# Your Home Away from Home

## Applied DataScience Capstone Final Assignment (Week 2)

by Sameh A.Rasoul

### Table of contents
* [The: Business Problem](#introduction)
* [Data Acquisition & Cleaning](#data)
* [Methodology](#methodology)
* [Predictive Model](#model)
* [Conclusions](#conclusions)
* [Recommendations](#recommendations)

## The Business Problem <a name="introduction"></a>

#### Background
The concept of leveraging Foursquare APIs to find similarities between neighborhoods and clustering them in an unsupervised matter is a fascinating one. However, a real use-case to this function may not be as straightforward for those unfamiliar with analytics. This begs the question of when, in practice, will the casual web user be interested in figuring out similarities between neighborhoods of a city? 

#### The Problem
One problem that this may solve would be figuring out which property to select when one is relocating to an entirely different city. This is one of those times when one is faced with many questions like: Which region in the new city should s/he begin with? How to go about it? should I ask friends? or better look-up reviews? What other factors in addition to price should I be weary off? … and so on. It is a real bummer that no property listing sites has any sort of customizations depending on the user profile. Even if it does have some adaptive features; as it currently stands, no listings websites analyze which city you are coming from and try to prioritize the listings according to their similarities to the neighborhoods in the user’s home city.

#### The Solution
This problem hypothetically manifests itself in Toronto residents who want to take the opportunity of falling property prices in Dubai, to find themselves a new home away from home. In reality, the common 'joe' almost invariably starts at property listing websites. Property purchases are a major investment for the majority of people, and price is a deciding factor. However, none of the mainstream property listings websites have such functionality. So, to solve our problem, we will write a script that scrapes the listings results page the user is looking at, find out its locations and then cluster them together with the home city of the user, showing the user some relevant details about the listing, similar neighborhoods back home and additional features all together in an interactive map. We will demonstrate a proof of concept by taking a resident of Toronto looking to buy an apartment in Dubai.


## Data Acquisition & Cleaning <a name="data"></a>

Several datasets will need to be scraped from multiple sources. We will divide our data acquisition and cleaning process according to the data sources, which include:
1.	The property listings website results page and the page of each individual listing in the results page
2.	Wikipedia page including Toronto’s postal codes and an online csv archive of geolocation of Toronto’s neighborhoods.
3.	Foursquare venue data for all property listings and Toronto’s neighborhoods
The datasets where harvested and cleaned separately before final consolidation and exploratory analysis.

Firstly, the dataset for the properties of interest in Dubai will be constructed as follows:


### Step 1: Scraping the Property Finder Page.

Let us start by installing and importing the pyperclip module which will make the transition from the listing site to our notebook seamless.

In [1]:
#!pip install pyperclip #<-- Uncomment to install pyperclip to copy from clipboard
import pyperclip

Now you can go to Propertyfinder.ae and input the desired search parameters for buying a property in Dubai. Once you are satisfied with the results page, copy the URL of the results page and run the following script.

In [2]:
#capture the results page from your clip board
results_url = str(pyperclip.paste())

In [None]:
# In case you are having trouble setting up pyperclip especially in remote environments: manually copy-paste here.
# We have searched for completed properties in the AED 500,000 to 700,000 as an example
results_url = "https://www.propertyfinder.ae/en/search?c=1&cs=completed&ob=mr&page=1&pf=500000&pt=700000"

Now we can get the results page as a response object:

In [3]:
#import the requests library
import requests

#Save the results and raise error if there is problems are faced
results_page = requests.get(results_url)
results_page.raise_for_status()

We will need the beautiful soup library to scrape the results page:

In [4]:
#!pip install bs4 #<-- Uncomment to install beautifulsoup for webscraping
import bs4
from bs4 import BeautifulSoup

Scrape for prices, cover pictures and hyperlink to each listing, by iterating over the html div tag. We have noticed that occationally there is a featured listing with slightly different html construction, which will raise an error. To fix this, we will **try** the normal listing search. If it fails, we will **except** the alternate html structure for featured adds: 

In [5]:
# create the soup of the page
soup = BeautifulSoup(results_page.text)

# capture all listings div html tags
divs = soup.find_all('div', {"class":"card-list__item"})

# initialize empty lists to capture listings data
links = []
pics = []
prices = []

#iterate over listing
for div in divs:
    
    try:  # normal listings will be captured here
        price = str(div.find("span", {"class":"card__price-value"}).getText())
        pic = str(div.find("source", {"type":"image/jpeg"}).get("srcset"))
        
    except: # the featured listing, if found, will be captured here
        price = str(div.find("div", {"class":"cts__price"}).getText())
        pic = str(div.find("div", {"class":"cts__main-img"}).get("srcset"))
    
    # manipulate price data into comprehensible string
    price = " ".join(price.split())
    prices.append(price)
    
    # consturct the url for the primary photo thumbnail for each listing
    pic_marker = pic.find(r'?')
    pic = pic[:pic_marker]
    pics.append(pic)
    
    # capture the link for each listing and construct the url
    link = str(div.find("a").get("href"))
    link = "https://www.propertyfinder.ae/"+str(link)
    links.append(link)

In [6]:
# Test the data by checking first two results:
print(links[0:2],'\n', pics [0:2],'\n', prices[0:2])

['https://www.propertyfinder.ae//en/buy/apartment-for-sale-dubai-business-bay-merano-tower-7616792.html', 'https://www.propertyfinder.ae//en/buy/apartment-for-sale-dubai-old-town-yansoon-yansoon-6-7761442.html'] 
 ['https://www.propertyfinder.ae/property/dc39565e20f20b72c655865ee9d48295/260/185/MODE/c97604/7616792-951bdo.jpg', 'https://www.propertyfinder.ae/property/e6d70835fb42df5aabe82cae58a66598/260/185/MODE/75a67a/7761442-94a4ao.jpg'] 
 ['1,200,000 AED', '1,300,000 AED']


After exploring a property page, we were able to locate the geocoordinates of the listing in a snippet of code which looks like  **"GeoCoordinates","latitude":25.049301,"longitude":55.198896** . So we iterated over all the links to get this code snippet. 

We will add a tqdm object to let us know if we are being throttled down or blocked by the property finder servers:

In [7]:
#!pip install tqdm #<--- Uncomment to install the tqdm progress bar wrapper
from tqdm import tqdm

The tqdm wrapper adds a progress bar, and timer which will also help you guage your resources if you want to extrapolate to capture hundreds or even more listings.

In [8]:
# initialize an empty list to store the geocoordinate snippets
snippets=[]

# iterate over each listing link and get the snippet including the coordinates
for link in tqdm(links, ncols=80):
    property_page = requests.get(link)
    property_page.raise_for_status()
    
    location_index = property_page.text.find("\"GeoCoordinates\",\"latitude\":")
    snippet = property_page.text[location_index:location_index+61]
    
    snippets.append(snippet)


100%|███████████████████████████████████████████| 25/25 [00:14<00:00,  1.74it/s]


Looking at the snippets, we can see there are inconsistencies in the decimal places in coordinates, which means that string splicing cannot be used:

In [9]:
snippets

['"GeoCoordinates","latitude":25.184884,"longitude":55.260565},',
 '"GeoCoordinates","latitude":25.191404,"longitude":55.279303},',
 '"GeoCoordinates","latitude":25.186913,"longitude":55.277919},',
 '"GeoCoordinates","latitude":25.109367,"longitude":55.24798},"',
 '"GeoCoordinates","latitude":25.207533,"longitude":55.277978},',
 '"GeoCoordinates","latitude":25.189836,"longitude":55.27576},"',
 '"GeoCoordinates","latitude":25.073151,"longitude":55.136982},',
 '"GeoCoordinates","latitude":25.191107,"longitude":55.26991},"',
 '"GeoCoordinates","latitude":25.190874,"longitude":55.281666},',
 '"GeoCoordinates","latitude":25.212018,"longitude":55.28286},"',
 '"GeoCoordinates","latitude":25.08333,"longitude":55.144753},"',
 '"GeoCoordinates","latitude":25.085322,"longitude":55.144807},',
 '"GeoCoordinates","latitude":25.187129,"longitude":55.278394},',
 '"GeoCoordinates","latitude":25.190874,"longitude":55.281666},',
 '"GeoCoordinates","latitude":25.082217,"longitude":55.142624},',
 '"GeoCoor

To solve this, we will define a **regex** (regular expression) that can identify the coordinates, and use it to extract latitude and longitude data.

A regex object of the form **r'"latitude":(\d+\.\d+),"longitude":(\d+\.\d+)'** was tested and found suitable to extract the latitude and longitude separately as 'match' object groups:

In [10]:
# import the regular expressions library
import re

# create the regex object
coords = re.compile(r'"latitude":(\d+\.\d+),"longitude":(\d+\.\d+)')

# initialize empty lists to capture coordinates
latitudes=[]
longitudes=[]

# while iterating over the
for snip in snippets:
    mo = coords.search(snip)
    latitudes.append(mo.group(1))
    longitudes.append(mo.group(2))
    
print(latitudes)
print(longitudes)

['25.184884', '25.191404', '25.186913', '25.109367', '25.207533', '25.189836', '25.073151', '25.191107', '25.190874', '25.212018', '25.08333', '25.085322', '25.187129', '25.190874', '25.082217', '25.087251', '25.189039', '25.19331', '25.193542', '25.074307', '25.086726', '25.191527', '25.206125', '25.071897', '25.193451']
['55.260565', '55.279303', '55.277919', '55.24798', '55.277978', '55.27576', '55.136982', '55.26991', '55.281666', '55.28286', '55.144753', '55.144807', '55.278394', '55.281666', '55.142624', '55.145574', '55.274261', '55.280919', '55.277061', '55.132694', '55.145205', '55.274545', '55.343607', '55.131535', '55.26512']


With latitude and longitude data appearing to have been extracted properly. Let us port our first dataset into a pandas dataframe:

In [11]:
# import
import pandas as pd

# change settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# confirm
print('Pandas successfully loaded')

Pandas successfully loaded


In [12]:
df=pd.DataFrame({'Price':prices, 'Latitude':latitudes , 'Longitude':longitudes, 'Link':links, 'Pic':pics})
df

Unnamed: 0,Price,Latitude,Longitude,Link,Pic
0,"1,200,000 AED",25.184884,55.260565,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/dc39565...
1,"1,300,000 AED",25.191404,55.279303,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/e6d7083...
2,"1,190,000 AED",25.186913,55.277919,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/b097242...
3,"1,000,000 AED",25.109367,55.24798,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/7c33f53...
4,"1,300,000 AED",25.207533,55.277978,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/42ace08...
5,"1,300,000 AED",25.189836,55.27576,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/8455cdb...
6,"1,250,000 AED",25.073151,55.136982,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/f6b32e6...
7,"1,049,987 AED",25.191107,55.26991,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/5990384...
8,"1,000,000 AED",25.190874,55.281666,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/e16e669...
9,"1,250,000 AED",25.212018,55.28286,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/c9ccbed...


Now let us confirm all the data was collected accurately by mapping it and seeing if it lines up properly with our target city

First we will need to get the geographic coordinates of the target city, which in our case is Dubai.

In [14]:
#!pip install geopy   #<--- Uncomment if you need to install geopy
from geopy.geocoders import Nominatim 
address = "Dubai"
geolocator = Nominatim(user_agent="Dubai_Explorer")

location = geolocator.geocode(address)
latituded = location.latitude
longituded = location.longitude

print('The geograpical coordinate of {} are {}, {}.'.format(address, latituded, longituded))

The geograpical coordinate of Dubai are 25.0750095, 55.18876088183319.


Next we initialize a folium map object around these coordinates and then iterate over the listings, adding in the details

In [15]:
import folium
# create map of Toronto using latitude and longitude values
map_dubai = folium.Map(location=[latituded, longituded], zoom_start=9)

In [16]:
# iterate over every listing in our df dataset
for price, pic, lat, lng, link in zip(df['Price'], df['Pic'], df['Latitude'], df['Longitude'], df['Link']):
    # create an html pop-up including price and photo, and hyperlinked to listings webpage
    label = r'<center><a href={} target="_blank">{}</a><br><br><a href={} target="_blank"><img src={} width="200"></a></center>'.format(link,price,link,pic)
    # add the popup
    iframe = folium.IFrame(html=label, width=210, height=210)
    popup = folium.Popup(iframe, max_width=500)
    folium.Marker([float(lat),float(lng)], popup=popup).add_to(map_dubai)
map_dubai

You can also click on each listing to see a popup of price & picture, hyperlinked to take you to the listings page.

### Step 2: Scraping and cleaning the home city neighborhoods dataset

For capturing this dataset, we again use the requests and beautiful soup libraries to scrape the table in the wikipedia page which will eventually be ported a pandas dataframe.

In [17]:
# download the wikipedia html, raise error if unsuccessful
import bs4
wikiPage = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
wikiPage.raise_for_status()

#create soup from html text
soup = bs4.BeautifulSoup(wikiPage.text)

In [18]:
# initialize empty lists to capture listings data
pcodes = []
boroughs = []
hoods = []

#iterate over listing
for row in soup.select('tr'):
    try:  # normal listings will be captured here
        postalcode = row.select('td')[0].getText()
        borough = row.select('td')[1].getText()
        neighborhood = row.select('td')[2].getText()
                
    except: # the featured listing, if found, will be captured here
        postalcode = ' '
        borough = ' '
        neighborhood = ' '
    pcodes.append(postalcode.strip())
    boroughs.append(borough.strip())
    hoods.append(neighborhood.strip())

As the loop has captured additional cells from elsewhere in the wikipedia page, we'll go ahead and clean off that unwanted data.

In [19]:
# slice only the required rows
pcodes = pcodes[1:181]
boroughs = boroughs[1:181]
hoods = hoods[1:181]

In [20]:
df1=pd.DataFrame({'Postal Code': pcodes, 'Borough': boroughs, "Neighbourhood":hoods})
df1.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


The dataset has several zipcodes that do not have an identifiable neighborourhood name. Let's go ahead and continue cleaning the dataset by dropping non assigned cells

In [21]:
# Import numpy to enable us to use nan
import numpy as np

#Replace all not assigned with NaN and drop not assigned Boroughs, then reset index
df1.replace("Not assigned", np.nan, inplace = True)
df1.dropna(subset=["Borough"], axis=0, inplace=True)
df1.reset_index(drop=True, inplace=True)
df1.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Verifiy that all available data is not nan

In [22]:
#Check for any remaining Nan in Neighbourhood column
available_data = df1.notnull()
print(available_data["Neighbourhood"].value_counts())

True    103
Name: Neighbourhood, dtype: int64


Now that we have the zipcodes, neighbourhoods and boroughs data, we will move forward to extracting geocoordinates for each of the zip. We will try the free available APIs.

**Attempt 1** : Using Geocoder / Nominatim --> DID NOT WORK

In [23]:
#!pip install geocoder #<-- Uncomment if you need to download & install beuatiful Soup
import geocoder
#we start by testing a randomn zipcode
g = geocoder.google('M1B Toronto')
coords = g.latlng
print(coords)
print(g)
#help(g)

None
<[REQUEST_DENIED] Google - Geocode [empty]>


In [24]:
coords = None
while(coords is None):
    g = geocoder.google('Toronto, Ontario')
    coords = g.latlng
    print(coords)

None
None
None
None
None
None


KeyboardInterrupt: 

Our request was denied.

**Attempt 2 :** Using Geopy --> DID NOT WORK on zipcodes. Gave random locations in Toronto

In [25]:
#import Nominatim for geopy
from geopy.geocoders import Nominatim 
address = "M1B, Toronto, Ontario"
geolocator = Nominatim(user_agent="Toronto_Explorer")

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

The geograpical coordinate of M1B, Toronto, Ontario are 43.6534817, -79.3839347.


In [26]:
print(location)

Toronto, Golden Horseshoe, Ontario, Canada


We checked several example zip codes against maps.google.com and they do not return the correct coordinates.

#### Solution: Obtain coordinates from online archive.

To remedy this and move forward with out proof of concept. We loaded the coordinates data from a readily available online archive. In a real application, we will subscribe to google API and get the latitude and longitude data for any city.

The archive is posted as a csv at the following link, so we'll read it straight into a pandas dataframe

In [27]:
# Use panadas read_csv function to read coordinates data straight into a new dataframe df2
fileUrl = "https://cocl.us/Geospatial_data"

df2 = pd.read_csv(fileUrl)

df2.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


If you wish to save your own copy of the coordinates data for Toronto use the next section.

In [None]:
df2.to_csv('data')

In [None]:
df3 = pd.read_csv(r'C:\Users\Droidicus\Downloads\data')
df3.head()

In [None]:
df3.drop(columns=['Unnamed: 0'], inplace=True)
df3.head()

We'll then perform a preliminary check before merging to see if the number of rows match

In [28]:
print("The shape of the loaded dataframe is: " , df2.shape)

The shape of the loaded dataframe is:  (103, 3)


Then we can go ahead and merge on the "Postal Code" column, which is the only common column between the dataframes df1 and df2.

In [29]:
# Merge the dataframes on Postal Code
dff = pd.merge(df1,df2)

In [30]:
dff.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Lets start by mapping all the neighbourhoods of Toronoto to verify all coordinate data is alright

In [31]:
#Import the required modules for dealing with JSONs
import json
from pandas.io.json import json_normalize

Now lets get the latitude and longitude values Toronto City of using geopy

In [32]:
#define the search term
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="Ontario_Explorer")
location = geolocator.geocode(address)
latitudet = location.latitude
longitudet = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitudet, longitudet))

The geograpical coordinate of Toronto, Ontario are 43.6534817, -79.3839347.


Since we have 103 neighborhoods, we choose to slice the original dataframe for only the boroughs with the word 'Toronto' in their name. This is reduce the number of API calls we make to Foursquare in this proof of concept excercise. So let's create a new dataframe as follows:

In [33]:
dft = dff[dff['Borough'].str.contains('Toronto')]
dft.reset_index(inplace = True)
dft.head()

Unnamed: 0,index,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,19,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [34]:
dft.shape

(39, 6)

Now let us quickly map the neighbourhoods in Toronto to make sure that every thing extracted seems about right:

In [35]:
# create map of Central Toronto using the same latitude and longitude values we got before
map_toronto = folium.Map(location=[latitudet, longitudet], zoom_start=11)

# add markers to map
for lat, lng, label in zip(dft['Latitude'], dft['Longitude'], dft['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

The location data appear to have been loaded correctly

### Step 3: Getting the Venues data for all locations using Foursquare

Now let's go ahead and concatenate the dataframes on top of each other, and then reset the index. But first, let us rename the "Price" column into neighbourhood. In this way the price will stand for the listings name when being compared to Toronto's neighbourhoods.

In [36]:
df.rename(columns = {"Price":"Neighbourhood"}, inplace=True)
df.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude,Link,Pic
0,"1,200,000 AED",25.184884,55.260565,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/dc39565...
1,"1,300,000 AED",25.191404,55.279303,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/e6d7083...
2,"1,190,000 AED",25.186913,55.277919,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/b097242...
3,"1,000,000 AED",25.109367,55.24798,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/7c33f53...
4,"1,300,000 AED",25.207533,55.277978,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/42ace08...


In [37]:
DF=pd.concat([dft,df])
DF.reset_index(inplace=True)
DF.head()

Unnamed: 0,level_0,index,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Link,Pic
0,0,2.0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6543,-79.3606,,
1,1,4.0,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6623,-79.3895,,
2,2,9.0,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3789,,
3,3,15.0,M5C,Downtown Toronto,St. James Town,43.6515,-79.3754,,
4,4,19.0,M4E,East Toronto,The Beaches,43.6764,-79.293,,


In [38]:
DF.drop(columns=['index','level_0'], inplace=True)

In [56]:
DF.sample(5)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Link,Pic
61,,,"1,100,000 AED",25.206125,55.343607,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/fc117ae...
54,,,"1,199,000 AED",25.087251,55.145574,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/7ea868e...
18,M4N,Central Toronto,Lawrence Park,43.728,-79.3888,,
14,M6K,West Toronto,"Brockton, Parkdale Village, Exhibition Place",43.6368,-79.4282,,
44,,,"1,300,000 AED",25.189836,55.27576,https://www.propertyfinder.ae//en/buy/apartmen...,https://www.propertyfinder.ae/property/8455cdb...


the sample shows that Latitude and Longitude columns were merged and we can proceed to call the Foursquare API

#### Foursquare

Let's start by initializing our Foursquare API.

In [40]:
CLIENT_ID = 'NQVSVFGS15XD2FKZVMPZDE33D2HSAANTARYBKOJZWUKOYIN3' # Foursquare ID
CLIENT_SECRET = 'P255ITBVW1RFAGFEV1M3SQMOSFL14H1B1JUQWI5VTDC4SEE1' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('My credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentials:
CLIENT_ID: NQVSVFGS15XD2FKZVMPZDE33D2HSAANTARYBKOJZWUKOYIN3
CLIENT_SECRET:P255ITBVW1RFAGFEV1M3SQMOSFL14H1B1JUQWI5VTDC4SEE1


Now we'll be defining a **getNearbyVenues** as follows. This function gets the top 100 venues for each passed location in a json and cleans it to new dataframes

In [41]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in tqdm(zip(names, latitudes, longitudes), ncols=80):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [42]:
LIMIT

100

In [43]:
all_venues = getNearbyVenues(names=DF['Neighbourhood'],
                                   latitudes=DF['Latitude'],
                                   longitudes=DF['Longitude']
                                  )

64it [00:38,  1.65it/s]


In [44]:
print(all_venues.shape)
all_venues.sample(10)

(2912, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
2379,"1,100,000 AED",25.082217,55.142624,"The Radisson Blu Residence, Dubai Marina",25.078288,55.143216,Hotel
1984,"1,049,987 AED",25.191107,55.26991,Gazebo,25.190844,55.26695,Indian Restaurant
641,"Little Portugal, Trinity",43.6479,-79.4197,Ufficio,43.649439,-79.423014,Italian Restaurant
487,"Richmond, Adelaide, King",43.6506,-79.3846,Booster Juice,43.648898,-79.383351,Juice Bar
199,St. James Town,43.6515,-79.3754,Downtown Camera,43.653107,-79.37512,Camera Store
1911,"1,250,000 AED",25.073151,55.136982,Zafran Indian Bistro,25.076842,55.139459,Indian Restaurant
1419,"St. James Town, Cabbagetown",43.668,-79.3677,China Gourmet,43.66418,-79.368359,Chinese Restaurant
1539,Church and Wellesley,43.6659,-79.3832,Como En Casa,43.66516,-79.384796,Mexican Restaurant
4,"Regent Park, Harbourfront",43.6543,-79.3606,Impact Kitchen,43.656369,-79.35698,Restaurant
2404,"1,199,000 AED",25.087251,55.145574,"Habtoor Grand Resort, Autograph Collection",25.085991,55.141161,Resort


## Methodology <a name="methodology"></a>

Our approach would be to immediately apply an unsupervised clustering model on the venue data like K-means or DBSCAN. For avoiding the mistake of classifying a poor quality nieghbourhood as an outlier we avoided DBSCAN, we applied instead a K-means clustering model to the combined dataset, which effectively clusters the Dubai property listings together with Toronto neighbourhoods.
This essentially pairs every property listing to a cluster of neighbourhoods in Toronto, meaning that if the property was to hypothetically be listed in Toronto, it will most likely belong with this group of neighbourhoods, based on its surrounding amenities and facilities, i.e. venues.


One hot encoding was then done to make allow model building for our categorical features

In [45]:
# one hot encoding
onehot = pd.get_dummies(all_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['Neighbourhood'] = all_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

#group rows by neighborhood  by taking the mean of the frequency of occurrence of each category
all_grouped = onehot.groupby('Neighbourhood').mean().reset_index()


We can go ahead and print the some frequencies of occurences to verify our data.

In [46]:
num_top_venues = 5

for loc in all_grouped['Neighbourhood']:
    print("----"+loc+"----")
    temp = all_grouped[all_grouped['Neighbourhood'] == loc].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----1,000,000 AED----
                       venue  freq
0                      Hotel  0.13
1                       Café  0.07
2  Middle Eastern Restaurant  0.07
3                 Restaurant  0.06
4                     Lounge  0.06


----1,040,000 AED----
                venue  freq
0                Café  0.12
1         Coffee Shop  0.08
2                 Spa  0.08
3     Harbor / Marina  0.08
4  Chinese Restaurant  0.08


----1,049,987 AED----
                      venue  freq
0                     Hotel  0.09
1      Gym / Fitness Center  0.05
2       Japanese Restaurant  0.05
3               Coffee Shop  0.05
4  Mediterranean Restaurant  0.04


----1,090,000 AED----
              venue  freq
0              Café  0.10
1       Coffee Shop  0.08
2             Hotel  0.08
3            Lounge  0.05
4  Asian Restaurant  0.05


----1,100,000 AED----
                       venue  freq
0                      Hotel  0.13
1  Middle Eastern Restaurant  0.08
2                Coffee Shop  0.07
3   

                venue  freq
0         Coffee Shop  0.09
1                Café  0.09
2  Italian Restaurant  0.06
3                 Pub  0.06
4    Sushi Restaurant  0.06


----St. James Town----
          venue  freq
0   Coffee Shop  0.07
1          Café  0.06
2    Restaurant  0.05
3  Cocktail Bar  0.05
4     Gastropub  0.04


----St. James Town, Cabbagetown----
         venue  freq
0  Coffee Shop  0.08
1   Restaurant  0.06
2         Café  0.06
3  Pizza Place  0.06
4       Bakery  0.04


----Stn A PO Boxes----
                 venue  freq
0          Coffee Shop  0.10
1   Italian Restaurant  0.04
2                 Café  0.03
3                Hotel  0.03
4  Japanese Restaurant  0.03


----Studio District----
                 venue  freq
0          Coffee Shop  0.08
1               Bakery  0.05
2              Brewery  0.05
3            Gastropub  0.05
4  American Restaurant  0.05


----Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park----
              venue  freq
0       Co

In [47]:
all_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"1,000,000 AED",54,54,54,54,54,54
"1,040,000 AED",26,26,26,26,26,26
"1,049,987 AED",75,75,75,75,75,75
"1,090,000 AED",63,63,63,63,63,63
"1,100,000 AED",92,92,92,92,92,92
"1,110,000 AED",34,34,34,34,34,34
"1,125,000 AED",86,86,86,86,86,86
"1,190,000 AED",68,68,68,68,68,68
"1,199,000 AED",30,30,30,30,30,30
"1,200,000 AED",82,82,82,82,82,82


and how many unique categories for all pulled venues

In [48]:
print('There are {} uniques categories.'.format(len(all_venues['Venue Category'].unique())))

There are 281 uniques categories.


<a id='item3'></a>


we can then define a function named **return_most_common_venue** to capture the top 10 venues of each location

In [49]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [50]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
locations_venues_sorted = pd.DataFrame(columns=columns)
locations_venues_sorted['Neighbourhood'] = all_grouped['Neighbourhood']

for ind in np.arange(all_grouped.shape[0]):
    locations_venues_sorted.iloc[ind, 1:] = return_most_common_venues(all_grouped.iloc[ind, :], num_top_venues)

locations_venues_sorted

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"1,000,000 AED",Hotel,Middle Eastern Restaurant,Café,Restaurant,Coffee Shop,Lounge,Gym,Lebanese Restaurant,Park,Rental Car Location
1,"1,040,000 AED",Café,Spa,Cocktail Bar,Coffee Shop,Harbor / Marina,Chinese Restaurant,Hotel,Middle Eastern Restaurant,Sports Bar,Latin American Restaurant
2,"1,049,987 AED",Hotel,Gym / Fitness Center,Coffee Shop,Japanese Restaurant,Lebanese Restaurant,Indian Restaurant,Mediterranean Restaurant,Asian Restaurant,Pool,Restaurant
3,"1,090,000 AED",Café,Hotel,Coffee Shop,Asian Restaurant,Gym,Lounge,Restaurant,Japanese Restaurant,Pizza Place,Park
4,"1,100,000 AED",Hotel,Middle Eastern Restaurant,Coffee Shop,Café,Restaurant,Lounge,Italian Restaurant,Gym,Spa,Gym / Fitness Center
5,"1,110,000 AED",Coffee Shop,Hotel,Spa,Chinese Restaurant,Resort,Cocktail Bar,Indian Restaurant,Middle Eastern Restaurant,Gym / Fitness Center,Turkish Restaurant
6,"1,125,000 AED",Coffee Shop,Hotel,Café,Seafood Restaurant,Middle Eastern Restaurant,Breakfast Spot,Beach,Pizza Place,Bakery,Chinese Restaurant
7,"1,190,000 AED",Hotel,Café,Coffee Shop,Lounge,Asian Restaurant,Italian Restaurant,Gym,Pizza Place,Japanese Restaurant,Middle Eastern Restaurant
8,"1,199,000 AED",Café,Chinese Restaurant,Hotel,Harbor / Marina,Cocktail Bar,Coffee Shop,Gym,Spa,Resort,Gluten-free Restaurant
9,"1,200,000 AED",Middle Eastern Restaurant,Hotel,Lounge,Italian Restaurant,Coffee Shop,Gym,Cocktail Bar,Indian Restaurant,French Restaurant,Turkish Restaurant


In [58]:
locations_venues_sorted.sample(5)

Unnamed: 0,Cluster Labels,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,0,"Parkdale, Roncesvalles",Breakfast Spot,Gift Shop,Cuban Restaurant,Eastern European Restaurant,Bookstore,Movie Theater,Dessert Shop,Italian Restaurant,Bar,Restaurant
29,0,"Harbourfront East, Union Station, Toronto Islands",Coffee Shop,Aquarium,Hotel,Café,Restaurant,Fried Chicken Joint,Brewery,Scenic Lookout,Park,Baseball Stadium
22,0,"Commerce Court, Victoria Hotel",Coffee Shop,Restaurant,Café,Hotel,Gym,American Restaurant,Japanese Restaurant,Deli / Bodega,Seafood Restaurant,Bakery
1,0,"1,040,000 AED",Café,Spa,Cocktail Bar,Coffee Shop,Harbor / Marina,Chinese Restaurant,Hotel,Middle Eastern Restaurant,Sports Bar,Latin American Restaurant
10,0,"1,217,819 AED",Hotel,Café,Restaurant,Middle Eastern Restaurant,Coffee Shop,Seafood Restaurant,Breakfast Spot,Lounge,Chinese Restaurant,American Restaurant


<a id='item4'></a>


## Predictive Model <a name="model"></a>

With one hot encoding and the data ready for building the model, we import the libraries and choose K=4 for our model.


In [51]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

In [52]:
# set number of clusters
kclusters = 4

all_grouped_clustering = all_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(all_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [53]:
# add clustering labels to our dataframe
locations_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Then we combine both data frames to port in the geo coordinates

In [54]:
all_merged = DF

all_merged = all_merged.join(locations_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

all_merged

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Link,Pic,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6543,-79.3606,,,0,Coffee Shop,Park,Pub,Bakery,Theater,Breakfast Spot,Café,Yoga Studio,Shoe Store,Beer Store
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.6623,-79.3895,,,0,Coffee Shop,Yoga Studio,Chinese Restaurant,Fried Chicken Joint,Music Venue,Sushi Restaurant,Sandwich Place,Restaurant,Bar,Bank
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.6572,-79.3789,,,0,Clothing Store,Coffee Shop,Café,Japanese Restaurant,Bubble Tea Shop,Cosmetics Shop,Ramen Restaurant,Pizza Place,Lingerie Store,Italian Restaurant
3,M5C,Downtown Toronto,St. James Town,43.6515,-79.3754,,,0,Coffee Shop,Café,Restaurant,Cocktail Bar,Gastropub,Beer Bar,American Restaurant,Clothing Store,Moroccan Restaurant,Seafood Restaurant
4,M4E,East Toronto,The Beaches,43.6764,-79.293,,,0,Pub,Health Food Store,Neighborhood,Trail,Yoga Studio,Ethiopian Restaurant,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,English Restaurant
5,M5E,Downtown Toronto,Berczy Park,43.6448,-79.3733,,,0,Coffee Shop,Cheese Shop,Farmers Market,Restaurant,Bakery,Seafood Restaurant,Beer Bar,Cocktail Bar,French Restaurant,Sushi Restaurant
6,M5G,Downtown Toronto,Central Bay Street,43.658,-79.3874,,,0,Coffee Shop,Café,Sandwich Place,Italian Restaurant,Thai Restaurant,Bubble Tea Shop,Burger Joint,Department Store,Japanese Restaurant,Salad Place
7,M6G,Downtown Toronto,Christie,43.6695,-79.4226,,,0,Grocery Store,Café,Park,Baby Store,Nightclub,Coffee Shop,Athletics & Sports,Candy Store,Restaurant,Italian Restaurant
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.6506,-79.3846,,,0,Coffee Shop,Café,Hotel,Gym,Bar,Restaurant,Clothing Store,Thai Restaurant,Bookstore,Juice Bar
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669,-79.4423,,,0,Bakery,Pharmacy,Brewery,Park,Middle Eastern Restaurant,Music Venue,Supermarket,Grocery Store,Bar,Bank


In [None]:
#all_merged.dropna(subset=["Cluster Labels"], axis=0, inplace=True)

In [None]:
#all_merged.dtypes

Now we can visualize the resulting clusters. We can zoom out from Dubai and in on Toronto to see how the property listings compare to Toronot Neighbourhoods.


In [55]:
# create the folium map object
map_clusters = folium.Map(location=[latituded, longituded], zoom_start=11)

# add in some colour coding
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add the markers on top of the map
markers_colors = []
for lat, lon, poi, cluster in zip(all_merged['Latitude'], all_merged['Longitude'], all_merged['Neighbourhood'], all_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>


## Conclusions <a name="conclusions"></a>


With this info, we could conclude the property listings results page passed to the program included a largely similar selection of properties similar to southern part of Toronto city. This conclusion was reached based on a predictive clustering model applied to a selection of listings on a results page from propertyfinder.ae. The model can be easily reapplied to any other results page from different cities on the same website, and easily adapted to scrape any listings website. The model can help people looking to relocate to an entire new city with identifying the similarities between new potential locations and the neighbourhoods in the home city.

## Recommendations <a name="recommendations"></a>

The model is largely dependent on the quality of data available from the Foursquare service, and runs under the assumption that venues categories in the vicinity of the location is a good indication on the nature of living on that neighbourhoods. A good development on the project would be to expand the dataset to encompass other sources of data in addition to foursquare, such as google maps reviews, trip-advisor, etc. and also official data such as crime rate, insurance rates, etc.
It may also be the case where, other clustering models might provide a better prediction as it eliminates outliers, which may occupy a cluster by themselves. The potential possibilities and applications are numerous.
