**Business Problem : Introduction** 


This project will investigate the ratings of different venues in Queens. To do this, we will identify clusters of high quality venues, mid range venues and venues receiving low ratings.

This information can be visualised on a map and can be useful for the following use cases:

Deciding on locations to live
Choosing a meeting location for eating out where there are more highly rated restaurants
People looking for a location to open a business, to identify the competition or the reputation of a particular area.

**Data**<br>
In order to draw conclusions on this problem, we will require the following information:

Venue locations provided by Foursquare
Ratings for each venue also provided by Foursquare
To obtain this information, we used New York neighborhood information from Coursera and obtained up to 50 venues within a 200m radius of each neighborhood in Queens. From these venues, we only used venues which had been rated on Foursquare.

The venues will be clustered based on their ratings and the location information will be used to visualise the clusters from a geographical perspective.



#**Data Collection**

In [2]:
# !conda install -c conda-forge requests --yes
# import requests


In [3]:
# source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
# soup = BeautifulSoup(source,'lxml')
# print(soup.prettify())

In [4]:
# table = soup.find('table')
# print(table.prettify())

In [5]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


In [6]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
table = data[0]
table


Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned


In [7]:
new_header = table.iloc[0] # grab the first row for the header
table = table[1:] # take the data less the header row
table.columns = new_header # set the header row as the table header
table

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned
10,M8A,Not assigned,Not assigned


In [8]:
newtable = table[table.Borough != "Not assigned"]
newtable

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned
11,M9A,Etobicoke,Islington Avenue
12,M1B,Scarborough,Rouge
13,M1B,Scarborough,Malvern


In [9]:
for row in newtable.itertuples():
    if row.Neighbourhood == "Not assigned":        
        newtable.at[row.Index,'Neighbourhood'] = newtable.at[row.Index,"Borough"]
        
newtable

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Queen's Park
11,M9A,Etobicoke,Islington Avenue
12,M1B,Scarborough,Rouge
13,M1B,Scarborough,Malvern


In [10]:
newtable = newtable.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
newtable

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [11]:
newtable.shape


(103, 3)

In [12]:
coordinates = pd.read_csv(r'http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv')
coordinates

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [13]:
coordinates.rename(columns={'Postal Code':'Postcode'},inplace=True)
coordinates

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [14]:
finaltable = pd.merge(newtable,coordinates,on='Postcode',how='left')
finaltable

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [15]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


In [16]:

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(finaltable['Latitude'], finaltable['Longitude'], finaltable['Borough'], finaltable['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Results and Discussion<br>
In this project to find the quality of venues in Queens as rated by Foursquare users, we excluded venues with no ratings. This resulted in a visualisation where not all venues were taken into account but will most likely give an indication of where venues are more frequently visited.

On top of this, we have the limitation from Foursquare, where there is a quota for the number of calls that can be made per day (I was unable to get around the bug of creating an app using a verified personal account, so had to use the Sandbox account).

From the data that is available, we can see that there are three distinct areas where data was able to be collected. When we compare this with the raw number of venues found, this investigation does not contain enough data to make any meaningful conclusions based on all of Queens.

If we limit our investigation to these three areas with data, we can only rank the areas in order of preference, relative to each other.

In Area 1, we have the most number of venues belonging to cluster 2, high quality venues, where as area 2 has the highests concentration of venues, but most of them are belonging to cluster 0, indicating that venues in this area are not of a very high standard. The third area has a combination of mid and low quality venues.

From these observations, we can conclude that the order of preference for locations to live around would be area 1, area 3, then area 2 if we base our decision on the quality of venues in the area.

**conclusion**<br>
From this investigation, we found:

With the limited data obtained from Foursquare, venues found were clustered into three distinct areas (indicated on the map)

Area 1 contained the highest proportion of high quality venues, while still containing a mixture of low to mid range venues in the northern side.

Area 2 contained the highest density of venues, most of them being low range venues.
Area 3 contained a mixture of low and mid range venues, with no high range venues.
The order of preference for locations to live based on venue quality would be area 1, area 3, then area 2