# Applied Data Science Capstone by IBM on Coursera

## Best neighborhoods for millennials to live in Toronto

## Introduction: Business Problem

Toronto, as the largest metropolitan areas in Canada, is very diverse and is the financial captials of Canada. The exciting employment opportunities have attracted many Canadians and immigrants to migrate to Toronto, especially the millennials of age 22 and 37 years old. According to UBS Global Real Estate Bubble Index in 2019, Toronto is ranked number 2 for the the greatest risk of a housing bubble and being the most overpriced cities in the world.

Because of the steeply inflating housing markets, many millennials are not able to afford living in Toronto. They are forced to live in the suburbs of Ontario and spend many hours commuting to the city for work. According to Statistics Canada, the millennial cohort had a median after-tax household income of CAD 44,093 in 2016.   

In this project, I am interested in investigating whether or not there are still neighbourhoods suitable and affordable for millennials to live in Toronto. A good old rule of thumb is that we shouldn't be spending more than 30 percent of our income on rent. Therefore, I will assess whether neighbourhoods with average rent less than CAD 13230 a year or CAD 1102 a month. In this project, I hope to help provide some insights to millennials by recommending suitable neighbourhoods to them. I will focus on millenials that are only looking for a bachelor's unit. In addition to rent, I will also utilize Foursquare API to determine the first 100 venues closest to each neighbourhood. I will perform K-Means clustering to cluster them based on categories. Lastly, I will conclude the ideal neighbourhoods by taking rent, infrastructure around the neighbourhood, such as TTC stations, hospitals, police stations, and restaurants into account.

## Data

Aspects that will be taken into account in this problem are:
* Rent of Bachelor's unit in each neighbourhood
* Number of Venues in the neighbourhood
* Type of Venues (Infrastructure) in the neighbourhood
* Distance from each neighbourhood to downtown Toronto

Below is a list of data I will be using to investigate this problem:
* Toronto neighbourhood data that was already processed in Week 3 as the center of each neighbourhood. 
* The average rent of a Bachelor's by neighbourhood data found in this link: https://www03.cmhc-schl.gc.ca/hmip-pimh/en/TableMapChart/Table?TableId=2.2.11&GeographyId=2270&GeographyTypeId=3&DisplayAs=Table&GeograghyName=Toronto#Row%20/%20Apartment
* Number of Venues and categories in each neighbourhood using **FourSquare API**.

I will only work on Toronto neighbourhoods with both location and rent data. Therefore, conclusion made here will be slightly biased as I do not work on a complete dataset.

### Pre-processing Data

#### Import required libraries

In [2]:
import pandas as pd
import numpy as np
import urllib.request, urllib.parse, urllib.error
import requests
from tabulate import tabulate

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
!conda install -c conda-forge geocoder --yes
import geocoder

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ca-certificates-2019.9.11  |       hecc5488_0         144 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.3 MB

The following NEW packages will be 

#### Load data.csv into a dataframe. This csv is downloaded from work done during Week 3. It includes postal code, latitude, and longitude for each neighborhood in Toronto.

#### Load AvgRent_Toronto.csv into a dataframe, df_rent. It is downloaded from CHMC's website and includes information such as rent of different unit types in each neighborhood.

In [16]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_5e6e27342afa4c63968e7a34a8916a0b = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='MNHme2L0CmMrpt5U_t3R2i7nK5fJpCj0K0g07ze8cC0C',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')


body = client_5e6e27342afa4c63968e7a34a8916a0b.get_object(Bucket='ibmcourseraapplieddatasciencecaps-donotdelete-pr-vrpjp4rmk3ehku',Key='AvgRent_Toronto.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

# If you are reading an Excel file into a pandas DataFrame, replace `read_csv` by `read_excel` in the next statement.
df_rent = pd.read_csv(body)
df_rent.head()

## Load week 3 csv (not necessary)
# body = client_5e6e27342afa4c63968e7a34a8916a0b.get_object(Bucket='ibmcourseraapplieddatasciencecaps-donotdelete-pr-vrpjp4rmk3ehku',Key='data.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
# if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

# If you are reading an Excel file into a pandas DataFrame, replace `read_csv` by `read_excel` in the next statement.

# df = pd.read_csv(body)
# df.head()

Unnamed: 0,Neighborhood,Bachelor,1 Bedroom,2 Bedroom,3 Bedroom +,Total
0,Agincourt/Malvern,**,1105,1316,1569,1341
1,Ajax/Pickering,**,953,1248,1397,1283
2,Alderwood,**,1169,1462,**,1435
3,Aurora,**,1127,1347,**,1298
4,Banbury-Don Mills/York Mills,**,1163,1335,1643,1286


#### Since I am only looking at Bachelor's unit, non-relevant data, such as 1 Bedroom, 2 Bedroom and etc will be dropped.

In [17]:
df_rent.drop(['1 Bedroom', '2 Bedroom', '3 Bedroom +', 'Total'], axis = 1, inplace = True)
df_rent.head()

Unnamed: 0,Neighborhood,Bachelor
0,Agincourt/Malvern,**
1,Ajax/Pickering,**
2,Alderwood,**
3,Aurora,**
4,Banbury-Don Mills/York Mills,**


In [18]:
df_rent.shape
# 135 neighborhoods in total

(135, 2)

#### Some of the neighborhoods in df_rent are listed in the same row. Split them and assign the same rent.

In [19]:
df_rent2 = (df_rent.set_index(df_rent.columns.drop('Neighborhood',1).tolist())
.Neighborhood.str.split('/', expand=True)
.stack()
.reset_index()
.rename(columns={0:'Neighborhood'})
.loc[:, df_rent.columns])

In [20]:
df_rent2.head()

Unnamed: 0,Neighborhood,Bachelor
0,Agincourt,**
1,Malvern,**
2,Ajax,**
3,Pickering,**
4,Alderwood,**


#### Neighborhood that has no data for Bachelor will be dropped.

In [21]:
for i, row in df_rent2.iterrows():
    if df_rent2.loc[i, 'Bachelor'] == '**':
        df_rent2.drop([i], inplace = True)
    else:
        continue

df_rent2.head()

Unnamed: 0,Neighborhood,Bachelor
9,Bay Street Corridor,1615
10,Bayview Village,1072
13,Bedford Park-Nortown,947
14,Beechborough-Greenbrook,875
15,Bendale,1032


#### Reset index. 

In [22]:
df_rent2.reset_index(inplace=True)
df_rent2.head()

Unnamed: 0,index,Neighborhood,Bachelor
0,9,Bay Street Corridor,1615
1,10,Bayview Village,1072
2,13,Bedford Park-Nortown,947
3,14,Beechborough-Greenbrook,875
4,15,Bendale,1032


#### Drop the previous index column.

In [23]:
df_rent2.drop(['index'], axis = 1, inplace = True)
df_rent2.head()

Unnamed: 0,Neighborhood,Bachelor
0,Bay Street Corridor,1615
1,Bayview Village,1072
2,Bedford Park-Nortown,947
3,Beechborough-Greenbrook,875
4,Bendale,1032


#### Number of neighorhoods that will be studied:

In [29]:
print(df_rent2.shape)

(107, 4)


#### Create two new columns, Latitude and Longitude in df_rent.

In [25]:
df_rent2['Latitude'] = ''
df_rent2['Longitude'] = ''
df_rent2.head()

Unnamed: 0,Neighborhood,Bachelor,Latitude,Longitude
0,Bay Street Corridor,1615,,
1,Bayview Village,1072,,
2,Bedford Park-Nortown,947,,
3,Beechborough-Greenbrook,875,,
4,Bendale,1032,,


### Look up for Longitude and Latitude for each neighborhood

#### Retrieve latitude and longitude for each neighborhood listed in df_rent

In [26]:
import geocoder

Lat_list=[]
Lng_list=[]

for i, row in df_rent2.iterrows():
    address='{}, Toronto, Ontario'.format(df_rent2.at[i,'Neighborhood'])
    g = geocoder.arcgis(address)
    df_rent2.loc[i, 'Latitude'] = g.latlng[0]
    df_rent2.loc[i, 'Longitude'] = g.latlng[1]

In [27]:
df_rent2

Unnamed: 0,Neighborhood,Bachelor,Latitude,Longitude
0,Bay Street Corridor,1615,43.6577,-79.3862
1,Bayview Village,1072,43.7771,-79.3796
2,Bedford Park-Nortown,947,43.7307,-79.4245
3,Beechborough-Greenbrook,875,43.6931,-79.4783
4,Bendale,1032,43.7596,-79.2574
5,Birchcliffe-Cliffside,943,43.6947,-79.2646
6,Black Creek,1108,43.7664,-79.5215
7,Bradford,951,43.6538,-79.4547
8,West Gwillimbury,951,44.1334,-79.6163
9,New Tecumseth,951,43.6424,-79.4054


#### Confirm the number of neighborhoods that will be studied.

In [28]:
df_rent2.shape

(107, 4)

### Option to download data to local machine
#### I downloaded the cleaned up dataframe into my local machine so that it's easier for me to continue working on it instead of re-running the code above.

In [43]:
from IPython.display import HTML
import base64 

def create_download_link(df, title = "Download CSV file", filename = "data.csv"):  
    csv = df.to_csv()
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

create_download_link(df_rent2)

#### I also have the option of loading the csv I downloaded using the code below instead of cleaning the data again.

In [6]:
body = client_5e6e27342afa4c63968e7a34a8916a0b.get_object(Bucket='ibmcourseraapplieddatasciencecaps-donotdelete-pr-vrpjp4rmk3ehku',Key='AvgRent_Toronto_latlong.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

# If you are reading an Excel file into a pandas DataFrame, replace `read_csv` by `read_excel` in the next statement.
df_rent2 = pd.read_csv(body)
df_rent2.head()


NameError: name 'client_5e6e27342afa4c63968e7a34a8916a0b' is not defined

#### Create a new column to categorize rent that is less than CAD 1120 as 1, more than CAD 1120 as 0.

In [47]:
df_rent2['Rent Category'] = ''
df_rent2.head()

Unnamed: 0,Neighborhood,Bachelor,Latitude,Longitude,Rent Category
0,Bay Street Corridor,1615,43.6577,-79.3862,
1,Bayview Village,1072,43.7771,-79.3796,
2,Bedford Park-Nortown,947,43.7307,-79.4245,
3,Beechborough-Greenbrook,875,43.6931,-79.4783,
4,Bendale,1032,43.7596,-79.2574,


#### The price in Bachelor is still an object. We need to remove the comma and convert it to integer.

In [55]:
df_rent2['Bachelor'] = df_rent2['Bachelor'].str.replace(',', '').astype(float)

df_rent2['Bachelor'].dtype

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

#### Confirm that Bachelor is now a float.

In [56]:
df_rent2.head()

Unnamed: 0,Neighborhood,Bachelor,Latitude,Longitude,Rent Category
0,Bay Street Corridor,1615.0,43.6577,-79.3862,
1,Bayview Village,1072.0,43.7771,-79.3796,
2,Bedford Park-Nortown,947.0,43.7307,-79.4245,
3,Beechborough-Greenbrook,875.0,43.6931,-79.4783,
4,Bendale,1032.0,43.7596,-79.2574,


### Visualize the spatial distribution of neighborhoods in Toronto for analysis

In [30]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.653963, -79.387207.


In [31]:
# create map of Toronto using latitude and longitude values
map_gta = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(df_rent2['Latitude'], df_rent2['Longitude'], df_rent2['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_gta)  
    
map_gta

#### Let's plot the histogram of rent of a Bachelor for each neighborhood. In this case, we want to focus on neighborhoods with rent less than CAD 1120 a month. I will create 2 bins, 0 - 1120 and 1120 - max.

#### Import the required libraries first.

In [32]:
# use the inline backend to generate the plots within the browser
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.style.use('ggplot') # optional: for ggplot-like style

# check for latest version of Matplotlib
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0

Matplotlib version:  3.0.2


#### Extract a subset of df_rent2 for plotting, we only need Neighborhood and Bachelor for this instance.

In [35]:
df_hist = df_rent2[['Neighborhood', 'Bachelor']]
df_hist.head()

Unnamed: 0,Neighborhood,Bachelor
0,Bay Street Corridor,1615
1,Bayview Village,1072
2,Bedford Park-Nortown,947
3,Beechborough-Greenbrook,875
4,Bendale,1032


#### Neighborhoods with rent higher than CAD 1120 a month will not be considered, so let's drop those neighborhoods.

In [65]:
indexNames = df_rent2[df_rent2['Bachelor'] > 1120].index
 
df_rent2.drop(indexNames , inplace=True)
df_rent2

Unnamed: 0,Neighborhood,Bachelor,Latitude,Longitude,Rent Category
1,Bayview Village,1072.0,43.7771,-79.3796,
2,Bedford Park-Nortown,947.0,43.7307,-79.4245,
3,Beechborough-Greenbrook,875.0,43.6931,-79.4783,
4,Bendale,1032.0,43.7596,-79.2574,
5,Birchcliffe-Cliffside,943.0,43.6947,-79.2646,
6,Black Creek,1108.0,43.7664,-79.5215,
7,Bradford,951.0,43.6538,-79.4547,
8,West Gwillimbury,951.0,44.1334,-79.6163,
9,New Tecumseth,951.0,43.6424,-79.4054,
10,Brampton (East),893.0,43.6831,-79.5602,


#### Now we have 91 neighborhoods to work on.

## Methodology

#### Now that our neighborhood data is complete, we can start using FourSquare API to fetch top 10 venues around each neighborhood.
#### Below is my FourSquare Credentials

In [66]:
CLIENT_ID = 'YUI2MPDATWZOO5AYH0QVYIGA2KXNMRZ2AIEZNZUXRW21VFV2' # your Foursquare ID
CLIENT_SECRET = 'QYDEO5VQ0J2EZMK5HQMIDXBRFGCASXJKRZVYYDRRD5CDGXTZ' # your Foursquare Secret
VERSION = '20191019' # Foursquare API version

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: YUI2MPDATWZOO5AYH0QVYIGA2KXNMRZ2AIEZNZUXRW21VFV2
CLIENT_SECRET:QYDEO5VQ0J2EZMK5HQMIDXBRFGCASXJKRZVYYDRRD5CDGXTZ


#### Create a function that extracts category of the venue found around the neighborhood.

In [67]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
        if len(categories_list) == 0:
         return None
    else:
        return categories_list[0]['name']

#### Function below will get venues 500m of each neighborhood in Toronto

In [68]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### I'd like to find out the top 10 closest venues in each neighborhood.

In [69]:
LIMIT = 10

neigh_venues = getNearbyVenues(names=df_rent2['Neighborhood'],
                                   latitudes=df_rent2['Latitude'],
                                   longitudes=df_rent2['Longitude'])

Bayview Village
Bedford Park-Nortown
Beechborough-Greenbrook
Bendale
Birchcliffe-Cliffside
Black Creek
Bradford
West Gwillimbury
New Tecumseth
Brampton (East)
Brampton (West)
Briar Hill-Belgravia
Broadview North
Brookhaven-Amesbury
Cabbagetown-South St. James Town
Caledonia-Fairbank
Erin Mills
Clairlea-Birchmount
Clanton Park
Clarkson
Lorne Park
Cooksville
Crescent Town
Danforth Village-East York
Don Valley Village
Pleasant View
Dorset Park
Downsview
Dufferin Grove
Little Portugal
East End-Danforth
East Gwillimbury
Newmarket
Eglinton East
Englemount-Lawrence
Forest Hill North
Forest Hill South
Humewood-Cedarvale
Ionview
Kennedy Park
Lambton Baby Point
Lawrence Park South
Leaside-Bennington
Long Branch
Maple Leaf
Milton
Halton Hills
Mimico
Mississauga Centre
Streetsville
Moss Park
Regent Park
Mount Dennis
Mount Pleasant East
New Toronto
North St. James Town
Oakville (excl. Bronte)
Oakwood-Vaughan
O'Connor-Parkview
Old East York
Orangeville
Mono
Parkwoods-Donalda
Playter Estates-Danforth

#### Let's look at the dataframe.

In [71]:
print(neigh_venues.shape)
neigh_venues.head()

(638, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bayview Village,43.7771,-79.37957,Flowers & Blossoms Inc.,43.776827,-79.378152,Flower Shop
1,Bayview Village,43.7771,-79.37957,Royal Frenchies,43.777059,-79.382422,Dog Run
2,Bayview Village,43.7771,-79.37957,Driving Range (Kennedy & Major Mackenzie),43.776885,-79.382417,Golf Driving Range
3,Bayview Village,43.7771,-79.37957,Forest Grove,43.779341,-79.379326,Trail
4,Bayview Village,43.7771,-79.37957,Firoka Construction & Painting,43.779619,-79.380599,Construction & Landscaping


#### Number of venues found in each neighborhood

In [72]:
neigh_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bayview Village,5,5,5,5,5,5
Bedford Park-Nortown,8,8,8,8,8,8
Beechborough-Greenbrook,4,4,4,4,4,4
Bendale,4,4,4,4,4,4
Birchcliffe-Cliffside,4,4,4,4,4,4
Black Creek,6,6,6,6,6,6
Bradford,10,10,10,10,10,10
Brampton (East),5,5,5,5,5,5
Brampton (West),5,5,5,5,5,5
Briar Hill-Belgravia,10,10,10,10,10,10


#### Number of unique venue categories

In [73]:
print('There are {} uniques categories.'.format(len(neigh_venues['Venue Category'].unique())))

There are 160 uniques categories.


#### One hot encoding

In [74]:
# one hot encoding
neigh_onehot = pd.get_dummies(neigh_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
neigh_onehot['Neighborhood'] = neigh_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [neigh_onehot.columns[-1]] + list(neigh_onehot.columns[:-1])
neigh_onehot = neigh_onehot[fixed_columns]

#### Let's evaluate the size of the neigh_onehot dataframe.

In [75]:
neigh_onehot.shape
neigh_onehot

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,American Restaurant,Animal Shelter,Antique Shop,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Bedford Park-Nortown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Bedford Park-Nortown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Bedford Park-Nortown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Bedford Park-Nortown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,Bedford Park-Nortown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Group neigh_onehot by Neighborhood.

In [77]:
neigh_grouped = neigh_onehot.groupby('Neighborhood').mean().reset_index()
neigh_grouped

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,American Restaurant,Animal Shelter,Antique Shop,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Bayview Village,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
1,Bedford Park-Nortown,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
2,Beechborough-Greenbrook,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
3,Bendale,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
4,Birchcliffe-Cliffside,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
5,Black Creek,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
6,Bradford,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
7,Brampton (East),0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
8,Brampton (West),0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
9,Briar Hill-Belgravia,0.00,0.0,0.00,0.0,0.0,0.1,0.0,0.0,0.000000,...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


#### Sort the top 10 venues in descending order in each neighborhood.

In [78]:
num_top_venues = 10

for hood in neigh_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = neigh_grouped[neigh_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bayview Village----
                        venue  freq
0  Construction & Landscaping   0.2
1                     Dog Run   0.2
2          Golf Driving Range   0.2
3                       Trail   0.2
4                 Flower Shop   0.2
5                 Pizza Place   0.0
6               Moving Target   0.0
7                  Nail Salon   0.0
8                Noodle House   0.0
9                Optical Shop   0.0


----Bedford Park-Nortown----
                   venue  freq
0   Fast Food Restaurant  0.12
1            Coffee Shop  0.12
2              Juice Bar  0.12
3         Sandwich Place  0.12
4          Grocery Store  0.12
5                   Park  0.12
6           Skating Rink  0.12
7           Liquor Store  0.12
8  Performing Arts Venue  0.00
9           Noodle House  0.00


----Beechborough-Greenbrook----
                           venue  freq
0              Convenience Store  0.25
1                     Restaurant  0.25
2                 Sandwich Place  0.25
3             Turk

#### Create a function to tabulate venues in decreasing frequency.

In [80]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Tabulate top 10 venues for each neighborhood.

In [82]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neigh_venues_sorted = pd.DataFrame(columns=columns)
neigh_venues_sorted['Neighborhood'] = neigh_grouped['Neighborhood']

for ind in np.arange(neigh_grouped.shape[0]):
    neigh_venues_sorted.iloc[ind, 1:] = return_most_common_venues(neigh_grouped.iloc[ind, :], num_top_venues)

neigh_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bayview Village,Trail,Construction & Landscaping,Flower Shop,Golf Driving Range,Dog Run,Discount Store,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant
1,Bedford Park-Nortown,Coffee Shop,Sandwich Place,Fast Food Restaurant,Liquor Store,Skating Rink,Park,Juice Bar,Grocery Store,Deli / Bodega,Department Store
2,Beechborough-Greenbrook,Convenience Store,Restaurant,Turkish Restaurant,Sandwich Place,Yoga Studio,Diner,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner
3,Bendale,Park,Tennis Court,Dog Run,Discount Store,Falafel Restaurant,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner
4,Birchcliffe-Cliffside,General Entertainment,Café,Skating Rink,College Stadium,Cosmetics Shop,Discount Store,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant


### Clustering neighborhoods to compare similarities

#### Now I would like to look at the top 10 venues extracted for each neighborhood to determine how similar they are. After asssessing the results, I will conclude which cluster is better for millennials to rent. 

#### I will use K Means Clustering to cluster venues.

In [83]:
# set number of clusters
kclusters = 5

neigh_grouped_clustering = neigh_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(neigh_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 2, 2, 2, 2], dtype=int32)

#### 

In [84]:
neigh_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

neigh_merged = df_rent2

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
neigh_merged = neigh_merged.join(neigh_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

neigh_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Bachelor,Latitude,Longitude,Rent Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Bayview Village,1072.0,43.7771,-79.3796,,1.0,Trail,Construction & Landscaping,Flower Shop,Golf Driving Range,Dog Run,Discount Store,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant
2,Bedford Park-Nortown,947.0,43.7307,-79.4245,,1.0,Coffee Shop,Sandwich Place,Fast Food Restaurant,Liquor Store,Skating Rink,Park,Juice Bar,Grocery Store,Deli / Bodega,Department Store
3,Beechborough-Greenbrook,875.0,43.6931,-79.4783,,1.0,Convenience Store,Restaurant,Turkish Restaurant,Sandwich Place,Yoga Studio,Diner,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner
4,Bendale,1032.0,43.7596,-79.2574,,1.0,Park,Tennis Court,Dog Run,Discount Store,Falafel Restaurant,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner
5,Birchcliffe-Cliffside,943.0,43.6947,-79.2646,,1.0,General Entertainment,Café,Skating Rink,College Stadium,Cosmetics Shop,Discount Store,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant


#### Visualize the cluster via Folium

In [86]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(neigh_merged['Latitude'], neigh_merged['Longitude'], neigh_merged['Neighborhood'], neigh_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

TypeError: list indices must be integers or slices, not float

#### Download my work into my machine just in case

In [87]:
from IPython.display import HTML
import base64 

def create_download_link(df, title = "Download CSV file", filename = "neigh_merged.csv"):  
    csv = df.to_csv()
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

create_download_link(neigh_merged)

### Assess each cluster

#### Cluster Label = 0

In [96]:
n0 = neigh_merged.loc[neigh_merged['Cluster Labels'] == 0]
n0.sort_values(['Bachelor'], ascending = False, axis = 0, inplace = True)
n0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,Neighborhood,Bachelor,Latitude,Longitude,Rent Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
78,Rexdale-Kipling,743.0,43.7243,-79.5672,,0.0,Flower Shop,Yoga Studio,Farmers Market,Falafel Restaurant,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner,Dog Run


#### Cluster Label = 1

In [98]:
n1 = neigh_merged.loc[neigh_merged['Cluster Labels'] == 1]
n1.sort_values(['Bachelor'], ascending = True, axis = 0, inplace = True)
n1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,Neighborhood,Bachelor,Latitude,Longitude,Rent Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
55,Maple Leaf,724.0,43.7146,-79.4798,,1.0,Convenience Store,Gym / Fitness Center,Locksmith,Trail,Dry Cleaner,Yoga Studio,Dog Run,Falafel Restaurant,Event Space,Electronics Store
50,Lambton Baby Point,759.0,43.6593,-79.4975,,1.0,Playground,Park,Discount Store,Falafel Restaurant,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner,Dog Run
35,East Gwillimbury,759.0,44.1054,-79.442,,1.0,Temple,Yoga Studio,Farmers Market,Falafel Restaurant,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner,Dog Run
67,New Toronto,805.0,43.6014,-79.5092,,1.0,Park,Coffee Shop,Supermarket,Pub,Breakfast Spot,Restaurant,Indian Restaurant,Italian Restaurant,Mexican Restaurant,Café
102,Woodbine Corridor,810.0,43.6779,-79.3149,,1.0,Dog Run,Baseball Field,Café,Yoga Studio,Farm,Falafel Restaurant,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant
103,Greenwood-Coxwell,810.0,43.672,-79.3234,,1.0,Indian Restaurant,Café,Brewery,Egyptian Restaurant,Bar,Asian Restaurant,Pakistani Restaurant,Dry Cleaner,Falafel Restaurant,Event Space
82,Riverdale,853.0,43.6732,-79.3412,,1.0,Toy / Game Store,Fast Food Restaurant,Clothing Store,Caribbean Restaurant,Café,Sandwich Place,Dog Run,Diner,Electronics Store,Egyptian Restaurant
25,Cooksville,869.0,43.58,-79.6161,,1.0,Korean Restaurant,Café,Discount Store,Burrito Place,Pakistani Restaurant,Bank,Caribbean Restaurant,Middle Eastern Restaurant,Indian Restaurant,Dog Run
3,Beechborough-Greenbrook,875.0,43.6931,-79.4783,,1.0,Convenience Store,Restaurant,Turkish Restaurant,Sandwich Place,Yoga Studio,Diner,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner
26,Crescent Town,878.0,43.6948,-79.2953,,1.0,Park,Convenience Store,Sandwich Place,Theater,Discount Store,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner


#### Cluster Label = 2

In [99]:
n2 = neigh_merged.loc[neigh_merged['Cluster Labels'] == 2]
n2.sort_values(['Bachelor'], ascending = True, axis = 0, inplace = True)
n2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,Neighborhood,Bachelor,Latitude,Longitude,Rent Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
49,Kennedy Park,719.0,43.7259,-79.2623,,2.0,Discount Store,Department Store,Coffee Shop,Convenience Store,Dry Cleaner,Farm,Falafel Restaurant,Event Space,Electronics Store,Egyptian Restaurant
54,Long Branch,728.0,43.5935,-79.5327,,2.0,Pizza Place,Bar,Wings Joint,Coffee Shop,Pharmacy,Grocery Store,Greek Restaurant,Italian Restaurant,Beer Store,Café
36,Newmarket,759.0,43.6895,-79.3061,,2.0,Middle Eastern Restaurant,Pizza Place,Coffee Shop,Grocery Store,Pharmacy,Pet Store,Breakfast Spot,Sandwich Place,Mexican Restaurant,Fast Food Restaurant
56,Milton,767.0,43.6962,-79.3322,,2.0,Coffee Shop,Ice Cream Shop,Pizza Place,Liquor Store,Thai Restaurant,Italian Restaurant,Pastry Shop,Sandwich Place,Restaurant,Pub
57,Halton Hills,767.0,43.6486,-79.4191,,2.0,Bar,Yoga Studio,Art Gallery,Brewery,Japanese Restaurant,Asian Restaurant,Pizza Place,Wine Bar,Vietnamese Restaurant,Grocery Store
27,Danforth Village-East York,805.0,43.6897,-79.3307,,2.0,Coffee Shop,Sandwich Place,Farmers Market,Italian Restaurant,Athletics & Sports,Convenience Store,Deli / Bodega,Department Store,Cuban Restaurant,Cosmetics Shop
11,Brampton (West),818.0,43.6831,-79.5602,,2.0,Coffee Shop,Pizza Place,Pharmacy,Liquor Store,Sandwich Place,Yoga Studio,Discount Store,Electronics Store,Egyptian Restaurant,Eastern European Restaurant
99,Weston,826.0,43.7044,-79.5094,,2.0,Coffee Shop,Convenience Store,Sandwich Place,Asian Restaurant,Mexican Restaurant,Bus Stop,Dog Run,Falafel Restaurant,Event Space,Electronics Store
72,Old East York,854.0,43.6962,-79.3329,,2.0,Coffee Shop,Ice Cream Shop,Pizza Place,Liquor Store,Thai Restaurant,Italian Restaurant,Pastry Shop,Sandwich Place,Restaurant,Pub
97,West Humber-Clairville,858.0,43.7146,-79.5925,,2.0,Hotel,Coffee Shop,Rental Car Location,Swiss Restaurant,Mediterranean Restaurant,Paper / Office Supplies Store,Storage Facility,Dry Cleaner,Falafel Restaurant,Event Space


#### Cluster Label = 3

In [100]:
n3 = neigh_merged.loc[neigh_merged['Cluster Labels'] == 3]
n3.sort_values(['Bachelor'], ascending = True, axis = 0, inplace = True)
n3

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,Neighborhood,Bachelor,Latitude,Longitude,Rent Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
95,Victoria Village,891.0,43.7315,-79.3143,,3.0,Park,Yoga Studio,Home Service,Falafel Restaurant,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner,Dog Run
88,St. Andrew-Windfields,905.0,43.7572,-79.3819,,3.0,Park,Yoga Studio,Home Service,Falafel Restaurant,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner,Dog Run
44,Ionview,913.0,43.7363,-79.2732,,3.0,Park,Yoga Studio,Home Service,Falafel Restaurant,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner,Dog Run


#### Cluster Label = 4

In [101]:
n4 = neigh_merged.loc[neigh_merged['Cluster Labels'] == 4]
n4.sort_values(['Bachelor'], ascending = True, axis = 0, inplace = True)
n4

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,Neighborhood,Bachelor,Latitude,Longitude,Rent Category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Clanton Park,867.0,43.7436,-79.4432,,4.0,IT Services,Yoga Studio,Discount Store,Falafel Restaurant,Event Space,Electronics Store,Egyptian Restaurant,Eastern European Restaurant,Dry Cleaner,Dog Run


## Results and Discussion

From the studies, I conclude that out of 105 neighborhoods, majority of them fall into two clusters, cluster label = 1 (neighborhoods) and cluster label = 2 (44 neighborhoods). This implies that most Toronto neighborhoods are very similar in terms of the types of venues around them.

Below I summarize the main categories of venues for each cluster:
* 0: Mostly food
* 1: Mostly food with some recreation
* 2: Mostly food with some retail
* 3: Fitness-oriented
* 4: Mix of all

Using the same algorithm as Week 3, I did not calculate distance from each venue to each neighborhood. FourSquare API did not retrieve TTC Subway stations in the output as well. This is probably because they are not rated by users and hence did not appear in the top 10 venues. More work can be done on this in the future by using FourSquare API to determine top 10 closest venues to the neighborhoods, which might take accessible to transport into account.

## Conclusion

In this study, since TTC subway stations are not included, other infrastructure and diversity of restaurants, and how close they are to Toronto are used in making the decision instead.

I recommend millennials to rent in neighborhoods that are close to downtown Toronto with diverse venues selection, so that they don't necessarily need to commute to downtown Toronto for entertainment. As a millennial myself, I am more inclined to rent from neighborhoods in Cluster 1, such as Yonge-Eglinton, Bayview, King, Cabbage Town, Regent Park, The Beaches, and Greenwood-Coxwell. Nevertheless, I'd still recommend neighborhoods from Cluster 2 for millennials to consider. They are Little Portugal, Danforth Village, Kennedy Park, Downsview, King West, Eglinton East.