# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results](#results)
* [Discussion](#discussion)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

The fitness industry in Canada is growing at a rapid pace of more than 5% year on year. Being a fitness enthusiast myself, I would like to work on a problem faced by majority of people involved in fitness. We all know that exercise plays a vital in becoming fit. However, the DIET plays an even more important role in the overall process.

For exercise, people join GYM, FITNESS CENTERS or simply go to park. But when it comes to the diet, we are unable to manage it on our own. Hence, I have come up with an idea of launching a chain of EAT-TO-GET-FIT outlets in Canada. It will take care of the complete meal for the person enrolling for the service. The unique-selling-point of these outlets will be that they will be strategically located very near to the GYMs/Fitness Centers etc. so that the enrolled member can eat at the outlet/take home the food after their work-out.

In today's world, people are crazy about fitness and to sustain with the habit of exercising daily, they join a gym/fitness center. Meanwhile, EAT-TO-GET-FIT outlets will take care of their diet. I believe that this idea has a great potential and to test it in Canada, I would like to start with one outlet in Toronto. Now, to find the most optimal location for this outlet, I am taking up this project.

In this project, we will put our data science mind to work and find out the most promising location in Toronto. We plan to have the outlet near the Gym or Fitness Centers because post the workout, people need food to re-energize themselves.

The target audience for this project will be the entrepreneur who wants to find the location to open an eat to get fit outlet(restaurant) 


## Data <a name="data"></a>

#### Data Gathering Phase

Data Requirements:
* List of neighborhoods in Toronto, Canada
* Latitude and Longitude of these neighborhoods
* Venue Category data to understand the type of venues in each neighborhood

Data Fetching: 
* Using the technique of web-scrapping to gather the list of neighborhoods in Toronto
* Installing the Geocoder package to fetch the Latitudes and Longitudes the neighborhoods
* Calling Foursquare API to get the details of various types of venues in these neighborhoods

Out of all the features considered, we will consider Venue Category equal to Gym to segment the data. This data will be used for data analysis (using techniques like clustering) to come up with the most suitable location for opening the first outlet of EAT-TO-GET-FIT. 

#### Installing the important libraries in Jupyter

In [38]:
!conda install -c conda-forge folium=0.5.0 --yes
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge lxml --yes

Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - folium


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.10.1              |             py_0          59 KB  conda-forge

The following packages will be UPDATED:

    folium: 0.5.0-py_0 conda-forge --> 0.10.1-py_0 conda-forge


Downloading and Extracting Packages
folium-0.10.1        | 59 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Solving environment: done


  current version: 4.5.11
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages alread

#### Importing the relevant libraries

In [3]:
#importing necessary libraries
import requests
import pandas as pd
import lxml
import lxml.html as lh
import numpy as np
import folium
from sklearn.cluster import KMeans
print('Libraries Imported')

#### Scrapping Borough and Neighborhoods in Toronto, Canada from Wikipedia Page

In [4]:
url  = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = requests.get(url)
if page.status_code == 200:
    print('Page download successful')
else:
    print('Page download error. Error code: {}'.format(page.status_code))

Page download successful


In [5]:
df_html = pd.read_html(url, header=0, na_values = ['Not assigned'])[0]
df_html.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,
1,M2A,,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Data Wrangling to eliminate NaN and Empty cells in the dataframe

In [6]:
#Drop the the rows on which the Borough is empty
df_html.dropna(subset=['Borough'], inplace=True)

In [7]:
#Check Neighborhood is empty but Borough exists
n_empty_neighborhood = df_html[df_html['Neighbourhood'].isna()].shape[0]
print('Number of rows on which Neighborhood column is empty: {}'.format(n_empty_neighborhood))

Number of rows on which Neighborhood column is empty: 1


In [8]:
#Show which neighborhood is emtpy but Borough exists
df_html[df_html['Neighbourhood'].isna()]

Unnamed: 0,Postcode,Borough,Neighbourhood
9,M9A,Queen's Park,


In [9]:
#Replace empty Neighborhood with Borough name and check again
df_html['Neighbourhood'].fillna(df_html['Borough'], inplace=True)
n_empty_neighborhood = df_html[df_html['Neighbourhood'].isna()].shape[0]
print('Number of rows on which Neighborhood column is empty: {}'.format(n_empty_neighborhood))

Number of rows on which Neighborhood column is empty: 0


In [10]:
#Confirm that Queen's Park Neighborhood is not empty now:
df_html[df_html['Borough']=="Queen's Park"]

Unnamed: 0,Postcode,Borough,Neighbourhood
9,M9A,Queen's Park,Queen's Park


In [11]:
#Group by Postcode / Borough
df_postcodes = df_html.groupby(['Postcode','Borough']).Neighbourhood.agg([('Neighbourhood', ', '.join)])
df_postcodes.reset_index(inplace=True)
df_postcodes.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [12]:
#Drop the the rows on which the Borough is empty
df_html.dropna(subset=['Borough'], inplace=True)

In [13]:
#Check Neighborhood is empty but Borough exists
n_empty_neighborhood = df_html[df_html['Neighbourhood'].isna()].shape[0]
print('Number of rows on which Neighborhood column is empty: {}'.format(n_empty_neighborhood))

Number of rows on which Neighborhood column is empty: 0


In [14]:
print('The shape of the dataset is:',df_postcodes.shape)

The shape of the dataset is: (103, 3)


#### Dataset with the List of Neighborhoods is ready to saved in .csv format

In [4]:
#exporting the dataframe as .csv file
df_postcodes.to_csv('Toronto_Postcodes.csv')

NameError: name 'df_postcodes' is not defined

#### Fetching the Longitude and Latitude

In [18]:
#Read CSV file from link and load into dataframe
url_csv = 'http://cocl.us/Geospatial_data'
df_coordinates = pd.read_csv(url_csv)
df_coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [19]:
#use the previously cleaned data
df_neighborhoods = pd.read_csv('Toronto_Postcodes.csv',index_col=[0])
df_neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [20]:
# Make sure both dataframes have the same 
df_coordinates.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
df_neighborhoods.rename(columns={'Postcode': 'PostalCode'}, inplace=True)

#### Merging both datasets

In [21]:
df_neighborhoods_coordinates = pd.merge(df_neighborhoods, df_coordinates, on='PostalCode')
df_neighborhoods_coordinates.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [22]:
# Check coordinates for a couple of neighborhoods
df_neighborhoods_coordinates[(df_neighborhoods_coordinates['PostalCode']=='M5G') |
                             (df_neighborhoods_coordinates['PostalCode']=='M2H') ]

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
17,M2H,North York,Hillcrest Village,43.803762,-79.363452
57,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


In [23]:
#Export to .csv
df_neighborhoods_coordinates.to_csv('Toronto_Postcodes_2.csv')

In [30]:
# Read .csv file from above
df = pd.read_csv('Toronto_Postcodes_2.csv', index_col=0)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [31]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


In [32]:
df.rename(columns={'Neighbourhood': 'Neighborhood'}, inplace=True)

In [33]:
#count Bourough and Neighborhood
df.groupby('Borough').count()['Neighborhood']

Borough
Central Toronto      9
Downtown Toronto    19
East Toronto         5
East York            5
Etobicoke           11
Mississauga          1
North York          24
Queen's Park         1
Scarborough         17
West Toronto         6
York                 5
Name: Neighborhood, dtype: int64

In [86]:
df_toronto = df
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [35]:
#Check the number of neighborhoods
print(df_toronto.groupby('Borough').count()['Neighborhood'])

Borough
Central Toronto      9
Downtown Toronto    19
East Toronto         5
East York            5
Etobicoke           11
Mississauga          1
North York          24
Queen's Park         1
Scarborough         17
West Toronto         6
York                 5
Name: Neighborhood, dtype: int64


In [36]:
#Create list with the Boroughs (to be used later)
boroughs = df_toronto['Borough'].unique().tolist()

In [37]:
#Obtain the coordinates from the dataset itself, just averaging Latitude/Longitude of the current dataset 
lat_toronto = df_toronto['Latitude'].mean()
lon_toronto = df_toronto['Longitude'].mean()
print('The geographical coordinates of Toronto are {}, {}'.format(lat_toronto, lon_toronto))

The geographical coordinates of Toronto are 43.70460773398059, -79.39715291165048


In [38]:
borough_color = {}
for borough in boroughs:
    borough_color[borough]= '#%02X%02X%02X' % tuple(np.random.choice(range(256), size=3)) #Random color

#### Visualizing the Toronto Neighborhood Merged Dataset

In [39]:
map_toronto = folium.Map(location=[lat_toronto, lon_toronto], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], 
                                           df_toronto['Longitude'],
                                           df_toronto['Borough'], 
                                           df_toronto['Neighborhood']):
    label_text = borough + ' - ' + neighborhood
    label = folium.Popup(label_text, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=borough_color[borough],
        fill=True,
        fill_color=borough_color[borough],
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

#### Getting Venues Data using Foursquare

In [40]:
CLIENT_ID = 'QUYY030NYACJTAFZ3QK4CIBXQ4YLY0ELZH1AXXKZ1VDKUFFV' # your Foursquare ID
CLIENT_SECRET = 'J0OI25U1SFYHSI4U4TXOMGLRXCSWLQNUX5AVDK5APTSY0KNL' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

In [41]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [42]:
#Get venues for all neighborhoods in our dataset
toronto_venues = getNearbyVenues(names=df_toronto['Neighborhood'],
                                latitudes=df_toronto['Latitude'],
                                longitudes=df_toronto['Longitude'])

Rouge, Malvern
Highland Creek, Rouge Hill, Port Union
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
Clairlea, Golden Mile, Oakridge
Cliffcrest, Cliffside, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Clarks Corners, Sullivan, Tam O'Shanter
Agincourt North, L'Amoreaux East, Milliken, Steeles East
L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens, Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West, 

In [45]:
#Check size of resulting dataframe
toronto_venues.shape

(2213, 7)

In [46]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge, Malvern",43.806686,-79.194353,Interprovincial Group,43.80563,-79.200378,Print Shop
2,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


#### As we would like to have our outlet near a Gym/Fitness Centers/Other Fitness related areas, we have changed the venue category of all the different types to Gym in achieve consistency in the data

In [47]:
toronto_venues.replace("Gym / Fitness Center", "Gym", inplace=True)
toronto_venues.replace("College Gym", "Gym", inplace=True)
toronto_venues.replace("Climbing Gym", "Gym", inplace=True)

In [48]:
toronto_venues['Venue Category'].str.count("Gym").sum()

50

In [49]:
#Check size of resulting dataframe
toronto_venues.shape

(2213, 7)

In [54]:
toronto_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge, Malvern",43.806686,-79.194353,Interprovincial Group,43.805630,-79.200378,Print Shop
2,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
...,...,...,...,...,...,...,...
2208,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437,McDonald's,43.741757,-79.584230,Fast Food Restaurant
2209,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437,Rogers Plus,43.741312,-79.585263,Video Store
2210,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437,LCBO,43.741508,-79.584501,Liquor Store
2211,Northwest,43.706748,-79.594054,Economy Rent A Car,43.708471,-79.589943,Rental Car Location


In [52]:
#Number of venues per neighborhood
toronto_venues.groupby('Neighborhood').count()

In [55]:
#Number of unique venue categories
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 267 uniques categories.


In [458]:
#print out the list of categories
toronto_venues['Venue Category'].unique()[:100]

array(['Fast Food Restaurant', 'Print Shop', 'Bar', 'Pizza Place',
       'Electronics Store', 'Mexican Restaurant', 'Rental Car Location',
       'Medical Center', 'Intersection', 'Breakfast Spot', 'Coffee Shop',
       'Korean Restaurant', 'Hakka Restaurant', 'Caribbean Restaurant',
       'Athletics & Sports', 'Thai Restaurant', 'Bank', 'Gas Station',
       'Bakery', 'Lounge', 'Fried Chicken Joint', 'Playground',
       'Department Store', 'Discount Store', 'Ice Cream Shop', 'Bus Line',
       'Metro Station', 'Bus Station', 'Park', 'Soccer Field', 'Motel',
       'American Restaurant', 'Café', 'General Entertainment',
       'Skating Rink', 'College Stadium', 'Indian Restaurant',
       'Chinese Restaurant', 'Pet Store', 'Vietnamese Restaurant',
       'Light Rail Station', 'Brewery', 'Thrift / Vintage Store',
       'Sandwich Place', 'Middle Eastern Restaurant', 'Shopping Mall',
       'Smoke Shop', 'Auto Garage', 'Latin American Restaurant',
       'Italian Restaurant', 'Noodle 

In [459]:
# check if the results contain "Gym/Fitness Center or College Gym or Climbing Gym"
("Gym / Fitness Center" or "College Gym" or "Climbing Gym") in toronto_venues['Venue Category'].unique()

False

In [460]:
# check if the results contain "Gym"
"Gym" in toronto_venues['Venue Category'].unique()

True

#### Analyze Each Neighborhood

In [56]:
# one hot encoding
to_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
to_onehot['Neighborhoods'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [to_onehot.columns[-1]] + list(to_onehot.columns[:-1])
to_onehot = to_onehot[fixed_columns]

print(to_onehot.shape)
to_onehot.head()

(2213, 268)


Unnamed: 0,Neighborhoods,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Rouge, Malvern",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Highland Creek, Rouge Hill, Port Union",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
len(to_onehot[to_onehot["Gym"] > 0])

50

#### Grouping the  rows by neighborhood, taking the mean of the frequency of occurrence of each category

In [61]:
to_grouped = to_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(to_grouped.shape)
to_grouped

(100, 268)


Unnamed: 0,Neighborhoods,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.02,0.0,0.000000,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.00,0.0,0.000000,0.0,0.0,0.00,0.0,0.0,0.00,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.00,0.0,0.000000,0.0,0.0,0.00,0.0,0.0,0.00,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.00,0.0,0.090909,0.0,0.0,0.00,0.0,0.0,0.00,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.00,0.0,0.000000,0.0,0.0,0.00,0.0,0.0,0.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Willowdale West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.00,0.0,0.000000,0.0,0.0,0.00,0.0,0.0,0.00,0.0
96,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.00,0.0,0.000000,0.0,0.0,0.00,0.0,0.0,0.00,0.0
97,"Woodbine Gardens, Parkview Hill",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.00,0.0,0.000000,0.0,0.0,0.00,0.0,0.0,0.00,0.0
98,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.00,0.0,0.111111,0.0,0.0,0.00,0.0,0.0,0.00,0.0


In [464]:
len(to_grouped[to_grouped["Gym"] > 0])

31

#### Creating a new dataframe to record the mean occurence of Gym in each neighborhood

In [62]:
to_fitness = to_grouped[["Neighborhoods","Gym"]]

In [63]:
to_fitness

Unnamed: 0,Neighborhoods,Gym
0,"Adelaide, King, Richmond",0.030000
1,Agincourt,0.000000
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.000000
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.000000
4,"Alderwood, Long Branch",0.111111
...,...,...
95,Willowdale West,0.000000
96,Woburn,0.000000
97,"Woodbine Gardens, Parkview Hill",0.076923
98,Woodbine Heights,0.000000


## Methodology <a name="methodology"></a>

In this project, we will direct our efforts on detecting areas of Toronto that have high Gym/Fitness Center density.

In first step we have collected the required data: location of every neighborhood of Toronto, Canada. We have also identified the different categories of venues.(according to Foursquare categorization).

Second step in our analysis is to clean the data as there are different types of gyms/fitness centers in Toronto. We need to perform data wrangling to ensure consistency in data. Post that we will perform one-hot encoding on the data set to convert the categorical data (venue category) to numerical form. This will help us in calculating the mean of occurrence of Gyms/Fitness Centers in every Neighborhood. 

In third and final step we will perform the cluster analysis on the basis of occurrence to identify the most suitable areas for opening the first outlet of *EAT-TO-GET-FIT*. We will present the map of all such locations and create clusters (using k-means clustering) to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location.

### Cluster Neighborhoods

#### Running k-means to cluster the neighborhoods in Toronto into 3 clusters

In [87]:
# set number of clusters
toclusters = 3

to_clustering = to_fitness.drop('Neighborhoods', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=toclusters, random_state=0).fit(to_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 1, 1, 0, 1, 1, 1, 1, 1], dtype=int32)

In [88]:
# create a new dataframe that includes the cluster as well as the neighborhoods.
to_merged = to_fitness.copy()

# add clustering labels
to_merged["Cluster Labels"] = kmeans.labels_

In [89]:
to_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)

to_merged.head()

Unnamed: 0,Neighborhood,Gym,Cluster Labels
0,"Adelaide, King, Richmond",0.03,1
1,Agincourt,0.0,1
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,1
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,1
4,"Alderwood, Long Branch",0.111111,0


#### Joining this dataset with the dataset for neighborhoods (including their latitude and longitude) 

In [91]:
# merge to_merged with toronto_data to add latitude/longitude for each neighborhood
df_toronto = df_toronto.join(to_merged.set_index('Neighborhood'), on="Neighborhood")

print("The of the new dataset is:", df_toronto.shape)
df_toronto.head()

(103, 7)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Gym,Cluster Labels
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,0.0,1.0
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0.0,1.0
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0.0,1.0
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0.0,1.0
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0.0,1.0


In [None]:
#Replacing any NaN with 0 for visualising and analysis purpose
df_toronto.replace(np.nan , 0, inplace=True)
df_toronto.head()

#### Visualizing the clusters

In [94]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [95]:
# create map
map_clusters = folium.Map(location=[lat_toronto, lon_toronto], zoom_start=11)

# set color scheme for the clusters
x = np.arange(toclusters)
ys = [i + x + (i*x)**2 for i in range(toclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighborhood'], df_toronto['Cluster Labels']):
    if (np.isnan(cluster) or (cluster != cluster) or np.isnan(lat) or np.isnan(lon)):
        continue
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [99]:
# save the map as HTML file
map_clusters.save('map_clusters_complete.html')

## Analysis <a name="analysis"></a>

#### CLUSTER 0

In [96]:
to_merged.loc[to_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Gym,Cluster Labels
4,"Alderwood, Long Branch",0.111111,0
11,"Brockton, Exhibition Place, Parkdale Village",0.090909,0
12,Business Reply Mail Processing Centre 969 Eastern,0.066667,0
17,Canada Post Gateway Processing Centre,0.090909,0
26,"Commerce Court, Victoria Hotel",0.04,0
27,Davisville,0.055556,0
28,Davisville North,0.125,0
34,"Dovercourt Village, Dufferin",0.066667,0
43,"First Canadian Place, Underground city",0.04,0
44,"Flemingdon Park, Don Mills South",0.095238,0


#### CLUSTER 1

In [97]:
to_merged.loc[to_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Gym,Cluster Labels
0,"Adelaide, King, Richmond",0.03,1
1,Agincourt,0.00,1
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.00,1
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.00,1
5,"Bathurst Manor, Downsview North, Wilson Heights",0.00,1
...,...,...,...
94,Willowdale South,0.00,1
95,Willowdale West,0.00,1
96,Woburn,0.00,1
98,Woodbine Heights,0.00,1


#### CLUSTER 2

In [98]:
to_merged.loc[to_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Gym,Cluster Labels
32,Don Mills North,0.2,2
36,Downsview Northwest,0.2,2


## Results <a name="results"></a>

We have used K-means clustering to form 3 kinds of clusters:  
* **Cluster 2**: Neighborhoods with a lot of Gyms in the area 
* **Cluster 1**: Neighborhoods with no Gym in the area 
* **Cluster 0**: Neighborhood with less or no Gyms in the area

Based on the clusters formed, there are 2 Neighborhoods with high number of Gyms/Fitness Centers. These are **Don Mills North** and **Downsview Northwest**. These locations belong to the **Cluster 2** and have **Turquoise** colored markers in the MapView.

## Discussion <a name="discussion"></a>

Based on the results drawn from the analysis, I would like to recommend Don Mills North and Downsview Northwest for opening the first outlet of EAT-TO-GET-FIT. Now, the next step will be to incorporate factors like size of the available space, distance from gym/fitness center, price of the available space, demographic factors of the neighborhood, etc. to recommend the final specific location is outside the scope of this project due to the timeline attached. However, the key take-away would be that these 2 areas are great to start with the on-ground street exploration for the ultimate outlet location.

## Conclusion <a name="conclusion"></a>

In this study, I have analyzed the neighborhoods and venue categories in Toronto, Canada. As I wanted to open the first outlet near the fitness centers/gyms, I built clustering model to identify the most potential areas. Based on the data, Don Mills North and Downsview are the 2 most suitable areas for opening the outlet. This model will be very useful when we will expand the chain of EAT-TO-GET-FIT to other parts of Toronto and later, Canada.  