# League of Toronto's Neighbourhoods in Sports Centers

### Introduction

Regular exercise provides the ultimate health benefits. The latest studies suggest that at least [75 minutes][1] of rigorous exercise a week is critically important to lead a healthier and happier life. Even though we all know that  exercise is important, people (in general) often miss exercising due to many reasons. Few of them are being tired, lazy, missing workout buddies, silly excuses and much more. However, we often miss the fact that missing preferable Sports Centers around our neighbourhood will also be one of the reasons to miss. 

Study below is to examine the neighbourhoods based on factors like ease of access, the popularity of sports centres and user preferences to identify the similarities between neighbourhoods to answer the ultimate question,

> #### _"Which neighbourhoods are similar to provide sophisticated access to Sports Centers in Toronto to lead a healthier life and which aren't"_

##### Relevant stakeholders / Interesting parties

The study here is to help people who are looking to change their neighbourhoods or trying to identify neighbourhoods with sophisticated access to sports centres. However, it is not restricted to only those, it can also help persons, who are looking to open a Sports Center.

[1]: https://www.health.harvard.edu/topics/exercise-and-fitness

### Data

There is no single data source to answer an ultimate question. Data will be sourced from different places.

Data sources & how it will contribute is explained below

#### Data Sources


| Data Source Id | Details   | Platform | Endpoint (if any) | Additinal Comments |
|----------------|-----------|----------|-------------------|--------------------|
|   [DS01][1]         | List of Postal codes | wikipedia | n/a | Same dataset from the previous excercise
|   [DS02][2]         | Geospatial data | openstreetmap | n/a | 
|   [DS03][3]         | List of Sports Centers | foursquare | explore | Endpoint group is _venues_
|   [DS04][3]         | Details about the Sports Centers | foursquare | details| Endpoint group is _venues_

[1]: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
[2]: https://nominatim.openstreetmap.org/
[3]: https://api.foursquare.com/v2/venues/

#### Details about the Data Sources

##### DS01 - List of Postal codes

A list of postal codes will be extracted from **_Wikipedia_** and will be formalized based on the available postal codes. The source data is expected to be in _html_ format. Data will be parsed to extract the below features. 

* *PostalCode*
* *Borough*
* *Neighborhood*

It will help to identify all the boroguhs and its corresponding neighbourhoods in Toronto.

##### DS02 - Geospatial data 

List of Geospatial data will be extracted using **_openstreetmap api_** & **_a file_** from Lab 3. Data is expected to be in _JSON & CSV_ formats. Data will be parsed to extract the below features. 

* *PostalCode*
* *Latitude*
* *Longitude*

It will help to identify the Geospatial details for all the identified neighbourhoods in Toronto.

##### DS03 - List of Sports Centers

List of Sports Centers will be extracted using **_foursquare api_**. Data is expected to be in _JSON_ format. Data will be parsed to extract the below features. 

* *Venue Id*
* *Venue name*
* *Venue Latitude*
* *Venue Longitude*
* *Venue Distance*
* *Venue Category*

It will help to identify all the Sports Centers around the identified neighbourhoods. 

**_Highlights_**

* Venue categories 4f4528bc4b90abdf24c9de85,4bf58dd8d48988d15e941735 will only be extracted. 
* _explore_ endpoint will be used to extract the data.
* 10000 will be issued as a radius parameter to sweep each neighbourhood. 



##### DS04 - Details about the Sports Centers

Details about each Sports Centers will be extracted using **_foursquare api_**. Data is expected to be in _JSON_ format. Data will be parsed to extract the below features. 

* *likes*
* *rating*
* *ratingSignals*
* *reasons_summary*
* *tip_count*
* *venue_id*

It will help to identify user preferences, the popularity of sports centers.

**_Highlights_**

* Conflicting Venue Categories will be removed after extracting the initial data.
* _venue_ endpoint will be used to extract the data.
* features were identified based on initial observations. May need minor updates after extracting.
* *reasons_summary* will be extracted based on the summary produced from original data


###### Identified Features 
| Data Source Id | Features |
|----------------|-----------|
|   DS03         | Venue Distance | 
|   DS03         | Venue Category | 
|   DS04         | likes | 
|   DS04         | rating | 
|   DS04         | ratingSignals | 
|   DS04         | reasons_summary | 
|   DS04         | tip_count |

     
#### Missing Data

All the identified features are analyzed to find if any of the identified features are having any missing data. 

*rating* & *ratingSignals* are identified to be missing for almost 65% of data. Foursquare is calculating "rating" based on few inputs which are not met for most of the venues. So decided to **omit** these two fearures in further processing. 


![Missing Data](images/capstone_league_of_neighbourhoods_missing.png)

#### Imputation
*reasons_summary* is extracted from the foursquare "venue details" data (*reasons.items.summary*). *summary* is expected to be set as *"Lots of people like this place"* when a venue is liked by many people. 


*reasons.items.summary* will be imputed using ```SimpleImputer``` from ```sklearn``` module and all the missing values will be set to blank in the first iteration. 

Once the data is imputed & *reasons_summary* will then be encoded using ```OrdinalEncoder``` to convert them to numerical format. 1 to indicate that the venue was liked by lot of people or otherwise 0.  

###### Final list of features
| Data Source Id | Features |
|----------------|-----------|
|   DS03         | Venue Distance | 
|   DS03         | Venue Category | 
|   DS04         | likes | 
|   DS04         | rating | 
|   DS04         | tip_count |

# Methodology

#### Categories

There can be many categories associated with each venue. All the catogories are extracted from the response and placed it in below columns.

* *sub_category_1*
* *sub_category_2*
* *sub_category_3*

In this study, venues are split into two broader categories. **"Athletics & Sports Centers" and "Fitness Centers"**


Each venue is analyzed and mapped to either one of these broader categories. If any of the below listed categories are identified in *sub_category_1* or *sub_category_2* or *sub_category_3* columns, the broader categories will be set accordingly. 

Foursquare venue categories are grouped as below.

###### _Athletics & Sports Centers_

*Athletics & Sports,Tennis Court,Baseball Field,Volleyball Court,Soccer Field,Hockey Arena,Badminton Court,Basketball Court,Basketball Stadium,Golf Course,Golf Driving Range,Skating Rink,Skate Park,Skating Field,Track,Sports Club,Paintball Field*

###### _Fitness Centers_

*Gym,Gym / Fitness Center,Boxing Gym,Climbing Gym,Gymnastics Gym,Cycle Studio,Social Club,Gym Pool,Pool,Martial Arts Dojo,Yoga Studio,Pilates Studio,College Gym*

#### Venue Distances

Venues are extracted using *explore* endpoint with radius value of 2000 to have a reasonable amount of samples for each neighbourhood. There are about 3000+ venues are extracted. However, they are not unique as expected. Venues can be common across negighbourhoods. 

As shown in the Snapshot below (only few neighbourhoods around *Parkwoods* is shown to illustrate). There are common venues across neighbourhoods, however, the distance between venue and each neighbourhoods are expected to be different.

![Common Venues](images/capstone_league_of_neighbourhoods_shared_venues.png)

*Venue Distance* will be the key feature to help us understand the easy access to the venues from different neighbourhoods.

#### Descriptive Statistics

Below are the descriptive statistics for each feature. 

|Stats |Category|	Venue Distance|	likes	|reasons_summary|	tip_count|
|--------|--------|----------------|---------|---------------|------------|
|count	|3365.000000	|3365.00000	|3365.000000	|3365.000000	|3365.000000|
|mean	|0.754532	|1277.01367	|12.341159	|0.067756	|3.872808|
|std	|0.430428	|481.87610	|48.280216	|0.251365	|9.662521|
|min	|0.000000	|17.00000	|0.000000	|0.000000	|0.000000|
|25%	|1.000000	|916.00000	|1.000000	|0.000000	|0.000000|
|50%	|1.000000	|1346.00000	|3.000000	|0.000000	|1.000000|
|75%	|1.000000	|1678.00000	|9.000000	|0.000000	|3.000000|
|max	|1.000000	|2003.00000	|674.000000	|1.000000	|112.000000|

*likes* and *tip_counts* have reasonable amount of outliers. However, as we are only interested in finding the similarities between neighbourhoods the outliers may be significant 

#### Scaling

All the features are in different range which is expected as each feature represent different units. So, scaling is an important stage before a model is built. 

```StandardScaler``` is used to Standardize the features. This will help to standardize the features with mean value as 0 and standard deviation as 1. 

Standardized features are shown below

![Standardized features](images/capstone_league_of_neighbourhoods_std_venues.png)

#### Cluster Analysis

Aim of this study is to identify the similarities between the neighbourhoods.  

```k-means``` clustering is one of the simplest and most popular clustering algorithm. ```kmeans``` from ```sklearn``` library is used in this study. Standardized features will be clustered using ```k-means``` clustering algorithm. 

There is no expected ```k``` value or known ```k``` value to use it in the model. So, model is applied to different ```k``` values to determine best ```k``` value. 

Elbow method is quite popular in indentifying the best ```k``` value. In this study, tests are done with 1 to 25 ```k``` value and below are results. 

![elbow method](images/capstone_league_of_neighbourhoods_elbow.png)

```k``` value 4-5 will be the best choice for this data set. So, choosing 5 as ```k``` value to produce the clusters.

# Results

Neighbourhoods are placed into 5 clusters based on it's similarities. Size of each cluster is as below

|Cluster|Size|
|-------|----|
|0	|24|
|1	|20|
|2	|14|
|3	|28|
|4	|17|

###### Do all the neighbourhoods in downtown are similar? Data says so! 

All the neighbourhoods (almost) in downtown appears to have similar characteristics. They all have equivalently easy access to the Sports Centers.

![downtown](images/capstone_league_of_neighbourhoods_downtown.png)

Downtown neighbourhoods that are similar in providing sophisticated access to Sports centers are listed below. 

_Harbourfront, Regent Park Queen,s ParkRyerson, Garden District ,St. James Town ,Berczy Park,Central Bay Street ,Adelaide, King, Richmond,Harbourfront East, Toronto Islands, Union Station,Little Portugal, Trinity,Design Exchange, Toronto Dominion Centre,Commerce Court, Victoria Hotel,Del Ray, Keelesdale, Mount Dennis, Silverthorn,Harbord, University of Toronto,Chinatown, Grange Park, Kensington Market,Stn A PO Boxes 25 The Esplanade,First Canadian Place, Underground city ,Church and Wellesley_

###### Do all the suburbs are similar? No. 

Neighbourhoods from the suburbs are splitted into 4 other clusters. They are seperated based on ease of access, most preferred venues, etc

![all_others](images/capstone_league_of_neighbourhoods_all_others.png)

# Discussion

Results naturally suggests that the neighbourhoods in downtown area are having same level of access to sports centers and all other neighbourhoods in suburbs are split based on it's ease of access. 

Cluster size of other neighbourhoods are 24,20,14,28

It can be argued that some features may not represent the true nature of the venues/neighbourhoods. However, for this study a reasonable amount of samples are processed to find the similarities between the neighbourhoods - but there is no way to deny that the data could be biased as it is based on the platform users.

# Conclusion

Aim of this study is to identify the similar neighbourhoods that has same level of access to Sports Centers. Based on this data, it is clearly evident that neighbourhoods in downtown area are having sophisticated access to sports centers & all other neighbourhoods are natually split based on ease of access and most preferred places. 

##### Future Enhancements 
* Identified features are *only* based on the foursquare platform. 
* Samples may not represent the true nature of the venues / neighbourhoods as these features are heavily depend on the platform users. 
* Model can be enhanced by adding additional features. 
* Different sources can be explored to get more features that represents the true nature of the venues. Some of the sources are,
  * Goodle Maps API
  * Trip Advisor API 
