<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Torronto</font></h1>

## Introduction

In this lab, you will learn how to convert addresses into their equivalent latitude and longitude values. Also, you will use the Foursquare API to explore neighborhoods in New York City. You will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. You will use the *k*-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in Torronto City and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Clean Dataset</a>

2. <a href="#item2">Enrich the base dataset</a>

    
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

# !conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.3 MB

The following NEW packages will be 

## 1. Download and Clean Dataset

The initial dataset is availble at this link : https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M .  First we will use Beautiful Soup to scraoe the wikipedia data for the table and extract the information out of it.

### Use the BeautifulSoup to scrape the data

In [3]:
website_url = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,"lxml")
My_table = soup.find("table",{"class":"wikitable sortable"})

### Tranform the data into a *pandas* dataframe

After some pandas magic we are making sure of the following things :
* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [4]:
postcode = []
borough = []
neighbourhood = []

for tr in My_table.findAll("tr")[1:]:
    td = tr.findAll('td')
    postcode.append(str(td[0].text).strip('\n'))
    borough.append(str(td[1].text).strip('\n'))
    neighbourhood.append(str(td[2].text).strip('\n'))

torronto_data = pd.DataFrame()
torronto_data["postcode"] = postcode
torronto_data["borough"] = borough
torronto_data["neighbourhood"] = neighbourhood
torronto_data.columns = ["postcode","borough","neighbourhood"]
torronto_data = torronto_data[torronto_data.borough != 'Not assigned']
torronto_data[torronto_data.neighbourhood == 'Not assigned'] = torronto_data.borough
torronto_data = torronto_data.groupby(["postcode","borough"])['neighbourhood'].apply(list).apply(', '.join).to_frame()
torronto_data.reset_index(inplace = True)

In [5]:
torronto_data.head()

Unnamed: 0,postcode,borough,neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [6]:
print("The number of rows in the data are : ",torronto_data.shape[0])

The number of rows in the data are :  103


## 2. Enrich the base dataset

In [14]:
# import geocoder # import geocoder
# def get_lat_long(postal_code):
#     # initialize your variable to None
#     lat_lng_coords = None

#     # loop until you get the coordinates
#     while(lat_lng_coords is None):
#         g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#         lat_lng_coords = g.latlng

#     return str(lat_lng_coords[0])+","+str(lat_lng_coords[1])

# torronto_data['lat_long'] = torronto_data['postcode'].apply(get_lat_long)

In [15]:
!wget -q -O 'geospatial_data.csv' http://cocl.us/Geospatial_data

In [None]:
torronto_data.head()

In [20]:
geo_data = pd.read_csv("geospatial_data.csv")
geo_data.columns = ["postcode","latitude","longitude"]

In [22]:
geo_data.head()

Unnamed: 0,postcode,latitude,longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [28]:
torronto_data = torronto_data.astype()

In [30]:
torronto_data.dtypes

postcode         object
borough          object
neighbourhood    object
dtype: object

In [31]:
geo_data.dtypes

postcode      object
latitude     float64
longitude    float64
dtype: object

In [32]:
final_join_df = torronto_data.merge(geo_data,how='inner',left_on = "postcode", right_on = "postcode")

In [35]:
print("The number of rows in the data are : ",final_join_df.shape[0])

The number of rows in the data are :  102
