# Peer-graded Assingment
## Segmenting and Clustering Neighborhoods in Toronto

For this assignment it is required to explore and cluster the neighborhoods in Toronto.
<br /> Data source: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

To create the dataframe:

* The dataframe will consist of three columns: **PostalCode, Borough, and Neighborhood**
* Only process the cells that have an assigned borough. **Ignore cells with a borough that is Not assigned.**
* **More than one neighborhood can exist in one postal code area.** For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
* **If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.**
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* **In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.**

**Download dependencies**

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


**Transform the data from html into a pandas dataframe**

In [3]:
url = 'List of postal codes of Canada_ M - Wikipedia.html'
dfs = pd.read_html(url)

In [4]:
df = dfs[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [6]:
df.shape

(180, 3)

**Drop rows with a borough that is Not assigned.**

In [7]:
# Check for not assigned values in borough and drop them

df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True)

**Check if two or more neighborhoods exist in one postal code area.**

In [8]:
temp = df.groupby('Postal Code').count()
temp[temp['Neighborhood'] > 2]

Unnamed: 0_level_0,Borough,Neighborhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1


So, there is only one postal code for every single neighborhood.

**Check if a row has Not assigned neighborhood, if yes the neighborhood will be the same as the borough.**

In [23]:
df.loc[df[df['Neighborhood'] == 'Not assigned'].index, 'Neighborhood'] = df.loc[df[df['Neighborhood'] == 'Not assigned'].index, 'Borough']
df.reset_index(drop=True, inplace=True)

# First Link's Answer below

### Number of Rows of the Dataframe

In [10]:
print('Number of Rows of the wrangled Dataframe is:', df.shape[0])

Number of Rows of the wrangled Dataframe is: 103


# Second Link's Answer below

Link used to download csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [64]:
#Load data from csv into a pandas dataframe

df_coor = pd.read_csv("Geospatial_Coordinates.csv")

In [65]:
df_coor.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [66]:
# Adds in df dataframe Latitude and Longtitude Columnns with the coordinates from df_coor dataframe

for ind in df.index:
    for ind2 in df_coor.index:
        if  df.loc[ind, 'Postal Code'] == df_coor.loc[ind2, 'Postal Code']:
            df.loc[ind, 'Latitude'] = df_coor.loc[ind2, 'Latitude']
            df.loc[ind, 'Longitude'] = df_coor.loc[ind2, 'Longitude']

In [67]:
df

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


**The above dataframe contains the same information as the one which is presented on coursera assignment's instructions, the only difference is the data's order.** 

# Third Link's Answer below
## Explore and cluster the neighborhoods in North York accordint to their coordinates
It is a simple clustering of neighborhoods in North York borough according to their geo-location

In [68]:
# Create the new dataframe which contains only North York borough

North_York = df[df['Borough'] == 'North York']
North_York.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
7,M3B,North York,Don Mills,43.745906,-79.352188
10,M6B,North York,Glencairn,43.709577,-79.445073


Run k-means to cluster the borough into 5 clusters.

In [69]:
# Copy the North_York dataframe
North_York_clu = North_York

# set number of clusters
kclusters = 5

# Drop column Neighborhood and Borough
North_York_clu = North_York_clu.drop('Neighborhood', 1)
North_York_clu = North_York_clu.drop('Borough', 1)
North_York_clu = North_York_clu.drop('Postal Code', 1)

# run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(North_York_clu)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([4, 4, 0, 4, 0, 4, 1, 2, 1, 3])

In [70]:
# add clustering labels
North_York.insert(0, 'Cluster Labels', kmeans.labels_)

**Visualizing the results** using folium

In [78]:
# create map
latitude = 43.750112
longitude = -79.442259
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(North_York['Latitude'], North_York['Longitude'], North_York['Neighborhood'], North_York['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters