# Segmenting and Clustering Neighborhoods in Toronto

This Notebook is used to build the code to scrape the following Wikipedia page,

> https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

## *Getting wikipedia data*

#### Let's get started!

First, we have to import the libraries:

In [1]:
#import libraries
#!conda install -c conda-forge folium

In [2]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import json # library to handle JSON files
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

Now, we scrap the wikipedia table by using BeautifulSoup and pandas libraries.

In [3]:
wiki = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki,'lxml')
table = soup.find("table",{"class":"wikitable sortable"})

In [4]:
#table #to see its content

#### We can now create a data frame by looping through BeautifulSoup Table.

I recommend that you see the contents of the table to better understand this process.

In [5]:
#Create data frame

columns=['Postal Code','Borough','Neighborhood']
df=pd.DataFrame(columns=columns)
p=[]
b=[]
n=[]

for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3:
        p.append(cells[0].find(text=True).lstrip('\n').strip())
        b.append(cells[1].find(text=True).lstrip('\n').strip())
        n.append(cells[2].find(text=True).lstrip('\n').strip())

df['Postal Code']=p
df['Borough']=b
df['Neighborhood']=n
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Let's clean our data!
We only need the cells that have an assigned borough. Therefore, we can ignore the cells with a borough that is 'Not assigned'.

In order to do that let's filter our data where the Borough is not equal to 'Not assigned':

In [6]:
ng_tnt=df[df['Borough'] != 'Not assigned'].reset_index(drop=True)

More than one neighborhood can exist in one postal code area. Luckly, the data frame that was extracted from Wikipedia came in that sort of way. So it is only necessary to check if the Postal Code are grouped correctly. Therefore, if the shape of the unique values of the Borough column is the same as the original, we can assume it is correct.

In [7]:
ng_tnt['Postal Code'].unique().shape[0]==ng_tnt['Postal Code'].shape[0]

True

#### Great! Now let's confirme that any neighborhood has 'Not assigned' values:

In [8]:
ng_tnt[ng_tnt['Neighborhood'] == 'Not assigned'].shape[0]

0

And get the shape of the grouped data frame:

In [9]:
ng_tnt.shape

(103, 3)

## *Uploading data frame with latitude and longitude information*

#### I tried to get the latitude and longitude of each postal code area by using geocoder, but I failed.
> ## It was taking to long!

*This is what I did...*

```python 
!pip install geocoder
import geocoder

    def get_lat_log(postal):
        lat_lng_coords = None
        while(lat_lng_coords is None):
          g = geocoder.google('{}, Toronto, Ontario'.format(postal))
          lat_lng_coords = g.latlng

        lat = lat_lng_coords[0]
        long = lat_lng_coords[1]
        return (lat, long)

    lat_list=[]
    long_list=[]

    lat,long = get_lat_log(ng_group['Postal Code'][0])

    
for pc in ng_group['Postal Code']:
    (lat,long)=get_lat_log(pc)
    lat_list.append(lat)
    long_list.append(long)
    
ng_group['Latitude']=lat_list
ng_group['Longitude']=long_list
```

#### So, I decided to read the csv file and merge the latitude and longitude with my data frame:

In [10]:
df_ltlg=pd.read_csv('https://cocl.us/Geospatial_data')
df_ltlg.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
ng_data=ng_tnt.merge(df_ltlg, on='Postal Code', how='left')
ng_data

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


Let's check if the coordinates matches with Capstone example:

In [12]:
check=ng_data[ng_data['Postal Code'] =='M9V'].reset_index(drop=True)
check

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


It does! So, now we are ready to analyze our data and cluster it for a better visualization.