<a> <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTTYOPfdkntp9QtXcPkvO7JsGnU9GunyWgQ8TvyHBySz38_j0UAYQ" align="center" ></a>

<h1 align="center"> Segmentation and Clustering Neighborhoods in Toronto(CAN)</h1>

## Summary

In this notebook, we will explore, segment, and cluster the neighborhoods in the city of Toronto.The neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way. 

In this particular case we will extract the data from a Wikipedia page, where all the information we need to explore and cluster the neighborhoods in Toronto is stored. We will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe.

We will use the BeautifulSoup Python Library. If you are interested, here is the library's main documentation page (<a href="https://beautiful-soup-4.readthedocs.io/en/latest/">BeautifulSoup Documentation</a>).

To segment the different neighbors in Toronto, we will develop a clustering algorithm based on the k-means principle.

### Importing Libraries

In [399]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup as BS
import csv
import matplotlib.pyplot as plt

def count_tags(tag_name, html):
    soup = BS(html)
    return len(soup.find_all(tag_name))

### Scrape the data form the Wikipedia page using BeautifulSoup

#### Tried to implement the scrape based in Based in BeautifulSoup, but I could not do it. if you have have tips, you are more than welcomed to share them. 

In [298]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source,'lxml')
td_count = count_tags('td',source)

table = soup.find_all('table',{'class':'wikitable sortable'})
info=table[0]
#print(info)
    
with open ('neigh_toronto.csv','w') as r:
    for row in info.find_all('tr'):
        for cell in row.find_all('td'):
            r.write(cell.text.ljust(25))
        r.write('\n')    

df=pd.read_csv('neigh_toronto.txt')   

#### Tried a simpler implementation, using both the BeautifulSoup library and the pandas library

In [463]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup= BeautifulSoup(source,'lxml')
# If you want to see the source code
#print(soup.prettify())

df=pd.read_html(source) 
df=df[0]
df[:15]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


### Pre-processing the Dataset, in order to make it clean

#### First we will replace every Neighborhood=='Not assigned' with an existent Borough, with its Borough value

In [478]:
df['Neighbourhood'] = np.where((df['Neighbourhood']=='Not assigned') & (df['Borough']!=df['Neighbourhood']),df['Borough'],df['Neighbourhood'])
df[:10]
# In cell 8 we can see that we were successful  

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
9,M8A,Not assigned,Not assigned


#### Getting rid of the Not assigned rows

In [487]:
df.drop(df[df.Borough=="Not assigned"].index,inplace=True)
df.drop(df[df.Neighbourhood=="Not assigned"].index,inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


#### Display the duplicates

In [480]:
df.Postcode.value_counts() 

M8Y    8
M9V    8
M5V    7
M8Z    5
M9B    5
M4V    5
M6M    4
M1V    4
M9R    4
M9C    4
M5J    3
M5R    3
M5T    3
M1M    3
M1E    3
M1P    3
M1T    3
M1L    3
M1C    3
M6K    3
M5H    3
M8X    3
M6L    3
M8V    3
M1K    3
M2J    3
M3H    3
M4B    2
M4X    2
M5K    2
      ..
M2H    1
M4M    1
M3N    1
M1G    1
M5G    1
M1J    1
M4A    1
M2K    1
M2R    1
M1S    1
M5C    1
M9N    1
M4N    1
M4G    1
M3L    1
M3B    1
M7A    1
M4C    1
M6E    1
M1W    1
M1H    1
M6B    1
M4W    1
M6C    1
M4H    1
M4S    1
M3M    1
M5W    1
M2P    1
M9L    1
Name: Postcode, Length: 103, dtype: int64

#### The same neighborhood can exist in one postal code area. We need to group them 

In [482]:
df_final = df.groupby(['Postcode','Borough'], sort = False).agg(lambda x: ', '.join(x)).reset_index()
df_final.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [483]:
df_final.shape

(103, 3)