## Week 3 - Segmenting and Clustering Neighborhoods in Toronto

In [1]:
# importing requests, beautifulSoup4 and pandas libs
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Get the table data from wikipedia and store in `data` array

In [2]:
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})

#### Find all `tr` tags and for each `tr` tag, find all `td` cells to get the data. Use the `strip()` function to remove the `\n` from end of each line

In [3]:
table_rows = table.find_all('tr')
data = []
for row in table_rows:
    td=[]
    for t in row.find_all('td'):
        td.append(t.text.strip())
    data.append(td)

* Create a pandas dataframe from the data array
* Filter any bad rows
* Filter out rows for which `Borough` is `Not assigned`
* Reset the index after filtering
* Group by `Postal Code`, `Borough`. Join the result df on `Neighborhoods`. Like for e.g. 
```csv
M1B, Scarborough, Malvern
M1B, Scarborough, Rouge
```
becomes
```csv
M1B, Scarborough, Malvern
```
* Replace any `Neighborhood` with `Not assigned` by its value of `Borough`

In [4]:
df = pd.DataFrame(data, columns=['Postal Code', 'Borough', 'Neighborhood'])
# df = df[~df['Borough'].isnull()]  # to filter out bad rows
df.drop(df[df.Borough == 'Not assigned'].index, inplace=True)
df.reset_index(drop=True, inplace=True)
df = df.groupby(['Postal Code','Borough'])['Neighborhood'].apply(lambda x: ','.join(x)).reset_index()
df['Neighborhood'].replace('Not assigned',df['Borough'],inplace=True)
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [5]:
# run the dataframe shape to get the dimensions of the dataframe
df.shape

(103, 3)

In [6]:
# download the `Geospatial_data.csv` from url directly
df_geo = pd.read_csv('http://cocl.us/Geospatial_data')
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
# look at the dataframes shape.
df_geo.shape

(103, 3)

#### We now will join the `df` and `df_geo` dataframes on `Postal Code` column to get the desired output.

In [8]:
df_merged=df.join(df_geo.set_index('Postal Code'), on='Postal Code')

In [9]:
df_merged.to_csv('df_final.csv')