# **Building DataFrame by Web Scrapping**
---
---

In [0]:
# Scrapping all the tables present in the URL using pandas directly into dataframe
import pandas as pd
dfs = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

In [86]:
# Viewing all the table to know which is the required table
for i in dfs:
  print(i)

    Postal code  ...                                       Neighborhood
0           M1A  ...                                                NaN
1           M2A  ...                                                NaN
2           M3A  ...                                          Parkwoods
3           M4A  ...                                   Victoria Village
4           M5A  ...                         Regent Park / Harbourfront
..          ...  ...                                                ...
175         M5Z  ...                                                NaN
176         M6Z  ...                                                NaN
177         M7Z  ...                                                NaN
178         M8Z  ...  Mimico NW / The Queensway West / South of Bloo...
179         M9Z  ...                                                NaN

[180 rows x 3 columns]
                                                  0   ...   17
0                                                

In [87]:
# Hence our required table is dfs[0]
df = dfs[0]
df = df.rename(columns={"Postal code": "PostalCode"})
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### **Only processing the cells that have an assigned borough. Ignoring the cells with a borough that is Not assigned.**

In [88]:
indexNames = df[ df['Borough'] =='Not assigned'].index
df.drop(indexNames , inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


### **If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough**

In [89]:
df.loc[df['Neighborhood'] =='Not assigned' , 'Neighborhood'] = df['Borough']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


### **Rows will be same postalcode will combined into one row with the neighborhoods separated with a comma**

In [90]:
result = df.groupby(['PostalCode','Borough'], sort=False).agg( ', '.join)
df_final = result.reset_index()
df_final.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [91]:
df_final.shape

(103, 3)