# Segmenting and Clustering Neighborhoods in Toronto

## Question 1

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

### Extracting/scraping table data 
I need to extract the table from the wikipedia page. I will use beautifulsoup library for this and lxml as the html parser.
From the created soup, extract the table.

In [2]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
text = requests.get(wiki_url).text  # get webpage content as text
html = BeautifulSoup(text, 'lxml')  # convert to html format

Need to learn a bit about reading html for the next step, to search for the table class from the html. tr in html stands for table row and th stands for table header cell. Hence, we need to find all rows under the table

In [3]:
table = html.find('table', class_='wikitable')  # table
rows = table.find_all('tr')  # finding all rows under the table

After getting the rows we need to get all the cells/columns in the rows. We will have to loop to get all the cells from all the rows. td stands for table data i.e. cells in the table.

In [4]:
table_df = []
for each_row in rows:
    row = []
    if each_row.find_all('th'):                 # first find the table header and append to first row
        header = each_row.text.split('\n')[1:4]    # slicing to remove the empty spaces in first and and last place
        print(header)                          # this header will be used as column names in the df
    for cells in each_row.find_all('td'):        # find and append all cells from each row
        row.append(cells.text.strip('\n'))
    table_df.append(row)                       # append all rows to a table to add in the dataframe
table_df[0:10]

['Postcode', 'Borough', 'Neighbourhood']


[[],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned']]

### Creating the dataframe
Now we have the column headers and rows of the table to create the dataframe.

In [5]:
header[0] = 'PostalCode'     # changing the name of the column to required ones
header[2] = 'Neighborhood'
df = pd.DataFrame(data = table_df[1:], columns=header)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Cleaning the dataframe
Ignore the cells with a borough that is Not assigned.

In [6]:
drop_index = df.index[df['Borough'] == 'Not assigned'] #get the index of the rows under column Borough having value 'Not assigned'
df.drop(df.index[drop_index], inplace=True)  # drop the rows having 'Not assigned' under Borough
df.reset_index(drop=True, inplace=True)  # resetting the index of the dataframe after dropping rows
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


More than one neighborhood can exist in one postal code area. Group neighborhoods accoding to same PostalCode. But we also have to check that if a cell has a Borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough. 
If we group the Neighborhood first, 'Not assigned' under Neighborhood will become difficult to remove using the index or df.loc if there are multiple Neighborhoods with the same PostalCode.

So assign the Borough name to the Neighborhood with 'Not assigned value'.

In [7]:
df.loc[df['Neighborhood']=='Not assigned', ['Neighborhood']]= df['Borough']   # assign cooresponding Borough values only to 
df.head()                                                                     # those values in Neighborhood column where
                                                                              # value is 'Not assigned'

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Now grouping together Neighborhoods having the same PostalCode.

In [8]:
# Arrange the column Neighborhood according to PostalCode value. Use apply() to separate the values in the rows using join.
df = df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Use the .shape method to print the number of rows of your dataframe

In [9]:
df.shape

(103, 3)

In [10]:
df.to_csv('Part_1.csv',index=False)