<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto City</font></h1>

### Import python libraries and get html content from the wikipedia page.

In [25]:
from bs4 import BeautifulSoup
import requests
import pandas as pd # library for data analsysis

url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "lxml")

### Process the header/column names and initialize the dataFrame.

In [26]:
postcode_table = soup.find("table", attrs={"class": "wikitable"})
postcode_table_data = postcode_table.tbody.find_all("tr")

# Get the headings of Lists
column_names = []
for tr in postcode_table_data[0].find_all("th"):
    # remove any newlines and extra spaces from left and right
    column_names.append(tr.text.replace('\n', '').strip())

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

### Process the table data to get the entire neighbourhood data out of html table. Here we have ignored the rows where Borough is 'Not assigned'. Now if a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [27]:
postcode_table_rest_of_data = postcode_table_data[1:]
#print(postcode_table_rest_of_data)
for tr in postcode_table_rest_of_data:
    data_row = tr.find_all("td")
    post_code = data_row[0].text.replace('\n', '').strip()
    borough = data_row[1].text.replace('\n', '').strip()
    has_borough_name = True if borough.find("Not assigned") == -1 else False
    neighborhood_name = data_row[2].text.replace('\n', '').strip()
    neighborhood_name = neighborhood_name if neighborhood_name.find("Not assigned") == -1 else borough
        
    if borough.find('Not assigned') == -1:    
        neighborhoods = neighborhoods.append({'Postcode': post_code,
                                              'Borough': borough,
                                              'Neighbourhood': neighborhood_name}, ignore_index=True)    

neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


### Now more than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, it can be found that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [28]:
df_neighborhoods = neighborhoods.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()

df_neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Shape of the dataframe:

In [29]:
df_neighborhoods.shape

(103, 3)

### The supplied csv file is used to get the dataframe for geospatial(lat/long) data. The file is downloaded and used in IBM watson local data asset to be used for the source of the dataframe.

In [30]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Now we have altered the df column 'Postal Code' to 'Postcode' to match with our previously processed dataframe of Toronto neighbourhood data. It would also help us to join the two dataframes with minimal effort.

In [34]:
df_geospatial_data.columns = ['Postcode', 'Latitude', 'Latitude']
df_geospatial_data.head()

Unnamed: 0,Postcode,Latitude,Latitude.1
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Finally we're merging (joining) two dataframe to get the required dataframe consisting teh Toronto city neighbourhood data along with respective latitude and longitude.

In [35]:
final_df = pd.merge(df_neighborhoods, df_geospatial_data, how='inner', on='Postcode')
final_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Latitude.1
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
