# IBM Data Science Professional Capstone Project: Scraping, Parsing Table into Pandas DF for Neighborhood Clustering

### In this three-part series of notebooks, we scrape from a Wikipedia article a table of postal codes, cities and neighborhoods in and around Toronto; we clean the data as necessary, geolocate each neighborhood and gather information regarding venues local to that neighborhood; finally, we perform a KMeans cluster analysis to identify neighborhoods sharing similarities and we visualize the result in the form of a tagged Folium map. 

### Notebook I:  The Neighborhoods DataFrame (Pandas)

#### Import necessary libraries for our analysis

In [81]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# !conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# install beautifulsoup4 if it is not already installed on your system
import bs4  # beautifulsoup4 will be used for for stringifying scraped html
from bs4 import BeautifulSoup

print('Libraries imported.') 


Libraries imported.


#### Retrieve the Wikipedia article containing the Canadian regional Postal Code table; retrieve and stringify the HTML document using BeautifulSoup; and do a preliminary manipulation of the object to isolate the table

In [82]:
       
def parse_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    return soup
    
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
souper = parse_url(url)
type(souper)
     

bs4.BeautifulSoup

#### A quick inspection of the object reveals the table of interest (demarked by 'tbody' tags).  Further, the table comprises rows (demarked by 'tr'tags), the first row comprises column headings (demarked by 'th' tags) and the balance of the rows comprise data elements (demarked by 'td' tags).  Isolating the rows, we find that there is the single row of column headings followed by 289 rows of data.  Using Pandas, the BeautifulSoup Postal Code table object will be moved into a DataFrame ('pc_df') for further processing.  Finally, as requested in the exercise, we will delete rows of the table for which the Borough is indicated as 'Not Assigned' and will assign the name of the Borough to any neighbourhood in that Borough that othewise has no assigned name.

In [85]:
raw_table = souper.tbody
rows = raw_table.find_all('tr')
n_rows = len(rows)

header_row = rows[0].find_all('th')
n_cols = len(header_row)
col_names = list()
[col_names.extend(header_row[col]) for col in range(n_cols)]
col_names[-1] = col_names[-1][:-1]

pc_df= pd.DataFrame(columns=col_names, index=range(0,len(rows)))
for i in range(1,n_rows):
    row_data = rows[i].find_all('td')
    row_data = [row_data[0].text, row_data[1].text, (row_data[2].text)[:-1]]
    for j in range(0,3):
        pc_df.iat[i,j] = row_data[j]
pc_df = pc_df[pc_df.Postcode.notnull()]
pc_df=pc_df.reset_index(drop=True)
pc_clean = pc_df[pc_df.Borough != 'Not assigned']
pc_clean=pc_clean.reset_index(drop=True)

# A spot check shows that at least one Borough has an unassigned Neighbourhood:  we will go through the dataframe and 
# assign to that and any other similarly unassigned Neighborhood the name of its respective Borough:

for indx in range(0,len(pc_clean)):
    if pc_clean.loc[indx,'Neighbourhood']=='Not assigned': 
        pc_clean.loc[indx,'Neighbourhood']=pc_clean.loc[indx,'Borough']
        print(pc_clean.loc[indx,'Neighbourhood'], ' had a "Not assigned" neighborhood, which now has been assigned the name of the Borough')
        
print('pc_clean.shape is: ',pc_clean.shape)


Queen's Park  had a "Not assigned" neighborhood, which now has been assigned the name of the Borough
pc_clean.shape is:  (212, 3)


#### As requested in the exercise, for each Postal Code/Borough pair having multiple Neighborhoods we will collapse the group of Neighborhoods of that pair into a single entry by contatenating the names of those Neighborhoods into a single entry for that Postal Code/Borough pair. 


In [86]:
pc_clean_groupby_neighbourhood=pc_clean.groupby(by=['Neighbourhood'],axis=0)
print(pc_clean_groupby_neighbourhood.size().sum(), pc_clean.shape[0])

212 212


#### We now proceed to reform our dataframe as requested:

In [87]:
pc_clean_groupby_borough = pc_clean.groupby(by=['Borough'], axis=0)

frame = []
for name, group in pc_clean_groupby_borough:
    grp_membs = pc_clean_groupby_borough.get_group(name)
    neighs = ''
    for i in range(len(grp_membs)):
        neighs = neighs + grp_membs.iloc[i,2] +', '
    neighs = neighs[:-2]
    frame.append({'Borough':name, 'Neighborhoods':neighs})
    
frame_df=pd.DataFrame(frame)
postal_df = pc_clean['Postcode']
frame_df['Postcode']=postal_df
canada_nhds=frame_df[['Postcode','Borough','Neighborhoods']]
canada_nhds


Unnamed: 0,Postcode,Borough,Neighborhoods
0,M3A,Central Toronto,"Lawrence Park, Roselawn, Davisville North, For..."
1,M4A,Downtown Toronto,"Harbourfront, Regent Park, Ryerson, Garden Dis..."
2,M5A,East Toronto,"The Beaches, The Danforth West, Riverdale, The..."
3,M5A,East York,"Woodbine Gardens, Parkview Hill, Woodbine Heig..."
4,M6A,Etobicoke,"Islington Avenue, Cloverdale, Islington, Marti..."
5,M6A,Mississauga,Canada Post Gateway Processing Centre
6,M7A,North York,"Parkwoods, Victoria Village, Lawrence Heights,..."
7,M9A,Queen's Park,Queen's Park
8,M1B,Scarborough,"Rouge, Malvern, Highland Creek, Rouge Hill, Po..."
9,M1B,West Toronto,"Dovercourt Village, Dufferin, Little Portugal,..."


#### As requested in the exercise, the shape of the final dataframe, "canada_nhs", is determined:

In [79]:
canada_nhds.shape

(11, 3)