# <font color='blue'><center>Exploring the Toronto Neighborhoods</center></font>

## <font color='green'><center>Part 1 - Dataset Formation</center></font>

In [20]:
# Import libraries
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup

<b>Below is the definition of the function, to retrieve the raw HTML data using the wikipedia URL: <br>https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.</b>

In [21]:
def getRawDataContent(urlString, contentParser):
    """
    Function to extract the raw HTML data using the URL.
    :param urlString: The URL containing the data to be retrieved.
    :param contentParser: The HTML parser within BeautifulSoup to extract the content.
    :return rawData: The raw HTML data.
    """
    urlResponse = urllib.request.urlopen(urlString)
    rawContent = urlResponse.read().decode() 
    rawData = BeautifulSoup(rawContent, contentParser)
    return rawData

<b>Next, we define another function to parse the relevant rows from the (html) table containing the data about neighborhoods in Toronto. Each row would be converted to a list and the function would return a list of lists.</b>

In [22]:
def generateDataset(rawData):
    """
    Function to get a list of data elements which would be used later to create a dataframe.
    :param rawData: The raw HTML data extracted using BeautifulSoup.
    :return dataList: The raw HTML data.
    """
    dataList = []
    dataDict = {}
    divData = rawData.find('div', class_='mw-parser-output')
    dataTable = divData.find('tbody').find_all('tr')
    for tableRow in dataTable:
        colData = tableRow.find_all('td')
        colData = [dataElement.text.strip() for dataElement in colData]
        if colData:
            dataList.append(colData)
    return dataList

<b>Let us now invoke the functions we defined above using the wikipedia URL and obtain the ouput list.</b>

In [23]:
wikiURL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
wikiData = getRawDataContent(wikiURL, 'lxml')
dataList = generateDataset(wikiData)

<b>Now, we convert the output list to a pandas dataframe with the required column names. A few sample rows of the dataframe are also displayed.</b>

In [24]:
torontoDataHeaderList = ['PostalCode', 'Borough', 'Neighborhood']
torontoNeighborhoods = pd.DataFrame(dataList, columns=torontoDataHeaderList)
torontoNeighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [25]:
# Check how many rows are there in the dataframe now
torontoNeighborhoods.shape

(180, 3)

<b>We would only process the cells that have an assigned borough. The cells with a borough that is <font color='red'>'Not assigned'</font> would be ignored.</b>

In [26]:
torontoNeighborhoods = torontoNeighborhoods[torontoNeighborhoods.Borough != 'Not assigned']

<b>If a cell has a valid borough, but the neighborhood is <font color='red'>'Not assigned'</font>, then the neighborhood will be the same as the borough.</b>

In [27]:
torontoNeighborhoods.loc[torontoNeighborhoods['Neighborhood'] =='Not assigned', 'Neighborhood'] = torontoNeighborhoods['Borough']

<b>If there is more than one neighborhood in the same postal code area, the corresponding rows would be combined into a single row, with the neighborhoods separated with a comma.<b>

In [28]:
torontoNeighborhoods = torontoNeighborhoods.groupby(['PostalCode', 'Borough'], sort=False).agg( ', '.join).reset_index()

In [34]:
torontoNeighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [30]:
torontoNeighborhoods.describe()

Unnamed: 0,PostalCode,Borough,Neighborhood
count,103,103,103
unique,103,10,99
top,M5A,North York,Downsview
freq,1,24,4


<b>Now, we write this dataframe to a CSV file.</b>

In [31]:
torontoNeighborhoods.to_csv('TorontoNeighborhoods.csv', index=False)

<b>Let us take a look at the shape of our dataframe.</b>

In [32]:
torontoNeighborhoods.shape

(103, 3)

In [35]:
torontoNeighborhoods.loc[torontoNeighborhoods['PostalCode']=='M9V']

Unnamed: 0,PostalCode,Borough,Neighborhood
89,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
