# Notebook 1: build the code to scrape the following Wikipedia page
With special thanks to [Syed Sadat Nazrul](https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059)

## Objectives:
Create a dataframe from Toronto:
1. The dataframe consists of three columns: PostalCode, Borough, and Neighborhood
2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
3. More than one neighborhood can exist in one postal code area. These two rows will be combined into one row with the neighborhoods separated with a comma. 
4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
5. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
4. Submit a link to your Notebook on your Github repository. (10 marks)

In [1]:
pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/55/6f/c87dffdd88a54dd26a3a9fef1d14b6384a9933c455c54ce3ca7d64a84c88/lxml-4.5.1-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 3.2MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
import requests
import lxml.html as lh
import pandas as pd

In [3]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#Create a handle, page, to handle the contents of the website
page = requests.get(url)

#Store the contents of the website under doc
doc = lh.fromstring(page.content)

#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

In [4]:
#Check the length of the first exp. 12 rows
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

In [5]:
#Create empty list
col=[]
i=0

#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d, %s' % (i,name))
#    print '%d:"%s"'%(i,name
    col.append((name,[]))

1, Postal Code

2, Borough

3, Neighborhood



In [6]:
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [7]:
# Consistency check:
[len(C) for (title,C) in col]

[181, 181, 181]

In [8]:
# Create the dictionary and put it into a dataframe:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

df.columns = df.columns.str.replace("[\n]", "")
df.replace('\n', '', regex=True, inplace=True)

In [9]:
# Check the size of the data frame:
print(df.shape)

(181, 3)


In [10]:
# Remove empty cells in the dataframe and specifically in 'Borough':
df = df.dropna(how = 'all')
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace = True)

In [11]:
# Check that all 'neigbourhood' cells are holding a value
print('Number of cells in neighbourhood that are empty: ', df['Neighborhood'].isna().sum())

Number of cells in neighbourhood that are empty:  0


In [12]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [13]:
# Check the size of the data frame:
print(df.shape)

(104, 3)


In [14]:
# Save the dataframe to the cloud for further processing in a different notebook:
df.to_csv("toronto_clean.csv", index=False)