URL for the required data:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Summary of Instructions for preparing the data:

1) The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
2) Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
3) More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will   notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
4) If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
5) Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
6) In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [7]:
#ad 1)
import pandas

#import requests for http-query
import requests

#import beautiful soup for scraping the website
from bs4 import BeautifulSoup

complete_website = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
complete_soup = BeautifulSoup(complete_website,'xml')

data_table = complete_soup.find('table',{'class':'wikitable sortable'})
table_rows = data_table.find_all('tr')

#The strip() returns a copy of the string with both leading and trailing white-spaces being stripped.

my_list = []
for row in table_rows:
    my_list.append([col.text.strip() for col in row.find_all('td')])

# complete requirement 1

df = pandas.DataFrame(my_list, columns=['PostalCode', 'Borough', 'Neighborhood'])

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [14]:
#ad 2)
# exclude NULL's from dataframe, this is not explicitly stated in 2) but seems to be straight forward....
df2 = df[~df['Borough'].isnull()] 

# additionaly exclude "Not assigned" instances from dataframe

mask = df2['Borough'].isin(['Not assigned'])
df2=df2[~mask]
df2.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [33]:
#ad 3) Check for duplicate postal codes
print("This is the data type of the variable df2: >>>>>>>",type(df2))
print("This is the size of the dataframe: >>>>>>>",df2.shape)
unique_values=df2['PostalCode'].unique().shape


if unique_values[0]==df2.shape[0]:
    print("SUCCESS and GOOD LUCK: There are no duplicates")
else:
    print("remove remaining duplicates")
    



This is the data type of the variable df2: >>>>>>> <class 'pandas.core.frame.DataFrame'>
This is the size of the dataframe: >>>>>>> (103, 3)
SUCCESS and GOOD LUCK: There are no duplicates


In [57]:
#ad 4) Check for not assigned neighbourhoods

nbh_checker_mask=df2['Neighborhood'].isin(['Not assigned'])
df3=df2[~nbh_checker_mask]
df3.head()


(103, 3)

In [58]:
#ad 4) Check if size has changed ....
if df3.shape[0]== df2.shape[0]:
    print("There is no request for changing the df as we are not experiencing any unassigned neighbourhoods")
else: 
    print("... we need some additional data transformation...")

There is no request for changing the df as we are not experiencing any unassigned neighbourhoods


In [None]:
#ad 5) I have included one Markdown cell and several comments - hope this is o.k.

In [59]:
#ad 6) size of the resulting dataframe
print("The size of the DataFrame is:>>>>> ", df3.shape)

The size of the DataFrame is:>>>>>  (103, 3)
