<h1>Web Scrape Example in Python</h1>

<strong>Welcome!</strong> In this notebook we will scrape the following Wikipedia page <a>https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M</a> in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

Dataframe contents:

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
- More than one neighborhood can exist in one postal code area. Rows will be combined into one row with the neighborhoods separated with a comma based on same postal code.

In [5]:
# !pip install beautifulsoup4

In [6]:
# required libraries
import requests #to handle requests
from bs4 import BeautifulSoup #to parse structured data

In [7]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
#Access the url with requests library
response = requests.get(url)
print(response) #200 means access was successful

<Response [200]>


In [8]:
#parse the html with BeautifulSoup to work with a nicer, 
#nested BeautifulSoup data structure
data_soup = BeautifulSoup(response.content, 'html.parser')
# data_soup

The table is under wikitable sortable class which can be identified by right clicking the table and then click inspect. So, a table with wikitable sortable class will be searched by find function. 

In [9]:
results=data_soup.find('table', class_='wikitable sortable')

A table in HTML is comprised of rows denoted by the tags **\<tr>\</tr>**. 
Each row has cells which can either be headings defined using **\<th>\</th>** or data defined using **\<td>\</td>**.

In [11]:
columns=results.find_all('th')
rows=results.find_all('tr')[1:] #slicing to get rid of headings
# print(results)
print(columns)
print(rows[0])

[<th>Postcode</th>, <th>Borough</th>, <th>Neighbourhood
</th>]
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>


Now to get the value within the opening and closing tags of an element, **get_text()** will be used. Additionally, **strip()** removes any additional leading and trailing spaces that might be present in the text. If needed, any specific value such as new line character \n can be removed.

In [20]:
#creating headers by removing <th>\</th> tags in list comprehension
headers=[columns[i].get_text().strip()for i in range(0,len(columns))]
# for i in range(0,len(columns)):
#      columns[i]=str(columns[i]).replace('<th>', '').replace('</th>', '').replace('\n', '')
print(headers)

data_content = []
for i in range(0,len(rows)):
    cells = rows[i].find_all('td') #all table cells marked by td  data
    data_content.append([cells[0].get_text().strip(), cells[1].get_text().strip(), cells[2].get_text().strip()])
print(data_content[0])

['Postcode', 'Borough', 'Neighbourhood']
['M1A', 'Not assigned', 'Not assigned']


Now the pandas library will be used to convert data into tabular structure and efficiently manipulate data with predefined functions.

In [25]:
import pandas as pd
#cretaing dataframe of the data and also use respective headers
df=pd.DataFrame(data=data_content,columns=headers)
# df.columns = [columns[0],columns[1],columns[2]]
df.head()#to show first five rows

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [31]:
#Ignore cells with a borough that is Not assigned.
df2=df[df.Borough != 'Not assigned']
#replacing not assigned neighbourhoods with the borough names
df2.Neighbourhood.replace('Not assigned', df.Borough, inplace=True)
# d2 = df[df.Neighbourhood != 'Not assigned'] and df[df.Borough == 'Not assigned']
df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [36]:
#Grouping more than one neighborhood based on same postal code 
#and add comma by apply func and also reset the indexing
df3=df2.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()
# df=df.groupby('Postcode')['Borough','Neighbourhood'].agg(', '.join).reset_index()
print(df3.head())
print('\n')
#checking for particular postal code
print(df3.loc[df3['Postcode']=='M9V'])

  Postcode      Borough                           Neighbourhood
0      M1B  Scarborough                          Rouge, Malvern
1      M1C  Scarborough  Highland Creek, Rouge Hill, Port Union
2      M1E  Scarborough       Guildwood, Morningside, West Hill
3      M1G  Scarborough                                  Woburn
4      M1H  Scarborough                               Cedarbrae


    Postcode    Borough                                      Neighbourhood
101      M9V  Etobicoke  Albion Gardens, Beaumond Heights, Humbergate, ...


In [37]:
# Checking the number of rows of the dataframe
df3.shape

(103, 3)

In [40]:
#saving to a csv file
df3.to_csv('toronto_neighburhoods.csv',index=False)

This notebook is part of an assignment of a course on **Coursera** called *Applied Data Science Capstone*. The course can be taken online by clicking [here](http://cocl.us/DP0701EN_Coursera_Week3_LAB2).