# Assignment Summary

###  Part 1 <br>First you will see how data from the wikipedia page on Canadian postal codes is captured and converted <br> into a Panda dataframe for further analysis. Various data wrangling and web scraping techniques will be used to cleanse and prepare the data.

###  Part 2 <br> You'll see how addresses are converted into their equivalent latitude and longitude values.  This step is necessary to visualize <br>the converted source data on a map.

###  Part 3 <br>Foursquare API will be used to explore neighborhoods in Toronto. Then you will see the most common venue categories in <br>each neighborhood, and then the explore feature will be used to group the neighborhoods into clusters. The *k*-means clustering algorithm <br>will be used to complete this task.  The Folium library will be used to visualize the neighborhoods in Toronoto and their emerging clusters. <br>Finally, an analysis of the clusters will be given highlighting key observations of each cluster.<br>

# Table of contents

### Part 1
[1. Create Notebook and download libraries and packages](#1.1.0)<br>
[2. Download wikipedia page, extract PostalCode, Burough & Neighbourhood table](#1.2.0)<br>
[3.a) Show table column headers, indexing and dataframe size](#1.3.0)<br>
[3.b) Only process cells that have an assigned Borough & ignore those Not assigned](#1.3.1)<br>
[3.c) Combine Neighbourhoods with same Postcode like M5A](#1.3.2)<br>
[3.d) Show table column headers, indexing and dataframe size](#1.3.3)<br>
[3.e&f) Show Number of (Rows, Columns) of new dataframe.](#1.3.4)<br>
<br>

# Part 1


## 1. Create Notebook and download libraries and packages <a name="1.1.0"></a>

In [1]:

# import libraries
import pandas as pd # library to analyze data
import numpy
import requests # library to handle web requests
from bs4 import BeautifulSoup
#
print('Libraries imported')

Libraries imported


## 2. Download wikipedia page, extract PostalCode, Burough & Neighbourhood table <a name="1.2.0"></a>

### <font color=green>_The pandas library reads HTML tables directly from a URL. This means that they already have a built-in HTML parser<br>that processes the HTML content of a given page and tries to extract various tables in the page.<br>The read-html method returns a list of DataFrames<br>The class  - table class="wikitable sortable"> is used to isolate the table we will extract from the wikipedia page.<br> header=0, the table starts at row 0, so the table is read to include the column headers<br>The attrs argument takes a Python dictionary of attributes and matches HTML elements that match those attributes.<br>Then print the extracted numbers to span the length of the object wikitable_</font>


In [2]:
url ='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikitables = pd.read_html(url, header=0, attrs={"class":"wikitable sortable"})
print ("Extracted {num} wikitables".format(num=len(wikitables)))
wikidf = wikitables[0]
wikidf.head(10)

Extracted 1 wikitables


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


### <font color=orange> _You'll notice the table above has duplicate postal codes and cells with Not assigned in the Borough and Neighbourhood columns_</font>

## 3.a) Show table column headers, indexing and dataframe size<a name="1.3.0"></a>


### <font color=green> _To do this we  using the built-in functions columns, index and table shape._</font>

In [3]:
wikidf.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

In [4]:
wikidf.index

RangeIndex(start=0, stop=288, step=1)

In [5]:
wikidf.shape

(288, 3)

## 3.b) Only process cells that have an assigned Borough & ignore those Not assigned<a name="1.3.1"></a>

## Remove table rows with Borough - Not assigned>>

In [15]:
wikidf = wikidf[wikidf.Borough != 'Not assigned'] # != means not equal to Not assigned
wikidf.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


### <font color=orange> _Notice the dataframe above, there are fewer rows indicated by the first column index. In this case,  the rows with Not assigned were removed from the dataframe._</font>

In [7]:
wikidf.shape

(211, 3)

### <font color=green>_We verify the number of rows removed by looking at the size of the dataframe. In this case the output shows that the number of rows declined from 288 to 211._</font>

## 3.c) Combine Neighbourhoods with same Postcode like M5A<a name="1.3.2"></a>

### <font color=green>_In order to combine Postal Code we use the unique parameter. Since the Neighbourhood column has multiple values, we must use the series function to list all the values.<br>Then we remove any unwanted text or symboles from the column using the lambda expression_</font>

In [8]:
# create unique values for Postcode column
new_df = pd.DataFrame({'Postcode':wikidf.Postcode.unique()})
# Add text of Burough column to new dataframe
new_df['Borough']=pd.DataFrame(list(set(wikidf['Borough'].loc[wikidf['Postcode'] == pc['Postcode']])) for i, pc in new_df.iterrows())
# Iterates over the rows of the dataframe to add series of multiple Neighbourhoods in list into Neighbourhood column
new_df['Neighbourhood']=pd.Series(list(set(wikidf['Neighbourhood'].loc[wikidf['Postcode'] == pc['Postcode']])) for i, pc in new_df.iterrows())
# remove unwanted [] from text in Neighbourhood column of new dataframe
new_df['Neighbourhood']=new_df['Neighbourhood'].apply(lambda pc: ', '.join(pc))
#
new_df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


## 3.d) If a cell has Borough but - Not Assigned>> Neighbourhood, then Neighbourhood get same value as Borough.<a name="1.3.3"></a><br>
### <font color=green>_For Boroughs with corresponding Not assigned>> Neighbourhood then replace - Not Assigned>> to Borough's cell value_<br> # See M7A Queen's Park in row index 4</font>

In [16]:
for index, row in new_df.iterrows():
    if row['Neighbourhood'] == 'Not assigned':
        row['Neighbourhood'] = row['Borough']
new_df.head(20) # See M7A Queen's Park in row id #4

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


### <font color=orange>_See change with row at M7A Queen's Park now appears in both columns_</font>

## 3.e&f) Show Number of (Rows, Columns) of new dataframe.<a name="1.3.4"></a><br><br>

In [10]:
new_df.shape

(103, 3)

### <font color=green>_We verify the number of rows removed by looking at the size of the dataframe.<br> In this case the output shows that the number of rows declined from 211 to 103.<br> The dataframe is not formatted and cleaned ands ready for Part 2_</font>