# Coursera Capstone - Week 3

This notebook is composed of three parts, as requested by the assignment directives.  
For ease of use, use the links below to go directly to each part if needed.
>
> <a href="#Part-1:-Creating-a-Dataframe-by-Webscrapping">Part 1: Creating a Dataframe by Webscrapping </a>
>
> <a href="#Part-2:-Getting-Coordinates">Part 2: Getting Coordinates </a>
>
> <a href="#Part-3:-Exploring-and-Clustering-Neighborhoods">Part 3: Exploring and Clustering Neighborhoods </a>

But first, let's import all the libraries that will be used in this notebook.

In [97]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from opencage.geocoder import OpenCageGeocode

## Part 1: Creating a Dataframe by Webscrapping

Using BeautifulSoup for webscrapping:

In [102]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(url)
table = soup.find('table',{'class':'wikitable sortable'})

The loop below fills the lists `postal_code`, `borough` and `neighborhood` with the values exctrated from the tables in the Wikipedia page.

In [103]:
postal_code = []
borough =[]
neighborhood = []

rows = table.find_all('tr')

for row in rows:
    cells = row.find_all('td')
        
    if len(cells) > 1:
        postal = cells[0]
        postal_code.append(postal.text.strip())
            
        br = cells[1]
        borough.append(br.text.strip())
            
        nh = cells[2]
        neighborhood.append(nh.text.strip())            

Now the populated lists are used to build the dataframe.

In [104]:
df = pd.DataFrame(postal_code)
df.rename(columns={0:"Postal Code"}, inplace=True)
df['Borough']=borough
df['Neighborhood']=neighborhood
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now let's drop the unwanted values "Not assigned" and reset the index of the dataframe.

In [105]:
df.drop(df.loc[df['Borough']=='Not assigned'].index, inplace=True)

In [106]:
df.reset_index(drop=True, inplace=True)

In [107]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


At last, using the **.shape** method to print the number of rows in the dataframe.

In [108]:
print("The dataframe contains {} rows.".format(df.shape[0]))

The dataframe contains 103 rows.


## Part 2: Getting Coordinates

In this part, the OpenCage geocoding service was used to retrieve the Latitude and Longitude for each Postal Code in the dataframe. The values were added to the lists `lat` and `lon`, which were then used to insert the latitude and longitude value in the original dataframe.  
  
**Note: The values for latitude and longitude will vary depend on the geocoding service used.**

In [70]:
# The API key to use OpenCage was removed for privacy reasons.
key = 'my_key'
geocoder = OpenCageGeocode(key)

lat = []
lon = []

for n in range(0,103):
    postcode = df['Postal Code'].values[n]
    address = '{}, Toronto, CA'.format(postcode)
    result = geocoder.geocode(address, no_annotations="1")
    if result and len(result):
        longitude = result[0]['geometry']['lng']
        latitude = result[0]['geometry']['lat']
    else:
        longitude = 'N/A'
        latitude = 'N/A'
    
    lat.append(latitude)
    lon.append(longitude)

In [109]:
df['Latitude'] = lat
df['Longitude'] = lon

In [111]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.653482,-79.383935
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.653482,-79.383935


In [115]:
# Checking if the Neighborhood column has any "Not assigned".
# If there is, it should be replaced by the Borough name.
df.loc[df['Neighborhood']=='Not assigned']

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude


No values found. The dataframe is complete.

## Part 3: Exploring and Clustering Neighborhoods

In this step, we are going to explore and cluster the neighborhoods in Toronto. A map will be generated to visualize the neighborhoods and how they cluster together.