# Applied Data Science Capstone

Welcome to my Coursera's Capstone Project notebook. On this notebook I'll develop my final project for the [IBM Data Science](https://www.coursera.org/professional-certificates/ibm-data-science) course.

In [2]:
#!pip install geocoder
#!pip install bs4

In [38]:
import pandas as pd
import numpy as np
import requests, geocoder, folium
from bs4 import BeautifulSoup

## Segmenting and Clustering
### Convert table to dataframe

At first, I crawled and parsed the Wikipedia's page.

In [45]:
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

The Wikipedia's page has a table containing all the postal codes. Instead of using a `for` structure to parse the HTML, I used the built-in `soup.find()` method, that returns the expected table. Then I converted it into a dataframe using `pd.read_html()`.

In [46]:
table = soup.find('table')
codes = pd.read_html(str(table))[0]
codes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


To prevent invalid borough values, I removed all not assigned boroughs from the dataframe.

In [47]:
codes = codes[(codes.Borough != 'Not assigned')]
codes.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Etobicoke,Islington Avenue
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


Then I renamed the column `Postcode` to `Postal Code` and merged the two dataframes, `codes` and `geo`, into a new `df` dataframe containing all the information.

In [48]:
codes.rename(columns = {'Postcode' : 'Postal Code', 'Neighbourhood':'Neighborhood'}, inplace = True)
codes = codes.groupby(by=['Postal Code','Borough'], sort=False).agg( ', '.join).reset_index()

As seen, my dataset has a total of **103 rows**.

In [49]:
codes.shape

(103, 3)

### Merge postal codes and geolocations dataframes

In [50]:
geo = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/Coursera_Capstone/master/Geospatial_Coordinates.csv')
df = codes.merge(geo, on = 'Postal Code')
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


### Map of Toronto

Now, let's plot the Toronto map, based on the dataframe locations, using the Folium library.

In [51]:
toronto_lat = 43.6532;
toronto_lng = -79.3832
map_toronto = folium.Map(location = [toronto_lat, toronto_lng], zoom_start = 10.7)

for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
map_toronto

### Select all boroughs containing Toronto

Let's first get all the neighborhoods from North York borough.

In [53]:
borough = df[df['Borough'].str.contains("North York")]
borough.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
7,M3B,North York,Don Mills North,43.745906,-79.352188
10,M6B,North York,Glencairn,43.709577,-79.445073


In [54]:
borough_lat =  df.loc[0, 'Latitude']
borough_lng =  df.loc[0, 'Longitude']
borough_name =  df.loc[0, 'Borough']
print('The geograpical coordinate of {} are {}, {}.'.format(borough_name, borough_lat, borough_lng))

The geograpical coordinate of North York are 43.7532586, -79.3296565.


In [57]:
map_borough = folium.Map(location = [borough_lat, borough_lng], zoom_start = 10.7)

for lat, lng, neighborhood in zip(borough['Latitude'], borough['Longitude'], borough['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_borough)  
map_borough

UndefinedError: 'None' has no attribute 'replace'

<folium.folium.Map at 0x11805d190>

In [31]:
def search(category, names, latitudes, longitudes):
    CLIENT_ID = 'OMVPO1DVXFDX4RZ1L1VCKMC45ZML0TK3JQP0JBIK4YAQAHFB'
    CLIENT_SECRET = '2HVRKXEVMZ5CEHUFXY5G3PFEHS4TQYQUBXXU51WZIS4R1PXF'
    VERSION = '20180604'
    radius = 500
    limit = 10
    venues = []
    ven = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/search?categoryId={}&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            category, CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, limit)
        result = requests.get(url).json()['response']['venues']
        print(result)
        #venues.append({'Neighbor': name, 'Latitude': result['location']['lat'], 'Longitude': result['location']['lng']})
    print(venues)

In [32]:
venues = search('4bf58dd8d48988d1e0931735', toronto['Neighborhood'], toronto['Latitude'], toronto['Longitude'])

NameError: name 'toronto' is not defined

In [33]:
    for v in venues:
        ven.append({'Neighbor' : v['name'], 'Latitude' : v['location']['lat'], 'Longitude' : v['location']['lng']})

NameError: name 'venues' is not defined

In [34]:
venues

NameError: name 'venues' is not defined