This notebook contains individual assignment _Segmenting and Clustering Neighborhoods in Toronto_ for Week 3 of [Applied Data Science Capstone](https://www.coursera.org/learn/applied-data-science-capstone) course.

## Table of Contents :
* [Install libraries](#install-libraries)
* [Import libraries](#import-libraries)
* [Display options for Pandas](#display-options-for-pandas)
* [Scrape Toronto Neighborhoods data](#scrape-toronto-neighborhoods-data)
* [Obtaining latitude and the longitude coordinates for neighborhoods](#obtaining-latitude-and-the-longitude-coordinates-for-neighborhoods)
* [Displaying neighborhoods on the map](#displaying-neighborhoods-on-the-map)
* [Explore and cluster the neighborhoods](#explore-and-cluster-the-neighborhoods)

# Install libraries <a class="anchor" id="install-libraries"></a>

In [None]:
!pip3 install beautifulsoup4 lxml requests pandas numpy geopy sklearn folium matplotlib

# Import libraries <a class="anchor" id="import-libraries"></a>

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim
import numpy as np
from sklearn.cluster import KMeans
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

# Display options for Pandas <a class="anchor" id="display-options-for-pandas"></a>

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Scrape Toronto Neighborhoods data <a class="anchor" id="scrape-toronto-neighborhoods-data"></a>

Get content of https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M as source to be scraped:

In [None]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

Selecting only HTML-snippet with rows of table:

In [None]:
table_rows = soup.select('table.wikitable > tbody > tr')

Removing 1st row that is a header:

In [None]:
table_rows_without_header = table_rows[1:]

Collecting table data into Python list data structure:

In [None]:
scraped_list = []
for tr in table_rows_without_header:
    td = tr.find_all('td')
    scraped_list.append((td[0].text, td[1].text, td[2].text.rstrip('\n')))

Creating empty dataframe:

In [None]:
scraped_neighborhoods_raw = pd.DataFrame(scraped_list, columns=['Postal Code', 'Borough', 'Neighborhood'])

Filtering out rows, where _Borough_ column has `Not assigned` value:

In [None]:
scraped_neighborhoods_filtered_na_boroughs = scraped_neighborhoods_raw[scraped_neighborhoods_raw['Borough'] != 'Not assigned']

Combine neighborhoods with same _PostalCode_ into single row:

In [None]:
scraped_neighborhoods_clean = scraped_neighborhoods_filtered_na_boroughs.groupby(['Postal Code', 'Borough'], as_index=False, sort=False).agg({'Neighborhood': lambda x: "%s" % ', '.join(x)})

For all rows, where _Borough_ is known, but _Neighborhood_ is `Not assigned` the neighborhood will be the same as the borough:

In [None]:
borough_for_not_assigned_neighborhoods = scraped_neighborhoods_clean[scraped_neighborhoods_clean['Neighborhood'] == 'Not assigned']['Borough']
scraped_neighborhoods_clean.loc[scraped_neighborhoods_clean['Neighborhood'] == 'Not assigned', 'Neighborhood'] = borough_for_not_assigned_neighborhoods

In [None]:
scraped_neighborhoods_clean.shape

# Obtaining latitude and the longitude coordinates for neighborhoods <a class="anchor" id="obtaining-latitude-and-the-longitude-coordinates-for-neighborhoods"></a>

As `geocoder` lib is very unstable, let's load data from http://cocl.us/Geospatial_data CSV-file:

In [None]:
geospatial_data = pd.read_csv('http://cocl.us/Geospatial_data')

Now, let's merge 2 existing dataframes into single one:

In [None]:
neighborhoods = pd.merge(scraped_neighborhoods_clean, geospatial_data, on='Postal Code')

# Displaying neighborhoods on the map <a class="anchor" id="displaying-neighborhoods-on-the-map"></a>

Obtaining Toronto latitude and longitude values:

In [None]:
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
toronto_location = geolocator.geocode(address)
toronto_latitude = toronto_location.latitude
toronto_longitude = toronto_location.longitude

Create a map of Toronto using latitude and longitude values:

In [None]:
map_newyork = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=10)

Add neighborhoods markers to map:

In [None]:
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

# Explore and cluster the neighborhoods <a class="anchor" id="explore-and-cluster-the-neighborhoods"></a>

Define Foursquare Credentials and Version:

In [None]:
CLIENT_ID = 'put your real Client ID here'
CLIENT_SECRET = 'put your real Client Secret here'
VERSION = '20180605'

"Borrowed" the get_category_type function from the Foursquare lab:

In [None]:
def get_nearby_venues(names, latitudes, longitudes, radius=500, limit=100):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        if results:
            print("Found {number} venues for '{name}' neighborhood.".format(number=len(results), name=name))
        else:
            print("WARNING: No venues found for '{name}' neighborhood.".format(name=name))
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Run the above function on each neighborhood and create a new dataframe with venues information:

In [None]:
toronto_venues = get_nearby_venues(names=neighborhoods['Neighborhood'],
                                 latitudes=neighborhoods['Latitude'],
                                 longitudes=neighborhoods['Longitude']
                                )

One hot encoding:

In [None]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

Move neighborhood column to the first column:

In [None]:
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 
cols = list(toronto_onehot)
cols.insert(0, cols.pop(cols.index('Neighborhood')))
toronto_onehot = toronto_onehot.loc[:, cols]

Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category:

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

Function to sort the venues in descending order:

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

New dataframe with top 10 venues for each neighborhood:

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

Run k-means to cluster the neighborhood into 5 clusters:

In [None]:
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

Add clustering labels to dataframe:

In [None]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Dropping neighborhoods:
- Islington Avenue
- Newtonbrook, Willowdale
- Upper Rouge

because no venues found in them by `get_nearby_venues` function:

In [None]:
neighborhoods_with_venues = neighborhoods[~neighborhoods['Neighborhood'].isin(['Islington Avenue', 'Newtonbrook, Willowdale', 'Upper Rouge'])].reset_index(drop=True)

Merge `neighborhoods_venues_sorted` with `neighborhoods_with_venues` to add latitude/longitude for each neighborhood:

In [None]:
toronto_merged = neighborhoods_with_venues.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

Visualize the resulting clusters:

In [None]:
map_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters