# Segmenting and Clustering Neighborhoods in Toronto

_Anna A. Stepanova, Ph.D_

## Introduction

In this project, I will scrape the web data using a package *BeautifulSoup*. Then, I'll get the neighborhood information data from web, convert addresses into their equivalent latitude and longitude values. Also, I will use the Foursquare API to explore neighborhoods in Torono. I will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. I will use the *k*-means clustering algorithm to complete this task. Finally, I will use the Folium library to visualize the neighborhoods in Toronto and their emerging clusters.

## Table of Contents


1. [Scrape Wikipedia Page](#item1)
2. [Add Geospatial Data](#item2)
3. [Segmentation and Clustering Neighborhoods in Toronto](#item3)



Let's download required packages before we explore the data

In [None]:
#!pip install --upgrade ibm-watson

import numpy as np # library to handle data in a vectorized manner

#!pip install --user pandas==1.0.3

import pandas as pd # library for data analsysis


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.11.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from bs4 import BeautifulSoup # web scrapping library

print('Libraries imported.')

<a id="item1"></a>
## 1. Scrape Wikipedia Page

Let's build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes, boroughs and neighborhoods.
We'll use BeautifulSoup library to extract the table from the web-page.

In [None]:
# import the library we use to open URLs
import urllib.request

# specify the URL of the Wikipedia page page we are going to be scraping
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

Using the *urllib.request* library, we want to query the page and put the HTML data into a variable (which we have called ‘url’):

In [None]:
page = urllib.request.urlopen(url)

Then we use Beautiful Soup to parse the HTML data we stored in our ‘url’ variable and store it in a new variable called ‘soup’ in the Beautiful Soup format. We use the “lxml” library option:

In [None]:
# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")

Extract the table data from the xml using class "wikitable sortable":

In [None]:
# find and extract table data from the Wikipedia page
table=soup.find('table', class_='wikitable sortable')

Now that the table has been found, let's use BeautifulSoup to extract rows into 3 future columns. Then we'll use *pandas* to create a data frame. All entries end up with new lines **\n** which we should replace prior to further analysis. We will exclude cells with a borough that is **Not assigned** and reset the indexes in a modified data frame. 
Let's print first 12 rows

In [None]:
### Let's get column data
#Initialize the columns
A=[]
B=[]
C=[]


for row in table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        
#### Now let's create a data frame using pandas library
tor=pd.DataFrame(A,columns=['PostalCode'])
tor['Borough']=B
tor['Neighborhood']=C


# First filter out those rows which 
# does not contain any data 
tor = tor.dropna(how = 'all')

# Remove \n from data frame
tor = tor.replace('\n','', regex=True)

### Drop rows Not assigned
tor.drop(tor[tor['Borough'] == 'Not assigned'].index, inplace = True)

### print the data frame resetting the index
tor.reset_index(drop=True, inplace = True)

### Print the modified dataframe 
tor.head(12)
     


The dataframe consists of three columns: PostalCode, Borough, and Neighborhood and 103 entries for these columns.

In [None]:
### print the shape of the data frame
print(tor.shape)


<a id="item2"></a>
## 2. Add Geospatial Data

In this part, we'll get the latitude and longitude coordinates of a given postal code using *Geocoder* package: NOT WORKING

As I was not able to get the geographical coordinates of the neighborhoods using the Geocoder package, I used a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [None]:
### Read Geospatial information from csv
postal_data = pd.read_csv('http://cocl.us/Geospatial_data')

postal_data.head()

Now we can create a new data frame containing information about neighborhoods and their coordinates by merging Neighborhood data with Geospatial data. First 12 rows printed

In [None]:
# Merge toronto data frame with postal_data
geo_tor = tor.merge(postal_data, left_on='PostalCode', right_on='Postal Code')

# drop repetitive Postal Code
geo_tor.drop(["Postal Code"], axis = 1, inplace = True)

# Print first 12 rows
geo_tor.head(12)


Our new data frame contains 103 entries and now has neighborhood coordinates.

In [None]:
### print the shape of the data frame
geo_tor.shape

<a id="item3"></a>
## 3. Clustering Neighborhoods in Toronto

In this final section, we'll use Foursquare API to cluster neighborhoods in Toronto.

#### Define Foursquare Credentials

In [None]:
# The code was removed by Watson Studio for sharing.

Let's explore the 41st neighborhood in our Toronto data frame. The code below will print the neighborhood name and its coordinates.

In [None]:
# print neighborhood name
neighborhood_name = geo_tor.loc[40, 'Neighborhood'] # neighborhood name

print('The 41st neighborhood in a cleaned Toronto data frame is {}.'.format(neighborhood_name))

# find neighborhood coordinates
neighborhood_latitude = geo_tor.loc[40, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = geo_tor.loc[40, 'Longitude'] # neighborhood longitude value


print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               round(neighborhood_latitude, 2), 
                                                               round(neighborhood_longitude, 2)))

#### Now, let's get the top 100 venues that are in Downsview within a radius of 1 kilometer.

First, let's create the GET request URL. Name your URL **url**.

In [None]:
# type your answer here
search_query = neighborhood_name
radius = 1000 # define the radius
LIMIT = 100 # limit of number of venues returned by Foursquare API
print(search_query + ' .... OK!')


url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    VERSION,
    radius, 
    LIMIT)



Now we are ready to send our GET request and examine the resutls

In [None]:
results = requests.get(url).json()
results

Before we proceed, let's borrow the get_category_type function from the Foursquare lab.

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues[filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()