# Segmenting and Clustering Neighbourhoods in Toronto I
## Applied Data Science Capstone by IBM on Coursera 
**Fernanda Oliveira**  
Data Analyst

## Introduction

In this hands-on project I will use a dataset, that is a table of postal codes of neighborhoods in the city of Toronto in order to explore, segment, and cluster them. I will first scrape the dataset from the Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, and read the data with pandas dataframe, clean it and then perform the analysis.

First, let's import all libraries that are needed to perform the analysis.

In [1]:
import numpy as np 

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json 

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


I installed the BeautifulSoup package using the instructions in the link: https://beautiful-soup-4.readthedocs.io/en/latest/

In [2]:
!pip install lxml

[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support[0m


## 1. Download and Explore Dataset

Let's scrape the data using the `read_html` from pandas library.

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
toronto_data = pd.read_html(url, header = 0)

In [4]:
df = toronto_data[0];

In [5]:
df.keys()

Index([u'Postcode', u'Borough', u'Neighbourhood'], dtype='object')

In [6]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


The goal now is to use only cells in the "Borough" column that is assigned. So, the strings "Not assigned" will be removed from the column "Borough" in the dataset. 

In [7]:
df=df[df.Borough != 'Not assigned']

In [8]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Using the information that "If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough", the cells that contain "Not assigned" in the column "Neighbourhood" will be replaced with the cells of columns "Borough".

In [9]:
df['Neighbourhood']=df['Neighbourhood'].replace('Not assigned', df['Borough'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [10]:
df.head(15)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Queen's Park
9,M9A,Downtown Toronto,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


Observe that more than one neighborhood can exist in one postal code area. For example, in the dataset above we will notice that M6A (index 5) is listed twice and has two neighborhoods: Lawrence Heights and Lawrence Manor. The ideia now is to combine the two rows into one row with the neighborhoods separated with a comma. For this, the `GroupBy()` , the aggregation method `agg()` and the `lambda function` were needed. Observe that `.reset_index()` was used to reset all the indices in the new dataset created.

In [11]:
df = df.groupby(['Postcode','Borough'], sort = False).agg(lambda x: ', '.join(x)).reset_index()

In [12]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [13]:
df.to_csv(r'Toronto_cleandata.csv')

In [14]:
df.shape

(103, 3)