# _Notebook for Segmenting and Clustering Neighborhoods in Toronto_

### **Part 1: Getting, cleaning, processing data**

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import requests
# import k-means from clustering stage
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
!conda install -c anaconda lxml --yes # for pandas read_html
import lxml

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge

The following packages will be UPDATED:

    certifi: 2019.6.16-py36_1 conda-forge --> 2019.9.11-py36_0 conda-forge


Downloading and Extracting Packages
certifi-2019.9.11    | 147 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package 

#### 1. Get info from Wikipedia page

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
df0 = pd.read_html(url)[0]
df0.columns = ['PostalCode', 'Borough', 'Neighborhood']
df0

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor


#### 2. Ignore cells with a borough that is *__Not assigned__*

Let's see how many 'Not assigned'

In [4]:
df0.Borough.value_counts()

Not assigned        77
Etobicoke           45
North York          38
Downtown Toronto    37
Scarborough         37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

And create new DF without 'Not assigned'

In [5]:
df1 = df0[df0.Borough != 'Not assigned']
print('New size is', df1.shape)

New size is (211, 3)


#### 3. More than one neighborhood can exist in one postal code area.
#### Rows will be combined into one row with the neighborhoods separated with a comma

In [6]:
df2 = df1.groupby(['PostalCode', 'Borough'])['Neighborhood']\
.apply(lambda Neighborhood: ','.join(Neighborhood))\
.to_frame(name = 'Neighborhood').reset_index()
df2.head(3)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"


#### 4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [7]:
# Check 'Not assigned' Neighborhood
df2.loc[df2['Neighborhood'] == 'Not assigned']

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


In [8]:
df2.loc[df2['Neighborhood'] == 'Not assigned', 'Neighborhood'] = \
df2.loc[df2['Neighborhood'] == 'Not assigned', 'Borough']

In [9]:
df2.loc[df2['PostalCode'] == 'M7A']

Unnamed: 0,PostalCode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


#### 5. In the last cell of notebook, use the .shape method to print the number of rows of dataframe.

In [10]:
df2.shape

(103, 3)