# Toronto Neighborhood, Explore, Analyze and Cluster
### Toronto M Postal Code Data 

### Table of Contents
1. [Installing and Importing of Libraries](#libraries)
2. [Part 1 Data Identification and Preparation ](#Part1)
3. [Part 2 Getting Latitude and Longitude of neighbourhood](#Part2)
4. [Part 3 Clustring Toronto Neighbourhood](#Part3)

__ __

### 1. Installing and Importing of Libraries <a class="anchor" id="libraries"></a>

In [1]:
try:
    print("Installing Libraries...\n")
    !conda install -c conda-forge beautifulsoup4 --yes
    print("BeautifulSoup4 has been successfully installed!\n")
    !conda install -c conda-forge ProgressBar2 --yes
    print("ProgressBar has been successfully installed!\n")
    !conda install -c conda-forge lxml --yes
    print("lxml has been successfully installed!\n")
    !conda install -c conda-forge geopy --yes
    print("GeoPy has been successfully installed!\n")
    !conda install -c conda-forge folium=0.5.0 --yes
    print("Folium has been successfully installed!\n")
    print("Libraries has been successfully installed!\n")
except:
    print("ERROR: could not install Libraries!\n")

try:
    print("Importing libraries...\n")
    import numpy as np # library to handle data in a vectorized manner
    import pandas as pd # library for data analysis
    from bs4 import BeautifulSoup as bts # library for web scraping
    from pandas.io.json import json_normalize
    from IPython.display import Image 
    from IPython.core.display import HTML 
    import matplotlib as mp # library for visualization
    import matplotlib.cm as cm
    import matplotlib.colors as colors
    import requests # library to handle requests
    from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
    from sklearn.cluster import KMeans # import k-means from clustering stage
    import folium # map rendering library
    import lxml
    import re
    from time import sleep
    print("All libraries imported successfully!\n")
except:
    print("ERROR: Could not import all libraries!\n")

%matplotlib inline

Installing Libraries...

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    soupsieve-1.7.3            |           py36_0          52 KB  conda-forge
    cryptography-2.4.2         |   py36h1ba5d50_0         618 KB
    beautifulsoup4-4.7.1       |        py36_1001         140 KB  conda-forge
    openssl-1.1.1a             |    h14c3975_1000         4.0 MB  conda-forge
    libarchive-3.3.3           |       h5d8350f_5         1.5 MB
    grpcio-1.16.1              |   py36hf8bcb03_1         1.1 MB
    conda-4.6.2                |           py36_0         869 KB  conda-forge
    libssh2-1.8.0              |                1         239 KB  conda-forge
    python-3.6.8               |       h0371630_0        34.4 MB
    ---------------------------------

# Part 1 <a class="anchor" id="Part1"></a>
### __Data Identification and Preparation__

### 2. Scraping and Cleaning Toronto Neighborhood Data <a class="anchor" id="neighborhood_data"></a>


__Read given wikipedia web page using pandas read_html method (Altho BS4 BeautifulSoup is installedand & imported but I have Not used it)__

In [2]:
try:
    print("Reading web page ...")
    url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
    wikipage = pd.read_html(url)
    print("Web page read  successful !")
except:
    print("ERROR: could not read web page.\n")

Reading web page ...
Web page read  successful !


__Check object type of wikipage object__

In [3]:
type(wikipage)

list

__wikipage object type is list, get the length of list object__

In [4]:
len(wikipage)

3

__length of wikipage list is 3, check all 3 elements of list one by one__

In [5]:
wikipage[0]

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned


In [6]:
wikipage[1]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,,Canadian postal codes,,,,,,,,,...,,,,,,,,,,
1,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL,NS,PE,NB,QC,ON,MB,SK,AB,...,L,M,N,P,R,S,T,V,X,Y
2,NL,NS,PE,NB,QC,ON,MB,SK,AB,BC,...,,,,,,,,,,
3,A,B,C,E,G,H,J,K,L,M,...,,,,,,,,,,
4,NL,NS,PE,NB,QC,ON,MB,SK,AB,BC,...,,,,,,,,,,
5,A,B,C,E,G,H,J,K,L,M,...,,,,,,,,,,


In [7]:
wikipage[2]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,NL,NS,PE,NB,QC,ON,MB,SK,AB,BC,NU/NT,YT,,,,,,
1,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


__wikipage is nested list and our relevent data is in wikipage[0]__ 

__Create a data frame using list wikipage[0]__

In [8]:
neighbour = pd.DataFrame(wikipage[0])
neighbour.head()

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


__Assign proper header to data frame which is first row of data frame itself ['Postcode','Borough','Neighbourhood']__

In [9]:
header=neighbour[0:1].values.tolist()
header

[['Postcode', 'Borough', 'Neighbourhood']]

In [10]:
header = header[0]

In [11]:
neighbour.columns = header
neighbour.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


__Delete row 0  and reset index - Cleenup neighbour data frame__

In [12]:
neighbour.drop(neighbour.index[:1], inplace=True)
neighbour.reset_index(drop=True, inplace=True)
neighbour.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


__Clear all rows where Borough is "Not assigned"and reset index - Cleenup neighbour data frame__

In [13]:
# Replace all 'Not assigned' in borough with np.nan
neighbour.replace({'Borough': 'Not assigned' }, np.nan, inplace = True)
# Drop whole row with NaN in "Borough" column
neighbour.dropna(subset=["Borough"], axis=0, inplace=True)
# reset index, because we droped some rows
neighbour.reset_index(drop=True, inplace=True)
neighbour.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


__Merge neighbourhood with same postal code - Prepare neighbour data frame__

In [14]:
# Groupby and join can be used for the purpose
neighbour = neighbour.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
neighbour.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


__Set Neighbourhood =  Brough where neighbourhood is equal to 'Not assigned' - Prepare neighbour data frame__

In [15]:
neighbour['Neighbourhood'] = np.where(neighbour['Neighbourhood'] == 'Not assigned', neighbour['Borough'], neighbour['Neighbourhood'])
neighbour.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


__Neighbour data frame shape__

In [16]:
neighbour.shape

(103, 3)