## Explore and cluster the neighborhoods in Toronto

Scarping a Wikipedia webpage with BeautifulSoup for Toronto Nieghborhood data 

## 1. Import and install required libraries/modules 

In [1]:
import pandas as pd 
!conda install -c conda-forge beautifulsoup4 --yes         #install beautifulsoup
from bs4 import BeautifulSoup              #for sacraping a webpage
from pandas import DataFrame       #for convering a list to dataframe
import requests        #make a request to get the webpage/url               
print('import and install finished')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    beautifulsoup4-4.9.1       |   py36h9f0ad1d_0         163 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.6 MB

The following NEW packages will be INSTALLED:

    python_abi:      3.6-1_cp36m       conda-forge

The following packages will be UPDATED:

    beautifulsoup4:  4.7.1-py36_1                

## 2. Request, Scrape (with BeautifulSoup) , Read (with pandas) and convert to Dataframe and name it Toro 

In [2]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'html.parser')    #scrape & parse 
table = soup.find_all('table')[0]    
df = pd.read_html(str(table))[0]    
#made three lists from df for DataFrame
Postal_code = df["Postal Code"].tolist()
Borough = df["Borough"].tolist()
Neighborhood = df["Neighborhood"].tolist()
#make a dictionary for DataFrame
d = {'PostalCodes':Postal_code,'Borough':Borough, 'Neighborhood':Neighborhood}  
#made the DataFrame
Toro = pd.DataFrame(d)
Toro

Unnamed: 0,PostalCodes,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


## 3. Remove cells/rows with a Borough not assigned.

In [3]:
# Get names of indices for which column Borough has value Not assigned
indexNames = Toro[ Toro['Borough'] == 'Not assigned' ].index
indexNames
# Delete these row indices from dataFrame
Toro.drop(indexNames , inplace=True)
Toro

Unnamed: 0,PostalCodes,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


## 4. Reset index and discard old index

In [4]:
print(Toro.index)

Int64Index([  2,   3,   4,   5,   6,   8,   9,  11,  12,  13,
            ...
            151, 152, 153, 156, 157, 160, 165, 168, 169, 178],
           dtype='int64', length=103)


In [5]:
Toro.reset_index(inplace = True)

In [35]:
Toro.head()

Unnamed: 0,index,PostalCodes,Borough,Neighborhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [6]:
Toro.drop('index',axis = 1, inplace=True)

In [7]:
Toro.head()

Unnamed: 0,PostalCodes,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [8]:
Toro.shape

(103, 3)

## 5. Check if there is any Borough with Not assigned Neighborhood

In [9]:
df = Toro
df.loc[df['Neighborhood'] != 'Not assigned', 'Assigned'] = 'Yes'
df.describe() #check if any Neiborhood has Not assighned value

Unnamed: 0,PostalCodes,Borough,Neighborhood,Assigned
count,103,103,103,103
unique,103,10,99,1
top,M4B,North York,Downsview,Yes
freq,1,24,4,103


From the above result all 103 Neighborhoods have been Assigned  

In [10]:
Toro.head()

Unnamed: 0,PostalCodes,Borough,Neighborhood,Assigned
0,M3A,North York,Parkwoods,Yes
1,M4A,North York,Victoria Village,Yes
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",Yes
3,M6A,North York,"Lawrence Manor, Lawrence Heights",Yes
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",Yes


In [11]:
Toro.drop('Assigned',axis = 1, inplace=True)

In [12]:
Toro.head()

Unnamed: 0,PostalCodes,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [13]:
Toro.shape

(103, 3)