## Explore and cluster the neighborhoods in Toronto

Scarping a Wikipedia webpage with BeautifulSoup for Toronto Nieghborhood data 

## 1. Import and install required libraries/modules 

In [3]:
import pandas as pd 
!conda install -c conda-forge beautifulsoup4 --yes         #install beautifulsoup
from bs4 import BeautifulSoup              #for sacraping a webpage
from pandas import DataFrame       #for convering a list to dataframe
import requests        #make a request to get the webpage/url               
print('import and install finished')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    beautifulsoup4-4.9.1       |   py36h9f0ad1d_0         163 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.6 MB

The following NEW packages will be INSTALLED:

    python_abi:      3.6-1_cp36m       conda-forge

The following packages will be UPDATED:

    beautifulsoup4:  4.7.1-py36_1                

## 2. Request, Scrape (with BeautifulSoup) , Read (with pandas) and convert to Dataframe and name it Toro 

In [5]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'html.parser')    #scrape & parse 
table = soup.find_all('table')[0]    
df = pd.read_html(str(table))[0]    
#made three lists from df for DataFrame
Postal_code = df["Postal Code"].tolist()
Borough = df["Borough"].tolist()
Neighborhood = df["Neighborhood"].tolist()
#make a dictionary for DataFrame
d = {'PostalCodes':Postal_code,'Borough':Borough, 'Neighborhood':Neighborhood}  
#made the DataFrame
Toro = pd.DataFrame(d)
Toro.head()

Unnamed: 0,PostalCodes,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## 3. Remove cells/rows with a Borough not assigned.

In [6]:
# Get names of indices for which column Borough has value Not assigned
indexNames = Toro[ Toro['Borough'] == 'Not assigned' ].index
indexNames
# Delete these row indices from dataFrame
Toro.drop(indexNames , inplace=True)
Toro.head()

Unnamed: 0,PostalCodes,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## 4. Reset index and discard old index

In [7]:
print(Toro.index)

Int64Index([  2,   3,   4,   5,   6,   8,   9,  11,  12,  13,
            ...
            151, 152, 153, 156, 157, 160, 165, 168, 169, 178],
           dtype='int64', length=103)


In [8]:
Toro.reset_index(inplace = True)

In [9]:
Toro.head()

Unnamed: 0,index,PostalCodes,Borough,Neighborhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [10]:
Toro.drop('index',axis = 1, inplace=True)

In [11]:
Toro.head()

Unnamed: 0,PostalCodes,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [12]:
Toro.shape

(103, 3)

## 5. Check if there is any Borough with Not assigned Neighborhood

In [13]:
df = Toro
df.loc[df['Neighborhood'] != 'Not assigned', 'Assigned'] = 'Yes'
df.describe() #check if any Neiborhood has Not assighned value

Unnamed: 0,PostalCodes,Borough,Neighborhood,Assigned
count,103,103,103,103
unique,103,10,99,1
top,M2R,North York,Downsview,Yes
freq,1,24,4,103


From the above result all 103 Neighborhoods have been Assigned  

In [14]:
Toro.head()

Unnamed: 0,PostalCodes,Borough,Neighborhood,Assigned
0,M3A,North York,Parkwoods,Yes
1,M4A,North York,Victoria Village,Yes
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",Yes
3,M6A,North York,"Lawrence Manor, Lawrence Heights",Yes
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",Yes


In [15]:
Toro.drop('Assigned',axis = 1, inplace=True)

In [16]:
Toro.head()

Unnamed: 0,PostalCodes,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [17]:
Toro.shape

(103, 3)

## Question/part 2 - adding latitude and longtide 

The geocoder provided did not work. We will read the csv file provided by the link. 

In [18]:
import pandas as pd

path = "https://cocl.us/Geospatial_data"
df2 = pd.read_csv(path)
df2.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


## a) the csv data seems to be sorted by Postal Code. Let's sort our dataframe to match it.

In [25]:
Toro.sort_values('PostalCodes',inplace=True)
Toro.head()

Unnamed: 0,level_0,index,PostalCodes,Borough,Neighborhood
0,0,6,M1B,Scarborough,"Malvern, Rouge"
1,1,12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,2,18,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,3,22,M1G,Scarborough,Woburn
4,4,26,M1H,Scarborough,Cedarbrae


A quick scan shows the sorted dataframe matches with csv file. Reseting and droping uneccessary indices.

In [28]:
Toro.drop('level_0',axis = 1, inplace=True)

In [29]:
Toro.drop('index',axis = 1, inplace=True)

In [30]:
Toro.head()

Unnamed: 0,PostalCodes,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## b) now that the two dataframes match, let's concatenate them

In [31]:
result = pd.concat([Toro, df2], axis=1, sort=False)
result.head()

Unnamed: 0,PostalCodes,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


## c) let's drop the extra column Postal Code

In [36]:
result.drop('Postal Code', axis = 1, inplace=True)
result.head()

Unnamed: 0,PostalCodes,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [35]:
result.head(12)

Unnamed: 0,PostalCodes,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
