# Criminal Project

This notebook is part of the course Applied Data Science Capstone, 
exercise: Segmenting and Clustering Neighborhoods in Toronto


In [23]:
!pip install lxml geopy folium



In [2]:
#Pandas is all we need to scrape the page
import pandas as pd

### Reading the HTML table from wikipedia
Using the url defined below, we read the content of all tables in page.

In this case, we have three pages(size of dfs) and we are using the first one.

The dfs is a list and the desired table is the first element of its.

Then, we convert to a pandas dataframe in the last line. 

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
dfs = pd.read_html(url)
df = pd.DataFrame(dfs[0])

Showing the first 12 rows from dataframe

In [4]:
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


Removing cells with a borough that is Not assigned.

In [5]:
# Showing the shape of the dataframe to compare after drop values.
df.shape

(180, 3)

In [6]:
df.drop(df.loc[df['Borough'] == 'Not assigned'].index, inplace=True)

In [7]:
# Showing the shape of the dataframe to compare with previous values.
df.shape

(103, 3)

Combining rows into one row with the neighborhoods separated with a comma when more than one neighborhood exist in one postal code area.

In [8]:
rdf = df.groupby("Postal Code")["Neighbourhood"].agg(lambda join_comma: ", ".join(join_comma))
rdf

Postal Code
M1B                                       Malvern, Rouge
M1C               Rouge Hill, Port Union, Highland Creek
M1E                    Guildwood, Morningside, West Hill
M1G                                               Woburn
M1H                                            Cedarbrae
                             ...                        
M9N                                               Weston
M9P                                            Westmount
M9R    Kingsview Village, St. Phillips, Martin Grove ...
M9V    South Steeles, Silverstone, Humbergate, Jamest...
M9W                  Northwest, West Humber - Clairville
Name: Neighbourhood, Length: 103, dtype: object

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [9]:
def notAssNei(Borough, Neighbourhood):
    if (Borough != "Not assigned") and (Neighbourhood=="Not assigned"):
        return Borough
    else:
        return Neighbourhood

Applying the function above to the dataframe

In [10]:
df['Neighbourhood'] = df.apply(lambda x: notAssNei(x['Borough'], x['Neighbourhood']), axis=1)


Showing the shape of the Dataframe

In [11]:
df.shape

(103, 3)

Get the geographical coordinates of the neighborhoods using the csv file that has the geographical coordinates of each postal code.

In [12]:
url = "https://cocl.us/Geospatial_data"
dfGD = pd.read_csv(url)

Applying Join to both dataframes to get just one dataframe.

In [13]:
dfJ = df.join(dfGD.set_index('Postal Code'),on="Postal Code") 

Showing the dataframe

In [14]:
dfJ

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
165,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


### Explore and cluster the neighborhoods in Toronto. 

Importing necessary modules

In [40]:
from geopy.geocoders import Nominatim
import folium

Getting Toronto Coordinates

In [41]:
address = "Toronto, ON"
geoloc = Nominatim(user_agent="Toronto")
locat = geoloc.geocode(address)
print('Toronto coordinates {}, {}.'.format(locat.latitude, locat.longitude))

Toronto coordinates 43.6534817, -79.3839347.


Creating map

In [37]:
toronto_map = folium.Map(location=[locat.latitude, locat.longitude], zoom_start=11)

In [38]:
for lat,lng,borough,neighbourhood in zip(dfJ['Latitude'],dfJ['Longitude'],dfJ['Borough'],dfJ['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=4,
    popup=label,
    color='green',
    fill=True,
    fill_color='#2344c3',
    fill_opacity=0.6,
    parse_html=False).add_to(toronto_map)
toronto_map