# Analysis of neighbourghoods in Toronto

These analysis are to collet and group information about venues and crime data in Toronto to come up with suggestion for grouping and pricing

1. A full report consisting of all of the following components (15 marks):
 1. Introduction where you discuss the business problem and who would be interested in this project.
 1. Data where you describe the data that will be used to solve the problem and the source of the data.
 1. Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
 1. Results section where you discuss the results.
 1. Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
 1. Conclusion section where you conclude the report.
2. A link to your Notebook on your Github repository pushed showing your code. (15 marks)

3. Your choice of a presentation or blogpost. (10 marks)

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Part 1. Prepare Toronto Neighborhood data</a>

1. <a href="#item2">Part 2. Explore</a>
    
1. <a href="#item3">Part 3. Prepare Toronto Crime data</a>
    
1. <a href="#item4">Part 4. Venue data</a>

1. <a href="#item5">Part 5. Combine dataset</a>
    
1. <a href="#item6">Part 6. Cluster</a>   
    
1. <a href="#item7">Conclusion</a>

</font>
</div>

In [1]:
import pandas as pd
import numpy as np

In [2]:
import requests
import lxml
from bs4 import BeautifulSoup

In [3]:
from geopy.geocoders import Nominatim 

In [4]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Part1. Prepare Toronto Neighborhood data <a class="anchor" id="item1"></a>

### Get data into data frame

In [5]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
page_soup = BeautifulSoup(page.content, 'html.parser')     

In [6]:
table_raw = page_soup.table

In [7]:
tbl_header = []
for l in table_raw.find_all('th'):
    tbl_header.append(l.string.strip('\n'))

In [8]:
tbl_header

['Postal code', 'Borough', 'Neighborhood']

In [9]:
tbl_content = []
for l in table_raw.find_all('td'):
    tbl_content.append(l.string.strip('\n'))

In [10]:
len(tbl_content)

540

In [11]:
n_cols = len(tbl_header)
tbl_content_split = [tbl_content[x:x+n_cols] for x in range(0, len(tbl_content), n_cols)]

In [12]:
toronto_postal_codes_raw = pd.DataFrame(tbl_content_split, columns=tbl_header)

### Cleanup data

In [13]:
new_df = toronto_postal_codes_raw[toronto_postal_codes_raw['Borough'] != 'Not assigned']

In [14]:
# There are no duplicates unlike have been told in the exercise
new_df[new_df.duplicated('Postal code')]

Unnamed: 0,Postal code,Borough,Neighborhood


In [15]:
# There are no Neighborhood with 'Not assigned' or empty 
new_df[(new_df['Neighborhood'] == 'Not assigned') | (new_df['Neighborhood'] == '')]

Unnamed: 0,Postal code,Borough,Neighborhood


In [16]:
# Replace / in Neighborhood with ,
new_df['Neighborhood'].replace(' /', ',', regex=True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [17]:
toronto_postal_codes = new_df.reset_index(drop=True)

In [18]:
toronto_postal_codes.rename(columns={'Postal code': 'Postal Code'}, inplace=True)

In [19]:
toronto_postal_codes.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [20]:
toronto_postal_codes.tail()

Unnamed: 0,Postal Code,Borough,Neighborhood
98,M8X,Etobicoke,"The Kingsway, Montgomery Road , Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."
102,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [21]:
toronto_postal_codes.shape

(103, 3)

### Add Latitude and Longitude to Postal Codes

In [22]:
url = 'https://cocl.us/Geospatial_data'

In [23]:
df_lat_lon = pd.read_csv(url)

In [24]:
df_lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [25]:
df_lat_lon.shape

(103, 3)

In [26]:
toronto_postal_codes_w_coords = pd.merge(
    toronto_postal_codes,
    df_lat_lon,
    on='Postal Code')

In [27]:
toronto_postal_codes_w_coords.head(11)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


# Part2. Explore neighborhoods <a class="anchor" id="item2"></a>

In [43]:
import folium

In [69]:
toronto_postal_codes_w_coords['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
York                 5
East Toronto         5
Mississauga          1
Name: Borough, dtype: int64

### Create map

In [47]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="new_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))

The geograpical coordinate of Toronto, Ontario are 43.6534817, -79.3839347.


In [49]:
# create map
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(
    toronto_postal_codes_w_coords['Latitude'],
    toronto_postal_codes_w_coords['Longitude'], 
    toronto_postal_codes_w_coords['Borough'], 
    toronto_postal_codes_w_coords['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

# Part3. Prepare Toronto Crime data <a class="anchor" id="item3"></a>

In [28]:
url = 'https://opendata.arcgis.com/datasets/f4c2e5de021f4836a3caf77f8421f487_0.csv?outSR=%7B%22latestWkid%22%3A3857%2C%22wkid%22%3A102100%7D'

df_crime = pd.read_csv(url)

In [29]:
df_crime.head()

Unnamed: 0,X,Y,Index_,event_unique_id,occurrencedate,reporteddate,premisetype,ucr_code,ucr_ext,offence,...,occurrencedayofyear,occurrencedayofweek,occurrencehour,MCI,Division,Hood_ID,Neighbourhood,Long,Lat,ObjectId
0,-79.405228,43.656982,7801,GO-20152165447,2015-12-18T03:58:00.000Z,2015-12-18T03:59:00.000Z,Commercial,1430,100,Assault,...,352.0,Friday,3,Assault,D14,79,University (79),-79.405228,43.656982,7001
1,-79.307907,43.778732,7802,GO-20151417245,2015-08-15T21:45:00.000Z,2015-08-17T22:11:00.000Z,Commercial,1430,100,Assault,...,227.0,Saturday,21,Assault,D42,118,Tam O'Shanter-Sullivan (118),-79.307907,43.778732,7002
2,-79.225029,43.765942,7803,GO-20151421107,2015-08-16T16:00:00.000Z,2015-08-18T14:40:00.000Z,Apartment,2120,200,B&E,...,228.0,Sunday,16,Break and Enter,D43,137,Woburn (137),-79.225029,43.765942,7003
3,-79.140823,43.778648,7804,GO-20152167714,2015-11-26T13:00:00.000Z,2015-12-18T13:38:00.000Z,Other,2120,200,B&E,...,330.0,Thursday,13,Break and Enter,D43,133,Centennial Scarborough (133),-79.140823,43.778648,7004
4,-79.288361,43.691235,7805,GO-20152169954,2015-12-18T19:50:00.000Z,2015-12-18T19:55:00.000Z,Commercial,1430,100,Assault,...,352.0,Friday,19,Assault,D55,61,Taylor-Massey (61),-79.288361,43.691235,7005


In [30]:
df_crime.shape

(206435, 29)

In [32]:
df_crime.columns

Index(['X', 'Y', 'Index_', 'event_unique_id', 'occurrencedate', 'reporteddate',
       'premisetype', 'ucr_code', 'ucr_ext', 'offence', 'reportedyear',
       'reportedmonth', 'reportedday', 'reporteddayofyear',
       'reporteddayofweek', 'reportedhour', 'occurrenceyear',
       'occurrencemonth', 'occurrenceday', 'occurrencedayofyear',
       'occurrencedayofweek', 'occurrencehour', 'MCI', 'Division', 'Hood_ID',
       'Neighbourhood', 'Long', 'Lat', 'ObjectId'],
      dtype='object')

In [33]:
df_crime['MCI'].value_counts()

Assault            111423
Break and Enter     43302
Auto Theft          23380
Robbery             21543
Theft Over           6787
Name: MCI, dtype: int64

In [34]:
df_crime['Neighbourhood'].value_counts()

Waterfront Communities-The Island (77)    7747
Bay Street Corridor (76)                  6817
Church-Yonge Corridor (75)                6232
West Humber-Clairville (1)                5702
Moss Park (73)                            4786
                                          ... 
Yonge-St.Clair (97)                        412
Guildwood (140)                            411
Maple Leaf (29)                            410
Woodbine-Lumsden (60)                      377
Lambton Baby Point (114)                   353
Name: Neighbourhood, Length: 140, dtype: int64

In [65]:
df_crime[df_crime['Neighbourhood'].str.match('Victoria Village')==True]

Unnamed: 0,X,Y,Index_,event_unique_id,occurrencedate,reporteddate,premisetype,ucr_code,ucr_ext,offence,...,occurrencedayofyear,occurrencedayofweek,occurrencehour,MCI,Division,Hood_ID,Neighbourhood,Long,Lat,ObjectId


In [102]:
df_crime_clean = df_crime[['MCI', 'Neighbourhood', 'Long', 'Lat']]
df_crime_clean.columns = ['MCI', 'Crime Neighbourhood', 'Crime Lon', 'Crime Lat']
df_crime_clean.head(5)

Unnamed: 0,MCI,Crime Neighbourhood,Crime Lon,Crime Lat
0,Assault,University (79),-79.405228,43.656982
1,Assault,Tam O'Shanter-Sullivan (118),-79.307907,43.778732
2,Break and Enter,Woburn (137),-79.225029,43.765942
3,Break and Enter,Centennial Scarborough (133),-79.140823,43.778648
4,Assault,Taylor-Massey (61),-79.288361,43.691235


In [103]:
for name in toronto_postal_codes_w_coords['Neighborhood']:
    #df.loc[df['c1'] == 'Value', 'c2'] = 10
    #df_crime[df_crime['Neighbourhood'].str.match('Victoria Village')==True]
    for n in name.split(','):
        n = n.lstrip()
        #print(n)
        df_crime_clean.loc[df_crime_clean['Crime Neighbourhood'].str.match(n)==True, 'Neighbourhood'] = name
        
#df_crime_clean['Neighborhood']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [104]:
df_crime_clean.head(5)

Unnamed: 0,MCI,Crime Neighbourhood,Crime Lon,Crime Lat,Neighbourhood
0,Assault,University (79),-79.405228,43.656982,
1,Assault,Tam O'Shanter-Sullivan (118),-79.307907,43.778732,"Clarks Corners, Tam O'Shanter, Sullivan"
2,Break and Enter,Woburn (137),-79.225029,43.765942,Woburn
3,Break and Enter,Centennial Scarborough (133),-79.140823,43.778648,
4,Assault,Taylor-Massey (61),-79.288361,43.691235,


In [70]:
df_crime_clean.shape

(206435, 5)

In [73]:
len(df_crime_clean[df_crime_clean['Neighbourhood'].isna()])

117933

Looks like have of the crime data is not getting postal neighbourhood, let's check which Crime Neighbourhoods

In [87]:
df_crime_nan = df_crime_clean[df_crime_clean['Neighbourhood'].isna()]
print(df_crime_nan.shape)
print(df_crime_nan['Crime Neighbourhood'].value_counts())

(117933, 5)
Waterfront Communities-The Island (77)    7747
Bay Street Corridor (76)                  6817
Church-Yonge Corridor (75)                6232
West Humber-Clairville (1)                5702
Moss Park (73)                            4786
                                          ... 
Old East York (58)                         479
Yonge-St.Clair (97)                        412
Maple Leaf (29)                            410
Woodbine-Lumsden (60)                      377
Lambton Baby Point (114)                   353
Name: Crime Neighbourhood, Length: 78, dtype: int64


78 neighbourhoods, let's check top 5, maybe the naming is a bit different?

In [93]:
toronto_postal_neighborhoods = \
    set(toronto_postal_codes_w_coords['Neighborhood'].to_list())

In [94]:
toronto_postal_neighborhoods

{'Agincourt',
 'Alderwood, Long Branch',
 'Bathurst Manor, Wilson Heights, Downsview North',
 'Bayview Village',
 'Bedford Park, Lawrence Manor East',
 'Berczy Park',
 'Birch Cliff, Cliffside West',
 'Brockton, Parkdale Village, Exhibition Place',
 'Business reply mail Processing CentrE',
 'CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport',
 'Caledonia-Fairbanks',
 'Canada Post Gateway Processing Centre',
 'Cedarbrae',
 'Central Bay Street',
 'Christie',
 'Church and Wellesley',
 "Clarks Corners, Tam O'Shanter, Sullivan",
 'Cliffside, Cliffcrest, Scarborough Village West',
 'Commerce Court, Victoria Hotel',
 'Davisville',
 'Davisville North',
 'Del Ray, Mount Dennis, Keelsdale and Silverthorn',
 'Don Mills',
 'Dorset Park, Wexford Heights, Scarborough Town Centre',
 'Downsview',
 'Dufferin, Dovercourt Village',
 'East Toronto',
 'Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood',
 'Fairview, Henry Farm, Oriole',
 'F

In [97]:
lookup = 'Island'
for n in toronto_postal_neighborhoods:
    if n.find(lookup) > 0:
        print(n)

CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Harbourfront East, Union Station, Toronto Islands


Lets check NAs on the map

In [None]:
# create map
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood, mci in zip(
    df_crime_nan['Crime Lat'],
    df_crime_nan['Crime Lon'], 
    df_crime_nan['Crime Neighbourhood'], 
    df_crime_nan['MCI']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Quick check doesn't reveal obvious mismatch, we can leave this for future improvments, we see on the map which areas are getting missed and will suggest customer to improve with later itterations of the project TODO** 

For now lets get rid of NA rows

In [107]:
df_crime_clean.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_crime_clean.dropna(inplace=True)


In [108]:
df_crime_clean.shape

(88502, 5)

In [111]:
df_crime_clean.head()

Unnamed: 0,MCI,Crime Neighbourhood,Crime Lon,Crime Lat,Neighbourhood
1,Assault,Tam O'Shanter-Sullivan (118),-79.307907,43.778732,"Clarks Corners, Tam O'Shanter, Sullivan"
2,Break and Enter,Woburn (137),-79.225029,43.765942,Woburn
12,Assault,Downsview-Roding-CFB (26),-79.508636,43.720917,Downsview
13,Break and Enter,Bedford Park-Nortown (39),-79.41642,43.735794,"Bedford Park, Lawrence Manor East"
14,Assault,Malvern (132),-79.209801,43.814213,"Malvern, Rouge"


In [209]:
df_crime_clean['MCI'].unique()

array(['Assault', 'Break and Enter', 'Robbery', 'Theft Over',
       'Auto Theft'], dtype=object)

In [210]:
[df_crime_clean['MCI'].value_counts()]

Assault            46423
Break and Enter    18756
Auto Theft         11170
Robbery             9387
Theft Over          2766
Name: MCI, dtype: int64

In [217]:
df_crime['MCI'].describe()

count      206435
unique          5
top       Assault
freq       111423
Name: MCI, dtype: object

# Part4. Venue data <a class="anchor" id="item4"></a>

```
# @hidden_cell
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
```

In [52]:
# Some definitions
LIMIT = 100

Helper function

In [53]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

### Venues

In [54]:
df_venues = getNearbyVenues(names=toronto_postal_codes_w_coords['Neighborhood'],
                                   latitudes=toronto_postal_codes_w_coords['Latitude'],
                                   longitudes=toronto_postal_codes_w_coords['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview
The Danforth West, Ri

In [56]:
print(df_venues.shape)
df_venues.head()

(4879, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
2,Parkwoods,43.753259,-79.329656,Tim Hortons,43.760668,-79.326368,Café
3,Parkwoods,43.753259,-79.329656,A&W,43.760643,-79.326865,Fast Food Restaurant
4,Parkwoods,43.753259,-79.329656,Bruno's valu-mart,43.746143,-79.32463,Grocery Store


In [109]:
# How many venues per neighborhood
df_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,46,46,46,46,46,46
"Alderwood, Long Branch",28,28,28,28,28,28
"Bathurst Manor, Wilson Heights, Downsview North",27,27,27,27,27,27
Bayview Village,12,12,12,12,12,12
"Bedford Park, Lawrence Manor East",42,42,42,42,42,42
...,...,...,...,...,...,...
"Willowdale, Newtonbrook",31,31,31,31,31,31
Woburn,8,8,8,8,8,8
Woodbine Heights,29,29,29,29,29,29
York Mills West,21,21,21,21,21,21


In [113]:
# How many uniq categories
print('There are {} uniques categories.'.format(len(df_venues['Venue Category'].unique())))

There are 325 uniques categories.


In [118]:
df_venues.rename(columns={'Neighborhood': 'Neighbourhood'}, inplace=True)

In [218]:
df_venues.head()

Unnamed: 0,Neighbourhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
2,Parkwoods,43.753259,-79.329656,Tim Hortons,43.760668,-79.326368,Café
3,Parkwoods,43.753259,-79.329656,A&W,43.760643,-79.326865,Fast Food Restaurant
4,Parkwoods,43.753259,-79.329656,Bruno's valu-mart,43.746143,-79.32463,Grocery Store


# Part5. Combined dataset <a class="anchor" id="item5"></a>

### Crime data

In [115]:
df_crime_clean.groupby('Neighbourhood').count()

Unnamed: 0_level_0,MCI,Crime Neighbourhood,Crime Lon,Crime Lat
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Agincourt,1650,1650,1650,1650
"Alderwood, Long Branch",1270,1270,1270,1270
"Bathurst Manor, Wilson Heights, Downsview North",727,727,727,727
Bayview Village,927,927,927,927
"Bedford Park, Lawrence Manor East",1240,1240,1240,1240
"Clarks Corners, Tam O'Shanter, Sullivan",1371,1371,1371,1371
"Cliffside, Cliffcrest, Scarborough Village West",1217,1217,1217,1217
"Del Ray, Mount Dennis, Keelsdale and Silverthorn",953,953,953,953
"Dorset Park, Wexford Heights, Scarborough Town Centre",2109,2109,2109,2109
Downsview,3974,3974,3974,3974


In [116]:
# How many uniq MCI types
print('There are {} unique MCIs.'.format(len(df_crime_clean['MCI'].unique())))

There are 5 uniques MCI.


### Hot-encode venues

In [128]:
# one hot encoding
df_onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_onehot['Neighbourhood'] = df_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
df_onehot = df_onehot[fixed_columns]

df_onehot.head()

Unnamed: 0,Neighbourhood,ATM,Accessories Store,Afghan Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [123]:
df_onehot.shape

(4879, 326)

In [124]:
# Lets regroup by frequency of occuring categories
df_grouped_v = df_onehot_v.groupby('Neighbourhood').mean().reset_index()
df_grouped_v

Unnamed: 0,Neighbourhood,ATM,Accessories Store,Afghan Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,...,0.000000,0.021739,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.037037,0.000000,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.023810,0.0,0.0,0.0,...,0.023810,0.000000,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92,"Willowdale, Newtonbrook",0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
93,Woburn,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
94,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.034483,0.000000,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
95,York Mills West,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0


In [136]:
df_grouped_v.shape

(97, 326)

### Hot-encode crime data

In [129]:
# one hot encoding
df_onehot = pd.get_dummies(df_crime_clean[['MCI']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_onehot['Neighbourhood'] = df_crime_clean['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [df_onehot.columns[-1]] + list(df_onehot.columns[:-1])
df_onehot = df_onehot[fixed_columns]

df_onehot.head()

Unnamed: 0,Neighbourhood,Assault,Auto Theft,Break and Enter,Robbery,Theft Over
1,"Clarks Corners, Tam O'Shanter, Sullivan",1,0,0,0,0
2,Woburn,0,0,1,0,0
12,Downsview,1,0,0,0,0
13,"Bedford Park, Lawrence Manor East",0,0,1,0,0
14,"Malvern, Rouge",1,0,0,0,0


In [133]:
df_onehot.shape

(88502, 6)

In [137]:
# Lets regroup by frequency of occuring categories
df_grouped_c = df_onehot.groupby('Neighbourhood').mean().reset_index()
df_grouped_c

Unnamed: 0,Neighbourhood,Assault,Auto Theft,Break and Enter,Robbery,Theft Over
0,Agincourt,0.428485,0.133333,0.290303,0.099394,0.048485
1,"Alderwood, Long Branch",0.462992,0.134646,0.258268,0.100787,0.043307
2,"Bathurst Manor, Wilson Heights, Downsview North",0.416781,0.220083,0.258597,0.077029,0.02751
3,Bayview Village,0.496224,0.132686,0.259978,0.057174,0.053937
4,"Bedford Park, Lawrence Manor East",0.212903,0.221774,0.447581,0.062903,0.054839
5,"Clarks Corners, Tam O'Shanter, Sullivan",0.479942,0.100656,0.274252,0.12108,0.02407
6,"Cliffside, Cliffcrest, Scarborough Village West",0.543139,0.081348,0.237469,0.117502,0.020542
7,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",0.610703,0.113326,0.125918,0.134313,0.01574
8,"Dorset Park, Wexford Heights, Scarborough Town...",0.502134,0.148412,0.215742,0.107634,0.026079
9,Downsview,0.597635,0.162808,0.119024,0.097635,0.022899


In [138]:
df_grouped_c.shape

(47, 6)

**Clearly we are missing half of the neighbourhoods crime data, something to improve later, but we clearly want to deliver data only for the neighbourhoods with crime data** 

In [139]:
df_grouped = pd.merge(df_grouped_c, df_grouped_v, on='Neighbourhood')

In [142]:
df_grouped

Unnamed: 0,Neighbourhood,Assault,Auto Theft,Break and Enter,Robbery,Theft Over,ATM,Accessories Store,Afghan Restaurant,Airport,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Agincourt,0.428485,0.133333,0.290303,0.099394,0.048485,0.0,0.0,0.0,0.0,...,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.462992,0.134646,0.258268,0.100787,0.043307,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.416781,0.220083,0.258597,0.077029,0.02751,0.0,0.0,0.0,0.0,...,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.496224,0.132686,0.259978,0.057174,0.053937,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.212903,0.221774,0.447581,0.062903,0.054839,0.0,0.0,0.0,0.0,...,0.02381,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0
5,"Clarks Corners, Tam O'Shanter, Sullivan",0.479942,0.100656,0.274252,0.12108,0.02407,0.0,0.0,0.0,0.0,...,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Cliffside, Cliffcrest, Scarborough Village West",0.543139,0.081348,0.237469,0.117502,0.020542,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",0.610703,0.113326,0.125918,0.134313,0.01574,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0
8,"Dorset Park, Wexford Heights, Scarborough Town...",0.502134,0.148412,0.215742,0.107634,0.026079,0.0,0.0,0.0,0.0,...,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Downsview,0.597635,0.162808,0.119024,0.097635,0.022899,0.0,0.0,0.0,0.014706,...,0.0,0.073529,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


-----

In [146]:
# Print each neighborhood with 5 top venues
num_top_venues = 5

for hood in df_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = df_grouped[df_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                venue  freq
0             Assault  0.43
1     Break and Enter  0.29
2  Chinese Restaurant  0.13
3          Auto Theft  0.13
4             Robbery  0.10


----Alderwood, Long Branch----
             venue  freq
0          Assault  0.46
1  Break and Enter  0.26
2       Auto Theft  0.13
3         Pharmacy  0.11
4   Discount Store  0.11


----Bathurst Manor, Wilson Heights, Downsview North----
             venue  freq
0          Assault  0.42
1  Break and Enter  0.26
2       Auto Theft  0.22
3          Robbery  0.08
4      Coffee Shop  0.07


----Bayview Village----
                 venue  freq
0              Assault  0.50
1      Break and Enter  0.26
2                 Bank  0.17
3          Gas Station  0.17
4  Japanese Restaurant  0.17


----Bedford Park, Lawrence Manor East----
                venue  freq
0     Break and Enter  0.45
1          Auto Theft  0.22
2             Assault  0.21
3  Italian Restaurant  0.07
4         Coffee Shop  0.07


----Clark

Helper function

In [143]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [175]:
# Let's crete new dataframe with 10 top venues
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = df_grouped['Neighbourhood']

for ind in np.arange(df_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Assault,Break and Enter,Auto Theft,Chinese Restaurant,Robbery,Shopping Mall,Theft Over,Bakery,Caribbean Restaurant,Pizza Place
1,"Alderwood, Long Branch",Assault,Break and Enter,Auto Theft,Discount Store,Pharmacy,Robbery,Convenience Store,Pizza Place,Park,Theft Over
2,"Bathurst Manor, Wilson Heights, Downsview North",Assault,Break and Enter,Auto Theft,Robbery,Coffee Shop,Bank,Pizza Place,Middle Eastern Restaurant,Sushi Restaurant,Fried Chicken Joint
3,Bayview Village,Assault,Break and Enter,Bank,Gas Station,Japanese Restaurant,Auto Theft,Chinese Restaurant,Park,Grocery Store,Restaurant
4,"Bedford Park, Lawrence Manor East",Break and Enter,Auto Theft,Assault,Coffee Shop,Italian Restaurant,Robbery,Theft Over,Bank,Restaurant,Sandwich Place


In [167]:
neighborhoods_venues_sorted.shape

(47, 11)

# Part6. Cluster <a class="anchor" id="item6"></a>

In [151]:
from sklearn.cluster import KMeans

In [196]:
# set number of clusters
kclusters = 6

df_grouped_clustering = df_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=4).fit(df_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 3, 4, 1, 4, 3, 2, 0, 1, 0], dtype=int32)

In [173]:
df_postal = toronto_postal_codes_w_coords
df_postal.rename(columns={'Neighborhood': 'Neighbourhood'}, inplace=True)

In [176]:
# New dataframe with cluster and top venues
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_merged = df_postal

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
df_merged = df_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

df_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0,Assault,Break and Enter,Auto Theft,Park,Robbery,Bus Stop,Pharmacy,Shopping Mall,Convenience Store,Fish & Chips Shop
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Assault,Break and Enter,Coffee Shop,Auto Theft,Sporting Goods Shop,Gym / Fitness Center,Grocery Store,Intersection,Golf Course,Lounge
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1.0,Assault,Break and Enter,Coffee Shop,Robbery,Auto Theft,Diner,Café,Pub,Theater,Park
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,,,,,,,,,,,
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,,,,,,,,,,,


In [181]:
df_merged.dropna(inplace=True)

In [183]:
# Put on the map

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(
    df_merged['Latitude'], 
    df_merged['Longitude'], 
    df_merged['Neighbourhood'], 
    df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examine cluster

In [184]:
for i in range(kclusters):
    print('Cluster {}:'.format(i))
    #print(borough_merged.loc[borough_merged['Cluster Labels'] == i, borough_merged.columns[[2] + list(range(5, borough_merged.shape[1]))]])
    print(df_merged.loc[df_merged['Cluster Labels'] == i, df_merged.columns[[2] + list(range(5, 7))]])

Cluster 0:
                                        Neighbourhood  Cluster Labels  \
1                                    Victoria Village             0.0   
18                  Guildwood, Morningside, West Hill             0.0   
22                                             Woburn             0.0   
29                                   Thorncliffe Park             0.0   
32                                Scarborough Village             0.0   
38        Kennedy Park, Ionview, East Birchmount Park             0.0   
40                                          Downsview             0.0   
46                                          Downsview             0.0   
53                                          Downsview             0.0   
56   Del Ray, Mount Dennis, Keelsdale and Silverthorn             0.0   
60                                          Downsview             0.0   
64                                             Weston             0.0   
89  South Steeles, Silverstone, Humberga

### Cluster 0

In [229]:
df_merged.loc[df_merged['Cluster Labels'] == 0, (['Postal Code', 'Neighbourhood'])]

Unnamed: 0,Postal Code,Neighbourhood
1,M4A,Victoria Village
18,M1E,"Guildwood, Morningside, West Hill"
22,M1G,Woburn
29,M4H,Thorncliffe Park
32,M1J,Scarborough Village
38,M1K,"Kennedy Park, Ionview, East Birchmount Park"
40,M3K,Downsview
46,M3L,Downsview
53,M3M,Downsview
56,M6M,"Del Ray, Mount Dennis, Keelsdale and Silverthorn"


In [185]:
df_merged.loc[df_merged['Cluster Labels'] == 0, df_merged.columns[[2] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Victoria Village,0.0,Assault,Break and Enter,Coffee Shop,Auto Theft,Sporting Goods Shop,Gym / Fitness Center,Grocery Store,Intersection,Golf Course,Lounge
18,"Guildwood, Morningside, West Hill",0.0,Assault,Pizza Place,Break and Enter,Robbery,Coffee Shop,Fast Food Restaurant,Bank,Auto Theft,Juice Bar,Pharmacy
22,Woburn,0.0,Assault,Park,Coffee Shop,Break and Enter,Robbery,Fast Food Restaurant,Chinese Restaurant,Mobile Phone Shop,Indian Restaurant,Auto Theft
29,Thorncliffe Park,0.0,Assault,Break and Enter,Coffee Shop,Robbery,Indian Restaurant,Grocery Store,Auto Theft,Theft Over,Gym,Pizza Place
32,Scarborough Village,0.0,Assault,Ice Cream Shop,Break and Enter,Robbery,Pizza Place,Coffee Shop,Restaurant,Sandwich Place,Fast Food Restaurant,Bowling Alley
38,"Kennedy Park, Ionview, East Birchmount Park",0.0,Assault,Discount Store,Break and Enter,Robbery,Chinese Restaurant,Coffee Shop,Grocery Store,Fast Food Restaurant,Auto Theft,Light Rail Station
40,Downsview,0.0,Assault,Auto Theft,Break and Enter,Robbery,Vietnamese Restaurant,Coffee Shop,Hotel,Grocery Store,Park,Pizza Place
46,Downsview,0.0,Assault,Auto Theft,Break and Enter,Robbery,Vietnamese Restaurant,Coffee Shop,Hotel,Grocery Store,Park,Pizza Place
53,Downsview,0.0,Assault,Auto Theft,Break and Enter,Robbery,Vietnamese Restaurant,Coffee Shop,Hotel,Grocery Store,Park,Pizza Place
56,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",0.0,Assault,Furniture / Home Store,Robbery,Grocery Store,Break and Enter,Auto Theft,Coffee Shop,Fast Food Restaurant,Gas Station,Sandwich Place


### Cluster 1

In [228]:
df_merged.loc[df_merged['Cluster Labels'] == 1, (['Postal Code', 'Neighbourhood'])]

Unnamed: 0,Postal Code,Neighbourhood
0,M3A,Parkwoods
2,M5A,"Regent Park, Harbourfront"
6,M1B,"Malvern, Rouge"
19,M4E,The Beaches
31,M6H,"Dufferin, Dovercourt Village"
33,M2J,"Fairview, Henry Farm, Oriole"
34,M3J,"Northwood Park, York University"
37,M6J,"Little Portugal, Trinity"
39,M2K,Bayview Village
44,M1L,"Golden Mile, Clairlea, Oakridge"


In [186]:
df_merged.loc[df_merged['Cluster Labels'] == 1, df_merged.columns[[2] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Parkwoods,1.0,Assault,Break and Enter,Auto Theft,Park,Robbery,Bus Stop,Pharmacy,Shopping Mall,Convenience Store,Fish & Chips Shop
2,"Regent Park, Harbourfront",1.0,Assault,Break and Enter,Coffee Shop,Robbery,Auto Theft,Diner,Café,Pub,Theater,Park
6,"Malvern, Rouge",1.0,Assault,Break and Enter,Auto Theft,Fast Food Restaurant,Trail,Robbery,Gym,Bank,Bakery,Chinese Restaurant
19,The Beaches,1.0,Assault,Break and Enter,Robbery,Auto Theft,Pub,Coffee Shop,Pizza Place,Japanese Restaurant,Breakfast Spot,Beach
31,"Dufferin, Dovercourt Village",1.0,Assault,Break and Enter,Robbery,Coffee Shop,Café,Park,Auto Theft,Convenience Store,Bakery,Bar
33,"Fairview, Henry Farm, Oriole",1.0,Assault,Break and Enter,Clothing Store,Auto Theft,Coffee Shop,Robbery,Theft Over,Sandwich Place,Restaurant,Japanese Restaurant
34,"Northwood Park, York University",1.0,Assault,Break and Enter,Auto Theft,Coffee Shop,Robbery,Furniture / Home Store,Pizza Place,Theft Over,Bar,Middle Eastern Restaurant
37,"Little Portugal, Trinity",1.0,Assault,Break and Enter,Café,Auto Theft,Robbery,Bar,Restaurant,Vegetarian / Vegan Restaurant,Bakery,Italian Restaurant
39,Bayview Village,1.0,Assault,Break and Enter,Bank,Gas Station,Japanese Restaurant,Auto Theft,Chinese Restaurant,Park,Grocery Store,Restaurant
44,"Golden Mile, Clairlea, Oakridge",1.0,Assault,Break and Enter,Robbery,Intersection,Auto Theft,Coffee Shop,Diner,Bus Line,Bakery,Convenience Store


### Cluster 2

In [230]:
df_merged.loc[df_merged['Cluster Labels'] == 2, (['Postal Code', 'Neighbourhood'])]

Unnamed: 0,Postal Code,Neighbourhood
51,M1M,"Cliffside, Cliffcrest, Scarborough Village West"


In [187]:
df_merged.loc[df_merged['Cluster Labels'] == 2, df_merged.columns[[2] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
51,"Cliffside, Cliffcrest, Scarborough Village West",2.0,Assault,Pizza Place,Break and Enter,Ice Cream Shop,Beach,Robbery,Park,Cajun / Creole Restaurant,Burger Joint,Hardware Store


### Cluster 3

In [231]:
df_merged.loc[df_merged['Cluster Labels'] == 3, (['Postal Code', 'Neighbourhood'])]

Unnamed: 0,Postal Code,Neighbourhood
27,M2H,Hillcrest Village
50,M9L,Humber Summit
77,M9R,"Kingsview Village, St. Phillips, Martin Grove ..."
78,M1S,Agincourt
82,M1T,"Clarks Corners, Tam O'Shanter, Sullivan"
85,M1V,"Milliken, Agincourt North, Steeles East, L'Amo..."
91,M4W,Rosedale
93,M8W,"Alderwood, Long Branch"


In [189]:
df_merged.loc[df_merged['Cluster Labels'] == 3, df_merged.columns[[2] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
27,Hillcrest Village,3.0,Assault,Break and Enter,Auto Theft,Robbery,Pharmacy,Park,Coffee Shop,Convenience Store,Bakery,Recreation Center
50,Humber Summit,3.0,Assault,Auto Theft,Electronics Store,Break and Enter,Robbery,Pharmacy,Arts & Crafts Store,Shopping Mall,Italian Restaurant,Park
77,"Kingsview Village, St. Phillips, Martin Grove ...",3.0,Assault,Auto Theft,Break and Enter,Pharmacy,Robbery,Mobile Phone Shop,Gas Station,Supplement Shop,Supermarket,Beer Store
78,Agincourt,3.0,Assault,Break and Enter,Auto Theft,Chinese Restaurant,Robbery,Shopping Mall,Theft Over,Bakery,Caribbean Restaurant,Pizza Place
82,"Clarks Corners, Tam O'Shanter, Sullivan",3.0,Assault,Break and Enter,Robbery,Coffee Shop,Auto Theft,Sandwich Place,Intersection,Park,Bank,Fast Food Restaurant
85,"Milliken, Agincourt North, Steeles East, L'Amo...",3.0,Break and Enter,Assault,Auto Theft,Chinese Restaurant,Robbery,Bakery,Pharmacy,Park,Noodle House,Fast Food Restaurant
91,Rosedale,3.0,Assault,Break and Enter,Coffee Shop,Grocery Store,Park,Robbery,Auto Theft,Metro Station,BBQ Joint,Bank
93,"Alderwood, Long Branch",3.0,Assault,Break and Enter,Auto Theft,Discount Store,Pharmacy,Robbery,Convenience Store,Pizza Place,Park,Theft Over


### Cluster 4

In [232]:
df_merged.loc[df_merged['Cluster Labels'] == 4, (['Postal Code', 'Neighbourhood'])]

Unnamed: 0,Postal Code,Neighbourhood
11,M9B,"West Deane Park, Princess Gardens, Martin Grov..."
16,M6C,Humewood-Cedarvale
17,M9C,"Eringate, Bloordale Gardens, Old Burnhamthorpe..."
23,M4G,Leaside
28,M3H,"Bathurst Manor, Wilson Heights, Downsview North"
30,M5H,"Richmond, Adelaide, King"
55,M5M,"Bedford Park, Lawrence Manor East"
61,M4N,Lawrence Park
81,M6S,"Runnymede, Swansea"


In [190]:
df_merged.loc[df_merged['Cluster Labels'] == 4, df_merged.columns[[2] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,"West Deane Park, Princess Gardens, Martin Grov...",4.0,Assault,Auto Theft,Park,Break and Enter,Pizza Place,Robbery,Theft Over,Clothing Store,Mexican Restaurant,Hotel
16,Humewood-Cedarvale,4.0,Assault,Break and Enter,Auto Theft,Pizza Place,Coffee Shop,Robbery,Theft Over,Middle Eastern Restaurant,Soccer Stadium,Café
17,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",4.0,Assault,Break and Enter,Auto Theft,Robbery,Coffee Shop,Convenience Store,Cosmetics Shop,Pet Store,College Rec Center,Gas Station
23,Leaside,4.0,Break and Enter,Assault,Auto Theft,Robbery,Sporting Goods Shop,Coffee Shop,Theft Over,Grocery Store,Furniture / Home Store,Sports Bar
28,"Bathurst Manor, Wilson Heights, Downsview North",4.0,Assault,Break and Enter,Auto Theft,Robbery,Coffee Shop,Bank,Pizza Place,Middle Eastern Restaurant,Sushi Restaurant,Fried Chicken Joint
30,"Richmond, Adelaide, King",4.0,Assault,Break and Enter,Auto Theft,Robbery,Coffee Shop,Café,Hotel,Theft Over,Theater,Restaurant
55,"Bedford Park, Lawrence Manor East",4.0,Break and Enter,Auto Theft,Assault,Coffee Shop,Italian Restaurant,Robbery,Theft Over,Bank,Restaurant,Sandwich Place
61,Lawrence Park,4.0,Break and Enter,Assault,Auto Theft,Café,Coffee Shop,Trail,Gym / Fitness Center,College Gym,College Quad,Park
81,"Runnymede, Swansea",4.0,Assault,Break and Enter,Auto Theft,Robbery,Coffee Shop,Café,Pizza Place,Bakery,Pub,Italian Restaurant


### Cluster 5

In [233]:
df_merged.loc[df_merged['Cluster Labels'] == 5, (['Postal Code', 'Neighbourhood'])]

Unnamed: 0,Postal Code,Neighbourhood
12,M1C,"Rouge Hill, Port Union, Highland Creek"


In [194]:
df_merged.loc[df_merged['Cluster Labels'] == 5, df_merged.columns[[2] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,"Rouge Hill, Port Union, Highland Creek",5.0,Assault,Break and Enter,Breakfast Spot,Playground,Burger Joint,Park,Italian Restaurant,Auto Theft,Robbery,Theft Over


# Conclusion <a class="anchor" id="item7"></a>

We were able to map crime and venue data in 48 neighbourhoods and group this data to 6 clusters.

Cluster 0 - is dominated by Assault with some Break & Enter, it is Parks, Discount Stores and Pizza Places

Cluster 1 - is dominated by Assault, Break & Enter and Robery. Those are Coffee shops, Banks, Restaurants

Cluster 2 - is dominated by Assaults in Pizza Places.

Cluster 3 - is dominated by Assault, Auto Thief and Break & Enter in Electronics shops, Coffee shops and Assian restaurants

Cluster 4 - is primarily Break & Enter, Assault in Sporting shops, Coffee shops and restaurants

Cluster 5 - is Assault and sometimes Break & Enter in Breakfast spots, playgrounds and Burger places

### Follow up

There are 140 neighbourhoods in total in Crime data and 97 neighbourhoods with the venue data. Due to naming differences in Crime data we were unable to map all the neighbourhoods in the first phase. The next phase would be improve this and map upto 97 neighbourghoods.

Another observation is that we have much more crime data over less categories than the venue data, 88 0000 with 5 categories vs 4 800 with 325 categories. This can make our clustering shifted towards crime data. The next follow up could be to apply different wightenig or reduce crime data.

These analysis didn't investigate if there are some neighbourhoods with higher or lower crime rate. For a proper pricing policy it will be important to take this into account with the next research phase.

The crime data is primarily dominated by Assault with over 50% of the crime data and Break and Enter on the 2nd place with about 20%. The next follow up would be to look if that skews our clustering and first investigate if that is dependent on neighbourhood and 2nd see if weights should be adjusted

Try different number of clusters and find optimal