# Tackling Bike Theft in Toronto

### Explore and cluster neighbourhoods in Toronto

This project will compare suburbs and will determine similarities based on clustering techniques using location data services.

This project uses web scraping techniques to retrieve data from the Canadian Postal Code's Wikipedia page.

The data is then acquired and cleansed in preparation for clustering.

The geospatial locations data import will be merged with the post code data which will enable the data to be visualised over a map of the area.

The bicycle theft locations data import will also be merged with the post code data which will enable the data to be visualised over a map of the area.

The data will be clustered and plotted over the map.

The clustering is carried out by K Means and the clusters are plotted using the Folium Library.

The data will be mapped across Toronto and then focused/clustered in on boroughs.


Install and import required libraries..

In [1]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python-3.8-main

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.2               |     pyhd8ed1ab_0          26 KB  conda-forge
    certifi-2021.5.30          |   py38h578d9bd_0         141 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1k             |       h7f98852_0         2.1 MB  conda-forge
    python_abi-3.8             |           2_cp38           4 KB  conda-forge
    vincent-0.4.4              |             

Setup the reference to the Postal Codes of Canada wiki page..

Read in the web page and define a Beautiful Soup object for manipulation

In [2]:
canadian_postcodes_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
canadian_postcodes = requests.get(canadian_postcodes_url).text
soup = BeautifulSoup(canadian_postcodes, "html5lib")


lets find the tables within the web page

In [3]:
table_contents = []
canadian_postcodes_table = soup.find("table")

Lets take a look at the raw html table..

In [4]:
canadian_postcodes_table

<table cellpadding="2" cellspacing="0" rules="all" style="width:100%;">

<tbody><tr>
<td style="width:11%;">
<p>M1A<br/><span style="font-size:85%;">Not assigned</span>
</p>
</td>
<td style="width:11%;">
<p>M2A<br/><span style="font-size:85%;">Not assigned</span>
</p>
</td>
<td style="width:11%;">
<p>M3A<br/><span style="font-size:85%;"><a href="/wiki/North_York" title="North York">North York</a><br/>(<a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>)</span>
</p>
</td>
<td style="width:11%;">
<p>M4A<br/><span style="font-size:85%;"><a href="/wiki/North_York" title="North York">North York</a><br/>(<a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>)</span>
</p>
</td>
<td style="width:11%;">
<p>M5A<br/><span style="font-size:85%;"><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a><br/>(<a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a> / <a href="/wiki/Harbourfront,_Toronto" title="Harbourfront, Toronto">Harbourf

we're going to wrangle the data into a dataframe consisting of 3 columns..

PostalCode, Borough, Neighborhood

And we're going to remove those cells that contain Not Assigned, merge neighborhoods into a single postal code area and remove the duplicate postal codes

In [5]:
for row in canadian_postcodes_table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass #ignore these ones
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)

#merge neighborhoods
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

#remove duplicate values - some postal code area have multiple neighbourhoods i.e.M5A
postcode_data = df.drop_duplicates()

#check de-duped
#postcode_data.loc[df['PostalCode'] == 'M5A']



view the data

In [6]:
postcode_data.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


Print the number of rows in the data

In [7]:
postcode_data.shape

(103, 3)

Reading in the following files:

1. Geospatial_Coordinates.csv 
2. Toronto Bike Thefts - Jul to Dec 2020.csv

These files are imported as notebook project assets, therefore the import statements will be hidden, however the data is read into the following dataframes:

1. geospatial_data_df 
2. crime_data_df

These will be referenced later in the notebook

In [8]:
# The code was removed by Watson Studio for sharing.

In [9]:
geospatial_data_df = geospatial_data
crime_data_df = crime_data

Lets review the Geospatial data

In [10]:
geospatial_data_df

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


print the number of rows in the data

In [11]:
geospatial_data_df.shape

(103, 3)

We're going to merge (or join) the post code data and the geospatial data so that we can link up the latitudes and longitudes for the neighbourhoods in Canada.

In [12]:
geospatial_data_df.rename(columns={'Postal Code':'PostalCode'},inplace=True)
merged_geospatial_data = pd.merge(postcode_data,geospatial_data,on='PostalCode')


view the geospatial contents...

In [13]:
merged_geospatial_data.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


###Section: #3 - Explore and cluster the neighborhoods in Toronto


lets do some quick analysis around the data, how many boroughs and neighborhoods are there in the data..

In [14]:
print('The dataframe has {} boroughs and {} neighbourhoods.'.format(
        len(merged_geospatial_data['Borough'].unique()),
        merged_geospatial_data.shape[0]
    )
)


The dataframe has 15 boroughs and 103 neighbourhoods.


Use geopy library to get the lattitude and longitude of Toronto

In [15]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Ontario are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Toronto, Ontario are 43.6534817, -79.3839347.


lets visualise the data over a map of Toronto, Ontario

In [17]:
toronto_data = merged_geospatial_data

In [18]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#Note. the image has been uploded into git - see toronto_map.png

Lets review the crime data


In [19]:
crime_data_df

Unnamed: 0,OBJECTID,event_unique_id,Primary_Offence,Occurrence_Date,Occurrence_Year,Occurrence_Month,Occurrence_DayOfWeek,Occurrence_DayOfMonth,Occurrence_DayOfYear,Occurrence_Hour,...,PostalCode,Location_Type,Premises_Type,Bike_Make,Bike_Model,Bike_Type,Bike_Speed,Bike_Colour,Cost_of_Bike,Status
0,7439,GO-20209026034,THEFT UNDER,2020/10/09 04:00:00+00,2020,October,Friday,9,283,13,...,M6K,"Streets, Roads, Highways (Bicycle Path, Privat...",Outside,MA,2016 ARGENTA EL,RC,10,BLK,1155,STOLEN
1,7504,GO-20201983292,THEFT FROM MOTOR VEHICLE OVER,2020/10/07 04:00:00+00,2020,October,Wednesday,7,281,22,...,M4X,"Parking Lots (Apt., Commercial Or Non-Commercial)",Outside,OTHER,MACH 5.5,MT,11,BLK,11000,STOLEN
2,14334,GO-20202059580,THEFT UNDER - BICYCLE,2020/10/22 04:00:00+00,2020,October,Thursday,22,296,15,...,M1M,"Parking Lots (Apt., Commercial Or Non-Commercial)",Outside,MONTANA,220,MT,0,BLK,1300,STOLEN
3,6693,GO-20201336667,THEFT UNDER - BICYCLE,2020/07/18 04:00:00+00,2020,July,Saturday,18,200,7,...,M5V,"Streets, Roads, Highways (Bicycle Path, Privat...",Outside,OPUS,MULLARD,MT,7,GRN,1842,STOLEN
4,6723,GO-20209018435,THEFT UNDER,2020/07/22 04:00:00+00,2020,July,Wednesday,22,204,14,...,M6E,"Parking Lots (Apt., Commercial Or Non-Commercial)",Outside,KO,SUTRA LTD,TO,11,DBL,3164,STOLEN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253,20116,GO-20201583344,THEFT UNDER - BICYCLE,2020/08/22 04:00:00+00,2020,August,Saturday,22,235,14,...,M8Y,"Parking Lots (Apt., Commercial Or Non-Commercial)",Outside,KHS,RACING SERIES,RC,24,BLK,1600,STOLEN
254,6687,GO-20201352499,THEFT UNDER,2020/07/19 04:00:00+00,2020,July,Sunday,19,201,4,...,M4W,"Parking Lots (Apt., Commercial Or Non-Commercial)",Outside,EMMO,ZONE GT,EL,3,RED,4000,STOLEN
255,16850,GO-20202040830,THEFT OF EBIKE UNDER $5000,2020/09/27 04:00:00+00,2020,September,Sunday,27,271,9,...,M9R,Go Station,Transit,OTHER,ARROW,EL,1,ONGBLK,2100,STOLEN
256,20120,GO-20201693165,THEFT UNDER - BICYCLE,2020/07/11 04:00:00+00,2020,July,Saturday,11,193,16,...,M9R,"Gas Station (Self, Full, Attached Convenience)",Commercial,MIELE,,OT,10,WHI,2000,STOLEN


We're only interested in the post code data and the unique crime id..

In [20]:
crime_location_data_df = crime_data_df[['PostalCode', 'event_unique_id']].copy()

merge with the geospatial data so that I can bring the crime data over the post code and geospatial data..

In [21]:
# and I want to merge with the postcode data to bring in the boroughs & neighborhoods..

merged_crime_location_data_df = pd.merge(merged_geospatial_data,crime_location_data_df,on='PostalCode')

merged_crime_location_data_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,event_unique_id
0,M3A,North York,Parkwoods,43.753259,-79.329656,GO-20209025110
1,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,GO-20209020276
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,GO-20202433586
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,GO-20209023520
4,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242,GO-20201378917
...,...,...,...,...,...,...
250,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,GO-20209020706
251,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,GO-20201789535
252,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,GO-20202058919
253,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,GO-20209021205


determine the number of bike thefts that have occured per postal area

In [22]:
merged_crime_location_data_df.groupby(['PostalCode', 'Borough', 'Neighborhood', 'Longitude', 'Latitude']).count().reset_index()[['PostalCode', 'Borough', 'Neighborhood', 'Longitude', 'Latitude','event_unique_id']]

Unnamed: 0,PostalCode,Borough,Neighborhood,Longitude,Latitude,event_unique_id
0,M1B,Scarborough,"Malvern, Rouge",-79.194353,43.806686,4
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",-79.160497,43.784535,4
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",-79.188711,43.763573,1
3,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",-79.262029,43.727929,43
4,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",-79.284577,43.711112,4
5,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",-79.239476,43.716316,1
6,M1P,Scarborough,"Dorset Park, Wexford Heights, Scarborough Town...",-79.273304,43.75741,1
7,M1R,Scarborough,"Wexford, Maryvale",-79.295849,43.750071,1
8,M1V,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",-79.284577,43.815252,5
9,M2H,North York,Hillcrest Village,-79.363452,43.803762,1


define K Means cluster on the bike thefts of Toronto

In [23]:
k=5
toronto_crime_clustering = merged_crime_location_data_df.drop(['PostalCode','Borough','Neighborhood','event_unique_id'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_crime_clustering)
kmeans.labels_
merged_crime_location_data_df.insert(0, 'Cluster Labels', kmeans.labels_)

In [24]:
#lets see the clusters
merged_crime_location_data_df

Unnamed: 0,Cluster Labels,PostalCode,Borough,Neighborhood,Latitude,Longitude,event_unique_id
0,3,M3A,North York,Parkwoods,43.753259,-79.329656,GO-20209025110
1,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,GO-20209020276
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,GO-20202433586
3,4,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,GO-20209023520
4,1,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242,GO-20201378917
...,...,...,...,...,...,...,...
250,4,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,GO-20209020706
251,4,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,GO-20201789535
252,4,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,GO-20202058919
253,4,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,GO-20209021205


Visualise the bike thefts over a mpa of toronto, in the relevant boroughs

In [25]:
# create map
map_toronto_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(merged_crime_location_data_df['Latitude'], merged_crime_location_data_df['Longitude'], merged_crime_location_data_df['Neighborhood'], merged_crime_location_data_df['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_toronto_clusters)
       
map_toronto_clusters

#Note. the image has been uploded into git - see toronto crime clusters map.png

define Foursquare credentials..

In [26]:
# The code was removed by Watson Studio for sharing.

In [32]:
FourSquare_data = toronto_data

In [33]:
radius = 500
LIMIT = 100

venues = []

for lat, long, post, borough, neighborhood in zip(
    FourSquare_data['Latitude'], 
    FourSquare_data['Longitude'],
    FourSquare_data['PostalCode'],
    FourSquare_data['Borough'],
    FourSquare_data['Neighborhood']):
    
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    results = requests.get(url).json()['response']['groups'][0]['items']
    
    for venue in results:
        venues.append((
            post, 
            borough,
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [34]:
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['PostalCode', 'Borough', 'Neighborhood', 'BoroughLatitude', 'BoroughLongitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(2159, 9)


Unnamed: 0,PostalCode,Borough,Neighborhood,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,M3A,North York,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,North York,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,M3A,North York,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A,North York,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,M4A,North York,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [35]:
venues_df.groupby(["PostalCode"]).count().reset_index()[['PostalCode','VenueCategory']]

Unnamed: 0,PostalCode,VenueCategory
0,M1B,1
1,M1C,2
2,M1E,9
3,M1G,4
4,M1H,8
...,...,...
96,M9N,2
97,M9P,8
98,M9R,4
99,M9V,10


In [36]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))
venues_df.VenueCategory.unique()

There are 275 uniques categories.


array(['Park', 'Fast Food Restaurant', 'Food & Drink Shop',
       'Hockey Arena', 'Coffee Shop', 'Portuguese Restaurant',
       'Intersection', 'Pizza Place', 'Bakery', 'Distribution Center',
       'Restaurant', 'Spa', 'Gym / Fitness Center', 'Historic Site',
       'Chocolate Shop', 'Pub', 'Farmers Market', 'Performing Arts Venue',
       'Breakfast Spot', 'Dessert Shop', 'French Restaurant',
       'Mexican Restaurant', 'Theater', 'Yoga Studio', 'Event Space',
       'Shoe Store', 'Café', 'Asian Restaurant', 'Art Gallery',
       'Electronics Store', 'Bank', 'Beer Store', 'Wine Shop',
       'Antique Shop', 'Boutique', 'Furniture / Home Store',
       'Vietnamese Restaurant', 'Clothing Store', 'Accessories Store',
       'Miscellaneous Shop', 'Italian Restaurant', 'Beer Bar',
       'Sushi Restaurant', 'Creperie', 'Fried Chicken Joint',
       'Hobby Shop', 'Burrito Place', 'Diner', 'Japanese Restaurant',
       'Smoothie Shop', 'Sandwich Place', 'Gym', 'College Auditorium',
     

In [37]:
venues_df

Unnamed: 0,PostalCode,Borough,Neighborhood,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,M3A,North York,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.332140,Park
1,M3A,North York,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,M3A,North York,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A,North York,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,M4A,North York,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
...,...,...,...,...,...,...,...,...,...
2154,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Royal Canadian Legion #210,43.628855,-79.518903,Social Club
2155,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Islington Florist & Nursery,43.630156,-79.518718,Flower Shop
2156,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Koala Tan Tanning Salon & Sunless Spa,43.631370,-79.519006,Tanning Salon
2157,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Once Upon A Child,43.631075,-79.518290,Kids Store


Find all the bike shops in the toronto area..

In [38]:
toronto_bikeshop_data = venues_df[venues_df['VenueCategory'].str.contains('Bike Shop',regex=True)]

#remove duplicate values - some postal code area have multiple neighbourhoods i.e.M5A
toronto_bike_shop_data = toronto_bikeshop_data.drop_duplicates()

In [39]:
toronto_bike_shop_data

Unnamed: 0,PostalCode,Borough,Neighborhood,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
230,M3C,North York,Don Mills South,43.7259,-79.340923,Skiis and Biikes,43.726351,-79.342977,Bike Shop
436,M4G,East York,Leaside,43.70906,-79.363452,Enduro Sport,43.706059,-79.361835,Bike Shop


In [40]:
k=2
toronto_bikeshop_clustering = toronto_bike_shop_data.drop(['PostalCode','Borough','Neighborhood','BoroughLatitude','BoroughLongitude','VenueName','VenueCategory'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_bikeshop_clustering)
kmeans.labels_
toronto_bike_shop_data.insert(0, 'Cluster Labels', kmeans.labels_)



In [41]:
# create map
map_bike_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(toronto_bike_shop_data['VenueLatitude'], toronto_bike_shop_data['VenueLongitude'], toronto_bike_shop_data['Neighborhood'], toronto_bike_shop_data['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_bike_clusters)
       
map_bike_clusters

#Note. the image has been uploded into git - see toronto bike shops clusters map.png

##Conclusion

We can see that these shops sit within the Cluster 2 and Cluster 3 of the bike theft data, therefore both of these shops would be approached to start a conversation about stocking my product.