## Week 5 analytics
In the previous week we collected venue information using the foursquare API and saved it for later use:

In [1]:
import pandas as pd

df = pd.read_csv("toronto_venues.csv")
df.drop(columns=["Unnamed: 0"], inplace=True)
df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Harbourfront, Regent Park",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Harbourfront, Regent Park",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Harbourfront, Regent Park",43.65426,-79.360636,Toronto Cooper Koo Family Cherry St YMCA Centre,43.653191,-79.357947,Gym / Fitness Center
3,"Harbourfront, Regent Park",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Harbourfront, Regent Park",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot


In [2]:
df["Venue Category"].value_counts().head(10)

Coffee Shop            143
Café                    94
Restaurant              49
Italian Restaurant      47
Bakery                  45
Hotel                   39
Bar                     37
Pizza Place             34
Park                    31
Japanese Restaurant     27
Name: Venue Category, dtype: int64

There are a number of restaurant related categories: ["Restaurant", "Italian Restaurant", "Japanese Restaurant", ...]. We'll take all venues that contain the word 'restaurant' so that we can class these together.

In [3]:
def restaurant_replace(venue):
    if "restaurant" in str(venue).strip().lower():
        return "Restaurant"
    else:
        return venue

df["Venue Category"] = df["Venue Category"].apply(lambda x: restaurant_replace(x))
df["Venue Category"].value_counts().head(10)

Restaurant        405
Coffee Shop       143
Café               94
Bakery             45
Hotel              39
Bar                37
Pizza Place        34
Park               31
Gastropub          24
Sandwich Place     23
Name: Venue Category, dtype: int64

That looks a lot better. As per our problem discussion we want to look for a location to open a bakery. We want to focus on the venue categories 'Bakery' and 'Related Shops'. We'll create this related shops category as a combination of Restaurant, Coffee Shop, Café. What we want is a lot of related shops that generate foot traffic, but few other bakeries for competition.

In [4]:
categories = ["Bakery", "Restaurant", "Coffee Shop", "Café"]
df = df[df["Venue Category"].isin(categories)]
df["Venue Category"].value_counts().head(10)

Restaurant     405
Coffee Shop    143
Café            94
Bakery          45
Name: Venue Category, dtype: int64

In [5]:
def group_related_shops(venue):
    # group the venues ["Restaurant", "Coffee Shop", "Café"] as related
    if venue in ["Restaurant", "Coffee Shop", "Café"]:
        return "Related Venue"
    else:
        return venue
    
df["Venue Category"] = df["Venue Category"].apply(lambda x: group_related_shops(x))
df["Venue Category"].value_counts()

Related Venue    642
Bakery            45
Name: Venue Category, dtype: int64

We now have the bakery and related venue information. Time to look at how these cluster.

## Take a quick look at our data
It's always a good idea to go over our data before be start appling machine learning models.

In [6]:
import numpy as np
import pandas as pd

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans
import folium

In [7]:
df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Harbourfront, Regent Park",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Harbourfront, Regent Park",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Related Venue
5,"Harbourfront, Regent Park",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Related Venue
12,"Harbourfront, Regent Park",43.65426,-79.360636,Sumach Espresso,43.658135,-79.359515,Related Venue
13,"Harbourfront, Regent Park",43.65426,-79.360636,Starbucks,43.651327,-79.364329,Related Venue


In [8]:
# https://www.latlong.net/place/toronto-on-canada-27230.html
tlat = 43.651070
tlng = -79.347015

map_toronto = folium.Map(location=[tlat, tlng], zoom_start=10)

# add markers to map
data = zip(df['Venue Latitude'], df['Venue Longitude'], df['Neighborhood'], df["Venue Category"])
for lat, lng, neighborhood, category in data:
    label = '{}, {}'.format(neighborhood, category)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green' if category == "Related Venue" else "red",
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

From this map:
*  Each bakery is represented in red
*  Related venues are in green ("Restaurant", "Coffee Shop", "Café")

From an initial scan of the map it looks like, as expected, coffee shops and cafes generally have a bakery nearby. Although in some cases: Wellesley and Swansea, for example, there appear to be an abundance of shops but no bakery.

That being said, it looks like the foursquare data may be somewhat lacking. A quick [google search](https://www.google.com/maps/search/bakery/@43.667978,-79.3928821,15z) goes a long way.

## Clustering the related venue data

First lets collect the related latitude and longitude data.

In [9]:
dfr = df[df["Venue Category"]=="Related Venue"]
dfb = df[df["Venue Category"]=="Bakery"]

In [10]:
X = []
for lat, lng in zip(dfr['Venue Latitude'], dfr['Venue Longitude']):
    X.append([lat, lng])

X = np.array(X)

From the map there looks to be approximately 5 clusters of related venues.

In [11]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(X)

In [12]:
kmeans.labels_

array([0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3,
       3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [13]:
set(kmeans.labels_.tolist())

{0, 1, 2, 3, 4}

In [14]:
import warnings
warnings.filterwarnings("ignore")

dfr["Cluster"] = kmeans.labels_
cluster_colours_dict = {
    0: "green",
    1: "blue",
    2: "orange",
    3: "pink",
    4: "purple"
}

Let's look at the map and colour each cluster...

In [16]:
map_toronto = folium.Map(location=[tlat, tlng], zoom_start=10)

# add markers to map
data = zip(dfr['Venue Latitude'],
           dfr['Venue Longitude'],
           dfr['Neighborhood'],
           dfr["Venue Category"],
           dfr["Cluster"])

# add the clusters of related venues:  ("Restaurant", "Coffee Shop", "Café")
for lat, lng, neighborhood, category, cluster in data:
    label = '{}, {}'.format(neighborhood, category)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=cluster_colours_dict[cluster],
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)


# also add the bakeries in red
data = zip(dfb['Venue Latitude'],
           dfb['Venue Longitude'],
           dfb['Neighborhood'],
           dfb["Venue Category"])
for lat, lng, neighborhood, category in data:
    label = '{}, {}'.format(neighborhood, category)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color="red",
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)


map_toronto

It looks like of all the clusters the one to the North in Eglinton is short on bakeries. A quick look at [google maps](https://www.google.com.au/maps/search/bakery/@43.7062441,-79.400403,17z/data=!4m2!2m1!6e5) shows that one bakery has popped up in the area, but largely this could be a reasonable bet in terms of low competition and other shopping venues nearby to generate foot traffic.

Wellesley also looks like a good bet as from the foursquare data there are a variety or shops in the area, but no bakery. However, as mentioned earlier, a [google search](https://www.google.com/maps/search/bakery/@43.667978,-79.3928821,15z) of the area showed a number of bakeries nearby. I suspect that the foursquare data is not quite as complete or up to date as what is available on google search.  

## Number of bakeries per other venue in the cluster
If we suppose that each related venue ("Restaurant", "Coffee Shop", "Café") generates some consistent amount of foot traffic then it'd be useful to know the number of bakeries per other related venue in each cluster. We can use kmeans.predict for this...

In [18]:
#kmeans.predict([[0, 0], [12, 3]])

B = []
for lat, lng in zip(dfb['Venue Latitude'], dfb['Venue Longitude']):
    B.append([lat, lng])
pred = kmeans.predict(B)
pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 3, 3,
       2, 0, 0, 0, 0, 0, 0, 2, 2, 1, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0,
       0])

In [23]:
dfb["Cluster"] = pred

In [25]:
cluster_colours_dict

{0: 'green', 1: 'blue', 2: 'orange', 3: 'pink', 4: 'purple'}

In [47]:
#{k:dfr[dfr["Cluster"]==k].count() for k in cluster_colours_dict.keys()}
bakery_ratio_dict = {}
for cluster_num in cluster_colours_dict.keys():
    num_bakeries = len(dfb[dfb["Cluster"]==cluster_num].index)
    num_related = len(dfr[dfr["Cluster"]==cluster_num].index)
    bakery_ratio_dict[cluster_num] = (cluster_colours_dict[cluster_num],
                                      num_bakeries,
                                      num_related,
                                      str(round(num_bakeries/num_related*100,2))+"%")

bakery_ratio_dict

{0: ('green', 31, 432, '7.18%'),
 1: ('blue', 2, 28, '7.14%'),
 2: ('orange', 3, 48, '6.25%'),
 3: ('pink', 9, 107, '8.41%'),
 4: ('purple', 0, 27, '0.0%')}

In [48]:
dfc = pd.DataFrame(bakery_ratio_dict).transpose()
dfc.columns = ["Colour", "num_bakeries", "num_related", "bakeries/related"]
dfc.sort_values(by="bakeries/related", inplace=True)
dfc

Unnamed: 0,Colour,num_bakeries,num_related,bakeries/related
4,purple,0,27,0.0%
2,orange,3,48,6.25%
1,blue,2,28,7.14%
0,green,31,432,7.18%
3,pink,9,107,8.41%


***