## Capstone Project - Bowling Alley near York University, Toronto

### Introduction / Business Problem

The students of York University - Toronto, do not have a place where they can let their hair down during a very demanding semester.
Nor do they have a food joint that is open beyond 10pm.
A recreational place like a Bowling alley, that serves good food and plays youthful music would be an ideal spot for the youngsters to gather and relax.

This place would be of greater importance when winter starts. Students need not have to travel to Downtown, Toronto to have a few hours of relaxation.
Since there would be a given number of student community in the campus, any person wishing to run this business would stand to see good profits

### Approach & Data Usage for a solution

#### 1. Download and Explore Dataset
Use the dataset that is exposed by Wikipedia or find any other source that provides data about Toronto and its neighborhoods.
We need the Latitude and Longitude coordinates of each neighborhood.

To explore the data, transform the raw data into a Pandas dataframe.
Use geopy library to get the latitude and longitude values of the location considered for our project.
Create a map with neighborhoods superimposed on top.
Get Top 100 venues that are within 1500 meters radius of the campus.



In [None]:
# 43.801326100, -79.499856700 -> York University, Toronto


#Scrape the Wikipedia page for getting the contents of table
import pandas as pd
from pandas import DataFrame
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
table = pd.read_html(url,header=0)
df = DataFrame(table[0])

#Create the 3 columns PostalCode, Borough and Neighborhood
column_names = ['PostalCode', 'Borough', 'Neighborhood']
myDataFrame=pd.DataFrame(columns=column_names)

#Take all the values scraped from Wikipedia and load them under the respective columns
myDataFrame['PostalCode'] = df['Postal Code']
myDataFrame['Borough'] = df['Borough']
myDataFrame['Neighborhood'] = df['Neighborhood']

#Only process the cells that have an assigned borough. Ignore cells with a borough that is 'Not assigned'.
myAssignedBoroughs = myDataFrame[myDataFrame['Borough'] != 'Not assigned'].reset_index(drop=True)

#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
for NeighborhoodValue in myAssignedBoroughs['Neighborhood']:
    if (NeighborhoodValue == 'Not assigned'):
        myAssignedBoroughs['Neighborhood'] = myAssignedBoroughs['Borough']
        
#print(f'The number of rows of the final data frame with all assigned Neighborhoods & Boroughs is : {myAssignedBoroughs.shape[0]}')

#Get the Latitude and Longitudes 
url='http://cocl.us/Geospatial_data'
df = pd.read_csv(url)
df.rename(columns = {'Postal Code': 'PostalCode'}, inplace=True)

#Get the final dataframe that has Boroughs in York
finalDataFrame = pd.merge(myAssignedBoroughs,df, on='PostalCode')

dfOnlyYork = finalDataFrame[finalDataFrame['Borough'].str.contains('York')]
dfOnlyYork = dfOnlyYork.reset_index()


#### 2. Explore Neighborhoods in Toronto
Write code to explore the venues across the Neighborhoods to check if there are eateries and bowling alleys.



In [None]:
CLIENT_ID = 'QTBK3RMF20WILETWAI5R10O4BV44DH0MNZK3N1FG3IXNVWKD' # your Foursquare ID
CLIENT_SECRET = 'LNKECWSW0IX51DC3KX0DBDPBPUXQP34FRYL3XJMBXJYZM5SH' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#Let us explore the York University neighborhood in our Dataframe
dfOnlyYork.loc[15, 'Neighborhood']

neighborhood_latitude = dfOnlyYork.loc[15, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = dfOnlyYork.loc[15, 'Longitude'] # neighborhood longitude value

neighborhood_name = dfOnlyYork.loc[15, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))




In [None]:
!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

| 

In [None]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#Now let us get some top venues that are in York University within a radius of 1500 meters
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans


In [None]:
LIMIT = 100
radius = 1500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL
results = requests.get(url).json()


### Extract the category of the venue, reuse it to clean the JSON and structure it into a dataframe

In [None]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues

#### 3. Analyze each Neighborhood
Analyze each Neighborhood by grouping them into categories like "Arcade", "Art Museum", "Athletis & Sports", etc.,
This helps us to narrow down a location or a Neighborhood that can be recommended for a possible entreprenuer willing to open a new business.



In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
york_venues = getNearbyVenues(names=dfOnlyYork['Neighborhood'],
                                   latitudes=dfOnlyYork['Latitude'],
                                   longitudes=dfOnlyYork['Longitude']
                                  )
york_venues.head()
york_venues.groupby('Neighborhood').count()
print('There are {} unique categories.'.format(len(york_venues['Venue Category'].unique())))

In [None]:
# one hot encoding
york_onehot = pd.get_dummies(york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
york_onehot['Neighborhood'] = york_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [york_onehot.columns[-1]] + list(york_onehot.columns[:-1])
york_onehot = york_onehot[fixed_columns]

york_onehot.head()

In [None]:
york_grouped = york_onehot.groupby('Neighborhood').mean().reset_index()

#Let's print each neighborhood along with the top 10 most common venues
num_top_venues = 10

for hood in york_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = york_grouped[york_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
import numpy as np
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = york_grouped['Neighborhood']

for ind in np.arange(york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

#### 4. Cluster Neighborhoods
Run k-means to cluster the neighborhoods into 5 clusters.
Use a Folium map to visualize the resulting clusters.



In [None]:
# set number of clusters
kclusters = 5

york_grouped_clustering = york_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(york_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 



In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

york_merged = dfOnlyYork

# merge york_grouped with york_data to add latitude/longitude for each neighborhood
york_merged = york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

york_merged.head(10) # check the last columns!

In [None]:
#York University, Toronto

address = 'York University, Toronto'
geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude


In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(york_merged['Latitude'], york_merged['Longitude'], york_merged['Neighborhood'], york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### 5. Examine Clusters
Examine each of the above 5 clusters.
See if there are multiple locations available for recommedation to solve our problem identified earlier (which is to find a location to start a new business of Bowling alley that can also double up as a eatery & music place)

#### Cluster 1

In [None]:
york_merged.loc[york_merged['Cluster Labels'] == 0, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

#### Cluster 2

In [None]:
york_merged.loc[york_merged['Cluster Labels'] == 1, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

#### Cluster 3

In [None]:
york_merged.loc[york_merged['Cluster Labels'] == 2, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

#### Cluster 4

In [None]:
york_merged.loc[york_merged['Cluster Labels'] == 3, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

#### Cluster 5

In [None]:
york_merged.loc[york_merged['Cluster Labels'] == 4, york_merged.columns[[1] + list(range(5, york_merged.shape[1]))]]

### ****** Thank you ******