## 1. A description of the problem and a discussion of the background

# Introduction/Business Problem
When companies deploy new projects, it is done through experimentation. An experiment is run in a controlled environment OR in many cases in a suitable pilot location, and if successful, it is expanded to other locations. While deciding what other locations one should deploy such projects, it is always beneficial to have a good comparison of target locations with the location already experimented on, based on appropriate parameters. This can be achieved by realizing how similar or dis-similar the locations are. 

In this project we will aim to solve three seperate scenarios (having the same problem above) through Clustering:

1.A company like Amazon/Walmart were thinking long-term and wanted to have their own logistics fleet for better control and reliability. They ran an experiment in New York City to determine where to establish new package drop-off locations, to make it more convinient for their customers to return something. The experiment was a success and they have regonized that best target cities are Toronto and San Francisco.
We solve the problem of what areas in Toronto and San Francisco should they target to set up these drop off locations.

2.A company like Trip Advisor wants to recommend me locations in San Francisco based on what I did in New York City.
We solve the problem of recommending locations in a new city based on similarities in the previous one.

3.Clusters in different cities can also help companies determine where, in different cities, they can deploy self driving cars to pilot test (based on successful experimentation for pilot tests in one city). Lets consider Waymo, Uber or Tesla are at that stage where level 5 cars can be tested on roads with real customers. They would like to know where they can deploy their cars, after a successful test (again) in New York City. They have identified that San Francisco and Toronto are two metros that are open to this.
OR This can even help RideShare companies like Uber and Lyft to deploy more cars around a particluar area specifically for a certain type of service (like Ride Share). So people will much rather take a shared Uber/Lyft rather than taking a bus for example. 
Here we solve the problem of Identifying where they can deploy specific type of service in San Francisco based on experimentation in NYC.

*Note there are a lot of assumptions made to create these scenarios and the focus of "this" work is clustering Neighborhoods based on top 10 venues (in each neighborhood) in these three cities to see what are similar areas between the three and how they can be identified after the target locations (San Francisco and Toronto) have been identified.

** This study does not go into too much detail, and does not take into consideration a lot of real world factors and other data. This is meant to be a presentation of Data Science and Machine Learning abilities to showcase academic comprehension of the subject matter and tools.


## 2. A description of the data and how it will be used to solve the problem

# Data

In order to cluster different areas of the three cities mentioned above, we will consider each area within a city as a "neighborhood". The data set for all three cities was cleaned from sources (below), to represent the 'Neighborhood' with corresponding latitudes and longitudes of each neighborhood's approximate center and is stored in "combined_df" dataframe. Details on three Indivudual data sources:

1. Toronto DataSet: This dataframe in csv was combined using two different datasets:
    - Postal Codes List in Toronto (which was exported and saves as an xslx file). This file contained the postal codes and corresponding Neighborhood names. https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
    - Geospatial File that has the latitude and longitudes and has been provided by Coursera http://cocl.us/Geospatial_data
2. New York DataSet: This data set was extracted from https://geo.nyu.edu/catalog/nyu_2451_34572
3. San Francisco DataSet: This data was extracted from https://geodata.lib.berkeley.edu/catalog/ark28722-s75c8t

We will use the three datasets to create one Merged dataset that has all neighborhoods of all three cities. We will extract Venues for each neighborhood from Foursquare and cluster the neighborhoods to determine similarities in inter-city neighborhoods and intra-city neighborhoods. We will focus more in "inter-city" relationships as it will be used to solve all three problems stated in the first section.

# METHODOLOGY

This section explains the main body of the work conducted in the project. It is divided into 6 sub-sections (if you want to follow along in the notebook). After these sub-sections, the next main section "RESULTS", will be portrayed.

Before diving into details, to summarize the Methodology, Neighborhood and Location data for three cities (NYC, SF and Toronto) was taken and combined into a single dataframe. Venues were extracted for these Neighborhods from FourSquare and each Neighborhood was explored based on top venues. Then, partitioning clustering (K-means) was used to cluster the neighborhoods into 'four' clusters to see similarity of clusters between different neighborhoods in different cities. The result of this clustering will enable one to make judgements about what location - in new cities (SF, Toronto) can be used to deploy an experiment that succeeeded in a pilot-city (NYC).

1. Importing and preparing all three data sets (NY, SF, Toronto): As mentioned in the data section, all three datasets were publically available and were formatted and cleaned to get a conbined dataframe "combined_df" which consisted of four columns, 'City', 'Neighborhood', 'Latitude', and 'Longitude'. 

2. Creating Maps and Pinouts to visualize the data: A Map of North America was created to visualize the Neighborhood in three cities using Folium. This ensured all the neighborhoods were imported properly and the location data was correct.

3. Using FourSquare to get the Venue information for combined_df: FourSquare is an application that has data for Venues and Users and Experiences. Using this app, data for Venues (limited to 100 venues per neighborhod) was extracted using API calls (more on APIs here https://developer.foursquare.com/docs/). The main interest of this project was to segment the Neighborhods based on different types of Venues in that Neighborhood. Hence 'Venue_Categories' were exported and cleaned up to create a master_df contianing all Cities, Neighborhoods, Venues and Venue Categories, along with their location information.

4. Create dataframe for Clustering: The goal of this sub-section, was to have a dataframe, that could be used for Clustering. This data frame is "master_grouped" and was created by grouping "one_hot encoded venues table", by Neighborhood and City to maintain uniqueness of each Neighborhood. Also another dataframe was created created, that would show the top 10 Venues by Neighborhood, so we can add our Cluster labels to this dataframe to analyze what each cluster contains (after clustering). This dataframe is "neighborhoods_venues_sorted".

5. Clustering: Once the dataframes above were available, partitioning Clustering (K-means) was used to create 6 clusters of Neighborhoods. Cluster labels were inserted to the "neighborhoods_venues_sorted" and this dataframe was merged with "combined_df" which contained latitudes and longitudes (location) information and a final dataframe "master_merged" was created that could be analyzed for results.

6. Created a MAP for visualization: Finally, another Map was created using folium to visualize the clusters and how they were spread in different cities. Toronto and NYC being close by, we could see the spread to compare both of them pretty conviniently. But SF being on the opposite coast, we comapred them using snippets. 

             * The next sections are cells for METHODOLOGY section. After sub section 6, RESULTS are shared. You will need to install folium and run the cells to view Maps.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

## 1. Importing and preparing all three data sets (NY, SF, Toronto).
In order to create maps using geoplot and identifying neighborhoods on it, we need to get dfs that have neighborhood, latitude and longitude.

toronto_df  
newyork_df  
sf_df  


#### 1.1. Creating Toronto Dataset

In [None]:
toronto_df = pd.read_csv(r"C:\Users\The Godfather\Desktop\totonto_neighborhoods.csv")
toronto_df.drop(columns=["Unnamed: 0","PostalCode"], inplace=True)
toronto_df.head()

In [None]:
toronto_df.shape

In [None]:
#Adding City column
toronto_df["City"] = "Toronto"
toronto_df = toronto_df[["City", "Neighborhood", "Latitude", "Longitude"]]
toronto_df.head()

In [None]:
toronto_df.shape

#### 1.2. Creating NYC Dataset

In [None]:
#1.2.1 Importing the json file
with open(r"C:\Users\The Godfather\Desktop\NYC_neighborhood_names-geojson.json") as json_data:
    newyork_data = json.load(json_data)
    
newyork_data

In [None]:
newyork_data = newyork_data['features']
newyork_data[0]

In [None]:
#1.2.2. Converting the above details into a dataframe
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
newyork_df = pd.DataFrame(columns=column_names)

In [None]:
for data in newyork_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    newyork_df = newyork_df.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [None]:
newyork_df.head()

In [None]:
newyork_df.shape

In [None]:
#Adding City column
newyork_df["City"] = "NYC"
newyork_df = newyork_df[["City", "Neighborhood", "Latitude", "Longitude"]]
newyork_df.head()

#### 1.3 Creating San Francisco Dataset

In [None]:
import json

with open(r"C:\Users\The Godfather\Desktop\sf_neighborhood_bounds-geojson.json") as read_file:
    data = json.load(read_file)

In [None]:
list(data.keys())

In [None]:
#data['features'][0]['geometry']['coordinates'][0][0]
#data['features'][0]['properties']['NEIGHBORHO']
DDFF=pd.DataFrame()
for i in range(0, len(data['features'])):
    df= pd.DataFrame(data['features'][i]['geometry']['coordinates'][0][0]).rename({0:"Longitude", 1:"Latitude"}, axis=1)
    df['Neighbrhood'] = data['features'][i]['properties']['NEIGHBORHO']
    DDFF = DDFF.append(df).copy()

In [None]:
sf_df = DDFF.groupby(['Neighbrhood']).mean().reset_index()

In [None]:
sf_df

In [None]:
sf_df = sf_df[['Neighbrhood', 'Latitude', 'Longitude']]
sf_df

In [None]:
sf_df.shape

In [None]:
sf_df["City"] = "SF"
sf_df = sf_df[["City", "Neighbrhood", "Latitude", "Longitude"]]
sf_df = sf_df.copy()

In [None]:
sf_df.rename(columns ={"Neighbrhood":"Neighborhood"}, inplace=True)
sf_df.head()

In [None]:
sf_df.shape

#### 1.4. Need to make NY and Toronto df same as SF by dropping Boroughs. AND Merging the three data frames to create new combined_df

In [None]:
toronto_df.head()


In [None]:
toronto_df.shape

In [None]:
newyork_df.head()

In [None]:
newyork_df.shape

In [None]:
sf_df.head()

In [None]:
#Now merge all three dfs vertically
combined_df = pd.concat([sf_df, newyork_df, toronto_df])
combined_df.head()

In [None]:
combined_df.shape

#### There are a total of 445 Neighborhoods in all three cities combined that we will take into consideration. Of which 431 are unique.
We will however consider the duplicate cases at this stage as the 'City' column will make the row unique for further processing. We will see later that this gets weeded out step by step and we ignore the few neighborhoods while clustering and get a clean set of 423 rows (after removing the NaNs from one_hot encoding.

## 2. Creating Maps and Pinouts to visualize the data

In [None]:
address = 'USA'
geolocator = Nominatim(user_agent="usa-explorer")

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of USA are {}, {}.'.format(latitude, longitude))

In [None]:
# create map of USA/Canada using latitude and longitude values
map_northamerica = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(combined_df['Latitude'], combined_df['Longitude'], combined_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_northamerica)  
    
map_northamerica

## 3. Using FourSquare to get the Venue information for combined_df


#### 3.1. Create foursquare credentials

In [None]:
CLIENT_ID = '4YRRGTHC2QBF0UUR1ASAYR00SZWDMNXI2GN2SI0GTVSEZMBM' # your Foursquare ID
CLIENT_SECRET = 'RGUUUSUTMV5SIBTY1AAN2Y1G5NESQNQX0VSWR1S41DJT1SMM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

#### 3.2 Create a function to get all nearby venues in combined_df

In [None]:
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
all_venues = getNearbyVenues(names=combined_df['Neighborhood'],
                                   latitudes=combined_df['Latitude'],
                                   longitudes=combined_df['Longitude']
                                  )

In [None]:
all_venues.head()

In [None]:
all_venues.shape

#### There are a total of 12946 Venues that were exported from Foursquare.

#### 3.3 Add the city column from combined_df into all_venues and create a new dataframe called "master_df"

In [None]:
master_df = pd.merge(all_venues, combined_df.drop_duplicates().copy().rename({'Latitude':'Neighborhood Latitude','Longitude':'Neighborhood Longitude'},axis=1), on=['Neighborhood','Neighborhood Longitude','Neighborhood Latitude'], how='left').copy()

In [None]:
master_df.head()

In [None]:
master_df.shape

#### 3.4. Make sure data is good

In [None]:
master_df.isnull().any()

In [None]:
# lets see number of unique categories that were returned from four square

print('There are {} uniques categories.'.format(len(master_df['Venue Category'].unique())))

In [None]:
print(master_df.loc[master_df['Neighborhood'] == 'NaN'])
print(master_df.loc[master_df['Neighborhood'] == 0])

## 4. Create dataframe for Clustering
1. Create a grouped dataframe by 'City' and 'Neighborhood'with one_hot encoded venue categories.
2. Cluster them using K-means

In [None]:
master_df.groupby(['Neighborhood', 'City']).count()

In [None]:
# one hot encoding
master_onehot = pd.get_dummies(master_df[['Venue Category']], prefix="", prefix_sep="")


#add neighborhood and city column back to dataframe
master_onehot[['Neighborhood', 'City']] = master_df[['Neighborhood', 'City']] 

# move city column to the first column
fixed_columns = [master_onehot.columns[-1]] + list(master_onehot.columns[:-1])
master_onehot = master_onehot[fixed_columns]

# move neighborhood column to the first column
#fixed_columns = [master_onehot.columns[306]] + list(master_onehot.columns[:-1])
#master_onehot = master_onehot[fixed_columns]

master_onehot.head()

In [None]:
master_onehot.shape

In [None]:

master_onehot.columns.get_loc("Neighborhood")
     

In [None]:
master_onehot["Neighborhood"]

#### You will notice each neighborhood has multiple rows. This is because one hot was done by venue category. That means, if a neighborhood has 5 venue categories, it will have 5 rows, with '1's on corresponding columns on each row.


In [None]:
master_grouped = master_onehot.groupby(["Neighborhood", "City"]).mean().reset_index()
master_grouped

In [None]:
# Now lets see what are the top 5 venues in each neighborhood. Remove the "City" column

num_top_venues = 5

for hood in master_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = master_grouped[master_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[2:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

#### Lets put the above info in a data frame

In [None]:

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = master_grouped['Neighborhood']

for ind in np.arange(master_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(master_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

In [None]:
neighborhoods_venues_sorted.shape

## 5. CLUSTERING

In [None]:
from sklearn.cluster import KMeans 


In [None]:
# set number of clusters
kclusters = 6


master_grouped_clustering = master_grouped.drop(columns =['Neighborhood', 'City'])

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(master_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

master_merged = combined_df

# merge master_grouped with combined_df to add latitude/longitude for each neighborhood
master_merged = master_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

master_merged.head() # check the last columns!

In [None]:
master_merged = master_merged.drop_duplicates(subset="Neighborhood")
master_merged

In [None]:
master_merged.dropna()

In [None]:
master_merged['Cluster Labels'].value_counts()

In [None]:
master_merged.loc[master_merged['Cluster Labels'].isnull()]

In [None]:
master_merged.drop(index=[207,257,5,11,95], inplace=True)
master_merged.head()

In [None]:
master_merged.shape

In [None]:
master_merged["Cluster Labels"] = master_merged["Cluster Labels"].apply(np.int64)

In [None]:
master_merged.head()

In [None]:
master_merged['Cluster Labels'].value_counts()

## 6. Create a MAP to visualize if needed

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(master_merged['Latitude'], master_merged['Longitude'], master_merged['Neighborhood'], master_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# RESULTS & DISCUSSIONS

The data successfully divided the Neighborhoods in all three cities into clusters. Clusters 0 to 5. Of which three clusters will be considered for analysis (0, 2 & 5)

1.Cluster 0 (RED Colored Clusters): Thses were areas where the most popular venue was a Bus Stops. There were two main locations in SF that fit this category. This is a bit surprising but these come up as a lot of Students/Tourists who want to travel from the City to these two destinations, end up getting here taking multiple modes of transport and prefer Buses (single mode of transport, even though its slower) to get back to their residence as they are tired after a day in the beach or walking/hiking around twin peaks.

2.Clsuter 1 (PURPLE Colored Clusters): Just had one location. So we will not consider this.  

3.Clsuter 2 (DARK BLUE Colored Clusters): These areas had trails, parks, playgrounds and Zoos. These are places where people visit to for  outdor activities or as weekend trips. Also a lot of tourists visit these places to see local nature and points of interest. This cluster is a good recommendation that companies like Trip Advisor will give to a person who enjoys outdoor activities and is visiting San Francisco or Toronto.

4.Clsuter 3 (LIGHT BLUE Colored Clusters): Just had 2 locations. So we will not consider this.  

5.Clsuter 4 (GREEN Colored Clusters): Pizza Places. Not of much use for our applications. So we will not consider this.  

6.Clsuter 5 (ORANGE Colored Clusters): Restaurants, Coffee Shops, Cafe's, Convenience Stores, Pubs & Bars, etc. This is the most prominent cluster as all three locations are densly populated cities. These are the places where lot of movement happens. People are constantly going to Cafe's and Coffee shops on a regular basis. Neighborhood specific Cafe's, Cofee Shops, Bakeries and Restaurants serve as a great place for people to regularly go with friends and family. These locations can be used by Amazon/Walmart to establish package drop-off locations. 

In [None]:
# Cluster [0]
master_merged.loc[master_merged['Cluster Labels'] == 0, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

In [None]:
# Cluster [1]:
master_merged.loc[master_merged['Cluster Labels'] == 1, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

In [None]:
# Cluster [2]
master_merged.loc[master_merged['Cluster Labels'] == 2, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

In [None]:
# Cluster [3]
master_merged.loc[master_merged['Cluster Labels'] == 3, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

In [None]:
# Cluster [4]
master_merged.loc[master_merged['Cluster Labels'] == 4, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

In [None]:
# Cluster [5]
master_merged.loc[master_merged['Cluster Labels'] == 5, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

# CONCLUSION
This study sucessfully segmented Neighborhoods, based on similarities in most popular Venues (between cities) which helped companies make decisions on waht area to deploy a project in a new city based on successful experimentation in a pilot - city.
This was extended to three cases introduced in the first section.

1.Companies like Amazon/Walmart were able to deploy "package drop-off locations" around Neighborhood venues that belong to cluster 5 (Orange Colored Clusters) in Toronto and San Francisco.

2.Trip Advisor (or similar) was able to give good relevant recommends - cluster 2 (Dark Blue Colored Clusters) to to its users who were into outdoor activities and were either tourists or just exploring the city of Toronto and San Francisco.

3.Companies like Uber/Lyft or SuperShuttle were able to deploy new Ride Share service routes in San Francisco - cluster 0 (Red Colored Clusters) where majority of the population would wait around a bus stop and get a single mode of transportation, rather than use other modes of transport where they would have to change staions/modes multiple times.


** Again, this study does not go into too much detail, and does not take into consideration a lot of realworld factors and other data. This is meant to be a presenttion of Data Science and Machine Learning abilities to showcase academic comprehension of the subject and tools.