# Capstone Project

## Table of Content
1. [Introduction](#introduction)
2. [Data](#data)
3. [Methodology](#methodology)
4. [Results](#results)
5. [Conclusion and Discussion](#conclusion)

## Introduction/Business Problem <a id="introduction"></a>

**The purpose of the project is to recommend neighborhoods in New York city to start a business of a given type.**

Selection of location is essential to start a business. There are lots of approaches to identify locations best fitted for a type of store/venue and one of which is referring to the best rated venues in the city and selecting similar locations. Ideally, if a business of a given type is successful in neighborhood A, such business type will most likely to be successful in similar neighborhoods B or C. The success of a store/venue is measured by its rating and count of likes. Thus, given a business type, the first several best rated stores of the same type will be searched in a city. Each of these stores will have its borough and nearest neighborhood calculated, referred as target neighborhood. Within the same borough, all neighborhoods will be clustered and assigned with cluster labels. Only neighborhoods with same cluster labels as the target neighborhood will be returned and a brief summary will be generated to describe the neighborhoods.

## Data <a id="data"></a>

Data to be used for this project include location and venue data from Foursquare and neighborhood/borough data of New York retrieved from Wikipedia. 

- The Foursquare location data, combined with New York neighborhood data, will be used to describe the physical locations and calculate the nearest neighborhood of venues. Neighborhoods will be clustered based on number of each venue type within a given radius. 
- The venue data from Foursquare will be used to measure the quality of a venue given its rating and count of likes. Venues of a given type in a city will be ranked based on ratings and then count of likes. That being said, if two venues have same ratings, the one with more count of likes will be ranked over the other one.

## Methodology <a id="methodology"></a>

The project aims to automatically generate analysis results and visualize the results with Folium map. Steps are shown as follow:
1. Find the top best rated venues of a given type in a city. <br>
    1). Use Geopy library, identify the latitude and longitude of the city. <br>
    2). Use Foursquare venue search API, search for venues of the given type within radius.<br> 
    3). Sort the returned venue lists on ratings and then likes_count.<br>
2. List the neighborhoods these venues locate<br>
    1). Retrieve New York neighborhood data from Wikipedia and use Geopy to find latitudes of longitudes of all neighborhoods.<br>
    2). Calculate the distance from venues in step 1 to each NY neighborhood. Assign neighborhood with minumum distance to the venue.<br>
    3). List the top neighborhoods as the target neighborhoods and corresponding boroughs as the target borough.<br>
3. Use K-means clustering, find similar neighborhoods <br>
    1). For each target borough, count the number of venues of each category.<br>
    2). Use K-means, cluster neighborhoods with similar categories of venues in each target borough.<br>
    3). Filter the clustering results with target neighborhoods. Only neighborhoods with same cluster labels of target neighborhoods are listed.<br>
4. Generate descriptive summary of these simiar neighborhoods<br>
    1). For each of the listed neighborhoods,a Fourquare venue search will be performed to collect information of venues with the type designated at the beginning of the analysis.<br>
    2). All listed neighborhoods will be marked on Folium map. Colors are used to denote cluster labels.<br>
    3). In the pop-up of each neighborhood marker, following informationi will be displayed as a brief summary:<br>
        a. Name: Borough.Neighborhood<br>
        b. Top 3 venue types<br>
        c. Number of designated venues<br>
        d. Average Ratings / Average likes_count<br>

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation


!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
#from pandas.io.json import json_normalize
from pandas import json_normalize

! pip install folium==0.5.0
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting folium==0.5.0
  Downloading folium-0.5.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 8.3 MB/s  eta 0:00:01
[?25hCollecting branca
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25ldone
[?25h  Created wheel for folium: filename=folium-0.5.0-py3-none-any.whl size=76240 sha256=c67435d6776debc11dd15db9e0ba50ad6bf01bcb428c74315c98f24f89d055a8
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/b2/2f/2c/109e446b990d663ea5ce9b078b5e7c1a9c45cca91f377080f8
Successfully built folium
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.5.0
Folium installed
Libraries imported.


In [2]:
# The code was removed by Watson Studio for sharing.

Your credentails:
CLIENT_ID: 4OH1RDZOGTRG1M3F3LUP0ATVNRL5A42QUFFRRTOZ43PPNCEC
CLIENT_SECRET:OGKCVYGEQAJ4ZUGZIER4ZJMR01OL2DI0IDLAQNVLEJ4STQQB


## Step 1: Find the top best rated venues of a given type in a city

In [3]:
def get_nearby_venues(category, radius, location):
    '''
    return names and categories of nearby venues
    
    '''
        
    VERSION = '20180604'
    LIMIT = 20
    
    try:
        from geopy.geocoders import Nominatim
    except:
        !pip install geopy
        from geopy.geocoders import Nominatim
    
    
    geolocator = Nominatim(user_agent="foursquare_agent")
    loc = geolocator.geocode(location)
    latitude = loc.latitude
    longitude = loc.longitude

    search_query = category
    radius = radius


    url_search = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
    results = requests.get(url_search).json()
    venues = results['response']['venues']
    dataframe = json_normalize(venues)

    filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
    dataframe_filtered = dataframe.loc[:, filtered_columns]

    # function that extracts the category of the venue
    def get_category_type(row):
        try:
            categories_list = row['categories']
        except:
            categories_list = row['venue.categories']

        if len(categories_list) == 0:
            return None
        else:
            return categories_list[0]['name']

    # filter the category for each row
    dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

    # clean column names by keeping only last term
    dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]
    
    return dataframe_filtered[['id']]

def get_venue_info(venue_id):
    venue_id = venue_id 
    url_explore = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&oauth_token={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET,ACCESS_TOKEN, VERSION)
    result_explore = requests.get(url_explore).json()
    #print(venue_id)
    
    cat = result_explore['response']['venue']
    
    def get_category_type(row):
        try:
            categories_list = row['categories']
        except:
            categories_list = row['venue.categories']
        
        if len(categories_list) == 0:
            return None
        else:
            return categories_list[0]['name']
        
    cat = get_category_type(cat)
    
    try:
        zipcode = result_explore['response']['venue']['location']['postalCode']
    except:
        zipcode = None
        
    try:
        rating = result_explore['response']['venue']['rating']
    except:
        rating = 0
        
    try:
        likes_count = result_explore['response']['venue']['likes']['count']
    except:
        likes_count = 0
        
    try:
        ratingSignals = result_explore['response']['venue']['ratingSignals']
    except:
        ratingSignals = 0
        
    try:
        photo_count = result_explore['response']['venue']['photos']['count']
    except:
        photo_count = 0

    results = [
    result_explore['response']['venue']['id'],
    result_explore['response']['venue']['name'], 
    cat,
    result_explore['response']['venue']['location']['lat'], 
    result_explore['response']['venue']['location']['lng'], 
    zipcode,
    rating,
    likes_count,
    ratingSignals,
    photo_count
    ]

    return results

def get_nearby_ratings(category, radius, location):
    '''
    return ['id','name','category','lat', 'lng', 'postalCode','rating', 'likes_count', 'ratingSignals', 'photos_count']
    of nearby venues
    
    '''
    
    venue_ids = get_nearby_venues(category = category, radius=radius, location=location)
    df = pd.DataFrame(columns=['id','name','category','lat', 'lng', 'postalCode','rating', 'likes_count', 'ratingSignals', 'photos_count'])

    for i in range(len(venue_ids)):
        v_id = venue_ids.loc[i][0]
        v_info = get_venue_info(v_id)
        df.loc[len(df)] = v_info

    df['likes_count'] = df['likes_count'].astype(int)
    df['ratingSignals'] = df['likes_count'].astype(int)
    df['photos_count'] = df['likes_count'].astype(int)
    
    print("[1]Done! Get rating data for nearby venues at {}".format(location))
    return df


## Step 2: List the neighborhoods these venues locate

In [10]:
def get_NY_data():

    from bs4 import BeautifulSoup

    soup = BeautifulSoup(requests.get('https://en.wikipedia.org/wiki/Neighborhoods_in_New_York_City').text, 'html.parser')

    table_contents = soup.find('table')

    def containsNumber(value):
        return any([char.isdigit() for char in value])

    NY_neighbors = []

    for row in table_contents.find_all('td'):
        t1 = row.findAll('a')
        for r in t1:
            out = r.attrs['title']
            if not containsNumber(out):
                NY_neighbors.append(out)


    neighbors = []
    borough = []
    for r in NY_neighbors:
        n = r.split(', ')[0]
        if '(' in n:
            n = n.split('(')[0]
        try:
            b = r.split(', ')[1]
        except:
            b = borough[-1]
        neighbors.append(n)
        borough.append(b)

    geolocator = Nominatim(user_agent='foursquare_agent')
    lats = []
    lngs = []
    for i in range(len(neighbors)):
        loc = neighbors[i] + ', ' + borough[i]
        try:
            gc = geolocator.geocode(loc)
            lats.append(gc.latitude)
            lngs.append(gc.longitude)
        except:
            lats.append(0)
            lngs.append(0)

    NY_neighbor_data = pd.DataFrame({
        'Borough': borough,
        'Neighbor': neighbors,
        'Latitude': lats,
        'Longtitude':lngs
    })
    NY_neighbor_data = NY_neighbor_data.loc[NY_neighbor_data['Latitude']!=0,]
    NY_neighbor_data.drop_duplicates(inplace=True)
    NY_neighbor_data['Neighbor.Borough'] = NY_neighbor_data['Neighbor']+'.'+NY_neighbor_data['Borough']
    print("Done! Get NY Data")
    
    
    return NY_neighbor_data

def venue_neighborhoods(venue_df, neighborhood_df=None):
    if neighborhood_df is None:
        print('neighborhood_df is none, grabing data...')
        neighborhood_df = get_NY_data()
        print('Done!')
    vn = list()
    nb_list = neighborhood_df['Neighbor.Borough']
    for v in venue_df.iterrows():
        v_lat = v[1]['lat']
        v_lng = v[1]['lng']
        distance_bk = list()
        min_i=0
        min_dis = 1e12
        i=0
        for n in nb_list:
            n_lat = neighborhood_df[neighborhood_df['Neighbor.Borough']==n]['Latitude']
            n_lng = neighborhood_df[neighborhood_df['Neighbor.Borough']==n]['Longtitude']
            distance_sq = (n_lat-v_lat)**2 + (n_lng-v_lng)**2
            distance = distance_sq.values
            if distance < min_dis:
                min_i = i
                min_dis = distance
                i+=1
            else:
                i+=1
        
        vn.append(nb_list[min_i])
    venue_df['Neighborhood'] = vn
    
    venue_sorted = venue_df.sort_values(by=['rating', 'likes_count'], ascending=False)
    venue_sorted['Borough'] = venue_sorted['Neighborhood']
    n_list = list()
    b_list = list()
    for r in venue_sorted.iterrows():
        n = r[1]['Neighborhood'].split('.')[0]
        b = r[1]['Neighborhood'].split('.')[1]
        n_list.append(n)
        b_list.append(b)

    venue_sorted['Neighborhood'] = n_list
    venue_sorted['Borough'] = b_list
    
    print('[2]Done! Get information from best rated venues.')

    return venue_sorted
        
    
        

## Step 3: Use K-means clustering, find similar neighborhoods

In [16]:
def getNearbyVenues(borough, radius, NY_data=None):
    if NY_data is None:
        print('Getting NY data for getNearbyVenues...')
        df = get_NY_data()
        print('Finished!')
    else:
        df = NY_data
    
    selected_df = df[df['Borough']==borough]
    venues_list=[]
    for name, lat, lng in zip(selected_df['Neighbor'], selected_df['Latitude'], selected_df['Longtitude']):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


def get_venue_rank(borough, radius=1000, NY_data=None):
    b_venues = getNearbyVenues(borough=borough,radius=radius,NY_data=NY_data)
    b_onehot = pd.get_dummies(b_venues[['Venue Category']], prefix="", prefix_sep="")

    # add neighborhood column back to dataframe
    b_onehot['Neighborhood'] = b_venues['Neighborhood'] 
    # move neighborhood column to the first column
    fixed_columns = list(b_onehot.columns[b_onehot.columns=='Neighborhood']) + list(b_onehot.columns[b_onehot.columns!='Neighborhood'])
    b_onehot = b_onehot[fixed_columns]

    b_grouped = b_onehot.groupby(by='Neighborhood').mean().reset_index()


    num_top_venues = 10

    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['Neighborhood']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
    neighborhoods_venues_sorted['Neighborhood'] = b_grouped['Neighborhood']

    for ind in np.arange(b_grouped.shape[0]):
        neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(b_grouped.iloc[ind, :], num_top_venues)
        
    return neighborhoods_venues_sorted


def neighbor_clustering(borough, radius,k_clustering, NY_data=None):
    dataframe = getNearbyVenues(borough=borough, radius=radius, NY_data=NY_data)
    
    neighbor_data = dataframe.groupby(by='Neighborhood').mean().reset_index()[['Neighborhood','Neighborhood Latitude', 'Neighborhood Longitude']]
    
    df_onehot = pd.get_dummies(dataframe[['Venue Category']], prefix='', prefix_sep='')
    df_onehot['Neighborhood']=dataframe["Neighborhood"]
    
    df_grouped = df_onehot.groupby(by="Neighborhood").mean().reset_index()
    
    df_ranks = get_venue_rank(borough=borough, radius=radius, NY_data=NY_data)
    
    from sklearn.cluster import KMeans
    df_cluster = df_grouped.drop('Neighborhood',1)
    kmeans = KMeans(n_clusters=k_clustering).fit(df_cluster)
    
    df_ranks.insert(0,'Cluster Labels', kmeans.labels_)
    df_return = neighbor_data
    df_return = df_return.join(df_ranks.set_index('Neighborhood'), on='Neighborhood')
    
    print('[3]Done! K-means clustering for {}'.format(borough))
    return df_return
    
    

def target_neighbors(category ,city_radius, neighbor_radius,top_n, k_clustering=5, location="New York"):
    """
    Only New York is supported for location parameter at this point
    """
    NY_df = get_NY_data()
    venue_df = get_nearby_ratings(category = category, radius = city_radius, location=location)
    
    venue_sorted = venue_neighborhoods(venue_df=venue_df, neighborhood_df=NY_df)
    boroughs = venue_sorted['Borough']
    target_boroughs = list()
    for b in boroughs:
        if b not in target_boroughs:
            target_boroughs.append(b)
            if len(target_boroughs) > top_n:
                break
                
    
    out = dict()
    for borough in target_boroughs:
        all_neighbors = venue_sorted[venue_sorted['Borough']==borough]['Neighborhood']
        target_neighbors = list()
        for n in all_neighbors:
            if n not in target_neighbors:
                target_neighbors.append(n)
                if len(target_neighbors) > 2:
                    break

        neighbor_cluster = neighbor_clustering(borough=borough, radius=neighbor_radius, k_clustering=k_clustering, NY_data=NY_df)
        target_labels = neighbor_cluster[neighbor_cluster['Neighborhood'].isin(target_neighbors)]['Cluster Labels']
        targets = neighbor_cluster[neighbor_cluster['Cluster Labels'].isin(target_labels)]
        targets.set_index('Neighborhood', inplace=True)
        venue_summary = venue_sorted[venue_sorted['Neighborhood'].isin(target_neighbors)].groupby(by='Neighborhood').mean()[['rating','likes_count']]
        venue_counts = venue_sorted[venue_sorted['Neighborhood'].isin(target_neighbors)].groupby(by='Neighborhood').count()[['name']]
        targets = targets.join(venue_summary, on='Neighborhood')
        targets = targets.join(venue_counts, on='Neighborhood')
        targets = targets.rename(columns = {'name':'counts'})
        targets.reset_index(inplace=True)
        out[borough] = targets

    return out

In [61]:
out = target_neighbors(category='cafe', city_radius = 10000, neighbor_radius = 5000,top_n=3)

Done! Get NY Data
[1]Done! Get rating data for nearby venues at New York
[2]Done! Get information from best rated venues.
[3]Done! K-means clustering for Manhattan
[3]Done! K-means clustering for Brooklyn


## Create Visualization

In [62]:
def visualize(input_df=out, k_clusters = 5):
    from matplotlib import cm, colors

    geolocator = Nominatim(user_agent="foursquare_agent")
    loc = geolocator.geocode('New York')
    latitude = loc.latitude
    longitude = loc.longitude

    # create map
    map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

    # set color scheme for the clusters
    x = np.arange(7*len(input_df))
    ys = [i + x + (i*x)**2 for i in range(7*len(input_df))]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]
    c = 0
    for borough in out:
        viz_df = out[borough].fillna(0)
        rainbow_range = rainbow[c*7:(c+1)*7]
        c+=1
        for lat, lon, poi, cluster, rating, likes_count, counts in zip(viz_df['Neighborhood Latitude'], viz_df['Neighborhood Longitude'], viz_df['Neighborhood'], viz_df['Cluster Labels'], viz_df['rating'], viz_df['likes_count'], viz_df['counts']):

            first_3 = viz_df[viz_df['Neighborhood']==poi].iloc[:,4:7].values.tolist()[0]
            first_3_label = ", ".join(first_3)

            label = "Neighborhood: {}.{} ,Top 3 Venue types: {}, ;Counts of selected venue type: {}, ;Average Ratings/Likes_count: {}/{}".format(borough, poi,first_3_label, counts, rating, likes_count)
            #folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
            folium.CircleMarker(
                [lat, lon],
                radius=5,
                popup=str(label),
                color=rainbow[cluster],
                fill=True,
                fill_color=rainbow_range[cluster],
                fill_opacity=0.7).add_to(map_clusters)

    return map_clusters
visualize()

## Results <a id="results"></a>

Results are from an illustration of using "cafe" as the designated venue type. Based on the results, two boroughs are listed to be the best boroughs to start a cafe business, which are Manhattan and Brooklyn. Two clusters are identified within each borough and marked with distinct colors. The popup label of each marker shows information as expected which includes the name and borough of the neighborhood, top 3 venue types around this neighborhood as well as the amount, average ratings and average likes_count of the designated venue type within range of this neighborhood. Neighborhoods with same cluster labels show similar top 3 venue types as expected. The situation where counts and ratings are 0 will be further discussed in the next section. However, neighborhoods with 0 counts can be filtered out as an alternative presentation but such solution is not recommended in terms of business. Further implications of the result will be discussed in next section. 

## Conclusion and Discussion <a id="conclusion"></a>

In this section, the following topics will be discussed.
1. Business implication of the results.
2. Situationis with 0 counts.
3. Parameters that can be tuned for better results.



### Business implication of the results.

As shown in the map, the program recommends numerous neighborhoods, in the form of 4 clusters located at 2 boroughs, to start a cafe business. Such recommendation is still considered to be general and a set of new factors need to be considered to finally make decisions. Such factors should include population, traffic, rent, etc. In the case of starting a cafe business, the style of the store and target market should also be considered as part of the decision process. That being said, instead of using the program as a "where to start" recommendation, the clustering analysis could serve better as a "where not to start" recommendation. 

### Situations with 0 counts

Another issue with the analysis is that certain neighborhoods show a count of 0 for the designated venue type, i.e. cafe. A few factors could contribute to the results. First, "cafe" is a narrower term compared to "restaurant". Thus, search for restaurant could decrease the chances of having 0 count result. Another factor is the search radius. Currently, the illustration use a radius of 5000 meters to search for venues of designated type near a neighborhood. The optimal radius to search varies borough by borough. 5000 meters could be a relatively large radius in Manhattan but not enough in Brooklyn. Thus, optimizing the radius according to boroughs can be a next step for this project. However, results with 0 count can still play an important role as such neighborhoods could serve as a potential opportunity in starting the business. Therefore, neighborhoods with 0 counts are kept in the results.

### Parameters to be tuned
A set of parameters can be tuned in this analysis. One that has been mentioned above is the venue type. Switch venue category between synonyms can potentially yield different results. Two radius are being used in the analysis. One is city radius, which represents the radius to be used to search for best rated venues within the city. The default is 10,000 meters for New York city. Another radius is neighborhood radius which is being used to search for venues of designated type within a range of neighborhood. The next parameter can be tuned is top_n which represents the top $n$ venues within the city that should be chosen to list the target neighborhoods and boroughs. In addition, the k clusters parameter can also be tuned to define the number of clusters. Tuning such parameters will yield different results to fit in various needs in using the program.