# Chinese Restaurant Clustering

## Obtain and Preprocess Data
The BeautifulSoup library is used to obtain a few tables: NYS Population Density table, NY Health Neighborhood table, and NYS Chinese Population by ZIP Code table. The first table provides the density info of the population by ZIP Code. The second table shows all the ZIP Codes and neighborhoods within a borough. The final table provides the Chinese population and total population per ZIP Code. 

Geocoder will be used to provide the location of the neighborhoods. This data will be essential for when venues are searched using Foursquare and for when plotting on a Folium map.

        Table Links:
        http://www.usa.com/rank/new-york-state--population-density--zip-code-rank.htm
        https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm
        http://zipatlas.com/us/ny/zip-code-comparison/percentage-chinese-population{}.htm

        Useful Links:
        http://beautiful-soup-4.readthedocs.io/en/latest/
        https://datatofish.com/create-pandas-dataframe/
        https://geocoder.readthedocs.io/providers/ArcGIS.html

In [1]:
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
import geocoder
import numpy as np

### Table 1:  NYS Population Density Table
This dataframe, densityDf, provides the population density for every zip code in NY.

In [2]:
# scrape site
website='http://www.usa.com/rank/new-york-state--population-density--zip-code-rank.htm'
file=urllib.request.urlopen(website)
htmlContent=file.read().decode('utf8')
file.close()
soup=BeautifulSoup(htmlContent,'html.parser')

# get all tables
tables=soup.findAll('table')
table=tables[1]

# initialize dictionary (will be transformed into dataframe)
densityDict={}
rows=table.findAll('tr')
for columnName in rows[0]:
    densityDict[columnName.text]=[];

# delete first row (handled in prev step)
del rows[0]

# add entries to dictionary
for row in rows:
    entries=row.findAll('td');
    for key,entry in zip(densityDict.keys(),entries):
        # store everything before the '/'
        entryText=entry.text.split('/')[0]
        densityDict[key].append(entryText)       

# turn into pandas dataframe
densityDf=pd.DataFrame(densityDict,columns=list(densityDict))

# drop rank(first column), modify column names
densityDf.drop(densityDf.columns[0], axis=1, inplace=True)
densityDf.rename(columns={densityDf.columns[0]:'Density (persons / sq. mi)', densityDf.columns[1]:'ZIP Codes'},inplace=True)

# change strings into numerical types (float/int)
densityDf[densityDf.columns[0]]=densityDf[densityDf.columns[0]].replace(',','',regex=True).astype(float)
densityDf[densityDf.columns[1]]=densityDf[densityDf.columns[1]].astype(int)

# reverse column order
densityDf=densityDf.iloc[:,::-1]

densityDf.head()    

Unnamed: 0,ZIP Codes,Density (persons / sq. mi)
0,10028,146955.3
1,10128,132677.4
2,10075,132095.7
3,10025,129548.9
4,10023,123875.9


### Table 2: NY Health Neighborhood Table
This dataframe, neighborhoodDf, provides us with the neighborhood data of NYC. It categorizes each zip code by neighborhood and each neighborhood is categorized by borough.

In [3]:
website='https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm'
file=urllib.request.urlopen(website);
htmlContent=file.read().decode('utf8')
file.close()
neighSoup=BeautifulSoup(htmlContent,'html.parser')

table=neighSoup.table


# initialize dictionary with keys 
neighborhoodDict={}
columnNames=table.findAll('th');
for columnName in columnNames:
    neighborhoodDict[columnName.text.rstrip().lstrip()]=[];

# store every row entry
rows=table.findAll('tr');

# delete empty rows ( no 'td' tag in a row)
delete=[]
for row in rows:
    rowElements=row.findAll('td')
    if(len(rowElements)==0):
        delete.append(row)  
for row in delete:
    rows.remove(row);


# store data into dictionary
prevLoc="";
for row in rows:
    rowElements=row.findAll('td')

    # if borough name not present, concat the last borough name to the list of row elements
    if(len(rowElements)==3):
        prevLoc=rowElements[0]
    else:
        rowElements=[prevLoc]+rowElements


    for entry, col in zip(rowElements, list(neighborhoodDict)):
        neighborhoodDict[col].append(entry.text.rstrip().lstrip())

# pandas dataframe
neighborhoodDf=pd.DataFrame(neighborhoodDict,columns=list(neighborhoodDict));
neighborhoodDf.head()

Unnamed: 0,Borough,Neighborhood,ZIP Codes
0,Bronx,Central Bronx,"10453, 10457, 10460"
1,Bronx,Bronx Park and Fordham,"10458, 10467, 10468"
2,Bronx,High Bridge and Morrisania,"10451, 10452, 10456"
3,Bronx,Hunts Point and Mott Haven,"10454, 10455, 10459, 10474"
4,Bronx,Kingsbridge and Riverdale,"10463, 10471"


In [4]:
#change ZIP Codes from str to list of int
for row in range(0,len(neighborhoodDf.index)):
    zipString=neighborhoodDf['ZIP Codes'][row]
    noSpaceZipString=zipString.replace(' ','')
    splitStringZipList=noSpaceZipString.split(',')

    intZipList=list(map(int,splitStringZipList))
    neighborhoodDf['ZIP Codes'][row]=intZipList
neighborhoodDf.head()

Unnamed: 0,Borough,Neighborhood,ZIP Codes
0,Bronx,Central Bronx,"[10453, 10457, 10460]"
1,Bronx,Bronx Park and Fordham,"[10458, 10467, 10468]"
2,Bronx,High Bridge and Morrisania,"[10451, 10452, 10456]"
3,Bronx,Hunts Point and Mott Haven,"[10454, 10455, 10459, 10474]"
4,Bronx,Kingsbridge and Riverdale,"[10463, 10471]"


### Table 3: NYS Chinese Population by ZIP Code 
This dataframe, chinesePopulationDf, displays the NYS data for population and the percentage Chinese at each zip code. 

In [5]:
chinesePopulationDict={}
chinesePopulationDict['ZIP Codes']=[]
chinesePopulationDict['Population']=[]
chinesePopulationDict['% Chinese']=[]


pages=["",".2",".3",".4",".5",".6"]
for page in pages:

    website='http://zipatlas.com/us/ny/zip-code-comparison/percentage-chinese-population{}.htm'.format(page)
    file=urllib.request.urlopen(website);
    htmlContent=file.read().decode('utf8')
    file.close()

    # find the table
    soup=BeautifulSoup(htmlContent,'html.parser')
    tables=soup.find_all('table')
    innerTables=tables[4].find_all('table')
    tables=innerTables[5]
    innerTables=tables.find_all('table')
    table=innerTables[1]
    
    # store values
    values=table.findAll('td');
    for i in range(8,len(values)): # row pattern    8,11,12,15,18,19,...
        if(i%7==1):
            chinesePopulationDict['ZIP Codes'].append(values[i].text)
            chinesePopulationDict['Population'].append(values[i+3].text)
            chinesePopulationDict['% Chinese'].append(values[i+4].text)

# transform into df
chinesePopulationDf=pd.DataFrame(chinesePopulationDict,columns=list(chinesePopulationDict))

# modify data types
chinesePopulationDf['ZIP Codes']=chinesePopulationDf['ZIP Codes'].astype(int)
chinesePopulationDf['Population']=chinesePopulationDf['Population'].replace('[\$,]', '', regex=True).astype(int)
chinesePopulationDf['% Chinese']=chinesePopulationDf['% Chinese'].replace('[,%]', '', regex=True).astype(float)

chinesePopulationDf

Unnamed: 0,ZIP Codes,Population,% Chinese
0,10002,84870,46.63
1,10013,25042,42.80
2,11355,83281,29.12
3,10048,55,29.09
4,10038,15574,27.20
...,...,...,...
595,12849,445,0.22
596,11786,5883,0.22
597,11742,12119,0.22
598,11942,3981,0.22


### Latitude and Longitude for Each Neighborhood
In order for the neighborhood dataframe to be beneficial while using the Foursquare API or Folium, we must calculate the latitude and longitude for each neighborhood. Since a neighborhood is a list of zip codes, the average longitude/latitude for each collection will be used as the center for each neighborhood.

In [6]:
# Add latitude and longitude for each neighborhood (average for each list of zipcode is calculated and stored)
latitude=[]
longitude=[]

for row in range(0,len(neighborhoodDf.index)):
    localLatitude=[]
    localLongitude=[]
    zipcodeList=neighborhoodDf['ZIP Codes'][row]
    for zipcode in zipcodeList:
        localLatitude.append(geocoder.arcgis('{},New York'.format(zipcode)).latlng[0])
        localLongitude.append(geocoder.arcgis('{},New York'.format(zipcode)).latlng[1])
    latitude.append(np.array(localLatitude).mean())
    longitude.append(np.array(localLongitude).mean())
    
neighborhoodDf['Latitude']=latitude #adding the coloumns with coordinates to a dataframe
neighborhoodDf['Longitude']=longitude
neighborhoodDf.head()
# pd.set_option('display.max_rows', neighborhood.shape[0]+1) 
# neighborhood

Unnamed: 0,Borough,Neighborhood,ZIP Codes,Latitude,Longitude
0,Bronx,Central Bronx,"[10453, 10457, 10460]",40.84744,-73.897128
1,Bronx,Bronx Park and Fordham,"[10458, 10467, 10468]",40.86705,-73.88607
2,Bronx,High Bridge and Morrisania,"[10451, 10452, 10456]",40.829625,-73.919533
3,Bronx,Hunts Point and Mott Haven,"[10454, 10455, 10459, 10474]",40.815615,-73.902221
4,Bronx,Kingsbridge and Riverdale,"[10463, 10471]",40.89335,-73.90391


## Merging Dataframes 
The goal is to merge all three dataframes into one dataframe called neighborhood. Since the neighborhoodDf (table 2) is represented by neighborhoods while both the densityDf (table 1) and the chinesePopulationDf (table 3) are represented by zip codes, we must use sums or averages for the latter two tables so they can too be represented by neighborhoods. Any rows that have NaN will be dropped.

In [7]:
# Merge chinesePopulationDf with DensityDf
populationDensityDf=pd.merge(chinesePopulationDf,densityDf, on='ZIP Codes', how='left')
populationDensityDf.head()

Unnamed: 0,ZIP Codes,Population,% Chinese,Density (persons / sq. mi)
0,10002,84870,46.63,90849.3
1,10013,25042,42.8,48772.7
2,11355,83281,29.12,48298.2
3,10048,55,29.09,
4,10038,15574,27.2,70556.6


In [8]:
# Merge populationDensityDf with neighborhoodDf 
neighborhood=neighborhoodDf.copy(deep=True)

totalPopulation=[]
percentChinese=[]
meanDensity=[]

for row in range(0,len(neighborhoodDf.index)):
    population=[]
    chinese=[]
    density=[]
    zipCodeList=neighborhoodDf['ZIP Codes'][row]
    for zipcode in zipCodeList:
        if(len(populationDensityDf.loc[populationDensityDf['ZIP Codes']==zipcode]['Population'])>0):
            peopleCount=populationDensityDf.loc[populationDensityDf['ZIP Codes']==zipcode]['Population'].values[0]
            chinesePercent=populationDensityDf.loc[populationDensityDf['ZIP Codes']==zipcode]['% Chinese'].values[0]
            dense=populationDensityDf.loc[populationDensityDf['ZIP Codes']==zipcode]['Density (persons / sq. mi)'].values[0]
            
            population.append(peopleCount)
            chinese.append(peopleCount*chinesePercent) #/100 
            density.append(dense)

    totalPopulation.append(np.array(population).sum())
    meanDensity.append(np.array(density).mean())
    if(len(population)!=0):
        percentChinese.append(np.array(chinese).sum()/np.array(population).sum()) #*100
    else:
        percentChinese.append(0)
neighborhood['Population']=totalPopulation
neighborhood['% Chinese']=percentChinese
neighborhood['Mean Density (persons / sq. mi)']=meanDensity

neighborhood.dropna(axis=0,inplace=True)
neighborhood.head()

  ret = ret.dtype.type(ret / rcount)


Unnamed: 0,Borough,Neighborhood,ZIP Codes,Latitude,Longitude,Population,% Chinese,Mean Density (persons / sq. mi)
1,Bronx,Bronx Park and Fordham,"[10458, 10467, 10468]",40.86705,-73.88607,250491.0,0.508621,61568.766667
2,Bronx,High Bridge and Morrisania,"[10451, 10452, 10456]",40.829625,-73.919533,40961.0,0.47,47063.6
4,Bronx,Kingsbridge and Riverdale,"[10463, 10471]",40.89335,-73.90391,88989.0,1.264727,26775.8
5,Bronx,Northeast Bronx,"[10466, 10469, 10470, 10475]",40.885696,-73.849714,79125.0,0.494188,19800.65
6,Bronx,Southeast Bronx,"[10461, 10462, 10464, 10465, 10472, 10473]",40.84026,-73.841711,290052.0,0.992098,30274.016667


In [9]:
# to know how many calls are needed to Foursquare 
neighborhood.shape

(40, 8)

## Find Chinese Restaurant Venue Data Using Foursquare
Since Foursquare only allows for 500 premium calls a day for personal accounts, only about 400 restaurant ratings will be evaluated. For each neighborhood, up to 50 Chinese Restaurant IDs within a 2000m boundary will be returned. Within those 50, up to 10 will randomly be selected to be viewed in detail. The returning dataframe will include the neighborhood names along with the averages/variances of rating and like count in the area.

In [10]:
from pandas.io.json import json_normalize
import json
import os
import random
import requests

In [11]:
CLIENT_ID = os.environ.get('FOURSQUARE_KEY')
CLIENT_SECRET = os.environ.get('FOURSQUARE_SECRET')
VERSION = '20200101'

In [12]:
def getNearbyVenueIds(neighborhoods, latitudes, longitudes, radius=2000, LIMIT=50):    
    
    venue_id_list=[] # list of [neighborhood,venue_ids]
    CATEGORY='4bf58dd8d48988d145941735'
    
    for neighborhood, lat, lng in zip(neighborhoods, latitudes, longitudes):
        
        # store venue ids in neighborhood 
        venue_ids=[]
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?categoryId={}&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CATEGORY,
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        # make the GET request
        results = requests.get(url).json()['response']['venues']
        
        # print(json.dumps(results,indent=4)) //print pretty json
    
        # add venue ids to venue_ids
        for i in range(0,len(results)):
            venue_ids.append(results[i]['id'])
        
        #add [neighborhood,venue_ids] to venue_id_list
        venue_id_list.append([neighborhood,venue_ids])

    return venue_id_list
        

In [13]:
def reduceVenuesToTen(venue_id_list): # list of [neighborhood,venue_ids]
    
    ten_venue_list=[]
    
    for row in range(0,len(venue_id_list)):
        name=venue_id_list[row][0]
        id_list=venue_id_list[row][1]
        
        #random sample size 10 or size of list
        size=min(10,len(id_list));
        ten=random.sample(id_list,size)
        ten_venue_list.append([name,ten])
        
    return ten_venue_list

In [14]:
def getNearbyVenueDetail(ten_list):
    rating_likes_list=[]
    for row in range(0,len(ten_list)):
        ratings=[]
        likesCount=[]
        for venue_id in ten_list[row][1]:
            
            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/{}?&client_id={}&client_secret={}&v={}'.format(
                venue_id,
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
            )

            # make the GET request
            response=requests.get(url).json()['response']
            if('venue' in response):
                response=response['venue']
                if 'rating' in response:
                    ratings.append(response['rating'])     
                likesCount.append(response['likes']['count'])
        
        
        neighborhood=ten_list[row][0] 
        
        avgRating=np.array(ratings).mean()
        varRating=np.array(ratings).var()
        avgLikes=np.array(likesCount).mean()
        varLikes=np.array(likesCount).var()
        
        rating_likes_list.append([neighborhood, avgRating,avgLikes, varRating,varLikes])
        
    ratingsAndLikes = pd.DataFrame([row for row in rating_list])
    ratingsAndLikes.columns = ['Neighborhood', 'Avg Rating', 'Avg Like Count', 'Variance Rating', 'Variance Like Count']
    
    return ratingsAndLikes

In [31]:
neighborhoodVenueList=getNearbyVenueIds(neighborhoods=chineseNeighborhood['Neighborhood'][0:1],#[0:1] for testing
                         latitudes=chineseNeighborhood['Latitude'][0:1],
                         longitudes=chineseNeighborhood['Longitude'][0:1]
)
tenNeighVenueList=reduceVenuesToTen(neighborhoodVenueList)
ratingsAndLikes=getNearbyVenueDetail(tenNeighVenueList);
ratingsAndLikes

<Response [429]>
<Response [429]>
<Response [429]>
<Response [429]>
<Response [429]>
<Response [429]>
<Response [429]>
<Response [429]>
<Response [429]>
<Response [429]>




Unnamed: 0,Neighborhood,Avg Rating
0,Central Bronx,


In [31]:
neighborhoodRatings=pd.merge(neighborhood,ratingsAndLikes)
neighborhoodRatings.head()

Unnamed: 0,Borough,Neighborhood,ZIP Codes,Avg Rating
0,The Bronx,Central Bronx,"10453, 10457, 10460",7.8
1,The Bronx,Bronx Park and Fordham,"10458, 10467, 10468",7.45
2,The Bronx,High Bridge and Morrisania,"10451, 10452, 10456",7.6
3,The Bronx,Hunts Point and Mott Haven,"10454, 10455, 10459, 10474",
4,The Bronx,Kingsbridge and Riverdale,"10463, 10471",7.266667


In [54]:
# populationRatings=pd.merge(neighborhoodRatings,populationIncome,on='Borough')
# populationRatings.head()

Unnamed: 0,Borough,Neighborhood,ZIP Codes,Avg Rating,Population,Density (persons / sq. mi),Median Household Income,Mean Household Income,Percentage in Poverty
0,The Bronx,Central Bronx,"10453, 10457, 10460",7.8,1418207,33867,34156,46298,0.271
1,The Bronx,Bronx Park and Fordham,"10458, 10467, 10468",7.45,1418207,33867,34156,46298,0.271
2,The Bronx,High Bridge and Morrisania,"10451, 10452, 10456",7.6,1418207,33867,34156,46298,0.271
3,The Bronx,Hunts Point and Mott Haven,"10454, 10455, 10459, 10474",,1418207,33867,34156,46298,0.271
4,The Bronx,Kingsbridge and Riverdale,"10463, 10471",7.266667,1418207,33867,34156,46298,0.271


## K-Means Clustering and Folium Visualization

In [50]:
from sklearn.cluster import KMeans
import folium

In [55]:
# populationRatings.drop(populationRatings.columns[:3], axis=1, inplace=True)
# populationRatings.head()

Unnamed: 0,Avg Rating,Population,Density (persons / sq. mi),Median Household Income,Mean Household Income,Percentage in Poverty
0,7.8,1418207,33867,34156,46298,0.271
1,7.45,1418207,33867,34156,46298,0.271
2,7.6,1418207,33867,34156,46298,0.271
3,,1418207,33867,34156,46298,0.271
4,7.266667,1418207,33867,34156,46298,0.271


In [82]:
kclusters=10
populationRatings.dropna(inplace=True)
model=KMeans(kclusters)
model.fit(populationRatings)
print(model.labels_)

[4 4 4 4 9 4 8 3 3 3 8 3 3 3 3 0 7 0 7 0 0 0 0 0 0 6 6 1 1 1 1 6 1 1 6 5 5
 2 5]


In [83]:
# droppedNeighborhood=neighborhood.drop([3,13,14])
droppedNeighborhood.insert(0,'Cluster Labels', model.labels_)
droppedNeighborhood


Unnamed: 0,Cluster Labels,Borough,Neighborhood,ZIP Codes,Latitude,Longitude
0,4,The Bronx,Central Bronx,"[10453, 10457, 10460]",40.84744,-73.897128
1,4,The Bronx,Bronx Park and Fordham,"[10458, 10467, 10468]",40.86705,-73.88607
2,4,The Bronx,High Bridge and Morrisania,"[10451, 10452, 10456]",40.829625,-73.919533
4,4,The Bronx,Kingsbridge and Riverdale,"[10463, 10471]",40.89335,-73.90391
5,9,The Bronx,Northeast Bronx,"[10466, 10469, 10470, 10475]",40.885696,-73.849714
6,4,The Bronx,Southeast Bronx,"[10461, 10462,10464, 10465, 10472, 10473]",40.840295,-73.838583
7,8,Brooklyn,Central Brooklyn,"[11212, 11213, 11216, 11233, 11238]",40.674294,-73.935457
8,3,Brooklyn,Southwest Brooklyn,"[11209, 11214, 11228]",40.615247,-74.015602
9,3,Brooklyn,Borough Park,"[11204, 11218, 11219, 11230]",40.631545,-73.980169
10,3,Brooklyn,Canarsie and Flatlands,"[11234, 11236, 11239]",40.636881,-73.899239


In [84]:
import matplotlib.cm as cm
import matplotlib.colors as colors
nycLatitude=geocoder.arcgis('New York City, New York').latlng[0]
nycLongitude=geocoder.arcgis('New York City, New York').latlng[1]
    

# Modified from the New York Lab

# create map
map_clusters = folium.Map(location=[nycLatitude, nycLongitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(droppedNeighborhood['Latitude'], droppedNeighborhood['Longitude'], droppedNeighborhood['ZIP Codes'], droppedNeighborhood['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters