# The Battle of the Neighborhoods - *Applied Data Science Capstone*

## 1. Problem and Discussion of the Background

While it is widely known that New York is the financial centre of the world, people sometimes fail to realize that there is more than one financial centre within the United States (US) itself, let alone internationally. I'm interested in assessing how similar (or diverse for that matter) different financial capitals are. For this reason, I'd like to compare the Manhattan, NY to Chicago, IL (another important financial centre for the US) to see if the makeup of those areas show similar trends or do other factors take precedence and explain the dissimilarities. Using the clustering analysis, I will compare the results to make inferences about the areas. I hope to learn more about the areas even though I have visited both places numerous times. My target audience for this would be the residents of Manhattan, NY or Chicago, IL to show them how homogenous/diverse (depending on the outcome) they are.

## 2. Data Needed and Solution Outline

Based on what we've learned throughout this program, the data needed for my discussion revolves around collecting geospatial data along with some descriptive information (i.e. venue information) linked to the geospatial data. FourSquare is a great platform to collect such information as it provides coordinates as well which will make mapping easier. Additionally, I'll need to scrape "neighborhood" information from web sources that list the different ZIP codes for both Manhattan, NY and Chicago, IL. Once the data is collected, I'll have to clean it and extract only the data that will be pertinent to my discussion. Once, I complete the data cleaning process, I will move on to the analysis stage where I map and cluster the data using k-Means clustering. Finally, using the information computed by the k-Means clustering, I will make my conclusions about the aforementioned locations by looking at the most common clusters in both locations.

## 3. Methodology

The cell below installs and imports all the necessary packages in order to compute the data.

In [1]:
!pip install folium
import folium

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np
import requests
import json

import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

from bs4 import BeautifulSoup

from geopy.geocoders import Nominatim

from pandas.io.json import json_normalize

from sklearn.cluster import KMeans



I found a credible data source for Chicago's neighborhoods and downloaded the CSV file. Then, I uploaded the file as a "data asset" onto the IBM's cloud object storage. I used IBM's interface to input the code for me to insert the data I uploaded into this notebook.

*Source: https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6*

**The cell might get hidden due to my credentials being present in the cell.**

In [2]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,the_geom,PERIMETER,AREA,COMAREA_,COMAREA_ID,AREA_NUMBE,COMMUNITY,AREA_NUM_1,SHAPE_AREA,SHAPE_LEN
0,MULTIPOLYGON (((-87.60914087617894 41.84469250...,0,0,0,0,35,DOUGLAS,35,46004620.0,31027.05451
1,MULTIPOLYGON (((-87.59215283879394 41.81692934...,0,0,0,0,36,OAKLAND,36,16913960.0,19565.506153
2,MULTIPOLYGON (((-87.62879823733725 41.80189303...,0,0,0,0,37,FULLER PARK,37,19916700.0,25339.08975
3,MULTIPOLYGON (((-87.6067081256125 41.816813770...,0,0,0,0,38,GRAND BOULEVARD,38,48492500.0,28196.837157
4,MULTIPOLYGON (((-87.59215283879394 41.81692934...,0,0,0,0,39,KENWOOD,39,29071740.0,23325.167906


I cleaned the data to my liking here and displayed the first 5 rows.

In [3]:
chicago_df = chicago_df['COMMUNITY'] 

chicago_df = pd.DataFrame(chicago_df)

chicago_df = chicago_df.rename(columns = {'COMMUNITY' : 'Neighborhood'})

print(chicago_df.shape)

chicago_df.head()

(77, 1)


Unnamed: 0,Neighborhood
0,DOUGLAS
1,OAKLAND
2,FULLER PARK
3,GRAND BOULEVARD
4,KENWOOD


Based on neighborhoods identified by the data source, I extracted the latitude and longitude in preperation for using the FourSquare API later. This was then stored into a new array.

In [4]:
chicago_coordinates = []

for hood in chicago_df['Neighborhood']:

    geolocator = Nominatim(user_agent = "chicago_explorer")
    location = geolocator.geocode(hood + ", Chicago")
    latitude = location.latitude
    longitude = location.longitude
    chicago_coordinates.append([hood, latitude, longitude])
    
chicago_coordinates

[['DOUGLAS', 41.8348565, -87.6179536],
 ['OAKLAND', 41.8236535, -87.6082424],
 ['FULLER PARK', 41.8180891, -87.6325508],
 ['GRAND BOULEVARD', 41.8139226, -87.6172724],
 ['KENWOOD', 41.8091444, -87.5979908],
 ['LINCOLN SQUARE', 41.975989850000005, -87.6896163305115],
 ['WASHINGTON PARK', 41.7925338, -87.6181052],
 ['HYDE PARK', 41.7944464, -87.5939244],
 ['WOODLAWN', 41.7794786, -87.599493],
 ['ROGERS PARK', 42.01053135, -87.67074819664808],
 ['JEFFERSON PARK', 41.9697375, -87.7631179],
 ['FOREST GLEN', 41.991751550000004, -87.75167396842738],
 ['NORTH PARK', 41.9805872, -87.7208917],
 ['ALBANY PARK', 41.9719367, -87.7161739],
 ['PORTAGE PARK', 41.9578093, -87.7650594],
 ['IRVING PARK', 41.953365, -87.7364471],
 ['DUNNING', 41.952809, -87.7964493],
 ['MONTCLARE', 41.9253091, -87.8008931],
 ['BELMONT CRAGIN', 41.9316983, -87.7686699],
 ['WEST RIDGE', 42.0035482, -87.6962426],
 ['HERMOSA', 41.928643, -87.7345019],
 ['AVONDALE', 41.9389208, -87.711168],
 ['LOGAN SQUARE', 41.9285683, -87.70

The array was converted to apandas dataframe.

In [None]:
cc_df = pd.DataFrame(chicago_coordinates)

The newly created dataframe was modified to make it easier to read.

In [None]:
cc_df = cc_df.rename(columns = {0 : 'Neighborhood', 1 : 'Latitude', 2 : 'Longitude'})

cc_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,DOUGLAS,41.834857,-87.617954
1,OAKLAND,41.823653,-87.608242
2,FULLER PARK,41.818089,-87.632551
3,GRAND BOULEVARD,41.813923,-87.617272
4,KENWOOD,41.809144,-87.597991


The dataframes were merged for consistency's sake.

In [None]:
chi_merged = chicago_df.merge(cc_df, how = 'inner')

chi_merged.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,DOUGLAS,41.834857,-87.617954
1,OAKLAND,41.823653,-87.608242
2,FULLER PARK,41.818089,-87.632551
3,GRAND BOULEVARD,41.813923,-87.617272
4,KENWOOD,41.809144,-87.597991


Chicago's center latitude and longitude were located.

In [None]:
address = 'Chicago, IL'

geolocator = Nominatim(user_agent="il_explorer")
location = geolocator.geocode(address)
il_latitude = location.latitude
il_longitude = location.longitude
print('The geographical coordinate of Chicago are {}, {}.'.format(latitude, longitude))

The geographical coordinate of Chicago are 42.0057335, -87.81401633833357.


The preliminary map was made to make sure the mapping was accurate.

In [None]:
map_chicago = folium.Map(location=[il_latitude, il_longitude], zoom_start=10)

for lat, lng, neighborhood in zip(chi_merged['Latitude'], chi_merged['Longitude'], chi_merged['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_chicago)  
    
map_chicago

Same thing was done to find Manhattan's latitude and longitude.

In [None]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
ny_latitude = location.latitude
ny_longitude = location.longitude
print('The geographical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geographical coordinate of Manhattan are 42.0057335, -87.81401633833357.


First, the available dataset on New York was extracted (this is the same dataset that was used throughout this program). Then, pertinent data from the dataset was pulled to create a new dataframe. However, the dataframe has information on all the boroughs of New York as opposed to just Manhattan.

*Source: https://cocl.us/new_york_dataset*

In [None]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
    
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
ny_neighborhoods_data = newyork_data['features']

column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

ny_neighborhoods = pd.DataFrame(columns=column_names)

for data in ny_neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    ny_neighborhoods = ny_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
ny_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


The aforementioned dataframe was then broken down further into a new dataframe that contained on Manhattan's data.

In [None]:
manhattan_data = ny_neighborhoods[ny_neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)

manhattan_data = manhattan_data.drop(columns = 'Borough')

manhattan_data.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Marble Hill,40.876551,-73.91066
1,Chinatown,40.715618,-73.994279
2,Washington Heights,40.851903,-73.9369
3,Inwood,40.867684,-73.92121
4,Hamilton Heights,40.823604,-73.949688


A preliminary map was made based on the information from the Manhattan dataframe to make sure the data was accurate.

In [None]:
map_manhattan = folium.Map(location=[ny_latitude, ny_longitude], zoom_start=12)

for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

A connection was then established with the FourSquare API using credentials acquired prior to this project. A function was created that would search for the 50 closest locations within a 500 m radius to the coordinates of each neighborhood in the chosen location (Manhattan or Chicago).

**The cell might be hidden due to the presence of credentials in the cell.**

In [None]:
# The code was removed by Watson Studio for sharing.

Dataframes containing the information from FourSquare were created for both Chicago and Manhattan separately using the function defined from before. The first 5 rows of both dataframes were displayed to get a feel of what the dataframes were producing. 

In [None]:
chicago_venues = getNearbyVenues(names = chi_merged['Neighborhood'],
                                   latitudes = chi_merged['Latitude'],
                                   longitudes = chi_merged['Longitude']
                                  )

DOUGLAS
OAKLAND
FULLER PARK
GRAND BOULEVARD
KENWOOD
LINCOLN SQUARE
WASHINGTON PARK
HYDE PARK
WOODLAWN
ROGERS PARK
JEFFERSON PARK
FOREST GLEN
NORTH PARK
ALBANY PARK
PORTAGE PARK
IRVING PARK
DUNNING
MONTCLARE
BELMONT CRAGIN
WEST RIDGE
HERMOSA
AVONDALE
LOGAN SQUARE
HUMBOLDT PARK
WEST TOWN
AUSTIN
WEST GARFIELD PARK
EAST GARFIELD PARK
NEAR WEST SIDE
NORTH LAWNDALE
UPTOWN
SOUTH LAWNDALE
LOWER WEST SIDE
NEAR SOUTH SIDE
ARMOUR SQUARE
NORWOOD PARK
NEAR NORTH SIDE
LOOP
SOUTH SHORE
CHATHAM
AVALON PARK
SOUTH CHICAGO
BURNSIDE
MCKINLEY PARK
LAKE VIEW
CALUMET HEIGHTS
ROSELAND
NORTH CENTER
PULLMAN
SOUTH DEERING
EAST SIDE
WEST PULLMAN
RIVERDALE
HEGEWISCH
GARFIELD RIDGE
ARCHER HEIGHTS
BRIGHTON PARK
BRIDGEPORT
NEW CITY
WEST ELSDON
GAGE PARK
CLEARING
WEST LAWN
CHICAGO LAWN
WEST ENGLEWOOD
ENGLEWOOD


In [None]:
chicago_venues.head()

In [None]:
manhattan_venues = getNearbyVenues(names = manhattan_data['Neighborhood'],
                                   latitudes = manhattan_data['Latitude'],
                                   longitudes = manhattan_data['Longitude']
                                  )

In [None]:
manhattan_venues.head()

The shape of both dataframes was generated to make sure that there was enough data present to conduct analysis. This was to also make sure the Chicago and Manhattan could be compared to each other.

In [None]:
print('Chicago dataset shape:', chicago_venues.shape)
print('Manhattan dataset shape:', manhattan_venues.shape)

Counts were done to see how many datapoints existed for each neighborhood.

In [None]:
chicago_venues.groupby('Neighborhood').count()

In [None]:
manhattan_venues.groupby('Neighborhood').count()

In [None]:
chicago_venues = chicago_venues[chicago_venues['Venue Category'] != 'Neighborhood']

print('There are {} uniques categories for Chicago.'.format(len(chicago_venues['Venue Category'].unique())))

In [None]:
manhattan_venues = manhattan_venues[manhattan_venues['Venue Category'] != 'Neighborhood']

print('There are {} uniques categories for Manhattan.'.format(len(manhattan_venues['Venue Category'].unique())))

Dataframes were created for both locations to prepare for clustering using the one-hot encoding method. Within the dataframe, the "Neighborhood" column is brought to the first row. Additionally, the first 5 rows of the one-hot encoding dataframe is displayed while also describing the shape of the dataframe.

In [None]:
chicago_onehot = pd.get_dummies(chicago_venues[['Venue Category']], prefix="", prefix_sep="")

chicago_onehot['Neighborhood'] = chicago_venues['Neighborhood'] 

fixed_columns = [chicago_onehot.columns[-1]] + list(chicago_onehot.columns[:-1])
chicago_onehot = chicago_onehot[fixed_columns]

print(chicago_onehot.shape)
chicago_onehot.head()

In [None]:
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

print(manhattan_onehot.shape)
manhattan_onehot.head()

Mean of the frequency of occurence of each category, after grouping by neighborhood, was computed for both dataframes.

In [None]:
chicago_grouped = chicago_onehot.groupby('Neighborhood').mean().reset_index()
chicago_grouped

In [None]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

The cell below computes the top 5 most common venues for each neighborhood and this was done for both locations.

In [None]:
num_top_venues = 5

for hood in chicago_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = chicago_grouped[chicago_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
for hood in manhattan_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp2 = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp2.columns = ['venue','freq']
    temp2 = temp2.iloc[1:]
    temp2['freq'] = temp2['freq'].astype(float)
    temp2 = temp2.round({'freq': 2})
    print(temp2.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

The cell below defines a function, "return_most_common_venues", which sorts in descending order the 5 most common venues. New dataframes were created to display (in order) what the most common "venues" were for each neighborhood within both locations.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

chi_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
chi_neighborhoods_venues_sorted['Neighborhood'] = chicago_grouped['Neighborhood']

for ind in np.arange(chicago_grouped.shape[0]):
    chi_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(chicago_grouped.iloc[ind, :], num_top_venues)

chi_neighborhoods_venues_sorted.head()

In [None]:
indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

ny_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
ny_neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    ny_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

ny_neighborhoods_venues_sorted.head()

The cell below computes an array of KMeans clustering based on different K values. KMeans clustering is a popular unsupervised machine learning algorithm which groups datapoints that are similar hopefully producing patterns. The displayed output is an accuracy measure of the clustering. Further, the values are plotted and the best value of k is picked using the "elbow method". Both locations produced different optimal values for k which shows diversity between the locations. Both locations were clustered differently based on their respective optimal k value. Once the array was generated after clustering, both locations had their dataframes append to include the new labeling with each neighborhood being given a label showing which cluster they belong to. The new dataframes were mapped for both locations based on the new cluster labels where each cluster was assigned a different color for ease of identifying on the map itself.

*Source: https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1*

In [None]:
chicago_grouped_clustering = chicago_grouped.drop('Neighborhood', 1)

Ks = 18
chi_ssd = []

for n in range(1,Ks):
    
    KM = KMeans(n_clusters = n)
    KM = KM.fit(chicago_grouped_clustering)
    chi_ssd.append(KM.inertia_)

chi_ssd

In [None]:
plt.plot(range(1,Ks), chi_ssd, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k - Chicago')
plt.show()

In [None]:
chi_kclusters = 4

chi_kmeans = KMeans(n_clusters=chi_kclusters, random_state=0).fit(chicago_grouped_clustering)

chi_kmeans.labels_[0:10]

In [None]:
chi_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', chi_kmeans.labels_)

chi_merged = chi_merged.join(chi_neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

chi_merged = chi_merged.dropna()

chi_merged = chi_merged.astype({'Cluster Labels': 'int32'})

chi_merged.head()

In [None]:
chi_map_clusters = folium.Map(location=[il_latitude, il_longitude], zoom_start=10)

x = np.arange(chi_kclusters)
ys = [i + x + (i*x)**2 for i in range(chi_kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(chi_merged['Latitude'], chi_merged['Longitude'], chi_merged['Neighborhood'], chi_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(chi_map_clusters)
       
chi_map_clusters

In [None]:
manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

ny_ssd = []

for n in range(1,Ks):
    
    KM = KMeans(n_clusters = n)
    KM = KM.fit(manhattan_grouped_clustering)
    ny_ssd.append(KM.inertia_)

ny_ssd

In [None]:
plt.plot(range(1,Ks), ny_ssd, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k - Manhattan')
plt.show()

In [None]:
ny_kclusters = 12

ny_kmeans = KMeans(n_clusters=ny_kclusters, random_state=0).fit(manhattan_grouped_clustering)

ny_kmeans.labels_[0:10]

In [None]:
ny_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', ny_kmeans.labels_)

manhattan_data = manhattan_data.join(ny_neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_data = manhattan_data.dropna()

manhattan_data = manhattan_data.astype({'Cluster Labels': 'int32'})

manhattan_data.head()

In [None]:
ny_map_clusters = folium.Map(location=[ny_latitude, ny_longitude], zoom_start=12)

x = np.arange(ny_kclusters)
ys = [i + x + (i*x)**2 for i in range(ny_kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood'], manhattan_data['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(ny_map_clusters)
       
ny_map_clusters

The largest cluster on both maps were visually identified and data on the 5 most common "venues" were extracted for the neighborhoods in the cluster to compare the makeup in each cluster. The most mentioned "venue category" was computed for both locations to get an idea of what kind of "venues" were in the cluster.

In [None]:
manhattan_final = manhattan_data.loc[manhattan_data['Cluster Labels'] == 1, manhattan_data.columns[[0] + list(range(4, manhattan_data.shape[1]))]]

manhattan_final

In [None]:
manhattan_final.describe(include = 'all')

In [None]:
chi_final = chi_merged.loc[chi_merged['Cluster Labels'] == 1, chi_merged.columns[[0] + list(range(4, chi_merged.shape[1]))]]

chi_final

In [None]:
chi_final.describe(include = 'all')

## 4. Discussion

For the discussion, observations that stand out will be included as well possible explanations to such observations. The recommendation in this case is really more of an informed determination rather than an actionable plan. Additionally, the determination in this context is to discern whether the locations, Manhattan and Chicago, are similar or diverse solely based on venue information from a location intelligence website (FourSquare). 

Starting with the "chicago_venues" and "manhattan_venues" shapes, we can see that Manhattan has more datapoints, even though the Manhattan has fewer neighborhoods (as shown in "manhattan_grouped" and "chicago_grouped"). The risk of a few neighborhoods holding more "venues" is mitigated since a limit was put on how many locations can be returned for a single neighborhood. Furthermore, it is evident that Manhattan has more "venues" in each neighborhood compared to Chicago because most neighborhoods in Manhattan met the 50 "venue" limit when looking at the counts grouped by neighborhoods. From the comparison, one can say that the data shows Manhattan has more to do within each neighborhood when compared to Chicago. However, there is also a possibility that there wasn't enough data captured on Chicago's neighborhoods. 

When looking the K-means clustering, we find that Chicago and Manhattan have different optimal clusters. Clusters are datapoints that share similarities as detected by the machine learning algorithm. Using the "elbow method", it was found that Chicago's optimal K value was 4 and Manhattan's optimal K value was 12. In this case, optimal means that value that would provide the most accuracy. We can see here that Manhattan needs more clusters to for higher accuracy compared to Chicago. This shows that Manhattan needs more clusters to distinctly separate the neighborhoods as opposed to Chicago. We can say here that Manhattan is more diverse than Chicago with the machine being able to find more to separate the neighborhoods. 

With regards to the mapping of the clusters, we can see that the central areas for both locations are quite similar in that they fall in the same cluster. For Manhattan, moving towards the ends it can be seen that the clusters start to change. On the other hand, Chicago is pretty similar across the map with a significant majority falling under one cluster. In Chicago, around 82% (61 out of the 74) of neighborhoods fall under the same cluster, whereas around 25% (10 out of 39) of neighborhoods fall in the same cluster for Manhattan. Looking at the most common type of venues in the most prominent cluster in each location, Chicago is more food-oriented (i.e. bars, restaurants, groceries) while Manhattan is more fitness-oriented (i.e. gyms and yoga studios).

*Source: https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1*

## 5. Conclusion

Based on the analysis of data, one can determine that Manhattan and Chicago are quite diverse locations even though both of them are in the United States. Both locations play an important in the global financial operations and are the most heavily populated cities in the Unites States. However, the findings suggest that the makeups of the neighborhoods are entirely different as Chicago is more homogeneous across the map while Manhattan is similar in the central region. Adding to that, the optimal clusters for Manhattan compared to Chicago is significantly more which proves that Manhattan is more diverse. Finally, comparing the types of venues that exist in the most common cluster of neighborhoods, we see that Chicago leans towards to dining and Manhattan leans towards wellness.

Having said that, it is important to note that using just a single method of analysis is not enough to make this determination. There are other factors that can impact the similarities or differences in the above locations. Culture can play a role in how the neighborhoods are structured and demographic information can also cause a change. It is also good practice to scrutunize the data to make sure the source is credible. In this case, some categories counted seperately even when they were describing the same venue type (i.e. beer bar and beer garden).