## Coursera Capstone-The battle of the Neighborhoods
Author: Steve Maraj

Date: 11/8/2019

## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results](#results)
* [Discussion](#discussion)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>
Firstly, I used one of the presented Notebook examples outline because of it's simplicity.
   
This project will try to find the best type of store and location to open in any town? My analysis is based on the assumption that what’s currently open will reflect the current culture. Of course the culture may change again but it will take years. I would suggest rerunning this analysis every few years with updated data. This analysis should be helpful to groups that wish to start up a business in any town.
   
For this time sensitive capstone project, I will get a district/community of related neighborhoods of a city, then get the lat/lon coordinates for them, submit it to foursquare and used the returned data to show the current businesses within a specific radius; a reflection of the current cultures. Then I will cluster the data to reflect the current markets. To enhance the validity of this project, further data analysis could include the average household income and property values; your prices should reflect your consumers affordability. Another enhancement would be to get your competitors profitability figures. For example if the laundromat segment has been running at a loss for years, then this would suggest crossing off laundromats from your list of businesses to open there. Getting more data from a premium account on foursquare allows you to figure out the high foot traffic establishments based on the quantity of comments placed by its patrons (this info would also show if the patrons are tech savvy). Analysis of the comments will be quite helpful as to what is liked and disliked about an establishment.


## Data <a name="data"></a>
I will be analyzing the North Side & South Side district neighborhoods of Chicago, IL. Basically, the neighborhoods are segmented into 9 districts based on Direction North, West & South. The reason I chose Chicago is because it was highly rated by a Quora user as one of the best cities in the US for people that like the social life. Since the required data was not readily available I manually created the data file with the neighborhoods obtained from wikipeadia, which also provided the latitude & longitude information of the 13 communities of the North Side & South Side.



In [4]:
# Let's load our required libraries
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library to process data as dataframes

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from geopy.geocoders import Nominatim  # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [56]:
# Let's load our data into a pandas dataframe
df_neighborhoods = pd.read_csv("Coursera_Capstone_Chicago_NorthSide_data.csv")

In [57]:
df_neighborhoods

Unnamed: 0,District,Communities,Neighborhoods,lat,lon
0,North Side,North Center,"Horner Park, Roscoe Village",41.95,-87.68
1,North Side,Lake View,"Boystown, Lake View East, Graceland West, Sout...",41.9435,-87.654167
2,North Side,Lincoln Park,"Old Town Triangle, Park West, Ranch Triangle, ...",41.92,-87.65
3,North Side,Avondale,"Belmont Gardens, Chicago's Polish Village, Kos...",41.94,-87.71
4,North Side,Logan Square,"Belmont Gardens, Bucktown, Kosciuszko Park, Pa...",41.928333,-87.706667
5,South Side,Armour Square,"Chinatown, Wentworth Gardens",41.833333,-87.633333
6,South Side,Douglas,"Groveland Park, Lake Meadows, the Gap, Prairie...",41.834722,-87.620556
7,South Side,Grand Boulevard,Bronzeville,41.81,-87.62
8,South Side,Kenwood,"Kenwood, South Kenwood",41.81,-87.6
9,South Side,Hyde Park,"East Hyde Park, Hyde Park",41.8,-87.59


In [58]:
# Let's use the geopy library to get the latitude and longitude values of Chicago, IL.
address = 'Chicago, IL'

geolocator = Nominatim(user_agent="il_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Chicago, IL are {}, {}.'.format(latitude, longitude))


The geograpical coordinate of Chicago, IL are 41.8755616, -87.6244212.


In [59]:
# Let's create a map of Chicago, IL using the latitude and longitude values generated above
map_chicago = folium.Map(location=[latitude, longitude], zoom_start=10)


In [79]:
# add markers to map
for lat, lon, Community, Neighborhood in zip(df_neighborhoods['lat'], df_neighborhoods['lon'], df_neighborhoods['Communities'], df_neighborhoods['Neighborhoods']):
    label = '{}--{}'.format(Community, Neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_chicago)  
    
map_chicago


In [78]:
# Let's utilizing the Foursquare API to explore the communities and neighborhoods and segment them.
# Setup Foursquare Credentials and Version
CLIENT_ID = 'XXXXXXXXXX' # your Foursquare ID
CLIENT_SECRET = 'XXXXXXXXXX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


Your credentails:
CLIENT_ID: XXXXXXXXXX
CLIENT_SECRET:XXXXXXXXXX


In [62]:
# function that extracts the category of the venue
# Let's reuse our function from our lab
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


In [63]:
# Let's Explore the Neighborhoods in the community of North Side
# Let's reuse our function from our lab
LIMIT = 100
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [64]:
# code to run the above function on each Communitiy and create a new dataframe called northside_venues.
northside_venues = getNearbyVenues(names=df_neighborhoods['Communities'],
                                   latitudes=df_neighborhoods['lat'],
                                   longitudes=df_neighborhoods['lon']
                                  )


North Center
Lake View
Lincoln Park
Avondale
Logan Square 
Armour Square
Douglas
Grand Boulevard
Kenwood
Hyde Park
Woodlawn
South Shore
Greater Grand Crossing


In [65]:
# Let's check the size of the resulting dataframe
print(northside_venues.shape)
northside_venues.head()


(460, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,North Center,41.95,-87.68,Big Bricks,41.951417,-87.676863,Pizza Place
1,North Center,41.95,-87.68,Trader Joe's,41.949938,-87.6755,Grocery Store
2,North Center,41.95,-87.68,Wasabi Cafe Sushi & Sake,41.95265,-87.677794,Sushi Restaurant
3,North Center,41.95,-87.68,G and L Fire Escape,41.950342,-87.68346,Pub
4,North Center,41.95,-87.68,knit1,41.95113,-87.676616,Arts & Crafts Store


In [66]:
# Let's check how many venues were returned for each community
northside_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Armour Square,20,20,20,20,20,20
Avondale,29,29,29,29,29,29
Douglas,28,28,28,28,28,28
Grand Boulevard,16,16,16,16,16,16
Greater Grand Crossing,11,11,11,11,11,11
Hyde Park,66,66,66,66,66,66
Kenwood,14,14,14,14,14,14
Lake View,90,90,90,90,90,90
Lincoln Park,56,56,56,56,56,56
Logan Square,80,80,80,80,80,80


In [67]:
# Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(northside_venues['Venue Category'].unique())))

There are 158 uniques categories.


In [68]:
# Let's Analyze Each Neighborhood (Community)
# one hot encoding
northside_onehot = pd.get_dummies(northside_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
northside_onehot['Neighborhood'] = northside_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [northside_onehot.columns[-1]] + list(northside_onehot.columns[:-1])
northside_onehot = northside_onehot[fixed_columns]

northside_onehot.head()


Unnamed: 0,Neighborhood,Accessories Store,African Restaurant,American Restaurant,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Baby Store,...,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,North Center,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,North Center,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,North Center,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,North Center,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,North Center,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [69]:
# Let's examine the new dataframe size.
northside_onehot.shape

(460, 159)

In [70]:
# Let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
northside_grouped = northside_onehot.groupby('Neighborhood').mean().reset_index()
northside_grouped


Unnamed: 0,Neighborhood,Accessories Store,African Restaurant,American Restaurant,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Baby Store,...,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Armour Square,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Avondale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.034483,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0
2,Douglas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.035714,0.0,0.0
3,Grand Boulevard,0.0,0.0,0.0625,0.0,0.125,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Greater Grand Crossing,0.0,0.0,0.181818,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Hyde Park,0.0,0.0,0.030303,0.0,0.015152,0.015152,0.015152,0.0,0.0,...,0.0,0.015152,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.015152
6,Kenwood,0.0,0.071429,0.0,0.0,0.071429,0.0,0.071429,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Lake View,0.0,0.0,0.022222,0.011111,0.0,0.0,0.0,0.011111,0.0,...,0.0,0.022222,0.0,0.011111,0.0,0.0,0.0,0.0,0.0,0.0
8,Lincoln Park,0.017857,0.0,0.017857,0.0,0.017857,0.017857,0.0,0.0,0.017857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017857,0.017857
9,Logan Square,0.0,0.0,0.0,0.0,0.025,0.0,0.0125,0.0,0.0,...,0.0125,0.0,0.0125,0.0,0.0,0.0125,0.0125,0.0,0.0,0.0


In [71]:
# Let's print each neighborhood (community) along with the top 5 most common venues
num_top_venues = 5

for hood in northside_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = northside_grouped[northside_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


----Armour Square----
              venue  freq
0  Baseball Stadium  0.25
1               Bar  0.20
2        Sports Bar  0.10
3            Lounge  0.10
4       Coffee Shop  0.05


----Avondale----
                venue  freq
0        Dance Studio  0.07
1  Chinese Restaurant  0.07
2          Food Truck  0.07
3           Pet Store  0.03
4             Brewery  0.03


----Douglas----
                  venue  freq
0  Fast Food Restaurant  0.14
1        Sandwich Place  0.07
2   Fried Chicken Joint  0.07
3           Pizza Place  0.04
4        Clothing Store  0.04


----Grand Boulevard----
            venue  freq
0     Art Gallery  0.12
1    Liquor Store  0.12
2       Jazz Club  0.06
3           Plaza  0.06
4  Breakfast Spot  0.06


----Greater Grand Crossing----
                 venue  freq
0  American Restaurant  0.18
1               Lounge  0.18
2  Fried Chicken Joint  0.18
3           Restaurant  0.09
4    Currency Exchange  0.09


----Hyde Park----
            venue  freq
0     Pizza Plac

In [72]:
# Let's put that into a pandas dataframe
# but First, let's write a function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


In [73]:
# Let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = northside_grouped['Neighborhood']

for ind in np.arange(northside_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(northside_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Armour Square,Baseball Stadium,Bar,Sports Bar,Lounge,American Restaurant,Plaza,Historic Site,Park,Coffee Shop,Bakery
1,Avondale,Dance Studio,Food Truck,Chinese Restaurant,Pet Store,Korean Restaurant,Electronics Store,Bus Station,Brewery,Soccer Field,Sandwich Place
2,Douglas,Fast Food Restaurant,Fried Chicken Joint,Sandwich Place,Coffee Shop,Pharmacy,Convenience Store,Pizza Place,Park,College Cafeteria,Historic Site
3,Grand Boulevard,Art Gallery,Liquor Store,Breakfast Spot,Jazz Club,Pizza Place,Performing Arts Venue,Caribbean Restaurant,Food,Burger Joint,Sporting Goods Shop
4,Greater Grand Crossing,Fried Chicken Joint,Lounge,American Restaurant,Restaurant,Park,Donut Shop,Currency Exchange,Intersection,Health Food Store,Dive Bar


In [74]:
# Cluster Neighborhoods
# Run k-means to cluster the neighborhood into 5 clusters.
# set number of clusters
kclusters = 5

northside_grouped_clustering = northside_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(northside_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 


array([3, 1, 1, 1, 4, 1, 1, 1, 1, 1])

In [75]:
# Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

northside_merged = df_neighborhoods

# merge northside_grouped with northside_data to add latitude/longitude for each neighborhood
northside_merged = pd.merge(northside_merged, neighborhoods_venues_sorted, how='left', left_on=['Communities'], right_on=['Neighborhood'])

northside_merged.head() # check the last columns!


Unnamed: 0,District,Communities,Neighborhoods,lat,lon,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North Side,North Center,"Horner Park, Roscoe Village",41.95,-87.68,1,North Center,Boutique,Pizza Place,Bank,Arts & Crafts Store,Pub,Theater,Mobile Phone Shop,Chinese Restaurant,Liquor Store,Miscellaneous Shop
1,North Side,Lake View,"Boystown, Lake View East, Graceland West, Sout...",41.9435,-87.654167,1,Lake View,Gay Bar,Sandwich Place,Mexican Restaurant,Sports Bar,Pizza Place,Pub,Japanese Restaurant,Coffee Shop,New American Restaurant,Vegetarian / Vegan Restaurant
2,North Side,Lincoln Park,"Old Town Triangle, Park West, Ranch Triangle, ...",41.92,-87.65,1,Lincoln Park,Pizza Place,Boutique,Cosmetics Shop,Coffee Shop,Burger Joint,Gym,Breakfast Spot,Mexican Restaurant,Men's Store,New American Restaurant
3,North Side,Avondale,"Belmont Gardens, Chicago's Polish Village, Kos...",41.94,-87.71,1,Avondale,Dance Studio,Food Truck,Chinese Restaurant,Pet Store,Korean Restaurant,Electronics Store,Bus Station,Brewery,Soccer Field,Sandwich Place
4,North Side,Logan Square,"Belmont Gardens, Bucktown, Kosciuszko Park, Pa...",41.928333,-87.706667,1,Logan Square,Coffee Shop,Bar,Bus Station,Café,Food & Drink Shop,Pizza Place,Gym / Fitness Center,Park,Cocktail Bar,Bookstore


In [76]:
# Let's drop the duplicate column "Neighborhood"
northside_merged.drop(['Neighborhood'], axis=1, inplace=True)
northside_merged.shape

(13, 16)

In [77]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(northside_merged['lat'], northside_merged['lon'], northside_merged['Communities'], northside_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


## Methodology <a name="methodology"></a>

1) Understand the problem you are trying to solve - I am trying to find the best location for a business.<br>
2) Understanding the requirements and approach.<br>
3) Acquiring the data - since the required data was not readily available, I created the data from widipedia<br>
4) Exploring the data or data wrangling - since I created exactly what I needed, there was no need for data exploration<br>
5) Analysing the data<br>
6) Gathering the result<br>
7) Discussing the results with the stakeholders (getting feedback)<br>

P.S. These are all iterative steps, you can rerun from any of the steps or even from the beginning if you are in doubt
of the results

## Results <a name="results"></a>
For cluster 1 - includes 9 communities
For clusters 0, 2, 3 & 4 - includes the remaining 4 communities. A cluster per community. Which shows the uniqueness of these communities. 

## Discussion <a name="discussion"></a>

For the 1st iteration of this project I only used the North Side data. Unfortunately, it was exactly 5 communities, so there was not sufficient data to compare. I will add South Side data to my currently file and rerun. Even with the addition of the South Side data, that still seems insufficent. I may have to rerun my analysis based on individual neighborhoods rather than the collective communities.

## Conclusion <a name="conclusion"></a>
Using Jupyter notebooks was quite helpful. Since everything was already setup from the first pass, I simply had to rerun with the updated data file. From my 2nd iteration based on the 4 unique community per cluster, I would like to get more data and rerun the process again to verify that these communities are really that unique. I would also like to incorporate some financial information, such as the average income of the residents living in the area, house prices and competitors value. Plus, add transportation locations. Basically the results are inconclusive. Because of time restraints I will submit as is. However, I think I really need to get the latitudes & longitudes of the Neighborhoods and not use the communities column since it seems too broad.