<a href="https://colab.research.google.com/github/tomthomas/Coursera_Capstone/blob/main/Capstone_BOTN_TomT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera


## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

The US is notorious for being one among the highest ranked countries in terms of obesity rates. The highly commercialized fast-food industry makes it difficult for the average American to stay healthy, especially in large cities, where advertisements are more prominent. This report explores how one could tackle the temptations of living in a large metropolitan city and remain healthy. 

For this analysis, the aim is to focus on the two main factors at hand when it comes to an individual who may be attempting to stay healthy - **fast-food restaurants** and **fitness centers**. We will analyze the data to observe any patterns between these two types. Whether there are more fitness centers where there are fast-food restaurants or the opposite? And if there can be anything that can be done to reduce the desire to visit a fast-food chain restaurant. 


## Data <a name="Data"></a>

For a thorough analysis, we require the following data:


*   **List of Neighborhoods in a Metropolitan City** - We have chosen **Los Angeles** for this project, which is a large city with a lot of fast food options and fitness options allowing for a more concrete observation. We will collect this information from the LA times Project.
*   **Department of Public Health Data for LA County**  - This dataset will be useful for evaluating correlation between obesity rates and the two main factors.
*  **Geospatial Data using Geocoder Package**  - We will use this package to capture the required Latitudes and Longitudes for LA County.
*  **GeoJSON Data for LA County** - This will be the data we use to create choropleth maps for Los Angeles.
*   **FourSquare Venue Data**  - We will utilize the FourSquare API to collect venue information for the different neighborhoods.


In [None]:
# Importing requried libraries
# Install necessary dependencies if you haven't already
!pip install opencage

import pandas as pd
import json
import requests
import numpy as np
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from bs4 import BeautifulSoup

from geopy.geocoders import Nominatim
from opencage.geocoder import OpenCageGeocode
from pandas.io.json import json_normalize 
from sklearn.cluster import KMeans

print("Import complete!")

In [2]:
# De-limit displayed rows and columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### Getting Neighborhood Data

The first data we need to import into our project is the **List of Neighborhoods in a Metropolitan City**.

Data Source: https://maps.latimes.com/neighborhoods/neighborhood/list/

This data will be valuable to group the different cities Neighborhoods of LA into a Region/Borough. We will utilize the BeautifulSoup package to extrapolate the list found on the LA Times website. 

Beautiful Soup will convert the HTML Text into a table and then we can convert it into a table.


In [3]:
# Create BeautifulSoup object
page = requests.get('http://maps.latimes.com/neighborhoods/neighborhood/list/').text
soup = BeautifulSoup(page,'html5lib')
tbl = str(soup.table)

# Convert to table
list1 = pd.read_html(tbl)
la_list = list1[0]
la_list.rename(columns={'Name':'Neighborhood'}, inplace =True)

la_list.head()

Unnamed: 0,Neighborhood,Region
0,Acton,Antelope Valley
1,Adams-Normandie,South L.A.
2,Agoura Hills,Santa Monica Mountains
3,Agua Dulce,Northwest County
4,Alhambra,San Gabriel Valley


In [9]:
# la_list = pd.read_csv("/content/la.csv", encoding= "latin_1")
# la_list.rename(columns={'Name':'Neighborhood'}, inplace =True)

### Getting Public Health Data

Next up is the data that could be key to our analysis. 
We are going to import the obesity data for various neighborhoods of LA.

Data source: http://publichealth.lacounty.gov/

We are going to get two types of indicators for our analysis:

1.   The Percentage of Adults in LA County Meeting Recommended Guidleines for Physical Activity
2.   The Percentage of Adults in LA County Who are Obese

The data is stored as .xlsx files which we will import and clean so that they can be ready for analysis. 



In [4]:
# Read Excel data
fit = pd.read_excel("https://github.com/tomthomas/Coursera_Capstone/blob/main/Data/PercentMeetingPAGuidelines.xlsx?raw=true", skiprows=4, usecols=lambda x: 'Unnamed' not in x)
obs = pd.read_excel("https://github.com/tomthomas/Coursera_Capstone/blob/main/Data/PercentObeseAdults.xlsx?raw=true", skiprows=4, usecols=lambda x: 'Unnamed' not in x)

# Drop unnecessary columns
fit.drop(fit.columns[[-1,-2]], axis = 1,inplace=True)
obs.drop(obs.columns[[-1,-2]], axis = 1,inplace=True)
fit.rename(columns={'Percentage':'Healthy Adult Percentage'}, inplace =True)
obs.rename(columns={'Percent':'Obesity Rate'}, inplace =True)
health_data = pd.merge(fit, obs, on = 'City/Community')

health_data.head()

Unnamed: 0,City/Community,Healthy Adult Percentage,Obesity Rate
0,Alhambra,0.27313,0.135779
1,Altadena,0.348242,0.244131
2,Arcadia,0.266197,0.057046
3,Azusa,0.375943,0.260744
4,Baldwin Park,0.316393,0.265167


### Getting GeoJSON Data

A GeoJSON file is a type of JSON file that is stored as a dictionary with co-ordinate data for polygons that will form into boundaries for any geographical region. In our case, this would be for LA County. This data was sourced from the UCLA GeoPortal. 

Data Source: https://apps.gis.ucla.edu/geodata/dataset/93d71e41-6196-4ecb-9ddd-15f1a4a7630c/resource/6cde4e9e-307c-477d-9089-cae9484c8bc1/download/la-county-neighborhoods-v6.geojson

This data will become more inferrable when we combine it visually to include the Obesity Rate and the Healthy Adult Percentage.

In [5]:
# Access and Store GeoJSON file
! wget --quiet https://apps.gis.ucla.edu/geodata/dataset/93d71e41-6196-4ecb-9ddd-15f1a4a7630c/resource/6cde4e9e-307c-477d-9089-cae9484c8bc1/download/la-county-neighborhoods-v6.geojson
    
# Create an object
world_geo = r'la-county-neighborhoods-v6.geojson'
# Load into a Dictionary
with open(world_geo) as data:
    geo = json.load(data)

Now that we have that GeoJSON file loaded, let's take a look at it.

In [196]:
# Run to view GeoJSON
# geo 

We can see that the dict also contains the names and regions of all the Neighborhoods and Regions of LA County. Let's go ahead and extrapolate that data into a DataFrame as well. This will be useful later on.

In [None]:
# Find all the NEighborhoods and Regions in the data
la_geo_name = []
la_geo_region = []

for feature in geo['features']:
    print (feature['properties']['name'])
    la_geo_name.append(feature['properties']['name'])
    la_geo_region.append(feature['properties']['metadata']['region'])

# Create a DataFrame
la_geo = pd.DataFrame({'Neighborhood':la_geo_name,
                       'Region':la_geo_region})    

### Merging Data

Combining the Neighborhood Data with the Public Health Data will give us a better view of how the data is spread accross the different Regions. 

In [7]:
# Let's take a look at the shape of each dataframe

# Neighborhood Data
print("There are ",la_list['Neighborhood'].count(),"Neighborhoods in the Neighborhood Dataframe")
# Public Health Data
print("There are ",health_data['City/Community'].count(),"Cities/Communities in the Public Health Dataframe")


There are  272 Neighborhoods in the Neighborhood Dataframe
There are  87 Cities/Communities in the Public Health Dataframe


Los Angeles seems to be a bit of an anomaly when it comes to the total number of officially designated communities.

https://en.wikipedia.org/wiki/List_of_cities_in_Los_Angeles_County,_California states that there are 88 cities which seems more in alignment with our Public Health dataframe.

https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Los_Angeles_County,_California states that there are 76 unincorporated communities with a small population residing in each of them. We can drop these rows since these are not just part of the city of LA but rather of the greater county of LA and hence, may stick out as outliers in our analysis. 

To keep things simple, we will use the 87 Cities found in the Public Health dataset and designate Regions for each City/Community.

In [8]:
# Rename health_data column for merge
health_data = health_data.rename(columns={'City/Community':'Neighborhood'})

health_data.head()

Unnamed: 0,Neighborhood,Healthy Adult Percentage,Obesity Rate
0,Alhambra,0.27313,0.135779
1,Altadena,0.348242,0.244131
2,Arcadia,0.266197,0.057046
3,Azusa,0.375943,0.260744
4,Baldwin Park,0.316393,0.265167


In [9]:
# Merge Neighborhood data and Public Health Data
la_merged = pd.merge(health_data, la_list,on='Neighborhood', how= 'left')

la_merged.head()

Unnamed: 0,Neighborhood,Healthy Adult Percentage,Obesity Rate,Region
0,Alhambra,0.27313,0.135779,San Gabriel Valley
1,Altadena,0.348242,0.244131,Verdugos
2,Arcadia,0.266197,0.057046,San Gabriel Valley
3,Azusa,0.375943,0.260744,San Gabriel Valley
4,Baldwin Park,0.316393,0.265167,San Gabriel Valley


In [10]:
# Let's take a look at the count for each region in the merged Dataframe
la_merged['Region'].value_counts()

San Gabriel Valley        22
Southeast                 18
South Bay                  8
Westside                   3
Verdugos                   3
Pomona Valley              3
Harbor                     3
San Fernando Valley        2
Antelope Valley            2
Santa Monica Mountains     1
South L.A.                 1
Northwest County           1
Central L.A.               1
Eastside                   1
Name: Region, dtype: int64

Having merged the dataframes, we will have several NaN rows for the Left Join in the Region Column due to either unkown Neighborhoods or the Council Districts. Let's extract them to see which ones they are.

In [11]:

la_nan = la_merged[la_merged.isna().any(axis=1)]

# Reset Index
la_nan = la_nan.reset_index(drop=True)
la_nan

Unnamed: 0,Neighborhood,Healthy Adult Percentage,Obesity Rate,Region
0,Florence-Graham,0.298663,0.312639,
1,Los Angeles Council District 1,0.314527,0.21641,
2,Los Angeles Council District 2,0.386215,0.207762,
3,Los Angeles Council District 3,0.365171,0.20308,
4,Los Angeles Council District 4,0.417725,0.15148,
5,Los Angeles Council District 5,0.423018,0.102214,
6,Los Angeles Council District 6,0.329751,0.2296,
7,Los Angeles Council District 7,0.356249,0.286106,
8,Los Angeles Council District 8,0.356369,0.318689,
9,Los Angeles Council District 9,0.322849,0.325907,


In [12]:
# Renaming the Council Districts to help with getting geospatial data
la_nan['Neighborhood'][0:17] = ['Florence','Cypress Park','Valley Village','Tarzana','Hollywood Hills','Bel-Air','Panorama City','Pacoima','Hyde Park',
                                    'Vermont Square','Gramercy Park','Pacific Palisades','Granada Hills','Silver Lake','Downtown','San Pedro','City of Los Angeles']
la_nan.reset_index(drop=True,inplace=True)                                    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [13]:
# Re-merge Missing Value data to main DataFrame

la_merged = la_merged.set_index('Obesity Rate')
la_nan = la_nan.set_index('Obesity Rate')
la_merged.update(la_nan)
la_merged.reset_index(inplace=True)

# Move Neighborhood back to the front
col = la_merged.pop('Neighborhood')
la_merged.insert(0, 'Neighborhood', col)

Our main DataFrame should have the updated Neighborhoods now.

In [14]:
la_merged.head()

Unnamed: 0,Neighborhood,Obesity Rate,Healthy Adult Percentage,Region
0,Alhambra,0.135779,0.27313,San Gabriel Valley
1,Altadena,0.244131,0.348242,Verdugos
2,Arcadia,0.057046,0.266197,San Gabriel Valley
3,Azusa,0.260744,0.375943,San Gabriel Valley
4,Baldwin Park,0.265167,0.316393,San Gabriel Valley


Let's quickly check if the datatypes of our DataFrame are all in order.

In [17]:
la_merged.dtypes

Neighborhood                 object
Obesity Rate                float64
Healthy Adult Percentage    float64
Region                       object
dtype: object

We can see that our DataTypes are not in the right formats. Let's fix that.

In [16]:
la_merged['Obesity Rate'] = pd.to_numeric(la_merged['Obesity Rate'],errors = 'coerce')
la_merged['Healthy Adult Percentage'] = pd.to_numeric(la_merged['Healthy Adult Percentage'],errors = 'coerce')

Earlier, I said that extracting the Neighborhoods and Regions from the GeoJSON data would be useful. 

Here, we can utilize that dataset to fill the missing regions in our main DataFrame.

In [18]:
# Merge both DataFrames
# By doing so, we will miss out on two rows which include Obesity Rate and Healthy Adult Percentage for - City of LA, LA County. THis is fine, we will include this
# later in our analysis. Having this here now would only display innacurately in the maps.
la_merged1 = pd.merge(la_merged, la_geo, on = 'Neighborhood')

# Drop old Region Column
la_merged1.drop(['Region_x'], axis= 1, inplace=True)
la_merged1.rename(columns={"Region_y":"Region"},inplace=True)

#Make it aesthetic
la_merged1['Region'] = la_merged1['Region'].str.replace('-',' ')
la_merged1['Region'] = la_merged1['Region'].str.title()
la_merged1['Region'] = la_merged1['Region'].replace("La", "LA", regex=True)

In [19]:
la_merged1

Unnamed: 0,Neighborhood,Obesity Rate,Healthy Adult Percentage,Region
0,Alhambra,0.135779,0.27313,San Gabriel Valley
1,Altadena,0.244131,0.348242,Verdugos
2,Arcadia,0.057046,0.266197,San Gabriel Valley
3,Azusa,0.260744,0.375943,San Gabriel Valley
4,Baldwin Park,0.265167,0.316393,San Gabriel Valley
5,Bell,0.257832,0.310867,Southeast
6,Bell Gardens,0.286023,0.285274,Southeast
7,Bellflower,0.246875,0.343177,Southeast
8,Beverly Hills,,0.460141,Westside
9,Burbank,0.174591,0.379088,San Fernando Valley


##Methodology<a name="methodology"></a>

This section will focus on the methods used to approach this project. 
The aim of this project is to identify the ideal fitness centers that would aid in staying healthy for an adult in LA County.

This goal is based on **2 main assumptions**:



1.   Visiting and Leaving Fitness Centers in the vicinity of greater number of fast-food restaurants **reduces the motivation** of the participant to return to the fitness center OR **increases the probability** of visiting a fast-food restaurant after a workout.
2.   Neighborhoods with a larger number of fast-food restaurants have a **higher obesity rate**.

To analyze these assumptions and to idenify the ideal fitness center location, we will use the following methods.


*   Based on the neighborhood data acquired and cleaned, import geospatial data to pinpoint lat,long values for each Neighborhood.
*   Acquire Venue information for each Neighborhood from Foursquare API.
*   Analyze Venue Information using Visualization and Clustering to find reasons to support our assumptions.
*   Identify which Fitness Centers best qualify as a candidate for training to maintain a healthy diet. 





In [294]:
# Let's do a re-count for each region in the merged Dataframe
la_merged1['Region'].value_counts()


#The data seems to be more spread out now

San Gabriel Valley        22
Southeast                 18
South Bay                  8
San Fernando Valley        7
South LA                   5
Westside                   5
Central LA                 4
Harbor                     4
Verdugos                   3
Pomona Valley              3
Antelope Valley            2
Santa Monica Mountains     1
Eastside                   1
Northwest County           1
Northeast LA               1
Name: Region, dtype: int64

### Getting Geospatial Data

We're going to utilize Nominatim from the goepy package to access the longitudinal and latitude values for Los Angeles. We are doing this so that we can display Los Angeles on a map as a visualization.

In [31]:
address = 'Los Angeles, CA'

geolocator = Nominatim(user_agent="la_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Los Angeles are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Los Angeles are 34.0536909, -118.242766.


We also need the lat,long values for each Neighborhood in our created DataFrame.

In [21]:
Neigh_lat = []
Neigh_lng = []
Neighborhood = la_merged1['Neighborhood']

key = '830323b5ca694362904814ff0a11b803'
geocoder = OpenCageGeocode(key)

for i in range(len(Neighborhood)):
    address = '{}, Los Angeles, CA'.format(Neighborhood[i])
    location = geocoder.geocode(address)
    Neigh_lat.append(location[0]['geometry']['lat'])
    Neigh_lng.append(location[0]['geometry']['lng'])

print(Neigh_lat, Neigh_lng)


[34.093042, 34.1863161, 34.1362075, 34.1338751, 34.0854739, 33.9747806, 33.9694561, 33.8825705, 34.0696501, 34.1816482, 34.1446643, 33.8322043, 33.8644291, 34.0966764, 33.894927, 34.0877926, 33.9620584, 34.0211224, 34.0286226, 33.942215, 34.0239015, 34.0751571, 33.9741588, 33.8963593, 34.1469416, 34.1361187, 33.9930677, 33.9188589, 33.9827043, 33.9562003, 33.9060971, 34.01979, 34.1008426, 33.8503463, 34.6981064, 33.8885217, 33.7690164, 34.0922322, 34.1637147, 34.1688075, 34.1311792, 34.0827278, 34.2242902, 34.2625025, 33.9805691, 34.0019449, 33.9511938, 34.0480643, 34.2661558, 34.0896518, 34.0428494, 33.7358518, 33.924831, 33.8915985, 33.9866807, 34.1483499, 34.0159398, 34.051522, 33.9092802, 34.5793131, 33.898917, 34.1476452, 33.9830688, 34.0553813, 33.7483311, 33.8455911, 34.0676169, 33.9761238, 34.1066756, 34.28497, 34.0990733, 34.3916641, 34.0194704, 33.9463456, 34.1133062, 33.9366769, 34.1082994, 33.8358492, 34.0384785, 34.0202894, 34.0686208, 34.0923014, 34.123985, 33.9414035, 33

Add Neighborhood Location data to the main DataFrame

In [22]:
la_merged1['Latitude'] = Neigh_lat
la_merged1['Longitude'] = Neigh_lng

la_merged1.head()

Unnamed: 0,Neighborhood,Obesity Rate,Healthy Adult Percentage,Region,Latitude,Longitude
0,Alhambra,0.135779,0.27313,San Gabriel Valley,34.093042,-118.12706
1,Altadena,0.244131,0.348242,Verdugos,34.186316,-118.135233
2,Arcadia,0.057046,0.266197,San Gabriel Valley,34.136207,-118.04015
3,Azusa,0.260744,0.375943,San Gabriel Valley,34.133875,-117.905605
4,Baldwin Park,0.265167,0.316393,San Gabriel Valley,34.085474,-117.961176


### Plot Map of Los Angeles

Now, we will take the geospatial co-ordinates we've gathered and the Neighborhood information to visualize a map.

To acheive this, we will use the Folium library.


In [27]:
# Create map of Los Angeles using latitude and longitude values
map_la = folium.Map(location=[latitude, longitude], zoom_start=10)
folium.TileLayer('CartoDBpositron').add_to(map_la)

# Add markers to map
for lat, lng, region, neighborhood in zip(la_merged1['Latitude'], la_merged1['Longitude'], la_merged1['Region'], la_merged1['Neighborhood']):
    label = '{}, {}'.format(neighborhood, region)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_la)  
    
map_la

We are seeing some outlier data that seems to be incorrectly extrapolated from the Geocode package. Visually seeing them on a map has helped us identify them. We will remove them to maintain accuracy

In [25]:
la_merged1['Region'].value_counts()

San Gabriel Valley        22
Southeast                 18
South Bay                  8
San Fernando Valley        7
Westside                   5
South LA                   5
Central LA                 4
Harbor                     4
Verdugos                   3
Pomona Valley              3
Antelope Valley            2
Northeast LA               1
Santa Monica Mountains     1
Northwest County           1
Eastside                   1
Name: Region, dtype: int64

In [26]:
la_merged1 = la_merged1[(la_merged1['Region'] != 'Antelope Valley') & (la_merged1['Region'] != 'Santa Monica Mountains') & (la_merged1['Region'] != 'Northwest County') 
& (la_merged1['Region'] != 'San Fernando Valley')  & (la_merged1['Region'] != 'Pomona Valley')]

Re-run the folium code to get an updated map.

The map is looking beautiful!

Now let's move onto Foursquare data.

### Getting Foursquare Venue Data


The Foursquare API consists of user-generated data about every public point of interest. This is very useful information when conducting analysis like ours.

Using a Foursquare account, we will extrapolate the venue data for each Neighborhood and then see what insights can be gathered from that.

Let's setup the required Foursquare Client setup first.

In [28]:
CLIENT_ID = 'JG23L3Z3IYDRCVAJNVPVZXYYVLRGIUJCHP20KVQV5OMAWIHI' # your Foursquare ID
CLIENT_SECRET = 'XPQ3ND1BVVJ2NOD0G41SR4NWHVNV4L50EUPA3LPOK3NYZPR1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: JG23L3Z3IYDRCVAJNVPVZXYYVLRGIUJCHP20KVQV5OMAWIHI
CLIENT_SECRET:XPQ3ND1BVVJ2NOD0G41SR4NWHVNV4L50EUPA3LPOK3NYZPR1


Now, for getting the venue data, I've provided you two options here. One is more intensive and the other is a bit more resource friendly.

**Method 1** allows you to import a certain number of venues of all types from the API. It then extracts just the venues we need (Fast Food and Fitness). This is more resource friendly. However, we will acquire less specific venues.  

**Method 2** purposely seeks out just those venues for Fast Food and Fitness Centers using their specific Category ID's that the API has declared. Using this method, we are able to get more specific venues and a larger amount. This is more resource heavy and hence, will take a longer time. This is useful when exploring in-detail for each area of Los Angeles.

*Note: Be aware of how much venue data you pull from the server. Foursquare has some limits to the number of free venue data you can import within a certai period of time before you are no longer able to.* 

#### Get Foursquare Venue Data - Method 1


In [33]:
LIMIT = 200 # Limit of number of venues returned by Foursquare API

def getNearbyVenues(names, latitudes, longitudes, radius=5000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


Convert our acquired data blob into a DataFrame.

In [None]:
la_venues = getNearbyVenues(names=la_merged1['Neighborhood'],
                                   latitudes=la_merged1['Latitude'],
                                   longitudes=la_merged1['Longitude']
                                  )

Let's go ahead and remove every category except Fast Food Restaurants and Gyms/Fitness Centers.

In [36]:
la_health = la_venues[(la_venues['Venue Category'] == 'Fast Food Restaurant') | (la_venues['Venue Category'] == 'Gym / Fitness Center')
| (la_venues['Venue Category'] == 'Gym')]
la_health['Venue Category'] = la_health['Venue Category'].replace({'Gym / Fitness Center': 'Gym',})
print(la_health.shape)
la_health.reset_index(drop= True, inplace=True)
la_health.head()

(382, 7)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Alhambra,34.093042,-118.12706,In-N-Out Burger,34.106211,-118.134465,Fast Food Restaurant
1,Alhambra,34.093042,-118.12706,Planet Fitness,34.07742,-118.116117,Gym
2,Altadena,34.186316,-118.135233,24 Hour Fitness,34.183503,-118.159222,Gym
3,Altadena,34.186316,-118.135233,Equinox Pasadena,34.145032,-118.145536,Gym
4,Altadena,34.186316,-118.135233,"iLoveKickboxing - Pasadena, CA",34.145764,-118.114105,Gym


Let's take a look at how many of each venue type is returned.

In [37]:
health_stat = la_health.groupby(by=['Venue Category'])
health_stat.count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Fast Food Restaurant,271,271,271,271,271,271
Gym,111,111,111,111,111,111


So, we have 111 Fitness Centers and 271 Fast Food Restaurants (Will vary with each GET) in the area. 
Having a greater number of Fast Food Restaurants also makes it easier to visit them more than Fitness Centers. Let's keep exploring.

While we are at it, Let's split the data into two for ease of display when it comes to Mapping.

In [41]:
la_fit = la_health[(la_health['Venue Category'] != 'Fast Food Restaurant')]
la_fat = la_health[(la_health['Venue Category'] == 'Fast Food Restaurant')]

print(la_fit.shape)
print(la_fat.shape)

(111, 7)
(271, 7)


### Getting Foursquare Venue Data - Method 2


What we are going to do is get all the venues in LA that match a Fast Food restaurant and a Fitness Center using the Category ID's FourSquare provides us.

Data Source: https://developer.foursquare.com/docs/build-with-foursquare/categories

Clearly, this would be a large number and fill the entire map. So, for ease of analysis, we will limit the number of venues returned for each neighborhood to 200. This will also allow us to see which neighborhoods have a higher number of fast food restaurants.

In [155]:
LIMIT = 200 # limit of number of venues returned by Foursquare API

def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?categoryId={}&intent=browse&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            cat_id,
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [None]:
#Create a new Dataframe to store the API values for Fast Food using Venue Categories Fast Food Restaurant, Bakery, Bubble Tea Shop, Dessert Shop, Donut Shop, Fried Chicken Joint, Hot Dog Joint
cat_id = '4bf58dd8d48988d16e941735&4bf58dd8d48988d16a941735&52e81612bcbc57f1066b7a0c&4bf58dd8d48988d1d0941735&4bf58dd8d48988d148941735&4d4ae6fc7a7b7dea34424761&4bf58dd8d48988d16f941735'
la_fatness = getNearbyVenues(names=la_merged['Neighborhood'],
                                   latitudes=la_merged['Latitude'],
                                   longitudes=la_merged['Longitude']
                                  )

In [None]:
#Lets check the shape of the created DataFrame
print(la_fatness.shape)
la_fatness.head()

Now, let's repeat the same process for Fitness Centers in L.A.

In [None]:
#Create a new Dataframe to store the API values for Fitness Centers using Venue Category Athletics & Sports
cat_id = '4f4528bc4b90abdf24c9de85'
la_fitness = getNearbyVenues(names=la_merged['Neighborhood'],
                                   latitudes=la_merged['Latitude'],
                                   longitudes=la_merged['Longitude']
                                  )

In [None]:
print(la_fitness.shape)
la_fitness.head()

### Mapping Data


Here, we can apply either Method 1 or Method 2 as the source Dataframes to Folium so that the map will be updated appropriately.

Let's go with the simple Method 1 for now (DatFrames will be la_fit | la_fat).

For Method 2, the DataFrames will be la_fatness | la_fitness.

Let's remap the data with just our new venue categories.

In [43]:

map_la = folium.Map(location=[latitude, longitude], zoom_start=10)
folium.TileLayer('CartoDBpositron').add_to(map_la)
# Loop through the Fitness Centers in LA to display as blue points of interest
for lat, lng, vname, vcat in zip(la_fit['Venue Latitude'], la_fit['Venue Longitude'], la_fit['Venue'], la_fit['Venue Category']):
    label = '{},{}'.format(vname, vcat)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_la) 

# Loop through the Fast Food Restaurants in LA to display as red points of interest
for lat, lng, vname, vcat in zip(la_fat['Venue Latitude'], la_fat['Venue Longitude'], la_fat['Venue'], la_fat['Venue Category']):
    label = '{},{}'.format(vname, vcat)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#af4e53',
        fill_opacity=0.7,
        parse_html=False).add_to(map_la)          

    
map_la

At a high level,
From the map, we can clearly see some key areas where there are higher percentage of Fitness Centers and vice versa. These would be key locations to consider, possibly for someone moving to LA to be ideal locations for living healthy vs. those areas that have a higher percentage of fast food restaurants.

*Note: Keep in mind that this isn't showing all the restaurants and gyms in LA. This is a sample size subset of the Foursquare data. There can be discrepancies in the data captured.*

In [None]:
m = folium.Map([latitude, longitude], zoom_start=9)
folium.TileLayer('CartoDBpositron').add_to(m)
m.choropleth(
    geo_data = geo,
    name = 'choropleth',
    data = la_merged,
    columns = ['Neighborhood','Obesity Rate'],
    key_on = 'feature.properties.name',
    fill_color = 'YlOrRd',
    fill_opacity = 0.7,
    line_opacity = 0.2,
    nan_fill_color = None,
    nan_fill_opacity = 0,
    parse_html=False,
    legend_name = 'Obesity Rate LA County'
)
folium.LayerControl().add_to(m)


m

In [None]:
pd.pivot_table(la_test, values = ['Venue Latitude','Venue Longitude'], index=['Neighborhood','Venue Category'], columns = 'Venue').reset_index()

In [None]:

# one hot encoding
la_onehot = pd.get_dummies(la_test[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
la_onehot['Neighborhood'] = la_test['Neighborhood'] 

fixed_columns = [la_onehot.columns[-1]] + list(la_onehot.columns[:-1])
la_onehot = la_onehot[fixed_columns]

# la_grouped = la_onehot.groupby('Neighborhood').mean().reset_index()

# # set number of clusters
# kclusters = 5

# la_clustering = la_grouped.drop('Neighborhood', 1)

# # run k-means clustering
# kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(la_clustering)

# # check cluster labels generated for each row in the dataframe
# kmeans.labels_[0:10] 

# # add clustering labels
# la_test['Cluster Labels'] =  kmeans.labels_


TypeError: ignored

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(la_test['Latitude'], la_test['Longitude'], la_test['Neighborhood'], la_test['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters