<a href="https://colab.research.google.com/github/tomthomas/Coursera_Capstone/blob/main/Capstone_BOTN_TomT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera


## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

The US is notorious for being one among the highest ranked countries in terms of obesity rates. The highly commercialized fast-food industry makes it difficult for the average American to stay healthy, especially in large cities, where advertisements are more prominent. This report explores how one could tackle the temptations of living in a large metropolitan city and remain healthy. 

For this analysis, the aim is to focus on the two main factors at hand when it comes to an individual who may be attempting to stay healthy - **fast-food restaurants** and **fitness centers**. We will analyze the data to observe any patterns between these two types. Whether there are more fitness centers where there are fast-food restaurants or the opposite? And if there can be anything that can be done to reduce the desire to visit a fast-food chain restaurant. 


## Data <a name="Data"></a>

For a thorough analysis, we require the following data:


*   **List of Neighborhoods in a Metropolitan City** - We have chosen **Los Angeles** for this project, which is a large city with a lot of fast food options and fitness options allowing for a more concrete observation. We will collect this information from 
*   **Department of Public Health Data for LA County**  - This dataset will be useful for evaluating correlation between obesity rates and the two main factors.
*  **Geospatial Data using Geocoder Package**  - We will use this package to capture the required Latitudes and Longitudes for LA County.
*   **FourSquare Venue Data**  - We will utilize the FourSquare API to collect venue information for the different neighborhoods.


In [1]:
# Importing requried libraries

import pandas as pd
import requests
import numpy as np
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from bs4 import BeautifulSoup

from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize 
from sklearn.cluster import KMeans

print("Import complete!")

Import complete!


In [2]:
# De-limit displayed rows and columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#### Getting Neighborhood Data

The first data we need to import into our project is the **List of Neighborhoods in a Metropolitan City**.

Data Source: https://maps.latimes.com/neighborhoods/neighborhood/list/

This data will be valuable to group the different cities Neighborhoods of LA into a Region/Borough. We will utilize the BeautifulSoup package to extrapolate the list found on the LA Times website. 

Beautiful Soup will convert the HTML Text into a table and then we can convert it into a table.


In [4]:
# Create BeautifulSoup object
page = requests.get('http://maps.latimes.com/neighborhoods/neighborhood/list/').text
soup = BeautifulSoup(page,'html5lib')
tbl = str(soup.table)

# Convert to table
list1 = pd.read_html(tbl)
la_list = list1[0]
la_list.rename(columns={'Name':'Neighborhood'}, inplace =True)

la_list.head()

ValueError: ignored

#### Getting Public Health Data

Next up is the data that could be key to our analysis. 
We are going to import the obesity data for various neighborhoods of LA.

Data source: http://publichealth.lacounty.gov/

We are going to get two types of indicators for our analysis:

1.   The Percentage of Adults in LA County Meeting Recommended Guidleines for Physical Activity
2.   The Percentage of Adults in LA County Who are Obese

The data is stored as .xlsx files which we will import and clean so that they can be ready for analysis. 



In [4]:
# Read Excel data
fit = pd.read_excel("https://github.com/tomthomas/Coursera_Capstone/blob/main/Data/PercentMeetingPAGuidelines.xlsx?raw=true", skiprows=4, usecols=lambda x: 'Unnamed' not in x)
obs = pd.read_excel("https://github.com/tomthomas/Coursera_Capstone/blob/main/Data/PercentObeseAdults.xlsx?raw=true", skiprows=4, usecols=lambda x: 'Unnamed' not in x)

# Drop unnecessary columns
fit.drop(fit.columns[[-1,-2]], axis = 1,inplace=True)
obs.drop(obs.columns[[-1,-2]], axis = 1,inplace=True)
fit.rename(columns={'Percentage':'Healthy Adult Percentage'}, inplace =True)
obs.rename(columns={'Percent':'Obesity Rate'}, inplace =True)
health_data = pd.merge(fit, obs, on = 'City/Community')

health_data.head()

Unnamed: 0,City/Community,Healthy Adult Percentage,Obesity Rate
0,Alhambra,0.27313,0.135779
1,Altadena,0.348242,0.244131
2,Arcadia,0.266197,0.057046
3,Azusa,0.375943,0.260744
4,Baldwin Park,0.316393,0.265167


#### Merging Data

Combining the Neighborhood Data with the Public Health Data will give us a better view of how the data is spread accross the different Regions. 

In [5]:
# Let's take a look at the shape of each dataframe

# Neighborhood Data
print("There are ",la_list['Neighborhood'].count(),"Neighborhoods in the Neighborhood Dataframe")
# Public Health Data
print("There are ",health_data['City/Community'].count(),"Cities/Communities in the Public Health Dataframe")


There are  272 Neighborhoods in the Neighborhood Dataframe
There are  87 Cities/Communities in the Public Health Dataframe


Los Angeles seems to be a bit of an anomaly when it comes to the total number of officially designated communities.

https://en.wikipedia.org/wiki/List_of_cities_in_Los_Angeles_County,_California states that there are 88 cities which seems more in alignment with our Public Health dataframe.

https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Los_Angeles_County,_California states that there are 76 unincorporated communities with a small population residing in each of them. We can drop these rows since these are not just part of the city of LA but rather of the greater county of LA and hence, may stick out as outliers in our analysis. 

To keep things simple, we will use the Public Health dataset and designate Regions for each City/Community.

In [6]:
# Rename health_data column for merge
health_data = health_data.rename(columns={'City/Community':'Neighborhood'})

health_data.head()

Unnamed: 0,Neighborhood,Healthy Adult Percentage,Obesity Rate
0,Alhambra,0.27313,0.135779
1,Altadena,0.348242,0.244131
2,Arcadia,0.266197,0.057046
3,Azusa,0.375943,0.260744
4,Baldwin Park,0.316393,0.265167


In [7]:
# Merge Neighborhood data and Public Health Data
la_merged = pd.merge(health_data, la_list,on='Neighborhood', how= 'left')

la_merged.head()

Unnamed: 0,Neighborhood,Healthy Adult Percentage,Obesity Rate,Region
0,Alhambra,0.27313,0.135779,San Gabriel Valley
1,Altadena,0.348242,0.244131,Verdugos
2,Arcadia,0.266197,0.057046,San Gabriel Valley
3,Azusa,0.375943,0.260744,San Gabriel Valley
4,Baldwin Park,0.316393,0.265167,San Gabriel Valley


In [8]:
# Let's take a look at the count for each region in the merged Dataframe
la_merged['Region'].value_counts()

San Gabriel Valley        22
Southeast                 18
South Bay                  8
Harbor                     3
Westside                   3
Pomona Valley              3
Verdugos                   3
San Fernando Valley        2
Antelope Valley            2
Eastside                   1
Northwest County           1
Santa Monica Mountains     1
South L.A.                 1
Central L.A.               1
Name: Region, dtype: int64

In [9]:
# Having merged the dataframes, we will have several NaN rows for the Left Join in the Region Column. Let's extrapolate them and edit them
la_nan = la_merged[la_merged.isna().any(axis=1)]

# Reset Index
la_nan = la_nan.reset_index(drop=True)
la_nan

Unnamed: 0,Neighborhood,Healthy Adult Percentage,Obesity Rate,Region
0,Florence-Graham,0.298663,0.312639,
1,Los Angeles Council District 1,0.314527,0.21641,
2,Los Angeles Council District 2,0.386215,0.207762,
3,Los Angeles Council District 3,0.365171,0.20308,
4,Los Angeles Council District 4,0.417725,0.15148,
5,Los Angeles Council District 5,0.423018,0.102214,
6,Los Angeles Council District 6,0.329751,0.2296,
7,Los Angeles Council District 7,0.356249,0.286106,
8,Los Angeles Council District 8,0.356369,0.318689,
9,Los Angeles Council District 9,0.322849,0.325907,


At this point, the easiest thing to do would be to manually add the remaining rows for NaN since the information comes from various sources of LA Map data such as:

1.   https://laedc.org/wtc/chooselacounty/regions-of-la-county/
2.   https://culturela.org/council-districts-directory/

In [10]:
# Populate missing values
list = ['South L.A.','Northeast L.A.','San Fernando Valley','San Fernando Valley','Central L.A.','Westside','San Fernando Valley','San Fernando Valley','South L.A.',
        'South L.A.','Central L.A.','Westside','San Fernando Valley','Central L.A.','Northeast L.A.','Harbor','Central L.A.','Los Angeles County']
for i in range(len(list)):
  la_nan.iloc[i,3] = list[i]
la_nan

Unnamed: 0,Neighborhood,Healthy Adult Percentage,Obesity Rate,Region
0,Florence-Graham,0.298663,0.312639,South L.A.
1,Los Angeles Council District 1,0.314527,0.21641,Northeast L.A.
2,Los Angeles Council District 2,0.386215,0.207762,San Fernando Valley
3,Los Angeles Council District 3,0.365171,0.20308,San Fernando Valley
4,Los Angeles Council District 4,0.417725,0.15148,Central L.A.
5,Los Angeles Council District 5,0.423018,0.102214,Westside
6,Los Angeles Council District 6,0.329751,0.2296,San Fernando Valley
7,Los Angeles Council District 7,0.356249,0.286106,San Fernando Valley
8,Los Angeles Council District 8,0.356369,0.318689,South L.A.
9,Los Angeles Council District 9,0.322849,0.325907,South L.A.


In [11]:
# Re-merge Missing Value data to main DataFrame

la_merged = la_merged.set_index('Neighborhood')
la_nan = la_nan.set_index('Neighborhood')
la_merged.update(la_nan)
la_merged.reset_index(inplace=True)

In [12]:
# Renaming the Council Districts to help with getting geospatial data
la_merged['Neighborhood'][37:53] = ['Cypress Park','Valley Village','Tarzana','Hollywood Hills','Bel Air','Panorama City','Pacoima','Hyde Park',
                                    'Vermont Square','Crenshaw','Pacific Palisades','Granada Hills','Silver Lake','Downtown','San Pedro','City of Los Angeles']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [13]:
# Merged DataFrame
la_merged.head()

Unnamed: 0,Neighborhood,Healthy Adult Percentage,Obesity Rate,Region
0,Alhambra,0.27313,0.135779,San Gabriel Valley
1,Altadena,0.348242,0.244131,Verdugos
2,Arcadia,0.266197,0.057046,San Gabriel Valley
3,Azusa,0.375943,0.260744,San Gabriel Valley
4,Baldwin Park,0.316393,0.265167,San Gabriel Valley


##Methodology<a name="methodology"></a>

This section will focus on the methods used to approach this project. 
The aim of this project is to identify the ideal fitness centers that would aid in staying healthy for an adult in LA County.

This goal is based on **2 main assumptions**:



1.   Visiting and Leaving Fitness Centers in the vicinity of greater number of fast-food restaurants **reduces the motivation** of the participant to return to the fitness center OR **increases the probability** of visiting a fast-food restaurant after a workout.
2.   Neighborhoods with a larger number of fast-food restaurants have a **higher obesity rate**.

To analyze these assumptions and to idenify the ideal fitness center location, we will use the following methods.


*   Based on the neighborhood data acquired and cleaned, import geospatial data to pinpoint lat,long values for each Neighborhood.
*   Acquire Venue information for each Neighborhood from Foursquare API.
*   Analyze Venue Information using Visualization and Clustering to find reasons to support our assumptions.
*   Identify which Fitness Centers best qualify as a candidate for training to maintain a healthy diet. 





In [14]:
# Let's do a re-count for each region in the merged Dataframe
la_merged['Region'].value_counts()


#The data seems to be more spread out now

San Gabriel Valley        22
Southeast                 18
South Bay                  8
San Fernando Valley        7
Westside                   5
Central L.A.               5
Harbor                     4
South L.A.                 4
Pomona Valley              3
Verdugos                   3
Northeast L.A.             2
Antelope Valley            2
Eastside                   1
Los Angeles County         1
Northwest County           1
Santa Monica Mountains     1
Name: Region, dtype: int64

#### Getting Geospatial Data

We're going to utilize Nominatim from the goepy package to access the longitudinal and latitude values for Los Angeles. We are doing this so that we can display Los Angeles on a map as a visualization.

In [15]:
address = 'Los Angeles, CA'

geolocator = Nominatim(user_agent="la_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Los Angeles are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Los Angeles are 34.0536909, -118.242766.


We also need the lat,long values for each Neighborhood in our created DataFrame.

In [None]:
!pip install opencage
from opencage.geocoder import OpenCageGeocode


In [17]:
Neigh_lat = []
Neigh_lng = []
Neighborhood = la_merged['Neighborhood']

key = '830323b5ca694362904814ff0a11b803'
geocoder = OpenCageGeocode(key)

for i in range(len(Neighborhood)):
    address = '{}, Los Angeles, CA'.format(Neighborhood[i])
    location = geocoder.geocode(address)
    Neigh_lat.append(location[0]['geometry']['lat'])
    Neigh_lng.append(location[0]['geometry']['lng'])

print(Neigh_lat, Neigh_lng)


[34.093042, 34.1863161, 34.1362075, 34.1338751, 34.0854739, 33.9747806, 33.9694561, 33.8825705, 34.0696501, 34.1816482, 34.1446643, 33.8322043, 33.8644291, 34.0966764, 33.894927, 34.0877926, 33.9620584, 34.0211224, 34.0286226, 33.942215, 34.0239015, 34.0751571, 33.9741588, 33.8963593, 34.1469416, 34.1361187, 33.9930677, 33.9188589, 33.9827043, 33.9562003, 33.9060971, 34.01979, 34.1008426, 33.8503463, 34.6981064, 33.8885217, 33.7690164, 34.0922322, 34.1637147, 34.1688075, 34.1311792, 34.0827278, 34.2242902, 34.2625025, 33.9805691, 34.0019449, 33.9252122, 34.0480643, 34.2661558, 34.0896518, 34.0428494, 33.7358518, 34.0536909, 33.924831, 33.8915985, 33.9866807, 34.1483499, 34.0159398, 34.051522, 33.9092802, 34.5793131, 33.898917, 34.1476452, 33.9830688, 34.0553813, 33.7483311, 33.8455911, 34.0676169, 33.9761238, 34.1066756, 34.28497, 34.0990733, 34.3916641, 34.0194704, 33.9463456, 34.1133062, 33.9366769, 34.1082994, 33.8358492, 34.0384785, 34.0202894, 34.0686208, 34.0923014, 34.123985, 33

Add Neighborhood Locaiton data to the main DataFrame

In [20]:
la_merged['Latitude'] = Neigh_lat
la_merged['Longitude'] = Neigh_lng

la_merged.head()

Unnamed: 0,Neighborhood,Healthy Adult Percentage,Obesity Rate,Region,Latitude,Longitude
0,Alhambra,0.27313,0.135779,San Gabriel Valley,34.093042,-118.12706
1,Altadena,0.348242,0.244131,Verdugos,34.186316,-118.135233
2,Arcadia,0.266197,0.057046,San Gabriel Valley,34.136207,-118.04015
3,Azusa,0.375943,0.260744,San Gabriel Valley,34.133875,-117.905605
4,Baldwin Park,0.316393,0.265167,San Gabriel Valley,34.085474,-117.961176


#### Plot Map of Los Angeles

Now, we will take the geospatial co-ordinates we've gathered and the Neighborhood information to visualize a map.

To acheive this, we will use the Folium library.


In [27]:
# Create map of Los Angeles using latitude and longitude values
map_la = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers to map
for lat, lng, region, neighborhood in zip(la_merged['Latitude'], la_merged['Longitude'], la_merged['Region'], la_merged['Neighborhood']):
    label = '{}, {}'.format(neighborhood, region)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_la)  
    
map_la

We are seeing some outlier data that seems to be incorrectly extrapolated from the Geocode package. Visually seeing them on a map has helped us identify them. We will remove them to maintain accuracy

In [23]:
la_merged = la_merged[(la_merged['Region'] != 'Antelope Valley') & (la_merged['Region'] != 'Santa Monica Mountains') & (la_merged['Region'] != 'Northwest County') 
& (la_merged['Region'] != 'San Fernando Valley')  & (la_merged['Region'] != 'Pomona Valley')
 & (la_merged['Neighborhood'] != 'Los Angeles County')]

Re-run the folium code to get an updated map.

The map is looking beautiful!

Now let's move onto Foursquare data.

#### Get Foursquare Venue Data

The Foursquare API consists of user-generated data about every public point of interest. This is very useful information when conducting analysis like ours.

Using a Foursquare account, we will extrapolate the venue data for each Neighborhood and then see what insights can be gathered from that.

Let's setup the required Foursquare Client setup first.

In [28]:
CLIENT_ID = 'JG23L3Z3IYDRCVAJNVPVZXYYVLRGIUJCHP20KVQV5OMAWIHI' # your Foursquare ID
CLIENT_SECRET = 'XPQ3ND1BVVJ2NOD0G41SR4NWHVNV4L50EUPA3LPOK3NYZPR1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: JG23L3Z3IYDRCVAJNVPVZXYYVLRGIUJCHP20KVQV5OMAWIHI
CLIENT_SECRET:XPQ3ND1BVVJ2NOD0G41SR4NWHVNV4L50EUPA3LPOK3NYZPR1


In [62]:
LIMIT = 200 # limit of number of venues returned by Foursquare API

def getNearbyVenues(names, latitudes, longitudes, radius=5000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)    

In [63]:
#Create a new Dataframe to store the API values
la_venues = getNearbyVenues(names=la_merged['Neighborhood'],
                                   latitudes=la_merged['Latitude'],
                                   longitudes=la_merged['Longitude']
                                  )

Alhambra
Altadena
Arcadia
Azusa
Baldwin Park
Bell
Bell Gardens
Bellflower
Beverly Hills
Carson
Cerritos
Compton
Covina
Cudahy
Culver City
Diamond Bar
Downey
East Los Angeles
El Monte
Florence-Graham
Gardena
Glendale
Glendora
Hacienda Heights
Hawthorne
Huntington Park
Inglewood
La Mirada
La Puente
Lakewood
Lawndale
Long Beach
Cypress Park
Hollywood Hills
Bel Air
Hyde Park
Vermont Square
Crenshaw
Pacific Palisades
Silver Lake
Downtown
San Pedro
City of Los Angeles
Lynwood
Manhattan Beach
Maywood
Monrovia
Montebello
Monterey Park
Norwalk
Paramount
Pasadena
Pico Rivera
Rancho Palos Verdes
Redondo Beach
Rosemead
Rowland Heights
San Dimas
San Gabriel
Santa Monica
South Gate
South Pasadena
South Whittier
Temple City
Torrance
Valinda
Walnut
West Covina
West Hollywood
West Whittier-Los Nietos
Westmont
Whittier


In [64]:
#Lets check the shape of the created DataFrame
print(la_venues.shape)
la_venues.head()

(7183, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Alhambra,34.093042,-118.12706,Blaze Pizza,34.09542,-118.125894,Pizza Place
1,Alhambra,34.093042,-118.12706,Sprouts Farmers Market,34.095288,-118.125046,Grocery Store
2,Alhambra,34.093042,-118.12706,Borneo Kalimantan Cuisine,34.094818,-118.126659,Food
3,Alhambra,34.093042,-118.12706,85C Bakery Cafe,34.093411,-118.130214,Bakery
4,Alhambra,34.093042,-118.12706,Grill 'Em All,34.095595,-118.126725,Burger Joint


Let's take a look at the unique venue categories we've collected

In [None]:
la_venues['Venue Category'].unique()

Let's go ahead and remove every category except Fast Food Restaurants and Gyms/Fitness Centers.

In [88]:
la_health = la_venues[(la_venues['Venue Category'] == 'Fast Food Restaurant') | (la_venues['Venue Category'] == 'Gym / Fitness Center')
| (la_venues['Venue Category'] == 'Gym') | (la_venues['Venue Category'] == 'Yoga Studio')]
la_health['Venue Category'] = la_health['Venue Category'].replace({'Gym / Fitness Center': 'Gym', 'Yoga Studio': 'Gym'})
print(la_health.shape)
la_health.reset_index(drop= True, inplace=True)
la_health.head()

(404, 7)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Alhambra,34.093042,-118.12706,In-N-Out Burger,34.106211,-118.134465,Fast Food Restaurant
1,Alhambra,34.093042,-118.12706,Planet Fitness,34.07742,-118.116117,Gym
2,Altadena,34.186316,-118.135233,24 Hour Fitness,34.183503,-118.159222,Gym
3,Altadena,34.186316,-118.135233,Equinox Pasadena,34.145032,-118.145536,Gym
4,Altadena,34.186316,-118.135233,"iLoveKickboxing - Pasadena, CA",34.145764,-118.114105,Gym


Let's take a look at how many of each venue type is returned.

In [66]:
health_stat = la_health.groupby(by=['Venue Category'])
health_stat.count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Fast Food Restaurant,273,273,273,273,273,273
Gym,131,131,131,131,131,131


So, we have 131 Fitness Centers and 273 Fast Food Restaurants in the area. 
Having a greater number of Fast Food Restaurants also makes it easier to visit them more than Fitness Centers. However, this is not a valid statistic. Let's keep exploring.

Let's remap the data with just our new venue categories.

In [86]:

map_la = folium.Map(location=[latitude, longitude], zoom_start=10)
for lat, lng, vname, vcat in zip(la_health['Venue Latitude'], la_health['Venue Longitude'], la_health['Venue'], la_health['Venue Category']):
    label = '{},{}'.format(vname, vcat)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_la)  
    
map_la

In [None]:
la_health.pivot(columns='Venue Category')

In [None]:
la_health['Venue Category']

In [91]:

la_test = la_health

In [98]:

# one hot encoding
la_onehot = pd.get_dummies(la_test[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
la_onehot['Neighborhood'] = la_test['Neighborhood'] 

fixed_columns = [la_onehot.columns[-1]] + list(la_onehot.columns[:-1])
la_onehot = la_onehot[fixed_columns]

# la_grouped = la_onehot.groupby('Neighborhood').mean().reset_index()

# # set number of clusters
# kclusters = 5

# la_clustering = la_grouped.drop('Neighborhood', 1)

# # run k-means clustering
# kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(la_clustering)

# # check cluster labels generated for each row in the dataframe
# kmeans.labels_[0:10] 

# # add clustering labels
# la_test['Cluster Labels'] =  kmeans.labels_


TypeError: ignored

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(la_test['Latitude'], la_test['Longitude'], la_test['Neighborhood'], la_test['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters