# Unhealthiest Cities in America

### Business Problem
Staying healthy and fit has been the latest craze among many Americans thanks to many social media influencers promoting healthy eating and fitness routines but yet so many people still struggle to stay healthy. Why is this? Could the number of healthy grocery stores contribute to this problem? Or is it because many Americans cannot get adequate access to healthcare or fitness centers? To find out what's causing this problem, we look at the number of healthy grocery stores, healthcare and fitness centers of the healthiest cities in America and compare with the unhealthiest cities in America. If there are any correlation between the number of healthy grocery stores, healthcare and fitness centers with staying healthy and fit, we can make a good assumption as to why the city is ranked as one of the unhealthiest cities in America. This analysis aims to inform you with the factors that constitutes as one of the unhealthiest cities in America and potentially help you decide which city to move to that would be the most optimal to start your health journey.

### Data and Data Sources
In order to find out whether the number of access to healthy resources contributes to an unhealthier population, we need to look at some data. <a href="https://wallethub.com">WalletHub</a> has ranked 170 cities in America in their <a href="https://wallethub.com/edu/healthiest-cities/31072">Healthiest and Unhealthiest Cities in America</a> report. We will also be looking in to Foursquare API's venues data to find out how many healthy grocery stores, healthcare and fitness centers are available in each city. 

In [1]:
# Import and download all necessary libraries for this notebook
import pandas as pd
import numpy as np
import requests

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim

! pip install folium==0.5.0
import folium

print('Folium installed')
print('Libraries imported.')

Folium installed
Libraries imported.


#### Data from WalletHub
<hr>
Here, we will scrape 170 cities that were ranked from healthiest to the least healthiest cities in America from <a href="https://wallethub.com">WalletHub</a>'s report about <a href="https://wallethub.com/edu/healthiest-cities/31072">Healthiest and Unhealthiest Cities in America</a> article.

In [2]:
# Open url to scrape table data from website
url = "https://wallethub.com/edu/healthiest-cities/31072"
hdr = {'User-Agent':'Mozilla/5.0'}
req = Request(url, headers=hdr)
page = urlopen(req)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

In [3]:
table = soup.find_all("table")[0]
print(table)

<table class="cardhub-edu-table center-aligned sortable" style="width: 680px;">
<thead>
<tr>
<th class="rank-numeric">
<p><b>Overall Rank*</b></p>
</th>
<th>
<p><b>City</b></p>
</th>
<th class="rank-numeric">
<p><b>Total Score</b></p>
</th>
<th class="rank-numeric">
<p><b>‘Health Care’ Rank</b></p>
</th>
<th class="rank-numeric">
<p><b>‘Food’ Rank</b></p>
</th>
<th class="rank-numeric">
<p><b>‘Fitness’ Rank</b></p>
</th>
<th class="rank-numeric">
<p><b>‘Green Space’ Rank</b></p>
</th>
</tr>
</thead>
<tbody><tr><td>1</td><td>San Francisco, CA</td><td>73.99</td><td>29</td><td>1</td><td>4</td><td>1</td></tr><tr><td>2</td><td>Seattle, WA</td><td>70.62</td><td>19</td><td>4</td><td>3</td><td>2</td></tr><tr><td>3</td><td>San Diego, CA</td><td>70.01</td><td>25</td><td>3</td><td>1</td><td>8</td></tr><tr><td>4</td><td>Portland, OR</td><td>65.66</td><td>61</td><td>6</td><td>16</td><td>3</td></tr><tr><td>5</td><td>Washington, DC</td><td>63.87</td><td>47</td><td>9</td><td>26</td><td>5</td></tr><tr>

In [4]:
# Put scraped data into pandas dataframe
num_rows = 0
column_names = []

# count numbers of rows in table
for row in table.find_all('tr'):
    num_td = row.find_all('td')
    if len(num_td) > 0:
        num_rows += 1

# get headers of the table
num_th = table.find_all('th')
if len(num_th) > 0:
    for th in num_th:
        column_names.append(th.get_text(strip=True))
    
# create pandas dataframe with header names and include all rows in the table
df = pd.DataFrame(columns=column_names,index=range(0,num_rows))

row_position = 0
for row in table.find_all('tr'):
    column_position = 0
    columns = row.find_all('td')
    for column in columns:
        df.iat[row_position,column_position] = column.get_text(strip=True)
        column_position += 1
    if len(columns) > 0:
        row_position += 1

df

Unnamed: 0,Overall Rank*,City,Total Score,‘Health Care’ Rank,‘Food’ Rank,‘Fitness’ Rank,‘Green Space’ Rank
0,1,"San Francisco, CA",73.99,29,1,4,1
1,2,"Seattle, WA",70.62,19,4,3,2
2,3,"San Diego, CA",70.01,25,3,1,8
3,4,"Portland, OR",65.66,61,6,16,3
4,5,"Washington, DC",63.87,47,9,26,5
...,...,...,...,...,...,...,...
169,170,"Memphis, TN",29.64,166,155,169,160
170,171,"Shreveport, LA",27.42,165,171,171,165
171,172,"Gulfport, MS",24.82,171,172,167,174
172,173,"Laredo, TX",24.06,151,170,174,156


In [5]:
# Clean the data by removing columns that are unnecessary for this analysis
df.columns = df.columns.str.replace('‘','').str.replace('’', '')
df.drop(['Overall Rank*', 'Total Score', 'Health Care Rank', 'Food Rank', 'Fitness Rank', 'Green Space Rank'], axis=1, inplace=True)
df

Unnamed: 0,City
0,"San Francisco, CA"
1,"Seattle, WA"
2,"San Diego, CA"
3,"Portland, OR"
4,"Washington, DC"
...,...
169,"Memphis, TN"
170,"Shreveport, LA"
171,"Gulfport, MS"
172,"Laredo, TX"


In [6]:
# Find latitude and longitude of all the ranked cities and append to the dataframe
latitude = []
longitude = []

for address in df['City']:
    geolocator = Nominatim(user_agent="city_explorer")
    location = geolocator.geocode(address)
    latitude.append(location.latitude)
    longitude.append(location.longitude)

df['Latitude'] = latitude
df['Longitude'] = longitude
df

Unnamed: 0,City,Latitude,Longitude
0,"San Francisco, CA",37.779026,-122.419906
1,"Seattle, WA",47.603832,-122.330062
2,"San Diego, CA",32.717420,-117.162773
3,"Portland, OR",45.520247,-122.674195
4,"Washington, DC",38.894992,-77.036558
...,...,...,...
169,"Memphis, TN",35.149022,-90.051629
170,"Shreveport, LA",32.522183,-93.765194
171,"Gulfport, MS",30.367420,-89.092816
172,"Laredo, TX",27.519984,-99.495376


#### Data from Foursquare API
<hr>
Here, we will be retrieving venue data for all ranked cities from Foursquare API.

In [7]:
# The code was removed by Watson Studio for sharing.

In [8]:
# Define function to retrieve nearby venues for all ranked cities
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
df_venues = getNearbyVenues(names=df['City'],latitudes=df['Latitude'],longitudes=df['Longitude'])

San Francisco, CA
Seattle, WA
San Diego, CA
Portland, OR
Washington, DC
New York, NY
Denver, CO
Irvine, CA
Scottsdale, AZ
Chicago, IL
Austin, TX
Los Angeles, CA
Honolulu, HI
Huntington Beach, CA
Minneapolis, MN
Salt Lake City, UT
Burlington, VT
Fremont, CA
Boston, MA
San Jose, CA
Santa Clarita, CA
Atlanta, GA
Portland, ME
Glendale, CA
Virginia Beach, VA
Sacramento, CA
Long Beach, CA
Lincoln, NE
Orlando, FL
Madison, WI
Boise, ID
Philadelphia, PA
Tampa, FL
Miami, FL
Oakland, CA
Santa Rosa, CA
Oceanside, CA
Raleigh, NC
Pittsburgh, PA
Plano, TX
Peoria, AZ
Anaheim, CA
Richmond, VA
Tempe, AZ
Aurora, CO
Fort Lauderdale, FL
Chesapeake, VA
Rochester, NY
Rancho Cucamonga, CA
Phoenix, AZ
Las Vegas, NV
Vancouver, WA
Garden Grove, CA
St. Paul, MN
Nashua, NH
Overland Park, KS
Sioux Falls, SD
Yonkers, NY
Grand Rapids, MI
Chandler, AZ
Colorado Springs, CO
St. Louis, MO
Omaha, NE
Gilbert, AZ
Reno, NV
Durham, NC
Charleston, SC
Spokane, WA
Tacoma, WA
Charlotte, NC
Providence, RI
Manchester, NH
Fargo, ND


In [None]:
print(df_venues.shape)
df_venues.head()

Here, we look at unique venue categories that can be associated with health and fitness.
<hr>
<strong>Health</strong>: Vegetarian / Vegan Restaurant, Doctor's Office, Pharmacy, Salad Place, Health Food Store, Supplement Shop, Medical Center, Farmers Market, Grocery Store, Supermarket, Organic Grocery, Fruit & Vegetable Store<br>
<strong>Fitness</strong>: Gym / Fitness Center, Gym, Sports Club, Skating Rink, Tennis Court, Athletics & Sports, Basketball Court, Hockey Field, Baseball Field, Climbing Gym, Recreation Center, Bike Rental / Bike Share, Yoga Studio

In [None]:
df_venues['Venue Category'].unique()