# Choosing a U.S. City to Live In To Pursue Data Science Career

## Introduction 

Pursuing a Data Science career is a exciting option for scientists and business people wanting to break into the technology industry.

For people transitioning to a data science career, a consideration is whether or not it would be beneficial to move to some of the cities known as tech hubs such as San Francisco. Although this is an attractive option, there are other considerations that should be weighed in order to make the best decision.

Apart from career prospects, one should choose a new home based on personal preferences. 

In this project, I will gather some data about several cities in mind and determine which one is the best fit for my preferences and my new career.

## Objective

To evaluate several U.S. cities more objectively based on personal preferences for lifestyle and health. 

## Preferences

- Weather: A mild weather is preferred.
- Pollen and mold: Lower pollen and mold counts are preferred.
- Scenery: A city near mountains is preferred.
- Urbanization and beautification: A city with a large number of parks is preferred.
- Outdoors: the availability of hiking trails and outdoor venues is preferred.
- Political views: blue state
- Career: tech hub.

## Audience

This project may be of interest to any person trying to figure out where to move. In order to make an objective, responsible decision, one must research and weigh pros and cons.

Machine Learning may be better able to determine where we should move than ourselves.

## Data

I will gather data from various U.S. Cities that will be correlated to the measures listed above. For example, the number of recycling centers may be correlated with political affiliation.
Presence of tech startups and coworking spaces is a proxy for tech hub.
Spiritual centers would be a measure of diversity of thought.
Street Art, Sculpture, Botanical gardens, parks, trees, and trails will be proxy of beautification.
University will be proxy of educational status of population.

# Methodology

Firstly, Foursquare data 

## Libraries

In [142]:

import pandas as pd
import numpy as np
import requests

import numpy as np # library to handle data in a vectorized manner
import json # library to handle JSON files
import matplotlib.cm as cm # Matplotlib and associated plotting modules
import matplotlib.colors as colors

import plotly
import plotly.plotly as py
import plotly.figure_factory as ff


!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from pandas.io.html import read_html
from sklearn.cluster import KMeans # import k-means from clustering stage

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geopy                     1.19.0                     py_0    conda-forge
Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Libraries imported.


In [157]:
import matplotlib.pyplot as plt

# Methodology

- Houston, TX
- Austin, TX
- Denver, CO
- Seattle, WA
- San Francisco, CA
- Portland, OR


In [158]:
# Enter the names of the cities of interest
cities = ['Houston, TX','Austin, TX','Denver, CO','Seattle, WA','San Francisco, CA','Portland, OR']

# Data Source 1: Foursquare. 

We will gather data on Foursquare from various cities. We would like to find out the cities with the most:

In [159]:
# Create a list of venues that align with personal interests. The codes had to be looked up in Foursquare.

#Outdoors and Recreation Venues: Trails, Bike Trail, Botanical Gardens, Forest, Mountain, Nature Preserve, National Park, Park, Tree, Outdoor Event Space


outdoors_venues_ID = ['4bf58dd8d48988d159941735','56aa371be4b08b9a8d57355e','52e81612bcbc57f1066b7a22','52e81612bcbc57f1066b7a23','4eb1d4d54b900d56c88a45fc','52e81612bcbc57f1066b7a13','52e81612bcbc57f1066b7a21','4bf58dd8d48988d163941735','56aa371be4b08b9a8d57356a']
                      
# Professional & Other Places:  Tech Startup

professional_venues_ID = ['4bf58dd8d48988d125941735']
    
#cultural venues:  Spiritual Center: Buddhist Temple, Hindu Temple, 

cultural_venues_ID = ['52e81612bcbc57f1066b7a3e','52e81612bcbc57f1066b7a3f']

# Food and drink shop: Farmers Market, Health Food Store, Organic Grocery, Fruit and Vegetable Store, Juice Bar

food_venues_ID = ['4bf58dd8d48988d1fa941735','50aa9e744b90af0d42d5de0e','52f2ab2ebcbc57f1066b8b45','52f2ab2ebcbc57f1066b8b1c','4bf58dd8d48988d112941735']

# Beautification: Park, 

beauty_venues_ID = ['4bf58dd8d48988d163941735']

categoryIDs = [outdoors_venues_ID,professional_venues_ID,cultural_venues_ID,food_venues_ID,beauty_venues_ID]
categoryIDs

[['4bf58dd8d48988d159941735',
  '56aa371be4b08b9a8d57355e',
  '52e81612bcbc57f1066b7a22',
  '52e81612bcbc57f1066b7a23',
  '4eb1d4d54b900d56c88a45fc',
  '52e81612bcbc57f1066b7a13',
  '52e81612bcbc57f1066b7a21',
  '4bf58dd8d48988d163941735',
  '56aa371be4b08b9a8d57356a'],
 ['4bf58dd8d48988d125941735'],
 ['52e81612bcbc57f1066b7a3e', '52e81612bcbc57f1066b7a3f'],
 ['4bf58dd8d48988d1fa941735',
  '50aa9e744b90af0d42d5de0e',
  '52f2ab2ebcbc57f1066b8b45',
  '52f2ab2ebcbc57f1066b8b1c',
  '4bf58dd8d48988d112941735'],
 ['4bf58dd8d48988d163941735']]

In [160]:
 venues_df = pd.DataFrame(columns = ['City','Category','Latitude','Longitude'])

In [161]:
# This function connects to Foursquare and extracts venues matching a CategoryID and 
# stores them in the dataframe designated.

def getFoursquareCityData(cities, categoryIDs, limit, max_radius, venues_df):

    # Connect to Foursquare and Query each city to find the number of each venue.

    client_ID = 'HJQTB2PO3CQ31PY0D3MKAFCODL1XOO2RLY3VXWZ2XVOUHERI'
    client_secret = 'YVXU2GICCUXXHV00HDZUG2ZCR5WG50VYWQCCF14A5JJYY31Y'
    version = '20180605' # Foursquare API version

    print('Your credentails:')
    print('CLIENT_ID: ' + client_ID)
    print('CLIENT_SECRET:' + client_secret)

    venues_list = []
    
    for city in cities:
        for list in categoryIDs:
            for category in list:
                url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&radius={}&limit={}&categoryId={}'.format(
                    client_ID,
                    client_secret,
                    version,
                    city,
                    max_radius,
                    limit,
                    category)

                city_abr = city.upper()[:3]
                try:
                    venues = requests.get(url).json()['response']['groups'][0]['items']
                    print(city_abr)

                    venues_list.append([(
                    city,
                    category,
                    v['venue']['name'], 
                    v['venue']['location']['lat'], 
                    v['venue']['location']['lng'],
                    v['venue']['categories'][0]['name']) for v in venues])
                except IndexError:
                    continue
                except KeyError:
                    continue

            venues_df = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    print(venues_df.shape)
    return venues_df
 

In [163]:
venues_df = getFoursquareCityData(cities, categoryIDs, 100, 100000, venues_df)

Your credentails:
CLIENT_ID: HJQTB2PO3CQ31PY0D3MKAFCODL1XOO2RLY3VXWZ2XVOUHERI
CLIENT_SECRET:YVXU2GICCUXXHV00HDZUG2ZCR5WG50VYWQCCF14A5JJYY31Y
HOU
HOU
HOU
HOU
HOU
HOU
HOU
HOU
HOU
HOU
HOU
HOU
HOU
HOU
HOU
HOU
HOU
HOU
AUS
AUS
AUS
AUS
AUS
AUS
AUS
AUS
AUS
AUS
AUS
AUS
AUS
AUS
AUS
AUS
AUS
AUS
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
DEN
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SEA
SAN
SAN
SAN
SAN
SAN
SAN
SAN
SAN
SAN
SAN
SAN
SAN
SAN
SAN
SAN
SAN
SAN
SAN
POR
POR
POR
POR
POR
POR
POR
POR
POR
POR
POR
POR
POR
POR
POR
POR
POR
POR
(5964, 6)


In [164]:

venues_df.columns = ('City','CategoryID','Venue','Latitude','Longitude','Type')
venues_df

Unnamed: 0,City,CategoryID,Venue,Latitude,Longitude,Type
0,"Houston, TX",4bf58dd8d48988d159941735,Buffalo Bayou Walk,29.762177,-95.375844,Trail
1,"Houston, TX",4bf58dd8d48988d159941735,Buffalo Bayou Park,29.762068,-95.391626,Park
2,"Houston, TX",4bf58dd8d48988d159941735,Houston Arboretum & Nature Center,29.765361,-95.452177,Botanical Garden
3,"Houston, TX",4bf58dd8d48988d159941735,Herman Park Trails,29.719804,-95.388748,Trail
4,"Houston, TX",4bf58dd8d48988d159941735,Terry Hershey Park,29.779138,-95.623096,Park
5,"Houston, TX",4bf58dd8d48988d159941735,Ho Chi Minh: Memorial Park Mountain Bike Trails,29.765167,-95.444738,Other Great Outdoors
6,"Houston, TX",4bf58dd8d48988d159941735,Buffalo Bayou Loop,29.761345,-95.401556,Trail
7,"Houston, TX",4bf58dd8d48988d159941735,Memorial/Allen Parkway Trails,29.760212,-95.408612,Trail
8,"Houston, TX",4bf58dd8d48988d159941735,Memorial Park,29.767656,-95.442524,Park
9,"Houston, TX",4bf58dd8d48988d159941735,Eleanor Tinsley Park,29.76144,-95.379271,Trail


Found out Tree is tree services - delete
Found out duplicates are present
University is too broad
Hogwarts! in Austin???
University, Library, Coworking Space, not as interesting

Drop Duplicates!

In [165]:
venues_df = venues_df.drop_duplicates()

In [166]:
venues_df.shape

(5364, 6)

In [167]:
# print how many duplicates just for kicks

# Foursquare Data Exploration

Let's find out which city has more total venues of interest

In [168]:
total_venues = pd.DataFrame(venues_df.groupby('City').count()['Venue'])
total_venues = total_venues.sort_values(by='Venue')
total_venues = total_venues.reset_index()
max_number = total_venues['Venue'].max()
max_city = total_venues.iloc[total_venues['Venue'].idxmax()][0]
min_number = total_venues['Venue'].min()
min_city = total_venues.iloc[total_venues['Venue'].idxmin()][0]
print('The city with the highest amount of venues matching your interests is: ' + str(max_city) +
      ' with ' + str(max_number) + ' venues.')
print('The city with the lowest amount of venues matching your interests is: ' + str(min_city) +
      ' with ' + str(min_number) + ' venues.')

The city with the highest amount of venues matching your interests is: San Francisco, CA with 1258 venues.
The city with the lowest amount of venues matching your interests is: Houston, TX with 729 venues.


In [170]:
# create lists of cities and number of venues for easy graphical representation
N = len(cities)
cities = []
total_number_venues_list = []
for index in range(0,N):
    city_total = total_venues.iloc[index][1]
    cities.append(total_venues.iloc[index][0])
    total_number_venues_list.append(city_total)

In [171]:
print(total_number_venues_list)
print(cities)

[729, 730, 777, 906, 964, 1258]
['Houston, TX', 'Austin, TX', 'Portland, OR', 'Denver, CO', 'Seattle, WA', 'San Francisco, CA']


In [194]:
# Create a plot to visualize the cities with the most venues.
plotly.tools.set_credentials_file(username='tinaprisma', api_key='3VyJz3uXuIJdwNlO6NOB')
import plotly.graph_objs as go

data = [go.Bar(
            x=total_number_venues_list,
            y=cities,
            orientation = 'h',
            width = .7
    )]

py.iplot(data, filename='basic-bar')


High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~tinaprisma/0 or inside your plot.ly account where it is named 'basic-bar'


For the second analysis, let's delete venue types that are more scarce.

In [195]:
total_venues = pd.DataFrame(venues_df.groupby('Type').count())

In [196]:
total_venues

Unnamed: 0_level_0,City,CategoryID,Venue,Latitude,Longitude
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
American Restaurant,3,3,3,3,3
Amphitheater,3,3,3,3,3
Athletics & Sports,2,2,2,2,2
Bakery,7,7,7,7,7
Bar,1,1,1,1,1
Beach,8,8,8,8,8
Big Box Store,1,1,1,1,1
Bike Rental / Bike Share,2,2,2,2,2
Bike Shop,5,5,5,5,5
Bike Trail,95,95,95,95,95


In [174]:
# delete types that are not relevant  < 10

relevant_venues_df = total_venues[total_venues['City'] > 14]
relevant_venues_df

Unnamed: 0_level_0,City,CategoryID,Venue,Latitude,Longitude
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bike Trail,95,95,95,95,95
Botanical Garden,73,73,73,73,73
Buddhist Temple,100,100,100,100,100
Dog Run,20,20,20,20,20
Farmers Market,460,460,460,460,460
Forest,82,82,82,82,82
Fruit & Vegetable Store,181,181,181,181,181
Garden,20,20,20,20,20
Grocery Store,153,153,153,153,153
Health Food Store,534,534,534,534,534


In [175]:
# create a list of relevant venues

In [177]:
relevant_venues_df = relevant_venues_df.reset_index()
relevant_venues_df

Unnamed: 0,index,Type,City,CategoryID,Venue,Latitude,Longitude
0,0,Bike Trail,95,95,95,95,95
1,1,Botanical Garden,73,73,73,73,73
2,2,Buddhist Temple,100,100,100,100,100
3,3,Dog Run,20,20,20,20,20
4,4,Farmers Market,460,460,460,460,460
5,5,Forest,82,82,82,82,82
6,6,Fruit & Vegetable Store,181,181,181,181,181
7,7,Garden,20,20,20,20,20
8,8,Grocery Store,153,153,153,153,153
9,9,Health Food Store,534,534,534,534,534


In [197]:
relevant_types = relevant_venues_df['Type'].tolist()
relevant_types

['Bike Trail',
 'Botanical Garden',
 'Buddhist Temple',
 'Dog Run',
 'Farmers Market',
 'Forest',
 'Fruit & Vegetable Store',
 'Garden',
 'Grocery Store',
 'Health Food Store',
 'Hindu Temple',
 'Juice Bar',
 'Lake',
 'Mountain',
 'National Park',
 'Nature Preserve',
 'Organic Grocery',
 'Other Great Outdoors',
 'Outdoor Event Space',
 'Park',
 'Playground',
 'Scenic Lookout',
 'State / Provincial Park',
 'Tech Startup',
 'Trail']

In [198]:
#Filter out irrelevant types from dataset
df = venues_df
df = df.loc[df['Type'].isin(relevant_types)]

In [199]:
df

Unnamed: 0,City,CategoryID,Venue,Latitude,Longitude,Type
0,"Houston, TX",4bf58dd8d48988d159941735,Buffalo Bayou Walk,29.762177,-95.375844,Trail
1,"Houston, TX",4bf58dd8d48988d159941735,Buffalo Bayou Park,29.762068,-95.391626,Park
2,"Houston, TX",4bf58dd8d48988d159941735,Houston Arboretum & Nature Center,29.765361,-95.452177,Botanical Garden
3,"Houston, TX",4bf58dd8d48988d159941735,Herman Park Trails,29.719804,-95.388748,Trail
4,"Houston, TX",4bf58dd8d48988d159941735,Terry Hershey Park,29.779138,-95.623096,Park
5,"Houston, TX",4bf58dd8d48988d159941735,Ho Chi Minh: Memorial Park Mountain Bike Trails,29.765167,-95.444738,Other Great Outdoors
6,"Houston, TX",4bf58dd8d48988d159941735,Buffalo Bayou Loop,29.761345,-95.401556,Trail
7,"Houston, TX",4bf58dd8d48988d159941735,Memorial/Allen Parkway Trails,29.760212,-95.408612,Trail
8,"Houston, TX",4bf58dd8d48988d159941735,Memorial Park,29.767656,-95.442524,Park
9,"Houston, TX",4bf58dd8d48988d159941735,Eleanor Tinsley Park,29.76144,-95.379271,Trail


In [206]:
df['CategoryID']

0       4bf58dd8d48988d159941735
1       4bf58dd8d48988d159941735
2       4bf58dd8d48988d159941735
3       4bf58dd8d48988d159941735
4       4bf58dd8d48988d159941735
5       4bf58dd8d48988d159941735
6       4bf58dd8d48988d159941735
7       4bf58dd8d48988d159941735
8       4bf58dd8d48988d159941735
9       4bf58dd8d48988d159941735
10      4bf58dd8d48988d159941735
11      4bf58dd8d48988d159941735
12      4bf58dd8d48988d159941735
13      4bf58dd8d48988d159941735
14      4bf58dd8d48988d159941735
15      4bf58dd8d48988d159941735
16      4bf58dd8d48988d159941735
17      4bf58dd8d48988d159941735
18      4bf58dd8d48988d159941735
19      4bf58dd8d48988d159941735
20      4bf58dd8d48988d159941735
21      4bf58dd8d48988d159941735
23      4bf58dd8d48988d159941735
24      4bf58dd8d48988d159941735
25      4bf58dd8d48988d159941735
26      4bf58dd8d48988d159941735
27      4bf58dd8d48988d159941735
29      4bf58dd8d48988d159941735
30      4bf58dd8d48988d159941735
31      4bf58dd8d48988d159941735
32      4b

In [213]:
#add column for category label

outdoors_df = df.loc[df['CategoryID'].isin(outdoors_venues_ID)]
startups_df = df.loc[df['CategoryID'].isin(professional_venues_ID)]
cultural_df = df.loc[df['CategoryID'].isin(cultural_venues_ID)]
food_df = df.loc[df['CategoryID'].isin(food_venues_ID)]
beauty_df = df.loc[df['CategoryID'].isin(beauty_venues_ID)]

### Source 2: Weather Data

In [212]:
outdoors_df

Unnamed: 0,City,CategoryID,Venue,Latitude,Longitude,Type,Outdoors Venue
0,"Houston, TX",4bf58dd8d48988d159941735,Buffalo Bayou Walk,29.762177,-95.375844,Trail,0
1,"Houston, TX",4bf58dd8d48988d159941735,Buffalo Bayou Park,29.762068,-95.391626,Park,0
2,"Houston, TX",4bf58dd8d48988d159941735,Houston Arboretum & Nature Center,29.765361,-95.452177,Botanical Garden,0
3,"Houston, TX",4bf58dd8d48988d159941735,Herman Park Trails,29.719804,-95.388748,Trail,0
4,"Houston, TX",4bf58dd8d48988d159941735,Terry Hershey Park,29.779138,-95.623096,Park,0
5,"Houston, TX",4bf58dd8d48988d159941735,Ho Chi Minh: Memorial Park Mountain Bike Trails,29.765167,-95.444738,Other Great Outdoors,0
6,"Houston, TX",4bf58dd8d48988d159941735,Buffalo Bayou Loop,29.761345,-95.401556,Trail,0
7,"Houston, TX",4bf58dd8d48988d159941735,Memorial/Allen Parkway Trails,29.760212,-95.408612,Trail,0
8,"Houston, TX",4bf58dd8d48988d159941735,Memorial Park,29.767656,-95.442524,Park,0
9,"Houston, TX",4bf58dd8d48988d159941735,Eleanor Tinsley Park,29.76144,-95.379271,Trail,0


(247, 6)

In [None]:
relevant_venues_df = venues_df

In [174]:
len(total_venues)

121

In [9]:
index

NameError: name 'index' is not defined

We need to find weather data on the cities of interest. This would include temperature data & precipitation.

Houston Code: USW00012918

MAX Temperature Houston 2000 - 2019 August
https://www.ncdc.noaa.gov/cag/city/time-series/USW00012918-tmax-1-8-2000-2019.csv

MIN Temperature Houston 2000 - 2019 August

https://www.ncdc.noaa.gov/cag/city/time-series/USW00012918-tmin-1-8-2000-2019.csv

AVG Temperature Houston 2000 - 2019 August

https://www.ncdc.noaa.gov/cag/city/time-series/USW00012918-tavg-1-8-2000-2019.csv

MAX Temperature Houston 2000 - 2019 Feb
https://www.ncdc.noaa.gov/cag/city/time-series/USW00012918-tmax-1-2-2000-2019.csv
MIN Temperature Houston 2000 - 2019 August Feb
https://www.ncdc.noaa.gov/cag/city/time-series/USW00012918-tmin-1-2-2000-2019.csv
AVG Temperature Houston 2000 - 2019 Feb
https://www.ncdc.noaa.gov/cag/city/time-series/USW00012918-tavg-1-2-2000-2019.csv

Annual Precipitation Houston 2000 - 2019 
https://www.ncdc.noaa.gov/cag/city/time-series/USW00012918-pcp-12-12-2000-2019.csv

AUSTIN DATA
Austin Code: USW00013958

DENVER DATA
Denver Code: USW00093037

PORTLAND DATA
Portland Code: USW00024229

SAN FRANCISCO DATA
San Frnacisco Code: USW00023234

SEATTLLE
Seattle Code: USW00024233

## Source 3: Pollen and Mold Data

We need to find weather data on the cities of interest. This would include temperature data, precipitation, humidity index, pollen count, mold spore count.


HOUSTON DATA - Station 188
http://pollen.aaaai.org/nab/index.cfm?p=AllergenCalendar&stationid=188&qsFullDate=10/1/2018

AUSTIN DATA - Station 111 

DENVER DATA - Station 196

SAN JOSE DATA - Station 108

SEATTLE DATA - Station 3

PORTLAND DATA - Station 1


Mold Spore Count Houston
http://www.houstontx.gov/health/Pollen-Mold/mold-archives.html

What the Numbers Mean
http://www.houstontx.gov/health/Pollen-Mold/numbers.html




###          Other Sources and Statistics to Consider:

I will have to think more deeply about where to find reliable data regarding these statistics and how to integrate them into my analysis: Healthiest US Cities, Best standard of living, cost of living, demographics.

This website contains open government data.
https://cities.data.gov/

## Methodology

Will use one hot encoding and a grading algorithm to find out the best city for me to live in.

## Results

## Discussion 

## Conclusion