# Capstone Project

### Problem Statement

A new family is currently living in the Bushwick Neighborhood of Brooklyn, New York. However, upon the expansion of their family through the birth of two children, they are seeking to return to Upstate, New York where they originally hail. The couple would like to live in a neighbhorhood that is similar in diversity and amenities to their current neighborhood (ridgewood), and is currently investigating Troy, Schenectady, and Albany, New York as potentially suitable communities. 

The goal of this analysis is to investigate Troy, Schenectady, and Albany, New York (which are small enough that they will be analyzed as whole cities), and determine if one of the cities in Upstate, NY will provide a close approximation of the living conditions of Ridgewood, Brooklyn, NY USA. 

### Data

The analysis will take place by examining the following pieces of available data: 
    
    - Foursquare API Data for all identified cities and neighborhoods. 
    - Mean Housing price/cost information from Zillow
    - Federal regional cost of living adjustment data (to normalize home values between NYC and Albany, Schenectady, and Troy.

### Analysis Methodology

The Analysis of all data will take place in the following manner: 
    - First Albany, Schenectady, and Troy venues and their venue categories will be analyzed with NYC venue data and clustered to determine if any of those three cities are similar to Bushwick. 
    - The second part of the analysis will be a review of single family housing costs normalized against neighborhood housing costs for Bushwick, Brooklyn, USA
    - The final piece of the analysis will be a brief demography study using a quick literature search to build a comparison table of each city. 

In [1]:
import numpy as np # library to handle data in a vectorized manner

!pip install BeautifulSoup4
from bs4 import BeautifulSoup

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import urllib.request

!pip install html5lib
!pip install lxml
!pip install et_xmlfile

from pandas import DataFrame

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ca-certificates-2020.4.5.2 |       hecda079_0         147 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    certifi-2020.4.5.2         |   py36h9f0ad1d_0         152 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0           conda-forge
    geopy:          

In [5]:
# The code was removed by Watson Studio for sharing.

In [2]:
# Obtain NYC Borough Data.
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

#load json. 
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
#define json features. 
neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe with the column names you just defined above. 
neighborhoods = pd.DataFrame(columns=column_names)

#fill in neighborhood data for NYC. 
for data in neighborhoods_data: #for data in neighborhoods data - which is the json format crap form above. for 'data' not sure what that's reading from. but it appears to be essentilaly length in the frame.
    borough = neighborhood_name = data['properties']['borough'] #this is because there's two dicts in the properties field
    neighborhood_name = data['properties']['name'] #this is becuase there's two dicts in the "properties" field
        
    neighborhood_latlon = data['geometry']['coordinates'] #
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

#Check the dataframe has all 5 boroughs. 
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

Data downloaded!
The dataframe has 5 boroughs and 306 neighborhoods.


In [None]:
### Create Map of NYC Data to Check that it's working. 

In [3]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map for lat, lng, borough, neighborhood in zip() which combines all of the fields together as a single dict. 
#for each of these items. labe - had a lable, which will be neibhrohood and borough
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)#add neighborhood and borough label
    label = folium.Popup(label, parse_html=True) #add popup label as the label with borough and neighborhood
    folium.CircleMarker( #define circle markers
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

NameError: name 'latitude' is not defined

In [6]:
import requests

url = ('https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}').format(client_id, 
                                                                  client_secret, 
                                                                  version,
                                                                  neighborhood_latitude, 
                                                                  neighborhood_longitude, 
                                                                  radius,
                                                                  limit)

In [7]:
search_data = requests.get(url).json()
#search_data

{'meta': {'code': 200, 'requestId': '5ee2f8ea8bc50045a2c27e9b'},
 'response': {'venues': [{'id': '4c4f0e8afb742d7fb546522f',
    'name': "Katie O'Byrne's",
    'location': {'address': '121 Wall St',
     'crossStreet': 'Erie Blvd',
     'lat': 42.81466591859655,
     'lng': -73.94340240559035,
     'labeledLatLngs': [{'label': 'display',
       'lat': 42.81466591859655,
       'lng': -73.94340240559035}],
     'distance': 71,
     'postalCode': '12305',
     'cc': 'US',
     'city': 'Schenectady',
     'state': 'NY',
     'country': 'United States',
     'formattedAddress': ['121 Wall St (Erie Blvd)',
      'Schenectady, NY 12305',
      'United States']},
    'categories': [{'id': '4bf58dd8d48988d11b941735',
      'name': 'Pub',
      'pluralName': 'Pubs',
      'shortName': 'Pub',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/nightlife/pub_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1591933182',
    'hasPerk': False},
   {'id': '5269a

In [8]:
# assign relevant part of JSON to venues
venues = search_data['response']['venues'] #'venues' is one of the response dict keys. Response is the Dict name? Maybe?
#venues

[{'id': '4c4f0e8afb742d7fb546522f',
  'name': "Katie O'Byrne's",
  'location': {'address': '121 Wall St',
   'crossStreet': 'Erie Blvd',
   'lat': 42.81466591859655,
   'lng': -73.94340240559035,
   'labeledLatLngs': [{'label': 'display',
     'lat': 42.81466591859655,
     'lng': -73.94340240559035}],
   'distance': 71,
   'postalCode': '12305',
   'cc': 'US',
   'city': 'Schenectady',
   'state': 'NY',
   'country': 'United States',
   'formattedAddress': ['121 Wall St (Erie Blvd)',
    'Schenectady, NY 12305',
    'United States']},
  'categories': [{'id': '4bf58dd8d48988d11b941735',
    'name': 'Pub',
    'pluralName': 'Pubs',
    'shortName': 'Pub',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/nightlife/pub_',
     'suffix': '.png'},
    'primary': True}],
  'referralId': 'v-1591933182',
  'hasPerk': False},
 {'id': '5269a11b498e93b4efba326f',
  'name': 'Thai Thai Bistro',
  'location': {'address': '2333 Nott St E',
   'crossStreet': 'Balltown Rd',
   'lat': 42.8

In [9]:
data = json_normalize(venues)
data.head(2)

Unnamed: 0,categories,delivery.id,delivery.provider.icon.name,delivery.provider.icon.prefix,delivery.provider.icon.sizes,delivery.provider.name,delivery.url,hasPerk,id,location.address,...,location.crossStreet,location.distance,location.formattedAddress,location.labeledLatLngs,location.lat,location.lng,location.postalCode,location.state,name,referralId
0,"[{'id': '4bf58dd8d48988d11b941735', 'name': 'P...",,,,,,,False,4c4f0e8afb742d7fb546522f,121 Wall St,...,Erie Blvd,71,"[121 Wall St (Erie Blvd), Schenectady, NY 1230...","[{'label': 'display', 'lat': 42.81466591859655...",42.814666,-73.943402,12305,NY,Katie O'Byrne's,v-1591933182
1,"[{'id': '4bf58dd8d48988d149941735', 'name': 'T...",2054920.0,/delivery_provider_grubhub_20180129.png,https://fastly.4sqi.net/img/general/cap/,"[40, 50]",grubhub,https://www.grubhub.com/restaurant/thai-thai-b...,False,5269a11b498e93b4efba326f,2333 Nott St E,...,Balltown Rd,4306,"[2333 Nott St E (Balltown Rd), Schenectady, NY...","[{'label': 'display', 'lat': 42.8162956237793,...",42.816296,-73.891335,12309,NY,Thai Thai Bistro,v-1591933182


## CLEAN RESULTS. 

In [10]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in data.columns if col.startswith('location.')] + ['id'] #this line creates a list of columns that includes, name, cateogries, all columns starting with "location", and ID only. 
dataframe_filtered = data.loc[:, filtered_columns] #Filtered dataframe based on above - make new dataframe, all things minus that shit. 

# function that extracts the category of the venue 
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

dataframe_filtered.head(5)

Unnamed: 0,name,categories,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,postalCode,state,id
0,Katie O'Byrne's,Pub,121 Wall St,US,Schenectady,United States,Erie Blvd,71,"[121 Wall St (Erie Blvd), Schenectady, NY 1230...","[{'label': 'display', 'lat': 42.81466591859655...",42.814666,-73.943402,12305,NY,4c4f0e8afb742d7fb546522f
1,Thai Thai Bistro,Thai Restaurant,2333 Nott St E,US,Schenectady,United States,Balltown Rd,4306,"[2333 Nott St E (Balltown Rd), Schenectady, NY...","[{'label': 'display', 'lat': 42.8162956237793,...",42.816296,-73.891335,12309,NY,5269a11b498e93b4efba326f
2,KeyBank,Bank,315 State St,US,Schenectady,United States,,44,"[315 State St, Schenectady, NY 12305, United S...","[{'label': 'display', 'lat': 42.81407963765152...",42.81408,-73.943476,12305,NY,4c49f6edbad7a5934d9ebaa9
3,H&R Block,Financial or Legal Service,133 Wall St Ste 3,US,Schenectady,United States,,58,"[133 Wall St Ste 3, Schenectady, NY 12305, Uni...","[{'label': 'display', 'lat': 42.81435509669349...",42.814355,-73.943315,12305,NY,4cfe753847699eb005521715
4,Schenectady Amtrak Station,Train Station,323 Erie Blvd,US,Schenectady,United States,,126,"[323 Erie Blvd, Schenectady, NY 12305, United ...","[{'label': 'display', 'lat': 42.81452824174902...",42.814528,-73.942516,12305,NY,4b743adcf964a52086ce2de3


## Create a Map of Venues of The Schenectady Area. 

In [None]:
venues_map = folium.Map(location=[neighborhood_latitude, neighborhood_longitude], zoom_start=8) # generate map centred at Schenectady Lat Long.

# add a red circle marker to represent the Conrad Hotel
folium.features.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='Schenectady',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.4
).add_to(venues_map)

# add the Italian restaurants as blue circle markers
for lat, lng, label in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.categories):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        #color='blue', You should be able to change this by getting a list of all venue categories and changing them accordingly in a new row. 
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)

# display map
venues_map

## Encode the Venue Categories for Schenectady. 

In [13]:
# one hot encoding
schenectady_onehot = pd.get_dummies(dataframe_filtered[['categories']], prefix="", prefix_sep="") ##this is important. 

#schenectady_onehot.head(3)

schenectady_onehot['neighborhood'] = dataframe_filtered['city']
schenectady_onehot.head(3)

#need to add the schenectady one hot encoding of categories into the schenectady initial dataframe. Can do this by defining a new df with just the rel schenectady info. 
#e.g.
#coded_df = df_filtered[columns]
#coded_df[list of categories]=array of categories from one hot encoding df.

Unnamed: 0,American Restaurant,Art Gallery,Bank,Bar,Beer Garden,Brewery,Building,Bus Line,Bus Stop,Business Center,...,Science Museum,Smoke Shop,Soccer Field,Speakeasy,Tattoo Parlor,Tech Startup,Thai Restaurant,Theater,Train Station,neighborhood
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Schenectady
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,Schenectady
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Schenectady
