# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal office location for a Hotel Tech Startup . Specifically, this report will be targeted to stakeholders interested in choosing an location in **Manhattan**, **New York**.

We will try to detect the location that surrouned with **subway stations**.We are also particularly interested in areas with **Asian food** since most of the employees Asians. We would also prefer locations as close to **Hotels**, assuming that first two conditions are met.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of potential office locations in Manhattan
* number of existing subway stations in the neighborhood (particularly **ACE and NWQR**)
* number of and distance to Asian restaurants in the neighborhood
* distance of neighborhood from Hotels groups

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* Manhattan geo location will be generated by **geopy library**
* number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of Subway Station will be obtained using **[NYC OpenData](https://data.cityofnewyork.us/Transportation/Subway-Stations/arq3-7z49)**

### Office Location

Lets Use geopy library to get the latitude and longitude values of **Mahattan, New York**

In [2]:
import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

def get_coordinates(address):
    
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    lat = location.latitude
    lng = location.longitude
    
    return [lat, lng]
    
address = 'Manhattan, NY'
mahattan_geo = get_coordinates(address)
print('The geograpical coordinate of {} are {}.'.format(address,mahattan_geo))

The geograpical coordinate of Manhattan, NY are [40.7900869, -73.9598295].


Let's use get_coordinates function to converted office location address to coordinates

In [13]:
# set up libraries
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


In [53]:
# Get the office dataframe and clean the data

office_df = pd.read_csv('office.csv')
office_df = office_df.drop(['index'],axis = 1)
office_df

Unnamed: 0,address,rent
0,"142 W 57th St New York, NY 10019","$1,100.00"
1,"12 E 49th St New York, NY 10017","$1,020.00"
2,"349 5th Ave New York, NY 10016",$900.00
3,"33 Irving Pl New York, NY 10003","$1,190.00"
4,"428 Broadway New York, NY 10013","$1,000.00"
5,"200 Broadway New York, NY 10038","$1,080.00"


In [54]:
# add latitude and longitude columns to the dataframe
office_geo = []
for ad in office_df['address']:
    try: 
        office_geo.append(get_coordinates(ad))
        print(ad + ' Success')
    except:
        print(ad + ' Fail')


office_df = pd.concat([office_df,pd.DataFrame(office_geo,columns=['latitude','longitude'])],axis=1)
office_df



142 W 57th St New York, NY 10019 Success
12 E 49th St New York, NY 10017 Success
349 5th Ave New York, NY 10016 Success
33 Irving Pl New York, NY 10003 Success
428 Broadway New York, NY 10013 Success
200 Broadway New York, NY 10038 Success


Unnamed: 0,address,rent,latitude,longitude
0,"142 W 57th St New York, NY 10019","$1,100.00",40.764807,-73.979244
1,"12 E 49th St New York, NY 10017","$1,020.00",40.757368,-73.97687
2,"349 5th Ave New York, NY 10016",$900.00,40.748181,-73.984518
3,"33 Irving Pl New York, NY 10003","$1,190.00",40.735105,-73.988113
4,"428 Broadway New York, NY 10013","$1,000.00",40.719648,-74.001443
5,"200 Broadway New York, NY 10038","$1,080.00",40.71056,-74.009014


Create a map of Mahattan, New York using latitude and longitude value and add circle about office location

In [125]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

map_newyork= folium.Map(location=mahattan_geo,zoom_start=12)

# add markers to map

for lat,lng,address, price in zip(office_df['latitude'],office_df['longitude'],office_df['address'],office_df['rent']):
    label = '{}, {}'.format(address,price)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)


# show the map
map_newyork

### Subway Location

Get the subway location and show them in the map

In [62]:
# read the csv into dataframe
mta_df = pd.read_csv('subway.csv')

# remove the columns that are irrelvant
mta_df.drop(['URL','NOTES'],axis = 1,inplace=True)

mta_df.head()

Unnamed: 0,OBJECTID,NAME,the_geom,LINE
0,1,Astor Pl,POINT (-73.99106999861966 40.73005400028978),4-6-6 Express
1,2,Canal St,POINT (-74.00019299927328 40.71880300107709),4-6-6 Express
2,3,50th St,POINT (-73.98384899986625 40.76172799961419),1-2
3,4,Bergen St,POINT (-73.97499915116808 40.68086213682956),2-3-4
4,5,Pennsylvania Ave,POINT (-73.89488591154061 40.66471445143568),3-4


In [70]:
# remove the POINT text
mta_df['the_geom'] = mta_df['the_geom'].str[6:]

# split the column into longitude and latitude
mta_df['longitude'] = mta_df['the_geom'].str.split(' ', expand=True, n=1).iloc[:,0]
mta_df['latitude'] = mta_df['the_geom'].str.split(' ', expand=True, n=1).iloc[:,1]

# delete the 'the_geom' 
mta_df.drop(['the_geom'],axis = 1, inplace=True)

# clean the longitude and latitude
mta_df['longitude'] = mta_df['longitude'].str[1:]
mta_df['latitude'] = mta_df['latitude'].str[:-1]

# change the type 
mta_df['latitude'] = mta_df['latitude'].astype('float')
mta_df['longitude'] = mta_df['longitude'].astype('float')

In [99]:
mta_df.head()

Unnamed: 0,OBJECTID,NAME,LINE,longitude,latitude
0,1,Astor Pl,4-6-6 Express,-73.99107,40.730054
1,2,Canal St,4-6-6 Express,-74.000193,40.718803
2,3,50th St,1-2,-73.983849,40.761728
3,4,Bergen St,2-3-4,-73.974999,40.680862
4,5,Pennsylvania Ave,3-4,-73.894886,40.664714


Since the client only cares about ace and nqwr station in manhattan, we will remove other data

In [112]:
# remove the station that is not in Manhattan
top_left = [40.806470, -73.973205]
bottom_left = [40.709729, -74.035690]
bottom_right = [40.696715, -73.992431]
top_right = [40.781518, -73.934066]

low_lat = bottom_right[0]
high_lat = top_left[0]       
low_lon = top_right[1]
high_lon = bottom_left[1]

mta_df = mta_df.loc[
    (mta_df["latitude"] < high_lat) & 
    (mta_df["latitude"] > low_lat) &
    (mta_df["longitude"] > high_lon) &
    (mta_df["longitude"] < low_lon)
]

In [116]:
# how many lines we have
mta_df['LINE'].unique()


array(['4-6-6 Express', '1-2', 'A-B-C', 'J-M-Z', '4-5-6-6 Express',
       'B-D-F-M', 'E-M', 'G', 'L', 'J-M', 'N-Q-R-W', 'S', '7-7 Express',
       '7-7 Express-N-W', '1-2-3', '1', '2-3', 'A-C-E', 'E-M-R', 'F',
       'B-D-E', 'F-Q', 'A-B-C-D', 'N-R-W', 'A-C', 'F-M', 'E', 'J-Z',
       'R-W', '4-5', 'B-D', 'N-Q', 'Q'], dtype=object)

In [120]:
# Remove lines that are not ACE and NQRW
ace_list = ['A-B-C','E-M','A-C-E','E-M-R','A-B-C-D','A-C','E']
nqrw_list = ['N-Q-R-W','7-7 Express-N-W','N-R-W','F-Q','R-W','N-Q','Q']
mta_ace = mta_df[mta_df['LINE'].isin(ace_list)]
mta_nqrw = mta_df[mta_df['LINE'].isin(nqrw_list)]

Unnamed: 0,OBJECTID,NAME,LINE,longitude,latitude
6,7,Cathedral Pkwy (110th St),A-B-C,-73.958067,40.800582
55,56,72nd St,A-B-C,-73.976337,40.775519
56,57,96th St,A-B-C,-73.964602,40.791619
65,66,Court Sq - 23rd St,E-M,-73.946055,40.747768
141,142,5th Ave - 53rd St,E-M,-73.975249,40.760087


Let's map the station into the map with blue stands for ace and yellow stands for nqrw

In [126]:
# add ace markers to map

for lat,lng,name, line in zip(mta_ace['latitude'],mta_ace['longitude'],mta_ace['NAME'],mta_ace['LINE']):
    label = '{}, {}'.format(name,line)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=2,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)

# add nqrw markers to map

for lat,lng,name, line in zip(mta_nqrw['latitude'],mta_nqrw['longitude'],mta_nqrw['NAME'],mta_nqrw['LINE']):
    label = '{}, {}'.format(name,line)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=2,
        popup=label,
        color='yellow',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)

# show map
map_newyork

### FOURSQUARE

In [127]:
# Define Foursquare Credentials and Version

CLIENT_ID = 'THW4T43GANDRLSNMXNCEK1ZGIBT0T2ONW3UXXXPVFBR2LS2Y' # your Foursquare ID
CLIENT_SECRET = 'X3IFLAKKQQCOKA40VJVJS31FPHHAGP5O150C0WUENFZ1UPNF' # your Foursquare Secret
VERSION = '20180605'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: THW4T43GANDRLSNMXNCEK1ZGIBT0T2ONW3UXXXPVFBR2LS2Y
CLIENT_SECRET:X3IFLAKKQQCOKA40VJVJS31FPHHAGP5O150C0WUENFZ1UPNF


In [None]:
# Let's create a function to repeat the same process to all the neighborhoods in Manhattan

def getNearbyFood(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only  relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return (nearby_venues)