# Data Science Capstone project on NYC Traffic Flow

There is a wealth of publicly available data on NYC available on data.cityofnewyork.us - I am enriching the Foursquare provided information with this dataset on New York City Traffic flow.

There are two phases to the deliverable. Initially I want to be able to:
- (1) Visualise spatial and traffic and Foursquare data by plotting (a) Venues, (b) Traffic speeds, and (c) the map of New York city to have an overvie of the business that are easily accessible by car.
- (2) Apply filters to plot traffic data over different days and time periods.

In the second phase I intend to perform an analysis on the traffic data to better understand
- (3) what kind of businesses are located near places that have good or fast traffic throughput and 
- (4) to identify what the kinds of businesses are that have poor traffic throughput nearby.

## Business Problem
The business problem is that the NYC government wants to better understand what areas have good or poor traffic flow in combination with the different venues in that location. They want to better understand what areas suffer from poor traffic flow to see which areas and which venues in what areas cannot be accessed easily by car.
- Target audience: NYC government.

## Data Required
- Foursquare: Foursquare API can be used to gather information about venues near a certain location. It also contains latitudes and longitudes.
- Folium is used to plot maps
- Traffic data contains speed, latitude, longitude and a number of other variables that can be used to determine the traffic throughput in that road/area.

### Data links
- Foursquare API: https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}
- Traffic Data (1): The traffic data from October 2016 https://www.kaggle.com/crailtap/nyc-real-time-traffic-speed-data-feed#october2016.csv
- Traffic Data (2): The node information for the traffic data referred to in the October 2016 dataset: https://www.kaggle.com/crailtap/nyc-real-time-traffic-speed-data-feed#linkinfo.csv

The traffic file (1) contains the following columns:
- Id
- Speed
- TravelTime
- Status
- DataAsOf
- linkId

Traffic Data file (2) contains these columns:
- linkId
- linkPoints
- EncodedPolyLine
- EncodedPolyLineLvls
- Transcom_id
- Borough
- linkName
- Owner

The Foursquare API calls will return at least the following information which I will use:
- Venue Latitude
- Venue Longitude
- Venue Category
- Venue Name

Traffic Data (1) provides the information about the traffic speed on road segments identified by ID
Traffic Data (2) provides the information about the locations of road segments and is also identified by ID
Foursquare data contains the Venue Category and the location (coordinates) of the venues.

Together they can be used to visualize all this data together and perform analysis on the above listed features.


## Aproach

### 1 Visualise spatial and traffic and Foursquare data by plotting (a) Venues, (b) Traffic speeds, and (c) the map of New York city.
- Use Folium to plot the map,
- Use Traffic Data File (1) to Determine traffic speed and the street ID
- Use Traffic Data File (2) to identify the latitude and longitude of the street ID
- Use Foursquare API to identify Venues near the Street location

### 2 Apply filters to plot traffic data over different days and time periods.
- Investigate how to make the map interactive to visualize the changes in traffic speeds over different days and time periods in the month of October 2016
- Classify areas and their traffic speeds taking into account the timeseries data where traffic speeds differ over time

### 3 What kind of businesses are located near places that have good or fast traffic throughput.
### 4 Identify what the kinds of businesses are that have poor traffic throughput nearby.
- Build a dataset of Venues and their locations (Latitude and Longitudes) using the Foursquare API, store all this information in a Dataframe in Watson Studio so we can refer to it without querying the Foursquare API.
- Try out various machine learning algorithms (exploratory) to look into correlations between Venue Category and Traffic speeds in NYC.
- Cluster business categories based on the locations and the traffic speeds nearby those locations





In [3]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
import datetime # For timeseries

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.0 MB

The following NEW packages will be 

### Start with setting the variables for the Foursquare API and determining the NYC location, as well as creating the main function for quering Venues in a location

In [10]:
CLIENT_ID = 'VSKNDXQXCA0IPVQ4AMSF40AGK4AUAIVWI41XKCLMO4X1EAG2' # your Foursquare ID
# Client secret is in hidden cell below
VERSION = '20180604'
LIMIT = 30

In [11]:
# The code was removed by Watson Studio for sharing.

In [12]:
address = '102 North End Ave, New York, NY'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

40.7149555 -74.0153365


In [13]:
def getNearbyVenues(latitudes, longitudes, radius=500):
    
    venues_list=[]
    for lat, lng in zip(latitudes, longitudes):
                    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        #return only relevant information for each nearby venue
        venues_list.append([(
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [14]:
nyc = getNearbyVenues(
                                   latitudes=[40.7127281],
                                   longitudes=[-74.0060152]
                                  )

In [15]:
place ='New York City,United States'

geolocator=Nominatim(user_agent='can_explorer')
location = geolocator.geocode(place)
latitude=location.latitude
longitude=location.longitude
print('The coordinates of NYC are {}, {}.'.format(latitude, longitude))

The coordinates of NYC are 40.7127281, -74.0060152.


### Create the Traffic Dataset that contains the locations of the segments and store it in a Pandas DataFrame - df_linkinfo

In [16]:
# The code was removed by Watson Studio for sharing.

In [17]:
df_linkinfo = pd.read_csv(body)

In [18]:
df_linkinfo.drop(["EncodedPolyLine", "Transcom_id", "Borough", "linkName", "Owner"], axis=1, inplace=True)
df_linkinfo.head()

Unnamed: 0,linkId,linkPoints,EncodedPolyLineLvls
0,4616337,"40.74047,-74.009251 40.74137,-74.00893 40.7431...",BBBBBBBBBBBBB
1,4616325,"40.73933,-74.01004 40.73895,-74.01012 40.7376,...",BBBBBB
2,4616324,"40.76375,-73.999191 40.763521,-73.99935 40.762...",BBBBBBBBBBBBBBB
3,4616338,"40.7607,-74.002141 40.76212,-74.91 40.76335,-7...",BBBBBBBBB
4,4616323,"40.77158,-73.994441 40.7713004,-73.99455 40.77...",BBBBBBBBBBBBBBBBB


In [19]:
#df_linkinfo['EncodedPolyLineLvls'] = df_linkinfo['EncodedPolyLineLvls'].apply(lambda x: (min(12,len(str(x)))))
df_linkinfo['EncodedPolyLineLvls'] = df_linkinfo['EncodedPolyLineLvls'].apply(lambda x: (len(str(x))))

In [22]:
df_linkinfo = pd.concat([df_linkinfo, df_linkinfo.linkPoints.str.split(expand=True)], axis=1, sort=False)
                       #  tuple(map(float, .split(',')

In [23]:
#Because the data is incomplete in the csv for all segments after segment (lat/longitude coordinates) 12
df_linkinfo.drop(df_linkinfo.columns[len(df_linkinfo.columns)-1], axis=1, inplace=True)
df_linkinfo.drop("linkPoints", inplace=True, axis=1)
df_linkinfo.rename(columns={0: 'point0', 1: 'point1', 2: 'point2', 3: 'point3', 4: 'point4', 5: 'point5', 6: 'point6', 7: 'point7', 8: 'point8',9: 'point9', 10: 'point10', 11: 'point11', 12: 'point12'}, inplace=True)

In [24]:
df_linkinfo.head()

Unnamed: 0,linkId,EncodedPolyLineLvls,point0,point1,point2,point3,point4,point5,point6,point7,point8,point9,point10,point11,point12
0,4616337,13,"40.74047,-74.009251","40.74137,-74.00893","40.7431706,-74.008591","40.7462304,-74.00797","40.74812,-74.007651","40.748701,-74.007691","40.74971,-74.00819","40.75048,-74.008321","40.751611,-74.00789","40.7537504,-74.00704","40.75721,-74.00463","40.76003,-74.002631","40.7607405,-7"
1,4616325,6,"40.73933,-74.01004","40.73895,-74.01012","40.7376,-74.010021","40.7346,-74.01026","40.72912,-74.010781","40.72619,-74.011131",,,,,,,
2,4616324,15,"40.76375,-73.999191","40.763521,-73.99935","40.7620804,-74.00136","40.75985,-74.00306","40.75775,-74.00457","40.75775,-74.00457","40.75576,-74.00601","40.7544904,-74.006921","40.7538404,-74.007241","40.75415,-74.00712","40.7502804,-74.00848","40.74833,-74.007771","40.74114,-74.0"
3,4616338,9,"40.7607,-74.002141","40.76212,-74.91","40.76335,-73.999271","40.76491,-73.99805","40.7667406,-73.996681","40.7693,-73.994801","40.7699605,-73.994521","40.7710104,-73.99438","40.7715106,-73.9942",,,,
4,4616323,17,"40.77158,-73.994441","40.7713004,-73.99455","40.77085,-73.99467","40.76997,-73.99481","40.7701604,-73.99477","40.76986,-73.994831","40.7695406,-73.99496","40.769341,-73.99508","40.768311,-73.9958","40.768311,-73.9958","40.76623,-73.99733","40.76623,-73.99733","40.76547,-73.9979"


### Create the Traffic Dataset that contains the traffic speed data and store it in a Pandas DataFrame - df_traffic

In [25]:
body = client_74cc9913793c416c8e6e21855b2aa2d8.get_object(Bucket='datascienceprojectcoursera-donotdelete-pr-j5jbz0cymbpaj8',Key='october2016v2.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_traffic = pd.read_csv(body)
df_traffic.head()

Unnamed: 0,Id,Speed,TravelTime,Status,DataAsOf,linkId
0,422,36.04,249,0,10/05/2016 09:41,4616298
1,423,11.81,434,0,10/05/2016 09:41,4616299
2,424,9.94,360,0,10/05/2016 09:41,4616300
3,425,47.85,225,0,10/05/2016 09:41,4616276
4,426,24.23,285,0,10/05/2016 09:41,4616272


In [26]:
df_traffic = pd.concat([df_traffic, df_traffic.DataAsOf.str.split(expand=True)], axis=1, sort=False)

In [27]:
df_traffic.rename(columns={0: 'Date', 1: 'Time'}, inplace=True)

In [28]:
# Take the leftmost 5 characters if there are more than 5 characters in the Time cell 
# (e.g. 10:10 as opposed to 10:10:78 since we dont care for the miliseconds)
df_traffic["Time"] = df_traffic['Time'].apply(lambda x: x[:5] if len(x) > 5 else x)

In [29]:
df_traffic.drop("DataAsOf", inplace=True, axis=1)

In [30]:
pd.to_datetime(df_traffic['Date'][0])

Timestamp('2016-10-05 00:00:00')

In [31]:
date = "10/05/2016"

In [32]:
map_nyc = folium.Map(location=[latitude, longitude], zoom_start=19)
points = (,)
folium.PolyLine(points, color="red", weight=2.5, opacity=1).add_to(my_map)
map_nyc

SyntaxError: invalid syntax (<ipython-input-32-a3816e2f8a59>, line 2)