<a href="https://colab.research.google.com/github/strickert/Applied-Data-Science-Machine-Learning/blob/main/0-proposal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 5. Capstone Project: Project Proposal

***

## Urban Densification and Traffic Congestion: A case study of New York City cabs

![images](./images/headers/header_proposal.jpg)

As global population and urban densification are expected to increase in the 21st century, cities around the world will face many new challenges. According to recent United Nations reports, the world's population is expected to reach 9.7 billion by 2050 [(1)](https://www.un.org/en/global-issues/population). The proportion of urban dwellers is expected to increase from 55% to 68% over the next 30 years [(2)](https://www.un.org/development/desa/en/news/population/2018-revision-of-world-urbanization-prospects.html). By 2030, the world will have about 43 megacities with populations of more than 10 million, including Tokyo, New Delhi, Shanghai, Mexico City, and New York. Transportation networks will become less efficient because of increasing congestion due to increased car ownership. As a result, increased congestion, air pollution, and aging infrastructure will affect both the health and overall quality of life of urban populations [(3)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243514/).

Despite government efforts to develop and promote greener means of transportation in metropolitan areas, the list of cities with severe traffic congestion continues to grow each year [(4)](https://www.cbsnews.com/pictures/worst-traffic-cities-in-the-world/38/). Recently, New York City became the most congested city in the United States [(5)](https://nypost.com/2021/03/09/nyc-has-americas-worst-traffic-congestion/). Personal cars, cabs, and other ride-sharing alternatives remain very popular as they are usually seen as a more efficient way to get to one's destination and offer more flexibility and comfort to their users than trains, subways, streetcars, and buses. However, with the ever-increasing traffic, are cars always the fastest way to get around town?

In this project, we will use New York City taxi trip records to study when and where the city is most congested. We will then use this information and combine it with additional data to build machine learning models that can predict travel time, with the ultimate goal of informing people about the best way to get around the city, i.e. by car or other means of transportation.

***
## Table of Content:
    1. Data Preparation
        1.1 External Datasets
            1.1.1 Weather Forecast Dataset
            1.1.2 Holidays Dataset
            1.1.3 Taxi Zones Dataset
        1.2 Primary Dataset
            1.2.1 Taxi Trips Dataset
            1.2.2 Taxi Trips Subset
    2. Exploratory Data Analysis
        2.1 Primary Dataset
            2.1.1 Temporal Analysis
            2.1.2 Spatio-Temporal Analysis
        2.2 External Datasets
            2.2.1 Temporal Analysis of Weather Data
            2.2.2 Temporal Analysis of Holidays Data
        2.3 Combined Dataset
            2.3.1 Overall Features Correlation
    3. Machine Learning Models

***
## Python Libraries and Magic commands Import¶

In [1]:
# Import python core libraries
import os

# Import data processing libraries
import pandas as pd
import geopandas as gpd

# Import other libraries
from IPython import display

In [2]:
# Set up magic commands
%matplotlib inline
%config Completer.use_jedi = False

***
## Data Import

In [3]:
# Import the weather forecasts dataset
weather_df_raw = pd.read_csv(r"data/raw/weather.csv")
weather_df_processed = pd.read_pickle(r"data/processed/weather.pickle")

# Import the holidays dataset
holidays_df_raw1 = pd.read_csv(r"data/raw/school_holidays.csv")
holidays_df_raw2 = pd.read_csv(r"data/raw/official_holidays.csv")
holidays_df_processed = pd.read_pickle(r"data/processed/holidays.pickle")

# Import the zones dataset
zones_df_raw = gpd.read_file(r"data/raw/zones.geojson")
zones_df_processed = pd.read_pickle(r"data/processed/zones.pickle")

# Import the taxi trips dataset
records_df_raw = pd.read_pickle(r"data/raw/taxi_records.pickle")
records_df_processed = pd.read_pickle(r"data/processed/train.pickle")

***
# Project Proposal

*PLEASE NOTE THAT IN-DEPTH ANALYSES ARE PROVIDED IN THE DATA PREPARATION I AND II, AND EXPLORATORY DATA ANALYSIS NOTEBOOKS!*

## Project Goal

In this work, we will first analyze where and when traffic congestion is highest and lowest in New York State. We will then build different machine learning models capable of predicting cab travel times in and around New York City using only variables that can be easily obtained from a smartphone app or a website. We will then compare their performance and explore the possibility of using additional variables such as weather forecasts and holidays to improve the predictive performance of the models.

## Data

Many factors can affect traffic, including construction or renovation of road infrastructure, weather conditions, vacations or public events, to name a few. In this project, we will use several external datasets in addition to New York City taxi trip records, in hopes of improving the predictive power of our models.

**External Datasets:**

1. Weather Forecast
2. Holidays
3. Regions
**Primary Dataset:**

4. Taxi Trips Records

### 1. External Datasets: Weather Forecast

![](./images/figures/1-data_preparation/lineplot_weather.png)

The 2018 NYC weather forecast was collected from the [National Weather Service Forecast Office](https://w2.weather.gov/climate/index.php?wfo=okx) website. The dataset contains 365 rows and ten columns containing the date and daily measurements in Central Park from January to December 2018 (*Table 1*). These measures are given in imperial units and include daily minimum, maximum and average temperatures, precipitations, snowfall, and snow depth.

In [4]:
# Display the first five rows of the raw weather data frame
weather_df_raw.head()

Unnamed: 0,Date,max_temp,min_temp,avg_temp,dep_temp,hdd,cdd,prec,new_snow,snow_depth
0,1/1/2018,19,7,13.0,-22.2,52,0,0.0,0.0,T
1,1/2/2018,26,13,19.5,-15.5,45,0,0.0,0.0,0
2,1/3/2018,30,16,23.0,-11.8,42,0,0.0,0.0,T
3,1/4/2018,29,19,24.0,-10.7,41,0,0.76,9.8,1
4,1/5/2018,19,9,14.0,-20.5,51,0,0.0,0.0,7


***Table 1:** first five rows of the raw weather dataset.*

The weather data set does not contain any missing or incorrect values, but a few outliers. These outliers will not be removed from the dataset, as they are only the result of much rarer weather events. Measurements, including temperature, precipitation, and snow depth, are given in imperial units and will be converted to metric units. The snow depth and precipitation columns may contain the character T, which stands for "Trace amounts". These characters will be replaced by zeros. The departure temperature (dep_temp), heating degree days (hdd), and cooling degree days (cdd) columns will be dropped because they do not provide useful information for model training. Finally, continuous variables will be grouped according to their level of intensity, and each level will be adjusted to contain a sufficient number of data points. The lowest level (level 0) will correspond to no weather event, while the highest level (level 3 or 4) will correspond to the highest intensity of a weather event. Binning continuous variables tends to improve the performance of a model by introducing non-linearity. The dates will be used later to merge the weather forecast dataset with the primary dataset. A subset of the weather dataset after preprocessing is shown in *Table 2*.

In [5]:
# Display the first five rows of the data frame
weather_df_processed.head()

Unnamed: 0,date,avg_temp,prec,new_snow,snow_depth
0,2018-01-01,0,0,0,0
1,2018-01-02,0,0,0,0
2,2018-01-03,0,0,0,0
3,2018-01-04,0,1,3,1
4,2018-01-05,0,0,0,2


***Table 2:** first five rows of the processed weather dataset.*

### 2. External Datasets: Holidays

![](./images/figures/1-data_preparation/barplot_monthly_holidays.png)

The 2018 NYC holidays were collected from the [Office Holiday](https://www.officeholidays.com/countries/usa/new-york/2021) website and the [School Year Calendar](https://www.schools.nyc.gov/) released by the Departement of Education. The school holidays dataset contains 27 rows and three columns, including one for the month, day, and name of the holiday (*Table 3*). The official holydays dataset contains 16 rows and five columns including one for the name, date, type of holidays, and comments (*Table 4*).

In [6]:
# Display the first five rows of the raw school holidays dataset
holidays_df_raw1.head()

Unnamed: 0,Month,Day,Holiday Name
0,January,15,Dr Martin Luther King Jr day
1,March,30,Spring recess
2,March,1,Spring recess
3,March,2,Spring recess
4,March,3,Spring recess


***Table 3:** first five rows of the raw school holidays dataset.*

In [7]:
# Display the first five rows of the official holidays dataset
holidays_df_raw2.head()

Unnamed: 0,Day,Date,Holiday Name,Type,Comments
0,Monday,1-Jan,New Year's Day,Federal Holiday,
1,Monday,15-Jan,Martin Luther King Jr. Day,Federal Holiday,3rd Monday in January
2,Monday,12-Feb,Lincoln's Birthday,Government Holiday,"Connecticut, Illinois, Missouri, New York."
3,Monday,19-Feb,Washington's Birthday (Observed),Federal Holiday,3rd Monday in February
4,Sunday,13-May,Mother's Day,Not A Public Holiday,2nd Sunday in May. Not a public holiday


**Table 4:** first five rows of the raw official holidays dataset.*

Both data sets contain no incorrect or outlier values, but some missing comments, which has no consequence since this column will be dropped later anyway. The columns containing date-related data will be combined and transformed into a date-time format. These date columns will then be used to combine the two datasets. Once combined most columns except for date and holiday type will de removed, as these are the only two variables that can provide useful information for training our models. An additional column containing 1s will be added to the dataset to indicate that these dates are not regular days. Finally, the holiday types will be ordinal encoded, while ensuring that the holidays are ranked by national importance. If two holidays have the same date, the one with the higher rank will take precedence. Feature encoding is a crucial step, as many machine learning algorithms cannot handle non-numeric variables. Later, the dates will be used to merge the combined holidays dataset with the primary dataset. A subset of the vacation dataset after preprocessing is shown in *Table 5*.

In [8]:
# Display the first five rows of the processed holidays dataset
holidays_df_processed.head()

Unnamed: 0,holiday_date,holiday_type,holiday
27,2018-01-01,3,1
28,2018-01-15,3,1
29,2018-02-12,2,1
30,2018-02-19,3,1
2,2018-03-01,1,1


***Table 5:** first five rows of the processed official holidays dataset.*

### 3. External Datasets: Taxi Zones

![](./images/figures/1-data_preparation/map_zones_boroughs.png)

The taxi zones were collected from the [the taxi and limousine Comission](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) website. The dataset contains 263 rows and seven columns, one for a unique identifier, one for the name of the zone, and one for the name of the borough in which it is located. The remaining three columns contain the length, area, and geospatial coordinates of the zone boundaries.

In [9]:
# Display the first five rows of the raw zones data frame
zones_df_raw.head()

Unnamed: 0,OBJECTID,Shape_Leng,Shape_Area,zone,LocationID,borough,geometry
0,1,0.116357,0.000782,Newark Airport,1,EWR,"POLYGON ((-8258175.509 4967457.200, -8258179.5..."
1,2,0.43347,0.004866,Jamaica Bay,2,Queens,"MULTIPOLYGON (((-8217980.649 4959237.189, -821..."
2,3,0.084341,0.000314,Allerton/Pelham Gardens,3,Bronx,"POLYGON ((-8220713.532 4993383.076, -8220638.4..."
3,4,0.043567,0.000112,Alphabet City,4,Manhattan,"POLYGON ((-8234500.209 4971984.014, -8234502.1..."
4,5,0.092146,0.000498,Arden Heights,5,Staten Island,"POLYGON ((-8257036.153 4948033.072, -8256954.6..."


***Table 6:** first five rows of the raw taxi zones dataset.*

The dataset does not contain any incorrect, missing or outlier values. Three new columns will be added to the dataset. The first two features will contain the latitude and longitude of the centroid of each zone. The third variable, called BoroughID, will contain a unique numeric identifier for each borough. A subset of the cab zone dataset after preprocessing is shown in *Table 7*.

In [10]:
# Display the first five rows of the processed zones data frame
zones_df_processed.head()

Unnamed: 0,OBJECTID,Shape_Leng,Shape_Area,zone,LocationID,borough,geometry,zone_latitude,zone_longitude,BoroughID,borough_latitude,borough_longitude
0,1,0.116357,0.000782,Newark Airport,1,EWR,"POLYGON ((-8258175.509 4967457.200, -8258179.5...",4966993.0,-8257012.0,0,4966993.0,-8257012.0
1,2,0.43347,0.004866,Jamaica Bay,2,Queens,"MULTIPOLYGON (((-8217980.649 4959237.189, -821...",4955975.0,-8218863.0,1,4969016.0,-8217521.0
2,3,0.084341,0.000314,Allerton/Pelham Gardens,3,Bronx,"POLYGON ((-8220713.532 4993383.076, -8220638.4...",4992372.0,-8220657.0,2,4990631.0,-8222784.0
3,4,0.043567,0.000112,Alphabet City,4,Manhattan,"POLYGON ((-8234500.209 4971984.014, -8234502.1...",4971680.0,-8235078.0,3,4979599.0,-8233970.0
4,5,0.092146,0.000498,Arden Heights,5,Staten Island,"POLYGON ((-8257036.153 4948033.072, -8256954.6...",4946581.0,-8258624.0,4,4950718.0,-8254718.0


***Table 7:** first five rows of the processed taxi zones dataset.*

### 4. Primary Dataset: Taxi Trip Records

The 2018 NYC Taxi Trip dataset was collected from the [Google Big Query](https://console.cloud.google.com/marketplace/product/city-of-new-york/nyc-tlc-trips?project=jovial-monument-300209&folder=&organizationId=) platform. The original dataset contains more than 100 million yellow taxi trip records for 2018, but only 1% of it will be used in this project in order to avoid tedious computations and issues related to hardware limitations.

The dataset contains several variables including: the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. Out of all these variables only, pick-up datetime, drop-off datetime, passenger count, trip distance, tolls amount, fare amount, pick-up location id, and drop-off location id were ultimately selected for this project. Lastly, the target variable, i.e. trip duration, was computed as the time difference in minute between the pick-up and drop-off.

In [11]:
# Display the first five rows of the raw taxi trips data frame
records_df_raw.head()

Unnamed: 0,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,tolls_amount,fare_amount,pickup_location_id,dropoff_location_id,trip_duration
0,2018-04-26 12:11:19,2018-04-26 12:24:34,3,8.9,2.64,25.0,143,220,13.25
1,2018-06-19 10:02:34,2018-06-19 10:45:46,1,15.84,0.0,48.0,138,85,43.2
2,2018-09-05 13:51:40,2018-09-05 14:33:07,1,11.1,5.76,37.5,138,142,41.45
3,2018-05-11 10:20:58,2018-05-11 10:46:27,2,8.49,0.0,27.5,45,41,25.483333
4,2018-12-16 18:21:08,2018-12-16 18:46:04,6,7.79,0.0,24.5,229,106,24.933333


***Table 8:** first five rows of the raw taxi trips dataset.*

The dataset has no missing values, but contains erroneous values and outliers. More than a dozen records are from before or after 2018. Most of the numeric columns contain extreme and erroneous values (*Table 10*). For instance, trips with 0 amd more than five passengers are either errors or the result of cab drivers picking up additional passengers en route to their final destination. Some records have negative, null, or extreme values in distance traveled, toll amount, fare amount and trip duration. These records will be removed from the dataset.

In [12]:
# Display some descriptive statistics
records_df_raw.describe().round(2)

Unnamed: 0,passenger_count,trip_distance,tolls_amount,fare_amount,pickup_location_id,dropoff_location_id,trip_duration
count,1122346.0,1122346.0,1122346.0,1122346.0,1122346.0,1122346.0,1122346.0
mean,1.6,2.88,0.34,13.02,163.23,161.47,17.24
std,1.24,3.72,1.78,189.3,66.42,70.36,66.13
min,0.0,0.0,-18.0,-275.0,1.0,1.0,0.0
25%,1.0,0.96,0.0,6.5,114.0,107.0,6.58
50%,1.0,1.6,0.0,9.0,162.0,162.0,10.97
75%,2.0,2.93,0.0,14.5,233.0,233.0,18.08
max,9.0,106.45,770.76,200005.5,265.0,265.0,1439.97


***Table 9:** descriptive statistics of numeric column from the raw taxi trips dataset.*

The distribution of variables such as number of passengers, distance traveled, toll amount, trip amount, and trip duration is heavily skewed by the presence of many near-zero values and outliers. We will not eliminate the near-zero values because they correspond to real and frequent trips, but we will eliminate most of the outliers while trying to preserve some degree of variability so that our models can correctly estimate the duration of non-conventional taxi trips.

![images](./images/figures/1-data_preparation/scatter_num_cols_before.png)
![images](./images/figures/1-data_preparation/histplot_num_cols_before.png)
![images](./images/figures/1-data_preparation/boxplot_num_cols_before.png)
***Figure 1:** scatterplots, histograms, and boxplots of the primary dataset's the numerical columns before data cleaning.*

Eliminating data based on the 99th and 95th percentiles of the variables would remove the most extreme or erroneous trip records while retaining a high degree of variability. The 95th percentile will be used to eliminate outliers from the passenger count column, while the 99th percentile will be used to eliminate outliers and erroneous data from all other columns. This process will remove about 7% of the total data.

![images](./images/figures/1-data_preparation/scatter_num_cols_after.png)
![images](./images/figures/1-data_preparation/histplot_num_cols_after.png)
![images](./images/figures/1-data_preparation/boxplot_num_cols_after.png)
***Figure 2:** scatterplots, histograms, and boxplots of the primary dataset's the numerical columns after data cleaning.*

After removing the extreme and spurious values from the data set,  strong correlation between travel time and distance and fare amount can be observed. In addition, each variable has a righ-skewed distribution. Therefore, the data will need to be normalize before being used to train the different models. In addition, new features will also be engineered and encoded, including pick-up dates and times, external datasets will be merged with the main one, and variables such as fares, which are calculated from trip duration and could not be known in advance, will be dropped.

In [13]:
# Display the first five rows of the processed zones data frame
records_df_processed.head()

Unnamed: 0,trip_distance,tolls_amount,pickup_location_id,dropoff_location_id,trip_duration,pickup_month,pickup_week,pickup_yearday,pickup_weekday,pickup_weekday_type,...,pickup_zone_latitude,pickup_zone_longitude,pickup_borough_latitude,pickup_borough_longitude,dropoff_borough_id,dropoff_zone_latitude,dropoff_zone_longitude,dropoff_borough_latitude,dropoff_borough_longitude,trip_within_borough
0,6.115492,0,144,237,20.4,12,52,360,2,0,...,4971260.0,-8237299.0,4979599.0,-8233970.0,3,4978272.0,-8233817.0,4979599.0,-8233970.0,1
1,2.253076,0,234,162,11.466667,4,16,108,2,0,...,4974117.0,-8236580.0,4979599.0,-8233970.0,3,4976519.0,-8234565.0,4979599.0,-8233970.0,1
2,2.574944,0,249,231,7.933333,9,39,272,5,1,...,4973270.0,-8237962.0,4979599.0,-8233970.0,3,4970802.0,-8238519.0,4979599.0,-8233970.0,1
3,1.255285,0,43,75,8.2,7,30,205,1,0,...,4980310.0,-8233808.0,4979599.0,-8233970.0,3,4981417.0,-8231603.0,4979599.0,-8233970.0,1
4,3.089933,0,163,140,8.633333,7,28,194,4,0,...,4977656.0,-8235145.0,4979599.0,-8233970.0,3,4977812.0,-8232604.0,4979599.0,-8233970.0,1


***Table 10:** first five rows of the primary dataset after processing.*

Seven new variables were created from the pickup date and time column, including month of the year, day of the year, day of the week, type of day (weekday, weekend), time of day, and peak hours. The time range used for peak hours was determined using the analysis of daily traffic data available on the [tomtom website](https://www.tomtom.com/en_gb/traffic-index/new-york-traffic/). The toll amounts were grouped into three fare categories. The external data sets were merged with the master data using dates. Finally, borough identification was added to the main dataset with an additional feature indicating whether the trip started and ended in the same borough.

**The combined New York City taxi trip data was ultimately split 80% and 20% into respectively a training and testing dataset**.

## Exploratory Data Analysis


**Note that the number of taxi trips is corresponds to 0.8% of the whole dataset.**

![images](./images/figures/2-exploratory_data_analysis/heatmap_hour_month.png)

***Figure 3:** heat map of the number of taxi trip by month of year and hour.*

The largest increase in the number of trips is observed in March, while the months of July through September, and November through December show a slight decrease compared to the rest of the year. Looking at the hourly scale, we observe that the number of hourly runs starts to decrease rapidly after midnight and increases again around 18:00. The largest daily increase is reached around 19:00. 

![images](./images/figures/2-exploratory_data_analysis/heatmap_hour_week.png)

***Figure 4:** heat map of the number of taxi trip by week and hour.*

The largest increase in trips is seen in March and specifically from week 9 to week 13. This increase coincides with the start and end of spring break - a national school vacation whose start and end date depends on the state, but which typically begins in late February and ends in late March. The week following the national holiday of July 14 (week 27), the week of Thanksgiving (week 47), and New Year's Eve (week 53) may also account for some of the slight declines observed in July, November, and December 2018. 

![images](./images/figures/2-exploratory_data_analysis/heatmap_hour_day.png)

***Figure 5:** heat map of the number of taxi trip by day and hour.*

On weekdays, the number of hourly trips begins to decline rapidly after midnight and increases again around 6:00 a.m., while on weekends, taxi trips remain high until 16:00 and decline significantly between 5:00 and 8:00. On weekdays, the largest daily increase is reached around 19:00, while on weekends the number of trips is relatively stable throughout the day and until late at night. For the first 31 days of the year The number of rides appears to drop sharply from Saturday to Sunday. In addition, on January 4, 2018, New York City was hit hard by a powerful cyclonic blizzard, which caused many disruptions.

![images](./images/figures/2-exploratory_data_analysis/barplot_avg_day_trip.png)

***Figure 6:** barplot of average number of taxi trip per day.*

The average number of taxi trips increases from Monday to Friday and then decreases again on weekends, with the largest decrease on Sunday. The closing of offices on weekends may explain the decline in taxi trips on Saturday and Sunday. However, the gradual increase in daily trips from Monday to Friday is more difficult to explain. One hypothesis could be that bars and restaurants open on weekends,close on Mondays.

![images](./images/figures/2-exploratory_data_analysis/barplot_avg_hour_trip.png)

***Figure 7:** barplot of average number of taxi trip per day and hour.*

the average number of hourly trips can vary significantly throughout the year, especially during peak hours, which begin at 06:00 and end around 21:00. Moreover, the number of trips is not necessarily correlated with traffic fluidity and congestion and thus with trip duration - the variable we are trying to predict. Below we look at the average length, duration and speed of trips on a monthly, daily and hourly scale. Trip speed will be used as an indicator of traffic fluidity/congestion.

![images](./images/figures/2-exploratory_data_analysis/boxplot_distance_month.png)

***Figure 8:** boxplot of daily average distrance traveled for each month of the year*

The average trip distance shows no significant changes throughout 2018.

![images](./images/figures/2-exploratory_data_analysis/boxplot_duration_month.png)

***Figure 9:** boxplot of daily average travel time for each month of the year*

The average trip duration shows no significant changes throughout 2018.

![images](./images/figures/2-exploratory_data_analysis/boxplot_speed_month.png)

***Figure 10:** boxplot of daily average travel speed for each month of the year*

The average trip speed shows no significant changes throughout 2018.

![images](./images/figures/2-exploratory_data_analysis/lineplot_distance_hour.png)

***Figure 11:** lineplot of daily average distrance traveled for each hour of the day and day of week.*

The average distance traveled is significantly lower between 00:00 and 02:00, and significantly higher between 05:00 and 08:00 on Saturdays and Sundays. This may suggest that New Yorkers do not travel to the same locations on weekdays and weekends.

![images](./images/figures/2-exploratory_data_analysis/lineplot_duration_hour.png)

***Figure 12:** lineplot of daily average travel time for each hour of the day and day of week.*

he average travel time is significantly lower on Saturdays and Sundays during peak hours. This may suggest that New Yorkers do not travel to the same places on weekdays and weekends. Another possible interpretation could corroborate previous observations, namely that fewer taxi trips may be a good indicator of less traffic and potentially better traffic conditions.

![images](./images/figures/2-exploratory_data_analysis/lineplot_speed_hour.png)

***Figure 13:** lineplot of daily average travel speed for each hour of the day and day of week.*

As a result of higer distance travelled early morning and lower travel durations during rush hour, the average travel speed is significantly higher on the weekends between 06:00 and 11:00.

![images](./images/figures/2-exploratory_data_analysis/map_pickup_dropoff.png)

***Figure 14:** heat map of the number of annual taxi trips by location.*

There are no significant differences between arrival and departure locations, with the exception of JFK and LaGuardia airports. Thus, people arriving at airports are more likely to take a cab to New York City than to take one to the airport. In addition, most trips appears to be located made in midtown Manhattan, but drop-off locations are slightly more evenly distributed, including in suburban areas such as Queens, Bronx, and Brooklyn. Manhattan is the heart of New York City with its many stores, bars, and restaurants and is also one of the most densely populated boroughs. As a result, traffic jams are more likely to occur and special attention must be paid to travel times, which are more likely to vary considerably depending on a variety of factors, including holidays and weather conditions.

![images](./images/figures/2-exploratory_data_analysis/map_dropoff_week-end.png)

***Figure 15:** heat maps of the percentage of annual taxi by pick-up locations for weekdays and weekends.*

Midtown Manhattan is more heavily traveled during the week and the southeast suburbs during the weekend. These include Hell's kitchen, East-village and LOwer-east side, which are known for their lively nightlife.

![images](./images/figures/2-exploratory_data_analysis/map_pickup_week-end.png)

***Figure 16:** heat maps of the percentage of annual taxi by pick-up locations for weekdays and weekends.*

Similarily to the previous observations, midtown Manhattan is more heavily traveled during the week and the southeast and suburbs during the weekend. Pick-up and drop-off locations do not change significantly during the week, but between weekdays and weekends. However, is the reduction in the number of taxi trips is linked to shorter travel time?

![images](./images/figures/2-exploratory_data_analysis/heatmap_boroughs.png)

***Figure 17:** heat map of average number of daily taxi trips by origin and destination borough*

The majority of the trips are within the same borough and most of them start and end in Manhattan. As noted above, another significant portion of trips start in Manhattan and end in Queens or Brooklyn. Below we will have a closer look at the amount of trips happening within the same or to another borough.

![images](./images/figures/2-exploratory_data_analysis/barplot_distance_boroughs.png)

***Figure 18:** barplot of the average daily distance traveled by borough*

The distance traveled is significantly lower for trips within the same borough than to another borough. In addition, excluding Staten Island and EWR, the travel distance is the lowest for Manhattan.

![images](./images/figures/2-exploratory_data_analysis/barplot_duration_boroughs.png)

***Figure 19:** barplot of the average daily travel duration by borough*

The average travel duration is significantly lower for trips within the same borough than to another borough, which may be expected since the distance traveled is also lower. However, travel time in Manhattan is comparable to that of Brooklyn for trips within the same borough.

![images](./images/figures/2-exploratory_data_analysis/barplot_speed_boroughs.png)

***Figure 20:** barplot of the average daily travel speed by borough*

The difference in average travelspeed between trips within a borough and to another borough is not as significant as the difference in average travel distance and duration. Manhattan, with its high population density and numerous intersections, appears to be one of the boroughs most strongly affected by traffic density, since the travel speed within this borough is the lowest.

Tollbooths are typically located before bridges, tunnels and on highways in the New York metropolitan area. Paying at a tollbooth usually implies a longer distance to travel to the destination. These facilities are also a major cause of traffic congestion because cars must stop to pay at the booth. However, the booths have recently been replaced by a new [automated systems](https://www.localsyr.com/news/local-news/construction-to-remove-toll-booths-along-nys-thruway-resumes/).

![images](./images/figures/2-exploratory_data_analysis/boxplot_distance_toll.png)

***Figure 20:** barplot of the average daily travel speed by borough*

![images](./images/figures/2-exploratory_data_analysis/boxplot_duration_toll.png)

***Figure 20:** barplot of the average daily travel speed by borough*

![images](./images/figures/2-exploratory_data_analysis/boxplot_speed_toll.png)

***Figure 20:** barplot of the average daily travel speed by borough*

The average distance traveled is three to five time greater when the trip passes through a tollbooth. However, there is no snignificant difference between high and low toll prices. 