# FareCast NYC: Empowering TLC Riders with Fare Estimation

## Overview

**Executive Summary:**

Automatidata, a fictional data consulting firm, is collaborating with the New York City Taxi and Limousine Commission (TLC). Since its establishment in 1971, the TLC has been responsible for regulating and overseeing the licensing of various transportation services in New York City, including taxi cabs, for-hire vehicles, commuter vans, and paratransit vehicles <sup>[1]</sup>. The TLC envisions the development of a user-friendly app, and in partnership with Automatidata, aims to enhance the overall experience for TLC riders by providing the ability to estimate taxi fares in advance.

This initiative aligns seamlessly with Automatidata's mission to transform untapped and stored data into practical solutions. Leveraging the extensive NYC TLC dataset, our goal is to craft a robust regression model that accurately estimates taxi fares, offering transparency and convenience to riders. The FareCast NYC app aspires to not only address the business needs of the TLC but also strategically impact the transportation landscape by improving transparency in fare calculations.

For more information about the TLC, visit their official page: [About TLC](https://www.nyc.gov/site/tlc/about/about-tlc.page).

**Project Objectives:**

1. Develop a robust fare estimation model:
    - Utilise historical data from the NYC TLC dataset to construct a reliable regression model for accurately estimating taxi fares.
    
    
2. Ensure accuracy and reliability:
    * Implement rigorous testing and optimization procedures to guarantee the precision and dependability of fare estimates generated by the model.


3. Create a user-friendly app interface:
    * Design an intuitive app interface tailored for TLC riders, facilitating easy and efficient estimation of taxi fares.


4. Enhance user satisfaction and confidence:
    * Improve overall user satisfaction and instil confidence in NYC taxi services by providing transparent and trustworthy fare estimation through the developed app.

**Data Sources:**

The primary data source for this project is the "2017 Yellow Taxi Trip Data", available on NYC OpenData. This dataset contains information on taxi and for-hire vehicle trips, with 22,699 rows and 18 columns.

For a detailed description of each attribute and its meaning, please refer to the [Data Dictionary](https://github.com/sssunri/farecast_NYC-TLC_estimation/blob/main/data_dictionary_trip_records_yellow.pdf) provided by the New York City Taxi & Limousine Commission. Leveraging this rich data dictionary, we will gain a comprehensive understanding of the dataset, which is crucial for building a robust fare estimation model.

## Data Exploration

**Load the Data**

In [1]:
# import libraries and packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

plt.rcParams['figure.figsize'] = 15, 5
pd.set_option('display.max_columns', None)

In [2]:
# load dataset into dataframe
df = pd.read_csv('data/2017_Yellow_Taxi_Trip_Data.csv')

**Initial Exploration**

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
1,35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
2,106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
3,38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
4,30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             22699 non-null  int64  
 1   VendorID               22699 non-null  int64  
 2   tpep_pickup_datetime   22699 non-null  object 
 3   tpep_dropoff_datetime  22699 non-null  object 
 4   passenger_count        22699 non-null  int64  
 5   trip_distance          22699 non-null  float64
 6   RatecodeID             22699 non-null  int64  
 7   store_and_fwd_flag     22699 non-null  object 
 8   PULocationID           22699 non-null  int64  
 9   DOLocationID           22699 non-null  int64  
 10  payment_type           22699 non-null  int64  
 11  fare_amount            22699 non-null  float64
 12  extra                  22699 non-null  float64
 13  mta_tax                22699 non-null  float64
 14  tip_amount             22699 non-null  float64
 15  to

The data types are generally appropriate for most columns. `tpep_pickup_datetime` and `tpep_dropoff_datetime` contain datetime information for the pickup and dropoff times. Converting these columns to `datetime` objects will facilitate the extraction of additional time-based features for future analysis and model training. This step ensures consistency and enables more straightforward date-based analysis.

All columns have non-null counts matching the total number of entries (22,699), indicating the absence of missing values. Several key attributes are likely to influence fare amounts, such as `trip_distance`, `passenger_count`, `payment_type`, and datetime features. The column `Unnamed: 0` appears to be redundant, possibly representing an old index. Removing it will enhance the clarity and efficiency of the dataset.

In [5]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,22699.0,56758490.0,32744930.0,12127.0,28520556.0,56731504.0,85374524.0,113486300.0
VendorID,22699.0,1.556236,0.4968384,1.0,1.0,2.0,2.0,2.0
passenger_count,22699.0,1.642319,1.285231,0.0,1.0,1.0,2.0,6.0
trip_distance,22699.0,2.913313,3.653171,0.0,0.99,1.61,3.06,33.96
RatecodeID,22699.0,1.043394,0.7083909,1.0,1.0,1.0,1.0,99.0
PULocationID,22699.0,162.4124,66.63337,1.0,114.0,162.0,233.0,265.0
DOLocationID,22699.0,161.528,70.13969,1.0,112.0,162.0,233.0,265.0
payment_type,22699.0,1.336887,0.4962111,1.0,1.0,1.0,2.0,4.0
fare_amount,22699.0,13.02663,13.24379,-120.0,6.5,9.5,14.5,999.99
extra,22699.0,0.3332746,0.4630966,-1.0,0.0,0.0,0.5,4.5


The descriptive statistics reveal a dataset with diverse numeric features exhibiting wide-ranging values, suggesting the importance of careful consideration during analysis. The majority of trips involve one to two passengers, occasionally reaching a maximum of six. Payments are predominantly made via credit card (payment type 1). The average trip distance is approximately 2.91 miles, and rate codes vary, with the majority falling under code 1. 

Notably, certain columns demonstrate considerable variability, potentially influencing specific analyses. For instance, the presence of a minimum passenger count of 0 prompts further investigation, as does the identification of negative values in `fare_amount`, demanding closer scrutiny. Additionally, related columns such as `tip_amount` and `total_amount` exhibit maximum values that appear as outliers. Recognizing the potential influence of outliers on statistical analyses, it is essential to subject them to further scrutiny. These findings underscore the need for a comprehensive exploratory data analysis (EDA) to gain a deeper understanding of the data distribution and identify and handle potential outliers effectively.

**Acknowledgments**

This project is inspired by the [Google Advanced Data Analytics Professional Certificate](https://www.coursera.org/professional-certificates/google-advanced-data-analytics) program, specifically the end-of-course portfolio project titled *Automatidata*.

This project utilises a subset of the [2017 Yellow Taxi Trip Data](https://data.cityofnewyork.us/Transportation/2017-Yellow-Taxi-Trip-Data/biws-g3hs) obtained from the New York City Taxi & Limousine Commission, originally published as part of the NYC Open Data program. The dataset, containing 22,699 rows representing different trips and 18 columns, was provided by the Google Advanced Data Analytics Professional Certificate program for the purpose of the end-of-course portfolio project. Invaluable insights into the dataset's attributes and their meanings were derived from the detailed [Data Dictionary](https://data.cityofnewyork.us/api/views/biws-g3hs/files/eb3ccc47-317f-4b2a-8f49-5a684b0b1ecc?download=true&filename=data_dictionary_trip_records_yellow.pdf) provided by the New York City Taxi & Limousine Commission. We acknowledge the contribution of both the NYC TLC and the Google program to the dataset used in this analysis.