## NYC Taxi Fare Prediction using Eexploratory Data Analysis and Machine Learning

Source Dataset: https://data.cityofnewyork.us/Transportation/2017-Yellow-Taxi-Trip-Data/biws-g3hs/about_data  
Data Provided By: Taxi and Limousine Commission (TLC)  
Dataset Owner: NYC OpenData

#### Import libraries

In [5]:
# Basic imports for data analysis
import numpy as np
import pandas as pd
import datetime as dt

# Importing libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Importing libraries for data modeling
from sklearn.model_selection import train_test_split


#### Review Data Available

In [None]:
# Load dataset
df_original= pd.read_csv('2017_Yellow_Taxi_Trip_Data.csv')
df_original.head()

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
1,35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
2,106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
3,38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
4,30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [4]:
# Variable overview
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             22699 non-null  int64  
 1   VendorID               22699 non-null  int64  
 2   tpep_pickup_datetime   22699 non-null  object 
 3   tpep_dropoff_datetime  22699 non-null  object 
 4   passenger_count        22699 non-null  int64  
 5   trip_distance          22699 non-null  float64
 6   RatecodeID             22699 non-null  int64  
 7   store_and_fwd_flag     22699 non-null  object 
 8   PULocationID           22699 non-null  int64  
 9   DOLocationID           22699 non-null  int64  
 10  payment_type           22699 non-null  int64  
 11  fare_amount            22699 non-null  float64
 12  extra                  22699 non-null  float64
 13  mta_tax                22699 non-null  float64
 14  tip_amount             22699 non-null  float64
 15  to

**Observations**:
- Compared to the original dataset, which contains 113M records, we are working with a smaller subset of the data with 22,699 records as part of practice.  
- The column 'Unnamed: 0' has not been mentioned in the data dictionary and is likely to random no. generated as index.
- Columns 'tpep_pickup_datetime' and 'tpep_dropoff_datetime' are date/time variables, which would need to be converted to the correct format.

In [8]:
# Summary statistics
df_original.describe(include= 'all')

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,,22687,22688,,,,2,,,,,,,,,,
top,,,04/15/2017 6:05:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,,2,2,,,,22600,,,,,,,,,,
mean,56758490.0,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,32744930.0,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,12127.0,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,28520560.0,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,56731500.0,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,85374520.0,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


**Observations**:
- There seems to be outlying values for fare_amount, trip_distance, total_amount etc. as the max. value of ~1000 USD is much higher than the mean.
- We have negative values in variables fare_amount, exttra etc.

Based on our observations from above, we can plan our EDA approach as below:
- Data Cleaning
    - Check/eliminate/deal with duplicates
    - Check/eliminate/deal with Nulls
    - Modify columns as needed
    - Identify outliers, isolate to assess how to deal with them
    - Check for incorrect values

- Draw insights from variables
    - Derive 'trip_duration' column based on date/time values provided
    - Can trip distance and trip duration be correlated so we keep just one?
    - Does the data sample capture all kind of rate code IDs?
    - Is there any variance in nos. observed between different data providers?

#### Exploratory Data Analysis

#### Model Selection/Checking Assumptions

#### Model Build and Initial Assessment

#### Final Assessment and Summary