<a href="https://colab.research.google.com/github/sanjay2097/NYC-Taxi-Trip-Time-Prediction/blob/main/NYC_Taxi_Time_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Data Description </b>

### The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this project. Based on individual trip attributes, you should predict the duration of each trip in the test set.

### <b>NYC Taxi Data.csv</b> - the training set (contains 1458644 trip records)


### Data fields
* #### id - a unique identifier for each trip
* #### vendor_id - a code indicating the provider associated with the trip record
* #### pickup_datetime - date and time when the meter was engaged
* #### dropoff_datetime - date and time when the meter was disengaged
* #### passenger_count - the number of passengers in the vehicle (driver entered value)
* #### pickup_longitude - the longitude where the meter was engaged
* #### pickup_latitude - the latitude where the meter was engaged
* #### dropoff_longitude - the longitude where the meter was disengaged
* #### dropoff_latitude - the latitude where the meter was disengaged
* #### store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
* #### trip_duration - duration of the trip in seconds

## **STEPS** 
### 1. Data Analysis
### 2. Feature Engineering
### 3. Feature Selection
### 4. Model Building
### 5. Model Validation & Selection


In [None]:
import numpy as np
import pandas as pd
from numpy import math
from datetime import datetime

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from lightgbm import LGBMRegressor

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error


from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

import seaborn as sns
import folium
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

### DATA ANALYSIS

In [None]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Loading Data
dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/NYC Taxi Data.csv')

In [None]:
dataset.head()

In [None]:
dataset.shape

(1458644, 11)

There are approx 1.46 million records in our dataset.

In [None]:
dataset.describe()

From preliminary analysis using describe function we can see that there are anomalous values in passenger count and trip_duration that needs to be addressed later.

In [None]:
# Checking null values
dataset.isnull().sum()

In [None]:
# Checking duplicated values
dataset.duplicated().sum()

0

*There are no Null values and duplicated values in given dataset.*

In [None]:
# Copying data to new dataframe for further analysis
df = dataset.copy()

In [None]:
# Lets look at the distribution plot of the features
pos = 1
fig = plt.figure(figsize=(18,26))
for i in df.describe().columns:
    ax = fig.add_subplot(6,2,pos)
    pos = pos + 1
    sns.distplot(df[i],ax=ax)

Inferences from distribution plot :

1.There are two major vendors in NYC.

2.Passenger count 1 has max distribution.

3.Distribution of trip duration is highly skewed.

### Preliminary analysis of independent variables

#### Vendor ID

In [None]:
df['vendor_id'].value_counts().plot(kind='bar')
plt.ylabel('Count')
plt.xlabel('Vendor ID')

*Both the vendors seems to have almost equal market share. But Vendor 2 is evidently more famous among the population as per the above graph.*

#### Datetime

In [None]:
# Converting datetime datatype from object
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
df['dropoff_datetime'] = pd.to_datetime(df['dropoff_datetime'])

In [None]:
# Adding new features month , day and hour from datetime
df['hour'] = df['pickup_datetime'].dt.hour 
df['day'] = df.pickup_datetime.dt.day_name()
df['month'] = df.pickup_datetime.dt.month_name()

In [None]:
# Analyzing month 
data=df['month']
plt.figure(figsize=(8,5))
sns.countplot(data, palette='rainbow')
plt.show()

All the months are closely distributed with March being highest and January lowest.

In [None]:
# Analyzing day
data=df['day']
plt.figure(figsize=(8,5))
sns.countplot(data, palette='rainbow')
plt.show()

We can see that Friday has the largest count of trips in dataset and Monday lowest.

In [None]:
# Analyzing hour
data=df['hour']
plt.figure(figsize=(8,5))
sns.countplot(data, palette='rainbow')
plt.show()

### Passanger Count

Between 7 am to 3 pm the trip distributions are close to each other but increases from 5 pm to 10 pm and starts decreasing till 5 am.

In [None]:
# Analyzing passanger count
data=df['passenger_count']
plt.figure(figsize=(8,5))
sns.countplot(data, palette='rainbow')
plt.show()

The passenger_count variable has a minimum value of 0 passengers. These observations are most likely errors and will need to be removed from the dataset.

According to the NYC Taxi & Limousine Commission, the maximum number of people allowed in a yellow taxicab, by law, is 5 passengers and one child .The observations more than 6 are likely an error and will also need to be removed from the dataset.

In [None]:
# Removing passenger count more than 6 
df = df[(df['passenger_count']>0) & (df['passenger_count']<=6)]

###store_and_fwd_flag

In [None]:
# analyzing trip data storing flag column
df['store_and_fwd_flag'].value_counts()

Most of the trip records were not stored in vehicle memory before forwarding to the vendor because the vehicle did not have a direct connection to the server.

#### Longitude and Latitude

Looking into it, the borders of NY City coordinates comes out to be:

longitude = (-74.03, -73.77)  ,
latitude = (40.63, 40.85)


Any coordinates outside will be outliers.

In [None]:
# Max and min values of lat and long in pickup and dropoff location
print(np.min(df['pickup_longitude']), np.min(df['pickup_latitude']))
print(np.max(df['pickup_longitude']), np.max(df['pickup_latitude']))

print(np.min(df['dropoff_longitude']), np.min(df['dropoff_latitude']))
print(np.max(df['dropoff_longitude']), np.max(df['dropoff_latitude']))

In [None]:
# Removing outlier coordinates
west, south, east, north = -74.03, 40.63, -73.77, 40.85

df = df[(df.pickup_latitude> south) & (df.pickup_latitude < north)]
df = df[(df.dropoff_latitude> south) & (df.dropoff_latitude < north)]
df = df[(df.pickup_longitude> west) & (df.pickup_longitude < east)]
df = df[(df.dropoff_longitude> west) & (df.dropoff_longitude < east)]

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(18,10))

df.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude',
                color='yellow', 
                s=.02, alpha=.6, subplots=True, ax=ax1)
ax1.set_title("Pickups")
ax1.set_facecolor('black')

df.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude',
                color='yellow', 
                s=.02, alpha=.6, subplots=True, ax=ax2)
ax2.set_title("Dropoffs")
ax2.set_facecolor('black') 

In [None]:
# Finding total diatance covered in each trip by making get_distance function
from math import sin, cos, sqrt, atan2, radians

def get_distance(lon_1, lon_2, lat_1, lat_2):

    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(lat_1)
    lon1 = radians(lon_1)
    lat2 = radians(lat_2)
    lon2 = radians(lon_2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

In [None]:
# Applying get_distance function to claculate each trip distance
df["distance"] = df.apply(lambda x: get_distance(x["pickup_longitude"],x["dropoff_longitude"],x["pickup_latitude"],x["dropoff_latitude"]),axis=1)

We will analyse distance column during feature engineering and selection.