<h2 style="color:blue;"> Problem Definition: </h2>

**Predicting the fare amount for a taxi ride in New York City with given the pickup and dropoff locations details.**

![image.png](https://storage.googleapis.com/kaggle-competitions/kaggle/10170/logos/header.png?t=2018-07-12-22-07-30)

<h2 style="color:blue;"> Data Fields: </h2>

##### <u>ID:</u>
**key** - Unique string identifying each row in both the training and test sets. Comprised of pickup_datetime plus a unique integer, but this doesn't matter, it should just be used as a unique ID field. 

##### <u>Features:</u>
**pickup_datetime** - timestamp value indicating when the taxi ride started.<br>
**pickup_longitude** - float for longitude coordinate of where the taxi ride started.<br>
**pickup_latitude** - float for latitude coordinate of where the taxi ride started.<br>
**dropoff_longitude** - float for longitude coordinate of where the taxi ride ended.<br>
**dropoff_latitude** - float for latitude coordinate of where the taxi ride ended.<br>
**passenger_count** - integer indicating the number of passengers in the taxi ride.<br>

##### <u>Target:</u>
**fare_amount** - float dollar amount of the cost of the taxi ride. This value is only in the training set; this is what you are predicting in the test set and it is required in your submission CSV.

<h2 style="color:blue;">Problem type: </h2>

**This a supervised regression problem**. We can try following most popular regression algorithm to solve our usecase.

1. Linear Regression
2. Ridge Regression
3. Neural Network Regression 
4. Lasso Regression 
5. Decision Tree Regression 
6. Random Forest
7. KNN Model 
8. Support Vector Machines (SVM)

In [1]:
# Including the necessary python libraries

# Data manipulation
import pandas as pd
import calendar

# Math calculations
import numpy as np

# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns

# Geograpical visualization
import folium
from folium import plugins
from folium.plugins import MeasureControl
from folium.plugins import HeatMap

# For math calculations
from math import sin, cos, sqrt, atan2, radians

# Stats
from scipy import stats
from scipy.stats import norm, skew

# Warnings
import warnings
warnings.filterwarnings('ignore')

### 1. Reading data

In [2]:
# Reading the train dataset from directory path using pandas read csv menthod and store them in the form of dataframe

dataset = pd.read_csv('train.csv', nrows = 500000, parse_dates = ["pickup_datetime"])
dataset.sample(5)

Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
268324,2014-12-14 02:28:58.0000001,11.5,2014-12-14 02:28:58+00:00,-73.999024,40.720236,-73.953729,40.707792,1
57943,2013-08-06 20:53:59.0000002,9.5,2013-08-06 20:53:59+00:00,-74.008025,40.740285,-73.992061,40.74898,2
93588,2014-07-25 01:15:41.0000002,6.0,2014-07-25 01:15:41+00:00,-73.994274,40.722592,-73.985489,40.73569,1
350686,2014-02-24 08:19:39.0000001,11.5,2014-02-24 08:19:39+00:00,-73.963247,40.762614,-73.985771,40.731783,1
473270,2011-06-21 19:15:00.00000010,10.1,2011-06-21 19:15:00+00:00,-73.952872,40.782918,-73.984767,40.76938,1


In [3]:
# Reading test dataset from directory path using pandas read csv menthod and store them in the form of dataframe

test = pd.read_csv('test.csv')
print("Number of datapoints in test dataset", test.shape[0])
test.head()

Number of datapoints in test dataset 9914


Unnamed: 0,key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,2015-01-27 13:08:24.0000002,2015-01-27 13:08:24 UTC,-73.97332,40.763805,-73.98143,40.743835,1
1,2015-01-27 13:08:24.0000003,2015-01-27 13:08:24 UTC,-73.986862,40.719383,-73.998886,40.739201,1
2,2011-10-08 11:53:44.0000002,2011-10-08 11:53:44 UTC,-73.982524,40.75126,-73.979654,40.746139,1
3,2012-12-01 21:12:12.0000002,2012-12-01 21:12:12 UTC,-73.98116,40.767807,-73.990448,40.751635,1
4,2012-12-01 21:12:12.0000003,2012-12-01 21:12:12 UTC,-73.966046,40.789775,-73.988565,40.744427,1


In [4]:
# Reading season details in New york 
# Reference: https://www.nyc.com/visitor_guide/weather_facts.75835/

season_table = pd.read_csv("Season_Details.txt", sep = ',')
season_table.head()

Unnamed: 0,month,season
0,September,FallSeason
1,October,FallSeason
2,November,FallSeason
3,December,WinterSeason
4,January,WinterSeason


In [5]:
# Reading weather details for New York City 
# Reference: https://www.kaggle.com/shaneysze/new-york-city-daily-temperature-18692021

weather = pd.read_csv("nyc_temp_1869_2021.csv")
weather.head()

Unnamed: 0.1,Unnamed: 0,MM/DD/YYYY,YEAR,MONTH,DAY,TMAX,TMIN
0,0,1869-01-01,1869,1,1,-17.0,-72.0
1,1,1869-01-02,1869,1,2,-28.0,-61.0
2,2,1869-01-03,1869,1,3,17.0,-28.0
3,3,1869-01-04,1869,1,4,28.0,11.0
4,4,1869-01-05,1869,1,5,61.0,28.0


### 2. Dataset Investigation:

In [6]:
# Total observation in dataset

print("Sample dataset {}\nTest dataset {}\nSeason dataset {}\nWeather dataset {}".format(dataset.shape,
                                                                                         test.shape,
                                                                                         season_table.shape,
                                                                                         weather.shape))

Sample dataset (500000, 8)
Test dataset (9914, 7)
Season dataset (12, 2)
Weather dataset (55634, 7)


#### Insights:
+ We have taken 500000 taxi booking details as a sample data from population and each booking data represented with 8 features.
+ In test data, we have around 9914 datapoints with 7 feature details, except dependent feature.
+ Season table contain details about monthly season value.
+ The weather dataset contain daily weather report details from year of 1868 to 2021.

In [None]:
# Dataset Features Information

dataset.info()

**Insights:** <br>
+ We have **8 features in our dataset**, In which 7 are Indipendent feature and 1 Dependent feature.
+ **Independent features:** key, pickup_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, passenger_count.
+ **Dependent feature:** fare_amount.
+ The features are aleady in proper datatype. So we don't need to do any datatype conversion in data cleaning phase.
+ There are **5 missing values in dropoff geo coodinates**.

In [None]:
# Basic statistics about numerical features in the dataset

dataset.describe()

#### Insights:
+ The **average taxi fare amount is $11 Dollers**.
+ There are few datapoints contain negative fare amounts, This could be outliers.
+ Fare amount data distribution is **right skewed**.
+ We cannot infer more details from latitude & longitude coordinates. But we can say there are few outliers in it.
+ The maximum Pickup longitude is **2140.6011** & minimum longitude is **-2986.242495**, Ideally the valid longitude range between **-180 <= longitude <= 180**. 
+ The maximum Pickup latitude is **1703.092772** & minimum longitude is **-3116.285383**, Ideally the valid latitude range between **-90 <= latitude <= 90**.  
+ The maximum Dropoff latitude is **404.616667** & minimum longitude is **-2559.748913**, Ideally the valid latitude range between **-90 <= latitude <= 90**. 
+ There are few datapoints with **zero passenger count**. In sometime we use to book taxi for goods transfer, So we cannot directly say these are outliers. But we can check the test datapoints with zero passenger count or not.
+ More number of booking is done for single passenger. 

In [None]:
# Basic statistics about numerical features in the dataset

test.describe()

#### Insights:
+ There are few geograpical outliers in train dataset.
+ The minimum passenger count value is one in test dataset.

In [None]:
# Basic statistics about categorical features in the dataset

dataset.describe( include = np.object )

#### Insights:
Key feature will **identify unique datapoint in the dataset**, becuase the frequency count is 1.

### 2. Data cleaning & preprocessing

In [None]:
# Replicate the dataset and make our changes only in copied dataset

df1 = dataset.copy( deep = True )
df1.head()

In [None]:
# Rounding the Geographical Coorinate upto 4 decimal place

df1.pickup_longitude  = round(df1.pickup_longitude.astype(float),4)
df1.pickup_latitude   = round(df1.pickup_latitude.astype(float),4)
df1.dropoff_longitude = round(df1.dropoff_longitude.astype(float),4)
df1.dropoff_latitude  = round(df1.dropoff_latitude.astype(float),4)
df1.sample()

In [None]:
# Reduce our latitude & longitude scope with respect to test dataset.

# First we will check the max & min geographical coodinates in test dataset.
# Test: Latitude Minimum & Maximum
test_lat_min = min(test.pickup_latitude.min(), test.dropoff_latitude.min())
test_lat_max = max(test.pickup_latitude.max(), test.dropoff_latitude.max())

# Train: Latitude Minimum & Maximum
train_lat_min = min(dataset.pickup_latitude.min(), dataset.dropoff_latitude.min())
train_lat_max = max(dataset.pickup_latitude.max(), dataset.dropoff_latitude.max())

print(">>> Minimum Latitude in test: {}, Maximum Latitude in test: {}".format(test_lat_min, test_lat_max))
print(">>> Minimum Latitude in train: {}, Maximum Latitude in test: {}".format(train_lat_min, train_lat_max))

# Test: Longitude Minimum & Maximum
test_lon_min = min(test.pickup_longitude.min(), test.dropoff_longitude.min())
test_lon_max = min(test.pickup_longitude.max(), test.dropoff_longitude.max())

# Train: Longitude Minimum & Maximum
train_lon_min = min(dataset.pickup_longitude.min(), dataset.dropoff_longitude.min())
train_lon_max = max(dataset.pickup_longitude.max(), dataset.dropoff_longitude.max())

print(">>> Minimum Longitude in test: {}, Maximum Longitude in test: {}".format(test_lon_min, test_lon_max))
print(">>> Minimum Longitude in train: {}, Maximum Longitude in test: {}".format(train_lon_min, train_lon_max))

#### Insights:
+ There is huge difference in train & test Geogrphical coordinate points.
+ So we can focus more on test region boundary in train datasets. 
+ **Boundary box is defined by test datapoints and focusing only those geographical coordinates in train data points.**

In [None]:
# Eliminate the datapoints from train dataset, In which geographical coordinates out of boundary 
# Boundary is decided based on test dataset

# Defining method 
def geographical_boundary(data):
    return (data[ (data.pickup_latitude  >= test_lat_min)  & (data.pickup_latitude <= test_lat_max) &
                  (data.dropoff_latitude >= test_lat_min)  & (data.dropoff_latitude <= test_lat_max) &
                  (data.pickup_longitude >= test_lon_min)  & (data.pickup_longitude <= test_lon_max) & 
                  (data.dropoff_longitude >= test_lon_min) & (data.dropoff_longitude <= test_lon_max) ])

# Invoking method
df1 = geographical_boundary(df1)
print("Number of datapoint remaining after deletion : ",df1.shape[0])

In [None]:
# Statistical analysis for pickup geographical coordinate outlier detection 
# Checking both pickup latitude & longitude  

#--> Statistics using describe method
print("------------------------------------------------------\n| Statistical Data about pickup latitude & longitude |\n------------------------------------------------------")
print(df1[['pickup_latitude', 'pickup_longitude']].describe(percentiles = [.25,.50,.75,.95]))

# Finding IQR and to check number of outliers with respect Latitude
P_Q1 = df1.pickup_latitude.quantile(0.25)
P_Q3 = df1.pickup_latitude.quantile(0.75)
P_IQR = P_Q3 - P_Q1
lat_out = df1[(df1.pickup_latitude < (P_Q1 - 1.5 * P_IQR)) | (df1.pickup_latitude > (P_Q3 + 1.5 * P_IQR))].shape[0]
print("\n>>> Number of outlier records only in pickup latitude: ",lat_out)

# Finding IQR and to check number of outliers with respect Longitude
p_q1 = df1.pickup_longitude.quantile(0.25)
p_q3 = df1.pickup_longitude.quantile(0.75)
p_iqr = p_q3 - p_q1
lon_out = df1[(df1.pickup_longitude < (p_q1 - 1.5 * p_iqr)) | (df1.pickup_longitude > (p_q3 + 1.5 * p_iqr))].shape[0]
print(">>> Number of outlier records only in pickup longitude: ",lon_out)

# Finding list of records for outlier either in latitude or longitude 
outlier = df1[(df1.pickup_latitude < (P_Q1 - 1.5 * P_IQR)) | 
              (df1.pickup_latitude > (P_Q3 + 1.5 * P_IQR)) |
              (df1.pickup_longitude < (p_q1 - 1.5 * p_iqr))|
              (df1.pickup_longitude > (p_q3 + 1.5 * p_iqr)) ]

print(">>> Number of pickup geographical coordinate outlier records comparing Latitude & Longitude: ",outlier.shape[0])

fig = plt.figure(figsize=(16,2))
# Histogram
plt.subplot(121)
sns.boxplot(df1.pickup_latitude).set_title("Boxplot for Pickup latetude outlier detection", size = 11)
# Boxplot
plt.subplot(122)
sns.boxplot(df1.pickup_longitude).set_title("Boxplot for Pickup longitude outlier detection", size = 11)
plt.show()

#### Insights:
+ Most of the Pickup latitude location cooridinates between **40.73 to 40.83**. There are **15063 outlier entries** only with respect to Pickup latitude.
+ Most of the Pickup longitude location coordinates between **-74.05 to -73.96**. There are **24295 outlier entries** only with respect to Pickup longitude.
+ When we **join both Pickup latitude & longitude outlier datapoints** then it leeds to **30716 outlier datapoints**.

In [None]:
# Statistical analysis for Dropoff geographical coordinate outlier detection 
# Checking both dropoff latitude & longitude  

#--> Statistics using describe method
print("-------------------------------------------------------\n| Statistical Data about dropoff latitude & longitude |\n-------------------------------------------------------")
print(df1[['dropoff_latitude', 'dropoff_longitude']].describe(percentiles = [.25,.50,.75,.95]))

# Finding IQR and to check number of outliers with respect Latitude
D_Q1 = df1.dropoff_latitude.quantile(0.25)
D_Q3 = df1.dropoff_latitude.quantile(0.75)
D_IQR = D_Q3 - D_Q1
lat_out = df1[(df1.dropoff_latitude < (D_Q1 - 1.5 * D_IQR)) | (df1.dropoff_latitude > (D_Q3 + 1.5 * D_IQR))].shape[0]
print("\n>>> Number of outlier records only in dropoff latitude: ",lat_out)

# Finding IQR and to check number of outliers with respect Longitude
d_q1 = df1.dropoff_longitude.quantile(0.25)
d_q3 = df1.dropoff_longitude.quantile(0.75)
d_iqr = d_q3 - d_q1
lon_out = df1[(df1.dropoff_longitude < (d_q1 - 1.5 * d_iqr)) | (df1.dropoff_longitude > (d_q3 + 1.5 * d_iqr))].shape[0]
print(">>> Number of outlier records only in dropoff longitude: ",lon_out)

# Finding list of records for outlier either in latitude or longitude 
outlier = df1[(df1.dropoff_latitude < (D_Q1 - 1.5 * D_IQR)) | 
              (df1.dropoff_latitude > (D_Q3 + 1.5 * D_IQR)) |
              (df1.dropoff_longitude < (d_q1 - 1.5 * d_iqr))|
              (df1.dropoff_longitude > (d_q3 + 1.5 * d_iqr)) ]

print(">>> Number of dropoff geo coordinate outlier records: ",outlier.shape[0])

fig = plt.figure(figsize=(16,2))
# Histogram
plt.subplot(121)
sns.boxplot(df1.dropoff_latitude).set_title("Boxplot for dropoff latetude outlier detection", size = 11)
# Boxplot
plt.subplot(122)
sns.boxplot(df1.dropoff_longitude).set_title("Boxplot for dropoff longitude outlier detection", size = 11)
plt.show()

#### Insights:
+ Most of the Dropoff latitude location cooridinates between **40.73 to 40.83**. There are **22487 outlier entries** only with respect to Dropoff latitude.
+ Most of the Dropoff longitude location coordinates between **-74.08 to -73.95**. There are **27540 outlier entries** only with respect to Dropoff longitude.
+ When we **join both Dropoff latitude & longitude outlier datapoints** then it leeds to **41907 outlier datapoints**.

In [None]:
# Scatter Plot for trip location spread in Test & Train

# Boundary Box
boundary = (test_lon_min, test_lon_max, test_lat_min, test_lat_max)

# Method for create scatter plot 
def Scatter_plot(train, test, boundary):
    fig, axis = plt.subplots(2, 2, figsize = (16, 10))
    # Pickup outlier condition in train
    ptrain_condition = [ (train.pickup_latitude < (P_Q1 - 1.5 * P_IQR)) | 
                         (train.pickup_latitude > (P_Q3 + 1.5 * P_IQR)) |
                         (train.pickup_longitude < (p_q1 - 1.5 * p_iqr))|
                         (train.pickup_longitude > (p_q3 + 1.5 * p_iqr))]
    # Dropoff outlier condition in train
    dtrain_condition = [ (train.dropoff_latitude < (D_Q1 - 1.5 * D_IQR))  | 
                         (train.dropoff_latitude > (D_Q3 + 1.5 * D_IQR))  |
                         (train.dropoff_longitude < (d_q1 - 1.5 * d_iqr)) |
                         (train.dropoff_longitude > (d_q3 + 1.5 * d_iqr)) ]
    # Pickup outlier condition in test
    ptest_condition = [ (test.pickup_latitude < (P_Q1 - 1.5 * P_IQR)) | 
                        (test.pickup_latitude > (P_Q3 + 1.5 * P_IQR)) |
                        (test.pickup_longitude < (p_q1 - 1.5 * p_iqr))|
                        (test.pickup_longitude > (p_q3 + 1.5 * p_iqr))]
    # Dropoff outlier condition in test
    dtest_condition = [ (test.dropoff_latitude < (D_Q1 - 1.5 * D_IQR))  | 
                        (test.dropoff_latitude > (D_Q3 + 1.5 * D_IQR))  |
                        (test.dropoff_longitude < (d_q1 - 1.5 * d_iqr)) |
                        (test.dropoff_longitude > (d_q3 + 1.5 * d_iqr)) ]
    # Pickup location in train dataset
    plt.subplot(221)
    train['out_detection'] = np.select(ptrain_condition, ['outlier'], default = 'non-outlier')
    sns.scatterplot(train.pickup_longitude, train.pickup_latitude, hue = train.out_detection).set_title("Pickup datapoints in Train")
    
    # Dropoff location in train dataset
    plt.subplot(222)
    train['out_detection'] = np.select(dtrain_condition, ['outlier'], default = 'non-outlier')
    sns.scatterplot(train.dropoff_longitude, train.dropoff_latitude, hue = train.out_detection).set_title("Dropoff datapoints in Train")
    
    # Pickup location in test dataset
    plt.subplot(223)
    test['out_detection'] = np.select(ptest_condition, ['outlier'], default = 'non-outlier')
    sns.scatterplot(test.pickup_longitude, test.pickup_latitude, hue = test.out_detection).set_title("Pickup datapoints in Test")
    
    # Pickup location in train dataset
    plt.subplot(224)
    test['out_detection'] = np.select(dtest_condition, ['outlier'], default = 'non-outlier')
    sns.scatterplot(test.dropoff_longitude, test.dropoff_latitude, hue = test.out_detection).set_title("Dropoff datapoints in Test")
    plt.show()

# Invoking the method call 
Scatter_plot(df1, test, boundary)

#### Insights:
+ The **test dataset also has outliers**.
+ The IQR value of train dataset is used to find the outliers of test dataset as well.
+ When we **remove the outliers from train dataset further then the model will not be more generalized one for new datapoints**.
+ For initial model building we can keep outlier, base on model accuracy we can decide, whether we have to reimpute or not.

In [None]:
# Statistics for fare amount feature
# Finding IQR and to check number of outliers with respect to fare amount
print("--------------------------------------\n| Statistical data about Fare amount |\n--------------------------------------")
print(df1.fare_amount.describe(percentiles = [.25, .50, .75, .95]))

#--> IQR calculation
Q1 = df1.fare_amount.quantile(0.25)
Q3 = df1.fare_amount.quantile(0.75)
IQR = Q3 - Q1

#--> Checking outliers
out_fare = df1[(df1.fare_amount < (Q1 - 1.5 * IQR)) | (df1.fare_amount > (Q3 + 1.5 * IQR))]
print("\n=> Number of outlier records: ",out_fare.shape[0])

far_condition = [(df1.fare_amount < (Q1 - 1.5 * IQR)) | (df1.fare_amount > (Q3 + 1.5 * IQR))]
df1['fare_out'] = np.select(far_condition, ['outlier'], default = 'non-outlier')

fig = plt.figure(figsize=(15,10))
# Histogram
plt.subplot(211)
sns.histplot(df1.fare_amount, kde = True).set_title('fare_amount data distribution', size = 11)
# Boxplot
plt.subplot(212)
sns.boxplot(df1.fare_amount).set_title("Boxplot for fare_amount outlier detection", size = 11)
plt.show()

#### Insights:
+ The dataset contain **42202 outlier datapoints with respect to fare amount**.
+ There are **few negative fare amount datapoints in train dataset**. We have to check those entries and remove them from dataset.
+ The average taxi fare amount is **$11.3 Dollars**.
+ We can **treat the fare amount outliers by considering the trip distance**. But this will be done only after calculating the distance feature in feature engineering phase.

In [None]:
# Checking the negative fare amount datapoints

negative_fare = df1[ df1.fare_amount <= 0 ]
print("The number of datapoints contain negative fare amount: ",len(negative_fare[negative_fare.fare_amount < 0]))
print("The number of datapoints contain zero fare amount: ",len(negative_fare[negative_fare.fare_amount == 0]))
negative_fare.sample(5)

#### Insights:
+ We have around **20 negative fare amount datapoints & 13 zero fare amount in train dataset**.
+ Few entries fare amount is zero, this could be due to some special offer given to the customer.
+ We can either simply remove here datapoints or based on trip distance we can apply mean fare amount. But we will recalculate the fare amount from trip distance.

In [None]:
# Checking outlier in Passenger count feature
# Checking unique pasenger count with its frequency in dataset

df1.passenger_count.value_counts()

#### Assemption:
+ **Most of the booking is done for 1 passenger and maximum passenger count is 6**.
+ Surprisingly there are **1754 datapoints with zero passenger count**. But this case is possible, because we can book taxi for goods transfer. 
+ We can check value count in test dataset, If we have any entry with zero passenger then we have to consider those datapoints as well. Otherwise we can delete 1754 datapoints.

In [None]:
# Checking the test dataset passenger count

test.passenger_count.value_counts()

#### Insights:
+ Since we don't have any datapoints with zero passenger count.
+ Not it is completly optional for us to keep the datapoints with zero passenger count or delete it.

In [None]:
# Filter the zero passenger count first and check the following conditions:
# we have to focus on location coordinate geographical outliers compare to fare amount. 
# Because we easly recalculate the fare amount with proper non outlier location coordinate points.
# we are considering location coordinate outlier for deletion.

df1.drop(df1[(df1.passenger_count == 0) & ( df1.out_detection == 'outlier')].index, inplace = True)

print("Number of datapoint remaining after deletion : ",df1.shape[0])

# Assigning passenger count as 1 for remaining non outlier entries. Because one is most frequent in passsenger count
df1['passenger_count'] = df1['passenger_count'].apply( lambda x : 1 if x == 0 else x )
print(df1.passenger_count.value_counts())

In [None]:
# Validating whether duplicate entries present or not 

duplicate = df1[df1.duplicated()]
duplicate

#### Insights:
There is no duplicate entries in train dataset.

In [None]:
# Checking is there any null values or not

df1.isnull().sum()

#### Insights:
There is no missing values in train dataset.

In [None]:
# Seperate the timestamp into date, day, hour, month, year 
# There new features will be treated as dummay and this seperated features will be helpful in EDA

# Date
df1['pickup_date']  = df1['pickup_datetime'].dt.date
# Day
df1['pickup_day']   = df1['pickup_datetime'].apply(lambda x : calendar.day_name[x.weekday()])
# Hour
df1['pickup_hour']  = df1['pickup_datetime'].apply(lambda x : x.hour).astype(int)
# Month
df1['pickup_month'] = df1['pickup_datetime'].apply(lambda x : x.month).astype(int)
# Year
df1['pickup_year']  = df1['pickup_datetime'].apply(lambda x : x.year).astype(int)
# Weekend or Weekday
df1['pickup_on']    = df1['pickup_day'].apply(lambda x : 'Weekend' if x == 'Saturday' or x == 'Sunday' else 'Weekday')           

df1.sample(3)

#### Insights:
**We have added following new feature from datetime feature**
+ **pickup_date:** Pickup date in the form of [YYYY-MM-DD].
+ **pickup_day:** Calender day of the pickup date.
+ **pickup_hour:** Pickup hour.
+ **pickup_month:** Pickup month.
+ **pickup_on:** Booked in weekdays or weekend.

### 3. Exploratory Data Analysis

In [None]:
# Booking location density plot using Folium Heatmap 
# Reference: https://towardsdatascience.com/data-101s-spatial-visualizations-and-analysis-in-python-with-folium-39730da2adf

# Data preparation: Combining pickup & drop details into single column
df_pickup = df1[['pickup_latitude', 'pickup_longitude']].copy().rename(columns = {'pickup_latitude' : 'latitude', 
                                                                                  'pickup_longitude' : 'longitude'})
df_dropoff = df1[['dropoff_latitude', 'dropoff_longitude']].copy().rename(columns = {'dropoff_latitude' : 'latitude',
                                                                                     'dropoff_longitude' : 'longitude'})
df_pickup = df_pickup.append(df_dropoff)        
df_pickup['count'] = 1

# Map instance creation
new_york = folium.Map(location=[40.693943, -73.985880], control_scale=True, zoom_start=9)
new_york.add_child(MeasureControl())
# Apply heatmap on top of map instance
HeatMap(data=df_pickup[['latitude', 'longitude', 'count']].groupby(['latitude', 'longitude']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(new_york)
new_york

#### Insights:
+ Most of the trips are surrounded by **Manhattan City**.
+ There are few trips pointing in ocean geo space. These must be outliers.

In [None]:
# Checking data distribution for fare_data feature

fig = plt.figure(figsize=(15, 5))
sns.histplot(df1.fare_amount, kde = True).set_title('fare_amount data distribution', size = 15)
plt.show()

#### Insights:
+ The dataset contain **42202 outlier datapoints with respect to fare amount**.
+ There are **few negative fare amount datapoints in train dataset**. We have to check those entries and remove them from dataset.
+ The average taxi fare amount is **$11.3 Dollars**.
+ We can **treat the fare amount outliers by considering the trip distance**. But this will be done only after calculating the distance feature in feature engineering phase.

In [None]:
# Checking data distribution for passenger_count data

fig, ax = plt.subplots(figsize = (10,5))
sns.countplot(df1.passenger_count, ax = ax)
ax.set_title('passenger_count data analysis', size = 16)
ax.set_xlabel('Passenger count', size = 12)
ax.set_ylabel('Count', size = 12)
ax.grid(axis='y')
for p in ax.patches:
    ax.annotate('{:.1f}%'.format( (p.get_height() / df1.shape[0]) * 100 ), (p.get_x()+0.2, p.get_height()+55))
plt.show()

#### Insights:
+ Around **70% of booking is done for single passenger**.
+ Passenger count five and six might have booked large seat capacity cabs, so it is obvious to have high fare.
+ We can prove this after computing distance because this is one of our hypothesis test case.

In [None]:
# Finding average fare amount with respect to passenger count

df1.groupby('passenger_count')['fare_amount'].mean().plot(kind='bar')
plt.title("Average fare amount VS Passenger count")
plt.xlabel("Passenger count")
plt.ylabel("Avg. fare amount")
plt.show()

#### Assemption:
+ **Hypothesis:** Based on passenger count the fare will increase..
+ But **average fare amount of passenger count 6 is high compare to all others**. This is strong evdient but in other hand average fare amount will not that much high comparitively for passenger count 3,4 & 5.
+ After Calculating distance feature, we can again test this hypothesis.

In [None]:
# Year wise taxi booking count

fig, ax = plt.subplots(figsize = (10,5))
sns.countplot(df1.pickup_year, ax = ax)
ax.set_title('Year wise taxi booking count', size = 16)
ax.set_xlabel('Year', size = 12)
ax.set_ylabel('Count', size = 12)
for p in ax.patches:
    ax.annotate('{:}'.format(p.get_height()) , (p.get_x()+0.2, p.get_height()+55))
plt.show()

#### Insights:
+ Till 2012 the taxi booking rate per year is linearly increasing except 2010 year.
+ Suddenly the taxi booking rate is decrease in 2013 & 2014 years.
+ Surprisingly in **2015 the rate of booking is reduced half the rate**. 
+ This is because of dataset generated middle of 2015 or actul number of booking itself half the rate compare to previous year.

In [None]:
# Explore further why we have only 3389 bookings in 2015

#--> Fetch the booking details of 2015 and check we have observation for all the moth or not ?
df1.query("pickup_year == 2015")['pickup_month'].unique()

#### Insights:
+ So it is clear that, **dataset contain 2015's booking details only for first 6 months**. 
+ Because of this we have very less booking details compare to previous year.

In [None]:
# Moth wise booking count with respect to year

fig, ax = plt.subplots(figsize = (20,5))
sns.countplot(df1.pickup_month, hue = df1.pickup_year, ax = ax)
ax.set_title('Month wise taxi booking count', size = 16)
ax.set_xlabel('Month', size = 15)
ax.set_ylabel('Count', size = 15)
plt.show()

#### Insights:
+ Except few months, **the booking count is uniform** for all of the months of different year.
+ Year of 2015 contain very less data points, due to which booking count will be very low.

In [None]:
# Grouping monthly booking count with respect to year wise


#--> Creating group by table
df1['count'] = 1
month_group = pd.DataFrame(df1.groupby(['pickup_month', 'pickup_year'])['count'].count()).reset_index()
month_group = month_group.pivot("pickup_month", "pickup_year", "count")

#--> Ploting 
fig, ax = plt.subplots(figsize = (18,7))
sns.lineplot(data = month_group, markers = True, dashes=False, ax = ax)
ax.set_title('Month wise taxi booking count with respect to year', size = 16)
ax.set_xlabel('Month', size = 15)
ax.set_ylabel('Count', size = 15)
ax.grid(axis='x')
plt.show()

#### Insights:
+ The **maximum trips booked at March 2012** and **minimum trips booked at Augest 2010**. 
+ **Monthwise booking count distribution is more or less following same distribution**. The **reason could be weather condition**, so we can explore booking count average with weather season.

In [None]:
# Reading season details 
# Reference: https://www.nyc.com/visitor_guide/weather_facts.75835/

#Reading data
season_tab = pd.read_csv("Season_details.txt", sep = ',')

# First preprocessing the month name to respective month number in season data
season_tab['month'] = season_tab.apply(lambda x : list(calendar.month_name).index(x.month), axis =1)

# Adding new column for Season detail in our dataframe
df1['season'] = df1.pickup_month.replace(dict(zip(season_tab['month'],season_tab['season'])))
df1.season.value_counts()

#### Insights:
+ **FallSeason Months:** September, October & November.
+ **WinterSeason:** December, January & Februray
+ **SpringSeason:** March, April & May
+ **SummarSeason:** June, July & August

In [None]:
# Grouping Season booking count with respect to year wise

#--> Creating group by table
# Since we have very few entries in 2015, we are eliminating for time being
temp = df1[~ ( df1.pickup_year == 2015 )]
temp['count'] = 1
month_group = pd.DataFrame(temp.groupby(['season', 'pickup_year'])['count'].count()).reset_index()
month_group = month_group.pivot("season", "pickup_year", "count")

#--> Ploting 
fig, ax = plt.subplots(figsize = (18,7))
sns.lineplot(data = month_group, markers = True, dashes=False, ax = ax)
ax.set_title('Season wise taxi booking count with respect to year', size = 16)
ax.set_xlabel('Season', size = 15)
ax.set_ylabel('Count', size = 15)
ax.grid(axis='x')
plt.show()

#### Insights:
+ **In Summer season, the booking count will decrease** as compare to previous season booking count in all the years.
+ The **more number of booking is done during the spring seasons**.
+ We can cannot predict the Winter season, because every year it is getting vary.
+ 2012 is a best year, becuase booking count is very high compare to all other years.

In [None]:
# Hour wise booking count 

fig, ax = plt.subplots(figsize = (18,5))
sns.countplot(df1.pickup_hour, ax = ax)
ax.set_title('Hour wise taxi booking count', size = 16)
ax.set_xlabel('Hour', size = 12)
ax.set_ylabel('Count', size = 12)
for p in ax.patches:
    ax.annotate('{:}'.format(p.get_height()) , (p.get_x()+0.2, p.get_height()+55))
plt.show()

#### Insights:
+ From mid night to **early Morning** (12 AM to 7AM) the **booking rate will be gardually reducing** and reaching very low value.
+ From 8 AM to 4 PM the the booking rate will be average.
+ **The maximum booking are done between 6 PM to 8 PM**.
+ late Evening to mind Night the booking count will be above average.

In [None]:
# Grouping hourly booking count with respect to year wise

#--> Creating Groupby table
hour_group = pd.DataFrame(df1.groupby(['pickup_hour', 'pickup_year'])['count'].count()).reset_index()
hour_group = hour_group.pivot("pickup_hour", "pickup_year", "count")

#--> Ploting
fig, ax = plt.subplots(figsize = (18,5))
sns.lineplot(data = hour_group, dashes=False, ax = ax)
ax.set_title('Hour wise taxi booking count with respect to year', size = 16)
ax.set_xlabel('Hour', size = 15)
ax.set_ylabel('Count', size = 15)
plt.show()

#### Insights:
+ The **hour wise booking count distribution will be more or less same for all the years**.
+ Early Morning the number of booking will be very low.
+ There are large number of bookings between 18 to 20 Hours.
+ Since we don't have sufficient datapoints for 2015, it is looks differnt from all other years.

In [None]:
# Exploring time based heatmap for taxi booking 

# Creating map instance
new_york = folium.Map(location=[40.693943, -73.985880], control_scale=True, zoom_start=12)
heat_df = df1[['pickup_latitude', 'pickup_longitude']]

# Create weight column, using date
heat_df['Weight'] = df1['pickup_hour']
heat_df = heat_df.dropna(axis=0, subset=['pickup_latitude','pickup_longitude', 'Weight'])

# List comprehension to make out list of lists
heat_data = [[[row['pickup_latitude'],row['pickup_longitude']] for index, row in heat_df[heat_df['Weight'] == i].iterrows()] for i in range(0,24)]

#create superhero markers and add them to map object
folium.Marker([40.6441666667, -73.7822222222], tooltip="John F. Kennedy International Airport (YFK)").add_to(new_york)
folium.Marker([40.7769271, -73.87396590000003], tooltip="LaGuardia Airport (LGA)").add_to(new_york)
folium.Marker([40.6895314, -74.17446239999998], tooltip="Newark Liberty International Airport (EWR)").add_to(new_york)

# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data,auto_play=True,max_opacity=0.8)
hm.add_to(new_york)
# Display the map
new_york

#### Insights:
+ **Most of the trips within Manhattan city through out the day**. 
+ Most of the **bookings near to Airport and hotsport place**.
+ **YFK and LGA Airport is most booking density place**.
+ Mostly the **long trips upto Norwalk, stford, ossining, Huntington, paterson & Cedar Grove Cities**.

### 4. Feature Engineering:


In [None]:
# Replicate the dataset and make our changes in copied dataset

df2 = df1.copy( deep = True )
df2.head()

**According to the official Wikipedia Page, the haversine formula determines the great-circle distance between two points on a sphere given their longitudes and latitudes**
![image.png](http://ttarnawski.usermd.net/wp-content/uploads/2017/08/Bez-nazwy.png)


In [None]:
# Creating new column for trip distance, we can find this details using trip Latitude & longitude details

def haversine_distance( lon1, lat1, lon2, lat2 ):
    # approximate radius of earth in km
    R = 6373.0 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    return round(R * c, 2)

df2['distance'] = df2.apply( lambda row : haversine_distance( row['pickup_longitude'],
                                                              row['pickup_latitude'],
                                                              row['dropoff_longitude'],
                                                              row['dropoff_latitude'] ), axis = 1 )
df2.head()

#### Assemption:
+ The distacnce is calculated between pickup(latitude, longitude) & dropoff(latitude, longitude) using **Haversine distance** formula.
+ We can calculate either by manual formula or geographical API like geodics. But API will take long time compare to manual calculation.
+ This **distance feature will play important role in model building**.
+ we can replace the outlier values in fare amount using the distance.

In [None]:
# Ensuring the relationtionship between fare_amount and distance using scatter plot

fig, ax = plt.subplots(figsize = (18,5))
sns.regplot( x = df2.distance, y = df2.fare_amount)
ax.set_title('Distacne VS Fare amount', size = 16)
ax.set_xlabel('Distance', size = 15)
ax.set_ylabel('Fare amount', size = 15)
plt.show()

#### Insights:
+ There are some datapoints has **long distance value but very low fare amount**. These datapoints could be **outliers**.
+ As we know already we have few datapoints with negative fare.
+ There are few datapoint with **high fare amount for very low distance**. These datapoints could be **outliers**.
+ We can calculate average value for removing outliers from fare amount with respect to fare amount.


In [None]:
# Outlier detection for distance feature

#--> IQR calculation
Q1 = df2.distance.quantile(0.25)
Q2 = df2.distance.quantile(0.50)
Q3 = df2.distance.quantile(0.75)
Q4 = df2.distance.quantile(0.95)
IQR = Q3 - Q1

#--> Removing outliers from distacnce
print("Statistical Data about Distance")
print("----------------------------------")
print("=> 25th Quantile: {} \n=> 50th Quantile: {} \n=> 75th Quantile: {} \n=> 95th Quantile: {}".format(Q1, Q2, Q3, Q4))
print("=> Min distance: {} \n=> Max distance: {} ".format(df2.distance.min(), df2.distance.max()))
length = df2[(df2.distance < (Q1 - 1.5 * IQR)) | (df2.distance > (Q3 + 1.5 * IQR))].shape[0]
print("=> Number of outlier records: ",length)

fig = plt.figure(figsize=(15,10))
# Histogram
plt.subplot(211)
sns.histplot(df2.distance, kde = True).set_title('Distance data distribution', size = 11)
# Boxplot
plt.subplot(212)
sns.boxplot(df2.distance).set_title("Boxplot for Distance outlier detection", size = 11)
plt.show()

#### Insights:
+ There are **40632 datapoints** considered as a **outliers based on distance feature**.
+ Few data points contain **distance value as zero. We have to further analyze these**.
+ Since the outlier count is very large, If we delete the data then we will lose some informations. 
+ So we have to find better way to handle this outliers.

In [None]:
# Fetching the zero fare amount datapoints

print("=> Number of datapoints with distance value as zero: ",len(df2[df2.distance == 0]))
print("=> Number of datapoints with pickup & drop at same location: ",len(df2[(df2.pickup_latitude == df2.dropoff_latitude) & 
                                                                            (df2.pickup_longitude == df2.dropoff_longitude)]))

print("=> Number of datapoints with distance & fare value as zero: ",len(df2[(df2.distance == 0) & (df2.fare_amount == 0)]))

df2.drop( df2[ df2.distance == 0 ].index, inplace = True )
print("Number of datapoint remaining after deletion : ",df2.shape[0])

#### Insights:
+ Number of datapoints with **distance value as zero is equal to Number of datapoints with pickup & drop at same location** for **6013 datapoints**.
+ The distance feature is going be a important feature for model building, because coorelation between **fare amount & distance is very high**.
+ There are two datapoints with **zero fare amount and distance is also zero**. So these **8** data is completly not useful.
+ We have to remove these datapoints from our dataset for better accuracy model. And 739 is small amount compare to 50k orginal datapoints, So it will not cause big issue.
+ After deletion we have **remaining 483255 datapoints**.

In [None]:
# Removing the outlier datapoints with respect to distance

#--> IQR calculation
Q1 = df2.distance.quantile(0.25)
Q3 = df2.distance.quantile(0.75)
IQR = Q3 - Q1

#--> Filtering the outlier datapoints
df2 = df2[~((df2.distance < (Q1 - 1.5 * IQR)) | (df2.distance > (Q3 + 1.5 * IQR)))]
print("Number of datapoint remaining after distance outlier deletion : ",df2.shape[0])
fig, ax = plt.subplots(figsize = (15,5))
sns.regplot( x = df2.distance, y = df2.fare_amount , marker = '+').set_title("Distance VS Fare amount")
plt.show()

#### Insights:
+ After removing the most extreme distance outliers we are getting **proper liner relationship between distance & fare amount**.
+ Based on distance we can rework on fare amount outlier by taking **average fare amount for same distance**.

In [None]:
# Handling fare amount outliers with respect to distance:
# The strong assemption is fare amount is linearly dependented on distance. 
# So we can apply average fare amount with same distance for outlier datapoints.

# IQR Calculation
Q1 = df2.fare_amount.quantile(0.25)
Q3 = df2.fare_amount.quantile(0.75)
IQR = Q3 - Q1

# Method for outlier treatment on fare amount
def remove_outlier(distance, fare):
    if fare <= 0:
        # Negative fare amount
        res = df2[(df2['distance'] == distance)]['fare_amount'].mean()
    elif fare < ( Q1 - 1.5 * IQR ) or fare > ( Q3 + 1.5 * IQR ):
        # Outlier fare amount
        res = df2[(df2['distance'] == distance)]['fare_amount'].mean()
    else:
        # Default as input fare amount
        res = fare
    return res

# Outlier removal function call
df2['fare_amount'] = df2.apply(lambda x : remove_outlier(x['distance'], x['fare_amount']) 
                                                      if x['fare_amount'] <= 0 or 
                                                         x['fare_amount'] < ( Q1 - 1.5 * IQR ) or
                                                         x['fare_amount'] > ( Q3 + 1.5 * IQR )
                                                    else x['fare_amount'] , axis = 1 )

#### Insights: 
+ The strong assemption is **fare amount is linearly dependented on distance**.
+ Using this assemption we can recalculate the oulier fare amount with distance.
+ **The average fare amount is calculated from dataset with exact distance and replaced for outlier entry**.

In [None]:
# Regplot after fare amount outlier value replacement

fig, ax = plt.subplots(figsize = (15,5))
sns.scatterplot( x = df2.distance, y = df2.fare_amount , marker = '+').set_title("Distance VS Fare amount")
plt.show()

In [None]:
# Removing the unwanted columns from Dataframe

df2.drop( ['key', 'count', 'out_detection', 'fare_out'], axis = 'columns', inplace = True)
df2.sample(3)

In [None]:
# Map instance creation
new_york = folium.Map(location=[40.693943, -73.985880], control_scale=True, zoom_start=9)
new_york.add_child(MeasureControl())
# Apply heatmap on top of map instance
HeatMap(data=df_pickup[['latitude', 'longitude', 'count']].groupby(['latitude', 'longitude']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(new_york)
# Airport
folium.Marker([40.644167, -73.782223], tooltip=" John F. Kennedy International Airport (YFK)").add_to(new_york)
folium.Marker([40.769914, -73.864324], tooltip=" LaGuardia Airport (LGA)").add_to(new_york)
folium.Marker([40.854591, -74.066219], tooltip="  Teterboro Airport").add_to(new_york)
folium.Marker([40.689531, -74.174463], tooltip=" Newark Liberty International Airport (EWR)").add_to(new_york)
# Railway station
folium.Marker([40.750580, -73.993584], tooltip=" Penn Station Railway Station ").add_to(new_york)
folium.Marker([40.743721, -73.923719], tooltip=" Railway Station ").add_to(new_york)
folium.Marker([40.752655, -73.977295], tooltip=" Grand Central Terminal Railway station").add_to(new_york)
# Important hortsport location
folium.Marker([40.759060, -73.979431], tooltip=" Top of The Rock").add_to(new_york)
folium.Marker([40.758678, -73.978798], tooltip=" Rockefeller Center").add_to(new_york)
folium.Marker([40.741112, -73.989723], tooltip=" Flatiron Building").add_to(new_york)
folium.Marker([40.741555, -73.972354], tooltip=" FDR Drive").add_to(new_york)
folium.Marker([40.703717, -74.013573], tooltip=" National Museum").add_to(new_york)
folium.Marker([40.705689, -74.017682], tooltip=" skyscraper Museum").add_to(new_york)
folium.Marker([40.731678, -74.063076], tooltip=" Journal square Transportation center").add_to(new_york)



# Display the map
new_york

In [None]:
weather.head()

In [None]:
weather = weather[(weather.YEAR >= 2009) & (weather.YEAR <= 2015)]
weather.head()

In [None]:
from geopy.geocoders import Nominatim
  
# Initialize Nominatim API
geolocator = Nominatim(user_agent="geoapiExercises")
  
# Assign Latitude & Longitude
Latitude = "25.594095"
Longitude = "85.137566"
  
# Dsiaplying Latitude and Longitude
print("Latitude: ", Latitude)
print("Longitude: ", Longitude)
  
# Get location with geocode
location = geolocator.geocode(Latitude+","+Longitude)
  
# Dsiplay location
print("\nLocation of the given Latitude and Longitude:")
print(location)

In [None]:
df2.sample(3)

In [None]:
# Initialize Nominatim API
geolocator = Nominatim(user_agent="geoapiExercises")

def location_finder(row):
    # Assign Latitude & Longitude
    Latitude = str(row.pickup_latitude)
    Longitude = str(row.pickup_longitude)

    # Get location with geocode
    location = geolocator.geocode(Latitude+","+Longitude)
    
    return location

In [None]:
df2['pickup_addr'] = df2.apply(location_finder, axis = 1)