**New York City Taxi Fare Prediction Playground Competition**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
df_train = pd.read_csv('../input/train.csv', nrows=2_000_000, parse_dates=['pickup_datetime'] )

In [None]:
# to read the first three rows of training dataset
df_train.head(3)

So in this dataset we have different columns including the target variable named as **fare_amount**.

Now we will look at that data and their types

In [None]:
#check the datatypes
df_train.dtypes

We will look at the distributions of data in each of these columns

In [None]:
#check the statictics of the features
df_train.describe()

In [None]:
#check for missing values in train data
df_train.isnull().sum().sort_values(ascending=False)

In [None]:
#drop the missing values
df_train = df_train.drop(df_train[df_train.isnull().any(1)].index, axis = 0)

In [None]:
df_train.describe()

From the above we able to find some of the outliers.

1.Obvisouly Taxi fare amount cannot be negative.
2.Passanger count max is showing 208 , so 208 passagers not able to travel in single taxi.

So we will remove the known outliers.


In [None]:
df_train.boxplot(column='fare_amount')

In [None]:
df_train.boxplot(column='passenger_count')

In [None]:
from collections import Counter
Counter(df_train['fare_amount']<0)

So there are 77 negative values of fare amount present in the dataset, we will remove these negative values.

In [None]:
df_train=df_train.drop(df_train[df_train['fare_amount']<0].index, axis=0)
df_train.shape

Assuming that taxi maximum seat capacity as 6 for SUV car, so we are dropping the passenger count more than 6.

In [None]:
df_train=df_train.drop(df_train[df_train['passenger_count']>6].index, axis=0)
df_train['passenger_count'].describe()

As we are going to predict the fare of taxi which depends on the location data, we will check the latitude and longitude

Latitude ranges from -90 to +90
Longitude ranges from -180 to +180

In [None]:
#checking the pickup_latitude 
df_train['pickup_latitude'].describe()

In [None]:
len(df_train[df_train['pickup_latitude']<-90])

In [None]:
len(df_train[df_train['pickup_latitude']>90])

From the above we found that there are 33 total outliers in pickup_latitude.So we will remove those outliers 

In [None]:
df_train['pickup_latitude'].shape

In [None]:
df_train=df_train.drop(((df_train[df_train['pickup_latitude']<-90])|(df_train[df_train['pickup_latitude']>90])).index, axis=0)

In [None]:
df_train['pickup_latitude'].shape

In [None]:
df_train['pickup_longitude'].shape

In [None]:
df_train=df_train.drop(((df_train[df_train['pickup_longitude']<-180])|(df_train[df_train['pickup_longitude']>180])).index, axis=0)

In [None]:
df_train['pickup_longitude'].shape

In [None]:
df_train.describe()

In [None]:
df_train=df_train.drop(((df_train[df_train['dropoff_latitude']<-90])|(df_train[df_train['dropoff_latitude']>90])).index, axis=0)

In [None]:
df_train=df_train.drop(((df_train[df_train['dropoff_longitude']<-180])|(df_train[df_train['dropoff_longitude']>180])).index, axis=0)

In [None]:
df_train.describe()

By seeing the value first three record found that key,pickup_datetime is datetime value, so we are changin the datatype to datetime.

In [None]:
df_train['key']=pd.to_datetime(df_train['key'])
df_train['pickup_datetime']=pd.to_datetime(df_train['pickup_datetime'])

In [None]:
df_train.dtypes

Now will explore the data using the plots,cheking the demands and taxi rates on day,date,hour wise.

In [None]:
data = [df_train]
for i in data:
    i['year']=i['pickup_datetime'].dt.year
    i['month']=i['pickup_datetime'].dt.month
    i['date']=i['pickup_datetime'].dt.date
    i['day of week']=i['pickup_datetime'].dt.dayofweek
    i['hour']=i['pickup_datetime'].dt.hour

In [None]:
df_train.head(3)

Checking fare rates based on the number of persons in a taxi trip

In [None]:
plt.figure(figsize=(15,7))
plt.hist(df_train['passenger_count'], bins=15)
plt.xlabel('Number of Passengers')
plt.title('Fare rates based on the number of passengers')
plt.show()

In [None]:
plt.figure(figsize=(15,7))
plt.scatter(x=df_train['passenger_count'], y=df_train['fare_amount'], s=1.5)
plt.xlabel('Number of Passengers')
plt.ylabel('Fare amount')
plt.title('Fare rates based on the number of passengers')
plt.show()

From the above we found that trip with one passenger has the highest taxi rates.

Now we will year,month,date,day,hour wise fare details 

In [None]:
plt.figure(figsize=(15,7))
plt.scatter(x=df_train['year'], y=df_train['fare_amount'], s=1.5)
plt.xlabel('Year')
plt.ylabel('Fare amount')
plt.title('Year wise taxi fares details')
plt.show()

In [None]:
plt.figure(figsize=(15,7))
plt.scatter(x=df_train['month'], y=df_train['fare_amount'], s=1.5)
plt.xlabel('Month')
plt.ylabel('Fare amount')
plt.title('Month wise taxi fares details')
plt.show()

In [None]:
plt.figure(figsize=(15,7))
plt.scatter(x=df_train['day of week'], y=df_train['fare_amount'], s=1.5)
plt.xlabel('Day of week')
plt.ylabel('Fare amount')
plt.title('Week wise taxi fares details')
plt.show()

Normally the rates majorly depends on the distance from  one place to another place,dataset does not have the distance instead it has pickup latitude , pickup longitude ,dropoff latitude and dropoff longitude by using the [Haversine Formula](http://https://community.esri.com/groups/coordinate-reference-systems/blog/2017/10/05/haversine-formula)

***Haversine Formula***

The Haversine formula is perhaps the first equation to consider when understanding how to calculate distances on a sphere. The word "Haversine" comes from the function:

haversine(θ) = sin²(θ/2)

 

The following equation where φ is latitude, λ is longitude, R is earth’s radius (mean radius = 6,371km) is how we translate the above formula to include latitude and longitude coordinates. Note that angles need to be in radians to pass to trig functions:

a = sin²(φB - φA/2) + cos φA * cos φB * sin²(λB - λA/2)

c = 2 * atan2( √a, √(1−a) )

d = R ⋅ c

In [None]:
def haversine_distance(lat1, long1, lat2, long2):
    data = [df_train]
    for i in data:
        R = 6371  #radius of earth in kilometers
        #R = 3959 #radius of earth in miles
        phi1 = np.radians(i[lat1])
        phi2 = np.radians(i[lat2])
    
        delta_phi = np.radians(i[lat2]-i[lat1])
        delta_lambda = np.radians(i[long2]-i[long1])
    
        #a = sin²((φB - φA)/2) + cos φA . cos φB . sin²((λB - λA)/2)
        a = np.sin(delta_phi / 2.0) ** 2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2.0) ** 2
    
        #c = 2 * atan2( √a, √(1−a) )
        c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    
        #d = R*c
        d = (R * c) #in kilometers
        i['h_distance'] = d
    return d

In [None]:
haversine_distance('pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude')

In [None]:
df_train['h_distance'].describe()

In [None]:
df_train=df_train.drop(df_train[df_train['h_distance']==0].index, axis=0)

In [None]:
df_train['h_distance'].describe()

In [None]:
df_train['fare_amount'].describe()

In [None]:
df_train=df_train.drop(df_train[df_train['fare_amount']==0].index, axis=0)

In [None]:
df_train['fare_amount'].describe()

**More to come..**