## Introduction:
In this Competition ,we are tasked with predicting the fare amount for a taxi ride in New York City given the pickup and Dropoff locations. 

## Features:
* pickup-datetime: time indicating when the taxi ride started
* pickup_longitude: longitude coordinate of where the taxi ride started.
* pickup_latitude - latitude coordinate of where the taxi ride started.
* dropoff_longitude - longitude coordinate of where the taxi ride ended.
* dropoff_latitude - latitude coordinate of where the taxi ride ended.
* passenger_count - integer indicating the number of passengers in the taxi ride.

## Target:
* fare_amount - float dollar amount of the cost of the taxi ride.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
sns.set()

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv('../input/nyctaxifares/NYCTaxiFares.csv')

In [None]:
data.head()

In [None]:
data.tail()

## EDA and Feature Engineering

1. describing Features of data
2. Checking null values
3. Changing datatype of pickup datetime 
4. Calculation of distance
5. Visualization

In [None]:
data.info()

In [None]:
# Description of all features
data.describe().T

In [None]:
data.isnull().sum()

In [None]:
data["fare_class"].value_counts()

In [None]:
print("Fare amount greater than 10$ :", data[data["fare_amount"]>=10].shape[0])
data[data["fare_amount"] >=10]

In [None]:
## Converting pickup_datetime from Object type to TimeStamp type

data["pickup_datetime"] = pd.to_datetime(data["pickup_datetime"])

In [None]:
data.head()

In [None]:
data.dtypes

**Calculation of distance between Latitude and Longitude**

The great circle distance is the shortest distance between two points on a sphere. In this notebook we can calculate distance between two points using Haversine Formula.
First, convert the latitude and longitude values from decimal degrees to radians. For this divide the values of longitude and latitude of both the points by 180/pi. Use the value of r (radius) as 6371.
For more details on [Haversine distance](https://www.geeksforgeeks.org/haversine-formula-to-find-distance-between-two-points-on-a-sphere/) and completeness of formula used visit the mentioned site.

In [None]:
from math import radians, cos,sin, asin,sqrt

def distance(lon1, lon2, lat1 , lat2):
    
    
    lon1 =radians(lon1)
    lon2 =radians(lon2)
    lat1 =radians(lat1)
    lat2 =radians(lat2)
    
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    
    r = 6371
    return(round (c * r, 2))
    
    
d = []
for i in range(data.shape[0]):
    d.append(distance(data["pickup_latitude"][i],
                      data["dropoff_latitude"][i],
                      data["pickup_longitude"][i],
                      data["dropoff_longitude"][i]))

In [None]:
data["distance in kms"] = d

In [None]:
data.head()

In [None]:
# Dropping Longitude and Latitude Features

data.drop(["pickup_latitude", "pickup_longitude", "dropoff_latitude", "dropoff_longitude"], axis=1, inplace=True)

In [None]:
data.head()

In [None]:
print("Date in data : ", data["pickup_datetime"].dt.day.sort_values().unique())
print("Month in data : ", data["pickup_datetime"].dt.month.unique()[0])
print("Year in data : ", data["pickup_datetime"].dt.year.unique()[0])

In [None]:
# Mapping days and Weekname
week_names = {0: "Sunday", 1: "Monday", 2: "Tuesday", 3: "Wednesday", 4: "Thursday", 5: "Friday", 6: "Saturday"}

data["weekday_name"] = data["pickup_datetime"].dt.weekday.map(week_names)

In [None]:
data.head()

In [None]:
plt.figure(figsize = (12,8))
data.groupby("weekday_name")["fare_amount"].sum().sort_values().plot()

plt.xlabel("Week", fontsize=15)
plt.ylabel("Fare Amount Average($)", fontsize=15)
plt.title("Total fare Amount vs Average", fontsize=20)
plt.show()

In [None]:
week_names_encode = {"Sunday": 1, "Saturday": 2, "Monday": 3, "Tuesday": 4, "Friday": 5, "Wednesday": 6, "Thursday": 7}

In [None]:
data["weekday_name"] = data["weekday_name"].map(week_names_encode)

In [None]:
data.head()

In [None]:
data["Hour"] = data["pickup_datetime"].dt.hour

In [None]:
data["Hour"].unique()

In [None]:

# Plotting graph of Fare vs Pickup time
plt.figure(figsize = (12,8))

data.groupby("Hour")["fare_amount"].sum().plot()
plt.title("Pickup Time vs Sum of Fare Amount at that Hour", fontsize=20)
plt.xlabel("Hour", fontsize=15)
plt.ylabel("Sum of Fare Amount", fontsize=15)
plt.show()

In [None]:
data["Month_Day"] = data["pickup_datetime"].dt.day

In [None]:
# Sum of Taxi Fare in a particular day

for day in list(data["pickup_datetime"].dt.day.sort_values().unique()):
    print(f"Date : {day} \t Total fare Amount : ${round(data[data.pickup_datetime.dt.day==day].fare_amount.sum(), 2)}")


In [None]:
plt.figure(figsize = (12, 8))

data.groupby("Month_Day")["fare_amount"].sum().plot()
plt.title("Pickup Time vs Sum of Fare Amount at that day", fontsize=20)
plt.xlabel("Month Day", fontsize=15)
plt.ylabel("Avg of Fare Amount", fontsize=15)
plt.show()

In [None]:
data.head()

In [None]:
data["passenger_count"].value_counts()

In [None]:
## Graph - Fare vs Distance

sns.relplot(data = data, kind = "scatter",x = "distance in kms",y = "fare_amount",
            hue = "passenger_count",height=6 ,aspect = 1.75,)
plt.title("Fare($) vs distance(kms)" , fontsize=15)
plt.show()


In [None]:
data.head()

In [None]:
data["fare_class"].value_counts()

In [None]:
data["fare_class"].unique()

In [None]:
# Total passenger travelling in a Taxi, paying Fare amount less than or more than $10.

data.groupby(["fare_class","passenger_count"])[["passenger_count"]].sum()

In [None]:
plt.figure(figsize=(15,8))
data.groupby("passenger_count")["fare_amount"].sum().sort_values().plot.barh()
plt.xlabel("Total Fare($)",fontsize =13)
plt.ylabel("Passengers in Taxi", fontsize =13)
plt.title("Number of passsenger vs Total Fare of Taxi", fontsize = 15)
plt.show()

In [None]:
data.drop("pickup_datetime", axis=1, inplace=True)

In [None]:
data.head()

In [None]:
data.to_csv("data_transformed.csv", index=False)

In [None]:
df = pd.read_csv("data_transformed.csv")
df.head()

In [None]:
# Separating dependent and independent feature
#### Dependent Feature ---> fare_amount

X = df.iloc[: , 1:]
y = df.iloc[: , 0]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.33)

In [None]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression(fit_intercept= True, normalize =True)
linreg.fit(X_train , y_train)

In [None]:
y_pred = linreg.predict(X_test)

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test , y_pred)

### Random Forest

Random forest is an ensemble technique where n number of steps is taken from training data to predict the output.

**If we compute single decision Tree to complete depth then it leads to Low bias and high Variance.**

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfreg = RandomForestRegressor(n_estimators = 15)
rfreg.fit(X_train,y_train)

In [None]:
predict = rfreg.predict(X_test)

In [None]:
r2_score(y_test, predict)

In [None]:
from sklearn.tree import DecisionTreeRegressor
dt_reg = DecisionTreeRegressor(criterion='mse', max_depth=None, random_state=42)
dt_reg.fit(X_train, y_train)

In [None]:
pred = dt_reg.predict(X_test)
r2_score(y_test, pred)

In [None]:
# Decision plot
from sklearn import tree
plt.figure(figsize = (15,8))
tree.plot_tree(dt_reg, max_depth = 2, fontsize = 15, feature_names=df.columns)
plt.title("<---------------------Decision Tree Split-------------------->", fontsize = 20)
plt.show()

## XGBoost

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.The model in supervised learning usually refers to the mathematical structure of by which the prediction yi is made from the input xi. 
The parameters are the undetermined part that we need to learn from data. In linear regression problems, the parameters are the coefficients θ. Usually we will use θ to denote the parameters.

In [None]:
from xgboost import XGBRegressor
xgb_reg = XGBRegressor(learning_rate= 0.30, max_depth=6, n_estimators=100, n_jobs =0)
xgb_reg.fit(X_train,y_train)

In [None]:
y_pred = xgb_reg.predict(X_test)

In [None]:
r2_score(y_pred, y_test)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np


In [None]:
n_estimators = [40,80,120,160]

criterion = ["mse","mae"]

max_depth = [int(x) for x in np.linspace(10,200,10)]

min_samples_split= [5,10,15]

min_samples_leaf = [4,6,8,10]

max_features = ['auto', 'sqrt', 'log2']

In [None]:
param_grid = {"n_estimators":n_estimators, "criterion":criterion, "max_depth":max_depth, "min_samples_split":
             min_samples_split, "min_samples_leaf":min_samples_leaf, "max_features":max_features}

In [None]:
rf_hyper = RandomForestRegressor()

In [None]:
rf_randomcv = RandomizedSearchCV(estimator=rf_hyper, param_distributions=param_grid, n_iter=10, 
                                 cv = 2, verbose=1, random_state=100, n_jobs=-1)

In [None]:
rf_randomcv.fit(X_train,y_train)

In [None]:
import pickle
filename = 'rf_NYCTaxifare_model.pkl'

pickle.dump(rfreg, open(filename,'wb'))

In this notebook I have tried to implement, i have tried to predict the cost of taxi driving and we are calculating the distance between points using Haversine Formula. i have also applied EDA and Feature Enginnering which is basic step for any kernel. 
After that checking the accuracy of different models using r2_score in which my model is predicting approximately the same r2_score in each model. I have also tried RandomizedsearchCV for random forest model. After a;; saving the model.
So this is all for this notebook

## Hope you like this notebook and if you are learning something please leave an Upvote which is the gesture of Motivation and encouragement.
