# Taxi Demand Prediction (NYC Taxi) (Regression Problem)

## Story

In this notebook we can see that how can we generate a column from a dataset according to our problem need and than create a machine learning models to train our dataset


### About Dataset

I took the dataset from New-York Governemnt site i.e (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) and data is of 2015 and 2016 (jan, feb, mar). This dataset will be used for many purposes like predict Total fare of the trip and many more but we use it for the taxi demand prediction. The cool thing is their is no column in the csv for demand of that taxi in specefic area, we will try to do some experiments, to create it and make a machine learning model on that dataset. Hope you guys like the work 


### Things to learn

1. Feature Engineering
2. How to handle large Dataset (csv of 1.5GB)
3. Machine Learning Techniques
4. Regression

## Cautions

I take less data from csv as kaggle not allow me to use more than 16GB of ram. If you take all of the data than you will get 97% accuracy on validation data and 92% accuracy on test data

In [None]:
import time
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import MiniBatchKMeans, KMeans
import gpxpy.geo # Get the haversine distance
from sklearn.linear_model import LinearRegression
from sklearn import tree
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
import math
from prettytable import PrettyTable

<h1 align="center">Loading Data</h1> 

Loading Data into pandas DataFrame their are alot of column names into the csv but we will only take the specefic 1

**Colums We will use:**
1. tpep_pickup_datetime    : Pick up Datetime
2. tpep_dropoff_datetime   : Drop Off Datetime
3. trip_distance           : Distance of Trip
4. pickup_longitude        : PickUp longitude
5. pickup_latitude         : Pickup Latitude
6. dropoff_longitude       : Dropoff Longitude
7. dropoff_latitude        : Dropoff Latitude
8. total_amount            : Total Fare amount

In [None]:

base_path = "../input/taxidemandfarepredictiondataset/"
# we speed the process by decreasing the dimensionality
columns=['tpep_pickup_datetime',
           'tpep_dropoff_datetime',
           'trip_distance',
           'pickup_longitude',
           'pickup_latitude',
           'dropoff_longitude',
           'dropoff_latitude',
           'total_amount']




df_2015_1 = pd.read_csv(f'{base_path}yellow_tripdata_2015-01.csv', usecols=columns, nrows=1000000)
df_2015_2 = pd.read_csv(f'{base_path}yellow_tripdata_2015-02.csv', usecols=columns, nrows=1000000)
df_2015_3 = pd.read_csv(f'{base_path}yellow_tripdata_2015-03.csv', usecols=columns, nrows=1000000)

df_2016_1 = pd.read_csv(f'{base_path}yellow_tripdata_2016-01.csv', usecols=columns, nrows=1000000)
df_2016_2 = pd.read_csv(f'{base_path}yellow_tripdata_2016-02.csv', usecols=columns, nrows=1000000)
df_2016_3 = pd.read_csv(f'{base_path}yellow_tripdata_2016-03.csv', usecols=columns, nrows=1000000)

df_2015 = df_2015_1.append(df_2015_2).append(df_2015_3)
df_2016 = df_2016_1.append(df_2016_2).append(df_2016_3)

original_2015_len = df_2015.shape[0]
original_2016_len = df_2016.shape[0]

## Preprocessing 

Preprocessing Includes some basic tasks like

1. Removing outliers
2. Build more features which help our model to learn things
3. Identifying and removing null values

In [None]:

'''
In this fucntion we will drop all the longitude and latitude which are 0s or empty or nan etc
and we will only take trip which is between 5$ to 45$ removing upper and lower bounds (Outlier removal)

For that you can make quantiles and take quantiles between your specefic range
'''
def clean_data(df, test=False, predict=False):
    df = df.dropna(how='any', axis='rows')
    df = df[(df.dropoff_latitude != 0) | (df.dropoff_longitude != 0)]
    df = df[(df.pickup_latitude != 0) | (df.pickup_longitude != 0)]
    
    if "total_amount" in list(df):
        df = df[df.total_amount.between(5, 45)]
    
    return df

df_2015 = clean_data(df_2015)
df_2016 = clean_data(df_2016)

<h1 align="center">Data Cleaning</h1> 

In [None]:
# to decide where to start removing outliers
def remove_outliers(data, start=0, end=100):
    data=np.sort(data)
    for i in np.linspace(start, end, 10):
        i=round(i, 6)
        print(str(i).zfill(5) + " percentile value is " + str(round(data[int(len(data)*(float(i)/100))-1], 1)))
    print(str(float(end)).zfill(3) + " percentile value is " + str(data[-1]))

### 1. <font color='red'>**pickup_latitude**</font> and <font color='red'>**pickup_longitude**</font>  
[NYC Coordinates Source](https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc)

In [None]:
# drop rows with coordinates outside NYC 
def clean_coordinates(df):
    nrows = df.shape[0]
    df.drop(df.index[
        
            ~((df['pickup_latitude'].between(40.496115395170364, 40.91553277700258)) &
              (df['pickup_longitude'].between(-74.25559136315209, -73.7000090639354))) 
        
    ], inplace=True)
    print("Number of rows removed due to wrong coordinates is {}".format(nrows - df.shape[0]))
    
clean_coordinates(df_2015)
clean_coordinates(df_2016)

### 2. <font color='red'>**trip_duration**</font> - **tpep_pickup_datetime** and **tpep_dropoff_datetime**

In [None]:
def clean_trip_duration(df):
    # convert from object to datetime
    df['tpep_pickup_datetime']  = pd.to_datetime(df['tpep_pickup_datetime'])
    df['tpep_dropoff_datetime']  = pd.to_datetime(df['tpep_dropoff_datetime'])
    
    # copute the time diffrance between pickup & dropoff
    # to covert from nanosecondes to minutes we devide by 1000000000 then by 60
    # store trip_duratin column
    trip_duration = np.array(df['tpep_dropoff_datetime']-df['tpep_pickup_datetime'])
    trip_duration = trip_duration/1000000000/60
    df['trip_duration'] = trip_duration.astype(float)
    
    # drop all records that have trip_duration > 2 hours
    #                            trip_duration <= 0
    #                            trip_distance <= 0
    nrows = df.shape[0]
    df.drop(df[(df['trip_duration'] > 160) | 
               (df['trip_duration'] <= 0)].index, inplace = True)
    print("Number of rows removed due to wrong trip_duration {}".format(nrows - df.shape[0]))
    
    
clean_trip_duration(df_2015)
clean_trip_duration(df_2016)

### 3. <font color='red'>**pickup_time**</font>

In [None]:
def clean_pickuptime(df):
    return df.rename(columns={'tpep_pickup_datetime': 'pickup_time'})

df_2015 = clean_pickuptime(df_2015)
df_2016 = clean_pickuptime(df_2016)




### 4. <font color='red'>**trip_distance**</font>

In [None]:
def clean_trip_distance(df):
    nrows = df.shape[0]
    df.drop(df[(df['trip_distance'] <= 0) | (df['trip_distance'] > 77.5)].index, inplace = True)
    print("Number of rows removed due to speed outliers {}".format(nrows - df.shape[0]))
    
clean_trip_distance(df_2015)
clean_trip_distance(df_2016)

### 5. <font color='red'>**speed**</font> - trip_distance/trip_duration

In [None]:
def compute_speed(df):
    # computing Taxi speed average (mile/hour)
    df['speed'] = df['trip_distance']/df['trip_duration']*60
    
def clean_speed(df):

    # Removing speed anomaly/outliers
    nrows = df.shape[0]
    df.drop(df[((df['speed'] <= 0) | (df['speed'] > 63.0))].index, inplace = True)
    print("Number of rows removed due to speed outliers {}".format(nrows - df.shape[0]))


compute_speed(df_2015)
compute_speed(df_2016)    
clean_speed(df_2015)
clean_speed(df_2016)



### 6. <font color='red'>**K-Means with respect to longitude and latitude**</font>


In [None]:
from datetime import datetime, timedelta
from sklearn.cluster import MiniBatchKMeans, KMeans
from pandarallel import pandarallel


#Clustering pickups
print("Getting clusters")
coord = df_2015[["pickup_latitude", "pickup_longitude"]].values
regions = MiniBatchKMeans(n_clusters = 30, batch_size = 10000).fit(coord)

print("Predicting clusters")
cluster_column = regions.predict(df_2015[["pickup_latitude", "pickup_longitude"]])
cluster_column_2016 = regions.predict(df_2016[["pickup_latitude", "pickup_longitude"]])
df_2015["pickup_cluster"] = cluster_column
df_2016["pickup_cluster"] = cluster_column_2016




In [None]:
# Replacing mins and sec with 0
print("Removing Hours and seconds")
pandarallel.initialize()
df_2015['pickup_time'] = df_2015.pickup_time.parallel_apply(lambda x : pd.to_datetime(x).replace(minute=0, second=0) + timedelta(hours=1))
df_2016['pickup_time'] = df_2016.pickup_time.parallel_apply(lambda x : pd.to_datetime(x).replace(minute=0, second=0) + timedelta(hours=1))


In [None]:

print("Group by Cluster and time")
df2 = df_2015.groupby(['pickup_time','pickup_cluster']).size().reset_index(name='count')
df1 = df_2016.groupby(['pickup_time','pickup_cluster']).size().reset_index(name='count')

print("Converting counts to demand percentage")
df2['count'] = df2['count'].parallel_apply(lambda x :  (x / df2['count'].max()))
df1['count'] = df1['count'].parallel_apply(lambda x :  (x / df1['count'].max()))


print("Getting month, days, hours, day of week")
df2['month'] = pd.DatetimeIndex(df2['pickup_time']).month
df2['day'] = pd.DatetimeIndex(df2['pickup_time']).day
df2['dayofweek'] = pd.DatetimeIndex(df2['pickup_time']).dayofweek
df2['hour'] = pd.DatetimeIndex(df2['pickup_time']).hour


df1['month'] = pd.DatetimeIndex(df1['pickup_time']).month
df1['day'] = pd.DatetimeIndex(df1['pickup_time']).day
df1['dayofweek'] = pd.DatetimeIndex(df1['pickup_time']).dayofweek
df1['hour'] = pd.DatetimeIndex(df1['pickup_time']).hour


### 4. <font color='red'>**Split data into train and test, X and y**</font>

In [None]:
# training X and y
X_2015_1 = df2[['pickup_cluster', 'month', 'day', 'hour', 'dayofweek']]
y_2015_1 = df2['count']


# training X and y
X_2016_1 = df1[['pickup_cluster', 'month', 'day', 'hour', 'dayofweek']]
y_2016_1 = df1['count']

print(len(X_2015_1))
print(len(y_2015_1))


### 4. <font color='red'>**Saving Preprocessed DataFrame for Later processing**</font>

In [None]:

# X_2016.to_csv("X_2016_X.csv")
# y_2016.to_csv("X_2016_Y.csv")
from sklearn.model_selection import train_test_split
X_2015, X_2016, y_2015, y_2016 = train_test_split(
     X_2015_1.values, y_2015_1.values, test_size=0.33, random_state=42)

<h1 align="center">Models Training</h1> 

In [None]:
print('model training 0/3 (creating model)', end='\r')
LReg = LinearRegression()

print('model training 1/3 (fitting model)', end='\r')
LReg.fit(X_2015, y_2015)

print('model training 2/3 (training model)', end='\r')
LReg_y_pred = LReg.predict(X_2016)

print('model training 3/3 done!           ', end='\r')

In [None]:
print('model training 0/3 (creating model)', end='\r')
RFRegr = RandomForestRegressor()

print('model training 1/3 (fitting model)', end='\r')
RFRegr.fit(X_2015, y_2015)

print('model training 2/3 (training model)', end='\r')
RFRegr_y_pred = RFRegr.predict(X_2016)

print('model training 3/3 done!           ', end='\r')

In [None]:
print('model training 0/3 (creating model)', end='\r')
GBRegr = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8)

print('model training 1/3 (fitting model)', end='\r')
GBRegr.fit(X_2015, y_2015)

print('model training 2/3 (training model)', end='\r')
GBRegr_y_pred = GBRegr.predict(X_2016)

print('model training 3/3 done!           ', end='\r')

<h1 align="center">Models Evaluation</h1> 

In [None]:
def model_evaluation(algorithem_name, X_Test, y_pred, y_true):
    
    # R2 and Adjasted R2
    r2 = r2_score(y_true, y_pred)
    adj_r2 = 1-(1-r2)*((len(X_Test)-1)/(len(X_Test)-X_Test.shape[1]-1))
    # MSE and RMSE
    mse = mean_squared_error(y_true, y_pred)
    rmse = math.sqrt(mse)
    
    # print in table
    x = PrettyTable()
    x.add_row(['R2', r2])
    x.add_row(['Adjusted R2', adj_r2])
    x.add_row(['MSE',mse])
    x.add_row(['RMSE', rmse])
    x.title = algorithem_name
    print(x)
    


### 1. <font color='red'>**y_True**</font> 

In [None]:
model_evaluation('y True',X_Test=X_2016, y_pred=y_2016, y_true=y_2016)

### 2. <font color='red'>**Linear Regression**</font>

In [None]:
model_evaluation('Linear Regression',X_Test=X_2016, y_pred=LReg_y_pred, y_true=y_2016)

### 3. <font color='red'>**Random Forest**</font> 

In [None]:
model_evaluation('Random Forest',X_Test=X_2016, y_pred=RFRegr_y_pred, y_true=y_2016)

### 4. <font color='red'>**Gradient Boosting**</font>

In [None]:
model_evaluation('Gradient Boosting',X_Test=X_2016, y_pred=GBRegr_y_pred, y_true=y_2016)

## Testing

In [None]:
LReg_y_pred = LReg.predict(X_2016_1)
RFRegr_y_pred = RFRegr.predict(X_2016_1)
GBRegr_y_pred = GBRegr.predict(X_2016_1)

### Actual

In [None]:
model_evaluation('y True',X_Test=X_2016_1, y_pred=y_2016_1, y_true=y_2016_1)

### Linear Regression

In [None]:
model_evaluation('Linear Regression',X_Test=X_2016_1, y_pred=LReg_y_pred, y_true=y_2016_1)

## Random Forest

In [None]:
model_evaluation('Linear Regression',X_Test=X_2016_1, y_pred=RFRegr_y_pred, y_true=y_2016_1)

## Gradiant Boasting

In [None]:
model_evaluation('Linear Regression',X_Test=X_2016_1, y_pred=GBRegr_y_pred, y_true=y_2016_1)