<div class="alert alert-block alert-success"><b> 1. Load and Read 55mils rows of data using Pandas and Dask</b>

<b>File descriptions</b><br>
train.csv - Input features and target fare_amount values for the training set (about 55M rows).
test.csv - Input features for the test set (about 10K rows). Your goal is to predict fare_amount for each row.
sample_submission.csv - a sample submission file in the correct format (columns key and fare_amount). This file 'predicts' fare_amount to be $11.35 for all rows, which is the mean fare_amount from the training set.

<b>Reference:</b><br>
https://www.kaggle.com/yuliagm/how-to-work-with-big-datasets-on-16g-ram-dask<br>
https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/<br></div>

<div class="alert alert-block alert-info"><b>Loading Libraries</b></div>

In [None]:
from dask.diagnostics import ProgressBar
from dask.distributed import progress
from distributed import Client
client = Client()
client

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math 
import pandas_profiling
%matplotlib inline

import dask
import dask.dataframe as dd
import datetime

import warnings
warnings.filterwarnings("ignore")

import os
print(os.listdir("../input/new-york-city-taxi-fare-prediction"))


In [None]:
os.chdir("../input/new-york-city-taxi-fare-prediction")

<div class="alert alert-block alert-info"><b>Amazing loading speed from Dask</b><br>
<br>In this instance, Medium dataset was loaded instantly by Dask compared to Pandas which took<b> more than 3 minutes</b> to complete loading.</div>

In [None]:
print('Start of Dask Read:',datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
dask_data = dd.read_csv("./train.csv")
print('End of Dask Read:',datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
# print('Start of Pandas Read:',datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
# pandas_data=pd.read_csv('./train.csv')
# print('End of Pandas Read:',datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

<div class="alert alert-block alert-info"><b>Dask Descriptive Statistics</b></div>

In [None]:

dask_data.columns

In [None]:
with ProgressBar():
    dask_data.head()

In [None]:
display(dask_data.head(2))
print('Information:')
dask_data.compute().info()
print('Shape:')
dask_data.compute().shape
print('Describe:')
dask_data.describe().compute()
print('Columns:')
len(dask_data.columns) 
print('Empty Values:')
dask_data.isnull().sum().compute()
print('Taxi fare Mean Value:')
dask_data.fare_amount.mean().compute()

<div class="alert alert-block alert-info"><b>Pandas Descriptive Statistics</b></div>

In [None]:
pandas_data.shape
pandas_data.head(2)
pandas_data.describe
pandas_data['fare_amount'].unique()
pandas_data.isnull().sum()
pandas_data.isna().sum()

<div class="alert alert-block alert-warning"> with Dask data profiling package at its infancy, running data profiling on pandas dataframe with large dateset inevitably crashed due to memory error.
pandas_data.profile_report()
profile = pandas_data.profile_report(title='Profiling Report')
profile.to_file(outputfile="New York Taxi Fare data profiling.html") </div>

<div class="alert alert-block alert-success"><b> 2. Preprocessing</b></div>

In [None]:
# count missing values
missing_values = dask_data.isnull().sum().compute()
missing_values

In [None]:
# calculate percent missing values
mysize = dask_data.index.size.compute()
missing_count = ((missing_values / mysize) * 100)
missing_count

In [None]:
#using Haversine formula to compute distance
def haversine_dist(long_pickup, long_dropoff, lat_pickup, lat_dropoff):
    
    distance = []
    
    for i in range(len(long_pickup)):
        long1, long2, lat1, lat2 = map(math.radians, 
                                       (long_pickup[i], long_dropoff[i], 
                                        lat_pickup[i], lat_dropoff[i]))
        dlat = (lat2 - lat1)
        dlong = (long2 - long1)    
        a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * (math.sin(dlong/2)**2)

        distance.append(2 * math.asin(math.sqrt(a)) * 6371)

    return distance

In [None]:
dask_data.columns

In [None]:
dist_km_interim = dask_data.map_partitions(lambda df: haversine_dist(df["pickup_longitude"],df["dropoff_longitude"],df["pickup_latitude"],df["dropoff_latitude"]))

In [None]:
dask_data["dist_km"] = dist_km

In [None]:
dask_data_new = dask_data.assign(dist_km = dist_km_interim )

In [None]:
with ProgressBar():
    dask_data_new.head()

In [None]:
def test_f(df, col_1, col_2):
    return df.assign(result=df[col_1] * df[col_2])

In [None]:
dask_data['dist_km'] = haversine_dist(dask_data['pickup_longitude'],dask_data['dropoff_longitude'],dask_data['pickup_latitude'],dask_data['dropoff_latitude'])
dask_data_test['dist_km'] = haversine_dist(dask_data_test['pickup_longitude'],dask_data_test['dropoff_longitude'],dask_data_test['pickup_latitude'],dask_data_test['dropoff_latitude'])

dask_data.head(5)

<div class="alert alert-block alert-warning"><b>Errors running Dask dataframe, Switching back to panda dataframe</b></div>

<div class="alert alert-block alert-success"><b> 3. Machine Learning</b></div>

<div class="alert alert-block alert-info"><b>Simple Linear Regression</b>

<div class="alert alert-block alert-info"><b>CatBoost</b>

<div class="alert alert-block alert-info"><b>LightBoost</b>

<div class="alert alert-block alert-info"><b>XGBoost</b>

<div class="alert alert-block alert-success"><b>