# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: Implementation of Linear Regression on a Large Dataset Using Dask Library

## Learning Objectives

At the end of the mini-project, you will be able to :

- understand how dask handles large dataset over pandas dataframe
- perform exploratory data analysis on a large dataset (2 Million rows) using dask
- implement linear regression model using dask library and make predictions


## Problem Statement

 Predict the taxi fare amount in New York city using Dask-ML.

## Information

### Dask
[Dask](https://dask.pydata.org/en/latest/) is an open source project that gives abstractions over NumPy Arrays, Pandas Dataframes and regular lists, allowing you to run operations on them in parallel, using multicore processing.

We can summarize the basics of Dask as follows:

* processes data that doesn’t fit into memory by breaking it into blocks and specifying task chains

* parallelizes execution of tasks across cores and even nodes of a cluster

* moves computation to the data rather than the other way around, to minimize communication overhead

### Dataset

The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. Its variables are as follows:
![Dataset](https://cdn.iisc.talentsprint.com/CDS/Images/NYC_Taxi_data_description.png)




## Grading = 10 Points

In [None]:
#@title Install Dask dependencies and restart runtime
!pip -qq install dask-ml==1.8.0
!pip -qq install dask==2.9.1
!pip -qq install dask[delayed]
!pip -qq install dask[dataframe] --upgrade

#### Importing Necessary Packages

In [None]:
import warnings
warnings.filterwarnings('ignore')
import dask
import dask.dataframe as dd
import dask.array as da
from dask_ml.linear_model import LinearRegression
from dask_ml.model_selection import train_test_split
from dask_ml.metrics import mean_squared_error, r2_score
from dask.distributed import Client
import time as time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from dask.distributed import Client, progress
client = Client()

In [None]:
#@title Download the data
!wget https://cdn.iisc.talentsprint.com/CDS/MiniProjects/Dask_MP_dataset.csv

#### Exercise 1: Read the dataset using dask library and compare the time of execution with pandas library.

**Hint:** pass `dtype` for passenger_count as `int64`

In [None]:
%%time
# YOUR CODE HERE

#### Use pandas to read the dataset and compare the time taken

In [None]:
%%time
# YOUR CODE HERE

### Data Analysis (2 Points)



#### Exercise 2: Drop the unnecessary columns. Also drop the duplicate rows and the rows having null values.

**Hint:** Drop those columns which are not useful in EDA as well as model implementation

In [None]:
""" Drop unnecessary columns """
# YOUR CODE HERE

In [None]:
""" Drop duplicate rows """
# YOUR CODE HERE

In [None]:
""" drop NA rows """
# YOUR CODE HERE

#### Exercise 3: Visualize the target variable, i.e., `fare_amount` to study the fare distribution, using a histogram density plot. Analyze the fare_amount distribution, try to visualize it for a range of [0, 60].

**Hint:** [sns.hisplot()](https://stackoverflow.com/questions/51027636/seaborn-histogram-with-bigdata/51027895) and use `.between` to plot the graph for given range


In [None]:
""" explore and plot the density plot of fare_amount """
# YOUR CODE HERE

#### Observe the number of workers and cores running in your machine

Initialize a client and observe how many workers are working and the number of cores utilizing for the given data set.

In [None]:
""" Initialize a client """
# YOUR CODE HERE

### EDA based on Time (2 Points)

#### Exercise 4: Extract day of the week (dow), hour, month and year from `pickup_datetime`.

**Hint:** use `pd.to_datetime()` function as dask does not have this functionality in it.

Remember to use `.compute()` while passing the dask dataframe in defined function.

In [None]:
# YOUR CODE HERE

#### Exercise 5: a.) Plot the taxi trip by hour of the day

* Partition the data into segments using `dask.from_pandas()`

* Plot the taxi trip for hour of the day. **Hint:** [sns.catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html)

In [None]:
""" taxi trip repartition by hour of the day """
# YOUR CODE HERE

#### Exercise 5: b.) Plot the taxi trip repartition by day of the week (dow)

In [None]:
""" taxi trip repartition by day of the week """
# YOUR CODE HERE

#### Exercise 6: a.) Draw a plot between the target variable and passenger count and analyze it.

In [None]:
""" passenger count feature """
# YOUR CODE HERE

#### Exercise 6: b.) Draw a plot between the target variable and hour and analyze it.

In [None]:
""" fare amount by hour """
# YOUR CODE HERE

### Feature Engineering (1 Point)

#### Exercise 7: Compute the Haversine distance between pickup and dropoff point

* Convert the latitude and longitude co-rodinates to radians

* Calculate the Haversine distance

  **Hint:** [haversine_distances](https://towardsdatascience.com/heres-how-to-calculate-distance-between-2-geolocations-in-python-93ecab5bbba4)

* Add the "distance" feature to the dataset and plot its distribution

In [None]:
""" distance feature """
# YOUR CODE HERE

In [None]:
""" plot the distance feature (take distance < 50) """
# YOUR CODE HERE

### Correlation between distance and fare amount (1 Point)

In [None]:
""" correlation between fare_amount and distance """
# YOUR CODE HERE

### Preparing dataset for model implementation

**Note:** Use the above modified dataset for modelling.

In [None]:
# YOUR CODE HERE

### Removing outliers from training set Based on Coordinates (1 Point)

#### Exercise 8: Remove the outliers using the given latitude and longitude features from the dataset. We need to analyze the data of taxi within New York City.

**Hint:** Given the co-ordinates of New York city are Latitude: 40.7128° and Longitude: -74.0060°. You can include the pickup and drop off points such that there left and right value mean will be the given co-ordinate value.

Also, choose nearest extreme values.

Use `.between()` and pass left and right value attributes accordingly.

In [None]:
""" remove the outliers in pickup latitude longitude and drop off latitude and longitude """
# YOUR CODE HERE

### Modelling (3 Points)

#### Exercise 9: Divide the data into train and test splits with X as feature variables and y as target variable

* Divide data into train test split with 70-30 ratio, Hint: `train_test_split()`

* As dask functions operate lazily so, before calling `.fit()` function, call the dask dataframe with `.compute()`.
* Convert X_train and y_train into array using `.values` as [dask's](https://ml.dask.org/modules/api.html) `.fit()` function takes array as attribute

In [None]:
""" select the target and feature variables and split the data into train and test """
# YOUR CODE HERE

#### Exercise 10: Predict the test data and calculate the mean squared error and r2 score.

**Hint:** Remember to call `.compute()` function as dask functions operate lazily and convert the dask dataframe to `.values` (Array type) as suggested in above exercise

In [None]:
""" predict the values """
# YOUR CODE HERE

In [None]:
""" compute mean squared error and r2_score """
# YOUR CODE HERE

### Report Analysis
* Discuss the pros and cons of using dask
* Derive the insights and discuss
* Comment on the performance metrics (MSE, R^2 score)
