<img style="float: right" src="../img/saturn.png" width="300" />

# Machine Learning on Big Data with Dask

## Single-node workflow

We'll first start off with a typical data preparation and machine learning workflow utilizing only the Jupyter Server.


## Monitor resource utilization

For this workshop it's important to monitor CPU and memory utilization when running various commands. It will help with understanding which operations are slow - and which ones run faster on a cluster!

To monitor resource utilization of the Jupyter Server, open a new Terminal window inside Jupyter Lab and run `htop`. You can position the window to view the notebook and terminal on the same screen:

![htop](../img/htop.png)

# Load data

This workshop will utilize [NYC taxi data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) for yellow taxi rides from the 2019 calendar year. The machine learning exercises involve predicting the "tip fraction" of each ride - how much a rider will tip the driver as a fraction of the charged fare amount.

Let's operate with one month for now to explore the data and build out the machine learning code.

In [1]:
import pandas as pd
import numpy as np

In [2]:
%%time

taxi = pd.read_csv(
    'https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2019-01.csv',
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']
)

CPU times: user 12.3 s, sys: 3.24 s, total: 15.6 s
Wall time: 20.3 s


In [3]:
taxi.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2019-01-01 00:46:40,2019-01-01 00:53:20,1,1.5,1,N,151,239,1,7.0,0.5,0.5,1.65,0.0,0.3,9.95,
1,1,2019-01-01 00:59:47,2019-01-01 01:18:59,1,2.6,1,N,239,246,1,14.0,0.5,0.5,1.0,0.0,0.3,16.3,
2,2,2018-12-21 13:48:30,2018-12-21 13:52:40,3,0.0,1,N,236,236,1,4.5,0.5,0.5,0.0,0.0,0.3,5.8,
3,2,2018-11-28 15:52:25,2018-11-28 15:55:45,5,0.0,1,N,193,193,2,3.5,0.5,0.5,0.0,0.0,0.3,7.55,
4,2,2018-11-28 15:56:57,2018-11-28 15:58:33,5,0.0,2,N,193,193,2,52.0,0.0,0.5,0.0,0.0,0.3,55.55,


### Exercise

How many rows are in the `taxi` DataFrame?

In [4]:
len(taxi)

7667792

Memory usage is also an important consideration, as DataFrames often take more space in memory than on disk:

In [5]:
taxi_bytes = taxi.memory_usage(deep=True).sum()
print(f"Size (MB): {taxi_bytes / 1e6}")

Size (MB): 1487.551776


# Exploratory analysis

For this workshop, we will just look at column statistics. There are many more explorary analyses that can performed with `pandas` and data visualization tools

In [6]:
%%time
np.round(taxi.describe().T, 3)

CPU times: user 3.9 s, sys: 248 ms, total: 4.15 s
Wall time: 4.15 s


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
VendorID,7667792.0,1.637,0.54,1.0,1.0,2.0,2.0,4.0
passenger_count,7667792.0,1.567,1.224,0.0,1.0,1.0,2.0,9.0
trip_distance,7667792.0,2.801,3.738,0.0,0.9,1.53,2.8,831.8
RatecodeID,7667792.0,1.058,0.678,1.0,1.0,1.0,1.0,99.0
PULocationID,7667792.0,165.501,66.392,1.0,130.0,162.0,234.0,265.0
DOLocationID,7667792.0,163.753,70.364,1.0,113.0,162.0,234.0,265.0
payment_type,7667792.0,1.292,0.473,1.0,1.0,1.0,2.0,4.0
fare_amount,7667792.0,12.409,262.072,-362.0,6.0,8.5,13.5,623259.86
extra,7667792.0,0.328,0.507,-60.0,0.0,0.0,0.5,535.38
mta_tax,7667792.0,0.497,0.053,-0.5,0.5,0.5,0.5,60.8


# Feature engineering

We are using stateless features, meaning the features values for a given observation don't depend on other observations. This is allows us to create features before performing any data splitting.

Then, split data into train/test sets.

In [7]:
# specify feature and label column names
raw_features = [
    'tpep_pickup_datetime', 
    'passenger_count', 
    'tip_amount', 
    'fare_amount',
]
features = [
    'pickup_weekday', 
    'pickup_weekofyear', 
    'pickup_hour', 
    'pickup_week_hour', 
    'pickup_minute', 
    'passenger_count',
]
label = 'tip_fraction'

In [8]:
def prep_df(taxi_df):
    '''
    Generate features from a raw taxi dataframe.
    '''
    df = taxi_df[taxi_df.fare_amount > 0][raw_features].copy()  # avoid divide-by-zero
    df[label] = df.tip_amount / df.fare_amount
     
    df['pickup_weekday'] = df.tpep_pickup_datetime.dt.weekday
    df['pickup_weekofyear'] = df.tpep_pickup_datetime.dt.isocalendar().week
    df['pickup_hour'] = df.tpep_pickup_datetime.dt.hour
    df['pickup_week_hour'] = (df.pickup_weekday * 24) + df.pickup_hour
    df['pickup_minute'] = df.tpep_pickup_datetime.dt.minute
    df = df[features + [label]].astype(float).fillna(-1)
    
    return df

In [9]:
taxi_feat = prep_df(taxi)
taxi_feat.head()

Unnamed: 0,pickup_weekday,pickup_weekofyear,pickup_hour,pickup_week_hour,pickup_minute,passenger_count,tip_fraction
0,1.0,1.0,0.0,24.0,46.0,1.0,0.235714
1,1.0,1.0,0.0,24.0,59.0,1.0,0.071429
2,4.0,51.0,13.0,109.0,48.0,3.0,0.0
3,2.0,48.0,15.0,63.0,52.0,5.0,0.0
4,2.0,48.0,15.0,63.0,56.0,5.0,0.0


Split into train/test sets

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    taxi_feat[features], 
    taxi_feat[label], 
    test_size=0.3,
    random_state=42
)

In [11]:
X_train.shape, y_train.shape

((5360764, 6), (5360764,))

In [12]:
X_test.shape, y_test.shape

((2297471, 6), (2297471,))

# Train model

We'll train a linear model to predict `tip_fraction`. We define a `Pipeline` to encompass both feature scaling and model training. This will be useful later when performing a grid search.

Evaluate the model against the test set using RMSE. We'll also save out the model for later use.

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

pipeline = Pipeline(steps=[
    ('scale', StandardScaler()),
    ('clf', ElasticNet(normalize=False, max_iter=100, l1_ratio=0)),
])

In [14]:
%%time
fitted = pipeline.fit(X_train, y_train)

CPU times: user 14.8 s, sys: 235 ms, total: 15 s
Wall time: 7.98 s


  positive)


In [15]:
%%time
preds = fitted.predict(X_test)

CPU times: user 162 ms, sys: 24.3 ms, total: 187 ms
Wall time: 92.9 ms


In [16]:
mean_squared_error(y_test, preds, squared=False)

5.118045186129104

In [17]:
import cloudpickle

with open('/tmp/model.pkl', 'wb') as f:
    cloudpickle.dump(fitted, f)

## Hooray!

We trained a terrible model. But that's okay. The point of this workshop is to scale our work, not make the model perfect!

# Let's step things up

We were able to train a model on a sample of the taxi data (single month from 2019). In many model building settings more data would be required. Follow below to see where the single-node environment starts to encounter challenges - the rest of the workshop will cover how Dask solves these problems!


## Load and process large dataset

Let's look at the size of the files on disk using `s3fs`. 

In [18]:
import s3fs
s3 = s3fs.S3FileSystem(anon=True)

files = s3.glob('s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv')
total_size = 0
for f in files:
    size = s3.du(f)
    total_size += size
    
    print(f"{f}, Size: {round(size / 1e6, 2)} MB")
print()
print(f"Total size: {round(total_size / 1e9, 2)} GB")

nyc-tlc/trip data/yellow_tripdata_2019-01.csv, Size: 687.09 MB
nyc-tlc/trip data/yellow_tripdata_2019-02.csv, Size: 649.88 MB
nyc-tlc/trip data/yellow_tripdata_2019-03.csv, Size: 726.2 MB
nyc-tlc/trip data/yellow_tripdata_2019-04.csv, Size: 689.21 MB
nyc-tlc/trip data/yellow_tripdata_2019-05.csv, Size: 701.54 MB
nyc-tlc/trip data/yellow_tripdata_2019-06.csv, Size: 643.49 MB
nyc-tlc/trip data/yellow_tripdata_2019-07.csv, Size: 584.39 MB
nyc-tlc/trip data/yellow_tripdata_2019-08.csv, Size: 562.39 MB
nyc-tlc/trip data/yellow_tripdata_2019-09.csv, Size: 608.97 MB
nyc-tlc/trip data/yellow_tripdata_2019-10.csv, Size: 669.17 MB
nyc-tlc/trip data/yellow_tripdata_2019-11.csv, Size: 637.81 MB
nyc-tlc/trip data/yellow_tripdata_2019-12.csv, Size: 639.11 MB

Total size: 7.8 GB


<br>

Files end up taking more space when loaded into memory, but let's see if this will fit on our Jupyter Server. We can loop through the files and concatenate them into one DataFrame. Watch memory utilization as the loop runs!

In [19]:
def load_csv(file):
    df = pd.read_csv(
        s3.open(file, mode='rb'),
        parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']
    )
    return df

In [None]:
%%time

dfs = []
for f in files:
    print(f)
    df = load_csv(f)
    print(f'{len(df)} rows, {df.memory_usage(deep=True).sum() / 1e6} MB')
    dfs.append(df)
taxi_big = pd.concat(dfs)

![explosion](https://media4.giphy.com/media/13d2jHlSlxklVe/giphy.gif)

Oh no! Looks like we have enough memory to load the CSV files individually, but not to concatenate them together into one DataFrame.

--- 

## Train model with large dataset

We are stuck here, because we need to be able to load the full dataset into memory to train with it.

Think about all the data we're missing out on. All those observations, all those models that will never get a chance!

<img src="https://media0.giphy.com/media/k61nOBRRBMxva/giphy.gif" width="400" alt="crying" />

# There must be a better way!

<img src="https://docs.dask.org/en/latest/_images/dask_horizontal_no_pad.svg" width="300" alt="dask" />

With Dask, we can scale out to a cluster to address these problems. 

Move on to [03-dask-basics.ipynb](../03-dask-basics.ipynb) to get started!