### ML with Dask

In the following task, you'll continue working with the Credit Card Fraud Detection dataset from Kaggle. Before moving on to the tasks, you should load the dataset using Dask.

In [1]:
import pandas as pd
import joblib

import dask.dataframe as dd
from dask.distributed import Client, progress
from dask_ml.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [2]:
client = Client(n_workers=4, threads_per_worker=2, memory_limit='2GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:51812  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 8.00 GB


In [3]:
ccfraud = dd.read_csv('data/creditcard.csv', dtype={'Time': 'float64'})
ccfraud.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Build many models

In this task, you'll train several machine learning models from scikit-learn using Dask as the backend of joblib. This time, you need to use all the variables except Class as your feature set. Class variable will be your target variable.

In [4]:
X = ccfraud.drop(columns=['Class'])
y = ccfraud['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=36)



### Logistic Regression

In [5]:
model = LogisticRegression()

with joblib.parallel_backend('dask'):
    model.fit(X_train.compute(), y_train.compute())
    
y_pred = model.predict(X_test.values.compute())

display(roc_auc_score(y_pred, y_test.values.compute()))

0.8537403330793096

### GB Classifier

In [6]:
model = GradientBoostingClassifier()

with joblib.parallel_backend('dask'):
    model.fit(X_train.compute(), y_train.compute())
    
y_pred = model.predict(X_test.values.compute())

display(roc_auc_score(y_pred, y_test.values.compute()))

0.9537512888143252

### RF Classifier

In [7]:
model = RandomForestClassifier()

with joblib.parallel_backend('dask'):
    model.fit(X_train.compute(), y_train.compute())
    
y_pred = model.predict(X_test.values.compute())

display(roc_auc_score(y_pred, y_test.values.compute()))

0.977054804150381

### Compare the results of your models

<span style="color:blue">random forrest performs the best with out of sample data but RF and GB take significantly longer even with the parallel backend.  logistic regression seems to perform well enough with out of sample while taking significantly less time.  we're only using the base models so we could get the the logistic regression to perform better with hyperparameter tuning.</span>