&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&ensp;
[Home Page](../../START_HERE.ipynb)



# Challenge - Gene Expression Classification


### Introduction

This notebook walks through an end-to-end GPU machine learning workflow where cuDF is used for processing the data and cuML is used to train machine learning models on it. 

After completing this excercise, you will be able to use cuDF to load data from disk, combine tables, scale features, use one-hote encoding and even write your own GPU kernels to efficiently transform feature columns. Additionaly you will learn how to pass this data to cuML, and how to train ML models on it. The trained model is saved and it will be used for prediction.

It is not required that the user is familiar with cuDF or cuML. Since our aim is to go from ETL to ML training, a detailed introduction is out of scope for this notebook. We recommend [Introduction to cuDF](../../CuDF/01-Intro_to_cuDF.ipynb) for additional information.

### Problem Statement:
We are trying to classify patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) using machine learning (classification) algorithms. This dataset comes from a proof-of-concept study published in 1999 by Golub et al. It showed how new cases of cancer could be classified by gene expression monitoring (via DNA microarray) and thereby provided a general approach for identifying new cancer classes and assigning tumors to known classes. 

Here is the dataset link: https://www.kaggle.com/crawford/gene-expression.

## Here is the list of exercises and modules to work on in the lab:

- Convert the serial Pandas computations to CuDF operations.
- Utilize CuML to accelerate the machine learning models.
- Experiment with Dask to create a cluster and distribute the data and scale the operations.

You will start writing code from <a href='#dask1'>here</a>, but make sure you execute the data processing blocks to understand the dataset.



### 1. Data Processing

The first step is downloading the dataset and putting it in the data directory, for using in this tutorial. Download the dataset here, and place it in (host/data) folder. Now we will import the necessary libraries.

In [3]:
import numpy as np; print('NumPy Version:', np.__version__)
import pandas as pd
import sys
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)
from sklearn import preprocessing 
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, auc
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
import cudf
import cupy
# import for model building
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error
from cuml.metrics.regression import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn import model_selection, datasets
from cuml.dask.common import utils as dask_utils
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf
from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF
from sklearn.ensemble import RandomForestClassifier as sklRF

NumPy Version: 1.20.3
Scikit-Learn Version: 0.24.2


We'll read the dataframe into y from the csv file, view its dimensions and observe the first 5 rows of the dataframe.

In [4]:
%%time
y = pd.read_csv('../../../data/actual.csv')
print(y.shape)
y.head()

(72, 2)
CPU times: user 2.61 ms, sys: 305 µs, total: 2.92 ms
Wall time: 2.49 ms


Unnamed: 0,patient,cancer
0,1,ALL
1,2,ALL
2,3,ALL
3,4,ALL
4,5,ALL


Let's convert our target variable categories to numbers.

In [5]:
y['cancer'].value_counts()
# Recode label to numeric
y = y.replace({'ALL':0,'AML':1})
labels = ['ALL', 'AML'] # for plotting convenience later on

Read the training and test data provided in the challenge from the data folder. View their dimensions.

In [6]:
# Import training data
df_train = pd.read_csv('../../../data/data_set_ALL_AML_train.csv')
print(df_train.shape)

# Import testing data
df_test = pd.read_csv('../../../data/data_set_ALL_AML_independent.csv')
print(df_test.shape)

(7129, 78)
(7129, 70)


Observe the first few rows of the train dataframe and the data format.

In [7]:
df_train.head()

Unnamed: 0,Gene Description,Gene Accession Number,1,call,2,call.1,3,call.2,4,call.3,...,29,call.33,30,call.34,31,call.35,32,call.36,33,call.37
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-214,A,-139,A,-76,A,-135,A,...,15,A,-318,A,-32,A,-124,A,-135,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-153,A,-73,A,-49,A,-114,A,...,-114,A,-192,A,-49,A,-79,A,-186,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,-58,A,-1,A,-307,A,265,A,...,2,A,-95,A,49,A,-37,A,-70,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,88,A,283,A,309,A,12,A,...,193,A,312,A,230,P,330,A,337,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-295,A,-264,A,-376,A,-419,A,...,-51,A,-139,A,-367,A,-188,A,-407,A


Observe the first few rows of the test dataframe and the data format.

In [8]:
df_test.head()

Unnamed: 0,Gene Description,Gene Accession Number,39,call,40,call.1,42,call.2,47,call.3,...,65,call.29,66,call.30,63,call.31,64,call.32,62,call.33
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-342,A,-87,A,22,A,-243,A,...,-62,A,-58,A,-161,A,-48,A,-176,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-200,A,-248,A,-153,A,-218,A,...,-198,A,-217,A,-215,A,-531,A,-284,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,41,A,262,A,17,A,-163,A,...,-5,A,63,A,-46,A,-124,A,-81,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,328,A,295,A,276,A,182,A,...,141,A,95,A,146,A,431,A,9,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-224,A,-226,A,-211,A,-289,A,...,-256,A,-191,A,-172,A,-496,A,-294,A


As we can see, the data set has categorical values but only for the columns starting with "call". We won't use the columns having categorical values, but remove them.

In [9]:
# Remove "call" columns from training and testing data
train_to_keep = [col for col in df_train.columns if "call" not in col]
test_to_keep = [col for col in df_test.columns if "call" not in col]

X_train_tr = df_train[train_to_keep]
X_test_tr = df_test[test_to_keep]

Rename the columns and reindex for formatting purposes and ease in reading the data.

In [10]:
train_columns_titles = ['Gene Description', 'Gene Accession Number', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
       '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', 
       '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38']

X_train_tr = X_train_tr.reindex(columns=train_columns_titles)

In [11]:
test_columns_titles = ['Gene Description', 'Gene Accession Number','39', '40', '41', '42', '43', '44', '45', '46',
       '47', '48', '49', '50', '51', '52', '53',  '54', '55', '56', '57', '58', '59',
       '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72']

X_test_tr = X_test_tr.reindex(columns=test_columns_titles)

We will take the transpose of the dataframe so that each row is a patient and each column is a gene.

In [12]:
X_train = X_train_tr.T
X_test = X_test_tr.T

print(X_train.shape) 
X_train.head()

(40, 7129)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
Gene Description,AFFX-BioB-5_at (endogenous control),AFFX-BioB-M_at (endogenous control),AFFX-BioB-3_at (endogenous control),AFFX-BioC-5_at (endogenous control),AFFX-BioC-3_at (endogenous control),AFFX-BioDn-5_at (endogenous control),AFFX-BioDn-3_at (endogenous control),AFFX-CreX-5_at (endogenous control),AFFX-CreX-3_at (endogenous control),AFFX-BioB-5_st (endogenous control),...,Transcription factor Stat5b (stat5b) mRNA,Breast epithelial antigen BA46 mRNA,GB DEF = Calcium/calmodulin-dependent protein ...,TUBULIN ALPHA-4 CHAIN,CYP4B1 Cytochrome P450; subfamily IVB; polypep...,PTGER3 Prostaglandin E receptor 3 (subtype EP3...,HMG2 High-mobility group (nonhistone chromosom...,RB1 Retinoblastoma 1 (including osteosarcoma),GB DEF = Glycophorin Sta (type A) exons 3 and ...,GB DEF = mRNA (clone 1A7)
Gene Accession Number,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
1,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
2,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14
3,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,315,1199,33,168,52,1138,777,41,228,-41


Just clearning the data, removing extra columns and converting to numerical values.

In [13]:
# Clean up the column names for training and testing data
X_train.columns = X_train.iloc[1]
X_train = X_train.drop(["Gene Description", "Gene Accession Number"]).apply(pd.to_numeric)

# Clean up the column names for Testing data
X_test.columns = X_test.iloc[1]
X_test = X_test.drop(["Gene Description", "Gene Accession Number"]).apply(pd.to_numeric)

print(X_train.shape)
print(X_test.shape)
X_train.head()

(38, 7129)
(34, 7129)


Gene Accession Number,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
1,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
2,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14
3,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,315,1199,33,168,52,1138,777,41,228,-41
4,-135,-114,265,12,-419,-585,158,-253,49,31,...,240,835,218,174,-110,627,170,-50,126,-91
5,-106,-125,-76,168,-230,-284,4,-122,70,252,...,156,649,57,504,-26,250,314,14,56,-25


We have the 38 patients as rows in the training set, and the other 34 as rows in the testing set. Each of those datasets has 7129 gene expression features. But we haven't yet associated the target labels with the right patients. You will recall that all the labels are all stored in a single dataframe. Let's split the data so that the patients and labels match up across the training and testing dataframes.We are now splitting the data into train and test sets. We will subset the first 38 patient's cancer types.

In [14]:
X_train = X_train.reset_index(drop=True)
y_train = y[y.patient <= 38].reset_index(drop=True)

# Subset the rest for testing
X_test = X_test.reset_index(drop=True)
y_test = y[y.patient > 38].reset_index(drop=True)

Generate descriptive statistics to analyse the data further.

In [15]:
X_train.describe()

Gene Accession Number,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
count,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,...,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0
mean,-120.868421,-150.526316,-17.157895,181.394737,-276.552632,-439.210526,-43.578947,-201.184211,99.052632,112.131579,...,178.763158,750.842105,8.815789,399.131579,-20.052632,869.052632,335.842105,19.210526,504.394737,-29.210526
std,109.555656,75.734507,117.686144,117.468004,111.004431,135.458412,219.482393,90.838989,83.178397,211.815597,...,84.82683,298.008392,77.108507,469.579868,42.346031,482.366461,209.826766,31.158841,728.744405,30.851132
min,-476.0,-327.0,-307.0,-36.0,-541.0,-790.0,-479.0,-463.0,-82.0,-215.0,...,30.0,224.0,-178.0,36.0,-112.0,195.0,41.0,-50.0,-2.0,-94.0
25%,-138.75,-205.0,-83.25,81.25,-374.25,-547.0,-169.0,-239.25,36.0,-47.0,...,120.0,575.5,-42.75,174.5,-48.0,595.25,232.75,8.0,136.0,-42.75
50%,-106.5,-141.5,-43.5,200.0,-263.0,-426.5,-33.5,-185.5,99.5,70.5,...,174.5,700.0,10.5,266.0,-18.0,744.5,308.5,20.0,243.5,-26.0
75%,-68.25,-94.75,47.25,279.25,-188.75,-344.75,79.0,-144.75,152.25,242.75,...,231.75,969.5,57.0,451.75,9.25,1112.0,389.5,30.25,487.25,-11.5
max,17.0,-20.0,265.0,392.0,-51.0,-155.0,419.0,-24.0,283.0,561.0,...,356.0,1653.0,218.0,2527.0,52.0,2315.0,1109.0,115.0,3193.0,36.0


Clearly there is some variation in the scales across the different features. Many machine learning models work much better with data that's on the same scale, so let's create a scaled version of the dataset.

In [16]:
X_train_fl = X_train.astype(float, 64)
X_test_fl = X_test.astype(float, 64)

# Apply the same scaling to both datasets
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_fl)
X_test = scaler.transform(X_test_fl) # note that we transform rather than fit_transform

In [21]:
y_test

Unnamed: 0,patient,cancer
0,39,0
1,40,0
2,41,0
3,42,0
4,43,0
5,44,0
6,45,0
7,46,0
8,47,0
9,48,0


<a id='dask1'></a>

### 2. Conversion to CuDF Dataframe
Convert the pandas dataframes to CuDF dataframes to carry out the further CuML tasks.

In [22]:
#Modify the code in this cell

#%%time
X_cudf_train = cudf.DataFrame(X_train)  #Pass X train dataframe here
X_cudf_test = cudf.DataFrame(X_test)    #Pass X test dataframe here

y_cudf_train = cudf.DataFrame(y_train['cancer'].values)  #Pass y train dataframe here
y_cudf_test = cudf.Series(y_test['cancer'].values)   #Pass y test dataframe here

In [24]:
X_cudf_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
0,-0.861496,-0.033101,-0.351701,-0.805738,-0.168417,-0.888716,1.120068,0.280962,1.86347,0.44911,...,0.074511,-0.81562,-1.758717,-0.021865,-0.405584,-0.159782,-0.033046,0.546068,-0.43582,-0.255875
1,-0.167723,1.0374,0.139139,0.876572,0.1146,0.293351,-1.322502,0.370212,0.023726,-0.182439,...,-0.11664,0.292993,-0.589006,0.092516,0.073055,-0.182892,-0.19726,-0.267043,-0.595744,0.499648
2,0.415047,1.35855,-2.495899,1.10088,-0.907912,-1.577008,0.353591,-1.849884,1.303018,-1.565148,...,1.627617,1.52403,0.317849,-0.498816,1.724361,0.565043,2.130709,0.70869,-0.384366,-0.38727
3,-0.130721,0.488768,2.429729,-1.461407,-1.300484,-1.090715,0.930757,-0.57807,-0.609828,-0.388171,...,0.731595,0.286192,2.749271,-0.485868,-2.152617,-0.508538,-0.800986,-2.251033,-0.526212,-2.029712
4,0.137537,0.341574,-0.506703,-0.115559,0.425006,1.161198,0.219688,0.8834,-0.35397,0.669195,...,-0.271951,-0.34633,0.633277,0.226322,-0.142332,-1.300593,-0.105493,-0.169469,-0.623557,0.138311
5,-0.158472,0.876825,1.999167,-0.952401,0.041564,-0.888716,0.51058,0.169399,-0.146846,0.386912,...,-0.761776,1.598844,-1.114719,-0.490184,-1.291066,-0.470721,0.024912,0.220824,-0.433039,-0.781456
6,0.452048,0.08733,2.197225,-1.090436,-1.117893,-0.836346,0.806089,0.247493,0.32832,-0.632178,...,-1.777269,0.231781,-2.455287,-0.535505,0.049123,0.569244,0.705915,-0.299567,-0.188286,-0.420119
7,-2.702307,-1.464896,0.208029,-1.582188,-2.414292,-2.624409,-1.068548,-2.920885,-0.35397,-1.345063,...,1.316996,-0.414343,-1.246147,-0.209624,1.030334,1.953764,0.532042,1.29413,0.384659,1.616508
8,1.164322,0.314812,1.060542,0.747163,0.607597,-0.716643,0.201219,0.303274,-0.914421,1.884448,...,2.117443,0.779287,-0.037007,-0.479393,0.192715,-0.233315,0.237423,3.115498,-0.362116,-0.321573
9,0.304043,0.6092,0.509423,0.324429,0.899744,1.445492,1.715704,0.593337,0.94969,0.339067,...,-1.633905,0.799691,0.225849,-0.643412,0.98247,-0.628292,0.111848,-0.332092,-0.463633,1.189473


### 3. Model Building
#### Dask Integration

We will try using the Random Forests Classifier  and implement using CuML and Dask.

#### Start Dask cluster

In [25]:
#Modify the code in this cell

# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1) #Set 1 thread per worker using arguments to cluster
c = Client(cluster) #Pass the cluster as an argument to Client

# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization

NVMLError_NotFound: Not Found

In [26]:
# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1)
c = Client(cluster)

# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization

NVMLError_NotFound: Not Found

#### Define Parameters

In addition to the number of examples, random forest fitting performance depends heavily on the number of columns in a dataset and (especially) on the maximum depth to which trees are allowed to grow. Lower `max_depth` values can greatly speed up fitting, though going too low may reduce accuracy.

In [28]:
# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000

#### Distribute data to worker GPUs

In [29]:
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)

In [None]:
n_partitions = n_workers

def distribute(X, y):
    # First convert to cudf (with real data, you would likely load in cuDF format to start)
    X_cudf = cudf.DataFrame.from_pandas(pd.DataFrame(X))
    y_cudf = cudf.Series(y)

    # Partition with Dask
    # In this case, each worker will train on 1/n_partitions fraction of the data
    X_dask = dask_cudf.from_cudf(X_cudf, npartitions=n_partitions)
    y_dask = dask_cudf.from_cudf(y_cudf, npartitions=n_partitions)

    # Persist to cache the data in active memory
    X_dask, y_dask = \
      dask_utils.persist_across_workers(c, [X_dask, y_dask], workers=workers)
    
    return X_dask, y_dask

In [None]:
#Modify the code in this cell

X_train_dask, y_train_dask = distribute(X_train, y_train) #Pass train data as arguments here
X_test_dask, y_test_dask = distribute(X_test, y_test) #Pass test data as arguments here

#### Create the  Scikit-learn model

Since a scikit-learn equivalent to the multi-node multi-GPU K-means in cuML doesn't exist, we will use Dask-ML's implementation for comparison.

In [59]:
%%time
max_depth = 12
n_trees = 150

# Use all avilable CPU cores
skl_model = sklRF(n_estimators=n_trees, max_depth=max_depth, n_jobs=-1)
skl_model.fit(X_train, y_train.iloc[:,1])

CPU times: user 206 ms, sys: 16.8 ms, total: 223 ms
Wall time: 242 ms


RandomForestClassifier(max_depth=12, n_estimators=150, n_jobs=-1)

In [63]:
import xgboost as xgb
xgr=xgb.XGBClassifier(max_depth=50,min_child_weight=0.1,gamma=0.0001)
dtrain = xgb.DMatrix(X_train, label=y_train.iloc[:,1])
dtest = xgb.DMatrix(X_test, label=y_test.iloc[:,1])



In [64]:
# instantiate params
params = {}

# general params
general_params = {}
params.update(general_params)

# booster params
booster_params = {'tree_method': 'gpu_hist'}
params.update(booster_params)

# learning task params
learning_task_params = {'objective': 'reg:squarederror'}
params.update(learning_task_params)
print(params)

{'tree_method': 'gpu_hist', 'objective': 'reg:squarederror'}


In [65]:
evallist = [(dtest, 'test'), (dtrain, 'train')]
num_round = 300

In [66]:
bst = xgb.train(params, dtrain, num_round, evallist)


[0]	test-rmse:0.39144	train-rmse:0.35744
[1]	test-rmse:0.32951	train-rmse:0.25557
[2]	test-rmse:0.29880	train-rmse:0.18276
[3]	test-rmse:0.28634	train-rmse:0.13072
[4]	test-rmse:0.28309	train-rmse:0.09351
[5]	test-rmse:0.28381	train-rmse:0.06691
[6]	test-rmse:0.28591	train-rmse:0.04788
[7]	test-rmse:0.28821	train-rmse:0.03427
[8]	test-rmse:0.29027	train-rmse:0.02453
[9]	test-rmse:0.29195	train-rmse:0.01757
[10]	test-rmse:0.29327	train-rmse:0.01258
[11]	test-rmse:0.29427	train-rmse:0.00901
[12]	test-rmse:0.29502	train-rmse:0.00645
[13]	test-rmse:0.29557	train-rmse:0.00462
[14]	test-rmse:0.29597	train-rmse:0.00331
[15]	test-rmse:0.29627	train-rmse:0.00238
[16]	test-rmse:0.29648	train-rmse:0.00170
[17]	test-rmse:0.29664	train-rmse:0.00122
[18]	test-rmse:0.29675	train-rmse:0.00088
[19]	test-rmse:0.29683	train-rmse:0.00063
[20]	test-rmse:0.29689	train-rmse:0.00045
[21]	test-rmse:0.29693	train-rmse:0.00032
[22]	test-rmse:0.29696	train-rmse:0.00023
[23]	test-rmse:0.29699	train-rmse:0.00017
[2

In [82]:
y_pred_train = bst.predict(dtrain)
y_pred_train = np.round(y_pred_train).astype(np.int32)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0], dtype=int32)

In [84]:
print("XGboost accuracy:  ", accuracy_score(y_train.iloc[:,1], y_pred_train))


XGboost accuracy:   1.0


In [85]:
y_pred_test = bst.predict(dtest)
y_pred_test = np.round(y_pred_test).astype(np.int32)
print("XGboost accuracy:  ", accuracy_score(y_test.iloc[:,1], y_pred_test))

XGboost accuracy:   0.9117647058823529


In [73]:
y_test.iloc[:,1]

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    1
12    1
13    1
14    1
15    1
16    0
17    0
18    1
19    1
20    0
21    1
22    1
23    1
24    1
25    1
26    1
27    1
28    0
29    0
30    0
31    0
32    0
33    0
Name: cancer, dtype: int32

In [71]:
import cupy as cp
from cuml.metrics.regression import r2_score

y_pred_test = bst.predict(dtest)

y_pred_cp = cp.asarray(y_pred_test)
y_test_cp = cp.asarray(y_train).astype(np.float32)

print("SKLearn accuracy:  ", accuracy_score(y_test.iloc[:,1], y_pred_cp))

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

In [34]:
from sklearn.model_selection import GridSearchCV
params_grid = {'max_depth': [10, 15, 50, 100], 'n_estimators': [500, 1000, 1500, 2000]}
grid = GridSearchCV(sklRF(), params_grid, scoring='accuracy')
grid.fit(X_train, y_train.iloc[:,1])

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'max_depth': [10, 15, 50, 100],
                         'n_estimators': [500, 1000, 1500, 2000]},
             scoring='accuracy')

In [35]:
grid.best_params_

{'max_depth': 10, 'n_estimators': 500}

#### Train the distributed cuML model

In [31]:
#Modify the code in this cell

%%time

cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams)
cuml_model.fit() # Pass X and y train dask data here

wait(cuml_model.rfs) # Allow asynchronous training tasks to finish

UsageError: Line magic function `%%time` not found.


#### Predict and check accuracy

In [74]:
#Modify the code in this cell

skl_y_pred = skl_model.predict(X_train)
#cuml_y_pred = cuml_model.predict().compute().to_array()  #Pass the X test dask data as argument here

# Due to randomness in the algorithm, you may see slight variation in accuracies
print("SKLearn accuracy:  ", accuracy_score(y_train.iloc[:,1], skl_y_pred))
#print("CuML accuracy:     ", accuracy_score())  #Pass the y test dask data  and predicted values from CuML model as argument here

SKLearn accuracy:   1.0


In [75]:
skl_y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

<a id='ex4'></a><br>

### 4. CONCLUSION

Let's compare the performance of our solution!

| Algorithm     | Implementation | Accuracy      | Time | Algorithm     | Implementation | Accuracy      | Time |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |


Write down your observations and compare the CuML and Scikit learn scores. They should be approximately equal.  We hope that you found this exercise exciting and beneficial in understanding RAPIDS better. Share your highest accuracy and try to use the unique features of RAPIDS for accelerating your data science pipelines. Don't restrict yourself to the previously explained concepts, but use the documentation to apply more models and functions and achieve the best results. Jump over to the next notebook for our sample solution.


### 5.  References



<p xmlns:dct="http://purl.org/dc/terms/">
  <a rel="license"
     href="http://creativecommons.org/publicdomain/zero/1.0/">
    <center><img src="http://i.creativecommons.org/p/zero/1.0/88x31.png" style="border-style: none;" alt="CC0"  /></center>
  </a>
 
</p>


- The dataset is licensed under a CC0: Public Domain license.

- Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science 286:531-537. (1999). Published: 1999.10.14. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander


## Licensing
  
This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).

&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;
[Home Page](../../START_HERE.ipynb)
