### Name: M.Vineeth
### Roll No: E18CSE095
### Batch: EB03

AI/Machine Learning services available on Cloud uses HPC for accommodating frequent calls from millions of users. Here are some questions for applications of HPC in Machine Learning. Scenario here is you are providing cloud HPC (similar to Azure/AWS/GCP) for training and testing of machine learning models on large datasets. In this lab we are going to implement task parallelism for training different machine learning models on the large-scale dataset “BitcoinHeistRansomwareAddressDataset Data Set” containing 3M samples of Bitcoin ransomware attack data available at
https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset
Training and testing ML/DL models on large scale datasets serially will be taking a lot of time. Hence we go parallel.

In [1]:
# loading the csv file
import numpy as np
import pandas as pd

data = pd.read_csv("BitcoinHeistData.csv")
data.shape

(2916697, 10)

In [2]:
data.head(10)

Unnamed: 0,address,year,day,length,weight,count,looped,neighbors,income,label
0,111K8kZAEnJg245r2cM6y9zgJGHZtJPy6,2017,11,18,0.008333333,1,0,2,100050000.0,princetonCerber
1,1123pJv8jzeFQaCV4w644pzQJzVWay2zcA,2016,132,44,0.0002441406,1,0,1,100000000.0,princetonLocky
2,112536im7hy6wtKbpH1qYDWtTyMRAcA2p7,2016,246,0,1.0,1,0,2,200000000.0,princetonCerber
3,1126eDRw2wqSkWosjTCre8cjjQW8sSeWH7,2016,322,72,0.00390625,1,0,2,71200000.0,princetonCerber
4,1129TSjKtx65E35GiUo4AYVeyo48twbrGX,2016,238,144,0.07284841,456,0,1,200000000.0,princetonLocky
5,112AmFATxzhuSpvtz1hfpa3Zrw3BG276pc,2016,96,144,0.084614,2821,0,1,50000000.0,princetonLocky
6,112E91jxS2qrQY1z78LPWUWrLVFGqbYPQ1,2016,225,142,0.002088519,881,0,2,100000000.0,princetonCerber
7,112eFykaD53KEkKeYW9KW8eWebZYSbt2f5,2016,324,78,0.00390625,1,0,2,100990000.0,princetonCerber
8,112FTiRdJjMrNgEtd4fvdoq3TC33Ah5Dep,2016,298,144,2.302828,4220,0,2,80000000.0,princetonCerber
9,112GocBgFSnaote6krx828qaockFraD8mp,2016,62,112,3.72529e-09,1,0,1,50000000.0,princetonLocky


In [3]:
classes = list(data['label'].unique())
print(classes)

['princetonCerber', 'princetonLocky', 'montrealCryptoLocker', 'montrealCryptXXX', 'paduaCryptoWall', 'montrealWannaCry', 'montrealDMALockerv3', 'montrealCryptoTorLocker2015', 'montrealSamSam', 'montrealFlyper', 'montrealNoobCrypt', 'montrealDMALocker', 'montrealGlobe', 'montrealEDA2', 'paduaKeRanger', 'montrealVenusLocker', 'montrealXTPLocker', 'paduaJigsaw', 'montrealGlobev3', 'montrealJigSaw', 'montrealXLockerv5.0', 'montrealXLocker', 'montrealRazy', 'montrealCryptConsole', 'montrealGlobeImposter', 'montrealSam', 'montrealComradeCircle', 'montrealAPT', 'white']


In [4]:
print("No. of classes: ", len(classes))

No. of classes:  29


In [5]:
for i in ["year", "looped"]:
    print(data[i].unique())

[2017 2016 2013 2014 2015 2012 2011 2018]
[    0     1  3283 ... 11938  7168  8173]


In [6]:
# let's encode the label column

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data.label = le.fit_transform(data.label.values)
data.head(10)

Unnamed: 0,address,year,day,length,weight,count,looped,neighbors,income,label
0,111K8kZAEnJg245r2cM6y9zgJGHZtJPy6,2017,11,18,0.008333333,1,0,2,100050000.0,26
1,1123pJv8jzeFQaCV4w644pzQJzVWay2zcA,2016,132,44,0.0002441406,1,0,1,100000000.0,27
2,112536im7hy6wtKbpH1qYDWtTyMRAcA2p7,2016,246,0,1.0,1,0,2,200000000.0,26
3,1126eDRw2wqSkWosjTCre8cjjQW8sSeWH7,2016,322,72,0.00390625,1,0,2,71200000.0,26
4,1129TSjKtx65E35GiUo4AYVeyo48twbrGX,2016,238,144,0.07284841,456,0,1,200000000.0,27
5,112AmFATxzhuSpvtz1hfpa3Zrw3BG276pc,2016,96,144,0.084614,2821,0,1,50000000.0,27
6,112E91jxS2qrQY1z78LPWUWrLVFGqbYPQ1,2016,225,142,0.002088519,881,0,2,100000000.0,26
7,112eFykaD53KEkKeYW9KW8eWebZYSbt2f5,2016,324,78,0.00390625,1,0,2,100990000.0,26
8,112FTiRdJjMrNgEtd4fvdoq3TC33Ah5Dep,2016,298,144,2.302828,4220,0,2,80000000.0,26
9,112GocBgFSnaote6krx828qaockFraD8mp,2016,62,112,3.72529e-09,1,0,1,50000000.0,27


In [7]:
# label encode the address values

le = LabelEncoder()
data["address"] = le.fit_transform(data["address"].values)
data.head(10)

Unnamed: 0,address,year,day,length,weight,count,looped,neighbors,income,label
0,23,2017,11,18,0.008333333,1,0,2,100050000.0,26
1,128,2016,132,44,0.0002441406,1,0,1,100000000.0,27
2,169,2016,246,0,1.0,1,0,2,200000000.0,26
3,217,2016,322,72,0.00390625,1,0,2,71200000.0,26
4,293,2016,238,144,0.07284841,456,0,1,200000000.0,27
5,335,2016,96,144,0.084614,2821,0,1,50000000.0,27
6,437,2016,225,142,0.002088519,881,0,2,100000000.0,26
7,1194,2016,324,78,0.00390625,1,0,2,100990000.0,26
8,477,2016,298,144,2.302828,4220,0,2,80000000.0,26
9,518,2016,62,112,3.72529e-09,1,0,1,50000000.0,27


In [8]:
# scale the income values

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data["income"] = scaler.fit_transform(np.array(data["income"]).reshape(-1, 1))
data.head()

Unnamed: 0,address,year,day,length,weight,count,looped,neighbors,income,label
0,23,2017,11,18,0.008333,1,0,2,1.401999e-06,26
1,128,2016,132,44,0.000244,1,0,1,1.400998e-06,27
2,169,2016,246,0,1.0,1,0,2,3.402425e-06,26
3,217,2016,322,72,0.003906,1,0,2,8.245876e-07,26
4,293,2016,238,144,0.072848,456,0,1,3.402425e-06,27


In [9]:
# sanity check 

print(np.min(data["income"]))
print(np.max(data["income"]))

0.0
1.0


In [10]:
# defining input and ouput

X = data.drop(columns = ['label'], axis = 1).values
y = data['label'].values

print("X shape: ", X.shape)
print("y shape: ", y.shape)

X shape:  (2916697, 9)
y shape:  (2916697,)


In [11]:
# splitting the data into 50-50 (train-test-split)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, shuffle = True)

print("X train shape: ", X_train.shape)
print("y train shape: ", y_train.shape)
print("X test shape: ", X_test.shape)
print("y test shape: ", y_test.shape)

X train shape:  (1458348, 9)
y train shape:  (1458348,)
X test shape:  (1458349, 9)
y test shape:  (1458349,)


In [31]:
%%time
# fitting baseline model-1 : Random Forest

from sklearn.ensemble import RandomForestClassifier

clf_rf = RandomForestClassifier(n_estimators = 50, max_depth = 4, n_jobs = -1)
clf_rf.fit(X_train, y_train)

Wall time: 15.5 s


RandomForestClassifier(max_depth=4, n_estimators=50, n_jobs=-1)

In [32]:
%%time
from sklearn.metrics import accuracy_score, classification_report

# accuracy on the training set
accuracy_score(y_train, clf_rf.predict(X_train))

Wall time: 6.62 s


0.9858312282116477

In [33]:
%%time
# accuracy on the testing set
accuracy_score(y_test, clf_rf.predict(X_test))

Wall time: 6.37 s


0.9857715814252966

In [34]:
%%time
# fitting baseline model-2 : Extra Trees Classifier

from sklearn.ensemble import ExtraTreesClassifier

clf_et = ExtraTreesClassifier(n_estimators = 50, max_depth = 4, n_jobs = -1)
clf_et.fit(X_train, y_train)

Wall time: 7.06 s


ExtraTreesClassifier(max_depth=4, n_estimators=50, n_jobs=-1)

In [37]:
%%time
# accuracy on the training set
accuracy_score(y_train, clf_et.predict(X_train))

Wall time: 6.25 s


0.9858312282116477

In [38]:
%%time
# accuracy on the testing set
accuracy_score(y_test, clf_et.predict(X_test))

Wall time: 6.66 s


0.9857715814252966

In [39]:
%%time
# fitting baseline model-3: Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

clf_dt = DecisionTreeClassifier()
clf_dt.fit(X_train, y_train)

Wall time: 11.6 s


DecisionTreeClassifier()

In [40]:
%%time
# accuracy on the training set
accuracy_score(y_train, clf_dt.predict(X_train))

Wall time: 355 ms


1.0

In [41]:
%%time
# accuracy on the testing set
accuracy_score(y_test, clf_dt.predict(X_test))

Wall time: 349 ms


0.9795885621343039

## Saving the models in pickle files

In [49]:
import os
base_dir = os.getcwd()
print(base_dir)

C:\Users\vinee\Desktop\BENNETT UNIVERSITY\SEMESTER 6\ECSE302L High Performance Computing\LAB ASSIGNMENTS\Lab_03


In [51]:
%%time
import pickle as pk

pk.dump(clf_rf, open(os.path.join(base_dir, "models/model1.pickle"), "wb"))
pk.dump(clf_et, open(os.path.join(base_dir, "models/model2.pickle"), "wb"))
pk.dump(clf_dt, open(os.path.join(base_dir, "models/model3.pickle"), "wb"))

Wall time: 20 ms


## Running things in Parallel

In [54]:
from joblib import Parallel, delayed

def load_models():
    model1 = pk.load(open(os.path.join(base_dir, "models/model1.pickle"), "rb"))
    model2 = pk.load(open(os.path.join(base_dir, "models/model2.pickle"), "rb"))
    model3 = pk.load(open(os.path.join(base_dir, "models/model3.pickle"), "rb"))
    
    return (model1, model2, model3)

In [56]:
%%time

model1, model2, model3 = load_models()

Parallel(4)((delayed(model1.fit)(X_train, y_train), 
             delayed(model2.fit)(X_train, y_train),
             delayed(model3.fit)(X_train, y_train)))

Wall time: 28.7 s


[RandomForestClassifier(max_depth=4, n_estimators=50, n_jobs=-1),
 ExtraTreesClassifier(max_depth=4, n_estimators=50, n_jobs=-1),
 DecisionTreeClassifier()]

In [57]:
# seqential predictions on training data
def predict_train():
    model1_train = model1.predict(X_train)
    model2_train = model2.predict(X_train)
    model3_train = model3.predict(X_train)
    
    return (model1_train, model2_train, model3_train)

In [59]:
%%time
# sequential predictions
model1_train, model2_train, model3_train = predict_train()

Wall time: 12.9 s


In [62]:
# parallel predictions on training data
def predict_train_parallel():
    model1_train, model2_train, model3_train = Parallel(16)((delayed(model1.predict)(X_train),
                                                             delayed(model2.predict)(X_train),
                                                             delayed(model3.predict)(X_train)))
    
    return (model1_train, model2_train, model3_train)

In [63]:
%%time
# parallel predictions
model1_train, model2_train, model3_train = predict_train_parallel()

Wall time: 14.2 s


In [60]:
# sequential prediction on the testing data
def predict_test():
    model1_test = model1.predict(X_test)
    model2_test = model2.predict(X_test)
    model3_test = model3.predict(X_test)
    
    return (model1_test, model2_test, model3_test)

In [61]:
%%time
# sequential operations
model1_test, model2_test, model3_test = predict_test()

Wall time: 12.7 s


In [64]:
# parallel predictions on testing data
def predict_train_parallel():
    model1_test, model2_test, model3_test = Parallel(16)((delayed(model1.predict)(X_test),
                                                             delayed(model2.predict)(X_test),
                                                             delayed(model3.predict)(X_test)))
    
    return (model1_test, model2_test, model3_test)

In [65]:
%%time
# parallel predictions
model1_test, model2_test, model3_test = predict_train_parallel()

Wall time: 13 s


## Performing Max Voting

In [71]:
estimators = []
estimators.extend([('model1', model1), ('model2', model2), ('model3', model3)])
print(len(estimators))    

3


In [75]:
# serial implementation
from sklearn.ensemble import VotingClassifier

def serial_voting():
    estimators = []
    estimators.extend([('model1', model1), ('model2', model2), ('model3', model3)])
    
    voting_hard = VotingClassifier(estimators = estimators, voting = 'hard')
    voting_hard.fit(X_train, y_train)
    
    y_pred = voting_hard.predict(X_train)
    
    print("accuracy score: ", accuracy_score(y_train, y_pred))

In [76]:
%%time

serial_voting()

accuracy score:  0.9858312282116477
Wall time: 54 s


In [77]:
# parallel implementation using n_jobs parameter
def parallel_voting_1():
    estimators = []
    estimators.extend([('model1', model1), ('model2', model2), ('model3', model3)])
    
    voting_hard = VotingClassifier(estimators = estimators, voting = 'hard', n_jobs = -1)
    voting_hard.fit(X_train, y_train)
    
    y_pred = voting_hard.predict(X_train)
    
    print("accuracy score: ", accuracy_score(y_train, y_pred))

In [78]:
%%time

parallel_voting_1()

accuracy score:  0.9858312282116477
Wall time: 48.8 s


In [86]:
# parallel implementation using task parallelism
def parallel_voting_2():
    voting_avg = Parallel(4)((delayed(model1.predict)(X_train),
                              delayed(model2.predict)(X_train),
                              delayed(model3.predict)(X_train)))
    print("accuracy score: ", accuracy_score(y_train, np.mean(voting_avg, axis = 0).round() ))

In [87]:
%%time

parallel_voting_2()

accuracy score:  0.9858312282116477
Wall time: 13.5 s
