# Kaggle Competition : Microsoft Malware Data
> This is a Sungryong Hong's Notebook.  

> I have a stand-alone Spark(2.3.2)/Hadoop(2.8.3) cluster, which has 48 logical cores with 150GB memory. 

> I have put the data files to my hdfs. Check the contents as `hfs -cat /data/spark/msmalware/test.csv | head`.  

>`hfs` is an alias for `hdfs dfs`. 



## 1. Import Basic Packages

In [1]:
# Basic Libraries 
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial import cKDTree
import gc

pd.set_option('display.max_rows', 500)

# plot settings
plt.rc('font', family='serif') 
plt.rc('font', serif='Times New Roman') 
plt.rcParams.update({'font.size': 16})
plt.rcParams['mathtext.fontset'] = 'stix'

#### Spark-Shell Sesssion 

In [2]:
# Basic PySpark Libraries

# Old Style : SparkContext 
#from pyspark import SparkContext   
#from pyspark.sql import SQLContext


# New Style : Spark Session  
#Shell-Mode: Spark Session Name is `spark`

sc = spark.sparkContext
sqlsc = SQLContext(sc)
sc.setCheckpointDir("hdfs://master:54310/tmp/spark/checkpoints")

import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark import Row
from pyspark.sql.window import Window as W

In [3]:
# Enable Arrow for boosting up python performances 
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set('spark.debug.maxToStringFields',50)

## 2. Read `mldata`

In [4]:
import pyarrow as pa
import pyarrow.parquet as pq

In [5]:
mldata = sqlsc.read.parquet('hdfs://master:54310/data/spark/msmalware/mldata.parquet.snappy')

In [6]:
print mldata.columns

['MachineIdentifier', 'IsBeta', 'IsSxsPassiveMode', 'HasTpm', 'CountryIdentifier', 'LocaleEnglishNameIdentifier', 'OsBuild', 'OsSuite', 'AutoSampleOptIn', 'Census_HasOpticalDiskDrive', 'Census_OSBuildNumber', 'Census_OSBuildRevision', 'Census_OSUILocaleIdentifier', 'Census_IsPortableOperatingSystem', 'Census_IsSecureBootEnabled', 'Census_IsTouchEnabled', 'Census_IsPenCapable', 'HasDetections', 'ProductName_indexed_onehot', 'EngineVersion_indexed_onehot', 'AppVersion_indexed_onehot', 'AvSigVersion_indexed_onehot', 'RtpStateBitfield_indexed_onehot', 'DefaultBrowsersIdentifier_indexed_onehot', 'AVProductStatesIdentifier_indexed_onehot', 'AVProductsInstalled_indexed_onehot', 'AVProductsEnabled_indexed_onehot', 'CityIdentifier_indexed_onehot', 'OrganizationIdentifier_indexed_onehot', 'GeoNameIdentifier_indexed_onehot', 'Platform_indexed_onehot', 'Processor_indexed_onehot', 'OsVer_indexed_onehot', 'OsPlatformSubRelease_indexed_onehot', 'OsBuildLab_indexed_onehot', 'SkuEdition_indexed_onehot'

In [7]:
len(mldata.columns)

84

## 4. Run MLs

In [8]:
mldata.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
+--------------------+
only showing top 5 rows



#### Tiny Sample

In [9]:
%%time
mltiny = mldata.sample(False,0.01,seed=777)

CPU times: user 1.21 ms, sys: 1.33 ms, total: 2.54 ms
Wall time: 26 ms


In [10]:
mltiny.cache()

DataFrame[MachineIdentifier: string, IsBeta: int, IsSxsPassiveMode: int, HasTpm: int, CountryIdentifier: int, LocaleEnglishNameIdentifier: int, OsBuild: int, OsSuite: int, AutoSampleOptIn: int, Census_HasOpticalDiskDrive: int, Census_OSBuildNumber: int, Census_OSBuildRevision: int, Census_OSUILocaleIdentifier: int, Census_IsPortableOperatingSystem: int, Census_IsSecureBootEnabled: int, Census_IsTouchEnabled: int, Census_IsPenCapable: int, HasDetections: int, ProductName_indexed_onehot: vector, EngineVersion_indexed_onehot: vector, AppVersion_indexed_onehot: vector, AvSigVersion_indexed_onehot: vector, RtpStateBitfield_indexed_onehot: vector, DefaultBrowsersIdentifier_indexed_onehot: vector, AVProductStatesIdentifier_indexed_onehot: vector, AVProductsInstalled_indexed_onehot: vector, AVProductsEnabled_indexed_onehot: vector, CityIdentifier_indexed_onehot: vector, OrganizationIdentifier_indexed_onehot: vector, GeoNameIdentifier_indexed_onehot: vector, Platform_indexed_onehot: vector, Pro

In [11]:
%%time
mltiny.count()

CPU times: user 13 ms, sys: 6.75 ms, total: 19.8 ms
Wall time: 1min 19s


89525

In [12]:
mltiny.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
+--------------------+
only showing top 5 rows



#### Small Sample

In [13]:
%%time
mlsmall = mldata.sample(False,0.1,seed=888)

CPU times: user 765 µs, sys: 958 µs, total: 1.72 ms
Wall time: 17.4 ms


In [14]:
%%time
mlsmall.count()

CPU times: user 1.13 ms, sys: 856 µs, total: 1.99 ms
Wall time: 2.75 s


892286

### 4.1 Trying `LightGBM` from mmlspark

> Website : https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md

In [15]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator as BCE

In [16]:
from mmlspark import LightGBMClassifier

In [17]:
gbclassifier = LightGBMClassifier(learningRate=0.3, numIterations=100,\
                                  #earlyStoppingRound=10,\
                                  labelCol='HasDetections',featuresCol='features')

In [18]:
paramGrid = ParamGridBuilder().addGrid(gbclassifier.numLeaves, [10, 20, 40]).build()
gbeval = BCE(labelCol='HasDetections',metricName='areaUnderROC')

In [19]:
gbcrossval = CrossValidator(estimator=gbclassifier, estimatorParamMaps=paramGrid,\
                            evaluator=gbeval,numFolds=4) 

#### For `mltiny`

In [20]:
%%time
gbcvmodeltiny  = gbcrossval.fit(mltiny)

CPU times: user 1.21 s, sys: 378 ms, total: 1.59 s
Wall time: 19min 52s


In [21]:
print("trained LightGBM :%s" % gbcvmodeltiny) #using the best model among cross-validation folds 

trained LightGBM :CrossValidatorModel_4c69a7f410f0ed37b8ac


In [22]:
gbcvmodeltiny.avgMetrics

[1.0, 1.0, 1.0]

In [23]:
# display CV score
auc_roc = gbcvmodeltiny.avgMetrics[0]
print("AUC ROC = %g" % auc_roc)
gini = (2 * auc_roc - 1)
print("GINI ~=%g" % gini)

AUC ROC = 1
GINI ~=1


### 4.2 Trying `RandomForestClassifer` from Spark ML

In [24]:
from pyspark.ml.classification import RandomForestClassifier

In [25]:
rfclassifier = RandomForestClassifier(labelCol='HasDetections',featuresCol='features')

In [26]:
rfparamGrid = ParamGridBuilder().addGrid(rfclassifier.numTrees, [10, 30, 60]).build()
rfeval = BCE(labelCol='HasDetections',metricName='areaUnderROC')

In [27]:
rfcrossval = CrossValidator(estimator=rfclassifier, estimatorParamMaps=rfparamGrid,\
                            evaluator=rfeval,numFolds=4) 

#### For `mltiny`

In [28]:
%%time
rfcvmodeltiny  = rfcrossval.fit(mltiny)

CPU times: user 16.6 s, sys: 8.06 s, total: 24.7 s
Wall time: 3h 29min 32s


In [29]:
rfcvmodeltiny.avgMetrics

[0.5710660200474003, 0.6549259685679865, 0.7044438195073475]

> Is `lightGBM` too good? 

### 4.3 Feature Importance 

> `LightGBM` from `mmlspark` shows some error messages when getting the feautre importances  `model.getFeatureImportances()`. Let's see the feature importances from `RandomForestClassifier`. 

In [30]:
# Covert the Vector-assembled feature importances to human-readables 
def ExtractFeatureImportance(featureImp, dataset, featuresCol):
    list_extract = []
    for i in dataset.schema[featuresCol].metadata["ml_attr"]["attrs"]:
        list_extract = list_extract + dataset.schema[featuresCol].metadata["ml_attr"]["attrs"][i]
    varlist = pd.DataFrame(list_extract)
    varlist['score'] = varlist['idx'].apply(lambda x: featureImp[x])
    return(varlist.sort_values('score', ascending = False))

#### Random Forest 

In [31]:
rfFeatureImportance = ExtractFeatureImportance(rfcvmodeltiny.bestModel.featureImportances,mltiny,'features')

In [32]:
rfFeatureImportance.head(10)

Unnamed: 0,idx,name,score
331789,331806,Census_OSBranch_indexed_onehot_rs3_release,0.022218
383202,9,Census_OSBuildNumber,0.021848
147511,147528,OsBuildLab_indexed_onehot_16299.15.x86fre.rs3_...,0.02095
331820,331837,Census_OSEdition_indexed_onehot_Core,0.020472
331944,331961,Census_ActivationChannel_indexed_onehot_Volume...,0.020178
148487,148504,SmartScreen_indexed_onehot_ExistsNotSet,0.016347
39712,39729,AVProductsInstalled_indexed_onehot_2,0.01574
147436,147453,Processor_indexed_onehot_x64,0.014901
331316,331333,Census_OSVersion_indexed_onehot_10.0.17134.228,0.014107
148523,148540,Census_MDC2FormFactor_indexed_onehot_Detachable,0.013863


#### LightGBM

In [33]:
# the below command will produce some error messages; hence, I cannot get the feature importances of LightGBM
#gbcvmodeltiny.bestModel.getFeatureImportances(importance_type='split')

### 4.4 Predictions 

#### LightGBM

In [34]:
predict = gbcvmodeltiny.transform(mldata.select('features'))

In [35]:
predict.columns

['features', 'rawPrediction', 'probability', 'prediction']

In [36]:
mldata.select('MachineIdentifier','HasDetections','features')\
      .join(predict.select('features','prediction'),mldata.features==predict.features)\
      .show(5)

+--------------------+-------------+--------------------+--------------------+----------+
|   MachineIdentifier|HasDetections|            features|            features|prediction|
+--------------------+-------------+--------------------+--------------------+----------+
|e7aa60177047f9807...|            1|(383218,[1,2,3,4,...|(383218,[1,2,3,4,...|       1.0|
|4c4652d8c2ec536e4...|            1|(383218,[1,2,3,4,...|(383218,[1,2,3,4,...|       1.0|
|3d0a2862336a50e93...|            1|(383218,[1,2,3,4,...|(383218,[1,2,3,4,...|       1.0|
|5793965ea86e97b20...|            0|(383218,[1,2,3,4,...|(383218,[1,2,3,4,...|       0.0|
|1c8ed5ff343f2afa1...|            0|(383218,[1,2,3,4,...|(383218,[1,2,3,4,...|       0.0|
+--------------------+-------------+--------------------+--------------------+----------+
only showing top 5 rows



In [37]:
resultdf = \
mldata.select('MachineIdentifier','HasDetections','features')\
      .join(predict.select('features','prediction'),mldata.features==predict.features)

In [38]:
resultdf.cache()

DataFrame[MachineIdentifier: string, HasDetections: int, features: vector, features: vector, prediction: double]

In [39]:
resultdf.count()

8939807

In [40]:
resultdf.columns

['MachineIdentifier', 'HasDetections', 'features', 'features', 'prediction']

In [41]:
resultdf.crosstab('HasDetections','prediction').show()

+------------------------+-------+-------+
|HasDetections_prediction|    0.0|    1.0|
+------------------------+-------+-------+
|                       1|      0|4463906|
|                       0|4475901|      0|
+------------------------+-------+-------+



> I guess `LightGBM` is too good; hence, overfit in the given train set. 
> But, randomforest can not even overfit the data. Hence, we may know why Grandient Boosted Trees are a magic word in many ML problems.  

#### RandomForest

In [42]:
rfpredict = rfcvmodeltiny.transform(mldata.select('features'))

In [43]:
mldata.select('MachineIdentifier','HasDetections','features')\
      .join(rfpredict.select('features','prediction'),mldata.features==rfpredict.features)\
      .show(5)

+--------------------+-------------+--------------------+--------------------+----------+
|   MachineIdentifier|HasDetections|            features|            features|prediction|
+--------------------+-------------+--------------------+--------------------+----------+
|e7aa60177047f9807...|            1|(383218,[1,2,3,4,...|(383218,[1,2,3,4,...|       0.0|
|4c4652d8c2ec536e4...|            1|(383218,[1,2,3,4,...|(383218,[1,2,3,4,...|       0.0|
|3d0a2862336a50e93...|            1|(383218,[1,2,3,4,...|(383218,[1,2,3,4,...|       0.0|
|5793965ea86e97b20...|            0|(383218,[1,2,3,4,...|(383218,[1,2,3,4,...|       0.0|
|1c8ed5ff343f2afa1...|            0|(383218,[1,2,3,4,...|(383218,[1,2,3,4,...|       0.0|
+--------------------+-------------+--------------------+--------------------+----------+
only showing top 5 rows



In [44]:
rfresultdf = \
mldata.select('MachineIdentifier','HasDetections','features')\
      .join(rfpredict.select('features','prediction'),mldata.features==rfpredict.features)

In [45]:
rfresultdf.crosstab('HasDetections','prediction').show()

+------------------------+-------+-------+
|HasDetections_prediction|    0.0|    1.0|
+------------------------+-------+-------+
|                       1|2950589|1513317|
|                       0|3691198| 784703|
+------------------------+-------+-------+

