# Kaggle Competition : Microsoft Malware Data
> This is a Sungryong Hong's Notebook.  

> I have a stand-alone Spark(2.3.2)/Hadoop(2.8.3) cluster, which has 48 logical cores with 150GB memory. 

> I have put the data files to my hdfs. Check the contents as `hfs -cat /data/spark/msmalware/test.csv | head`.  

>`hfs` is an alias for `hdfs dfs`. 



## 1. Import Basic Packages

In [1]:
# Basic Libraries 
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial import cKDTree
import gc

pd.set_option('display.max_rows', 500)

# plot settings
plt.rc('font', family='serif') 
plt.rc('font', serif='Times New Roman') 
plt.rcParams.update({'font.size': 16})
plt.rcParams['mathtext.fontset'] = 'stix'

#### Spark-Shell Sesssion 

In [2]:
# Basic PySpark Libraries

# Old Style : SparkContext 
#from pyspark import SparkContext   
#from pyspark.sql import SQLContext


# New Style : Spark Session  
#Shell-Mode: Spark Session Name is `spark`

sc = spark.sparkContext
sqlsc = SQLContext(sc)
sc.setCheckpointDir("hdfs://master:54310/tmp/spark/checkpoints")

import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark import Row
from pyspark.sql.window import Window as W

In [3]:
# Enable Arrow for boosting up python performances 
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set('spark.debug.maxToStringFields',50)

## 2. Read `mldata`

In [4]:
import pyarrow as pa
import pyarrow.parquet as pq

In [5]:
mldata = sqlsc.read.parquet('hdfs://master:54310/data/spark/msmalware/mldata.parquet.snappy')

In [6]:
print mldata.columns

['MachineIdentifier', 'IsBeta', 'IsSxsPassiveMode', 'HasTpm', 'CountryIdentifier', 'LocaleEnglishNameIdentifier', 'OsBuild', 'OsSuite', 'AutoSampleOptIn', 'Census_HasOpticalDiskDrive', 'Census_OSBuildNumber', 'Census_OSBuildRevision', 'Census_OSUILocaleIdentifier', 'Census_IsPortableOperatingSystem', 'Census_IsSecureBootEnabled', 'Census_IsTouchEnabled', 'Census_IsPenCapable', 'HasDetections', 'ProductName_indexed_onehot', 'EngineVersion_indexed_onehot', 'AppVersion_indexed_onehot', 'AvSigVersion_indexed_onehot', 'RtpStateBitfield_indexed_onehot', 'DefaultBrowsersIdentifier_indexed_onehot', 'AVProductStatesIdentifier_indexed_onehot', 'AVProductsInstalled_indexed_onehot', 'AVProductsEnabled_indexed_onehot', 'CityIdentifier_indexed_onehot', 'OrganizationIdentifier_indexed_onehot', 'GeoNameIdentifier_indexed_onehot', 'Platform_indexed_onehot', 'Processor_indexed_onehot', 'OsVer_indexed_onehot', 'OsPlatformSubRelease_indexed_onehot', 'OsBuildLab_indexed_onehot', 'SkuEdition_indexed_onehot'

In [7]:
len(mldata.columns)

84

## 4. Run MLs

In [8]:
mldata.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
+--------------------+
only showing top 5 rows



#### Tiny Sample

In [9]:
%%time
mltiny = mldata.sample(False,0.001,seed=777)

CPU times: user 1.19 ms, sys: 873 µs, total: 2.06 ms
Wall time: 27.5 ms


In [10]:
mltiny.cache()

DataFrame[MachineIdentifier: string, IsBeta: int, IsSxsPassiveMode: int, HasTpm: int, CountryIdentifier: int, LocaleEnglishNameIdentifier: int, OsBuild: int, OsSuite: int, AutoSampleOptIn: int, Census_HasOpticalDiskDrive: int, Census_OSBuildNumber: int, Census_OSBuildRevision: int, Census_OSUILocaleIdentifier: int, Census_IsPortableOperatingSystem: int, Census_IsSecureBootEnabled: int, Census_IsTouchEnabled: int, Census_IsPenCapable: int, HasDetections: int, ProductName_indexed_onehot: vector, EngineVersion_indexed_onehot: vector, AppVersion_indexed_onehot: vector, AvSigVersion_indexed_onehot: vector, RtpStateBitfield_indexed_onehot: vector, DefaultBrowsersIdentifier_indexed_onehot: vector, AVProductStatesIdentifier_indexed_onehot: vector, AVProductsInstalled_indexed_onehot: vector, AVProductsEnabled_indexed_onehot: vector, CityIdentifier_indexed_onehot: vector, OrganizationIdentifier_indexed_onehot: vector, GeoNameIdentifier_indexed_onehot: vector, Platform_indexed_onehot: vector, Pro

In [11]:
%%time
mltiny.count()

CPU times: user 15.1 ms, sys: 7.48 ms, total: 22.5 ms
Wall time: 1min 23s


9063

In [12]:
mltiny.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
|(383218,[2,3,4,5,...|
+--------------------+
only showing top 5 rows



#### Small Sample

In [13]:
%%time
mlsmall = mldata.sample(False,0.01,seed=888)

CPU times: user 728 µs, sys: 923 µs, total: 1.65 ms
Wall time: 14.8 ms


In [14]:
%%time
mlsmall.count()

CPU times: user 1.34 ms, sys: 949 µs, total: 2.29 ms
Wall time: 2.94 s


89452

### 4.1 Trying `LightGBM` from mmlspark

> Website : https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md

In [15]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator as BCE

In [16]:
from mmlspark import LightGBMClassifier

In [17]:
gbclassifier = LightGBMClassifier(learningRate=0.3, numIterations=100,\
                                  #earlyStoppingRound=10,\
                                  labelCol='HasDetections',featuresCol='features')

In [18]:
paramGrid = ParamGridBuilder().addGrid(gbclassifier.numLeaves, [30, 50]).build()
gbeval = BCE(labelCol='HasDetections',metricName='areaUnderROC')

In [19]:
gbcrossval = CrossValidator(estimator=gbclassifier, estimatorParamMaps=paramGrid,\
                            evaluator=gbeval,numFolds=4) 

#### For `mltiny`

In [20]:
%%time
gbcvmodeltiny  = gbcrossval.fit(mltiny)

CPU times: user 709 ms, sys: 202 ms, total: 912 ms
Wall time: 10min 51s


In [21]:
print("trained LightGBM :%s" % gbcvmodeltiny) #using the best model among cross-validation folds 

trained LightGBM :CrossValidatorModel_4146bf1865d951917d8a


In [22]:
gbcvmodeltiny.avgMetrics

[1.0, 1.0]

In [23]:
# display CV score
auc_roc = gbcvmodeltiny.avgMetrics[0]
print("AUC ROC = %g" % auc_roc)
gini = (2 * auc_roc - 1)
print("GINI ~=%g" % gini)

AUC ROC = 1
GINI ~=1


#### For `mlsmall`

In [24]:
%%time
gbcvmodelsmall  = gbcrossval.fit(mlsmall)

CPU times: user 2.12 s, sys: 842 ms, total: 2.96 s
Wall time: 26min 20s


In [25]:
gbcvmodelsmall.avgMetrics

[1.0, 1.0]