### Introduction
One of the most important problems in the [challange to predict malware](https://www.kaggle.com/c/microsoft-malware-prediction) is to find an validation dataset that represents the test. As many commented in the [discussions](http://https://www.kaggle.com/c/microsoft-malware-prediction/discussion/75087), the data for this competition is quite diferent from the train dataset to the test. This kernel has some feature engenieering and adversarial validation made by me and [DimitreOliveira](https://www.kaggle.com/dimitreoliveira) to deal with that problem.

#### **Table of contents**
1. [Simplification of version related features](#Simplification-of-version-related-features);
2. [Encoding](#Encoding);
3. [Adversarial Validation](#Adversarial-Validation).

In [1]:
import dask
import dask.dataframe as dd
import warnings
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, recall_score, precision_score, f1_score
import xgboost as xgb

%matplotlib inline
sns.set(style="whitegrid")
warnings.filterwarnings("ignore")
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [2]:
dtypes = {
        'MachineIdentifier':                                    'category',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'IsProtected':                                          'float16',
        'Census_ProcessorCoreCount':                            'float16',
        'Census_SystemVolumeTotalCapacity':                     'float32',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'AvSigVersion':                                         'category',
        'OsBuildLab':                                           'category',
        'Census_OSVersion':                                     'category',
        'AppVersion':                                           'category',
        'EngineVersion':                                        'category',
        'Census_PowerPlatformRoleName':                         'category',
        'OsPlatformSubRelease':                                 'category',
        'Census_OSInstallTypeName':                             'category',
        'SkuEdition':                                           'category',
        'Census_ActivationChannel':                             'category',
        'Census_OSWUAutoUpdateOptionsName':                     'category',
        'ProductName':                                          'category',
        'Platform':                                             'category',
        'Census_PrimaryDiskTypeName':                           'category',
        'Census_DeviceFamily':                                  'category',
        'Census_OSArchitecture':                                'category',
        'Processor':                                            'category',
        'HasDetections':                                        'int8'
        }

label = ['HasDetections']

ids = ['MachineIdentifier']

numerical_features = ['AVProductsEnabled', 'AVProductsInstalled', 
                      'Census_ProcessorCoreCount', 'Census_SystemVolumeTotalCapacity']

binary_features = ['Census_IsAlwaysOnAlwaysConnectedCapable', 'IsProtected', 'Wdft_IsGamer']

version_features = ['AvSigVersion', 'OsBuildLab', 'Census_OSVersion', 'AppVersion', 'EngineVersion']

# < 10 categories
low_cardinality_features = ['Census_PowerPlatformRoleName', 'OsPlatformSubRelease', 
                            'Census_OSInstallTypeName', 'SkuEdition', 'Census_ActivationChannel', 
                            'Census_OSWUAutoUpdateOptionsName', 'ProductName', 
                            'Platform', 'Census_PrimaryDiskTypeName', 'Census_DeviceFamily', 
                            'Census_OSArchitecture', 'Processor']

use_columns = numerical_features + binary_features + version_features + low_cardinality_features

### Load data

I reduced the size so it could run on Kaggle, I run it locally with all the data.

In [3]:
train = pd.read_csv('../input/microsoft-malware-prediction/train.csv', dtype=dtypes, usecols=(use_columns + label), nrows=1000000)
print(train.shape)

(1000000, 25)


## Simplification of version related features
### Reduce granularity on version features

In [4]:
for feature in version_features:
    if feature in ['EngineVersion']:
        train[feature] = train[feature].apply(lambda x : ".".join(x.split('.')[:3]))
    elif feature in ['OsBuildLab']:
        train[feature] = train[feature].apply(lambda x : ".".join(x.split('.')[:1]))
    else:
        train[feature] = train[feature].apply(lambda x : ".".join(x.split('.')[:2]))

In [5]:
# Remove rows with NA
train.dropna(inplace=True)

## Encoding

From analysing a different number of encoders (One hot, Hash, frequency, binary), the one with best results was the Target Encoder.

In [6]:
Y_train = train[label]
X_train = train.drop(label, axis = 1)

encoder = ce.TargetEncoder(cols=(version_features + low_cardinality_features))
encoder.fit(X_train, Y_train)
X_train = encoder.fit_transform(X_train.reset_index(), Y_train)

### Fill missing values with mean
The values will be filled with the mean value, since it's the base to our encoder.

In [7]:
X_train.fillna(X_train.mean(), inplace=True)

## Adversarial Validation
When looking at different kinds techniques to avoid overfitting, the one that is most fit to our problems is Adversarial Validation. Which gives us probabilities of a given row from a train dataset to belong to the test dataset.

* *References: [Improve Your Model Performance using Cross Validation (in Python and R)](https://www.analyticsvidhya.com/blog/2018/05/improve-model-performance-cross-validation-in-python-r/)*

In [8]:
test = dd.read_csv('../input/microsoft-malware-prediction/test.csv', dtype=dtypes, usecols=(use_columns + ids)).compute()

In [9]:
test.drop('MachineIdentifier',axis=1, inplace=True)

In [10]:
for feature in version_features:
    if feature in ['EngineVersion']:
        test[feature] = test[feature].apply(lambda x : ".".join(x.split('.')[:3]))
    elif feature in ['OsBuildLab']:
        test[feature] = test[feature].apply(lambda x : ".".join(x.split('.')[:1]))
    else:
        test[feature] = test[feature].apply(lambda x : ".".join(x.split('.')[:2]))
        
test = encoder.transform(test.reset_index())

In [11]:
test.drop('index', axis = 1, inplace = True)
X_train.drop('index', axis = 1, inplace = True)
test['is_train'] = 0
X_train['is_train'] = 1

In [17]:
df = pd.concat([X_train, test], axis = 0)
df.describe()

Unnamed: 0,ProductName,EngineVersion,AppVersion,AvSigVersion,AVProductsInstalled,AVProductsEnabled,Platform,Processor,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,Census_DeviceFamily,Census_ProcessorCoreCount,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_PowerPlatformRoleName,Census_OSVersion,Census_OSArchitecture,Census_OSInstallTypeName,Census_OSWUAutoUpdateOptionsName,Census_ActivationChannel,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,is_train
count,8806371.0,8806371.0,8806371.0,8806371.0,8782604.0,8782604.0,8806371.0,8806371.0,8806371.0,8806371.0,8806371.0,8782722.0,8806371.0,8745094.0,8806371.0,8731681.0,8806371.0,8806371.0,8806371.0,8806371.0,8806371.0,8806371.0,8716219.0,8503514.0,8806371.0
mean,0.5,0.49,0.51,0.48,,,0.5,0.5,0.5,0.5,0.5,,0.5,,0.5,363652.66,0.5,0.5,0.5,0.5,0.5,0.5,,,0.11
std,0.0,0.04,0.04,0.04,0.0,0.0,0.0,0.04,0.03,0.02,0.01,0.0,0.0,0.0,0.02,318906.34,0.03,0.0,0.04,0.02,0.02,0.03,0.0,0.0,0.31
min,0.26,0.0,0.3,0.09,1.0,0.0,0.49,0.0,0.44,0.06,0.36,0.0,0.5,1.0,0.43,0.0,0.23,0.5,0.0,0.46,0.48,0.44,0.0,0.0,0.0
25%,0.5,0.49,0.5,0.46,1.0,1.0,0.5,0.51,0.49,0.49,0.49,1.0,0.5,2.0,0.5,115799.0,0.5,0.5,0.51,0.48,0.49,0.49,0.0,0.0,0.0
50%,0.5,0.49,0.53,0.49,1.0,1.0,0.5,0.51,0.51,0.51,0.49,1.0,0.5,4.0,0.51,242814.0,0.5,0.5,0.51,0.52,0.52,0.49,0.0,0.0,0.0
75%,0.5,0.5,0.53,0.5,2.0,1.0,0.5,0.51,0.52,0.52,0.51,1.0,0.5,4.0,0.51,475809.0,0.52,0.5,0.51,0.52,0.52,0.5,0.0,1.0,0.0
max,0.5,0.94,0.56,0.94,6.0,5.0,0.51,0.51,0.52,0.6,0.54,1.0,0.5,224.0,0.51,95374808.0,0.67,0.5,0.51,0.53,0.52,0.59,1.0,1.0,1.0


In [14]:
df.head()

Unnamed: 0,ProductName,EngineVersion,AppVersion,AvSigVersion,AVProductsInstalled,AVProductsEnabled,Platform,Processor,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,Census_DeviceFamily,Census_ProcessorCoreCount,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_PowerPlatformRoleName,Census_OSVersion,Census_OSArchitecture,Census_OSInstallTypeName,Census_OSWUAutoUpdateOptionsName,Census_ActivationChannel,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer
0,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.51,1.0,0.5,4.0,0.51,299451.0,0.53,0.5,0.51,0.53,0.49,0.49,0.0,0.0
1,0.5,0.4,0.45,0.4,1.0,1.0,0.5,0.51,0.52,0.52,0.51,1.0,0.5,4.0,0.51,102385.0,0.5,0.5,0.51,0.52,0.49,0.49,0.0,0.0
2,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.49,1.0,0.5,4.0,0.5,113907.0,0.53,0.5,0.51,0.53,0.52,0.53,0.0,0.0
3,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.51,1.0,0.5,4.0,0.43,227116.0,0.53,0.5,0.51,0.53,0.52,0.53,0.0,0.0
4,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.49,1.0,0.5,4.0,0.51,101900.0,0.5,0.5,0.51,0.48,0.52,0.49,0.0,0.0


In [15]:
X_train.head()

Unnamed: 0,ProductName,EngineVersion,AppVersion,AvSigVersion,AVProductsInstalled,AVProductsEnabled,Platform,Processor,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,Census_DeviceFamily,Census_ProcessorCoreCount,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_PowerPlatformRoleName,Census_OSVersion,Census_OSArchitecture,Census_OSInstallTypeName,Census_OSWUAutoUpdateOptionsName,Census_ActivationChannel,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,is_train
0,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.51,1.0,0.5,4.0,0.51,299451.0,0.53,0.5,0.51,0.53,0.49,0.49,0.0,0.0,1
1,0.5,0.4,0.45,0.4,1.0,1.0,0.5,0.51,0.52,0.52,0.51,1.0,0.5,4.0,0.51,102385.0,0.5,0.5,0.51,0.52,0.49,0.49,0.0,0.0,1
2,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.49,1.0,0.5,4.0,0.5,113907.0,0.53,0.5,0.51,0.53,0.52,0.53,0.0,0.0,1
3,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.51,1.0,0.5,4.0,0.43,227116.0,0.53,0.5,0.51,0.53,0.52,0.53,0.0,0.0,1
4,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.49,1.0,0.5,4.0,0.51,101900.0,0.5,0.5,0.51,0.48,0.52,0.49,0.0,0.0,1


In [18]:
y = df['is_train']
df.drop('is_train', axis = 1, inplace = True) 

# Xgboost parameters
xgb_params = {'learning_rate': 0.05, 
              'max_depth': 4,
              'subsample': 0.9,        
              'colsample_bytree': 0.9,
              'objective': 'binary:logistic',
              'silent': 1, 
              'n_estimators':100, 
              'gamma':1,         
              'min_child_weight':4}   
clf = xgb.XGBClassifier(**xgb_params, seed = 10)
clf.fit(df, y)

[00:26:39] Tree method is automatically selected to be 'approx' for faster speed. To use old behavior (exact greedy algorithm on single machine), set tree_method to 'exact'.


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.9, gamma=1, learning_rate=0.05, max_delta_step=0,
       max_depth=4, min_child_weight=4, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=10, silent=1,
       subsample=0.9)

In [None]:
del test, Y_train, df, y

In [20]:
X_train.head()

Unnamed: 0,ProductName,EngineVersion,AppVersion,AvSigVersion,AVProductsInstalled,AVProductsEnabled,Platform,Processor,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,Census_DeviceFamily,Census_ProcessorCoreCount,Census_PrimaryDiskTypeName,Census_SystemVolumeTotalCapacity,Census_PowerPlatformRoleName,Census_OSVersion,Census_OSArchitecture,Census_OSInstallTypeName,Census_OSWUAutoUpdateOptionsName,Census_ActivationChannel,Census_IsAlwaysOnAlwaysConnectedCapable,Wdft_IsGamer,is_train
0,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.51,1.0,0.5,4.0,0.51,299451.0,0.53,0.5,0.51,0.53,0.49,0.49,0.0,0.0,1
1,0.5,0.4,0.45,0.4,1.0,1.0,0.5,0.51,0.52,0.52,0.51,1.0,0.5,4.0,0.51,102385.0,0.5,0.5,0.51,0.52,0.49,0.49,0.0,0.0,1
2,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.49,1.0,0.5,4.0,0.5,113907.0,0.53,0.5,0.51,0.53,0.52,0.53,0.0,0.0,1
3,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.51,1.0,0.5,4.0,0.43,227116.0,0.53,0.5,0.51,0.53,0.52,0.53,0.0,0.0,1
4,0.5,0.55,0.53,0.55,1.0,1.0,0.5,0.51,0.52,0.52,0.49,1.0,0.5,4.0,0.51,101900.0,0.5,0.5,0.51,0.48,0.52,0.49,0.0,0.0,1


In [21]:
X_train.drop('is_train', axis = 1, inplace = True) 

probs = clf.predict_proba(X_train)[:,1]
new_df = pd.DataFrame({'id':X_train.index, 'probs':probs})
new_df = new_df.sort_values(by = 'probs', ascending=False)

In [36]:
val_set_ids = new_df.iloc[1:np.int(new_df.shape[0]*0.2),1]

In [38]:
val_set_ids.to_csv('validation_20.csv')

### To read

In [31]:
# # Adversarial validation idexes
# avi = pd.read_csv('../input/validation_20.csv', names=['indexes', 'probability'])

# # Split in train and validation
# X_train = train[~train.index.isin(avi['indexes'])]
# 2X_val = train[train.index.isin(avi['indexes'])]

217576   0.95
123947   0.95
276616   0.95
795739   0.95
7371     0.95
Name: probs, dtype: float32