# Create baseline random forest model

This is a first pass on the Microsoft malware Kaggle competition. This notebook creates a baseline model using sklearns random forest classifier to offer a gauge on how well future models are performing. 

We'll also use this notebook to get a initial view of the feature importance in this dataset. 

In [1]:
import os
import random
import feather

import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier

  from numpy.core.umath_tests import inner1d


In [2]:
data_path_training = 'data/train.csv'

The training data set has already been loaded and saved to a feather format in the raw_data_to_feather notebook. This reduces the load time of the full training set from ~2mins to ~1min. The full training set is ~7.5gb of data which is starting to put pressure on the memory on my laptop. Will need to consider how to approach working with datasets of this size particularly when we want to pull in the test set of data.

In [3]:
#%time train_df = feather.read_dataframe(data_path_training)

### Issues with file size

The 7.5gb size of the full training set has been causing issues on my local machine. Instead we'll use a sample of the training data for local development and then run the full set on a cloud instance when ready.

Using exp1orer's answer from this stackoverflow question: https://stackoverflow.com/questions/22258491/read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame

At this stage we'll work with 1% of the training data. Note: each time this is run it will take a different sample.

In [4]:
%time train_df = pd.read_csv(data_path_training, header = 0, skiprows = lambda i: i > 0 and random.random() > 0.01)

CPU times: user 22.3 s, sys: 873 ms, total: 23.1 s
Wall time: 23.2 s




In [5]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

## Initial overview of data

The training dataset contains ~9 million rows of data each containing 83 features. The set is divided almost exactly in half of examples with and without malware detected.

In [6]:
train_df.shape

(89376, 83)

In [7]:
train_df.HasDetections.value_counts()

1    44717
0    44659
Name: HasDetections, dtype: int64

In [8]:
train_df.head(2).T

Unnamed: 0,0,1
MachineIdentifier,0000d322f204595d1d90eac4cb8a6e6e,000143e604f86e38c23c6e858c02d562
ProductName,win8defender,win8defender
EngineVersion,1.1.15100.1,1.1.15100.1
AppVersion,4.18.1807.18075,4.18.1807.18075
AvSigVersion,1.273.1135.0,1.273.529.0
IsBeta,0,0
RtpStateBitfield,7,7
IsSxsPassiveMode,0,0
DefaultBrowsersIdentifier,,
AVProductStatesIdentifier,53447,53447


## Preprocessing training data for Random Forest

All of the features need to be converted to numbers to work in the random forest. We need to identify which columns hold values that are not numbers (expect all of these values to be strings) and then decide whether to one hot encode or linear encode the features.

For features that have low enough value types we will one hot encode, the others we will pass through a linear encoder.

#### Potential problem
These counts of the number of values in each feature may not hold true in the full training set.

#### Results from full training set
When running this code over the full traing set this was the result from a limit of 35:

"After one hot encoding features with a cardinality of less than 35 values there will be 303 features in the training set"

The full results of this test are saved in non_numerical_features_full_training_set.csv. Note that a cut off of 30 will drop the number of features to 208 if the numbers of features becomes an issue.

In [9]:
non_numerical_features_names = list(train_df.select_dtypes(include=["object"]).columns)

non_numerical_features = pd.DataFrame(index=non_numerical_features_names)
non_numerical_features['occurences'] = np.nan

for feat in non_numerical_features_names:
    non_numerical_features.loc[feat, 'occurences'] = train_df[feat].value_counts().count()

In [10]:
feature_limit_one_hot_encode = 35

non_numerical_features.sort_values(by='occurences', ascending=False, inplace=True)
non_numerical_features['oh_encode'] = non_numerical_features['occurences'].apply(lambda row: True if row < feature_limit_one_hot_encode else False)

In [11]:
print(
    'After one hot encoding features with less than ' + str(feature_limit_one_hot_encode) + ' values there will be '
    + str(int(train_df.shape[1] + non_numerical_features.occurences.where(non_numerical_features.oh_encode).sum()))
    + ' features in the training set'
) 

After one hot encoding features with less than 35 values there will be 297 features in the training set


In [12]:
non_numerical_features

Unnamed: 0,occurences,oh_encode
MachineIdentifier,89376.0,False
AvSigVersion,3854.0,False
OsBuildLab,345.0,False
Census_OSVersion,251.0,False
AppVersion,77.0,False
EngineVersion,42.0,False
Census_ChassisTypeName,26.0,True
Census_OSSkuName,21.0,True
Census_OSEdition,21.0,True
Census_InternalBatteryType,21.0,True


### Exploring linear encoding / training categories

What happens when a new value comes into the trained model?? If you have low, medium and high (corresponding to 1, 2, 3) but then in the test set or a real world example the value very high comes in, how does that get passed through the random forest? I assume it doesn't get assigned the value of 4. Even more problematic would be a value of low-medium.

If there is a split midway through the tree what does it do it it tries to say less than 2.5?

In [None]:
a = train_df.AvSigVersion.value_counts().keys()

In [26]:
a.sort_values()

Index(['0.0.0.0', '1.199.3431.0', '1.221.14.0', '1.223.1419.0', '1.223.1437.0',
       '1.223.1440.0', '1.223.1683.0', '1.223.1733.0', '1.223.1838.0',
       '1.223.1849.0',
       ...
       '1.277.41.0', '1.277.43.0', '1.277.46.0', '1.277.48.0', '1.277.49.0',
       '1.277.51.0', '1.277.58.0', '1.277.62.0', '1.277.64.0', '1.277.67.0'],
      dtype='object', length=3854)