### Key Challenge in this Task:
Importing large data sets is a hassle here. Each one of train and test data sets is of size 5GB+. We have used [Dask Dataframe](https://docs.dask.org/en/latest/dataframe.html) to tackle this challenge. [Dask](https://dask.org/) provides a framework to scale pandas workflows natively using a parallel processing architecture.



### List of Acknowledgements :
* **Acknowledgement 1:** [Theo Viel's data loading work](https://www.kaggle.com/theoviel/load-the-totality-of-the-data) but I have made some modifications
* **Acknowledgement 2:** Memory optimization explanation from [Chris Deotte](https://www.kaggle.com/cdeotte) (Each column in train.csv has 9 million rows. If we declare that column as int16 then we allocate 2 bytes per each row equaling 18 million bytes for that column. Instead, if we declare that column as int8 we allocate 1 byte per each row equaling 9 million bytes).
* **Acknowledgement 3:** [Fabien Daniel's brilliant work using LGBM](https://www.kaggle.com/fabiendaniel/detecting-malwares-with-lgbm) where he used a census of variables: binary, true numericals (floats) and categorical.

In [None]:
import dask
import dask.dataframe as dd
import warnings
import numpy as np
import pandas as pd
import gc
import seaborn as sns
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error, classification_report

%matplotlib inline
sns.set(style="whitegrid")
warnings.filterwarnings("ignore")

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function() {
    return False;
}

In [None]:
# Taken from https://www.kaggle.com/theoviel/load-the-totality-of-the-data
# I modified some data types

dtypes = {
        'MachineIdentifier':                                    'category',
        'ProductName':                                          'category',
        'EngineVersion':                                        'category',
        'AppVersion':                                           'category',
        'AvSigVersion':                                         'category',
        'IsBeta':                                               'int8',
        'RtpStateBitfield':                                     'float16',
        'IsSxsPassiveMode':                                     'int8',
        'DefaultBrowsersIdentifier':                            'float16',
        'AVProductStatesIdentifier':                            'float32',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'HasTpm':                                               'int8',
        'CountryIdentifier':                                    'int16',
        'CityIdentifier':                                       'float32',
        'OrganizationIdentifier':                               'float16',
        'GeoNameIdentifier':                                    'float16',
        'LocaleEnglishNameIdentifier':                          'int8',
        'Platform':                                             'category',
        'Processor':                                            'category',
        'OsVer':                                                'category',
        'OsBuild':                                              'int16',
        'OsSuite':                                              'int16',
        'OsPlatformSubRelease':                                 'category',
        'OsBuildLab':                                           'category',
        'SkuEdition':                                           'category',
        'IsProtected':                                          'float16',
        'AutoSampleOptIn':                                      'int8',
        'PuaMode':                                              'category',
        'SMode':                                                'float16',
        'IeVerIdentifier':                                      'float16',
        'SmartScreen':                                          'category',
        'Firewall':                                             'float16',
        'UacLuaenable':                                         'float32',
        'Census_MDC2FormFactor':                                'category',
        'Census_DeviceFamily':                                  'category',
        'Census_OEMNameIdentifier':                             'float16',
        'Census_OEMModelIdentifier':                            'float32',
        'Census_ProcessorCoreCount':                            'float16',
        'Census_ProcessorManufacturerIdentifier':               'float16',
        'Census_ProcessorModelIdentifier':                      'float16',
        'Census_ProcessorClass':                                'category',
        'Census_PrimaryDiskTotalCapacity':                      'float32',
        'Census_PrimaryDiskTypeName':                           'category',
        'Census_SystemVolumeTotalCapacity':                     'float32',
        'Census_HasOpticalDiskDrive':                           'int8',
        'Census_TotalPhysicalRAM':                              'float32',
        'Census_ChassisTypeName':                               'category',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float16',
        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float16',
        'Census_InternalPrimaryDisplayResolutionVertical':      'float16',
        'Census_PowerPlatformRoleName':                         'category',
        'Census_InternalBatteryType':                           'category',
        'Census_InternalBatteryNumberOfCharges':                'float32',
        'Census_OSVersion':                                     'category',
        'Census_OSArchitecture':                                'category',
        'Census_OSBranch':                                      'category',
        'Census_OSBuildNumber':                                 'int16',
        'Census_OSBuildRevision':                               'int32',
        'Census_OSEdition':                                     'category',
        'Census_OSSkuName':                                     'category',
        'Census_OSInstallTypeName':                             'category',
        'Census_OSInstallLanguageIdentifier':                   'float16',
        'Census_OSUILocaleIdentifier':                          'int16',
        'Census_OSWUAutoUpdateOptionsName':                     'category',
        'Census_IsPortableOperatingSystem':                     'int8',
        'Census_GenuineStateName':                              'category',
        'Census_ActivationChannel':                             'category',
        'Census_IsFlightingInternal':                           'float16',
        'Census_IsFlightsDisabled':                             'float16',
        'Census_FlightRing':                                    'category',
        'Census_ThresholdOptIn':                                'float16',
        'Census_FirmwareManufacturerIdentifier':                'float16',
        'Census_FirmwareVersionIdentifier':                     'float32',
        'Census_IsSecureBootEnabled':                           'int8',
        'Census_IsWIMBootEnabled':                              'float16',
        'Census_IsVirtualDevice':                               'float16',
        'Census_IsTouchEnabled':                                'int8',
        'Census_IsPenCapable':                                  'int8',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'Wdft_RegionIdentifier':                                'float16',
        'HasDetections':                                        'int8'
        }

In [None]:
# We will use this quite often to clear memory space
gc.collect()

In [None]:
train_data = dd.read_csv("../input/microsoft-malware-prediction/train.csv", dtype=dtypes) 
train_data = train_data.compute()

In [None]:
gc.collect()

In [None]:
def segregate_features():
    binary = [col for col in train_data.columns if train_data[col].nunique() == 2]
    numerical_floats = ['Census_ProcessorCoreCount',
                        'Census_PrimaryDiskTotalCapacity',
                        'Census_SystemVolumeTotalCapacity',
                        'Census_TotalPhysicalRAM',
                        'Census_InternalPrimaryDiagonalDisplaySizeInInches',
                        'Census_InternalPrimaryDisplayResolutionHorizontal',
                        'Census_InternalPrimaryDisplayResolutionVertical',
                        'Census_InternalBatteryNumberOfCharges']
    categorical = [col for col in train_data.columns if (col not in numerical_floats) & (col not in binary)]
    return binary, numerical_floats, categorical
    
binary_columns, numerical_float_columns, categorical_columns = segregate_features()

In [None]:
gc.collect()

In [None]:
print(train_data.shape)
train_data.head()

In [None]:
gc.collect()

### Checking for Missing Values :

In [None]:
#Taken from https://michael-fuchs-python.netlify.app/2019/03/18/dealing-with-missing-values/

def display_missing_values(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [None]:
gc.collect()

In [None]:
print(display_missing_values(train_data))

In [None]:
gc.collect()

### Column Type Distribution : 

In [None]:
total = train_data.shape[0]
missing_df = []
cardinality_df = []
for col in train_data.columns:
    missing_df.append([col, train_data[col].count(), total])
    cardinality = train_data[col].nunique()
    if cardinality > 2 and col != 'MachineIdentifier':
        cardinality_df.append([col, cardinality])
    
missing_df = pd.DataFrame(missing_df, columns=['Column', 'Number of records', 'Total']).sort_values("Number of records", ascending=False)
cardinality_df = pd.DataFrame(cardinality_df, columns=['Column', 'Cardinality']).sort_values("Cardinality", ascending=False)
type_df = [['Binary columns', len(binary_columns)], ['Numerical columns', len(numerical_float_columns)], ['Categorical columns', len(categorical_columns)]]

type_df = pd.DataFrame(type_df, columns=['Type', 'Column Count']).sort_values('Column Count', ascending=True)

In [None]:
plt.style.use('ggplot')
f, ax = plt.subplots(figsize=(8, 1))
sns.barplot(x="Column Count", y="Type", data=type_df, label="Feature Type Distribution", palette='PuRd')
plt.show()

In [None]:
gc.collect()

### Testing Imbalance of Labels :

In [None]:
f, ax = plt.subplots(figsize=(3, 3))
ax = sns.countplot(x="HasDetections", data=train_data, label="Label Count")
sns.despine(bottom=True)

Visibly, the training data set is quite balanced with almost equal distribution of labels.
### Feature-wise Missing Value Distribution :

In [None]:
f, ax = plt.subplots(figsize=(10, 16))
sns.set_color_codes("muted")
sns.barplot(x="Total", y="Column", data=missing_df, label="Missing", color="navy")
sns.barplot(x="Number of records", y="Column", data=missing_df, label="Existing", color="skyblue")
ax.legend(ncol=2, loc="upper right", frameon=True)
plt.show()

In [None]:
gc.collect()

### Cardinality Distribution of Categorical Variables :

In [None]:
f, ax = plt.subplots(figsize=(10, 15))
sns.set_color_codes("muted")
sns.barplot(x="Cardinality", y="Column", data=cardinality_df, label="Existing", color="red")
plt.show()

In [None]:
gc.collect()

## Data Pre-Processing :
#### Step 1: Removing Columns with High Cardinality

In [None]:
high_cardinality_cols = [col for col in categorical_columns if train_data[col].nunique() > 500] 
high_cardinality_cols.remove('MachineIdentifier')  # Also Remove Machine IDs
train_data.drop(high_cardinality_cols, axis=1, inplace=True)
print('Columns with High Cardinality: \n')
high_cardinality_cols

#### Step 2: Removing Columns Having >40% Missing Data

In [None]:
high_null_cols = [col for col in train_data.columns if train_data[col].count() < len(train_data)*0.6]
train_data.drop(high_null_cols, axis=1, inplace=True)
print('Columns with > 40% Missing Values: \n')
high_null_cols

#### Step 3: Removing Unnecessary Columns

In [None]:
useless_cols = ['MachineIdentifier']
train_data.drop(useless_cols, axis=1, inplace=True)

In [None]:
gc.collect()

In [None]:
# Remove rows from numeric features with missing values
# We will need this later to plot distribution
train_data.dropna(subset = numerical_float_columns, inplace=True)

In [None]:
gc.collect()

In [None]:
binary_columns, numerical_float_columns, categorical_columns = segregate_features()

In [None]:
gc.collect()

### Plotting Distribution :

In [None]:
def plot_distribution():
    for feat in numerical_float_columns:
        f, axes = plt.subplots(1, 3, figsize=(20, 8), sharex=True)
        sns.distplot(train_data[feat], ax=axes[0], kde_kws={'bw': 0.1}).set_title("All Labels")
        sns.distplot(train_data[train_data['HasDetections']==1][feat], ax=axes[1], kde_kws={'bw': 0.00001}).set_title("HasDetections = 1")
        sns.distplot(train_data[train_data['HasDetections']==0][feat], ax=axes[2], kde_kws={'bw': 0.00001}).set_title("HasDetections = 0")
        sns.despine(left=True)
        plt.tight_layout()

In [None]:
gc.collect()

In [None]:
plot_distribution()

### Observations :
* Census_ProcessorCoreCount: Malware detection is right-skewed.
* Census_PrimaryDiskTotalCapacity: Almost symmetric.
* Census_SystemVolumeTotalCapacity, Census_TotalPhysicalRAM: Malware non-detection is right-skewed.
* Census_InternalPrimaryDiagonalDisplaySizeInInches: Malware non-detection has a long right-tail.
* Census_InternalPrimaryDisplayResolutionHorizontal: Almost symmetric.
* Census_InternalPrimaryDisplayResolutionVertical:  Malware non-detection has a long right-tail.
* Census_InternalBatteryNumberOfCharges: Almost symmetric.

# Data Processing for LGBM

In [None]:
train_data.dropna(inplace=True)

In [None]:
labels = train_data['HasDetections']
train_data.drop('HasDetections', axis=1, inplace=True)
X_train, X_val, Y_train, Y_val = train_test_split(train_data, labels, test_size=0.15,random_state=12345)

In [None]:
binary_columns, numerical_float_columns, categorical_columns = segregate_features()

# Label encoder
lencoder = {}
for col in categorical_columns:
    _, lencoder[col] = pd.factorize(X_train[col])
    
for col in categorical_columns:
    X_train[col] = lencoder[col].get_indexer(X_train[col])
    X_val[col] = lencoder[col].get_indexer(X_val[col])

In [None]:
gc.collect()

In [None]:
params = {'num_leaves': 60,
         'min_data_in_leaf': 100, 
         'objective':'binary',
         'max_depth': -1,
         'learning_rate': 0.1,
         "boosting": "gbdt",
         "feature_fraction": 0.8,
         "bagging_freq": 1,
         "bagging_fraction": 0.8 ,
         "bagging_seed": 42,
         "metric": 'auc',
         "lambda_l1": 0.1,
         "random_state": 12345,
         "verbosity": -1}

In [None]:
lgb_train = lgb.Dataset(X_train, label=Y_train)
lgb_val = lgb.Dataset(X_val, label=Y_val)

In [None]:
gc.collect()

## LGBM Model Training :

In [None]:
model = lgb.train(params, lgb_train, 1000, valid_sets=[lgb_train, lgb_val], early_stopping_rounds=200, verbose_eval=100)

## Feature Importance :

In [None]:
lgb.plot_importance(model, figsize=(16, 16))
plt.show()

#### Top 10 Most Important Features Contributing towards Malware:
* Census_SystemVolumeTotalCapacity
* CountryIdentifier
* Census_OSVersion
* Census_InternalPrimaryDiagonalDisplaySizeInInches
* AppVersion
* GeoNameIdentifier
* LocaleEnglishNameIdentifier
* Wdft_RegionIdentifier
* Census_PrimaryDiskTotalCapacity
* EngineVersion

In [None]:
gc.collect()

In [None]:
train_preds_raw = model.predict(X_train, num_iteration=model.best_iteration)
val_preds_raw = model.predict(X_val, num_iteration=model.best_iteration)
train_preds = np.around(train_preds_raw)
val_preds = np.around(val_preds_raw)

# Model Metrics :
### [1] Classification Report for Training & Validation

In [None]:
target_names=['HasDetections = 0', 'HasDetections = 1']
print('************************* TRAIN *************************')
print(classification_report(Y_train, train_preds, target_names=target_names))
print('*********************** VALIDATION **********************')
print(classification_report(Y_val, val_preds, target_names=target_names))

We get almost same accuracy hovering around 68% - 69% for both training and validation data sets.
### [2] Confusion Matrices for Training & Validation

In [None]:
f, axes = plt.subplots(1, 2, figsize=(12, 4), sharex=True)
train_cnf_mat = confusion_matrix(Y_train, train_preds)
val_cnf_mat = confusion_matrix(Y_val, val_preds)

train_cnf_mat_norm = train_cnf_mat / train_cnf_mat.sum(axis=1)[:, np.newaxis]
val_cnf_mat_norm = val_cnf_mat / val_cnf_mat.sum(axis=1)[:, np.newaxis]

train_df_cm = pd.DataFrame(train_cnf_mat_norm, index=[0, 1], columns=[0, 1])
val_df_cm = pd.DataFrame(val_cnf_mat_norm, index=[0, 1], columns=[0, 1])

sns.heatmap(train_df_cm, annot=True, fmt='.2f', cmap="Spectral", ax=axes[0]).set_title("TRAIN")
sns.heatmap(val_df_cm, annot=True, fmt='.2f', cmap="Spectral", ax=axes[1]).set_title("VALIDATION")

Both training and validation datasets have almost similar TP, TN, FP, FN ratios.

### [3] Distribution Plot for Training & Validation:

In [None]:
f, ax = plt.subplots(figsize=(16, 4))
sns.set_color_codes("muted")
ax = sns.distplot(train_preds_raw, color="blue", kde_kws={"label": "TRAIN"}, axlabel='Probability Distribution')
ax = sns.distplot(val_preds_raw, color="orange", kde_kws={"label": "VALIDATION"})
sns.despine(left=True)

Both training and validation datasets are following almost similar probability distribution.

In [None]:
gc.collect()

In [None]:
del train_data, X_val, Y_train, Y_val

In [None]:
gc.collect()

# Prediction on Test Data :

In [None]:
test_data = dd.read_csv('../input/microsoft-malware-prediction/test.csv', dtype=dtypes, usecols=(['MachineIdentifier'] + list(X_train.columns))).head(n=7853253)

In [None]:
gc.collect()

In [None]:
submission = pd.DataFrame({"MachineIdentifier":test_data['MachineIdentifier']})
test_data.drop('MachineIdentifier', axis=1, inplace=True)

for col in categorical_columns:
    test_data[col] = lencoder[col].get_indexer(test_data[col])

In [None]:
predictions = model.predict(test_data, num_iteration=model.best_iteration)

In [None]:
gc.collect()

In [None]:
submission["HasDetections"] = predictions
submission.head()

In [None]:
submission.to_csv("malware_prediction_submission.csv", index=False)