# Embeddings Neural Network - Microsoft Malware
I'm excited to share my Microsoft Malware Neural Network. This NN uses two innovative designs that I'm proud to have discovered. It also uses embeddings to accept categorical variables into the network. It's part of an ensemble that scored 0.707 public LB (14th place out of 2450) described [here][1].
* Similar variables are "squeezed" together forcing the network to PCA encode the group
* High cardinality categorical variables are reduced using statistical hypothesis testing   

By executing this kernel, you train one fold of a 5-fold network and score 0.696 (+ - 0.001) public LB and 0.770 (+ - 0.003) private LB. If you ensemble the outputs from running this kernel 5 times, you score 0.699 public LB and 0.771 private LB! Below is an example of a "squeeze" grouping for geographical variables. The actual group receives 8 geographical variables. The full network employs 6 groupings.  
  
.  
![image](http://playagricola.com/Kaggle/geographicalB31919.jpg)  
  
.  
.  
After the 6 groupings are created, they are fed into the remaining 3 hidden layers
![image](http://playagricola.com/Kaggle/outputB31919.jpg)  
  
[1]: https://www.kaggle.com/c/microsoft-malware-prediction/discussion/84135

# Load Data

In [None]:
# SET THIS VARIABLE TO TRUE TO RUN KERNEL QUICKLY
# AND TEST FOR BUGS. ONLY 10000 ROWS OF DATA IS LOADED
Debug = False

# IMPORT LIBRARIES
import pandas as pd, numpy as np, os, gc

# LOAD THESE BUT DONT ENCODE
MM = ['AvSigVersion','Census_OSVersion','Census_OSBuildRevision','AppVersion','EngineVersion']

# LOAD AND NUMERIC ENCODE
NE = ['Census_SystemVolumeTotalCapacity','Census_PrimaryDiskTotalCapacity']

# LOAD AND STATISTICAL ONE-HOT-ENCODE
OHE = [ 'RtpStateBitfield','DefaultBrowsersIdentifier', 'AVProductStatesIdentifier',
        'AVProductsInstalled', 'AVProductsEnabled', 'CountryIdentifier', 'CityIdentifier', 
        'GeoNameIdentifier', 'LocaleEnglishNameIdentifier', 'Processor', 'OsBuild', 'OsSuite',
        'SmartScreen','Census_MDC2FormFactor', 'Census_OEMNameIdentifier', 
        'Census_ProcessorCoreCount', 'Census_ProcessorModelIdentifier', 
        'Census_OSUILocaleIdentifier', 'Census_PrimaryDiskTypeName',
        'Census_HasOpticalDiskDrive', 'Census_TotalPhysicalRAM', 'Census_ChassisTypeName',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches',
        'Census_InternalPrimaryDisplayResolutionHorizontal',
        'Census_InternalPrimaryDisplayResolutionVertical',
        'Census_PowerPlatformRoleName', 'Census_InternalBatteryType',
        'Census_InternalBatteryNumberOfCharges', 'Census_OSEdition', 'Census_GenuineStateName',
        'Census_ActivationChannel', 'Census_FirmwareManufacturerIdentifier', 'Census_IsTouchEnabled', 
        'Census_IsPenCapable', 'Census_IsAlwaysOnAlwaysConnectedCapable', 'Wdft_IsGamer', 
        'Wdft_RegionIdentifier', 'OsBuildLab', 'OrganizationIdentifier','Platform',
        'Census_OEMModelIdentifier', 'IsProtected', 'IeVerIdentifier','Firewall', 
        'Census_ProcessorManufacturerIdentifier','Census_OSInstallTypeName',
        'Census_OSWUAutoUpdateOptionsName','Census_IsFlightingInternal',
        'Census_FlightRing','Census_ThresholdOptIn','Census_FirmwareVersionIdentifier',
        'Census_IsSecureBootEnabled','Census_IsWIMBootEnabled']

# DONT LOAD THESE
XX = ['SMode','IsBeta', 'OsVer', 'OsPlatformSubRelease', 'SkuEdition', 'AutoSampleOptIn', 'PuaMode',
     'UacLuaenable', 'Census_ProcessorClass', 'Census_OSArchitecture', 'Census_OSBranch',
     'Census_OSBuildNumber', 'Census_OSSkuName', 'Census_OSInstallLanguageIdentifier',
     'Census_IsPortableOperatingSystem', 'Census_IsFlightsDisabled', 'Census_IsVirtualDevice',
     'IsSxsPassiveMode','ProductName','HasTpm','Census_DeviceFamily']

# DONT LOAD THIS
XXX = ['MachineIdentifier']

# LOAD ALL AS CATEGORIES
dtypes = {}
for x in OHE+NE+MM: dtypes[x] = 'category'
dtypes['HasDetections'] = 'int8'

# LOAD TRAIN CSV FILE
if Debug:
    df_train = pd.read_csv('../input/microsoft-malware-prediction/train.csv', usecols=dtypes.keys(), dtype=dtypes,nrows=10000)
else:
    df_train = pd.read_csv('../input/microsoft-malware-prediction/train.csv', usecols=dtypes.keys(), dtype=dtypes)
if 5244810 in df_train.index:
    df_train.loc[5244810,'AvSigVersion'] = '1.273.1144.0'
    df_train['AvSigVersion'].cat.remove_categories('1.2&#x17;3.1144.0',inplace=True)
print ('Loaded',len(df_train),'rows of TRAIN.CSV!')

# SHUFFLE TRAIN DATA
df_train = df_train.sample(frac=1)
df_train.reset_index(drop=True,inplace=True)

# LOAD TEST CSV FILE
if Debug:
    df_test = pd.read_csv('../input/microsoft-malware-prediction/test.csv', usecols=list(dtypes.keys())[0:-1], dtype=dtypes,nrows=10000)
else:
    df_test = pd.read_csv('../input/microsoft-malware-prediction/test.csv', usecols=list(dtypes.keys())[0:-1], dtype=dtypes)
print ('Loaded',len(df_test),'rows of TEST.CSV!')

# Define Encoding Functions
This model uses a special statisical label encoding function which is explained in the kernel [here][1].  Here is a brief summary. When a categorical variable has many unique categories, each category value is hypothesis tested to see if it influences `HasDetections` rate. For example, say that `CountryIdentifier` has 222 unique values: 1, 2, 3, ..., 222. And we know that the average `HasDetections` rate is 0.5. Then for all observations from country, `k`, we test whether the rate of `HasDetections` deviates from 0.5 more than 1 standard deviation. If it does, we keep that category value. If it does not, we change that category value to a new category called `unhelpful`. In this specific example, the cardinality is reduced from 222 down to 115 thus removing 107 unhelpful values.  
  
Additionally, we use log frequency encoding, label encoding, and memory optimization.
  
[1]: https://www.kaggle.com/cdeotte/neural-network-malware-0-67

In [None]:
import math

# FACTORIZE
def factor_data(df_train, df_test, col):
    df_comb = pd.concat([df_train[col],df_test[col]],axis=0)
    df_comb,_ = df_comb.factorize(sort=True)
    # MAKE SMALLEST LABEL 1, RESERVE 0
    df_comb += 1
    # MAKE NAN LARGEST LABEL
    df_comb = np.where(df_comb==0, df_comb.max()+1, df_comb)
    df_train[col] = df_comb[:len(df_train)]
    df_test[col] = df_comb[len(df_train):]
    del df_comb
    mx = max(df_train[col].max(),df_test[col].max())+1
    return mx
    
# OPTIMIZE MEMORY
def reduce_memory(df,col):
    mx = df[col].max()
    if mx<256:
            df[col] = df[col].astype('uint8')
    elif mx<65536:
        df[col] = df[col].astype('uint16')
    else:
        df[col] = df[col].astype('uint32')
    
# LOG FREQUENCY ENCODE
def encode_FE_lg(df,col,verbose=1):
    ln = 1/df[col].nunique()
    vc = (df[col].value_counts(dropna=False, normalize=True)+ln).map(math.log).to_dict()
    nm = col+'_FE_lg'
    df[nm] = df[col].map(vc)
    df[nm] -= df[nm].min()
    df[nm] = df[nm]/df[nm].max()
    df[nm] = df[nm].astype('float32')
    if verbose==1:
        print('FE encoded',col)
    return [nm]

# STATISTICAL CATEGORY ENCODE
def encode_CE(df, col, filter, zscore, tar='HasDetections', m=0.5, verbose=1):
    cv = pd.DataFrame( df[col].value_counts(dropna=False) ).reset_index()
    cv4 = df.groupby(col)[tar].mean().reset_index().rename({tar:'rate',col:'index'},axis=1)
    d1 = set(cv['index'].unique())
    cv = pd.merge(cv,cv4,on='index',how='left')
    if (len( cv[ cv['index'].isna() ])!=0 ):
        cv.loc[ cv['index'].isna(),'rate' ] = df.loc[ df[col].isna(),tar ].mean()
    cv = cv[ cv[col]> (filter * len(df)) ]
    cv['ratec'] = (df[tar].sum() - cv['rate']*cv[col])/(len(df)-cv[col])
    cv['sd'] = zscore * 0.5 / cv[col].map(lambda x: math.sqrt(x))
    cv = cv[ (abs(cv['rate']-m)>=cv['sd']) | (abs(cv['ratec']-1+m)>=cv['sd']) ]
    d2 = set(cv['index'].unique())
    d = list(d1 - d2)
    if (df[col].dtype.name=='category'):
        if (not 0 in df[col].cat.categories):
            df[col].cat.add_categories(0,inplace=True)
        else:
            print('###WARNING CAT 0 ALREADY EXISTS IN',col)
    df.loc[ df[col].isin(d),col ] = 0
    if verbose==1:
        print('CE encoded',col,'-',len(d2),'values. Removed',len(d),'values')
    mx = df[col].nunique()
    return [mx,d2]

# CATEGORY ENCODE FROM KEEP LIST
def encode_CE_test(df,col,d):
    if (df[col].dtype.name=='category'):
        if (not 0 in df[col].cat.categories):
            df[col].cat.add_categories(0,inplace=True)
        else:
            print('###WARNING CAT 0 ALREADY EXISTS IN',col)
    df.loc[ ~df[col].isin(d),col ] = 0
    mx = df[col].nunique()
    return [mx,d]

# Feature Engineering
We engineer `AppVersion2` which is the second number from `AppVersion`. For example `AppVersion2 = 18` when `AppVersion = 4.18.1807.18075`. This variable indicates whether Windows Defender has been updated to the latest version. And we engineer two time stamp variables. `Lag1 = AvSigVersion_Date - Census_OSVersion_Date` and `Lag2 = max(July 26,2018 - AvSigVersion_Date,0)`. When encoding the test data, we define `Lag2 = max(September 27,2018 - AvSigVersion_Date,0)` . The time stamp variables determine if a computer has an outdated `AvSigVersion`. EDA shows that these computers have decreased `HasDetections`. Presumably because they use their computers less or have better antivirus than Windows Defender. Lastly we clean the numeric values and normalize them.  
  
 [Time split validation][1] shows that these new variables increase our model's accuracy.  

[1]: https://www.kaggle.com/cdeotte/time-split-validation-malware-0-68

In [None]:
def makeNew(df,verbose=1,add=0,TS=True,data=0):

    old = df.columns
    
    # FEATURE ENGINEER
    df['AppVersion2'] = df['AppVersion'].apply(lambda x: x.split('.')[1]).astype('category')

    if TS:
        from datetime import datetime, date, timedelta

        # AS timestamp
        datedictAS = np.load('../input/malware-timestamps/AvSigVersionTimestamps.npy')[()]
        df['DateAS'] = df['AvSigVersion'].map(datedictAS)

        # OS timestamp
        datedictOS = np.load('../input/malware-timestamps-2/OSVersionTimestamps.npy')[()]
        df['DateOS'] = df['Census_OSVersion'].map(datedictOS)

        df['Lag1'] = df['DateAS'] - df['DateOS']
        df['Lag1'] = df['Lag1'].map(lambda x: x.days//7)
        df['Lag1'] = df['Lag1']/52.0
        df['Lag1'] = df['Lag1'].astype('float32')
        df['Lag1'].fillna(0,inplace=True)
        
        if data!=0:
            if data==1:
                df['Lag5'] = datetime(2018,7,26) - df['DateAS'] # TRAIN
            elif data==2:
                df['Lag5'] = datetime(2018,9,27) - df['DateAS'] #PUBLIC TEST
            elif data==3:
                df['Lag5'] = datetime(2018,10,27) - df['DateAS'] #PRIVATE TEST
            df['Lag5'] = df['Lag5'].map(lambda x: x.days//1)
            df.loc[ df['Lag5']<0, 'Lag5' ] = 0
            df['Lag5'] = df['Lag5']/365.0
            df['Lag5'] = df['Lag5'].astype('float32')
            df['Lag5'].fillna(0,inplace=True)

        del df['DateAS'], df['DateOS']
        del datedictAS, datedictOS
        x=gc.collect()
    
    # NUMERIC ENCODE NE VARIABLES
    for col in NE:
        nm = col+'_NE'
        df[nm] = df[col].astype('float32')
        df[nm] /= np.std(df[nm])
    new = list(set(df.columns)-set(old))
    ret = []
    for x in new:
        if str(df[x].dtype)=='category': # if cat
            if add==1: OHE.append(x)
        else: 
            ret.append(x)
            df[x].fillna(df[x].mean(),inplace=True)
    if verbose==1:
        print('Engineered',len(new),'new features!')
    return ret

# Encode Variables
In addition to statistically label encoding categorical variables (described above), we will add new variables of log frequency encoding for all categorical variables with cardinality over 10. [Time split validation][1] shows that these new variables increase our model's accuracy.  

[1]: https://www.kaggle.com/cdeotte/time-split-validation-malware-0-68

In [None]:
# GET FREQUENCY ENCODE LIST
FE = []
for col in df_train.columns:
    if col=='HasDetections': continue
    if df_train[col].nunique()>10:
        FE.append(col)

# FEATURE ENGINEER / NEUMERIC ENCODE
NUM = makeNew(df_train,verbose=0,add=1,data=1)
makeNew(df_test,verbose=0,data=2)
print('Engineered '+str(len(NUM))+' variables (including NE)')
ct = len(NUM)+1; cnew = ct
    
# FREQUENCY ENCODE
for x in FE:
    NUM += encode_FE_lg(df_train,x,verbose=0)
    encode_FE_lg(df_test,x,verbose=0)
    #print(str(ct)+': FE: '+x)
    ct += 1
print('Frequency encoded '+str(len(NUM)-cnew)+' variables')
    
# STATISTICAL CATEGORY ENCODE
inps={}; tt = 0
for col in OHE:
    factor_data(df_train,df_test,col)
    d = encode_CE(df_train,col,0.001,1)[1]
    encode_CE_test(df_test,col,d)
    inps[col] = factor_data(df_train,df_test,col)
    tt += inps[col]
    reduce_memory(df_train,col)
    reduce_memory(df_test,col)
    #print(str(ct)+': CE: '+col)
    ct += 1

# REMOVE UNNEEDED
for x in np.unique(NE+MM):
    del df_train[x]
    if x!='AvSigVersion': del df_test[x]
x = gc.collect()

mm = round(df_train.memory_usage(deep=True).sum() / 1024**2)
mm2 = round(df_test.memory_usage(deep=True).sum() / 1024**2)
print('Encoded '+str(len(NUM))+' non-CE variables and '+str(len(OHE))+' CE containing '+str(tt)+' unique values into '+str(mm)+' Mb memory')
print('Test memory is '+str(mm2)+' Mb')

# Define AUC Callback
This callback displays AUC after each epoch.

In [None]:
from keras import callbacks
from sklearn.metrics import roc_auc_score

class printAUC(callbacks.Callback):
    def __init__(self, X_train, y_train, inps, fes, X_val, y_val, k, ee):
        super(printAUC, self).__init__()
        self.bestAUC = 0
        self.X_train = X_train
        self.y_train = y_train
        self.inps = inps
        self.fes = fes
        self.X_val = X_val
        self.y_val = y_val
        self.k = k
        self.ee = ee
        
    def on_epoch_end(self, epoch, logs={}):
        pred = self.model.predict([self.X_train[col] for col in self.inps] + [self.X_train[self.fes]])
        aucTR = roc_auc_score(self.y_train, pred)
        pred = self.model.predict([self.X_val[col] for col in self.inps] + [self.X_val[self.fes]])
        auc = roc_auc_score(self.y_val, pred)
        print ("Train AUC: " + str(round(aucTR,5))+" - Validation AUC: " + str(round(auc,5)))
        if (self.bestAUC < auc) :
            self.bestAUC = auc
            self.model.save("bestNet"+str(self.k)+".h5", overwrite=True)
        return

# Define Variable Groupings
All variables in the same group get "squeezed" together and thus the group's dimension is reduced.

In [None]:
# DEFINE NETWORK ARCHITECTURE GROUPINGS
# (1) GEOGRAPHICAL, (2) SOFTWARE/VIRUS, (3) HARDWARE, (4) NAME/MODEL
groups = [  ['CountryIdentifier','CityIdentifier','OrganizationIdentifier','GeoNameIdentifier',
             'LocaleEnglishNameIdentifier','Census_OSInstallLanguageIdentifier','Census_OSUILocaleIdentifier',
            'Wdft_RegionIdentifier'],
            ['DefaultBrowsersIdentifier', 'AVProductStatesIdentifier', 'AVProductsInstalled', 'AVProductsEnabled',
             'IsProtected', 'SMode', 'IeVerIdentifier', 'SmartScreen', 'Firewall','Census_IsSecureBootEnabled',
            'Census_IsWIMBootEnabled','Wdft_IsGamer','Census_OSWUAutoUpdateOptionsName','Census_GenuineStateName',
            'AppVersion2'],
            ['Processor','Census_MDC2FormFactor','Census_DeviceFamily','Census_ProcessorCoreCount','Census_ProcessorClass',
            'Census_PrimaryDiskTypeName','Census_HasOpticalDiskDrive','Census_TotalPhysicalRAM','Census_ChassisTypeName',
            'Census_InternalPrimaryDiagonalDisplaySizeInInches', 'Census_InternalPrimaryDisplayResolutionHorizontal',
            'Census_InternalPrimaryDisplayResolutionVertical', 'Census_PowerPlatformRoleName', 'Census_InternalBatteryType',
            'Census_InternalBatteryNumberOfCharges','Census_IsTouchEnabled','Census_IsPenCapable',
             'Census_IsAlwaysOnAlwaysConnectedCapable'],
            ['Census_OEMNameIdentifier', 'Census_OEMModelIdentifier', 'Census_ProcessorManufacturerIdentifier',
            'Census_ProcessorModelIdentifier','Census_FirmwareManufacturerIdentifier', 'Census_FirmwareVersionIdentifier']
         ]

# Build Embeddings Network
All statistically label encoded categorical variables are accepted into an embedding with input output ratio 1:1. Then similar variables are grouped as defined above. These groupings are fed into a common dense layer with input output ratio 2:1. This "squeezes" the variables together and finds reduced dimensional representation of the variables. Since the dense layer only uses identity activation, this mimics PCA. Each group contains the following number of variables  
* Geographical Group - 8 variables
* Hardware Group - 18 variables
* Name/Model Group - 6 variables
* Software/Virus Group - 15 variables
* Miscellenous Group - 12 variables
* Time Group - 33 variables  
  
Embeddings are equivalent to one-hot-encoding a category variable. For example, suppose a categorical variable has 100 unique values. If we one-hot-encode it, we get 100 boolean variables. If we input those 100 variables into 100 units and send the outputs of those 100 units into 50 units. Then that is equivalent to sending a label encoded variable of 100 unique values into an embedding with a 100:50 input output. Below are some examples of groupings. In these examples, the categorical variables have 3 unique values and the embeddings are 3:3. Then the grouping squeezes 6 inputs down to 3 outputs (thus mimicking PCA) Keep in the mind that the true grouping has more variables being inputted into it.
  
![image](http://playagricola.com/Kaggle/geographicalB31919.jpg)  

.  
.  
![image](http://playagricola.com/Kaggle/outputB31919.jpg)  
  
.  
.  
![image](http://playagricola.com/Kaggle/hardwareB31919.jpg)


In [None]:
from keras.models import Model
from keras.layers import Dense, Input, concatenate, BatchNormalization, Activation, Dropout, Embedding, Reshape
from keras.callbacks import LearningRateScheduler
from keras.optimizers import Adam

df_train_Y = df_train['HasDetections']
del df_train['HasDetections']
x=gc.collect()

#SPLIT TRAIN AND VALIDATION SET
chunk = len(df_train)//5
idx = range(chunk*0,chunk//2)
idx2 = range(chunk//2,chunk)
idx3 = range(chunk,chunk*3)
idx4 = range(chunk*3,chunk*5)
X_val1 = df_train.loc[idx]
Y_val1 = df_train_Y.loc[idx]
X_val2 = df_train.loc[idx2]
Y_val2 = df_train_Y.loc[idx2]
X_train1 = df_train.loc[idx3]
Y_train1 = df_train_Y.loc[idx3]
X_train2 = df_train.loc[idx4]
Y_train2 = df_train_Y.loc[idx4]
del df_train, df_train_Y
x=gc.collect()

In [None]:
ins = []; outs = {}
# CREATE AN EMBEDDING FOR EACH CATEGORY VARIABLE
for k in inps.keys():
    x = Input(shape=(1,))
    ins.append(x)
    y = np.int(inps[k])
    x = Embedding(y, y, input_length=1)(x)
    x = Reshape(target_shape=(y, ))(x)
    outs[k]=x 
    
# ORGANIZE EMBEDDINGS INTO GROUPS
all = set(inps.keys())
used = []
outs2 = []
for k in groups:
    g = [outs[x] for x in set(k).intersection(all)]
    used += list(set(k).intersection(all))
    x = concatenate(g)
    s = sum([inps[x] for x in set(k).intersection(all)])
    x = Dense(s//2,kernel_initializer='he_uniform')(x)
    x = BatchNormalization()(x)
    x = Activation('elu')(x)
    outs2.append(x)
g = [outs[x] for x in all-set(used)]
x = concatenate(g)
s = sum([inps[x] for x in all-set(used)])
x = Dense(s//2,kernel_initializer='he_uniform')(x)
x = BatchNormalization()(x)
x = Activation('elu')(x)
outs2.append(x)

# ORGANIZE FREQUENCY ENCODED AND NUMERICS INTO A GROUP
x = Input(shape=(len(NUM), ))
ins.append(x)
x = Dense(len(NUM)//2,kernel_initializer='he_uniform')(x)
x = BatchNormalization()(x)
x = Activation('elu')(x) 

# CONNECT GROUPS TO DENSE LAYERS
x = concatenate(outs2+[x])
x = Dense(100,kernel_initializer='he_uniform')(x)
x = Dropout(0.2)(x)
x = BatchNormalization()(x)
x = Activation('elu')(x)
x = Dense(100,kernel_initializer='he_uniform')(x)
x = Dropout(0.2)(x)
x = BatchNormalization()(x)
x = Activation('elu')(x)
x = Dense(100,kernel_initializer='he_uniform')(x)
x = Dropout(0.2)(x)
x = BatchNormalization()(x)
x = Activation('elu')(x)
x = Dense(1,activation='sigmoid')(x)

model = Model(inputs=ins, outputs=x)
model.compile(optimizer=Adam(lr=0.01), loss='binary_crossentropy', metrics=['accuracy'])
#annealer = LearningRateScheduler(lambda x: 1e-2 * 0.95 ** x)

# Train Embeddings Network
All of the training data won't fit into GPU memory, so we alternate training with half and half. This single kernel only trains one fold of a 5-fold NN. If you run this kernel 5 times and ensemble the 5 submission files, you will achieve 0.699 public LB.

In [None]:
epochs=10
batch=256
for k in range(epochs):
    # SPLIT TRAINING DATA IN HALF TO FIT INTO GPU MEMORY
    model.fit( [X_train1[col] for col in OHE] + [X_train1[NUM]],Y_train1,
        batch_size=batch, epochs = 1, verbose=2, callbacks=[ #annealer, 
        printAUC(X_train1, Y_train1, OHE, NUM, X_val1, Y_val1, 0, k)],
        validation_data = ([X_val1[col] for col in OHE] + [X_val1[NUM]],Y_val1) )
    model.fit( [X_train2[col] for col in OHE] + [X_train2[NUM]],Y_train2,
        batch_size=batch, epochs = 1, verbose=2, callbacks=[ #annealer, 
        printAUC(X_train2, Y_train2, OHE, NUM, X_val2, Y_val2, 0, k)],
        validation_data = ([X_val2[col] for col in OHE] + [X_val2[NUM]],Y_val2) )
    # SHUFFLE TRAIN
    X_train1['HasDetections'] = Y_train1
    X_train1 = X_train1.sample(frac=1)
    Y_train1 = X_train1['HasDetections']
    del X_train1['HasDetections']
    X_train2['HasDetections'] = Y_train2
    X_train2 = X_train2.sample(frac=1)
    Y_train2 = X_train2['HasDetections']
    del X_train2['HasDetections']
    x=gc.collect()

In [None]:
del model
del X_train1, Y_train1, X_val1, Y_val1
del X_train2, Y_train2, X_val2, Y_val2
del ins, outs, outs2, x
x = gc.collect()

# Predict Test Data
We will predict test.csv in chunks of 1 million.

In [None]:
# LOAD BEST SAVED NET
from keras.models import load_model

# PREDICT TEST
pred = np.zeros((len(df_test),1))
print('Predicting test...')
model = load_model('bestNet0.h5')
idx = 0; chunk = 1000000
if Debug: chunk = 5000
ct2 = 1;
while idx < len(df_test):
    idx2 = min(idx + chunk, len(df_test) )
    idx = range(idx, idx2)
    pred[idx] += model.predict( [df_test.iloc[idx][col] for col in OHE] + [df_test.iloc[idx][NUM]] )
    print(' part '+str(ct2)+' done')
    ct2 += 1
    idx = idx2
del model
x = gc.collect()

# Adjust Private Test Submission
The private test dataset is 33% outliers, explained [here][1], [here][4], [here][2], and [here][3]. Therefore we must adjust for these or our private test score will be terrible.  
  
[1]: https://www.kaggle.com/c/microsoft-malware-prediction/discussion/84745
[2]: https://www.kaggle.com/c/microsoft-malware-prediction/discussion/84096
[3]: https://www.kaggle.com/c/microsoft-malware-prediction/discussion/84227
[4]: https://www.kaggle.com/cdeotte/private-leaderboard-0-750

In [None]:
from datetime import datetime
datedictAS = np.load('../input/malware-timestamps/AvSigVersionTimestamps.npy')[()]
df_test['Date'] = df_test['AvSigVersion'].map(datedictAS)
df_test['HasDetections'] = pred
df_test['X'] = df_test['Date'] - datetime(2018,11,20,4,0) 
df_test['X'] = df_test['X'].map(lambda x: x.total_seconds()/86400)
df_test['X'].fillna(0,inplace=True)
s = 5.813888
df_test['F'] = 1.0
df_test['F'] = 1 - df_test['X']/s
df_test.loc[df_test['X']<=0,'F'] = 1.0
df_test.loc[df_test['X']>s,'F'] = 0
df_test['HasDetections'] *= df_test['F']
pred = df_test['HasDetections']

# Write Submission File

In [None]:
print('Writing submission file...')
if Debug:
    submit = pd.read_csv('../input/microsoft-malware-prediction/sample_submission.csv', nrows=10000)
else:
    submit = pd.read_csv('../input/microsoft-malware-prediction/sample_submission.csv')
submit['HasDetections'] = pred
submit.to_csv('submission.csv', index=False)
print('Done!')

# Display Predictions
First we will display a histogram and next display a time series plot.

In [None]:
import matplotlib.pyplot as plt    
b = plt.hist(pred, bins=200)

In [None]:
import calendar, math

def dynamicPlot(data,col, target='HasDetections', start=datetime(2018,4,1), end=datetime(2018,12,1)
                ,inc_hr=0,inc_dy=7,inc_mn=0,show=0.99,top=5,top2=4,title='',legend=1,z=0,dots=False):
    # check for timestamps
    if 'Date' not in data:
        print('Error dynamicPlot: DataFrame needs column Date of datetimes')
        return
    
    # remove detection line if category density is too small
    cv = data[(data['Date']>start) & (data['Date']<end)][col].value_counts(dropna=False)
    cvd = cv.to_dict()
    nm = cv.index.values
    th = show * len(data)
    sum = 0; lnn2 = 0
    for x in nm:
        lnn2 += 1
        sum += cvd[x]
        if sum>th:
            break
    top = min(top,len(nm))
    top2 = min(top2,len(nm),lnn2,top)

    # calculate rate within each time interval
    diff = (end-start).days*24*3600 + (end-start).seconds
    size = diff//(3600*((inc_mn * 28 + inc_dy) * 24 + inc_hr)) + 5
    data_counts = np.zeros([size,2*top+1],dtype=float)
    idx=0; idx2 = {}
    for i in range(top):
        idx2[nm[i]] = i+1
    low = start
    high = add_time(start,inc_mn,inc_dy,inc_hr)
    data_times = [low+(high-low)/2]
    while low<end:
        slice = data[ (data['Date']<high) & (data['Date']>=low) ]
        #data_counts[idx,0] = len(slice)
        data_counts[idx,0] = 5000*len(slice['AvSigVersion'].unique())
        for key in idx2:
            if nan_check(key): slice2 = slice[slice[col].isna()]
            else: slice2 = slice[slice[col]==key]
            data_counts[idx,idx2[key]] = len(slice2)
            if target in data:
                data_counts[idx,top+idx2[key]] = slice2['HasDetections'].mean()
        low = high
        high = add_time(high,inc_mn,inc_dy,inc_hr)
        data_times.append(low+(high-low)/2)
        idx += 1

    # plot lines
    fig = plt.figure(1,figsize=(15,3))
    cl = ['r','g','b','y','m']
    ax3 = fig.add_subplot(1,1,1)
    lines = []; labels = []
    if z==1: ax3.plot(data_times,data_counts[0:idx+1,0],'k')
    for i in range(top):
        tmp, = ax3.plot(data_times,data_counts[0:idx+1,i+1],cl[i%5])
        if dots: ax3.plot(data_times,data_counts[0:idx+1,i+1],cl[i%5]+'o')
        lines.append(tmp)
        labels.append(str(nm[i]))
    ax3.spines['left'].set_color('red')
    ax3.yaxis.label.set_color('red')
    ax3.tick_params(axis='y', colors='red')
    if col!='ones': ax3.set_ylabel('Category Density', color='r')
    else: ax3.set_ylabel('Data Density', color='r')
    #ax3.set_yticklabels([])
    if target in data:
        ax4 = ax3.twinx()
        for i in range(top2):
            ax4.plot(data_times,data_counts[0:idx+1,i+1+top],cl[i%5]+":")
            if dots: ax4.plot(data_times,data_counts[0:idx+1,i+1+top],cl[i%5]+"o")
        ax4.spines['left'].set_color('red')
        ax4.set_ylabel('Detection Rate', color='k')
    if title!='': plt.title(title)
    if legend==1: plt.legend(lines,labels,loc=2)
    plt.show()
        
# INCREMENT A DATETIME
def add_time(sdate,months=0,days=0,hours=0):
    month = sdate.month -1 + months
    year = sdate.year + month // 12
    month = month % 12 + 1
    day = sdate.day + days
    if day>calendar.monthrange(year,month)[1]:
        day -= calendar.monthrange(year,month)[1]
        month += 1
        if month>12:
            month = 1
            year += 1
    hour = sdate.hour + hours
    if hour>23:
        hour = 0
        day += 1
        if day>calendar.monthrange(year,month)[1]:
            day -= calendar.monthrange(year,month)[1]
            month += 1
            if month>12:
                month = 1
                year += 1
    return datetime(year,month,day,hour,sdate.minute)

# CHECK FOR NAN
def nan_check(x):
    if isinstance(x,float):
        if math.isnan(x):
            return True
    return False

In [None]:
df_test['ones'] = 1
dynamicPlot(df_test, 'ones', inc_dy=2, legend=0,
        title='Test.csv HasDetections Predictions. (Dotted line uses right y-axis. Solid uses left.)')

# Result
Here is the result of the output from running 1 kernel (1 fold). You can increase the score by 0.002 by ensembling the outputs from running this kernel 5 times. Trained neural networks have a lot of variance, so ensembling neural networks always helps.
![image](http://playagricola.com/Kaggle/score31919.png)