# Predict if server will be hacked


In [None]:
# importing libraries

import pandas as pd
import pandas_profiling
from pandas_profiling import ProfileReport
import numpy as np
from matplotlib import pyplot as plt
import seaborn as seab

# importing libraries for preprocessing

from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import PowerTransformer

# importing the classifiers and metrics

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier 
from sklearn.tree import DecisionTreeClassifier
import lightgbm as lgbm
from lightgbm import LGBMClassifier
from xgboost  import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold 
from datetime import datetime 

print('Successfully imported libraries !!!') 

In [None]:
# Import the train and test csv files

pd.set_option('display.max_colwidth', 100)   #to display 100 charactors in each column
train_data = pd.read_csv(r'D:\jobs\A! Hackathon\Novartis Challenge\Dataset\train.csv')
test_data = pd.read_csv(r'D:\jobs\A! Hackathon\Novartis Challenge\Dataset\test.csv')
print(train_data.shape)
print(test_data.shape)

## Exploratory Data Analysis (EDA)

The first step for creating an ML model is exploring the given datasets (train.csv and test.csv). This is a crucial step.

To explore the data I had generated a report with pandas_profiling. It generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

- Type inference: detect the types of columns in a dataframe.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation,   kurtosis, skewness
- Most frequent values
- Histogram
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
- Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

In [None]:
pandas_profiling.ProfileReport(train_data)

Pandas profilling is a html widget. Hence, it would not be accessible when the notebook were downloaded. Therefore, to support my inferences I will be showing some plots and data.

In [None]:
# plotting the histograms of all the variables

train_data.hist(bins=10, figsize=(14,10))
plt.show()

In [None]:
pandas_profiling.ProfileReport(test_data)

Notes based on data_profilling
- Highly imbalanced target variable (Multiple_offense). There are almost 20 times more positive examples than negative ones
- Variables X_1, X_2, X_3, X_4, X_5, X_6, X_7, X_8, X_9, X_10m X_11, x_12, X_13, X_14 and X_15 are the anonymized logging parameters which were provided and will be used as features
- Among these X_10 and X_12 are highly skewed where as X_2 and X_3 are highly correlated with each other
- Missing values are <0.1% (all from X_12). These values would be repaced with zeros
- ID and Date fields are highly cardinal so we can drop them from both train and test sets

In order to build a model for this classification model, my approach would be as follows.

1. Data-Preprocessing
   1.1     Filling in missing values
   1.2     SMOTE for imbalanced data
   1.3     Scaling the features
2. Model Comparisions and validation
   2.1     Testing different models (primarily ensembles and trees models
   2.2     Compare based on accuracy as well as execution time and choose best model 
3. Checking feature importance and dropping irrelevant features
4. Predict on test set to get our final predictions
5. Submit the model

## Data Pre-Processing

### Dropping high cardinal variable and removing duplicates (from both train and test datasets)

In [None]:
# dropping duplicates, also INCIDENT_ID and DATE columns from the train/test dataset

train_data = train_data.drop(['INCIDENT_ID', 'DATE'], axis=1)
train_data.drop_duplicates(keep='first', inplace=True)

# print('train_data shape: {}'.format(train_data.shape))

index = pd.DataFrame()
index['INCIDENT_ID'] = test_data['INCIDENT_ID']
# print('index shape: {}'.format(index.shape))

test_data = test_data.drop(['INCIDENT_ID','DATE'], axis=1)
# print('test_data shape: {}'.format(test_data.shape))

In [None]:
y = train_data['MULTIPLE_OFFENSE']
train_data = train_data.drop(['MULTIPLE_OFFENSE'], axis=1)

# print('Target shape: {}'.format(y.shape))
# print('train_data shape: {}'.format(train_data.shape))

### Filling missing values (with 0)

In [None]:
train_data.info()   # we find that there are 182 missing values in variable x_12

In [None]:
test_data.info()   # we find that there are 127 missing values in variable x_12

In [None]:
# we can see that there are 182 and 127  missing values in train and test datasets, respectively, for the variable x_12. 
# Let's fill these missing values with 0

train_data.fillna(0, inplace=True)
test_data.fillna(0, inplace=True)
# print(train_data.isnull().sum())   # to check if successful --- Done!
# print(test_data.isnull().sum())

### Handling imbalanced data (using SMOTE) ---- try undersampling next 

source: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

In [None]:
# Applying smote to balance the imbalanced labels

print("Number records train_data dataset: ", train_data.shape)
print("Number records test_data dataset: ", test_data.shape)
print("\nNumber labels: ", y.shape)
print("Before OverSampling, counts of label '1': {}".format(sum(y==1)))    
print("Before OverSampling, counts of label '0': {}".format(sum(y==0)))

oversample = SMOTE()
X_train, Y = oversample.fit_resample(train_data, y)

print('\nAfter OverSampling, the shape of X_train: {}'.format(X_train.shape))
print('After OverSampling, the shape of Y: {}'.format(Y.shape))

print("\nAfter OverSampling, counts of label '1': {}".format(sum(Y==1)))
print("After OverSampling, counts of label '0': {}".format(sum(Y==0)))

## Scaling features

The effect of different scalers (namely, StandardScaler(), MinMaxScaler() and PowerTransformer()) were compared before the PowerTransformer was used to remove the mean and scale the data to unit variance.

In [None]:
# scaling
scaler = PowerTransformer(method='yeo-johnson')
scaler.fit(train_data)
X_features = pd.DataFrame(scaler.transform(X_train), columns = X_train.columns)
X_test = pd.DataFrame(scaler.transform(test_data), columns = test_data.columns)

# print(X_features.shape)
# print(X_test.shape)

In [None]:
X_features.hist(bins=10, figsize=(14,10))
plt.show()

Comparing the above plots to the non-scaled plots, we can see that all the features have been centred at zero after scaling.

## Models Comparision and Validation

Now that we are done with data preprocessing, let's move on to make some models. The models that will be trained are:
1. Logistic Regression Classifier
2. Random Forests Classifier
3. Gradient Boosted Trees Classifier
4. Decision Trees Classifier
5. Extra Treese Classifier 
6. LightGBM Classifier
7. XGBoost Classifier
8. CatBoost Classifier
9. KNN Classifier

The ensemble and trees models were used because they can handle class imbalance.

### GridSearchCV

GridSearchCV was performed on the chosen models to get the best possible parameters. These can be seen below.

In [None]:
'''
# training Logistic regression
lr = LogisticRegression()
params = { 'class_weight':['balanced', None],
           'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
           'C' : np.logspace(-3,3,7), 
           'penalty' : ["l1","l2"],
          
         }
gs = GridSearchCV(lr, params, cv=5, scoring='recall', verbose=1, n_jobs=-1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

-------------------------------------------------------------------------------------
# Training Random Forest Classifier
rf = RandomForestClassifier()
params = {'criterion' : ['gini', 'entropy'],
          'n_estimators' : [10, 50, 100, 500, 1000], 
          'max_depth' : [7, 11, 15, 19]
         }
gs = GridSearchCV(rf, params, cv=5, scoring='recall', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

---------------------------------------------------------------------------------------
# Training Gradient Boosted Trees
gb = GradientBoostingClassifier()
params = {'n_estimators' : [10, 50, 100, 500, 1000], 
          'max_depth' : [7, 11, 15, 19, 25, 30]
         }
gs = GridSearchCV(gb, params, cv=5, scoring='recall', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

----------------------------------------------------------------------------------------
# Training Decision Tree
dt = DecisionTreeClassifier()
params = {'max_depth' : [7, 11, 15, 19, 25, 30, 60, 90], 
          'max_features' : [1000, 2000, 5000, 10000]
         }
gs = GridSearchCV(dt, params, cv=5, scoring='recall', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

----------------------------------------------------------------------------------------
# Training Extra trees classifiers
etc = ExtraTreesClassifier(random_state=10)
params = {'n_estimators' : [10, 50, 100, 500, 1000], 
          'max_features' : [5, 10, 50, 100]
         }
gs = GridSearchCV(etc, params, cv=5, scoring='recall', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

----------------------------------------------------------------------------------------
# Training LightGBM ---- https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
from lightgbm import LGBMClassifier

lgbm = LGBMClassifier()
params = { 'learning_rate': [0.005, 0.001, 0.01, 0.05, 0.1],
           'n_estimators': [5, 10, 50, 100, 500, 1000],
           'max_depth' : [7, 11, 15, 19],       # helps control overfitting
           'num_leaves': [30, 60, 90],          # correlated with max_depth, double the max_depth gives high accuracy
           'bagging_fraction' : [0.5],  # to speed up
           'bagging_freq' : [5],        # to speed up
           'feature_fraction' : [0.5],  # to speed up
           'boosting_type' : ['gbdt'],
           'objective' : ['binary'],
           'reg_alpha': [0, 0.01, 0.1, 1, 2, 5, 7, 10, 50, 100],
           'reg_lambda': [0, 0.01, 0.1, 1, 5, 10, 20, 50, 100]
         }
gs = GridSearchCV(lgbm, params, cv=5, scoring='recall', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

----------------------------------------------------------------------------------------------
# Training xgb Classifier
xgb = XGBClassifier()
params = {'n_estimators': [50, 100, 500, 1000],
          'max_depth': [7, 9, 15, 19],    # typical values 3-10, higher values cause overfitting
          'learning_rate': [0.05, 0.01, 0.5, 0.1],
          'gamma': [0, 1, 5],
          'min_child_weight' : [1, 5, 10],   # to avoid overfitting, too high values cause underfitting
          'subsample' : [0.5, 0.8, 1],    #lower values make it conservative, too low values lead to underfitting
          'colsample_bytree' : [0.5, 0.8, 1]    #same as max_features in GBT
         }
gs = GridSearchCV(xgb, params, cv=5, scoring='recall', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

------------------------------------------------------------------------------------------------
# Catclassifier

from catboost import CatBoostClassifier

cbt = CatBoostClassifier()
params = {'depth' : [7, 11, 15, 19],
          'learning_rate' : [0.005, 0.001, 0.01, 0.05, 0.1],   # how fast can reach the global/local optima
          'iterations'    : [30, 50, 100],    # many iterations would mean using more trees to fit --- risk of overfitting
          'l2_leaf_reg': [1, 3, 5, 7, 9],   #l2 reguralization term in cost funtion
          'border_count':[5, 10, 20, 30, 50, 100, 200, 254],   #254 is usually used for best possible results on CPU or GPU---this parameter impacts speed
          'bagging_temperature' : [0, 1]    # 0 means equal weight assigned to all features, 1 means weighted
         }
gs = GridSearchCV(cbt, params, cv=5, scoring='recall', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

-------------------------------------------------------------------------------------------------
# knn classifier

knn = KNeighborsClassifier()

k_range = list(range(1,31))
weight_options = ["uniform", "distance"]
params = dict(n_neighbors = k_range, weights = weight_options)

gs = GridSearchCV(knn, params, cv=5, scoring='recall', verbose = 1, n_jobs = -1)
gs.fit(X_features, Y)
print(gs.best_params_)
print(gs.best_score_)

'''

## Compare models

In [None]:
# add our tuned models into list
models = []

lr_model = LogisticRegression(C=0.01, penalty='l1', solver='liblinear', random_state = 10)

rf_model = RandomForestClassifier(criterion='entropy', max_depth=19, n_estimators=100, random_state = 10)

gb_model = GradientBoostingClassifier(max_depth=7, n_estimators=500, random_state = 10)

dt_model = DecisionTreeClassifier(max_depth=15, max_features=15, random_state = 10)

etc_model = ExtraTreesClassifier(n_estimators= 500, max_features=15, random_state = 10)

lgbm_model = LGBMClassifier(learning_rate= 0.03, n_estimators= 500, max_depth= 7, num_leaves= 50, bagging_fraction= 0.5, 
                            bagging_freq= 2, feature_fraction= 0.7, boosting_type= 'gbdt', 
                            objective= 'binary', reg_alpha= 0, reg_lambda= 0, random_state = 10)

xgb_model = XGBClassifier(colsample_bytree=0.5, gamma=0, learning_rate=0.01, 
                          max_depth=15, min_child_weight=1, n_estimators=1000, subsample=1, random_state = 10)

cbt_model = CatBoostClassifier(depth=16, learning_rate=0.01, iterations=15, border_count=254, 
                               bagging_temperature=0, silent=True, random_state = 10)

knn_model = KNeighborsClassifier(n_neighbors=5, weights='uniform')

models.append(('Logistic Regression', lr_model))
models.append(('Random Forest', rf_model))
models.append(('Gradient Boosting Trees', gb_model))
models.append(('Decision Trees', dt_model))
models.append(('Extra Trees', etc_model))
models.append(('LightGBM', lgbm_model))
models.append(('XGBoost', xgb_model))
models.append(('CatBoost', cbt_model))
models.append(('K Nearest Neighbours Classifier', knn_model))

results = []
names = []

# evaluate each model in turn
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, shuffle = True, random_state = 10)
    cv_results = model_selection.cross_val_score(model, X_features, Y, cv = 5, scoring = 'recall')
    results.append(cv_results)
    names.append(name)
    # print mean accuracy and standard deviation
    print("%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()))

In [None]:
fig = plt.figure(figsize=(12, 10))
plt.boxplot(results)
plt.title('Algorithm Comparison')
plt.xticks([1,2,3,4,5,6,7,8,9], names, rotation=45, horizontalalignment='right', fontweight='light')
plt.show()

From the obove we can see that XGBoost, LightGBM and Gradient Boosting Trees perform the best among all the classifier. Another criteria to check which model is the best is by looking at the time execution (since in this challenge, that is a key factor checked as well).

In [None]:
# https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/

start = datetime.now() 
xgm = xgb_model.fit(X_features, Y) 
stop = datetime.now()
execution_time_xgb = stop-start 
execution_time_xgb

In [None]:
start = datetime.now() 
lgbm = lgbm_model.fit(X_features, Y) 
stop = datetime.now()
execution_time_lgbm = stop-start 
execution_time_lgbm

In [None]:
start = datetime.now() 
gb = gb_model.fit(X_features, Y) 
stop = datetime.now()
execution_time_gb = stop-start 
execution_time_gb

In [None]:
#Creating a dataframe ‘comparison_df’ for comparing the performance of Lightgbm and xgb. 

comparison_dict = {'execution time':(execution_time_lgbm, execution_time_xgb, execution_time_gb)}
comparison_df = pd.DataFrame(comparison_dict) 
comparison_df.index= ['LightGBM','xgboost','GradientBoosting Trees'] 
comparison_df

Since, LightGBM has the least execution time we shall proceed with this model.

## Learning Curves

Before proceeding to fit our LightGB model on our datasets, we need to check for it's performance, say, if it is overfitting or underfitting the data. 

We shall do so by plotting the classification error of lgbm_model. This was done by referring to https://www.kaggle.com/tobikaggle/humble-lightgbm-starter-with-learning-curve.

In [None]:
params = {'learning_rate':0.03, 'n_estimators':500, 'max_depth':7, 'num_leaves':50, 'bagging_fraction':0.5,
          'bagging_freq':2, 'feature_fraction':0.7, 'boosting_type':'gbdt',
          'objective':'binary', 'reg_alpha':0, 'reg_lambda':0, 'random_state':10, 
          'metric':{'binary_error'}, 'verbose':0
         }   # the best possible found via gridsearchcv

X_train, X_valid, y_train, y_valid = train_test_split(X_features, Y, test_size=0.2)

# create dataset for lightgbm
lgbm_train = lgbm.Dataset(X_train, y_train)
lgbm_valid = lgbm.Dataset(X_valid, y_valid, reference=lgbm_train)

evals_result={}

model = lgbm.train(params, lgbm_train, valid_sets=[lgbm_train, lgbm_valid], 
                    evals_result= evals_result, early_stopping_rounds=100, verbose_eval=0)

print('Start predicting...')
# predict
y_pred = model.predict(X_test)


print('Plot metrics during training...')
ax = lgbm.plot_metric(evals_result, metric='binary_error', figsize=(10,8))
plt.show()

In [None]:
print('Plot feature importances...')

ax = lgbm.plot_importance(model, max_num_features=15, figsize=(10,8))
plt.show()

Looks like our model is doing fine. Also we can notice that all the features are contributing towards predictions. Hence, this model will be submitted.

On submitting this model, a public score of 99.40696 was evaluated. Let's see what happens when we combine all out best 3 models using a voingclassifier.

In [None]:
# Stacking classifiers using maxvoting

clf1 = xgb_model
clf2 = lgbm_model
clf3 = rf_model
final_clf = VotingClassifier(estimators=[('xgb', clf1), ('lgbm', clf2), ('rf', clf3)], voting='hard')
start = datetime.now() 
final_clf = final_clf.fit(X_features, Y)
stop = datetime.now()
execution_time = stop-start 
print('Execution_time: {}'.format(execution_time))
print('Stacking implemented successfully!!!')

The public score had now increased to 99.55110(rf). This seems little bit better. This would be our final model for submission.

## Submission

In [None]:
final_predictions = final_clf.predict(X_test)
# print(final_predictions.shape)
# print(index.shape)

results = pd.DataFrame({'INCIDENT_ID': index['INCIDENT_ID'], 'MULTIPLE_OFFENSE': final_predictions})
results.to_csv(r'D:\jobs\A! Hackathon\Novartis Challenge\Dataset\mysubmission.csv', index=False)
print("Your submission was successfully saved!")