# Prediction if Server is Hack or not 

This kernal(notebook) is an attempt to showcase the work done for Novartis hackathon by Hackerearth and on the topic of Imbalanced dataset, if you like my work please upvote and also provide your valuable comments on this kernal which will help me in improving further.

Although its been 3 months for the competetion it took time for me to post the code here. 

**Better late than never :):)**

## Problem Statement

All the countries across the globe have adapted to means of digital payments. And with the increased volume of digital payments, hacking has become common event
wherein the hacker can try to hack your details just with your phone number linked to your bank account.
However, there is data with some anonymized variables based on which one can predict that the hack is going to happen.

The problem is to build a __predictive model which can identify a pattern in these variables and suggest that a hack is going to happen So that the cyber security can somehow stop it before it happens__.

![](https://raw.githubusercontent.com/VijayMukkala/Mini-Projects/master/Predict_incident-is-hack/hacker-1944688_640.jpg)

- [Import Packages](#section1)<br>
- [Exploring the data ](#section2)<br>
- [Visualizing the data](#section3)<br>
- [Preprocessing](#section4)<br>
- [Model Selection](#section5)<br>
- [Test for unseen data & output file ](#section6)<br>

<a id=section1></a>
## Import Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# Model libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
import xgboost as xgb
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier


#Other Libraries
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split,RandomizedSearchCV
from sklearn.metrics import recall_score,precision_score,confusion_matrix
import scipy.stats as stats

<a id=section2></a>
## Exploring the data 

In [None]:
data = pd.read_csv('/kaggle/input/novartis-data/Train.csv')
data_test = pd.read_csv('/kaggle/input/novartis-data/Test.csv')


print('Shape of the data:',data.shape)
data.head()

### Data:
- __X_1 - X_15__ : Anonymized logging parameters
- __Date__ : Date wof Incident occurance
- __Incident ID__ : ID of the occurance of event
- __Multiple Offense__ : Indicates that if the incident was hack

In [None]:
data.info() #getting more information on the dtype

In [None]:
data.describe()  # To check on the statistics

In [None]:
data.isnull().sum() #checking for missing values in data

In [None]:
data.columns #columns of the data

## Visualizing the target variable

In [None]:
print(data['MULTIPLE_OFFENSE'].value_counts())
plt.figure(figsize=(5,3))
sns.countplot(data['MULTIPLE_OFFENSE'])
plt.show()

We can see that the data is imbalanaced & most of the mails are from hackers(suspicious).

### What Is Data Imbalance?

Data imbalance usually reflects an unequal distribution of classes within a dataset.

As you can see from the above figure most of the mails are suspicious, if we dont fix the probelm the model will be biased. 

Dealing with Imbalanced Datsets:
There are many ways of dealing with imbalanced datasets:
- __Undersampling :__
Undersampling is the process where you randomly
delete some of the observations from the majority class in order to match the numbers
with the minority class.
- __Oversampling :__ 
It is the process of generating synthetic data that tries to randomly generate a sample of the attributes from observations in the minority class.

The best kaggle kernal which I refer on the above topics :
https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets

But in this problem(kernal) I didnot use undersampling or oversampling techniques.
I have applied on the Machine learning algorithms like XGBOOST, Ensembling methods , Balanced Bagging classifier with the required hyperparmaters which deals with imbalance data and helped achieved me a great score of __99.5 %__ on test score. We are gonna look into more detail while doing the model Building

### Splitting the data into test & train before we do any preprocessing or implementing sampling techniques

In [None]:
X = data.drop('MULTIPLE_OFFENSE', axis=1)
y = data['MULTIPLE_OFFENSE']

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =0.25, random_state = 42, stratify=y)

In [None]:
print('y_train:\n',y_train.value_counts(normalize = True))
print('y_test:\n',y_test.value_counts(normalize = True))

We can see that the distribution of data is identical for both the varaibles __0 & 1__ in test & train data. 

<a id=section3></a>
## Visualizing the data
### Checking the distribution of Continous variables

In [None]:
#function to create histogram, Q-Q Plot and boxplot

def diagnostic_plots(df,variable):
    
    #define figure size
    plt.figure(figsize =(16,4))
    
    #histogram
    plt.subplot(1,3,1)
    sns.distplot(df[variable], bins = 30, kde = False)
    plt.title('Histogram')
    
    #Q-Q plot
    plt.subplot(1,3,2)
    stats.probplot(df[variable], dist = "norm", plot = plt)
    plt.ylabel('RM quantiles')
    
    # box plot
    plt.subplot(1,3,3)
    sns.boxplot(y=df[variable])
    plt.title('Boxplot')
    
    plt.show()

In [None]:
diagnostic_plots(data,'X_2')

In [None]:
diagnostic_plots(data,'X_3')

In [None]:
diagnostic_plots(data,'X_6')

In [None]:
diagnostic_plots(data,'X_7')

In [None]:
diagnostic_plots(data,'X_8')

In [None]:
diagnostic_plots(data,'X_10')

In [None]:
diagnostic_plots(data,'X_11')

In [None]:
diagnostic_plots(data.dropna(),'X_12')

In [None]:
diagnostic_plots(data,'X_13')

In [None]:
diagnostic_plots(data,'X_14')

In [None]:
diagnostic_plots(data,'X_15')

From the above histogram ,Probability plot and Boxplots for continous variables in the data, we can see that there are many outliers present in the features __X_6,X_7,X_8,X_10,X_11,X_12,X_13,X_15__.

One way of dealing with outliers is to remove them from the data but this will cause information loss so we will use a technique called __Capping or Censoring capping__ in this kernal. 

__Capping or Censoring capping__ : means capping the maximum and /or minimum of a distribution at an arbitrary value. In other words, values bigger or smaller than the arbitrarily determined ones are __censored__.

This can be done using by a simple code and on each of the feature but in this notebook we gonna use __Winsorizer__ method from the __feature engine__ which deals with outliers using Capping method

### Distribution of categorical variables 

In [None]:
f, axes = plt.subplots(ncols=4, figsize=(20,4))
sns.countplot(data['X_1'],ax=axes[0])
sns.countplot(data['X_4'],ax=axes[1])
sns.countplot(data['X_5'],ax=axes[2])
sns.countplot(data['X_9'],ax=axes[3])
plt.show()

The observation here is for variables X_1 and X_9 only 3 variables are giving more information and the rest of the values are very less. so instead of __onehot encoding__ technique I will use __onehot encoding of frequent categories__ .

In __One hot encoding of frequent categories__, we create dummy variables only for most frequent categories. It is equivalent to grouping all the remaining categories under a new category.We can choose if we want top 10 frequent variables or more as per the use case or data provide to us

In this kernal, we use selecting the top frequent values using __OneHotCategoricalEncoder API__. This encoder can also create binary variables for the n most popular categories

In [None]:
### We need to install the libraries required for the preprocesing steps
!pip install -U imbalanced-learn
!pip install feature_engine

<a id=section4></a>

## Preprocessing

- We gonna use Pipelines for dropping the columns, doing standardization of Numerical data & for Frequent one hot encoding for categorical data


### Why Pipelines?

In a typical machine learning workflow you will need to apply all transformations at least twice. Once when training the model and again on any new data when we want to predict. Using Scikit-learn pipelines as a tool will simplify this process.

They have several key benefits:
- They make your workflow much easier to read and understand.
- They enforce the implementation and order of steps in your project.
- These in turn make your work much more reproducible.

In [None]:
# Pipelines
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as pl1

#preprocessing methods
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler

#preprocessing methods using feature engine
from feature_engine import categorical_encoders as ce
from feature_engine.outlier_removers import Winsorizer

### Preprocessing steps that need to be done before fitting the model

- Missing values to be imputed for feature 'X_12'
- Oulier treatment using capping technique.
- Creating a pipleine with Standardization using standard scale on continous data & one hot encoding on Cagtegorical data
- Drop the columns 'INCIDENT_ID' & 'DATE' as not much information is available
- Fit the data and transform on both X_train & test
The data will be ready for using to the model after the above steps

In [None]:
# Segregating the data into numerical, categorical and features with outliers
numerical_features = ['X_2', 'X_3', 'X_6','X_7', 'X_8', 'X_10', 'X_11','X_12', 'X_13', 'X_14','X_15']
categorical_features = ['X_1', 'X_4', 'X_5','X_9']
outliers_data = ['X_6', 'X_7','X_8','X_10','X_11','X_12','X_13','X_15']

### Missing data imputation
As the X_12 feature data is skewed, the missing data can be replaced using median.

In [None]:
X_train['X_12'] = X_train['X_12'].fillna(X_train['X_12'].median())
X_test['X_12'] = X_test['X_12'].fillna(X_test['X_12'].median())

### Outlier Treatment
We are gonna treat the outliers for each individual variable in different ways of which tail to be considered for the capping method.
So in the below code we have created a pipeline

In [None]:
categorical_features = ['X_1', 'X_4', 'X_5','X_9']
X_train[categorical_features] = X_train[categorical_features].astype('object')
X_test[categorical_features] = X_test[categorical_features].astype('object')

In [None]:
outlier_treat =Pipeline(steps = [
              ('outlier1', Winsorizer(distribution = 'gaussian', tail = 'right',fold = 3, variables = ['X_6', 'X_7','X_8','X_10','X_12'])),
              ('outlier2', Winsorizer(distribution = 'gaussian', tail = 'left',fold = 3, variables = ['X_11', 'X_13'])),
              ('outlier3', Winsorizer(distribution = 'gaussian', tail = 'both',fold = 3, variables = ['X_15']))
                                      ])

In [None]:
outlier_treat.fit(X_train)

In [None]:
X_train = outlier_treat.transform(X_train)

**NOTE** :The outlier treatment will be done only on the train data but not the test data.

### Creating pipelines with Standardization on Continous values & One Hot encoding for frequent categories using Feature engine

In [None]:
# Converting the categorical variables to 'object' for doing the one hot encoding operation
X_train[categorical_features] = X_train[categorical_features].astype('object')
X_test[categorical_features] = X_test[categorical_features].astype('object')

In [None]:
numeric_transformer = Pipeline(steps = [
              ('scaler', StandardScaler())
                     ])
categorical_transformer = Pipeline(steps=[
    ('onehot3',ce.OneHotCategoricalEncoder(top_categories = 3, variables = ['X_9','X_1'] )),
    ('onehot4',ce.OneHotCategoricalEncoder(top_categories = 4, variables = ['X_5'] )),
    ('onehot10',ce.OneHotCategoricalEncoder(top_categories = 9, variables = ['X_4'] ))
])

### Creating a column transformer

In [None]:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
                 ('drop_columns', 'drop', ['INCIDENT_ID','DATE']), #dropping the columns 
                 ('num', numeric_transformer, numerical_features),
                 ('cat', categorical_transformer,categorical_features)
    ])

### Fitting the data to the created column transformer

In [None]:
preprocessor.fit(X_train)

### Transforming the data on the Train & test

In [None]:
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

<a id=section5></a>
# Model Selection

We can try to fit the imbalanced data on below models

- XGBOOST 
- Balanced Bagging classifier
- Support vector machine
- Random Forest
- ADA Boost

**Evaluation Metric : Recall score : TP/(TP+FN)**

### 1. XGB Classifier with Hyper parameter(scale_pos_weight) 

scale_pos_weight hyperparamter controls the balance of positive and negative weights, useful for unbalanced classes.

In [None]:
# from sklearn.model_selection import RandomizedSearchCV
# rs = RandomizedSearchCV(xgb_model, {
#         'scale_pos_weight': [1,2],
#         'learning_rate'   : [0.05,0.10,0.15,0.20,0.25,0.30],
#         'min_child_weight': [1,3,5,7],
#         'gamma'           : [0.0,0.1,0.2,0.3,0.4],
#         'colsample_bytree': [0.3,0.4,0.5,0.7]
#     }, 
#     cv=5, 
#     scoring = 'f1',
#     return_train_score=False, 
#     n_iter=50
# )
# rs.fit(X_train, y_train)
# pd.DataFrame(rs.cv_results_)[['param_scale_pos_weight','param_learning_rate','param_min_child_weight','param_gamma','param_colsample_bytree','mean_test_score']]

Best parameters from the randomizedsearchcv
{'scale_pos_weight': 1,
 'min_child_weight': 1,
 'learning_rate': 0.3,
 'gamma': 0.3,
 'colsample_bytree': 0.3}
 
 Applying the parameters in the xgb model

In [None]:
xgb_model = xgb.XGBClassifier(scale_pos_weight= 1,min_child_weight=1,learning_rate= 0.35,gamma= 0.3,colsample_bytree= 0.3 )
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

score = recall_score(y_test,y_pred_xgb)
print('Recall score :',score)
confusion_matrix(y_test,y_pred_xgb, labels = [1,0])

### 2. Balanced Bagging Classifier
BalancedBaggingClassifier uses a random undersampling strategy on the majority class within a bootstrap sample in order to balance the two classes

In [None]:
#Create an object of the classifier.
bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),sampling_strategy='auto',
replacement=False,random_state=0)


In [None]:
from sklearn.metrics import precision_score
bbc.fit(X_train, y_train)
y_pred_bbc = bbc.predict(X_test)

score = recall_score(y_test,y_pred_bbc)
precision = precision_score(y_test,y_pred_bbc)
print('recall score :',score)
confusion_matrix(y_test,y_pred_bbc, labels = [1,0])


### 3. SVM with Weighted class

Class-weighted SVM is designed to deal with unbalanced data by assigning higher misclassification penalties to training instances of the minority class.
The paramter used : __class_weight = 'balanced'__

In [None]:
# from sklearn.model_selection import RandomizedSearchCV
# fold = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)
# svm = SVC()

# rs = RandomizedSearchCV(SVC(class_weight = 'balanced'), {
#         'C': [0.1, 1, 10],
#         'kernel': ['linear', 'poly', 'rbf'],
#         'tol' :[0.1,0.001,0.001]
#     }, 
#     cv=fold, 
#     scoring="recall", 
#     n_iter=5
# )
# rs.fit(X_train, y_train)
# pd.DataFrame(rs.cv_results_)[['param_C','param_kernel','param_tol','mean_test_score']]

After the randomized searchcv the best params are :
__C = 0.1 , kernel = 'poly', tol = 0.001__

In [None]:
model = SVC(class_weight = 'balanced', C = 0.1 , kernel = 'poly', tol = 0.001)
model.fit(X_train, y_train)
y_pred_svm = model.predict(X_test)

score = recall_score(y_pred_svm,y_test)
print('recall score',score)
confusion_matrix(y_pred_svm,y_test, labels = [1,0])


### 4. Random Forest With Bootstrap Class Weighting

Given that each decision tree is constructed from a bootstrap sample (e.g. random selection with replacement), the class distribution in the data sample will be different for each tree.

As such, it might be interesting to change the class weighting based on the class distribution in each bootstrap sample, instead of the entire training dataset.

This can be achieved by setting the class_weight argument to the value ‘balanced_subsample‘.

In [None]:
model_rf = RandomForestClassifier(class_weight='balanced_subsample')
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)

score = recall_score(y_pred_rf,y_test)
print(score)
confusion_matrix(y_pred_rf,y_test, labels = [1,0])


### 5. ADA boost

In [None]:
ada_model = AdaBoostClassifier()
ada_model.fit(X_train, y_train)
y_pred_ada = ada_model.predict(X_test)

score = recall_score(y_test,y_pred_ada)
print('Recall score :', score)
confusion_matrix(y_test,y_pred_ada, labels = [1,0])

### Conclusion :
Considering the recall score , XGB Classifier is having the best recall score after hyper tuning the parameters. So we will use this model for our prediction.

<a id=section6></a>
## Test for unseen data & output file 

#### Now applying the model on the hackathon test data

In [None]:
data_test.head()

In [None]:
data_test.isnull().sum()

#### Transforming the test data same as the operations done on the train data

In [None]:
# Replacing the missing values with the median values
data_test['X_12'] = data_test['X_12'].fillna(data_test['X_12'].median())

In [None]:
#Converting the datatype to 'object' for all the categorical features for transformations
data_test[categorical_features] = data_test[categorical_features].astype('object')

In [None]:
# Performing scaling , one hot encoding on the data using sklearn pipelines
data_test1 = preprocessor.transform(data_test)

#### Transforming the train data with all the data , previously we divide the train data again into test & train for internal evaluation and selection of algorithm

In [None]:
#Missing value imputation
X['X_12'] = X['X_12'].fillna(X['X_12'].median())

In [None]:
# Outlier treatment
X = outlier_treat.transform(X)

In [None]:
# Preprocessing 
X = preprocessor.transform(X)

Now our data is ready for final prediction & the model we will be using is XBoost Classifier & ADA boost classifier

### Testing the unseen data for XGB Classifier

In [None]:
xgb_model.fit(X, y)
prediction_xgb = xgb_model.predict(data_test1)

In [None]:
output_xgb=pd.DataFrame({"INCIDENT_ID":data_test["INCIDENT_ID"],"MULTIPLE_OFFENSE":prediction_xgb}) 
output_xgb.head()

In [None]:
print(output_xgb['MULTIPLE_OFFENSE'].value_counts())
sns.countplot(output_xgb['MULTIPLE_OFFENSE'])

Using the above approaces , I was able to achieve 99.5% on the unseen test data using XGB model with hyperparameter tuning.

### I hope this notebook was useful

## Thank you, kindly Upvote and Happy learning :)