> Hey kagglers, 

> Everyone tried their fair share of bit to dig deep and put out the best predictions for rainfall in Australia, and so do I. In this notebook you wouldn't see much of data exploration or visualization because I have tried to keep it to the minimum (but still the kernel got so big, I couldn't help, sorry), given the situation to cover the other aspects such as kinds of imputation, type of imbalances and ways to deal them, a class to run a Grid-search and transferring the best_parameters to the algoithms automatically, all you have to do is just instantiate and pass the parameters. And at last StackingClassifier.

In [None]:

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import SGDClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, precision_recall_curve
from sklearn.metrics import confusion_matrix, roc_auc_score, classification_report
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV

import warnings
warnings.filterwarnings(action='ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
warnings.filterwarnings(action='ignore', category=FutureWarning)


In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
        
df = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')

In [None]:
df.head(5)

> Well, we have df.info() to get all the details such  as the count and the data types. But I thought of writing my own reusable dfInfo method which will give me all the details I need to move ahead.

In [None]:
def dfInfo(df):
    feature_dict = {}
    
    #list of all features
    features = df.columns.tolist()

    #list of datatypes of all features
    datatype = [df[col].dtype for col in df.columns] 

    #Count of each feature
    count = [df[col].count() for col in df.columns]

    #Missing percentage in each feature
    miss_percent = [(round(((len(df) - df[col].count())/len(df) * 100), 2)) for col in df.columns]

    #Marking yes for missing and No for not missing 
    missing = ['Yes' if df[col].isnull().sum() != 0 else 'No' for col in df.columns] 
    
    #Unique count of categorical features
    unique_count =  [len(df[col].unique()) if df[col].dtype == "object" else "NA" for col in df.columns]
    
    #Feature Categorical or numerical
    cat_num = ["Catgeorical" if df[col].dtype == "object" else "Numerical" for col in df.columns]
    
    feature_dict.update({"Features": features, "Datatype":datatype, "Count":count, 
                        "Missing":missing, "Missing_percent":miss_percent, "CatOrNum":cat_num, "Unique_Count":unique_count})
    
    return pd.DataFrame(data=feature_dict)
    

In [None]:
dfInfo(df)

> Now our next step would be to perform imputation but before moving ahead with the several imputation techniques and find out which one works best, we have to deal with categorical features, categorical features with Nan values and outliers. Categorical features with Nan values should be imputed ensuring no data leak. And outliers are treated because the imputation process otherwise will be influenced, producing values too far from the real values and resulting in invalid estimates. Outlier treatment will be done only in the training set, because testing sets in the real world are not in our control. But also remember too much outlier treatment can sometime lead to loss of information, because a data point far away from the mean doesnt necessarily always mean wrong data captured. 

In [None]:
#Splitting "Date" column into Day, Month and Year

df['Date'] = pd.to_datetime(df['Date'])

df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

#Will use a copy of df, excluding date column
data = df.drop('Date', axis=1)

data = data.dropna(axis=0, how='any', subset=["RainTomorrow"])

print(data.shape)

data.head(3)


In [None]:
data["RainToday"].replace({'No' : 0, 'Yes' : 1}, inplace=True)
data["RainTomorrow"].replace({'No' : 0, 'Yes' : 1}, inplace=True)

**Outlier treatment**

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=[20,10])
data.boxplot(column=['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm'])
plt.xticks(rotation=45)
plt.show()

In [None]:
X = data.drop("RainTomorrow", axis=1)
y = data["RainTomorrow"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
def detect_outliers(col):
    IQR = data[col].quantile(0.75) - data[col].quantile(0.25)
    lower_bound = data[col].quantile(0.25) - (IQR * 3)
    upper_bound = data[col].quantile(0.75) + (IQR * 3)
    return col + "  outlier is < {lowerbound} and > {upperbound}".format(lowerbound=round(lower_bound,2), upperbound=round(upper_bound,2))

print(detect_outliers("Rainfall"))
print(detect_outliers("Evaporation"))
print(detect_outliers("WindGustSpeed"))
print(detect_outliers("WindSpeed9am"))
print(detect_outliers("WindSpeed3pm"))

In [None]:
X_train["Rainfall"] = np.where(X_train["Rainfall"]>3.2, 3.2, X_train["Rainfall"])
X_train["Evaporation"] = np.where(X_train["Evaporation"]>21.8, 21.8, X_train["Evaporation"])

In [None]:
X_train.describe()

**Miissing Value Treatment**

> It is important to note that some algorithms like XGBoost and LightGBM treat missing values without any preprocessing.

> Before we start imputing the missing values, it is important to understand the reasons behind the values missing in the dataframe. There are 3 main reasons behind it: 

> 1- MCAR (Missing completely at random) : The missing values in any particular feature is not linked with the missing values in other feature(s). These are just random misses.

> 2- MAR (Missing at random) : There is a relationship between the way the values are missing and the particular feature in which the values are missing, but not with the missing values. For example, men are more likely to tell their age or weight than women, and because of which we might find more missing values in women against that feature.

> 3- MNAR (Missing not at random) : There is a particular relationship between the missing values and the value itself. For example, people with low income cannot afford higher education compared to high income people. 

> Remember that in any given dataset, there may be missing values in many features, but it is not necessary that if one feature is MCAR, all other features will have MCAR. They can be missing at random (MAR) or MNAR.

> With all that said, lets explore the missing data.

In [None]:
#One way to check the missing values in dataframe is using  missingno
import missingno as msno

msno.matrix(X_train) #gives you a data-dense display and help pick patterns

> From the matrix above we can immediately notice a few things:

> 1- Missing values at WindGustDir and WindGustSpeed are directly related, hence MNAR.

> 2- Missing values at Pressure9am and Pressure3pm are directly influenced by each other, hence MNAR.

> 3- Rainfall and RainToday are MNAR as well.

> 4- We can also say Evaporation, Sunshine, Cloud9am and Cloud3pm could be a case of MNAR.

> 5 - MinTemp and Temp9am are also influenced by each other, hence MNAR.

> You would have got the gist by now. 

In [None]:
#Lets see if we are on point with the claims of MNAR we made above
msno.heatmap(X_train)

> And we are absolutely on point. You can see the direct relations between the missing data. You can do some extensive imputations feature wise, but I will just continue with Iterative Imputation. 

> I will not delete any column as it is not a good idea because it will lead to losing some important information. 

> There are various imputation techniques such as:
1- imputing with a constant value
2- imputing with mean, median and mode
3- imputing using KNN based methods or MICE
Note: Timeseries imputations are different such as ffill, bfill and LinearInterpolation.

> I will show you both the methods of how to impute with KNN and MICE(Iterative).

> But before doing that lets encode the categorical features.

In [None]:
#Lets find out the categorical and numerical features
from sklearn.compose import make_column_selector

select_numeric_features = make_column_selector(dtype_exclude="object")
select_categorical_features = make_column_selector(dtype_include="object")

numeric = select_numeric_features(data)
categorical = select_categorical_features(data)

print("Numerical features :", numeric)
print("Categorical features :", categorical)

In [None]:
X_train_num = X_train[["MinTemp", "MaxTemp", "Rainfall", "Evaporation", "Sunshine", "WindGustSpeed", "WindSpeed9am", "WindSpeed3pm", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm", "Cloud9am", "Cloud3pm", "Temp9am", "Temp3pm", "RainToday", "Year", "Month", "Day"]]
X_test_num = X_test[["MinTemp", "MaxTemp", "Rainfall", "Evaporation", "Sunshine", "WindGustSpeed", "WindSpeed9am", "WindSpeed3pm", "Humidity9am", "Humidity3pm", "Pressure9am", "Pressure3pm", "Cloud9am", "Cloud3pm", "Temp9am", "Temp3pm", "RainToday", "Year", "Month", "Day"]]

X_train_num = X_train_num.reset_index(drop=True)
X_test_num = X_test_num.reset_index(drop=True)

> We can encode in multiple ways such as:
1- Label encoders - It will map an integer to each class in the feature but will impose a false sense of ordinal relationship between the classes (49 > 38)
2- One hot encoding - It will expand each column feature into multiple dummy columns based on cardinality, leading the model to struggle with sparse and large data with too many dummy features.
3- Target encoding (using category_encoder) - It uses mean encoding or median encoding involving target class. It can increase the quality of model, but high chances of overfitting and leads to data leakage, as it encodes based on the target rendering the feature biased. 

> I will go with hot encoding technique.
Note: One_hot_encoding from scikit learn looks same as pandas pd.dummies but OHE from sckikit learn has many advantges. 

In [None]:
from sklearn.preprocessing import OneHotEncoder

#drop = "first" helps us escape the dummy variable trap
ohe = OneHotEncoder(handle_unknown='error', categories="auto", sparse=False, drop="first")

In [None]:
for dframe in [X_train, X_test]:
    dframe['WindGustDir'].fillna(X_train['WindGustDir'].mode()[0], inplace=True)
    dframe['WindDir9am'].fillna(X_train['WindDir9am'].mode()[0], inplace=True)
    dframe['WindDir3pm'].fillna(X_train['WindDir3pm'].mode()[0], inplace=True)

In [None]:
X_train_location = ohe.fit_transform(X_train[["Location"]])
X_train_WindGustDir = ohe.fit_transform(X_train[["WindGustDir"]])
X_train_WinDir9am = ohe.fit_transform(X_train[["WindDir9am"]])
X_train_WindDir3pm = ohe.fit_transform(X_train[["WindDir3pm"]])

X_train = pd.concat([X_train_num, pd.DataFrame(X_train_location), pd.DataFrame(X_train_WindGustDir), pd.DataFrame(X_train_WinDir9am), pd.DataFrame(X_train_WindDir3pm)], axis=1)

X_train.head(3)

In [None]:

X_test_location = ohe.fit_transform(X_test[["Location"]])
X_test_WindGustDir = ohe.fit_transform(X_test[["WindGustDir"]])
X_test_WinDir9am = ohe.fit_transform(X_test[["WindDir9am"]])
X_test_WindDir3pm = ohe.fit_transform(X_test[["WindDir3pm"]])

X_test = pd.concat([X_test_num, pd.DataFrame(X_test_location), pd.DataFrame(X_test_WindGustDir), pd.DataFrame(X_test_WinDir9am), pd.DataFrame(X_test_WindDir3pm)], axis=1)

X_test.head(3)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
y_train = y_train.astype(int)
y_test = y_test.astype(int)

**Imputation : KNN and Iterative**

***KNN Imputation***

In [None]:
from sklearn.impute import KNNImputer

X_train_Knn = X_train.copy(deep=True)

knn_imputer = KNNImputer(n_neighbors=3, weights="uniform")
X_train_Knn = knn_imputer.fit_transform(X_train_Knn)

In [None]:
X_test_Knn = X_test.copy(deep=True)
X_test_Knn = knn_imputer.transform(X_test_Knn)

***Iterative Imputing***

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
X_train_Imp = X_train.copy(deep=True)

Iter_Imp = IterativeImputer(max_iter=25, verbose=2, imputation_order="ascending") 
X_train_Imp = Iter_Imp.fit_transform(X_train_Imp)

In [None]:
X_test_Imp = X_test.copy(deep=True)
X_test_Imp = Iter_Imp.transform(X_test_Imp)

In [None]:
X_train_Imp = X_train_Imp.astype(int)
X_test_Imp = X_test_Imp.astype(int)
X_train_Knn = X_train_Knn.astype(int)
X_test_Knn = X_test_Knn.astype(int)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#KNN Imputation train and test scaling
X_train_Knn = scaler.fit_transform(X_train_Knn)
X_test_Knn = scaler.transform(X_test_Knn)

X_train_Imp = scaler.fit_transform(X_train_Imp)
X_test_Imp = scaler.transform(X_test_Imp)

> Let's check how Knn and Iterative Imputed Data performs

In [None]:
lr = LogisticRegression()

lr.fit(X_train_Knn, y_train)
prediction_KNN = lr.predict(X_test_Knn)

lr.fit(X_train_Imp, y_train)
prediction_IMP = lr.predict(X_test_Imp)

In [None]:
def evaluation(y_actual, predicted):
    cnf_matrix = confusion_matrix(y_actual, predicted)
    sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu", fmt='g')
    plt.ylabel("Actual")
    plt.xlabel("Predicted")
    labels = ['No', 'Yes']
    print(classification_report(y_actual, predicted, target_names=labels))

In [None]:
#KNN evaluation

evaluation(y_test, prediction_IMP)

In [None]:
#Precision recall curve for KNN
y_pred_prob = lr.predict_proba(X_test_Knn)[:,1]
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
plt.plot(precision, recall)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision Recall Curve')

In [None]:
print("AUC score is: ", roc_auc_score(y_test, prediction_KNN))

In [None]:
#Precision recall curve for IMP
y_pred_prob = lr.predict_proba(X_test_Imp)[:,1]
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
plt.plot(precision, recall)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision Recall Curve')

In [None]:
print("AUC score is: ", roc_auc_score(y_test, prediction_IMP))

> As AUC is a bit better with Iterative Imputation, I will use Iterative Imputed data ahead.

**Imbalance techniques**

> Why deal with Imbalance class problem? As we can see below, there is a huge difference between YES and NO in our binary target class. This will make our model be biased towards predicting more No's than Yes. And hence we will see a very good accuracy. Maybe, a whopping 98%. But did the model learn? Not at all. The classifiers strive to get a good performance in the training data and concentrate on learning the pattern of the "majority" class more than the "minority" class. If a student knows that 98% of the questions in exam will be from Science and geography and 2% from History, they will happily ignore History. While dealing with imbalance problems, we should remember that Accuracy is not the metric we should be focusing On. Metrics such as Precision score, recall, F1 score, ROC_AUC should be our main focus.

> Real life applications where we have to deal with class imbalance are fake news detection, fraud detection, intrusion detection etc.

In [None]:
f, ax = plt.subplots(figsize=(6, 8))
ax = sns.countplot(x="RainTomorrow", data=data, palette="Set1")
plt.show()

***<i> Borderline Smote***

In [None]:
from collections import Counter

import imblearn
from imblearn.over_sampling import BorderlineSMOTE
over = BorderlineSMOTE(random_state=142)

Xtrain_BLS, ytrain_BLS = over.fit_resample(X_train_Imp, y_train)

counter = Counter(ytrain_BLS)

for label, _ in counter.items():
    X_train_scatter = np.array(Xtrain_BLS)
    y_train_scatter = np.array(ytrain_BLS)
    row_ix = np.where(y_train_scatter == label)[0]
    plt.scatter(X_train_scatter[row_ix, 0], X_train_scatter[row_ix, 1], label=str(label))
plt.title(f"{counter}")
plt.legend()
plt.show()


***<i> SVM Smote***

In [None]:
from imblearn.over_sampling import SVMSMOTE
over = SVMSMOTE(random_state=142)
Xtrain_SVM, ytrain_SVM = over.fit_resample(X_train_Imp, y_train)

counter = Counter(ytrain_SVM)

for label, _ in counter.items():
    X_train_scatter = np.array(Xtrain_SVM)
    y_train_scatter = np.array(ytrain_SVM)
    row_ix = np.where(y_train_scatter == label)[0]
    plt.scatter(X_train_scatter[row_ix, 0], X_train_scatter[row_ix, 1], label=str(label))
plt.title(f"{counter}")
plt.legend()
plt.show()

***<i> ADASYN***

In [None]:
from imblearn.over_sampling import ADASYN
over = ADASYN(random_state=142)

Xtrain_ADA, ytrain_ADA = over.fit_resample(X_train_Imp, y_train)

counter = Counter(ytrain_ADA)

for label, _ in counter.items():
    X_train_scatter = np.array(Xtrain_ADA)
    y_train_scatter = np.array(ytrain_ADA)
    row_ix = np.where(y_train_scatter == label)[0]
    plt.scatter(X_train_scatter[row_ix, 0], X_train_scatter[row_ix, 1], label=str(label))
plt.title(f"{counter}")
plt.legend()
plt.show()

***<i> SMOTETomek***

In [None]:
from imblearn.combine import SMOTETomek

smotek = SMOTETomek(random_state=142)
Xtrain_SMT, ytrain_SMT = smotek.fit_resample(X_train_Imp, y_train)

counter = Counter(ytrain_SMT)

for label, _ in counter.items():
    X_train_scatter = np.array(Xtrain_SMT)
    y_train_scatter = np.array(ytrain_SMT)
    row_ix = np.where(y_train_scatter == label)[0]
    plt.scatter(X_train_scatter[row_ix, 0], X_train_scatter[row_ix, 1], label=str(label))
plt.title(f"{counter}")
plt.legend()
plt.show()

**Model**

> The class Model below takes inputs and runs a GridsearchCV and pulls the best parameters. Once the best parameters are generated after instatiating the class, the next steps fit the model with training sets you pass and gives you the desired output. You can perform hyperparameter tuning as per your requirement, however I havent done any hyperparameter tuning due to computing constraints. But you can play with the parameters, all you have to do is the pass the list of parameters. I will also just test the model with Borderline Smote data, but it should not stop you from testing with the others.

In [None]:
model = list()
resample = list()
accuracy = list()
precision = list()
recall = list()
F1score = list()
AUCROC = list()

class Model:
    def __init__(self, X_train, y_train, X_test, y_test, model_type=None, enhanced_model_type=None, params=None, algo=None, sampling=None):
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        self.params = params
        self.algo = algo
        self.sampling = sampling

        if model_type == 'rf':
            self.user_defined_model = RandomForestClassifier()
        elif model_type == 'lr':
            self.user_defined_model = LogisticRegression()
        elif model_type == 'ada':
            self.user_defined_model = AdaBoostClassifier()
        elif model_type == 'sgd':
            self.user_defined_model = SGDClassifier()
    
        self.cv = StratifiedKFold(n_splits=5, random_state=142, shuffle=True)
        self.GS = GridSearchCV(self.user_defined_model, param_grid=params, cv=self.cv, scoring='roc_auc', n_jobs=-1, refit=True)
        self.MS = self.GS.fit(X_train, y_train)
        self.best_param = self.MS.best_params_
        print(self.best_param)

        if enhanced_model_type == 'rf':
            self.enhanced_model = RandomForestClassifier(**self.best_param)
        elif enhanced_model_type == 'lr':
            self.enhanced_model = LogisticRegression(**self.best_param)
        elif enhanced_model_type == 'ada':
            self.enhanced_model = AdaBoostClassifier(**self.best_param)
        elif enhanced_model_type == 'sgd':
            self.enhanced_model = SGDClassifier(**self.best_param)
            
    def fit(self, X_train, y_train):
        self.model = self.enhanced_model.fit(X_train, y_train)
        return self.model

    def predict(self, X_test):
        y_pred = self.model.predict(X_test)
        return y_pred
    
    def predict_prob(self, X_test):
        y_prob = self.model.predict_proba(X_test)
        return y_prob

    def append_metrics(self, X_test, y_test):
        y_pred = self.model.predict(X_test)
        y_prob = self.model.predict_proba(X_test)
        model.append(self.algo)
        accuracy.append(accuracy_score(y_test, y_pred, normalize=False)) #returns the no of correctly classified samples
        precision.append(precision_score(y_test, y_pred))
        recall.append(recall_score(y_test, y_pred))
        F1score.append(f1_score(y_test, y_pred))
        AUCROC.append(roc_auc_score(y_test, y_prob[:,1]))
        resample.append(self.sampling)
    
    def print_metric(self, X_train, y_train, X_test, y_test):
        y_pred = self.model.predict(X_test)
        y_prob = self.model.predict_proba(X_test)
        print("="*60)
        print("Confusion Matrix")
        print("-"*30)
        print(confusion_matrix(y_test, y_pred), "\n")
        print("="*60)
        print("Classification Report")
        print("-"*30)
        print(classification_report(y_test, y_pred), "\n")
        print("="*60)
        print("ROC-AUC score")
        print("-"*30)
        print(roc_auc_score(y_test, y_prob[:,1]))
        print("*"*60)
  

    

> Try out hyperparameter tuning as per your convenience on the below models. I just ran the models with single-value paramters as it would take a lot of time otherwise.

***Logistic regression with BorderlineSMOTE***

In [None]:
params = {'C':[10],'class_weight':['balanced'], 'solver':['lbfgs'], 'max_iter': [1000], 'n_jobs': [-1]}
logreg_Borderline = Model(Xtrain_BLS, ytrain_BLS, X_test_Imp, y_test, model_type='lr', enhanced_model_type='lr', params=params, algo='Logistic', sampling='BLS')

In [None]:
logreg_Borderline.fit(Xtrain_BLS, ytrain_BLS)
logreg_Borderline.predict(X_test_Imp)
logreg_Borderline.predict_prob(X_test_Imp)
logreg_Borderline.append_metrics(X_test_Imp, y_test)
logreg_Borderline.print_metric(Xtrain_BLS, ytrain_BLS, X_test_Imp, y_test)

***RandomForest with BorderlineSMOTE***

In [None]:
n_estimators = [1100]
max_features = ['sqrt']
max_depth = [200]
max_depth.append(None)
min_samples_split = [2]
min_samples_leaf = [2]
bootstrap = [True]
  
params = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

randomforest_BLS = Model(Xtrain_BLS, ytrain_BLS, X_test_Imp, y_test, model_type='rf', enhanced_model_type='rf', params=params, algo='RandomForest', sampling='BLS')

In [None]:
randomforest_BLS.fit(Xtrain_BLS, ytrain_BLS)
randomforest_BLS.predict(X_test_Imp)
randomforest_BLS.predict_prob(X_test_Imp)
randomforest_BLS.append_metrics(X_test_Imp, y_test)
randomforest_BLS.print_metric(Xtrain_BLS, ytrain_BLS, X_test_Imp, y_test)

***Adaboost with BorderlineSMOTE***

In [None]:
n_estimators = [1500]
learning_rate = [0.01]

params = {'n_estimators': n_estimators,
          'learning_rate': learning_rate}

adaboost_BLS = Model(Xtrain_BLS, ytrain_BLS, X_test_Imp, y_test, model_type='ada', enhanced_model_type='ada', params=params, algo='Adaboost', sampling='BLS')

In [None]:
adaboost_BLS.fit(Xtrain_BLS, ytrain_BLS)
adaboost_BLS.predict(X_test_Imp)
adaboost_BLS.append_metrics(X_test_Imp, y_test)
adaboost_BLS.print_metric(Xtrain_BLS, ytrain_BLS, X_test_Imp, y_test)

***SGDClassifier with BorderlineSMOTE***

In [None]:
params = {
    'alpha': [1e-3],#0.001
    'max_iter' : [1500],
    'class_weight': ['balanced'],
    'loss': ['log'],
    'eta0': [0.05],
    'penalty': ['elasticnet'],
    'n_jobs': [-1]
    }
SGD_BLS = Model(Xtrain_BLS, ytrain_BLS, X_test_Imp, y_test, model_type='sgd', enhanced_model_type='sgd', params=params, algo='SGDClassifier', sampling='BLS')

In [None]:
SGD_BLS.fit(Xtrain_BLS, ytrain_BLS)
SGD_BLS.predict(X_test_Imp)
SGD_BLS.append_metrics(X_test_Imp, y_test)
SGD_BLS.print_metric(Xtrain_BLS, ytrain_BLS, X_test_Imp, y_test)

In [None]:
clf_eval_df = pd.DataFrame({'model':model,
                            'resample':resample,
                            'accuracy':accuracy,
                            'precision':precision,
                            'recall':recall,
                            'f1-score':F1score,
                            'AUC-ROC':AUCROC})

In [None]:
clf_eval_df

> If we just consider Recall, logistic regression did a great job.
> Accuracy shows the no of samples correctly classified

**Voting Classifiers**

In [None]:
from sklearn.ensemble import VotingClassifier

logreg = LogisticRegression(C=10, class_weight='balanced', solver='lbfgs', max_iter=1000, n_jobs=-1)
xgb_classifier = XGBClassifier(gamma=0.0468, learning_rate=0.05, max_depth=3, n_estimators=1500, nthread = -1, random_state = 142)
adaboost = AdaBoostClassifier(n_estimators = 1000, learning_rate = 0.001)
sgdclassifier = SGDClassifier(alpha = 1e-3, max_iter = 1000, class_weight = 'balanced', loss = 'log', eta0=0.05, penalty='elasticnet', n_jobs=-1)

models = [('log_reg', logreg), ('xgb', xgb_classifier), ('adaboost', adaboost), ('sgd', sgdclassifier)]

voting_hard = VotingClassifier(estimators=models, voting='hard', n_jobs=-1)

voting_hard.fit(Xtrain_BLS,ytrain_BLS)

voting_soft = VotingClassifier(estimators=models, voting='soft', n_jobs=-1)

voting_soft.fit(Xtrain_BLS,ytrain_BLS)

pred_hard=voting_hard.predict(X_test_Imp)
pred_soft=voting_soft.predict(X_test_Imp)

print("Hard Voting Scores")

print("="*30)

print("Precision score", precision_score(y_test, pred_hard))

print("Recall score", recall_score(y_test, pred_hard))

print("F1 score", f1_score(y_test, pred_hard))

print(confusion_matrix(y_test, pred_hard))
  
print(classification_report(y_test, pred_hard))

print("="*30)

print("Soft Voting Scores")

print("Precision score", precision_score(y_test, pred_soft))

print("Recall score", recall_score(y_test, pred_soft))

print("F1 score", f1_score(y_test, pred_soft))

print(confusion_matrix(y_test, pred_soft))
  
print(classification_report(y_test, pred_soft))

> As Voting classifier does not output class probabilities, we do not get fetch the AUCROC score directly. But we can see voting classifier did a very good job than others, in terms of recall.

**Stacking Classifier**

In [None]:
from sklearn.ensemble import StackingClassifier

logreg = LogisticRegression(C=10, class_weight='balanced', solver='lbfgs', max_iter=1000, n_jobs=-1)
xgb_classifier = XGBClassifier(gamma=0.0468, learning_rate=0.05, max_depth=3, n_estimators=1500, nthread = -1, random_state = 142)
adaboost = AdaBoostClassifier(n_estimators = 1000, learning_rate = 0.001)
sgdclassifier = SGDClassifier(alpha = 1e-3, max_iter = 1000, class_weight = 'balanced', loss = 'log', eta0=0.05, penalty='elasticnet', n_jobs=-1)

estimators = [('log_reg', logreg), ('xgb', xgb_classifier), ('adaboost', adaboost), ('sgd', sgdclassifier)]

final_estimator = RandomForestClassifier(n_estimators=1500, max_features='sqrt', max_depth=200, min_samples_split=2, min_samples_leaf=2, bootstrap=True)

stack = StackingClassifier(estimators=estimators, final_estimator=final_estimator, cv=5, n_jobs=-1, passthrough=True, verbose=2)

stack.fit(Xtrain_BLS, ytrain_BLS)

pred = stack.predict(X_test_Imp)

predprob = stack.predict_proba(X_test_Imp)

print("Precision score", precision_score(y_test, pred))

print("Recall score", recall_score(y_test, pred))

print("F1 score", f1_score(y_test, pred))

print("AUC_ROC score", roc_auc_score(y_test, predprob[:,1]))

print(confusion_matrix(y_test, pred))
  
print(classification_report(y_test, pred))

> The overall score in all the models could have been improved if we would have followed extensive feature selection process, feature engineering, feature extraction, hyperparameter tuning, multicolinearity check etc. 