## Table of Content

1. [Problem statement](#1)
1. [Import Libraries and dataset](#2)
1. [Exploratory data analysis (EDA)](#3)
1. [ Explore Target Variable `RainTomorrow`](#4)
1. [Explore Categorical Variables](#5)
1. [Explore Numerical Variables](#6)
1. [Multivariate Analysis](#7)
1. [Handling Class Imbalance](#8)
1. [Splitting of data](#9)
1. [Feature Engineering](#10)
1. [Feature Scaling](#11)
1. [Model training, making predictions and evaluation](#12)
    - [Logistic Regression](#12.1)
    - [KNN](#12.2)
    - [Decision Tree](#12.3)
    - [Random Forest](#12.4)
    - [lightGBM](#12.5)
    - [Catboost](#12.6)
    - [XGBoost](#12.7)
    - [Neural Network](#12.8)
1. [Model Comparison](#13)
1. [Bias and variance](#14)
1. [Feature Importance](#15)
1. [Model Performance on Imbalance dataset](#16)
    
    

## 1. Problem statement <a class="anchor" id="1"></a>    
In this notebook, the problem is to predict that whether or not it will rain tomorrow in Australia.    
In order to solve this problem, my approach is to build different classifiers (binary classifiers) and compare them to get the best classifier. Initially, I will explore and process the data and then implement classifiers.    
So, let's start this story.

## 2. Import Libraries and dataset <a class="anchor" id="2"></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from IPython.display import display  
import time

import warnings
warnings.filterwarnings('ignore')

# for data preprocessing
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix,\
cohen_kappa_score, plot_confusion_matrix, roc_curve

# import different classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm
import catboost
import xgboost
from sklearn.neural_network import MLPClassifier

In [None]:
df = pd.read_csv("../input/weather-dataset-rattle-package/weatherAUS.csv")

## 3. Exploratory data analysis (EDA) <a class="anchor" id="3"></a>

In [None]:
def data_explore(df):
    display(df.head())
    print("*" * 30)
    print(f"shape of dataset {df.shape}")
    print("*" * 30)
    display("Info {}".format(df.info()))
    print("*" * 30)
    print("Dtypes: \n{}".format(df.dtypes.value_counts()))
    print("*" * 30)
    print(df.columns)
    print("*" * 30)
    print("Number of columns having null values: ", df.isnull().any().sum())
data_explore(df)

In [None]:
# describe for all numeric variables
df.describe().T

In [None]:
# describe for all categorical variables
df.describe(include=['object']).T

In [None]:
# data type plots
fig, ax = plt.subplots(1,2,figsize = (12,6))

df.dtypes.value_counts().plot.pie(explode = [0.05,0.05], autopct = "%1.0f%%",
                                  shadow = True, ax = ax[1])
ax[1].set_title("datatype")

df.dtypes.value_counts().plot(kind = 'bar', ax = ax[0])
ax[0].set_title("datatype")

In the dataset, 70% features are numeric (float 64) while rest 30% are categorical (object). 

## 4. Explore Target Variable `RainTomorrow`  <a class="anchor" id="4"></a>

In [None]:
# missing values in the target variable
df['RainTomorrow'].isnull().sum()

There are 3267 entries in the dataset where target variable `RainTomorrow` is null.    
We can't imput these missing values so the only option left is to drop all the enties where target variable is `NaN`.

In [None]:
df.dropna(subset=['RainTomorrow'], axis = 0, inplace = True)

df['RainTomorrow'].isnull().sum()

In [None]:
display(df['RainTomorrow'].value_counts())
display(df['RainTomorrow'].value_counts() * 100 / len(df))

df['RainTomorrow'].value_counts().plot(kind = 'bar', color = ['skyblue', 'navy'], rot = 0)

In [None]:
# conversion of target variable from categorical to numeric
df['RainTomorrow'] = df['RainTomorrow'].map({'No': 0, 'Yes': 1})

**Findings**    
* Target variable has 3267 Nan values which were removed from the dataset.
* 1103116 entries for `No` variable (77.58%)
* 31877 entries for `Yes` variable  (22.41%)    
    The dataset is highly imbalanced 


## 5. Explore Categorical Variables <a class="anchor" id="5"></a>
   * Unique values 
   * number of missing values
   * frequency plot

In [None]:
cat_cols = df.select_dtypes('object').columns
cat_cols

We have 7 categorical columms ('Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm','RainToday')    
Lets explore these first


In [None]:
# number of unique values in each categorical column
for col in cat_cols:
    print(col,"\t", df[col].nunique())

In [None]:
# number of missing values in each categorical column
for col in cat_cols:
    print(col, '\t', df[col].isnull().sum())

In [None]:
# frequency plot of each categorical variable
plt.figure(figsize = (20,8))
for i, col in enumerate(cat_cols[1:]):
    plt.subplot(2, 3, i+1)
    sns.countplot(df[col])
    plt.xticks(rotation = 90)
    plt.title(f"{col} has {df[col].nunique()} unique values")

**Date**   
* Date column is categorical and it would not provide any significance if as such. So we could convert it into datetime format and then create separate feature of year, month, and day. 

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df[['Date', 'Year', 'Month', 'Day']].head()

In [None]:
# Now lets drop Date column
df.drop('Date', axis = 1, inplace = True)

In [None]:
# update cat_cols
cat_cols = df.select_dtypes('object').columns
cat_cols

## 6. Explore Numerical Variables <a class="anchor" id="6"></a>
   * missing values check
   * outlier check
   * distribution check

In [None]:
# numerical columns
num_cols = df.select_dtypes(exclude=['object']).columns
num_cols = num_cols[:-4]
num_cols

In [None]:
# missing value check
df[num_cols].isnull().sum()

In [None]:
# Outlier check
df[num_cols].describe().T

In [None]:
# box plot of numerical variables
plt.figure(figsize = (15,12))
for i, col in enumerate(num_cols):
    plt.subplot(4, 4, i+1)
    sns.boxplot(data = df, y = col, whis = 3)
    plt.title(col)

Columns `Rainfall`, `Evaporation`, `WindSpeed9am`, `WindSpeed3am` mainly contains outliers

**Distribution check**   
* To check whether the distibution is normal or skewed. 

In [None]:
outlier_cols = ['Rainfall', 'Evaporation', 'WindSpeed9am', 'WindSpeed3pm']
# histogram plot to check distribution
plt.figure(figsize = (12,10))
for i, col in enumerate(outlier_cols):
    plt.subplot(2, 2, i+1)
    sns.histplot(data = df,x = col, bins = 20)
    plt.title(col)

**Observation**    
The distribution is skewed in case of all these columns so we will find IQR (Interquantile range) to detect ouliers.

In [None]:
def IQR(df, out_cols):
    for col in out_cols:
        iqr = df[col].quantile(0.75) - df[col].quantile(0.25)
        lower =  df[col].quantile(0.25) - (iqr * 3)
        upper = df[col].quantile(0.75) + (iqr * 3)
        outlier_percent = round((df[df[col] > upper].shape[0] * 100)/len(df), 2)
        print( col , '\t', lower.round(2), '\t', upper.round(2), 
              '\t', df[col].min(), '\t', df[col].max(), '\t', outlier_percent)
print('column \t\t lower \t high \t min \t max \t outlier_percent')
IQR(df, outlier_cols)

The normal range for `Rainfall` is -2.4 to 3.2 while its min and max values are 0 and 371 so we can limit the higher values only upto 3.2

## 7. Multivariate Analysis <a class="anchor" id="7"></a>
   * To discover patterns and relationships between variables in the dataset. 
   * Heatmap of correlation 
   * Pairplot to see the patterns

In [None]:
# Heatmap
sns.set_context('notebook', font_scale=1.0, rc = {'lines.linewidth': 2.5})
plt.figure(figsize = (15,12))

# mask the duplicate correlation values
mask = np.zeros_like(df.corr())
mask[np.triu_indices_from(mask, 1)] = True

a = sns.heatmap(df.corr(), mask = mask, annot=True, fmt = '.2f', cmap = 'viridis')

rotx = a.set_xticklabels(a.get_xticklabels(), rotation = 90)
roty = a.set_yticklabels(a.get_yticklabels(), rotation = 30)


**Observations**     
There are few variables ('MinTemp', 'MaxTemp', 'Temp9am', 'Temp3pm', 'WindGustSpeed', 'WindSpeed3pm', 'Pressure9am', 'Pressure3pm') which have high correlation with other variables while none with 100% correlation so no need to remove any features.

In [None]:
# Pair Plot for higly correlated variables
sns.pairplot(data = df, vars = ['MinTemp', 'MaxTemp', 'Temp9am', 'Temp3pm', 'WindGustSpeed', 'WindSpeed3pm', 'Pressure9am', 'Pressure3pm'], 
             kind = 'scatter', 
             diag_kind= 'hist',
             hue = 'RainTomorrow')

## 8. Handling Class Imbalance <a class="anchor" id="8"></a>

In [None]:
df['RainTomorrow'].value_counts().plot(kind='bar',color=['blue', 'cyan'])

The target is highly imbalance so we can either increase the minor class samples or decrease the major class samples. Here I will use oversampling of the minority class.

In [None]:
no = df[df['RainTomorrow'] == 0]
yes = df[df['RainTomorrow'] == 1]

yes_os = resample(yes, replace = True, n_samples=len(no), random_state=21)

df_os = pd.concat([no, yes_os])
print(df_os.shape)

In [None]:
fig = plt.figure(figsize = (8,5))
df_os['RainTomorrow'].value_counts(normalize = True).plot(kind = 'bar', 
                                                         color = ['skyblue', 'navy'], 
                                                         alpha = 0.9, 
                                                         rot = 0)
plt.title('balanced dataset')

## 9. Splitting of data <a class="anchor" id="9"></a>

In [None]:
X = df_os.drop(['RainTomorrow'], axis = 1)
y = df_os['RainTomorrow']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train.shape, X_test.shape

## 10. Feature Engineering <a class="anchor" id="10"></a>
   1. Imputing missing values
       - In numerical variables
       - In categorical variables   
   2. Treating outliers
   3. Encoding of categorical variables

#### 1. Imputing Missing values <a class="anchor" id="10.1"></a>  
  * Numeric missing values are imputed with median of training dataset as median is more robust to outliers and there would be no data leakage because imputing all the missing values with median of **training dataset**
  * Categorical missing values are imputed with mode of training dataset.

In [None]:
# categorical and numeric missing values in train and test datasets

cat_miss = pd.concat([pd.DataFrame(X_train[cat_cols].isnull().sum()),
                      pd.DataFrame(X_test[cat_cols].isnull().sum())],
                     axis = 1)
num_miss = pd.concat([pd.DataFrame(X_train[num_cols].isnull().sum()),
                      pd.DataFrame(X_test[num_cols].isnull().sum())],
                     axis = 1)
cat_miss.columns = ['train', 'test']
num_miss.columns = ['train', 'test']
display(cat_miss)
display(num_miss)


In [None]:
# imputing missing values of numeric columns
for df1 in [X_train, X_test]:
    for col in num_cols:
        col_median = X_train[col].median()
        df1[col].fillna(col_median, inplace = True)

In [None]:
X_train[num_cols].isnull().any().sum(), X_test[num_cols].isnull().any().sum()

In [None]:
# imputing misssing values in categorical variables
for df1 in [X_train, X_test]:
    for col in cat_cols[1:]:
        col_mode = X_train[col].mode()[0]
        df1[col].fillna(col_mode, inplace = True)

In [None]:
X_train[cat_cols].isnull().any().sum(), X_test[cat_cols].isnull().any().sum()

In [None]:
X_train.isnull().any().sum(), X_test.isnull().any().sum()

All the missing values have been imputed, now we will treat the outliers

#### 2. Treating outliers <a class="anchor" id="10.2"></a>
  * The max value of 4 columns having outlier is to change to the upper limit of IQR.

In [None]:
# treating outliers
def max_value(df, col, top):
    return np.where(df[col]> top, top, df[col])
for df in [X_train, X_test]:
    df['Rainfall'] = max_value(df, 'Rainfall', 3.2)
    df['Evaporation'] = max_value(df, 'Evaporation', 21.8)
    df['WindSpeed9am'] = max_value(df, 'WindSpeed9am', 55.0 )
    df['WindSpeed3pm'] = max_value(df, 'WindSpeed3pm', 57.0)

In [None]:
X_train['Rainfall'].max(), X_test['Rainfall'].max()

In [None]:
cat_cols

#### 3. Encoding Categorical Variables <a class="anchor" id="10.3"></a>
There are 5 categorical variables which have to be encoded, In this, I am using `get_dummies` with drop first. 

In [None]:
for col in cat_cols:
    print(col , '\t', X_train[col].nunique())

In [None]:
num_cols = X_train.select_dtypes(exclude= 'object').columns

X_train[num_cols].shape, X_train[cat_cols].shape, X_train.shape

In [None]:
# categorical encoding for training dataset
X_train.shape
X_train = pd.concat([X_train[num_cols], 
                   pd.get_dummies(X_train['Location'], drop_first=True), 
                   pd.get_dummies(X_train['WindGustDir'], drop_first=True, prefix = 'WindGustDir'), 
                   pd.get_dummies(X_train['WindDir9am'], drop_first=True, prefix = 'WD9am'),
                pd.get_dummies(X_train['WindDir3pm'], drop_first=True, prefix = 'WD3pm'), 
                    pd.get_dummies(X_train['RainToday'], drop_first=True, prefix ='RainToday')], 
                    axis = 1
                   )
X_train.shape

In [None]:
X_train.head(2)

In [None]:
# categorical encoding for test dataset
X_test.shape
X_test = pd.concat([X_test[num_cols], 
                   pd.get_dummies(X_test['Location'], drop_first=True), 
                   pd.get_dummies(X_test['WindGustDir'], drop_first=True, prefix = 'WindGustDir'), 
                   pd.get_dummies(X_test['WindDir9am'], drop_first=True, prefix = 'WD9am'),
                   pd.get_dummies(X_test['WindDir3pm'], drop_first=True, prefix = 'WD3pm'), 
                   pd.get_dummies(X_test['RainToday'], drop_first=True, prefix ='RainToday')], 
                   axis = 1
                   )
X_test.shape

Now before diving into Model training we should map all the feature variables onto the same scale using `feature scaling`. 

## 11. Feature Scaling <a class="anchor" id="11"></a>

In [None]:
X_train.describe()

In [None]:
cols = X_train.columns

In [None]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# after scaling the dataframe is converted into np array 
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns = [cols])

In [None]:
X_train.describe()

**Observation**    
Now max and min value of all the features is 1, 0 and count of all is same indicating the absence of null values. Now our training and test dataset is ready for model build. 

## 12. Model training, making predictions and evaluation <a class="anchor" id="12"></a>
    1. Logistic Regression
    2. KNN
    3. Decision Tree
    4. Random Forest
    5. lightGBM
    6. Catboost
    7. XGBoost
    8. Neural network

In [None]:
def plot_roc_curve(fpr, tpr):
    plt.plot(fpr, tpr, color = 'red', label = 'ROC')
    plt.plot([0,1], [0,1], color = 'navy', linestyle = '--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curver')
    plt.legend()
    plt.show()

In [None]:
# General method for model training
def model_run(clf, X_train, y_train, X_test, y_test, verbose = 1):
    tic = time.time()
    
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    y_pred_train = clf.predict(X_train)
    
    accuracy_train = accuracy_score(y_train, y_pred_train)
    accuracy_test = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred)
    coh_kap = cohen_kappa_score(y_test, y_pred)
    
    toc = time.time()
    time_taken = toc-tic
    
    print("Training Accuracy = {}".format(accuracy_train.round(2) * 100))
    print("Test Accuracy = {}".format(accuracy_test.round(2) * 100))
    print("ROC Area under Curve = {}".format(roc_auc.round(2)))
    print("Cohen's Kappa = {}".format(coh_kap.round(2)))
    print("Time taken = {}".format(time_taken))
    print(classification_report(y_test,y_pred,digits=5))
    
    probs = clf.predict_proba(X_test)[:,1]
    
    fpr, tpr, threshold = roc_curve(y_test,probs)
    plot_roc_curve(fpr, tpr)
    
    plot_confusion_matrix(clf, X_test, y_test, cmap=plt.cm.Blues, normalize='all')
    
    return clf, accuracy_train,accuracy_test, roc_auc, coh_kap, time_taken

### 1. Logistic Regression <a class="anchor" id="12.1"></a>

In [None]:
# Logistic regression
clf_lr = LogisticRegression()
clf_lr, acc_tr_lr, acc_lr, roc_lr, coh_kap_lr, time_lr = model_run(clf_lr, X_train, y_train, X_test, y_test)

### 2. KNN  <a class="anchor" id="12.2"></a>

In [None]:

clf_knn = KNeighborsClassifier()
clf_knn, acc_tr_knn, acc_knn, roc_knn, coh_kap_knn, time_knn = model_run(clf_knn, X_train, y_train, X_test, y_test)

### 3. Decision Tree  <a class="anchor" id="12.3"></a>

In [None]:
# Decision Tree
param_dt = {'max_depth': 16, 'max_features': 'sqrt'}

clf_dt = DecisionTreeClassifier(**param_dt)
clf_dt, acc_tr_dt, acc_dt, roc_dt, coh_kap_dt, time_dt = model_run(clf_dt, X_train, y_train, X_test, y_test)

### 4. Random Forest  <a class="anchor" id="12.4"></a>

In [None]:
params_rf = {'max_depth': 16,
             'min_samples_leaf': 1,
             'min_samples_split': 2,
             'n_estimators': 200,
             'random_state': 21}
clf_rf = RandomForestClassifier(**params_rf)
clf_rf, acc_tr_rf, acc_rf, roc_rf, coh_kap_rf, time_rf = model_run(clf_rf, X_train, y_train, X_test, y_test)

### 5. LGBM  <a class="anchor" id="12.5"></a>

In [None]:
params_lgbm = {'colsample_bytree': 0.95, 
         'max_depth': 16, 
         'min_split_gain': 0.1, 
         'n_estimators': 200, 
         'num_leaves': 50, 
         'reg_alpha': 1.2, 
         'reg_lambda': 1.2, 
         'subsample': 0.95, 
         'subsample_freq': 20}
clf_lgbm = lightgbm.LGBMClassifier(**params_lgbm)
clf_lgbm, acc_tr_lgbm, acc_lgbm, roc_lgbm, coh_kap_lgbm, time_lgbm = model_run(clf_lgbm, X_train.values, y_train.values, X_test, y_test)

### 6. CatBoost  <a class="anchor" id="12.6"></a>

In [None]:
params_cboost ={'iterations': 50,
            'max_depth': 16}
clf_cbst = catboost.CatBoostClassifier(**params_cboost)
clf_cbst, acc_tr_cbst, acc_cbst, roc_cbst, coh_kap_cbst, time_cbst = model_run(clf_cbst, X_train, y_train, X_test, y_test, verbose=0)

### 7. XGBoost  <a class="anchor" id="12.7"></a>

In [None]:
params_xgb ={'max_depth': 16}
clf_xgb = xgboost.XGBClassifier(**params_xgb)
clf_xgb, acc_tr_xgb,acc_xgb, roc_xgb, coh_kap_xgb, time_xgb = model_run(clf_xgb, X_train, y_train, X_test, y_test)

## 8. Neural Network  <a class="anchor" id="12.8"></a>

In [None]:
clf_nn = MLPClassifier(random_state=21, verbose=0)
clf_nn, acc_tr_nn, acc_nn, roc_nn, coh_kap_nn, time_nn = model_run(clf_nn, X_train, y_train, X_test, y_test)

## 13. Model Comparison  <a class="anchor" id="13"></a>

In [None]:
# comparison of accuracy, roc score, coh_kappa score and time
acc_all = [acc_lr, acc_knn,acc_dt, acc_rf, acc_lgbm, acc_cbst, acc_xgb, acc_nn]
acc_tr_all = [acc_tr_lr, acc_tr_knn,acc_tr_dt, acc_tr_rf, acc_tr_lgbm, acc_tr_cbst, acc_tr_xgb, acc_tr_nn]
roc_all = [roc_lr, roc_knn, roc_dt, roc_rf, roc_lgbm, roc_cbst, roc_xgb, roc_nn]
coh_kap_all = [coh_kap_lr, coh_kap_knn, coh_kap_dt, coh_kap_rf, coh_kap_lgbm, coh_kap_cbst, coh_kap_xgb, coh_kap_nn]
time_taken = [time_lr, time_knn, time_dt, time_rf, time_lgbm, time_cbst, time_xgb, time_nn]
models = ['Logistic Regression','KNN','Decision Tree','Random Forest','LightGBM','Catboost','XGBoost', 'Neural Network' ]

model_comp_df = pd.DataFrame({'Model': models,
                              'Train Accuracy' : acc_tr_all,
                          'Test Accuracy': acc_all, 
                          'ROC_AUC': roc_all,
                          'Cohen_kappa': coh_kap_all,
                          'Time_taken': time_taken})

model_comp_df.style.background_gradient(cmap='Blues')
model_comp_df.to_csv('balanced_summery.csv')

In [None]:
fig = plt.figure(figsize = (15, 12))

# plot of Test Accuracy of every model
plt.subplot(221)
sns.barplot(data = model_comp_df, x = 'Model', y = 'Test Accuracy', palette = 'winter')
plt.title('Test Accuracy of classifier')
plt.xticks(rotation = 90)
plt.ylim(0.5, 1.0)

# plot of Time of every model
plt.subplot(222)
sns.barplot(data = model_comp_df, x = 'Model', y = 'Time_taken', palette = 'summer' )
plt.title('Time taken by classifier')
plt.xticks(rotation = 90)

# plot of ROC of every model
plt.subplot(223)
sns.barplot(data = model_comp_df, x = 'Model', y = 'ROC_AUC', palette = 'winter')
plt.title('ROC-AUC Score of classifier')
plt.xticks(rotation = 90)
plt.ylim(0.5, 1.0)

# plot of Cohen_kappa of every model
plt.subplot(224)
sns.barplot(data = model_comp_df, x = 'Model', y = 'Cohen_kappa', palette = 'summer')
plt.title('Cohen_kappa Score of classifier')
plt.xticks(rotation = 90)
plt.ylim(0.5, 1.0)

plt.savefig('balanced_plots.jpeg')

**Result Discussion**    
1. Among all the classifiers, XGBoost, Catboost and Random Forest perform best with high accuracy score.
2. On comparing the time_taken of all the classifiers, it was observed that KNN is taking highest time followed by neural network. 
3. Trends of ROC-AUC score and Cohen_kappa score is similar to that of accuracy.  

<span style="color:blue">**On considering time and accuracy, the best classifier for the current problem is XGBoost.** 

## 14. Bias and variance of different models <a class="anchor" id="14"></a>

In [None]:
model_comp_df['Train Accuracy'] = model_comp_df['Train Accuracy'].apply(lambda x: round(x, 3))
model_comp_df['Test Accuracy'] = model_comp_df['Test Accuracy'].apply(lambda x: round(x, 3))
model_comp_df[['Model','Train Accuracy', 'Test Accuracy']] 

**Observations**    
* Logistic regression has high bias and low variance.
* XGBoost has low bias and low variance.

Hence XGBoost is considered as most optimized model for this problem. 

## 15. Feature Importance <a class="anchor" id="15"></a>

In [None]:
# calculating the most important features
importance = clf_xgb.feature_importances_

feat_imp_df = pd.DataFrame({'Features': cols, 'Importance': importance})

f_imp = feat_imp_df[feat_imp_df.sort_values('Importance', ascending=False)['Importance'] >= 0.01].reset_index(drop= True)
f_imp.shape

In [None]:
plt.figure(figsize = (20, 8))
f_imp.plot.bar( x = 'Features', y = 'Importance')

In [None]:
# model performance on important features
imp_feat = f_imp['Features']

clf_xgb_imp = xgboost.XGBClassifier(**params_xgb)
clf_xgb_imp, acc_tr_xgb_imp,acc_xgb_imp, roc_xgb_imp, coh_kap_xgb_imp, time_xgb_imp = model_run(clf_xgb_imp, X_train[imp_feat], y_train, X_test[imp_feat], y_test)

**Observations**    
* For the dataset containing most important features obtained from XGBoost, the test accuracy is decreased for 94 to 88. 

## 16. Model Performance on Imbalance dataset <a class="anchor" id="16"></a>

In order to observe the effect of balancing the dataset I have trained all the classifiers on imbalanced dataset and obtained the model comparison summary. 

In [None]:
bal_clf_summ = pd.read_csv('../input/balance-model-summary/balanced_summery.csv', index_col = 0)
bal_clf_summ

In [None]:
unbal_clf_summ = pd.read_csv('../input/unbalance-model-summary-rainfall/unbalanced_summery.csv', index_col=0)
unbal_clf_summ

**Observation**
If we compare the model accuracy before and after balancing the minority class then it is observed that: 
* Models like Logisitic Regression and Decision tree perform better on unbalanced dataset.
* Models including Catboost and XGBoost show high variance in case of unbalanced dataset as compared to balanced dataset.
* Model Random Forest perform almost same in both the cases.

<span style="color:blue">**Hence Random Forest can be considered as good model in this case with low bias and low variance**


<span style="color:red">I hope you Liked my kernel. An upvote is a gesture of appreciation and encouragement that fills me with energy to keep improving my efforts ,be kind to show one.