In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

**Objective**

1. Get a better understanding of the simplified predictive modelling framework

2. Grasp the logic behind different coding methods & concise techniques used

3. Comparisons between different models


**Introduction**

This is my first Kernel, so do help by commenting any improvements you feel i could use! This is based on the Porto Seguro Safe Driver Prediction (Classification Problem) dataset. This is also mainly self-taught, hence most of my approaches here are pretty much simplified & also structured to contain alternative methods. 

I've also included some explanations within the codes that I've used to explain the logic behind it as well. Hopefully, those who are in similar self-taught circumstances will find this Kernel useful!

**Some pointers to NOTE:**

    -This Kernel does focuses more on data manipulation for 'Model Comparisons' (Chapter 8 onwards)

    -Chapters 1 to 7 are mainly as a set-up for the Model Comparisons
    
    -Chapters 8 onwards deals directly with Model Comparison techniques


I will be doing another Kernel that focuses more on 'Data Cleaning' & 'Feature Engineering' separately...

**Chapter Outline**

1.Open Dataset

2.Preliminary Analysis

    2.1 Structure
    2.2 Composition (Correlation, Missing, Unique, Data-types)

3.Data Cleaning

    3.1 C1 Correction
    3.2 C2 Complete
    3.3 C3 Create
    3.4 C4 Convert

4.Prepare Data - A

5.Exploratory Data Analysis (EDA)

    5.1 Balance of dataset (Target Variable)
    5.2 Uni-variates
    5.3 Bi-variates
    5.4 EDA Summative notes for Feature Importance comparison

6.Parameter Tuning

7.Feature Select - A

    7.1 Pre-Drop Accuracy Score
    7.2 Post-Drop Accuracy Score

8.Feature Select & Individual Model Charting - B

    8.1 Model Preparation
    8.2 LASSO
    8.3 Ridge
    8.4 Balanced Logistic Regression
    8.5 XGBoost
    8.6 Random Forest Classifier

9.Features (Side To Side Comparison)

    9.1 (Coefficient Values) Quick Easy Method
    9.2 (Coefficient Values) Neat DataFrame Method
    9.3 (Plotting) Quick Method
    9.4 (Plotting) Sorted Neat Method

10.ROC AUC (Side To Side Comparison)

    10.1 Brief Annotations

11.Cross Validation (Side To Side Comparison)

    11.1 Overall Conclusions

**Coding Techniques :**

    A.List comprehensions
    B.Samples to reduce computational cost
    C.Concise 'def' functions that can be used repetitively
    D.Pivoting using groupby
    E.When & How to convert and reshape dictionary’s into lists or dataframes
    F.Quickly split dataframe columns
    G.loc & conditionals
    H.Loop Sub-plots
    I.Quick Lambda formulae functions
    J.Quick looping print or DataFrame conversion of summative scores
    K.Order plot components 
    L.Create & Plot Bulk Ensemble comparative results


In [3]:
# Import Modules

# Foundational Packages
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.options.display.max_columns = 100
ZZ = 15

**1. Open Dataset**

We will now use pandas to open the dataset. This part is pretty straightforward. 

I've also set a copy set. The primary set "train_raw" acts like a control piece where no adjustments (data cleaning) will be made at all. While the "train_raw_copy" set is where the adjustments will be made. You will see its use during the feature selection phase.

In [4]:
# Open Train & Test files
train_raw = pd.read_csv('../input/train.csv', na_values=-1) #FYI na_values are defined in the original data page
test_raw = pd.read_csv('../input/test.csv', na_values=-1)
# Copy Train file for workings
train_raw_copy = train_raw.copy(deep=True)

**2. Preliminary Analysis**

This is just a preliminary schematic analysis to get a rough understanding of the data-set.

* **2.1** Structure

This is simply the size dimensions (length of rows and width of columns) of the data-set. Head and samples are only to give yourself a quick truncated visual.

In [10]:
# Shape
print('Train Shape: ', train_raw_copy.shape)
print('Test Shape: ', test_raw.shape)

In [11]:
# Brief Head Output
display(train_raw_copy.head())
display(test_raw.head())

In [12]:
# Brief Sample Output
samples_show = 10
display(train_raw_copy.sample(samples_show))
display(test_raw.sample(samples_show))

* **2.2** Composition

Now the actual composition of the data-set in preparation for data cleaning.

We will first deal with the internal relationships within the data, hence for this we will use the heatmap from the seaborn module

-Correlation: The relationships between features. (Positive or Negative, Strong or Weak, None)

In [13]:
# Heatmap of correlations
cor = train_raw_copy.corr()
plt.figure(figsize=(12, 9))
sns.heatmap(cor)
plt.show()

We see that the features with 'calc' seem to be independent. We'll remove them later during 'Data Cleaning' to reduce unnecessary compuational costs and noise.

Now to deal with the the data attributes. We will use a generic function that can be used repetitively as we proceed with data cleaning as a checking measure.

-Missing values: For each respective column of feature, the count of empty data entries.

-Unique values: For each respective feature, how many unique values there are.

-Data-types:

Categorical (Each category represents a specific class of a particular description)

Binary (Yes/No or 1/0 indicator)

Integer/Ordinal (A series of ordered value counts that represents a scale range)

Float/Interval (A numerical continuous value scale)

In [14]:
# Function to output missing values & UniqueCounts & DataTypes
def basic_details(df):
    details = pd.DataFrame()
    details['Missing value'] = df.isnull().sum()
    details['N unique value'] = df.nunique()
    details['dtype'] = df.dtypes
    print('\n', details, '\n')

Train set *Unhide to view output

In [16]:
basic_details(train_raw_copy)

Test set *Unhide to view output

In [17]:
basic_details(test_raw)

**3. Data Cleaning **

After getting a rough picture what the data-set ‘’Has’’ and ‘’Lacks’’, we proceed to tidy these imperfections..

** C1 — Correction**

This step drops the uncorrelated features & removes features which contain excessive rows of empty data entries. The objective is to remove unnecessary noise (prediction errors) and computational costs (run time of code sequence) to the analysis and modelling process as we proceed.

Some decision factors on deciding the cut-off threshold include the data-set dimensions and feature unique values found during the preliminary quick analysis earlier.

In [18]:
##### C1 - Correction

# Combine both df for easy referencing
data_cleaner = [train_raw_copy, test_raw]

# Get List of Column names to drop
# 1.Drop those that missing values exceeds threshold
limit = 569  # ps_car_09_cat from "train_raw_copy" used as threshold for Missing Values
remove_cols_1 = [c for c in train_raw_copy.columns if train_raw_copy[c].isnull().sum() > limit]

# 2.Drop those that are uncorrelated from Heatmap
# **NOTE we will rectify this later during Feature Selection**
remove_cols_2 = train_raw_copy.columns[train_raw_copy.columns.str.startswith('ps_calc')]

# Dropping
for DataSet in data_cleaner:
    DataSet.drop(columns=remove_cols_1, axis=1, inplace=True)
    DataSet.drop(columns=remove_cols_2, axis=1, inplace=True)

In [20]:
# Check New Shape
print('Train New Shape: ',train_raw_copy.shape)
print('Test New Shape: ', test_raw.shape)

**C2 — Complete**

Now to fill the residual missing empty data entries that fell below the threshold in C1. The most common approach is to replace it with either the mode/mean/median. In this case, we will replace with the Mode since we have little depth of knowledge on the features.

In [21]:
##### C2 - Completing (Missing)
# Choices : Median / Mean / Mode

# Easy referencing
for df in data_cleaner:
    # List Comprehension
    Residual_Missing = [c for c in df.columns if df[c].isnull().sum() > 0]
    for col in Residual_Missing:
        df[col].fillna(df[col].mode()[0], inplace=True)

Train set *Unhide to view output

In [23]:
# Check Missing
print('Train Missing: ',train_raw_copy.isnull().sum())

Test set *Unhide to view output

In [24]:
# Check Missing
print('Test Missing: ',test_raw.isnull().sum())

**C3 — Create**

I have temporarily skipped this step as we do not have specific knowledge on the features. If you'd like to dwell into this further do check out my other Kernel titled "Feature Engineering & EDA Focused".

**C4 — Convert**

We'll now convert each particular statistical data-types to their respective computational data-types.

* Metadata Loop

Here we use a metadata loop to return the following 6x stats...

1.use (The purpose it serves in this analysis): input, ID, target

2.type (The statistical data-types, NOT computational data-types):

Nominal_Categorical_cat ->variables w/o order ranking sequence  (Discrete),

Binary_bin                  ->variables w only 2 option either or   (Discrete),

Interval_Real_float     ->Continuous                                         (Continuous), 

Ordinal_Integer_int   ->variables w an ordered series         (Discrete)

3.preserve (Retain for prediction or not): True or False

4.dataType (Computational data-type): int, float, char

5.category (Feature type): ind-individual, reg-registration, car, calc-calculated

6.NUnique: "Number of unique values

In [25]:
##### C4 - Convert
data = []
for feature in train_raw_copy.columns:
    # Defining the role of each variable
    if feature == 'target':
        use = 'target'
    elif feature == 'id':
        use = 'id'
    else:
        use = 'input'

    # Defining the statistical data type
    if 'bin' in feature or feature == 'target':
        type = 'binary'
    elif 'cat' in feature or feature == 'id':
        type = 'categorical'
    elif train_raw_copy[feature].dtype == float or isinstance(train_raw_copy[feature].dtype, float):
        type = 'real'
    elif train_raw_copy[feature].dtype == int:
        type = 'integer'

    # Initialize preserve to True for all variables except for id.
    # Since ONLY id is not in use
    preserve = True
    if feature == 'id':
        preserve = False

    # Defining the data type
    dtype = train_raw_copy[feature].dtype
    
    # Set default
    category = 'none'
    # Defining the category
    if 'ind' in feature:
        category = 'individual'
    elif 'reg' in feature:
        category = 'registration'
    elif 'car' in feature:
        category = 'car'
    elif 'calc' in feature:
        category = 'calculated'

    # Define UniqueValue Count
    NUnique = train_raw_copy[feature].nunique()

    # Creating a Dictionary that contains all the metadata for the variable to allocate/append above derivations
    feature_dictionary = {
        'varname': feature,
        'use': use,
        'type': type,
        'preserve': preserve,
        'dtype': dtype,
        'category': category,
        'NUnique': NUnique
    }
    data.append(feature_dictionary)

# Adjust & Define DataFrame
metadata = pd.DataFrame(data, columns=['varname', 'use', 'type', 'preserve', 'dtype', 'category', 'NUnique'])

* Pivot Stats

This works just as the pivot table in Excel.

In [26]:
# How many of each Feature types do we have?
print(metadata.groupby(['category'])['category'].count())

In [27]:
# How many of each Statistical data-types do we have?
print(metadata.groupby(['use', 'type'])['use'].count())

In [28]:
# Combining both of the above
print(metadata.groupby(['use', 'type', 'category'])['category'].count())

* Convert computational data-types

We will now 'brute-force' convert the variable names for each feature based on their respective statistical data-type.

Here you can see 2 different methods. Either using the metadata we did above, or a shortcut list comprehension.

*Unhide to view output

In [33]:
# Cat_Categorical
#BinaryLevel_cat_col = [col for col in train_raw_copy.columns if '_cat' in col] #Alternative List Comprehension Approach
BinaryLevel_cat_col = metadata.loc[metadata['type'] == 'categorical']['varname'] # Uses the metadata we made earlier
BinaryLevel_cat_col = list(BinaryLevel_cat_col)
BinaryLevel_cat_col.remove('id')

for c in BinaryLevel_cat_col:
    train_raw_copy[c] = train_raw_copy[c].astype('uint8')
    test_raw[c] = test_raw[c].astype('uint8')

In [31]:
# Bin_Binary
# NominalLevel_bin_col = [col for col in train_raw_copy.columns if 'bin' in col] #Alternative List Comprehension Approach
NominalLevel_bin_col = metadata.loc[metadata['type'] == 'binary']['varname'] # Uses the metadata we made earlier
NominalLevel_bin_col = list(NominalLevel_bin_col)
NominalLevel_bin_col.remove('target')

for c in NominalLevel_bin_col:
    train_raw_copy[c] = train_raw_copy[c].astype('uint8')
    test_raw[c] = test_raw[c].astype('uint8')

In [32]:
# Other_Others / Numerical
# Shortcut list comprehension method
other_col = [c for c in train_raw_copy.columns if c not in BinaryLevel_cat_col + NominalLevel_bin_col]
other_col.remove('id')
other_col.remove('target')
OrdinalLevel_other_col = [c for c in other_col if train_raw_copy[c].dtypes == 'int64']
IntervalLevel_other_col = [c for c in other_col if train_raw_copy[c].dtypes == 'float64']

# Now to check again what we have cleaned up so far!

*Unhide to view output

In [34]:
basic_details(train_raw_copy)

**4.Prepare Data - A**

Here we prepare the data by setting or assigning variables. This makes things easier when we chart graphs, conduct feature selection, model the dataset as we proceed.

In [35]:
# Break-Down WITHOUT 'id' & 'target'
# Categorical_cat
Categorical = BinaryLevel_cat_col
# Binary_bin
Binary = NominalLevel_bin_col
# Integer_'int'_Ordinal
Integer = OrdinalLevel_other_col
# Real_'float'_Interval
Real = IntervalLevel_other_col


# Original
Original_All_w = train_raw_copy.columns.get_values().tolist()
# Original WITHOUT 'id' & 'target'
Original_All_wo = [c for c in train_raw.columns if c not in ['id', 'target']]


# Converted Dtypes WITHOUT FeatureEngineering
Converted_dtypes_All_wo = Categorical + Binary + Integer + Real


# For Graph Chart Plots
# W/O 'id' & 'target'
Categorical_Chart_wo = Categorical
Binary_Chart_wo = Binary
Integer_Chart_wo = Integer
Real_Chart_wo = Real


# For Feature Selection /OR Interaction Building /OR Pre- Model Benchmarks
Features_PreSelect_Original = Original_All_wo
Features_PreSelect = Converted_dtypes_All_wo

Final Checks on Data for Model

*Unhide to view output

In [36]:
# Missing values
print(train_raw_copy.isnull().sum())
print(test_raw.isnull().sum())

In [37]:
# Stats
basic_details(train_raw_copy)
basic_details(test_raw)

**5.Exploratory Data Analysis (EDA)**

Now with a neaten data-set it, we can proceed to do some EDA to get a clearer idea of what relationships or abnormal relationships we have from the features. These may include outliers, a skewed data, reasonableness checks, feature selection etc. 

**5.1** Balance of dataset (Target variable)

In [41]:
"""target (i.e.Target Variable)"""
print("Exploring target (i.e.Target Variable)...")

# List Comprehension
class_0 = [c for c in train_raw_copy['target'] if c == 0]
class_1 = [c for c in train_raw_copy['target'] if c == 1]
# # Alternative Mask Method
# class_0 = train_raw_copy.SeriousDlqin2yrs.value_counts()[0]
# class_1 = train_raw_copy.SeriousDlqin2yrs.value_counts()[1]

class_0_count = len(class_0)
class_1_count = len(class_1)

print("Target Variable Balance...")
print("Total number of class_0: {}".format(class_0_count))
print("Total number of class_1: {}".format(class_1_count))
print("Event rate: {} %".format(round(class_1_count/(class_0_count+class_1_count) * 100, 3)))   # round 3.dp
print('-' * ZZ)

# Plot
sns.countplot("target", data=train_raw_copy)
plt.show()

Quick Commentary: 

Not good...we have a very unbalanced dataset.... However, this isn't the objective of this Kernel. Hence, we will have to take it as it is. The focus is still on comparing ensemble models! 

Moving on!

**5.2** Uni-variates

Univariate - Categorical


In [38]:
# Bar Plot # N/A
# Density Plot  # Chosen as opposed to histogram since this doesnt need bins parameter
print("Plotting Density Plot...for Categorical")
i = 0

# Single out the 'target' & those that are not for easy reference
t1 = train_raw_copy.loc[train_raw_copy['target'] != 0]
t0 = train_raw_copy.loc[train_raw_copy['target'] == 0]

sns.set_style('whitegrid')
# plt.figure()
fig, ax = plt.subplots(4, 4, figsize=(8, 8))

for feature in BinaryLevel_cat_col:
    i += 1
    plt.subplot(4, 4, i)
    sns.kdeplot(t1[feature], bw=0.5, label="target = 1")
    sns.kdeplot(t0[feature], bw=0.5, label="target = 0")
    plt.ylabel('Density plot', fontsize=10)
    plt.xlabel(feature, fontsize=10)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=10)
plt.show()

Quick Commentary: we can easily see ps_car_11_cat is highly skewed.

Univariate - Binary

In [39]:
# Bar Plot"""   # N/A
# Density Plot"""  # Chosen
print("Plotting Density Plot...for Nominal")
i = 0
t1 = train_raw_copy.loc[train_raw_copy['target'] != 0]
t0 = train_raw_copy.loc[train_raw_copy['target'] == 0]

sns.set_style('whitegrid')
# plt.figure()
fig, ax = plt.subplots(4, 4, figsize=(8, 8))

for feature in NominalLevel_bin_col:
    i += 1
    plt.subplot(4, 4, i)
    sns.kdeplot(t1[feature], bw=0.5, label="target = 1")
    sns.kdeplot(t0[feature], bw=0.5, label="target = 0")
    plt.ylabel('Density plot', fontsize=10)
    plt.xlabel(feature, fontsize=10)
    locs, labels = plt.xticks()
    plt.tick_params(axis='both', which='major', labelsize=10)
plt.show()

Quick Commentary: Nothing abnormal

Univariate - Ordinal / Int

In [40]:
# Bar Plot"""   # N/A
# Density Plot"""   # N/A
# Violin Plot"""   # Chosen
print("Plotting Violin Plot...for Ordinal_Int")
sns.set_style("whitegrid")  # Chosen
for col in OrdinalLevel_other_col:
    ax = sns.violinplot(x="target", y=col, data=train_raw_copy)
    plt.show()

Quick Commentary: ps_ind_14 seems highly skewed too...

Univariate - Interval / Float

In [42]:
# Bar Plot"""   # N/A
# Density Plot"""   # N/A
# Violin Plot"""   # Chosen
print("Plotting...for Interval_Float")
sns.set_style("whitegrid")  # Chosen
for col in IntervalLevel_other_col:
    ax = sns.violinplot(x="target", y=col, data=train_raw_copy)
    plt.show()

Quick Commentary: ps_car_12 & 13 & 15 seems highly skewed as well...

Uni-variates Summative: Highly skewed (ps_car_11_cat, ps_ind_14, ps_car_12 & 13 & 15)

Lets keep that in mind and note them later during the model ensemble comparisons for Feature Importance.

**5.3** Bi-variates

Do note that I have only taken a truncated sample size of 800 quantity of samples to reduce the computational costs when using the entire dataset.

Bivariate - Categorical

In [43]:
# Set sample size to reduce computational cost
sample_SIZE = 800
sample = train_raw_copy.sample(sample_SIZE)
BinaryLevel_cat_col.extend(['target'])  # Add 'target' into list
var = BinaryLevel_cat_col
sample = sample[var]
g = sns.pairplot(sample,  hue='target', palette='Set1', size=1, diag_kind='kde', plot_kws={"s": 8})
plt.show()
BinaryLevel_cat_col.remove('target')  # Remove 'target' into list

Quick Commentary: Doesn't seem to bear any clear collinearity

Bivariate - Binary

In [44]:
# Set sample size to reduce computational cost
sample_SIZE = 800
sample = train_raw_copy.sample(sample_SIZE)
NominalLevel_bin_col.extend(['target'])  # Add 'target' into list
var = NominalLevel_bin_col
sample = sample[var]
g = sns.pairplot(sample,  hue='target', palette='Set1', size=1, diag_kind='kde', plot_kws={"s": 8})
plt.show()
NominalLevel_bin_col.remove('target') # Remove to revert to original

We can hardly see anything here!!!!!! Let's try switching to a heatmap instead.

In [45]:
cor = train_raw_copy[NominalLevel_bin_col].corr()
plt.figure(figsize=(12, 9))
sns.heatmap(cor,)
plt.show()

Quick Commentary: Ok since no correlations above 0.3

Bivariate - Ordinal / Int

In [46]:
# Set sample size to reduce computational cost
sample_SIZE = 800
sample = train_raw_copy.sample(sample_SIZE)
OrdinalLevel_other_col.extend(['target'])  # Add 'target' into list
var = OrdinalLevel_other_col
sample = sample[var]
g = sns.pairplot(sample,  hue='target', palette='Set1', size=1, diag_kind='kde', plot_kws={"s": 8})
plt.show()
OrdinalLevel_other_col.remove('target') # Remove to revert to original

Quick Commentary: Doesn't seem to bear any clear collinearity

Bivariate - Interval / Float

In [47]:
# Set sample size to reduce computational cost
sample_SIZE = 800
sample = train_raw_copy.sample(sample_SIZE)
IntervalLevel_other_col.extend(['target'])  # Add 'target' into list
var = IntervalLevel_other_col
sample = sample[var]
g = sns.pairplot(sample,  hue='target', palette='Set1', size=1, diag_kind='kde', plot_kws={"s": 8})
plt.show()
IntervalLevel_other_col.remove('target') # Remove to revert to original

Quick Commentary: 

ps_reg_01 & ps_reg_02 seem to bear some positive linear relationship

ps_car_12 & ps_car_13 seem to bear some positive linear relationship

ps_car_15 & ps_car_13 seem to bear some strong exponential relationship

Just as before, we will note this during the model ensemble comparisons for Feature Importance.

Multivariate

In [48]:
cor = train_raw_copy[Features_PreSelect].corr()
plt.figure(figsize=(12, 9))
sns.heatmap(cor)
plt.show()

Quick Commentary: Ok since no exceptionally high correlations. Except for ps_car_04_cat against ps_car_12 & 13.

# 5.4 EDA Summative notes for Feature Importance comparison

Uni-variate: 
ps_car_11_cat <> 
ps_ind_14 <> 
ps_car_12 <> 
ps_car_13 <> 
ps_car_15

Bi-Variate: 
ps_reg_02 & ps_reg_02 <> 
ps_car_12 & ps_car_13 <> 
ps_car_15 & ps_car_13

Multi-variate: 
ps_car_04_cat & ps_car_12 <> 
ps_car_04_cat & ps_car_13

**6.Parameter Tuning**

You could run this, but I've left it out due to the huge computational cost that comes with it. 

It's simply recursively running the code for each param_test and logging down the best parameters.

In [None]:
##################Lasso Parameter C Tuning
# # COMMENT: Best Parameter was found as {'logisticregression__C': 0.1}
#
# from sklearn.pipeline import make_pipeline
# from sklearn.model_selection import GridSearchCV
#
# X = train_raw_copy.drop(['id', 'target'], axis = 1)
# y = train_raw_copy['target']
#
# # # {'logisticregression__C': [1, 10, 100, 1000]
# param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100]}
# pipe = make_pipeline(StandardScaler(), LogisticRegression(penalty='l1'))
# grid = GridSearchCV(pipe, param_grid, cv=10)
# grid.fit(X, y)
# print(grid.best_params_)
#
#
#
# #################XGB Classifier Tuning
# from sklearn.model_selection import GridSearchCV
# from xgboost.sklearn import XGBClassifier
# from sklearn.preprocessing import StandardScaler

# # Substitute this after exery run for new parameter grid to test
# param_test1 = {
#  'classifier__max_depth': range(3, 10, 2),
#  'classifier__min_child_weight': range(1, 6, 2)
# }
#
# param_test2 = {
#  'classifier__gamma': [i/10.0 for i in range(0, 5)]
# }
#
# param_test3 = {
#  'classifier__learning_rate': [0.1, 0.01, 0.001],
#  'classifier__n_estimators=100': [100, 140, 200]
# }
#
# #Log down the best parameters
# # 'classifier__gamma': 0,
# # 'classifier__max_depth': 7,
# # 'classifier__min_child_weight': 5
#       
# print("Tuning XGBClassifier Parameters")
# #
# from sklearn.pipeline import make_pipeline
# from sklearn.pipeline import Pipeline
# print("Making XGBClassifier-Pipeline...")
# pipeXGBC = Pipeline([('scaler', StandardScaler()),
#                       ('classifier', XGBClassifier(gamma=0, max_depth=7, min_child_weight=5))])
# print("Running XGBClassifier-Pipeline Parameters GridSearchCV...")
# gsearchXGBC2 = GridSearchCV(pipeXGBC, cv=5, param_grid=param_test3)
# print("Fitting XGBClassifier-Pipeline Parameters GridSearchCV...")
# gsearchXGBC2.fit(X_train, y_train)
# print("Running XGBClassifier-Pipeline GridSearchCV Scores...")
# print(gsearchXGBC2.cv_results_, gsearchXGBC2.best_params_, gsearchXGBC2.best_score_)
# print("Running XGBClassifier-Pipeline Best Estimator...")
# best_gridXGBC2 = gsearchXGBC2.best_estimator_
# print(best_gridXGBC2)
#
#
# #################Random Forest Classifier Tuning
# # Create the parameter grid based on the results of random search
# param_grid0 = {
#      'bootstrap': [True],
#      'max_depth': [80, 90, 100, 110],
#      'max_features': [2, 3],
#      'min_samples_leaf': [3, 4, 5],
#      'min_samples_split': [8, 10, 12],
#      'n_estimators': [100, 200, 300, 1000]
# }
#
# param_grid1 = {
#     'classifier__bootstrap': [True],
#     'classifier__max_depth': [80, 100],
#     'classifier__max_features': [2, 4],
#     'classifier__min_samples_leaf': [4],
#     'classifier__min_samples_split': [10],
#     'classifier__n_estimators': [100, 200]
# }
#
# param_grid2 = {
#     'classifier__min_samples_leaf': [3, 5],
#     'classifier__min_samples_split': [10],
#     'classifier__n_estimators': [100, 200]
# }
#
# #Log down the best parameters
# #    'classifier__bootstrap': [True],
# #    'classifier__max_depth': [80],
# #    'classifier__max_features': [2],
#
# from sklearn.pipeline import make_pipeline
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler
# print("Making RFClassifier-Pipeline...")
# pipeRFC2 = Pipeline([('scaler', StandardScaler()),
#                      ('classifier', RandomForestClassifier(bootstrap=True, max_depth=80, max_features=2,
#                                                            criterion='entropy'))])
# print("Running RFClassifier-Pipeline Parameters GridSearchCV...")
# gsearchRFC2 = GridSearchCV(pipeRFC2, cv=5, param_grid=param_grid2)
# print("Fitting RFClassifier-Pipeline Parameters GridSearchCV...")
# gsearchRFC2.fit(X_train, y_train)
# print("Running RFClassifier-Pipeline GridSearchCV Scores...")
# print(gsearchRFC2.cv_results_, gsearchRFC2.best_params_, gsearchRFC2.best_score_)
# print("Running RFClassifier-Pipeline Best Estimator...")
# best_gridXGBC2 = gsearchRFC2.best_estimator_
# print(best_gridXGBC2)

**7.Feature Select - A**

This segment simply uses LASSO regression via L1 penalty of Logistic Regression to validate the initial feature dropping during 
3.'Data Cleaning' C1 - Correction

As mentioned earlier in part'1, this is the part where we use the original control dataset 'train_raw'.

In short, we are comparing the LASSO accuracy scores before VS after the drop. Ideally, we should see that we have no difference in accuracy scores. Hence, indicating that the drop had no effect on the model accuracy.

**7.1 **Pre-Drop Accuracy Score

In [49]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

############# PRE DROPPING FEATURES
##### Organizing to validate C1 Drop
# ONLY C2->Fillna step iterated. NO COLUMNS DROPPED
TempToBeFilled = [c for c in train_raw.columns if train_raw[c].isnull().sum() > 0]
for col in TempToBeFilled:
    train_raw[col].fillna(train_raw[col].mode()[0], inplace=True)

train_x1 = train_raw.drop(columns=['id', 'target'])      
Y1 = train_raw['target'].values

# Preparing train/test split of dataset            
X_train, X_validation, y_train, y_validation = train_test_split(train_x1, Y1, train_size=0.9, random_state=1234)

##### Instantiate Logistic Regression 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Transform data for LogRef fitting"""
scaler = StandardScaler()
std_data = scaler.fit_transform(X_train.values)

# Establish Model
RandomState=42
model_LogRegLASSO1 = LogisticRegression(penalty='l1', C=0.1, random_state=RandomState, solver='liblinear', n_jobs=1)
model_LogRegLASSO1.fit(std_data, y_train)

# Run Accuracy score without any dropping of features
print("PRE DROPPING FEATURES: Running LASSO Accuracy Score without features drop...")
# make predictions for test data and evaluate
y_pred = model_LogRegLASSO1.predict(X_validation)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_validation, predictions)
print("PRE Accuracy: %.2f%%" % (accuracy * 100.0))

**7.2 **Post-Drop Accuracy Score

In [50]:
############# POST DROPPING FEATURES
train_x2 = train_raw_copy[Features_PreSelect]   
Y2 = train_raw_copy['target'].values  

# Preparing train/test split of dataset            
X_train, X_validation, y_train, y_validation = train_test_split(train_x2, Y2, train_size=0.9, random_state=1234)

##### Instantiate Logistic Regression 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Transform data for LogRef fitting"""
scaler = StandardScaler()
std_data = scaler.fit_transform(X_train.values)

# Establish Model

model_LogRegLASSO1 = LogisticRegression(penalty='l1', C=0.1, random_state=RandomState, solver='liblinear', n_jobs=1)
model_LogRegLASSO1.fit(std_data, y_train)

# Run Accuracy score without any dropping of features
print("POST DROPPING FEATURES: Running LASSO Accuracy Score with features dropped...")
# make predictions for test data and evaluate
y_pred = model_LogRegLASSO1.predict(X_validation)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_validation, predictions)
print("POST Accuracy: %.2f%%" % (accuracy * 100.0))

It's right both are 96.22%! No difference in Accuracy Scores! Moving on...

# 8.Feature Select & Individual Model Charting - B

Now we will do the similar approach as before in part'A, but include extra model of a mix between BlackBox & WhiteBox models. Note that now we are resuming to without the 'calc' named features since we just validated that it wont affect accuracy scores.

    /////I didn't use a bulk loop iteration hear as I kept getting the "insufficient memory error", Hence, had to split them up individually. But interestingly I could do it with the ROC AUC segment later on...If anyone knows a solution to this, do let me know! Much appreciated!/////

WhiteBox Models> 
LASSO, Ridge, Logistic Regression Balanced weighted

BlackBox Models>
Extreme Gradiant Boosting Classifier, Random Forest Classifier

# 8.1 Model Preparation

We will now prepare the model by first train/test splitting our dataset. Here we intentionally ignore 10% of the train set as a measure to avoid over-fitting.

We also establish a common function to scale all our features evaluation metrics to a common scaling range for comparative purposes. As this Kernel focuses on that! *Unhide to view output

In [51]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Preparing train/test split of dataset
train_x = train_raw_copy[Features_PreSelect]   
Y = train_raw_copy['target'].values             
X_train, X_validation, y_train, y_validation = train_test_split(train_x, Y, train_size=0.9, random_state=1234)

# Preparing Side to Side Comparative Function
from sklearn.preprocessing import MinMaxScaler

# Generic Function to Normalize Rankings/Coefficients
def rank_to_dict(ranks, names, order=1):
    minmax = MinMaxScaler()
    # Transposes array of 'ranks' into single column array, then applies Fit_Transforms with MinMax
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    # shortcut map & lambda function to round ranks to 2 precision dp
    # Altenatively, You can use a list comprehension here as well. 
    # *See Mean rounding code at Chapter 9.Features (Side To Side Comparison) for example*
    ranks = map(lambda x: round(x, 2), ranks)   
    # Returns names with each respective rounded ranks
    return dict(zip(names, ranks))

names = Features_PreSelect
ranks = {}

print('Prep done...')

# 8.2 LASSO

*Unhide to view output

In [53]:
# LASSO via LogisticRegression l1 penalty - WhiteBox Model
print('Running LASSO via LogisticRegression l1 penalty...')
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Transform data for LogReg fitting
scaler = StandardScaler()
std_data = scaler.fit_transform(X_train.values)

# Establish Model
model_LogRegLASSO = LogisticRegression(penalty='l1', C=0.1, random_state=RandomState, solver='liblinear', n_jobs=1)
model_LogRegLASSO.fit(std_data, y_train)

# For Side To Side
ranks["LogRegLASSO"] = rank_to_dict(list(map(float, model_LogRegLASSO.coef_.reshape(len(Features_PreSelect), -1))),
                                    names, order=1)
print(ranks["LogRegLASSO"])


######Alternative Direct Methods:

### Method 1 without Coefficients shown
#from sklearn.feature_selection import SelectFromModel
#model = SelectFromModel(model_LogRegLASSO, prefit=True)
#X_new = model.transform(X_train)
#print("New Shape", X_new.shape)
#print("Old Shape", X_train.shape)


### Method 2 with Coefficients shown
# Set df to append
#zero_feat = []
#nonzero_feat = []

# Loop through feature coefficients & append accordingly
#num_features = len(X_train.columns)
#for i in range(num_features):
#    coef = model_LogRegLASSO.coef_[0, i]
#    if coef == 0:
#        zero_feat.append(X_train.columns[i])
#    else:
#        nonzero_feat.append((coef, X_train.columns[i]))
#print('Features that have coefficient of 0 are: ', zero_feat, '\n')
#print('Features that have non-zero coefficients are:')
#print(sorted(nonzero_feat, reverse=True))

In [91]:
# Plotting
import operator
listsLASSO = sorted(ranks["LogRegLASSO"].items(), key=operator.itemgetter(1))
# convert list>array>dataframe
dfLASSO = pd.DataFrame(np.array(listsLASSO).reshape(len(listsLASSO),2), columns = ['Features','Ranks']).sort_values('Ranks') 
dfLASSO['Ranks']=dfLASSO['Ranks'].astype(float)
#df.sort_values('Ranks', ascending=True)

dfLASSO.plot.bar(x='Features', y='Ranks', color='blue')
#plt.xticks(rotation='vertical')
plt.xticks(rotation=90)

from pylab import rcParams
rcParams['figure.figsize'] = 7, 10
plt.show()

# 8.3 Ridge

*Unhide to view output

In [55]:
# Ridge via LogisticRegression l2 penalty - WhiteBox Model
print('Running Ridge via LogisticRegression l2 penalty...')
# Establish Model
model_LogRegRidge = LogisticRegression(penalty='l2', C=0.1, random_state=RandomState, solver='liblinear', n_jobs=1)
model_LogRegRidge.fit(std_data, y_train)

# For Side To Side
ranks["LogRegRidge"] = rank_to_dict(list(map(float, model_LogRegRidge.coef_.reshape(len(Features_PreSelect), -1))),
                                    names, order=1)
print(ranks["LogRegRidge"])

In [103]:
# Plotting
import operator
listsRidge = sorted(ranks["LogRegRidge"].items(), key=operator.itemgetter(1))
dfRidge = pd.DataFrame(np.array(listsRidge).reshape(len(listsRidge),2), columns = ['Features','Ranks']).sort_values('Ranks') # convert list>array>dataframe
dfRidge['Ranks']=dfRidge['Ranks'].astype(float)
#df.sort_values('Ranks', ascending=True)

dfRidge.plot.bar(x='Features', y='Ranks', color='blue')
#plt.xticks(rotation='vertical')
plt.xticks(rotation=90)

from pylab import rcParams
rcParams['figure.figsize'] = 12, 8
plt.show()

# 8.4 Logistic Regression Balance weighted

*Unhide to view output

In [57]:
# LogisticRegression Standard 'Balanced' weighted - WhiteBox Model
print('RunningLogisticRegression Balanced...')
# Establish Model
model_LogRegBalance = LogisticRegression(class_weight='balanced', C=0.1, random_state=RandomState, solver='liblinear',
                                         n_jobs=1)
model_LogRegBalance.fit(std_data, y_train)

# For Side To Side
ranks["LogRegBalance"] = rank_to_dict(list(map(float, model_LogRegBalance.coef_.reshape(len(Features_PreSelect), -1))),
                                      names, order=1)
print(ranks["LogRegBalance"])

In [104]:
#Plotting
import operator
listsBal = sorted(ranks["LogRegBalance"].items(), key=operator.itemgetter(1))
dfBal = pd.DataFrame(np.array(listsBal).reshape(len(listsBal),2), columns = ['Features','Ranks']).sort_values('Ranks') # convert list>array>dataframe
dfBal['Ranks']=dfBal['Ranks'].astype(float)
#df.sort_values('Ranks', ascending=True)

dfBal.plot.bar(x='Features', y='Ranks', color='blue')
#plt.xticks(rotation='vertical')
plt.xticks(rotation=90)

from pylab import rcParams
rcParams['figure.figsize'] = 7, 10
plt.show()

# 8.5 Extreme Gradiant Boosting

*Unhide to view output

In [61]:
# Extreme Gradiant Boosting Classifier - BlackBox Model
from xgboost.sklearn import XGBClassifier
from xgboost import plot_importance

print("Running XGBClassifier Feature Importance Part 1...")
model_XGBC = XGBClassifier(objective='binary:logistic',
                           max_depth=7, min_child_weight=5,
                           gamma=0,
                           learning_rate=0.1, n_estimators=100,)
model_XGBC.fit(X_train, y_train)
print("XGBClassifier Fitted")

# For Side To Side
print("Ranking Features with XGBClassifier...")
ranks["XGBC"] = rank_to_dict(model_XGBC.feature_importances_, names)
print(ranks["XGBC"])

In [78]:
#Plotting
# plot feature importance for feature selection using default inbuild function
print("Plotting XGBClassifier Feature Importance")
plot_importance(model_XGBC)

from pylab import rcParams
rcParams['figure.figsize'] = 5, 10
plt.show()

# 8.6 Random Forest Classifier

*Unhide to view output

In [79]:
# Random Forest Classifier - BlackBox Model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

model_RFC = RandomForestClassifier(bootstrap=True, max_depth=80,
                                   criterion='entropy',
                                   min_samples_leaf=3, min_samples_split=10, n_estimators=100)
model_RFC.fit(X_train, y_train)

# For Side To Side
print("Ranking Features with RFClassifier...")
ranks["RFC"] = rank_to_dict(model_RFC.feature_importances_, names)
print(ranks["RFC"])

In [83]:
#Plotting
# For Chart
importance = pd.DataFrame({'feature': X_train.columns, 'importance': np.round(model_RFC.feature_importances_, 3)})
importance_sorted = importance.sort_values('importance', ascending=False).set_index('feature')
# plot feature importance for feature selection using default inbuild function
#print(importance_sorted)
importance_sorted.plot.bar()

from pylab import rcParams
rcParams['figure.figsize'] = 10, 20
plt.show()

# 9.Features (Side To Side Comparison)

Now we will collate all the feature coefficients & normalize them for a scaled comparison across all of them. This uses the "rank_to_dict" function we defined at

8.Feature Select - B

# **9.1** (Coefficient Values) Quick Easy Method

In [105]:
pd.options.display.max_columns = 100
##### Collate Feature Coefficients Side by Side
print("Collating Side To Side Feature Scores...")

######## Easy quick print Method
# Create empty dictionary to store the mean value calculated across all the scores
r = {}
for name in names:
    # This is the alternative rounding method from the earlier map & lambda combination
    r[name] = round(np.mean([ranks[method][name] for method in ranks.keys()]), 2)

methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")

print("\t%s" % "\t".join(methods))
for name in names:
    print("%s\t%s" % (name, "\t".join(map(str, [ranks[method][name] for method in methods]))))

Evidently, you can see that this method isn't for OCD analyst...alignments are all off...

Now lets instead use a much more presentable method!!!!

# **9.2** (Coefficient Values) Neat DataFrame Method

In [109]:
######## Alternatively, set into Dataframe. Advantage is that we can plot here.
# Loop through dictionary of scores to append into a dataframe
row_index = 0
AllFeatures_columns = ['Feature', 'Scores']
AllFeats = pd.DataFrame(columns=AllFeatures_columns)
for name in names:
    AllFeats.loc[row_index, 'Feature'] = name
    AllFeats.loc[row_index, 'Scores'] = [ranks[method][name] for method in methods]
        
    row_index += 1

# Here the dataframe scores are a list in a list. 
# To split them, we convert the 'Scores' column from a dataframe into a list & back into a dataframe again
AllFeatures_only = pd.DataFrame(AllFeats.Scores.tolist(), )
# Now to rename the column headers
AllFeatures_only.rename(columns={0:'LogRegBalance',1:'LogRegLASSO',2:'LogRegRidge',
                                     3:'Random ForestClassifier',4:'XGB Classifier', 5:'Mean'},inplace=True)
AllFeatures_only = AllFeatures_only[['LogRegBalance','LogRegLASSO','LogRegRidge', 
                                           'Random ForestClassifier', 'XGB Classifier', 'Mean']]
# Now to join both dataframes
AllFeatures_compare = AllFeats.join(AllFeatures_only).drop(['Scores'],  axis=1)
display(AllFeatures_compare)

Perfect ain't it!!! Now likewise for the plotting

# **Plot the Normalized Feature Rankings/Coefficients/Gini Importance**

# **9.3** (Plotting) Quick Method

In [110]:
#Plotting
df = AllFeatures_compare.melt('Feature', var_name='cols',  value_name='vals')
g = sns.factorplot(x="Feature", y="vals", hue='cols', data=df, size=10, aspect=2)

plt.xticks(rotation=90)
plt.show()

# **9.4** (Plotting) Sorted Neat Method

**Alternatively, a better plot could be used, where we sort by the mean first.** Similar, to what the Excel Pivot Chart does.

In [111]:
AllFeatures_compare_sort = AllFeatures_compare.sort_values(by=['Mean'], ascending=True)
order_ascending = AllFeatures_compare_sort['Feature']
#Plotting
df2 = AllFeatures_compare_sort.melt('Feature', var_name='cols',  value_name='vals')
# ONLY Difference is that now we use row_order to sort based on the above ascending Ascending Mean Features
g2 = sns.factorplot(x="Feature", y="vals", hue='cols', data=df2, size=10, aspect=2, row_order=order_ascending)

plt.xticks(rotation=90)
plt.show()

# Quick Commentary:

* Some **Observations** include:

1.Noise

Amongst WhiteBox Models LASSO comparatively seems like the noisiest.

2.Mean alignment

WhiteBox Models tends to align closer

3.RandomForest & XGB Classifier

Closely aligned


* Some **Subjective** conclusions:

1.White-Box models (LASSO) better option when determining optimal *Feature* *Quantity*. 

As it filters the top performing features, while forcing the residual features to be close to zero. Evidently, from the steep slope.


2.Black-Box Models beter option for determining Features Interaction or specific *Feature Quality Selection*.

Since, it generates relatively good accuracy and robustness, thereby easing the resulting data interpretation.


**10.ROC AUC (Side To Side Comparison)**

Now as before we will collate the ROC & AUC for each model & plot all of them for comparison

Just a tip for a ridiculous error i faced. Dont forget to use 'predict_proba' for the roc_curve!! I didn't initially & had a 0.5 score.

In [125]:
##### Ensemble Comparison of ROC AUC 
from sklearn import model_selection
import matplotlib.pyplot as plt

# run model 10x with 60/30 split, but intentionally leaving out 10% avoiding overfitting
cv_split = model_selection.ShuffleSplit(n_splits=10, test_size=.3, train_size=.6, random_state=0)

print("Charting ROC AUC for Ensembles...")
from sklearn.metrics import roc_curve, auc

# Establish Models
models = [
    {
        'label': 'LASSO',
        'model': model_LogRegLASSO,
    },
    {
        'label': 'Ridge',
        'model': model_LogRegRidge,
    },
    {
        'label': 'LogReg Balance',
        'model': model_LogRegBalance,
    },
    {
        'label': 'XGBoost Classifier',
        'model': model_XGBC,
    },
    {
        'label': 'Random Forest Classifier',
        'model': model_RFC,
    }
]

# Models Plot-loop
for m in models:
    #scaler = StandardScaler()
    #std_data2 = scaler.fit_transform(X_validation)
    #fpr, tpr, thresholds = roc_curve(y_validation, m['model'].predict_proba(std_data2).T[0])
    fpr, tpr, thresholds = roc_curve(y_validation, m['model'].predict_proba(X_validation).T[1])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % (m['label'], roc_auc))

# Set Plotting attributes
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc=0, fontsize='small')
plt.show()

**10.1 Brief Annotations**

As expected given the way the XGBoost algorithm is structured (Bagging & Learning from errors & applying error emphasis) it triumphs all. Furthermore, we haven't done any Feature Engineering as well. Hence, this could possibly a sign of overfitting too. In a
 
Overall the AUCs also aren't fancy but at least we know how to compare them going forward!

**11.Cross Validation Scores (Side To Side Comparison)**

Now as before we will collate the relevant 'Cross Validated' (CV) accuracy scores for each model & compare all of them. 

In [126]:
##### Ensemble Comparison of Accuracy Scores 
# Set dataframe for appending``
pd.options.display.max_columns = 100
Scores_columns = ['Model Name', 'Model Parameters', 'Train Accuracy Mean', 'Test Accuracy Mean']
Scores_compare = pd.DataFrame(columns=Scores_columns)

# Models CV-loop
row_index = 0
for m in models:
    # Name of Model
    Scores_compare.loc[row_index, 'Model Name'] = m['label']
    # Model Parameters
    Scores_compare.loc[row_index, 'Model Parameters'] = str(m['model'].get_params())
    
    # Execute Cross Validation (CV)
    cv_results = model_selection.cross_validate(m['model'], X_train, y_train, cv=cv_split)
    # Model Train Accuracy
    Scores_compare.loc[row_index, 'Train Accuracy Mean'] = cv_results['train_score'].mean()
    # Model Test Accuracy
    Scores_compare.loc[row_index, 'Test Accuracy Mean'] = cv_results['test_score'].mean()

    row_index += 1
    
display(Scores_compare)

# 11.1 OVERALL CONCLUSIONS

Interesting deviations in the AUC vs Accuracy scores...Low AUC but High Accuracy scores. 

BUT point to note, both are totally different derivations. In addition, we have to factor in the 'balance' of the class representation in the data to fully evaluate these metrics holistically. Given that we did not focus much on EDA & Feature Engineering here, there isn't much room for contention here... In addition, we can also exlclude the 'Balanced' LogisticRegression here since we have no secure knowledge on the data 'balance'.

Furthermore, back to the basic definitions... AUC ROC (The Receiver Operating Characteristics) (ROC) curve is simply a trade-off graph between the True Positive rate (y-axis) against the False Positive rate (x-axis) at incremental threshold settings. Ideally, the more the curve is concave towards the top left corner the better. In turn indicating the optimal ability to attain a high proportion of correct detection of the condition presence, at a low expense of incorrect detection of the condition presence.

Hence, since we didn't even have a adequate proportion of presence conditions to begin with it further emphasizes why we should harp too much on AUC ROC scores. Therefore, we can actually place lower weightage on these low AUC scores. 

Thus, i'll narrow on accuracy score for now instead. Superficially, the accuracy score does suggest models seem optimal. 

Then again, this was a unbalanced dataset and I'm pretty sure in reality 'real-world' models would be 1000x more complex which will likely further dampen the scores!!!

BUT More Importantly, we now know more coding methods to deal compare & neaten coding outputs!!! 

**Recap of Essentials Coding Techniques:**

    A.List comprehensions
    B.Samples to reduce computational cost
    C.Concise 'def' functions that can be used repetitively
    D.Pivoting using groupby
    E.When & How to convert and reshape dictionary’s into lists or dataframes
    F.Quickly split dataframe columns
    G.loc & conditionals
    H.Loop Sub-plots
    I.Quick Lambda formulae functions
    J.Quick looping print or DataFrame conversion of summative scores
    K.Order plot components 
    L.Create & Plot Bulk Ensemble comparative results

Thank you so much for reading up till the end! Hopefully, you have a better understanding of Data Manipulation in respect of Ensemble Models now!