## **Introduction:**
This tutorial is for beginners learning the concept of `Scikit-Learn library` which is a high level framework designed for supervised and unsupervised machine learning algorithms, built on top of NumPy and SciPy libraries, each responsible for lower-level data science tasks.

### Author: Abdelwahed Ashraf 

### Linkedin: [Link](https://www.linkedin.com/in/abdelwahed-ashraf-090523169/)

### Kaggle: [Link](https://www.kaggle.com/abdelwahed43)

**The sequence of steps are as follows:**

* Check for missing values in the dataset
* Pre-process data by splitting into Train-Test sets
* Models Classification
* Feature Scaling by standardizing and normalizing your data
    > * Learn its effect by improved accuracy
* Reduce the dimension of your data using PCA
    > * Learn its effect by improved accuracy
* Applying Gaussian Mixture and Grid Search
    > * Learn its effect by improved accuracy
* Fit our best model 



## Outline
The following sections are included in this notebook:

### A. [Load and Parse Data](#section-one)

### B. [Exploratory Data Analysis (EDA)](#section-two)
   1. [Missing Data](#section-two-a)    
   
    

      
        
### C. [Fit and Evaluate the Model](#section-four)
   1. [Cross-Validation](#section-four-a)
       
       * Naive Bayes classifier
       * KNN or k-Nearest Neighbors  
       * Random Forest 
       * Logistic Regression             
       * Support Vector Machines  
       * Decision Tree
       * XGBOOST Classifier 
       * AdaBoosting Classifier
       * GradientBoostingClassifier 
       * HistGradientBoostingClassifier
       * Principal Component Analysis (PCA) 
       * Gaussian Mixture 
       * Grid Search
      
   
   2. [Model Stacking](#section-four-c)
    
### D. [Predict Test Dataset and Submit](#section-five)

<a id="section-one"></a>
# A. Load and Parse Data

The Python Pandas packages helps us work with our datasets. We start by acquiring the training and testing datasets into Pandas DataFrames.
We also combine these datasets to run certain operations on both datasets together.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.


# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
# read csv (comma separated value) into data
train = pd.read_csv('../input/data-science-london-scikit-learn/train.csv', header=None)
trainLabel = pd.read_csv('../input/data-science-london-scikit-learn/trainLabels.csv', header=None)
test = pd.read_csv('../input/data-science-london-scikit-learn/test.csv', header=None)
print(plt.style.available) # look at available plot styles
plt.style.use('ggplot')

<a id="section-two"></a>
# B. Exploratory Data Analysis (EDA)
The purpose of EDA is to get familiar with our data, but not so familiar that we begin making assumptions about the model fit! In Kaggle competitions it can be tempting to overfit the training data in hopes of a lower test score, but this often doesn't bode well for real world applications. Typically, it's best to let the data speak for itself and allow the model the flexibility to find correlations between the target and features. Afterall, the models in todays age are very robust. 

### Do Preprocessing Later!
This is really more of a personal opinion. I find it hard to keep track of data processing done in cells throughout an EDA section. Typically, I prefer to do all the preprocessing in a single code block or even better in a pipeline. This way I know the preprocessing is being applied the same way to the train, validation, and test datasets. I use EDA as a way to identify the preprocessing steps that need to take place and potential feature engineering opportunities. 

Remember, it's best to do preprocessing in a pipeline!!!

In this section I will explore the following common issues:
1. Missing Data


### B.1 check shape of training and test set

In [None]:
print('Data Description')
print('The shape of our training set: ',train.shape[0], 'rows ', 'and', train.shape[1]  , 'columns'  )
print('The shape of our testing set: ',trainLabel.shape[0], 'rows', 'and', trainLabel.shape[1], 'column')
print('The shape of our testing set: ',test.shape[0], 'rows', 'and', test.shape[1], 'columns')



### B.2 Analyze by describing data
Pandas also helps describe the datasets answering following questions early in our project.


### Which features are available in the dataset?



In [None]:
print(train.columns.values)

In [None]:
# preview the data from head
train.head(3)

### B.3 Descriptive statistics
***split training data into numeric and categorical data***

In [None]:
# split data train into Numeric and Categorocal
numeric = train.select_dtypes(exclude='object')
categorical = train.select_dtypes(include='object')

#### Which features are categorical?
 * These values classify the samples into sets of similar samples.
 * Within categorical features are the values nominal, ordinal, ratio, or interval based? 
 * Among other things this helps us select the appropriate plots for visualization.

In [None]:
print("\nNumber of categorical features : ",(len(categorical.axes[1])))
print("\n", categorical.axes[1])
categorical.head()

In [None]:
##train.describe(include=['O'])

#### Which features are numerical?
* Which features are numerical? These values change from sample to sample.
* Within numerical features are the values discrete, continuous, or timeseries based?
* Among other things this helps us select the appropriate plots for visualization.

In [None]:
print("\nNumber of numeric features : ",(len(numeric.axes[1])))
print("\n", numeric.axes[1])

In [None]:
train.describe()

### What are the data types for various features?

* Helping us during converting goal.

### Quantitative data:
*  discrete, continuous
* are measures of values or counts and are expressed as numbers.

* Quantitative data are data about numeric variables (e.g. how many; how much; or how often).

### Qualitative data:
*  ordinal , nominal
* are measures of 'types' and may be represented by a name, symbol, or a number code.

* Qualitative data are data about categorical variables (e.g. what type).

#### Note All Features are Numerical (Quantitative data[](http://))

In [None]:
train.info()
print('_'*50)
test.info()

<a id="section-two-a"></a>
### B.4 Missing Data

We have collected several assumptions and decisions regarding our datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.

* Correcting by dropping features
* Correcting by fill features


In [None]:
#missing data in Traing examples
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

missing_data.head(3)

In [None]:
#missing data in Traing Label (target label) examples
total = trainLabel.isnull().sum().sort_values(ascending=False)
percent = (trainLabel.isnull().sum()/trainLabel.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

missing_data.head(3)

In [None]:
#missing data in Test examples
total = test.isnull().sum().sort_values(ascending=False)
percent = (test.isnull().sum()/test.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

missing_data.head(3)

In [None]:
print(train.shape)
print(trainLabel.shape)
print(test.shape)

### C. Model, predict and solve
- Grid Search
- Naive Bayes classifier
- KNN or k-Nearest Neighbors
- Random Forrest
- Logistic Regression
- Support Vector Machines
- Decision Tree
- XGBOOST Classifier
- AdaBoosting Classifier
- GradientBoostingClassifier
- HistGradientBoostingClassifier
- Principal Component Analysis (PCA)
- Gaussian Mixture

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import LabelEncoder

# NAIBE BAYES
from sklearn.naive_bayes import GaussianNB
#KNN
from sklearn.neighbors import KNeighborsClassifier
#RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier
#LOGISTIC REGRESSION
from sklearn.linear_model import LogisticRegression
#SVM
from sklearn.svm import SVC
#DECISON TREE
from sklearn.tree import DecisionTreeClassifier
#XGBOOST
from xgboost import XGBClassifier
#AdaBoosting Classifier
from sklearn.ensemble import AdaBoostClassifier
#GradientBoosting Classifier
from sklearn.ensemble import GradientBoostingClassifier
#HistGradientBoostingClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier, StackingClassifier

from sklearn.model_selection import cross_val_score,StratifiedKFold,GridSearchCV




from sklearn.preprocessing import StandardScaler ,Normalizer , MinMaxScaler, RobustScaler 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

We further split the training set in to a train and test set to validate our model.

In [None]:
X_train,X_test,y_train,y_test = train_test_split(train,trainLabel,test_size=0.30, random_state=101)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

Coming to the modeling part. We first scale the data using standard scaler.
We use grid search with stratified kfold validation for 9 algorithms.
We get the scores from the cross validation for all these models and run a prediction on the test data from our train_test_split.
For stacking we get the accuracy based on fitting the train and test set.

sk_fold = StratifiedKFold(10,shuffle=True, random_state=42)
#sc =StandardScaler()

X_train= X_train

X_train_1= train.values

X_test= X_test

X_submit= test.values

g_nb = GaussianNB()
knn = KNeighborsClassifier()
ran_for  = RandomForestClassifier()
log_reg = LogisticRegression()
svc = SVC()
tree= DecisionTreeClassifier()
xgb = XGBClassifier()

ada_boost = AdaBoostClassifier()
grad_boost = GradientBoostingClassifier(n_estimators=100)
hist_grad_boost = HistGradientBoostingClassifier()




clf = [("Naive Bayes",g_nb,{}),\
       ("K Nearest",knn,{"n_neighbors":[3,5,8],"leaf_size":[25,30,35]}),\
       ("Random Forest",ran_for,{"n_estimators":[100],"random_state":[42],"min_samples_leaf":[5,10,20,40,50],"bootstrap":[False]}),\
       ("Logistic Regression",log_reg,{"penalty":['l2'],"C":[100, 10, 1.0, 0.1, 0.01] , "solver":['saga']}),\
       ("Support Vector",svc,{"kernel": ["linear","rbf"],"gamma":['auto'],"C":[0.1, 1, 10, 100, 1000]}),\
       ("Decision Tree", tree, {}),\
       ("XGBoost",xgb,{"n_estimators":[200],"max_depth":[3,4,5],"learning_rate":[.01,.1,.2],"subsample":[.8],"colsample_bytree":[1],"gamma":[0,1,5],"lambda":[.01,.1,1]}),\
       
       ("Adapative Boost",ada_boost,{"n_estimators":[100],"learning_rate":[.6,.8,1]}),\
       ("Gradient Boost",grad_boost,{}),\
     
       ("Histogram GB",hist_grad_boost,{"loss":["binary_crossentropy"],"min_samples_leaf":[5,10,20,40,50],"l2_regularization":[0,.1,1]})]


stack_list=[]
train_scores = pd.DataFrame(columns=["Name","Train Score","Test Score"])

i=0
for name,clf1,param_grid in clf:
    clf = GridSearchCV(clf1,param_grid=param_grid,scoring="accuracy",cv=sk_fold,return_train_score=True)
    clf.fit(X_train,y_train) #.reshape(-1,1)
    y_pred = clf.best_estimator_.predict(X_test)
    cm = confusion_matrix(y_test,y_pred)
    
    #train_scores.loc[i]= [name,cross_val_score(clf,X_train,y_train,cv=sk_fold,scoring="accuracy").mean(),(cm[0,0]+cm[1,1,])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])]
    train_scores.loc[i]= [name,clf.best_score_,(cm[0,0]+cm[1,1,])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])]
    stack_list.append(clf.best_estimator_)
    i=i+1
    
est = [("g_nb",stack_list[0]),\
       ("knn",stack_list[1]),\
       ("ran_for",stack_list[2]),\
       ("log_reg",stack_list[3]),\
       ("svc",stack_list[4]),\
       ("dec_tree",stack_list[5]),\
       ("XGBoost",stack_list[6]),\
       ("ada_boost",stack_list[7]),\
       ("grad_boost",stack_list[8]),\
       ("hist_grad_boost",stack_list[9])]








sc = StackingClassifier(estimators=est,final_estimator = None,cv=sk_fold,passthrough=False)
sc.fit(X_train,y_train)
y_pred = sc.predict(X_test)
cm1 = confusion_matrix(y_test,y_pred)
y_pred_train = sc.predict(X_train)
cm2 = confusion_matrix(y_train,y_pred_train)
train_scores.append(pd.Series(["Stacking",(cm2[0,0]+cm2[1,1,])/(cm2[0,0]+cm2[0,1]+cm2[1,0]+cm2[1,1]),(cm1[0,0]+cm1[1,1,])/(cm1[0,0]+cm1[0,1]+cm1[1,0]+cm1[1,1])],index=train_scores.columns),ignore_index=True)

### Traing Models with Feature Scaling

### **Feature Scaling**

Two approaches are shown below:
1. The **StandardScaler** assumes your data is normally distributed within each feature and will scale them such that the distribution is now centred around 0, with a standard deviation of 1.

2. The **normalizer** scales each value by dividing each value by its magnitude in n-dimensional space for n number of features.

sk_fold = StratifiedKFold(10,shuffle=True, random_state=42)
#sc =StandardScaler()

sc =StandardScaler()
#sc =Normalizer()

X_train= sc.fit_transform(X_train)

X_train_1= sc.transform(train.values)

X_test= sc.transform(X_test)

X_submit= sc.transform(test.values)


g_nb = GaussianNB()
knn = KNeighborsClassifier()
ran_for  = RandomForestClassifier()
log_reg = LogisticRegression()
svc = SVC()
tree= DecisionTreeClassifier()
xgb = XGBClassifier()

ada_boost = AdaBoostClassifier()
grad_boost = GradientBoostingClassifier(n_estimators=100)
hist_grad_boost = HistGradientBoostingClassifier()




clf = [("Naive Bayes",g_nb,{}),\
       ("K Nearest",knn,{"n_neighbors":[3,5,8],"leaf_size":[25,30,35]}),\
       ("Random Forest",ran_for,{"n_estimators":[100],"random_state":[42],"min_samples_leaf":[5,10,20,40,50],"bootstrap":[False]}),\
       ("Logistic Regression",log_reg,{"penalty":['l2'],"C":[100, 10, 1.0, 0.1, 0.01] , "solver":['saga']}),\
       ("Support Vector",svc,{"kernel": ["linear","rbf"],"gamma":['auto'],"C":[0.1, 1, 10, 100, 1000]}),\
       ("Decision Tree", tree, {}),\
       ("XGBoost",xgb,{"n_estimators":[200],"max_depth":[3,4,5],"learning_rate":[.01,.1,.2],"subsample":[.8],"colsample_bytree":[1],"gamma":[0,1,5],"lambda":[.01,.1,1]}),\
       
       ("Adapative Boost",ada_boost,{"n_estimators":[100],"learning_rate":[.6,.8,1]}),\
       ("Gradient Boost",grad_boost,{}),\
     
       ("Histogram GB",hist_grad_boost,{"loss":["binary_crossentropy"],"min_samples_leaf":[5,10,20,40,50],"l2_regularization":[0,.1,1]})]


stack_list=[]
train_scores = pd.DataFrame(columns=["Name","Train Score","Test Score"])

i=0
for name,clf1,param_grid in clf:
    clf = GridSearchCV(clf1,param_grid=param_grid,scoring="accuracy",cv=sk_fold,return_train_score=True)
    clf.fit(X_train,y_train) #.reshape(-1,1)
    y_pred = clf.best_estimator_.predict(X_test)
    cm = confusion_matrix(y_test,y_pred)
    
    #train_scores.loc[i]= [name,cross_val_score(clf,X_train,y_train,cv=sk_fold,scoring="accuracy").mean(),(cm[0,0]+cm[1,1,])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])]
    train_scores.loc[i]= [name,clf.best_score_,(cm[0,0]+cm[1,1,])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])]
    stack_list.append(clf.best_estimator_)
    i=i+1
    
est = [("g_nb",stack_list[0]),\
       ("knn",stack_list[1]),\
       ("ran_for",stack_list[2]),\
       ("log_reg",stack_list[3]),\
       ("svc",stack_list[4]),\
       ("dec_tree",stack_list[5]),\
       ("XGBoost",stack_list[6]),\
       ("ada_boost",stack_list[7]),\
       ("grad_boost",stack_list[8]),\
       ("hist_grad_boost",stack_list[9])]








sc = StackingClassifier(estimators=est,final_estimator = None,cv=sk_fold,passthrough=False)
sc.fit(X_train,y_train)
y_pred = sc.predict(X_test)
cm1 = confusion_matrix(y_test,y_pred)
y_pred_train = sc.predict(X_train)
cm2 = confusion_matrix(y_train,y_pred_train)
train_scores.append(pd.Series(["Stacking",(cm2[0,0]+cm2[1,1,])/(cm2[0,0]+cm2[0,1]+cm2[1,0]+cm2[1,1]),(cm1[0,0]+cm1[1,1,])/(cm1[0,0]+cm1[0,1]+cm1[1,0]+cm1[1,1])],index=train_scores.columns),ignore_index=True)

KNN gave maximum accuracy using Feature Scaling.

## **Principal Component Analysis**

PCA helps us to identify patterns in data based on the correlation between features. Used to reduce number of variables in your data by extracting important one from a large pool. Thus, it reduces the dimension of your data with the aim of retaining as much information as possible.

Here we will use a straightforward PCA, asking it to preserve 85% of the variance in the projected data.

from sklearn.decomposition import PCA

pca = PCA(0.85, whiten=True)
pca_train_data = pca.fit_transform(train)
print(pca_train_data.shape,'\n')

explained_variance = pca.explained_variance_ratio_ 
print(explained_variance)

Introducing another concept now i.e. **K-Fold Cross-validation**, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.

Cross-Validation can be used to evaluate performance of a model by handling the variance problem of the result set.

In this approach, the data used for training and testing are non-overlapping. To implement, first separate your data set into two subsets. One subset you use for training and other for testing. Now, do the exercise again by swapping the data sets. Report the average test result. This is call 2-fold cross validation. 

Similarly if you divide your entire data set in to five sub sets and perform the exercise ten times and report the average test result then that would be 10-fold cross validation (which is what we'll be doing now).


train=pd.DataFrame(pca_train_data)
test=pd.DataFrame(pca.transform(test))

print(train.shape)
print(trainLabel.shape)
print(test.shape)

X_train,X_test,y_train,y_test = train_test_split(train,trainLabel,test_size=0.30, random_state=101)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

sk_fold = StratifiedKFold(10,shuffle=True, random_state=42)


X_train= X_train

X_train_1= train.values

X_test= X_test

X_submit = test.values

g_nb = GaussianNB()
knn = KNeighborsClassifier()
ran_for  = RandomForestClassifier()
log_reg = LogisticRegression()
svc = SVC()
tree= DecisionTreeClassifier()
xgb = XGBClassifier()

ada_boost = AdaBoostClassifier()
grad_boost = GradientBoostingClassifier(n_estimators=100)
hist_grad_boost = HistGradientBoostingClassifier()




clf = [("Naive Bayes",g_nb,{}),\
       ("K Nearest",knn,{"n_neighbors":[3,5,8],"leaf_size":[25,30,35]}),\
       ("Random Forest",ran_for,{"n_estimators":[100],"random_state":[99],"min_samples_leaf":[5,10,20,40,50],"bootstrap":[False]}),\
       ("Logistic Regression",log_reg,{"penalty":['l2'],"C":[100, 10, 1.0, 0.1, 0.01] , "solver":['saga']}),\
       ("Support Vector",svc,{"kernel": ["linear","rbf"],"gamma":['auto'],"C":[0.1, 1, 10, 100, 1000]}),\
       ("Decision Tree", tree, {}),\
       ("XGBoost",xgb,{"n_estimators":[200],"max_depth":[3,4,5],"learning_rate":[.01,.1,.2],"subsample":[.8],"colsample_bytree":[1],"gamma":[0,1,5],"lambda":[.01,.1,1]}),\
       
       ("Adapative Boost",ada_boost,{"n_estimators":[100],"learning_rate":[.6,.8,1]}),\
       ("Gradient Boost",grad_boost,{}),\
     
       ("Histogram GB",hist_grad_boost,{"loss":["binary_crossentropy"],"min_samples_leaf":[5,10,20,40,50],"l2_regularization":[0,.1,1]})]


stack_list=[]
train_scores = pd.DataFrame(columns=["Name","Train Score","Test Score"])

i=0
for name,clf1,param_grid in clf:
    clf = GridSearchCV(clf1,param_grid=param_grid,scoring="accuracy",cv=sk_fold,return_train_score=True)
    clf.fit(X_train,y_train) #.reshape(-1,1)
    y_pred = clf.best_estimator_.predict(X_test)
    cm = confusion_matrix(y_test,y_pred)
    
    #train_scores.loc[i]= [name,cross_val_score(clf,X_train,y_train,cv=sk_fold,scoring="accuracy").mean(),(cm[0,0]+cm[1,1,])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])]
    train_scores.loc[i]= [name,clf.best_score_,(cm[0,0]+cm[1,1,])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])]
    stack_list.append(clf.best_estimator_)
    i=i+1
    
est = [("g_nb",stack_list[0]),\
       ("knn",stack_list[1]),\
       ("ran_for",stack_list[2]),\
       ("log_reg",stack_list[3]),\
       ("svc",stack_list[4]),\
       ("dec_tree",stack_list[5]),\
       ("XGBoost",stack_list[6]),\
       ("ada_boost",stack_list[7]),\
       ("grad_boost",stack_list[8]),\
       ("hist_grad_boost",stack_list[9])]








sc = StackingClassifier(estimators=est,final_estimator = None,cv=sk_fold,passthrough=False)
sc.fit(X_train,y_train)
y_pred = sc.predict(X_test)
cm1 = confusion_matrix(y_test,y_pred)
y_pred_train = sc.predict(X_train)
cm2 = confusion_matrix(y_train,y_pred_train)
train_scores.append(pd.Series(["Stacking",(cm2[0,0]+cm2[1,1,])/(cm2[0,0]+cm2[0,1]+cm2[1,0]+cm2[1,1]),(cm1[0,0]+cm1[1,1,])/(cm1[0,0]+cm1[0,1]+cm1[1,0]+cm1[1,1])],index=train_scores.columns),ignore_index=True)

## **Applying Gaussian Mixture and Grid Search to improve the accuracy**

We select the above three algorithms **(KNN, Random Forest and SVM)** which  gave maximum accuracy for further analysis

In [None]:
# Importing libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.mixture import GaussianMixture
from sklearn.svm import SVC

X = np.r_[train,test]
print('X shape :',X.shape)
print('\n')

# USING THE GAUSSIAN MIXTURE MODEL 

#The Bayesian information criterion (BIC) can be used to select the number of components in a Gaussian Mixture in an efficient way. 
#In theory, it recovers the true number of components only in the asymptotic regime
lowest_bic = np.infty
bic = []
n_components_range = range(1, 7)

#The GaussianMixture comes with different options to constrain the covariance of the difference classes estimated: 
# spherical, diagonal, tied or full covariance.
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
    for n_components in n_components_range:
        gmm = GaussianMixture(n_components=n_components,covariance_type=cv_type)
        gmm.fit(X)
        bic.append(gmm.aic(X))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_gmm = gmm
            
best_gmm.fit(X)
gmm_train = best_gmm.predict_proba(train)
gmm_test = best_gmm.predict_proba(test)

The **predict_proba** method will take in new data points and predict the responsibilities for each Gaussian. In other words, the probability that this data point came from each distribution.



**Now Applying Grid Search Algorithm:** 

To identify the best algorithm and best parameters

In [None]:
X_train,X_test,y_train,y_test = train_test_split(gmm_train,trainLabel,test_size=0.30, random_state=101)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

In [None]:
sk_fold = StratifiedKFold(10,shuffle=True, random_state=42)


X_train= X_train


X_train_1= pd.DataFrame(gmm_train).values

X_test= X_test

X_submit =  pd.DataFrame(gmm_test).values

g_nb = GaussianNB()
knn = KNeighborsClassifier()
ran_for  = RandomForestClassifier()
log_reg = LogisticRegression()
svc = SVC()
tree= DecisionTreeClassifier()
xgb = XGBClassifier()

ada_boost = AdaBoostClassifier()
grad_boost = GradientBoostingClassifier(n_estimators=100)
hist_grad_boost = HistGradientBoostingClassifier()




clf = [("Naive Bayes",g_nb,{}),\
       ("K Nearest",knn,{"n_neighbors":[3,5,6,7,8,9,10],"leaf_size":[25,30,35]}),\
       ("Random Forest",ran_for,{"n_estimators":[10, 50, 100, 200,400],"max_depth":[3, 10, 20, 40],"random_state":[99],"min_samples_leaf":[5,10,20,40,50],"bootstrap":[False]}),\
       ("Logistic Regression",log_reg,{"penalty":['l2'],"C":[100, 10, 1.0, 0.1, 0.01] , "solver":['saga']}),\
       ("Support Vector",svc,{"kernel": ["linear","rbf"],"gamma":[0.05,0.0001,0.01,0.001],"C":[0.1, 1, 10, 100, 1000]},),\
      
       ("Decision Tree", tree, {}),\
       ("XGBoost",xgb,{"n_estimators":[200],"max_depth":[3,4,5],"learning_rate":[.01,.1,.2],"subsample":[.8],"colsample_bytree":[1],"gamma":[0,1,5],"lambda":[.01,.1,1]}),\
       
       ("Adapative Boost",ada_boost,{"n_estimators":[100],"learning_rate":[.6,.8,1]}),\
       ("Gradient Boost",grad_boost,{}),\
     
       ("Histogram GB",hist_grad_boost,{"loss":["binary_crossentropy"],"min_samples_leaf":[5,10,20,40,50],"l2_regularization":[0,.1,1]})]


stack_list=[]
train_scores = pd.DataFrame(columns=["Name","Train Score","Test Score"])

i=0
for name,clf1,param_grid in clf:
    clf = GridSearchCV(clf1,param_grid=param_grid,scoring="accuracy",cv=sk_fold,return_train_score=True)
    clf.fit(X_train,y_train) #.reshape(-1,1)
    y_pred = clf.best_estimator_.predict(X_test)
    cm = confusion_matrix(y_test,y_pred)
    
    #train_scores.loc[i]= [name,cross_val_score(clf,X_train,y_train,cv=sk_fold,scoring="accuracy").mean(),(cm[0,0]+cm[1,1,])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])]
    train_scores.loc[i]= [name,clf.best_score_,(cm[0,0]+cm[1,1,])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])]
    stack_list.append(clf.best_estimator_)
    i=i+1
    
est = [("g_nb",stack_list[0]),\
       ("knn",stack_list[1]),\
       ("ran_for",stack_list[2]),\
       ("log_reg",stack_list[3]),\
       ("svc",stack_list[4]),\
       ("dec_tree",stack_list[5]),\
       ("XGBoost",stack_list[6]),\
       ("ada_boost",stack_list[7]),\
       ("grad_boost",stack_list[8]),\
       ("hist_grad_boost",stack_list[9])]







In [None]:

sc = StackingClassifier(estimators=est,final_estimator = None,cv=sk_fold,passthrough=False)
sc.fit(X_train,y_train)
y_pred = sc.predict(X_test)
cm1 = confusion_matrix(y_test,y_pred)
y_pred_train = sc.predict(X_train)
cm2 = confusion_matrix(y_train,y_pred_train)
train_scores.append(pd.Series(["Stacking",(cm2[0,0]+cm2[1,1,])/(cm2[0,0]+cm2[0,1]+cm2[1,0]+cm2[1,1]),(cm1[0,0]+cm1[1,1,])/(cm1[0,0]+cm1[0,1]+cm1[1,0]+cm1[1,1])],index=train_scores.columns),ignore_index=True)

In [None]:
print(X_train_1.shape)
print(trainLabel.shape)
print(X_submit.shape)

In [None]:
# Fitting our model
stack_list[3].fit(X_train_1,trainLabel)
y_submit = stack_list[3].predict(X_submit)

y_submit= pd.DataFrame(y_submit)
y_submit.index +=1



# FRAMING OUR SOLUTION
y_submit.columns = ['Solution']
y_submit['Id'] = np.arange(1,y_submit.shape[0]+1)
y_submit = y_submit[['Id', 'Solution']]





In [None]:
print(y_submit.shape)
print(y_submit.head(8))

Exporting the data to submit.

In [None]:
y_submit.to_csv('Submission.csv',index=False)