# Red Wine Quality Classification - Optimized Models


**Bugra Sebati E.** - **August 2021**

## Introduction
This datasets is related to red variants of the Portuguese “Vinho Verde” wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.)

**What you can find on this notebook?**
* Understanding the data
* Exploratory Data Analysis
* Normal distribution control
* Unbalanced Data problem
* Classification Models
* Classification Models Optimization
* Model Select

**Information about Red Wine :**

Starting with the basics, red wine is an alcoholic beverage made by fermenting the juice of dark-skinned grapes. Red wine differs from white wine in its base material and production process. Red wine is made with dark-skinned rather than light-skinned grapes. During red wine production, the winemaker allows pressed grape juice, called must, to macerate and ferment with the dark grape skins, which adds color, flavor and tannin to the wine. Alcohol occurs when yeast converts grape sugar into ethanol and carbon dioxide. The result of these processes: Red wine.

References : www.winemag.com


![](http://media0.giphy.com/media/RNDV3Y4K0g19YY5EWd/giphy.gif)

**What are our variables? Let's meet**

* **Fixed acidity** :  Most acids involved with wine or fixed or nonvolatile.(do not evaporate readily)
* **Volatile acidity** : The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
* **Citric acid** :  Found in small quantities, citric acid can add 'freshness' and flavor to wines.
* **Residual sugar** :  The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
* **Chlorides** :  The amount of salt in the wine.
* **Free sulfur dioxide** :  The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
* **Total sulfur dioxide** :  Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
* **Density** :  The density of water is close to that of water depending on the percent alcohol and sugar content.
* **pH** :  Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
* **Sulphates** :  A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
* **Alcohol** :  The percent alcohol content of the wine.

**Target Variable : Quality** : Based on sensory data, score between 3 and 8


**If you like this notebook,dont forget to upvote :) Thanks !**

**We are ready to start. Cheers !**

In [None]:
#### IMPORT LIBRARIES


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import shapiro
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split , GridSearchCV , cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
import warnings
warnings.filterwarnings("ignore")

In [None]:
redwine = pd.read_csv("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")
df = redwine.copy()
df.head()

In [None]:
df.info()

In [None]:
df.describe().T

Any missing values in dataset?

In [None]:
df.isnull().values.any()

Now, let's look at the distribution of the target variable

In [None]:
sns.countplot(x = "quality" , data = df);

In [None]:
plt.figure(1, figsize = (12,7))
plt.title("Quality distribution", color = "black", fontsize = 15)
df["quality"].value_counts().plot.pie(autopct = "%1.1f%%");

Let's analyze the correlation

In [None]:
plt.figure(figsize = (11,8))
sns.set(font_scale = 1.1)
sns.heatmap(df.corr() , cmap = "RdYlGn", annot = True, fmt = ".2f", annot_kws = {"size": 12});

Highest **positive** correlation : alcohol (0.48) **Moderate degree**

Highest **negative** correlation : volatile acidity(-0.39) **Moderate degree**

Now, we will look at the distribution and skewness and kurtosis value of all variables.

We will test the normality distribution with the Shapiro-Wilk test.

#### **Shapiro-Wilk Test :**
Tests whether a data sample has a Normal distribution.

**Assumption** : Observations in each sample are independent and distributed identically.

**Hypothesis** : 

**H0**: the sample has a Normal distribution.

**H1**: the sample does not have a Normal distribution.

If pvalue is less than 0.05, there is no normal distribution

#### Fixed acidity

In [None]:
sns.distplot(df["fixed acidity"] , color = "r", bins = 60 , hist_kws = {"alpha": 0.4});

In [None]:
print("Skewness: %f" % df["fixed acidity"].skew())
print("Kurtosis: %f" % df["fixed acidity"].kurt())
test_statistic, pvalue = shapiro(df["fixed acidity"])
print('Shapiro Test Statistic = %.4f, p-value = %.4f' % (test_statistic, pvalue))

#### Volatile acidity

In [None]:
sns.distplot(df["volatile acidity"] , color = "r", bins = 60 , hist_kws = {"alpha": 0.4});

In [None]:
print("Skewness: %f" % df["volatile acidity"].skew())
print("Kurtosis: %f" % df["volatile acidity"].kurt())
test_statistic, pvalue = shapiro(df["volatile acidity"])
print('Shapiro Test Statistic = %.4f, p-value = %.4f' % (test_statistic, pvalue))

#### Citric acid

In [None]:
sns.distplot(df["citric acid"] , color = "r", bins = 60 , hist_kws = {"alpha": 0.4});

In [None]:
print("Skewness: %f" % df["citric acid"].skew())
print("Kurtosis: %f" % df["citric acid"].kurt())
test_statistic, pvalue = shapiro(df["citric acid"])
print('Shapiro Test Statistic = %.4f, p-value = %.4f' % (test_statistic, pvalue))

#### Residual sugar

In [None]:
sns.distplot(df["residual sugar"] , color = "r", bins = 60 , hist_kws = {"alpha": 0.4});

In [None]:
print("Skewness: %f" % df["residual sugar"].skew())
print("Kurtosis: %f" % df["residual sugar"].kurt())
test_statistic, pvalue = shapiro(df["residual sugar"])
print('Shapiro Test Statistic = %.4f, p-value = %.4f' % (test_statistic, pvalue))

#### Chlorides

In [None]:
sns.distplot(df["chlorides"] , color = "r", bins = 60 , hist_kws = {"alpha": 0.4});

In [None]:
print("Skewness: %f" % df["chlorides"].skew())
print("Kurtosis: %f" % df["chlorides"].kurt())
test_statistic, pvalue = shapiro(df["chlorides"])
print('Shapiro Test Statistic = %.4f, p-value = %.4f' % (test_statistic, pvalue))

#### Free sulfur dioxide

In [None]:
sns.distplot(df["free sulfur dioxide"] , color = "r", bins = 60 , hist_kws = {"alpha": 0.4});

In [None]:
print("Skewness: %f" % df["free sulfur dioxide"].skew())
print("Kurtosis: %f" % df["free sulfur dioxide"].kurt())
test_statistic, pvalue = shapiro(df["free sulfur dioxide"])
print("Shapiro Test Statistic = %.4f, p-value = %.4f" % (test_statistic, pvalue))

#### Total sulfur dioxide

In [None]:
sns.distplot(df["total sulfur dioxide"] , color = "r" , bins = 60 , hist_kws = {"alpha" : 0.4});

In [None]:
print("Skewness: %f" % df["total sulfur dioxide"].skew())
print("Kurtosis: %f" % df["total sulfur dioxide"].kurt())
test_statistic , pvalue = shapiro(df["total sulfur dioxide"])
print("Shapiro Test Statistic = %.4f, p-value = %.4f" % (test_statistic , pvalue))

#### Density

In [None]:
sns.distplot(df["density"] , color = "r" , bins = 60 , hist_kws = {"alpha" : 0.4});

In [None]:
print("Skewness: %f" % df["density"].skew())
print("Kurtosis: %f" % df["density"].kurt())
test_statistic, pvalue = shapiro(df["density"])
print("Shapiro Test Statistic = %.4f, pvalue = %.4f" % (test_statistic , pvalue))

#### pH

In [None]:
sns.distplot(df["pH"], color = "r" , bins = 60 , hist_kws = {"alpha" : 0.4});

In [None]:
print("Skewness : %f" % df["pH"].skew())
print("Kurtosis : %f" % df["pH"].kurt())
test_statistic , pvalue = shapiro(df["pH"])
print("Shapiro Test Statistic : %.4f, pvalue = %.4f" % (test_statistic , pvalue))

#### Sulphates

In [None]:
sns.distplot(df["sulphates"] , color = "r" , bins = 60 , hist_kws = {"alpha" : 0.4});

In [None]:
print("Skewness : %f" % df["sulphates"].skew())
print("Kurtosis : %f" % df["sulphates"].kurt())
test_statistic , pvalue = shapiro(df["sulphates"])
print("Shapiro Test Statistic : %.4f , pvalue : %.4f" % (test_statistic , pvalue))

#### Alcohol

In [None]:
sns.distplot(df["alcohol"] , color = "r" , bins = 60 , hist_kws = {"alpha" : 0.4});

In [None]:
print("Skewness : %f" % df["alcohol"].skew())
print("Kurtosis : %f" % df["alcohol"].kurt())
test_statistic , pvalue = shapiro(df["alcohol"])
print("Shapiro Test Statistic : %.4f , pvalue : %.4f" % (test_statistic , pvalue))

Let's look at the distribution of variables with the target variable using the boxplot graph.

In [None]:
figure, ax = plt.subplots(1,5, figsize = (24,6))
sns.boxplot(data = df, x = "quality", y="fixed acidity", ax = ax[0])
sns.boxplot(data = df, x = "quality", y="volatile acidity", ax = ax[1])
sns.boxplot(data = df, x = "quality", y="citric acid", ax = ax[2])
sns.boxplot(data = df, x = "quality", y="residual sugar", ax = ax[3])
sns.boxplot(data = df, x = "quality", y="chlorides", ax = ax[4])
plt.show()

In [None]:
figure, ax = plt.subplots(1,6, figsize = (24,6))
sns.boxplot(data = df, x = "quality", y="free sulfur dioxide", ax = ax[0])
sns.boxplot(data = df, x = "quality", y="total sulfur dioxide", ax = ax[1])
sns.boxplot(data = df, x = "quality", y="density", ax = ax[2])
sns.boxplot(data = df, x = "quality", y="pH", ax = ax[3])
sns.boxplot(data = df, x = "quality", y="sulphates", ax = ax[4])
sns.boxplot(data = df, x = "quality", y="alcohol", ax = ax[5])
plt.show()

Now, we need a function that will help us with the modeling.

The quality variable is 6 categories, we'll transform it to 2. ( 0 and 1 )

In [None]:
df["quality"].value_counts()

In [None]:
## transform func.

df["quality"] = df["quality"].apply(lambda value : 1 if value >= 7 else 0)

In [None]:
df["quality"].value_counts()

In [None]:
x = df[df.columns[:-1]]
y = df["quality"]

We need to standardize our independent variables. 

In [None]:
scaler = StandardScaler()
x = scaler.fit_transform(x)

In [None]:
x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = 0.20 , random_state = 42)

In [None]:
x_train.shape , x_test.shape , y_train.shape , y_test.shape

I think we're ready, but we have a problem. Unbalanced data !

#### **Unbalanced Data Problem**
Unbalanced data refers to classification problems where we have unequal instances for different classes. Having unbalanced data is actually very common in general, but it is especially prevalent when working with disease data where we usually have more healthy control samples than disease cases. Even more extreme unbalance is seen with fraud detection, where e.g. most credit card uses are okay and only very few will be fraudulent.

Most machine learning classification algorithms are sensitive to unbalance in the predictor classes. Let’s consider an even more extreme example than our red wine dataset: assume we had 10 good quality vs 90 bad quality samples. A machine learning model that has been trained and tested on such a dataset could now predict “bad quality” for all samples and still gain a very high accuracy. An unbalanced dataset will bias the prediction model towards the more common class!

**Solutions**

* **Find extra data**  : The number of observations in a class with few observations can be increased.
* **Undersampling**   : Undersampling refers to a group of techniques designed to balance the class distribution for a classification dataset that has a skewed class distribution.
* **Oversampling**    : I think undersampling should be preferred if it can't be done.It's very likely to cause overfit.
* **Resampling : SMOTE**  : It does not act as a classic data copy. Copying data does not give new information to the model.It uses KNN algorithm. It selects random data from a minority category selects their neighbors, and generates data. This process repeats until the majority reaches the number of observations in the category. 

**I used SMOTE in this notebook. You can also find different methods for the unbalanced data problem.**

In [None]:
smote = SMOTE(k_neighbors = 4 , random_state = 12)
x_train, y_train = smote.fit_resample(x_train, y_train)
y_train.value_counts()

#### READY FOR ML !

![](http://media4.giphy.com/media/8Iv5lqKwKsZ2g/giphy.gif)

#### KNN Classifier (K-Nearest Neighbors Algorithm)

In [None]:
knn = KNeighborsClassifier()
knn_model = knn.fit(x_train , y_train)

In [None]:
y_pred_knn = knn_model.predict(x_test)
accuracy_score(y_test , y_pred_knn)

#### OPTIMIZATION

In [None]:
knn_params = {"n_neighbors": np.arange(1,60)}
knn_cv = GridSearchCV(knn , knn_params, cv = 10)
knn_cv.fit(x_train , y_train)

In [None]:
print("Best Parameters: " + str(knn_cv.best_params_))

**Optimized Model**

In [None]:
knn = KNeighborsClassifier(n_neighbors = 2)
opt_knn = knn.fit(x_train , y_train)

In [None]:
y_pred_knn = opt_knn.predict(x_test)
accuracy_score(y_test , y_pred_knn)

In [None]:
print(classification_report(y_test , y_pred_knn))

So, what is this table ? Let's explain.

For all models, you can evaluate it this way.

![](http://www.linkpicture.com/q/Sunu1_3.jpg)

In [None]:
score = round(accuracy_score(y_test, y_pred_knn), 6)
cm = confusion_matrix(y_test, y_pred_knn)
sns.heatmap(cm, annot = True, fmt = ".0f")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Accuracy Score: {0}".format(score), size = 15)
plt.show()

#### Logistic Regression

In [None]:
logi = LogisticRegression(solver = "liblinear")
log_model = logi.fit(x_train,y_train)

In [None]:
log_model.predict_proba(x_test)[:5]

In [None]:
y_pred_logi = log_model.predict(x_test)
accuracy_score(y_test, y_pred_logi)

In [None]:
cross_val_score(log_model, x_test, y_test, cv = 10).mean()

In [None]:
print(classification_report(y_test , y_pred_logi))

In [None]:
score = round(accuracy_score(y_test, y_pred_logi), 6)
cm = confusion_matrix(y_test, y_pred_logi)
sns.heatmap(cm, annot = True, fmt = ".0f")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Accuracy Score: {0}".format(score), size = 15)
plt.show()

#### GBM (Gradient Boosting Machine)

In [None]:
gbm = GradientBoostingClassifier()
gbm_model = gbm.fit(x_train, y_train)

In [None]:
y_pred_gbm = gbm_model.predict(x_test)
accuracy_score(y_test, y_pred_gbm)

#### OPTIMIZATION

In [None]:
#gbm_params = {"learning_rate" : [0.001, 0.01, 0.1, 0.05],
#             "n_estimators": [100,200,300,400,500,600,700],
#             "max_depth": [3,5,10],
#             "min_samples_split": [2,5,10]}

#gbm_cv = GridSearchCV(gbm , gbm_params , cv = 10 , n_jobs = -1 , verbose = 2)

#gbm_cv.fit(x_train, y_train)

In [None]:
#print("Best Parameters: " + str(gbm_cv.best_params_))

**Optimized Model**

In [None]:
gbm = GradientBoostingClassifier(learning_rate = 0.1, max_depth = 10 , min_samples_split = 5, n_estimators = 300)
opt_gbm = gbm.fit(x_train, y_train)

In [None]:
y_pred_gbm = opt_gbm.predict(x_test)
accuracy_score(y_test, y_pred_gbm)

In [None]:
print(classification_report(y_test , y_pred_gbm))

In [None]:
score = round(accuracy_score(y_test, y_pred_gbm), 6)
cm = confusion_matrix(y_test, y_pred_gbm)
sns.heatmap(cm, annot = True, fmt = ".0f")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Accuracy Score: {0}".format(score), size = 15)
plt.show()

#### Naive Bayes

In [None]:
nb = GaussianNB()
nb_model = nb.fit(x_train , y_train)

In [None]:
nb_model.predict_proba(x_test)[:5]

In [None]:
y_pred_nb = nb_model.predict(x_test)
accuracy_score(y_test , y_pred_nb)

In [None]:
cross_val_score(nb_model, x_test, y_test, cv = 10).mean()

In [None]:
print(classification_report(y_test , y_pred_nb))

In [None]:
score = round(accuracy_score(y_test, y_pred_nb), 6)
cm = confusion_matrix(y_test, y_pred_nb)
sns.heatmap(cm, annot = True, fmt = ".0f")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Accuracy Score: {0}".format(score), size = 15)
plt.show()

#### SVC (Support Vector Classifier)

In [None]:
svc = SVC()
svc_model = svc.fit(x_train , y_train)

In [None]:
y_pred_svc = svc_model.predict(x_test)
accuracy_score(y_test , y_pred_svc)

#### OPTIMIZATION

In [None]:
svc_params = { "C" : [0.01 , 0.1 , 1 , 2 , 3 , 5 , 7 , 10],
             "gamma" : [0.01 , 0.1 , 1 , 3 , 5 , 7 , 10]}

svc_cv_model = GridSearchCV(svc,svc_params, 
                            cv = 10, 
                            n_jobs = -1, 
                            verbose = 2)

svc_cv_model.fit(x_train, y_train)

In [None]:
print("Best Parameters: " + str(svc_cv_model.best_params_))

**Optimized Model**

In [None]:
opt_svc = SVC(C = 2 , gamma = 1).fit(x_train, y_train)

In [None]:
y_pred_svc = opt_svc.predict(x_test)
accuracy_score(y_test, y_pred_svc)

In [None]:
print(classification_report(y_test , y_pred_svc))

In [None]:
score = round(accuracy_score(y_test, y_pred_svc), 6)
cm = confusion_matrix(y_test, y_pred_svc)
sns.heatmap(cm, annot = True, fmt = ".0f")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Accuracy Score: {0}".format(score), size = 15)
plt.show()

#### MLP Classifier (Multi-Layer Perceptron Classifier)

In [None]:
mlpc = MLPClassifier()
mlpc_model = mlpc.fit(x_train , y_train)

In [None]:
y_pred_mlpc = mlpc.predict(x_test)
accuracy_score(y_test, y_pred_mlpc)

#### OPTIMIZATION

In [None]:
#mlpc_params = {"alpha": [0.1, 0.01, 0.02, 0.005, 0.0001],
#              "hidden_layer_sizes": [(10,10,10),(100,100,100),(100,100),(10,10),(100,)],
#              "solver" : ["lbfgs","adam","sgd"],
#              "activation": ["relu","logistic"]}

#mlpc_cv_model = GridSearchCV(mlpc , mlpc_params , cv = 10 , n_jobs = -1 , verbose = 2)

#mlpc_cv_model.fit(x_train , y_train)

In [None]:
#print("Best Parameters: " + str(mlpc_cv_model.best_params_))

**Optimized Model**

In [None]:
mlpc = MLPClassifier(activation = "relu" , alpha = 0.01,
                     hidden_layer_sizes = (100, 100, 100) , solver = "adam")
opt_mlpc = mlpc.fit(x_train , y_train)

In [None]:
y_pred_mlpc = opt_mlpc.predict(x_test)
accuracy_score(y_test, y_pred_mlpc)

In [None]:
print(classification_report(y_test , y_pred_mlpc))

In [None]:
score = round(accuracy_score(y_test, y_pred_mlpc), 6)
cm = confusion_matrix(y_test, y_pred_mlpc)
sns.heatmap(cm, annot = True, fmt = ".0f")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Accuracy Score: {0}".format(score), size = 15)
plt.show()

#### Desicion Tree Classifier

In [None]:
clf = DecisionTreeClassifier()
clf_model = clf.fit(x_train , y_train)

In [None]:
y_pred_clf = clf_model.predict(x_test)
accuracy_score(y_test , y_pred_clf)

In [None]:
print(classification_report(y_test , y_pred_clf))

In [None]:
score = round(accuracy_score(y_test, y_pred_clf), 6)
cm = confusion_matrix(y_test, y_pred_clf)
sns.heatmap(cm, annot = True, fmt = ".0f")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Accuracy Score: {0}".format(score), size = 15)
plt.show()

#### Light GBM Classifier

In [None]:
lgbm = LGBMClassifier()
lgbm_model = lgbm.fit(x_train, y_train)

In [None]:
y_pred_lgbm = lgbm_model.predict(x_test)
accuracy_score(y_test, y_pred_lgbm)

#### OPTIMIZATION

In [None]:
lgbm_params = {"n_estimators": [100 , 500 , 1000],
               "subsample": [0.6 , 0.8 , 1.0],
               "learning_rate": [0.1 , 0.01 , 0.02 , 0.05],
               "min_child_samples": [5 , 10 , 20]}

lgbm_cv_model = GridSearchCV(lgbm , lgbm_params , cv = 10 , n_jobs = -1 , verbose = 2)

lgbm_cv_model.fit(x_train, y_train)

In [None]:
print("Best Parameters: " + str(lgbm_cv_model.best_params_))

**Optimized Model**

In [None]:
lgbm = LGBMClassifier(learning_rate = 0.1 ,max_depth = 8 , min_child_samples = 10 , 
                      n_estimators = 500 , subsample = 0.6)

opt_lgbm = lgbm.fit(x_train, y_train)

In [None]:
y_pred_lgbm = opt_lgbm.predict(x_test)
accuracy_score(y_test, y_pred_lgbm)

In [None]:
print(classification_report(y_test , y_pred_lgbm))

In [None]:
score = round(accuracy_score(y_test, y_pred_lgbm), 6)
cm = confusion_matrix(y_test, y_pred_lgbm)
sns.heatmap(cm, annot = True, fmt = ".0f")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Accuracy Score: {0}".format(score), size = 15)
plt.show()

#### Random Forest Classifier

In [None]:
rf = RandomForestClassifier()
rf_model = rf.fit(x_train, y_train)

In [None]:
y_pred_rf = rf_model.predict(x_test)
accuracy_score(y_test, y_pred_rf)

#### OPTIMIZATION

In [None]:
#rf_params = {"max_features": ["auto","sqrt","log2"], "n_estimators": [10, 100 ,300 ,500 ,600 ,800 ,1000],
#            "max_depth" : [2,5,10,15,19,25]}

#rf_cv_model = GridSearchCV(rf_model , rf_params , cv = 10 , n_jobs = -1 , verbose = 2)

#rf_cv_model.fit(x_train, y_train)

In [None]:
#print("Best Parameters: " + str(rf_cv_model.best_params_))

**Optimized Model**

In [None]:
rf_model = RandomForestClassifier(max_features = "auto" , max_depth = 19,
                                  random_state = 44 , n_estimators = 1000)

opt_rf = rf_model.fit(x_train, y_train)

In [None]:
y_pred_rf = opt_rf.predict(x_test)
accuracy_score(y_test, y_pred_rf)

In [None]:
print(classification_report(y_test , y_pred_rf))

In [None]:
score = round(accuracy_score(y_test, y_pred_rf), 6)
cm = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm, annot = True, fmt = ".0f")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Accuracy Score: {0}".format(score), size = 15)
plt.show()

#### XGBoost

In [None]:
xgb = XGBClassifier()
xgb_model = xgb.fit(x_train , y_train)

In [None]:
y_pred_xgb = xgb_model.predict(x_test)
accuracy_score(y_test, y_pred_xgb)

#### OPTIMIZATION

In [None]:
#xgb_params = {"n_estimators": [100 , 500 , 1000] ,"subsample": [0.6 , 0.8 , 1.0] ,"learning_rate": [0.1 , 0.01 , 0.05]}

#xgb_cv_model = GridSearchCV(xgb , xgb_params , cv = 10 , n_jobs = -1 , verbose = 2)

#xgb_cv_model.fit(x_train, y_train)

In [None]:
#print("Best Parameters: " + str(xgb_cv_model.best_params_))

**Optimized Model**

In [None]:
xgb = XGBClassifier(n_estimators = 1000 , learning_rate = 0.1 , subsample = 1 , max_depth = 4)
opt_xgb =  xgb.fit(x_train,y_train)

In [None]:
y_pred_xgb = opt_xgb.predict(x_test)
accuracy_score(y_test , y_pred_xgb)

In [None]:
print(classification_report(y_test , y_pred_xgb))

In [None]:
score = round(accuracy_score(y_test, y_pred_xgb), 6)
cm = confusion_matrix(y_test, y_pred_xgb)
sns.heatmap(cm, annot = True, fmt = ".0f")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Accuracy Score: {0}".format(score), size = 15)
plt.show()

#### MODEL SELECT 

In [None]:
models = [opt_knn , log_model , nb_model , opt_gbm , opt_svc, opt_mlpc , clf_model , opt_lgbm , opt_rf , opt_xgb]
result = []
results = pd.DataFrame(columns= ["Models","Accuracy"])
for x in models:
    names = x.__class__.__name__
    y_pred = x.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("-" * 28)
    print(names + ":" )
    print("Accuracy: {:.4%}".format(accuracy))
for x in models:
    names = x.__class__.__name__
    y_preds = x.predict(x_test)
    accuracy = accuracy_score(y_test, y_preds)    
    result = pd.DataFrame([[names, accuracy * 100]], columns= ["Models","Accuracy"])
    results = results.append(result)
    
sns.barplot(x = "Accuracy", y = "Models", data = results, color = "g")
plt.xlabel("Accuracy %");

#### **CONCLUSION**

Best accuracy **%** **91.875** : **Random Forest Classifier**

Hope you found this notebook useful! 
Please leave in comments in case of any questions, concerns, and feedback! Thank you :)

**Thanks for attention !**