> # **Introduction** 

#### Glass classification is dataset which is to identify glass is made by several of chemical elements. The dataset has been provided by UCI Machine Learning. It contains 10 attributes including id. The response is glass type(discrete 7 values). Today I would like to try to analyze it using a ensemble machine learning approach

## **1. Import packages and Dataset** 

In [None]:
#import python packages 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split  
from sklearn.metrics import accuracy_score 
%matplotlib inline

In [None]:
#import dataset from draft environment
data = pd.read_csv('../input/glass.csv')

In [None]:
data.head()

In [None]:
data.info()

## **2. Exploratory Data Analysis** 

In [None]:
#correlation of each the datasets 
corr = data.corr()
plt.figure(figsize=(12,12))
sns.heatmap(corr, cbar = True,  square = True, annot=True, fmt= '.2f',annot_kws={'size': 15}
            , alpha = 0.7, cmap= 'coolwarm')
plt.show()

In [None]:
# make boxplot to correction is there outlier or no
# you can repeat this code for all feature
fig, axes = plt.subplots(nrows=2,ncols=2)
fig.set_size_inches(10,10)
sns.boxplot(x=data['RI'],color = 'blue', ax=axes[0][0])
sns.boxplot(x=data['Na'],color = 'Red', ax=axes[0][1])
sns.boxplot(x=data['Mg'],color = 'Green', ax=axes[1][0])
sns.boxplot(x=data['Al'],color = 'Orange', ax=axes[1][1])

## **3. Data Preproccessing **

In [None]:
dt = data['Type'].value_counts()
print ('The number of each Type class = \n')
print (dt)

We can see the dataset, the dataset has an imbalanced class. Therefore, we should handle first this issue. There are a few methods can handle it. Currently, I would like to try SMOTE ( Synthetic Minority Oversampling Technique) method to resampling the sample dataset. 

In [None]:
sns.countplot(data['Type'])
plt.show()

## ** 4.  Oversampling using Pakages Imbalanced Learn (SMOTE)**

In [None]:
#import packages for imbalance-learn for balancing class
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split

x = data.drop('Type', axis=1)
y = data['Type']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

print("Number  X_train dataset: ", x_train.shape)
print("Number y_train dataset: ", y_train.shape)
print("Number X_test dataset: ", x_test.shape)
print("Number y_test dataset: ", y_test.shape)

In [None]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '2': {}".format(sum(y_train==2)))
print("Before OverSampling, counts of label '3': {}".format(sum(y_train==3)))
print("Before OverSampling, counts of label '5': {}".format(sum(y_train==5)))
print("Before OverSampling, counts of label '6': {}".format(sum(y_train==6)))
print("Before OverSampling, counts of label '7': {} \n".format(sum(y_train==7)))

sm = SMOTE(random_state=2)
x_train_res, y_train_res = sm.fit_sample(x_train, y_train.ravel())

print('After OverSampling, the shape of train_X: {}'.format(x_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '2': {}".format(sum(y_train_res==2)))
print("After OverSampling, counts of label '3': {}".format(sum(y_train_res==3)))
print("After OverSampling, counts of label '5': {}".format(sum(y_train_res==5)))
print("After OverSampling, counts of label '6': {}".format(sum(y_train_res==6)))
print("After OverSampling, counts of label '7': {}".format(sum(y_train_res==7)))

## **A. Random Forest Classifier**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

RFC = RandomForestClassifier(n_estimators = 500, criterion = 'entropy', random_state = 42, max_depth = 10 )
RFC.fit(x_train_res, y_train_res.ravel())

#predict 
pred_train = RFC.predict(x_train_res)

In [None]:
#Confusion Matrik of train dataset  
print(confusion_matrix(y_train_res,pred_train))
print ('\n')
print(classification_report(y_train_res,pred_train))

In [None]:
# Confusion Matriks of Test Dataset  
Pred_RFC =RFC.predict(x_test)

print('Confusion Matrix : ','\n',confusion_matrix(y_test,Pred_RFC))
print ('\n')
print(classification_report(y_test,Pred_RFC))
print('\n')
print ('Accuracy_R.Forest_Classifier : ', 
                     accuracy_score(y_test,Pred_RFC)*100,'%')

The accuracy above  only produce less than 70 %. it indicate there is tendency for overfitting. What is overfitting ?
Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. Therefore, you must find the optimal method to solve it.  The following is another methods to solve this issue. 


## **B. Boostrap Aggregating Classifier**

In [None]:
from sklearn.ensemble import BaggingClassifier
BS = BaggingClassifier(RandomForestClassifier(), n_estimators = 300 )
BS.fit(x_train_res, y_train_res.ravel())

#predict 
pred_train_BS = BS.predict(x_train_res)
pred_test_BS = BS.predict(x_test)

In [None]:
# Confusion Matriks of Test Dataset  
print('Confusion Matrix : ','\n',confusion_matrix(y_test,pred_test_BS))
print ('\n')
print(classification_report(y_test,pred_test_BS))
print('\n')
print ('Accuracy_Bagging Classifier : ', 
                     accuracy_score(y_test,pred_test_BS)*100,'%')

 ## **C. AdaBoost Classifier**

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
AB = AdaBoostClassifier(DecisionTreeClassifier(), n_estimators = 300 )
AB.fit(x_train_res, y_train_res.ravel())

#predict 
pred_train_AB = AB.predict(x_train_res)
pred_test_AB = AB.predict(x_test)

In [None]:
# Confusion Matriks of Test Dataset  
print('Confusion Matrix : ','\n',confusion_matrix(y_test,pred_test_AB))
print ('\n')
print(classification_report(y_test,pred_test_AB))
print('\n')
print ('Accuracy_ AdaBoost Classifier : ', 
                     accuracy_score(y_test,pred_test_AB)*100,'%')

## **5. Oversampling using the other method approach**


in order to solve imbalanced class, you can use the other method, it like who I used that is doing duplicate dataset for every feature which has imbalance class. 

In [None]:
# split dataset to be train and test
train = pd.concat([x_train,y_train], axis = 1)
print(train.head())

In [None]:
test = pd.concat([x_test,y_test], axis = 1)
print(test.head())

Note : you must remember one thing, the dataset which oversampling that is only train dataset. 

In [None]:
dt=train['Type'].groupby(train['Type']).count()

print ('The number of each Type class = \n')
print (dt)

In [None]:
# we will calculate for each the number of class
C3 = train[train['Type']==3]
C3 = pd.concat([C3]*5)

C5 = train[train['Type']==5]
C5 = pd.concat([C5]*5)

C6 = train[train['Type']==6]
C6 =pd.concat([C6]*8)

C7 = train[train['Type']==7]
C7 = pd.concat([C7]*2)

C1 = train[train['Type']==1]

C2 = train[train['Type']==2]

#Combain of every dataframe above with new variable name 
data_balanced=pd.concat([C1,C2,C3,C5,C6,C7])
data_balanced.head()


In [None]:
data_balanced.shape

In [None]:
type=data_balanced['Type'].groupby(data_balanced['Type']).count()
type

Output above show the result of oversampling with the other method. Then we will use the same method to predict a type of glass. 

## **A. Random Forest Classifier**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

x_train_A = data_balanced.drop('Type', axis=1)
y_train_A = data_balanced['Type']

RFC = RandomForestClassifier(n_estimators = 300, criterion = 'entropy', random_state = 42, max_depth = 10 )
RFC.fit(x_train_A, y_train_A)

In [None]:
pred_train = RFC.predict(x_train_A)
print(confusion_matrix(y_train_A,pred_train))
print ('\n')
print(classification_report(y_train_A,pred_train))

In [None]:
#predict test dataset 
x_test_A = test.drop('Type', axis=1)
y_test_A = test['Type']

pred_test = RFC.predict(x_test_A)
print('Confusion Matrix : ','\n',confusion_matrix(y_test_A,pred_test))
print ('\n')
print(classification_report(y_test_A,pred_test))

print ('Accuracy_R.Forest_Classifier_B : ', 
                     accuracy_score(y_test_A,pred_test)*100,'%')

## ** B. Bagging Aggregating Classifier**

In [None]:
from sklearn.ensemble import BaggingClassifier
bs = BaggingClassifier(RandomForestClassifier(), n_estimators = 300 )
bs.fit(x_train_res, y_train_res.ravel())

#predict 
pred_train_bs = bs.predict(x_train_res)
pred_test_bs = bs.predict(x_test)

In [None]:
# Confusion Matriks of Test Dataset  
print('Confusion Matrix : ','\n',confusion_matrix(y_test,pred_test_bs))
print ('\n')
print(classification_report(y_test,pred_test_bs))
print('\n')
print ('Accuracy_Bagging Classifier : ', 
                     accuracy_score(y_test,pred_test_bs)*100,'%')

The result is producing same accuracy from both of method SMOTE and duplicate dataset as much 67 %. I think it is bad model. it can be improved with setting the hyperparameter or feature engineering and selection. 