# Glass type classification with machine learning

This is my first Kaggle notebook. Here's my plan of attack for the glass classification problem.

# Contents

## 1) Prepare Problem

 * Load libraries

 * Load and explore the shape of the dataset

## 2) Summarize Data

* Descriptive statistics

* Data visualization

## 3) Prepare Data

* Data Cleaning

* Split-out validation dataset

*  Data transformation  

## 4) Evaluate Algorithms

* Assessing feature importance via XGBoost and PCA

* Compare Algorithms

## 5) Improve Accuracy

* Algorithm Tuning

## 6) Finalize Model

* Create standalone model on entire training dataset

* Predictions on test dataset

## 1. Prepare Problem

### Loading the libraries 

Let us first begin by loading the libraries that we'll use in the notebook

In [None]:
import numpy as np  # linear algebra
import pandas as pd  # read dataframes
import matplotlib.pyplot as plt # visualization
import seaborn as sns # statistical visualizations and aesthetics
from sklearn.preprocessing import StandardScaler # preprocessing 
from sklearn.decomposition import PCA # dimensionality reduction
from scipy.stats import boxcox # data transform
from sklearn.model_selection import (train_test_split, KFold , cross_val_score, GridSearchCV ) # model selection modules
from sklearn.pipeline import Pipeline # streaming pipelines
# load models
from sklearn.tree import DecisionTreeClassifier
from xgboost import (XGBClassifier, plot_importance)
from sklearn.svm import SVC
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
%matplotlib inline 

### Loading and exploring the shape of the dataset

In [None]:
df = pd.read_csv('../input/glass.csv')

print(df.shape)

The dataset consists of 214 observations

In [None]:
df.head(15)

In [None]:
df.dtypes

## 2. Summarize data

### Descriptive statistics

Let's first summarize the distribution of the numerical variables.

In [None]:
df.describe()

The features are not on the same scale. For example Si has a mean of 72.65 while Fe has a mean value of 0.057. Features should be on the same scale for an algorithm such as logistic regression (gradient descent) to converge fast. Let's go ahead and check the distribution of the glass types.

In [None]:
df['Type'].value_counts()

The dataset is pretty unbalanced. The instances of types 1 and 2 constitute more than 67 % of the glass types.

###  Data Visualization

* **Univariate plots**

Let's go ahead an look at the distribution of the different features of this dataset.

In [None]:
features = df.columns[:-1].tolist()
for feat in features:
    skew = df[feat].skew()
    sns.distplot(df[feat], label='Skew = %.3f' %(skew))
    plt.legend(loc='best')
    plt.show()

None of the features is normally distributed. The features Fe, Ba, Ca and K exhibit the highest skew coefficients. Let's do a boxplot of the several distributions.

In [None]:
sns.boxplot(df[features])
plt.show()

Unsurprisingly, Silicon has a mean that is much superior to the other constituents as we already saw in the previous section. Well, that is normal since glass is mainly based on silica.

* **Multivariate plots**

Let's now do a pairplot to visually examine the correlation between the features.

In [None]:
plt.figure(figsize=(8,8))
sns.pairplot(df[features],palette='coolwarm')
plt.show()

Let's go ahead and examine a heatmap of the correlations.

In [None]:
corr = df[features].corr()
plt.figure(figsize=(8,8))
sns.heatmap(corr, cbar = True,  square = True, annot=True,
           xticklabels= df.columns.tolist(), yticklabels= df.columns.tolist(),
           cmap= 'coolwarm')
plt.show()
print(corr)

There seems to be a strong positive correlation between RI and Ba; also a strong positive correlation between Ba and Na is noticeable. This could give us a hint about performing Principal component analysis to decorrelate some of the input features.

## 3. Prepare data

### - Data cleaning 

In [None]:
df.info()

This dataset is clean; there aren't any missing values in it.

### - Split-out validation dataset

In [None]:
# Define X as features and y as lablels
X = df[features]
y = df['Type']
# set a seed and a test size for splitting the dataset 
seed = 7
test_size = 0.2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size , random_state = seed)

### - Data transformation  

Let's examine if a Box-Cox transform can contribute to the normalization of some features. It should be emphasized that all transformations should only be done on the training set to avoid data snooping. Otherwise the test error estimation will be biased.

In [None]:
features_boxcox = []

for feature in features:
    bc_transformed, _ = boxcox(X_train[feature]+1)  # shift by 1 to avoid computing log of negative values
    features_boxcox.append(bc_transformed)

features_boxcox = np.column_stack(features_boxcox)
df_bc = pd.DataFrame(data=features_boxcox, columns=features)
df_bc['Type'] = df['Type']

In [None]:
df_bc.head()

In [None]:
for feature in features:
    fig, ax = plt.subplots(1,2,figsize=(7,3.5))    
    ax[0].hist(df[feature], color='blue', bins=30, alpha=0.3, label='Skew = %s' %(str(round(X_train[feature].skew(),3))) )
    ax[0].set_title(str(feature))   
    ax[0].legend(loc=0)
    ax[1].hist(df_bc[feature], color='red', bins=30, alpha=0.3, label='Skew = %s' %(str(round(df_bc[feature].skew(),3))) )
    ax[1].set_title(str(feature)+' after a Box-Cox transformation')
    ax[1].legend(loc=0)
    plt.show()

In [None]:
# check if skew is closer to zero after a box-cox transform
for feature in features:
    delta = np.abs( df_bc[feature].skew() / df[feature].skew() )
    if delta < 1.0 :
        print('Feature %s is less skewed after a Box-Cox transform' %(feature))
    else:
        print('Feature %s is more skewed after a Box-Cox transform'  %(feature))

The Box-Cox transform seems to do a good job in reducing the skews of the different distributions of features. Next, we will use the transformed features to feed them into out machine learning models. Only the distribution of Si will not be transformed since such transformation leads to very high values without a big improvement in skewness.

In [None]:
df_bc["Si"] = df["Si"]

for feature in features:
    if feature not in ["Si"]:
        X_train[feature], lmbda = boxcox(X_train[feature]+1)  # shift by 1 to avoid computing log of negative values
        X_test[feature] = X_test[feature].apply(lambda x: ((x+1.0)**lmbda - 1.0)/lmbda if lmbda !=0 else np.log(x+1) )


X_train, X_test = X_train.values, X_test.values
y_train, y_test = y_train.values, y_test.values

### - Standarizing the dataset

Now we have to standarize the different features to bring them to the same scale.

In [None]:
# Standarize the dataset 
for i in range(X.shape[1]):
    sc = StandardScaler()
    X_train[:,i] = sc.fit_transform(X_train[:,i].reshape(-1,1)).reshape(1,-1)
    X_test[:,i] = sc.transform(X_test[:,i].reshape(-1,1)).reshape(1,-1)

## 4. Evaluate Algorithms

### - Assessing feature importance via XGBoost and PCA

* **XGBoost**

In [None]:
model_importances = XGBClassifier(n_estimators=200)
model_importances.fit(X_train, y_train)
plot_importance(model_importances)
plt.show()

* **PCA**

Let's go ahead and perform a PCA on the features to decorrelate the ones that are linearly dependent and then let's plot the cumulative explained variance.

In [None]:
pca = PCA(random_state = seed)
pca.fit(X_train)
var_exp = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(var_exp)
plt.bar(range(1,len(cum_var_exp)+1), var_exp, align= 'center', label= 'individual variance explained', \
       alpha = 0.7)
plt.step(range(1,len(cum_var_exp)+1), cum_var_exp, where = 'mid' , label= 'cumulative variance explained', \
        color= 'red')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.xticks(np.arange(1,len(var_exp)+1,1))
plt.legend(loc='best')
plt.show()

# Cumulative variance explained
for i, sum in enumerate(cum_var_exp):
    print("PC" + str(i+1), "Cumulative variance: %.3f% %" %(cum_var_exp[i]*100))

It appears that about 96 % of the variance can be explained with the first 6 principal components. PCA seems a better choice to reduce the dimensionality of the dataset than selecting the most important features via XGBoost.

### - Compare Algorithms

Now it's time to compare 4 different algorithms (XGBoost Classifier, Support Vector Classifier, RandomForest Classifier and KNeighbors Classifier) after reducing the dimensionality of the data to 6. We'll use 10-folds cross-validation to assess the performance of each model with the metric being the classification accuracy.

In [None]:
pca = PCA(n_components = 6, random_state= seed)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

models = []
models.append(('XGBoost', XGBClassifier(seed = seed) ))
models.append(('SVC', SVC(random_state=seed)))
models.append(('RF', RandomForestClassifier(random_state=seed, n_jobs=-1 )))
tree = DecisionTreeClassifier(max_depth=4, random_state=seed)
models.append(('KNN', KNeighborsClassifier(n_jobs=-1)))

results, names  = [], []
num_folds = 10
scoring = 'accuracy'

for name, model in models:
    kfold = KFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train_pca, y_train, cv=kfold, scoring = scoring, n_jobs= -1) 
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
    
fig = plt.figure(figsize=(8,6))    
fig.suptitle("Algorithms comparison")
ax = fig.add_subplot(1,1,1)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()


**Observation:** It appears that the XGBoost Classifier (XGBClassifer), the Support Vector Classifier (SVC) and the KNeigbors Classifier yield the highest scores. However, these algorithms also yield a wide distribution (10% to 13%). It is worthy to continue our study by tuning these two algorithms.

## 5. Algorithm tuning

Let's start by tuning the hyperparameters of the XGBoost Classifier.

to be continued ...