# Studying glass classification

I'm trying to learn a bit more on 2 topics:

1. spotting and correcting skewed data
2. tuning the Random Forest Classifier parameters to achieve better results

## The data
This is the list of all the fields.

* Id number: 1 to 214 (removed from CSV file)
* RI: refractive index
* Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
* Mg: Magnesium
* Al: Aluminum
* Si: Silicon
* K: Potassium
* Ca: Calcium
* Ba: Barium
* Fe: Iron
* Type of glass: (class attribute)
  *  1 building_windows_float_processed
  * 2 building_windows_non_float_processed
  * 3 vehicle_windows_float_processed
  * 4 vehicle_windows_non_float_processed (none in this database)
  * 5 containers
  * 6 tableware
  * 7 headlamps

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

Loading dataset and showing some records

In [None]:
df = pd.read_csv('../input/glass.csv')

df.sample(5)

## Unbalanced classes

The classes in this dataset are not equally represented. Classes `3`, `5` and `6` are really poor and past analyses show that the algorithms struggle to classify data for them.

In [None]:
df.groupby(by='Type').count()

I try to find some common values in records for those classes. The goal is to duplicate data including some variation in order to "reinforce" the classification.

### Past results
Before doing it, I note here the past performances to see if there's an improvement.

This is the correlation between the features and the class:

* Mg      0.744993
* Al      0.598829
* Ba      0.575161
* Na      0.502898
* Fe      0.188278
* RI      0.164237
* Si      0.151565
* K       0.010054
* Ca      0.000952

This is the best estimator perf for a Random Forest:

`0.7890625`

`{'n_estimators': 100, 'min_samples_split': 2, 'criterion': 'entropy', 'min_samples_leaf': 1}`


And this is the cross-validation score:

`Score: 0.756`

Let's go.


### Generating random samples

I create some useful function that helps me create new samples for the "poor" classes using their mean and standard deviation values and setting to 0 a set of labels

In [None]:

# generate a -1 or +1 at random to add or subtract a random value from the mean
def random_sign():
    return [-1 if np.random.random()<.5 else 1]

# generate a new fake sample based on the passed means and standard deviations arrays,
# setting the passed Type value and the passed labels at zero
def gen_sample(mm, ss, type, labels_at_zero=[]):
    ix = ['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'Type']
    new = np.abs(mm + (ss * np.random.random(len(ix)) * random_sign() ))
    for l in labels_at_zero:
        new[l] = 0
    new['Type'] = type
    return new



In [None]:
# Class 6
# Selecting the 'mean' and 'std' columns from the describe() to generate random values for the class
means = df[df['Type']==6].describe().loc['mean',:]
stds = df[df['Type']==6].describe().loc['std',:]

# this was the full table, I took only the second and third row
df[df['Type']==6].describe()

In [None]:
# using a temp value
dfnew = df
for i in range(0, 20):
    dfnew = dfnew.append(gen_sample(means, stds, 6, ['Ba', 'Fe', 'K']), ignore_index=True)

In [None]:
# Let's check what happened
dfnew[dfnew['Type']==6].describe()

Count raised to 29, so correctly +20. Standard deviation obviously decreased because I was adding random values inside the stddev range. Maybe this helps prediction too but I feel I'm corrupting the data a bit more than I wanted.

Now I do the same for the rest of the classes. BTW, I already checked that classes 3 and 5 do not have blank features.

In [None]:
# Class 3
# Selecting the 'mean' and 'std' columns from the describe() to generate random values for the class
means = df[df['Type']==3].describe().loc['mean',:]
stds = df[df['Type']==3].describe().loc['std',:]

for i in range(0, 13):
    dfnew = dfnew.append(gen_sample(means, stds, 3, []), ignore_index=True)

# Class 5
# Selecting the 'mean' and 'std' columns from the describe() to generate random values for the class
means = df[df['Type']==5].describe().loc['mean',:]
stds = df[df['Type']==5].describe().loc['std',:]
for i in range(0, 17):
    dfnew = dfnew.append(gen_sample(means, stds, 5, []), ignore_index=True)

### A final check on the data

In [None]:
dfnew.groupby(by='Type').count()

In [None]:
df = dfnew

### X and Y
Dropping the class (`Type` column) from the X set and moving it in the Y set

In [None]:
X = df.drop(['Type'], axis=1)
Y = df['Type']

How the features influence the classification

In [None]:
df.corr()['Type'].abs().sort_values(ascending=False)

### Calculating data skewness and possibly unskewing

In [None]:
import matplotlib.pylab as plt
from sklearn import preprocessing
from scipy.stats import skew
from scipy.stats import boxcox

# getting features names to loop
classes = X.columns.values

# This will contain the unskewed features
X_unsk = pd.DataFrame()

# looping through the 
for c in classes:
    scaled = preprocessing.scale(X[c]) 
    boxcox_scaled = preprocessing.scale(boxcox(X[c] + np.max(np.abs(X[c]) +1) )[0])
    
    # Populating 
    X_unsk[c] = boxcox_scaled
    
    #Next We calculate Skewness using skew in scipy.stats
    skness = skew(scaled)
    boxcox_skness = skew(boxcox_scaled)
    
    #We draw the histograms 
    figure = plt.figure()
    # First the original data shape
    figure.add_subplot(121)   
    plt.hist(scaled,facecolor='blue',alpha=0.55) 
    plt.xlabel(c + " - Transformed") 
    plt.title("Skewness: {0:.2f}".format(skness)) 
    
    # then the unskewed
    figure.add_subplot(122) 
    plt.hist(boxcox_scaled,facecolor='red',alpha=0.55) 
    plt.title("Skewness: {0:.2f}".format(boxcox_skness)) 

    plt.show()

In most cases the BoxCox unskewing is successfully transforming the data

## Hyperparameters

Searching the best parameters for the Random Forest Classifier 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import math

# Here I use the unskewed dataset
X = X_unsk
X_tr, X_ts, y_tr, y_ts = train_test_split(X, Y, test_size=0.40, random_state=42)

rf = RandomForestClassifier(max_features='auto', oob_score=True, random_state=1, n_jobs=-1)
param_grid = { "criterion" : ["gini", "entropy"]
              , "min_samples_leaf" : [1, 5, 10]
              , "min_samples_split" : [2, 4, 10, 12, 16]
              , "n_estimators": [100, 125, 200]}
gs = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)
gs = gs.fit(X_tr, y_tr)

### Score

Printing best score performance and algorithm parameters

In [None]:
print(gs.best_score_)
print(gs.best_params_)

### Training

Final training with the best hyperparameters found by GridSearchCV

In [None]:
bp = gs.best_params_
rf = RandomForestClassifier( criterion=bp['criterion'], 
                             n_estimators=bp['n_estimators'],
                             min_samples_split=bp['min_samples_split'],
                             min_samples_leaf=bp['min_samples_leaf'],
                             max_features='auto',
                             oob_score=True,
                             random_state=1,
                             n_jobs=-1)

rf.fit(X_tr, y_tr)
pred = rf.predict(X_ts)

score = rf.score(X_ts, y_ts)
print("Score: %.3f" % (score))

Works better on the training set but worse on the validation.

### Features importance
This is the features importances for the algorithm

In [None]:
pd.concat((pd.DataFrame(X.columns, columns = ['variable']), 
           pd.DataFrame(rf.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:20]

Taking a look at the confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix
import itertools
#print(y_ts.values)
#print(pred)

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = (cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]).round(decimals=2)
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

cnf_matrix = confusion_matrix(y_ts.values, pred)

plt.figure()
plot_confusion_matrix(cnf_matrix, classes=np.sort(y_ts.unique()), normalize=False,
                      title='Confusion matrix, without normalization')

Classes 4 and 5 are ok now but I corrupted 1 and 2 and 3 is matching less then 50% of the times.

## Trying with XGBoost

As suggested in the comments, I try to compare this Random Forest model with an XGBoost to see how both perform with this dataset. [I fixed the values to the successful set calculated on my PC because it takes too much to run online]

In [None]:
from xgboost import XGBClassifier

# Here I use the unskewed dataset
X = X_unsk
X_tr, X_ts, y_tr, y_ts = train_test_split(X, Y, test_size=0.4, random_state=42)
xgb = XGBClassifier()

param_grid = { "max_depth" : [5]
              , "learning_rate" : [0.125]
              , "n_estimators": [50]
              , "reg_lambda": [.1]}
gs = GridSearchCV(estimator=xgb, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1, verbose=1)
gs = gs.fit(X_tr, y_tr)

In [None]:
print(gs.best_score_)
print(gs.best_params_)

The score on the TS is a bit worse than the RF even though I tried many parameters combination.

In [None]:
bp = gs.best_params_
xgb = XGBClassifier( max_depth=bp['max_depth'], 
                             n_estimators=bp['n_estimators'],
                             learning_rate=bp['learning_rate'],
                   reg_lambda=bp['reg_lambda'])

xgb.fit(X_tr, y_tr)
pred = xgb.predict(X_ts)

score = xgb.score(X_ts, y_ts)
print("Score: %.3f" % (score))

The feature importance matrix is slightly different from the RF's one.

In [None]:
pd.concat((pd.DataFrame(X.columns, columns = ['variable']), 
           pd.DataFrame(xgb.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:20]

In [None]:
cnf_matrix = confusion_matrix(y_ts.values, pred)

plt.figure()
plot_confusion_matrix(cnf_matrix, classes=np.sort(y_ts.unique()), normalize=False,
                      title='Confusion matrix, without normalization')

Again the issue here are features 1 and 2 that worked a lot better before the fake data was introduced, and the 3 that is really mismatched.

## Conclusions

OLD COMMENT:
> I'm really a novice in ML but I'm trying to apply all the interesting stuff I find in many awesome Kaggle kernels and discussions in order to slowly learn how work with data.
> In this case I've learned a bit more about feature skewing, Random Forest parameters tuning and I did some experiment with XGBoost but the kernel is really far from defining a decent classifier for the glass classification problem. Maybe I need to study more the dataset and to try other classifiers.

NEW COMMENT:
I tried to fix the poor performance of this small dataset adding fake values, similar to the existing ones, for the less represented classes. It has not worked as I expected, maybe I added too much data or maybe someone has a better explanation out there!

Anyway, any advice is welcome!