# Glass Identification data set

# 1. Exploratory Data Analysis

1.1. Data set description

Glass Identification data set was generated to help in criminological investigation.

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style="white")
import warnings
warnings.simplefilter("ignore")

In [None]:
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score,cross_validate
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.metrics import classification_report,f1_score,accuracy_score,confusion_matrix

from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
from imblearn.pipeline import make_pipeline

In [None]:
# load dataset
df=pd.DataFrame.from_csv("glass.data.txt", sep=",",header=None,index_col=None)
# drop first column
df=df[df.columns[1:11]]
# Add columns name
df.columns=["RI","Na","Mg","Al","Si","K","Ca","Ba","Fe","Class"]
df.head(3)

In [None]:
      RI      Na     Mg     Al     Si    K      Ca     Ba     Fe     Class
0   1.52101  13.64  4.49   1.10   71.78  0.06   8.75   0.0    0.0     1
1   1.51761  13.89  3.60   1.36   72.73  0.48   7.83   0.0    0.0     1
2   1.51618  13.53  3.55   1.54   72.99  0.39   7.78   0.0    0.0     1


In [None]:
print(df.info())

In [None]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 10 columns):
RI       214 non-null float64
Na       214 non-null float64
Mg       214 non-null float64
Al       214 non-null float64
Si       214 non-null float64
K        214 non-null float64
Ca       214 non-null float64
Ba       214 non-null float64
Fe       214 non-null float64
Class    214 non-null int64
dtypes: float64(9), int64(1)
memory usage: 16.8 KB
None
Data set contains 214 instances, 9 numeric attributes and class name.

There are no missing data.

Features: 1. RI: refractive index 2. Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 3-9) 3. Mg: Magnesium 4. Al: Aluminum 5. Si: Silicon 6. K: Potassium 7. Ca: Calcium 8. Ba: Barium 9. Fe: Iron

Glass types: 1. building_windows_float_processed 2. building_windows_non_float_processed 3. vehicle_windows_float_processed 4. vehicle_windows_non_float_processed (none in this database) 5. containers 6. tableware 7. headlamps

1.2. Summary statistics and data distribution
The mean of some features are very small, such as "Fe", "Ba" or very large, "Si". On the plot 1.1 we see "Si" feature has larger weight in the oxide. Naturally this is dominant component of glass. We will need to standardise the features so they all have mean of 0 and standard deviation of 1.

Standardization of datasets is a common requirement for many machine learning algorithms as many elements used in the objective function assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

We can explore further the distribution of each feature by class.

In [None]:
print(df.describe())

In [None]:
             RI          Na          Mg          Al          Si           K  \
count  214.000000  214.000000  214.000000  214.000000  214.000000  214.000000   
mean     1.518365   13.407850    2.684533    1.444907   72.650935    0.497056   
std      0.003037    0.816604    1.442408    0.499270    0.774546    0.652192   
min      1.511150   10.730000    0.000000    0.290000   69.810000    0.000000   
25%      1.516523   12.907500    2.115000    1.190000   72.280000    0.122500   
50%      1.517680   13.300000    3.480000    1.360000   72.790000    0.555000   
75%      1.519157   13.825000    3.600000    1.630000   73.087500    0.610000   
max      1.533930   17.380000    4.490000    3.500000   75.410000    6.210000   

               Ca          Ba          Fe       Class  
count  214.000000  214.000000  214.000000  214.000000  
mean     8.956963    0.175047    0.057009    2.780374  
std      1.423153    0.497219    0.097439    2.103739  
min      5.430000    0.000000    0.000000    1.000000  
25%      8.240000    0.000000    0.000000    1.000000  
50%      8.600000    0.000000    0.000000    2.000000  
75%      9.172500    0.000000    0.100000    3.000000  
max     16.190000    3.150000    0.510000    7.000000  

In [None]:
fig,ax=plt.subplots(figsize=(10, 10))
sns.boxplot(data=df.loc[:,"RI":"Fe"], palette='Paired',ax=ax)
sns.despine()
plt.title('Plot 1.1 Boxplot Glass data set')