Before starting any datascience analysis, its important we understand the data we are dealing with. Whatever we can grab our hands at is going to help us build a better modal.

The current dataset is related to different types of Glass material. The goal of this analysis is to create a modal which can be used to identify the glass type if we can provide it with the variables that make up a Glass.

## Loading the dataset and understanding it.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

data = pd.read_csv('../input/glass.csv')
features = data.loc[:,data.columns != 'Type']
target = data.loc[:,'Type']

In [None]:
features.describe()

The dimensions in our dataset are the various constituents of Glass. The variation of each quantity gives us a different kind of glass.

### Lets understand what those dimensions really mean.

1.  RI - Refractive index, In optics, the refractive index or index of refraction of a material is a dimensionless number that describes how light propagates through that medium. **This is an important factor that differentiates any glass type**.

2. Na - Sodium

3. Mg - Magnesium
Magnesium and iron increase glass alteration, forming tri-octahedral smectites with the same (Fe + Mg)/Si ratio. With iron, two kinds of silicates precipitate with the same composition but with a different morphology, whereas with magnesium alone, a single Mg-silicate forms. Moreover, it was found that the glass alteration rate drops when the pH stabilizes at a minimum value of 7.8 for Mg-silicates and 6.2 for Fe-silicates. At this point the secondary silicates stop precipitating. This result was confirmed by geochemical simulation and the solubility product of these silicates was estimated considering the presence or absence of aluminum in their structure.
Source: https://www.sciencedirect.com/science/article/pii/S088329271630539X

4. Al - Aluminium, Gives strength to the glass. Higher Al makes stronger glasses. 
Source: https://www.theregister.co.uk/2015/11/04/alumina_in_glass_could_make_busted_smartphones_a_thing_of_the_past/

5. Si - Silica is the primary ingredient in the production of most glass. So it wont be a surprise if we find Si having a major share in glass making.

6. K - Potassium - For stronger glass. Source: https://en.wikipedia.org/wiki/Chemically_strengthened_glass

7. Ca - Calcium - Soda–lime glass, also called soda–lime–silica glass, is the most prevalent type of glass, used for windowpanes and glass containers (bottles and jars) for beverages, food, and some commodity items. Glass bakeware is often made of borosilicate glass. Soda–lime glass accounts for about 90% of manufactured glass.
Source: https://en.wikipedia.org/wiki/Soda%E2%80%93lime_glass

8. Ba - Barium, is what is used in television screens to protects our eyes from the harmful X-Rays that could cause long term health issues.

9. Fe - Iron as mentioned with Magnesium is used for glass alteration.

Now that we have an understanding of our dimensions, lets find the correlations between the dimensions.


In [None]:
corrmat = features.corr()
corrmat

In [None]:
corrmat.iloc[0,:].plot(kind='bar')

As you can see in the plot above, the Refractive index of a glass is positively affected when there is an increase in Calcium and Iron content while for the rest of the elements the refractive index has a negative correlation.

## Lets also see how other elements are correlated with each other.

In [None]:
import matplotlib.pyplot as plt
chartlocation = 0
plt.figure(figsize=(15,12))
columns = np.copy(corrmat.columns.values)
for index, row in corrmat.iterrows():
    column_name = columns[chartlocation]
    chartlocation = chartlocation + 1
    plt.subplot(3,3,chartlocation)
    row.drop(index).plot(kind='bar', title=column_name)

We see that Silica is the only element which is negatively correlated with every other element that make up the glass. Any increase in Silica will need a decrease in every other element and this increase mainly affects the refractive index as you can see in the graph.

We have a clarity on what the data is about and how they are all related to each other in the making of glass. Lets try now and see if there are any outliers in the dataset. The best way is to visualize each data dimension using boxplots. 

In [None]:
chartlocation = 0
plt.figure(figsize=(15,12))
columns = features.columns.values
for column in columns:
    chartlocation = chartlocation + 1
    plt.subplot(3,3,chartlocation)
    features.boxplot(column=column)

As you can see in the above boxplots there are many datapoints which are outliers and may affect the goodness of our machine learning modal. We need to get rid of them. You can see the affect of outliers on the data distribution as well which we will see now.

In [None]:
fig = plt.figure(figsize=(15,12))
ax = fig.add_subplot(1, 1, 1)
features.hist(ax=ax)
plt.show()

As you can see in the histograms above, K, Ca, Fe and Ba have the highest skewness in their distribution. A distribution is skewed if one of its tails is longer than the other. The K, Ca, Fe and Ba distribution shown has a positive skew. You can see the collective skewness in our feature set in the below plot.

In [None]:
features.skew().plot(kind='bar')
plt.show()

So now that we know our dataset has a lot of outliers, its time we remove them. 

In [None]:
def find_outlier_fences_IQR(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    return [fence_low, fence_high]

fences = {}
for column in features.columns.values:
    fences[column] = find_outlier_fences_IQR(features, column)
print(fences)

#lets find rows with more than one or two outliers and drop them.
outliers_index = []
for index, row in features.iterrows():
    outliers_detected = 0
    for column in features.columns.values:
        fence_low = fences[column][0]
        fence_high = fences[column][1]
        if row[column] < fence_low or row[column] > fence_high:
            outliers_detected = outliers_detected + 1
    
    if outliers_detected > 1:
        outliers_index.append(index)

print("\nthere are %d rows found with more than 1 outlier" %(len(outliers_index)))

Lets remove the found outliers from the dataset and then move on with modelling.

In [None]:
outliers_removed_featureset = features.drop(outliers_index)
outliers_removed_targetset = target.drop(outliers_index)

### Standardization of a dataset
This is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

In [None]:
from sklearn.preprocessing import StandardScaler
autoscaler = StandardScaler()
features_scaled = autoscaler.fit_transform(outliers_removed_featureset)

### Visualizing the dataset 

In [None]:
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig)
X_reduced = PCA(n_components=3).fit_transform(features_scaled.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=outliers_removed_targetset)
plt.title("Priciple components 3")
plt.show()

In [None]:
X_reduced = PCA(n_components=2).fit_transform(features_scaled.data)
plt.title("Priciple components 2")
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=outliers_removed_targetset)
plt.show()

## Splitting the datasets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_scaled,outliers_removed_targetset, test_size=0.20, random_state=42)

### Modelling
We will use one of the ensemble methods to find how important different elements are to the making of a glass. 

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100)
clf = clf.fit(X_train, y_train)
feature_with_importance = pd.DataFrame()
feature_with_importance['columns'] = outliers_removed_featureset.columns
feature_with_importance['importance'] = clf.feature_importances_
feature_with_importance.sort_values(by=['importance'], ascending=True, inplace=True)
feature_with_importance.set_index('columns', inplace=True)
feature_with_importance.plot(kind='bar')
plt.show()

As you can see in the above graph, almost all elements are important in the making of glass. Now lets try out a few algorithms and see which one best suits our dataset.

In [None]:
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

models = [
    SVC(),
    KNeighborsClassifier(),
    GradientBoostingClassifier(n_estimators=100)
]

for model in models:
    clf = model.fit(X_train, y_train)
    print('score:',clf.score(X_test,y_test))

### Hyperparameters tuning
Hyper-parameters are parameters that are not directly learnt within estimators. 

In [None]:
from sklearn.model_selection import GridSearchCV
parameter_grid = {
    'C' :  [1, 10, 100, 1000, 1500],
    'gamma' : [0.001, 0.01, 0.1, 1],
    'kernel': [ 'rbf', 'sigmoid']
}

gsv = GridSearchCV(SVC(),parameter_grid)
gsv = gsv.fit(X_train, y_train)
print('score:',gsv.score(X_test,y_test))
gsv.best_params_