# UCI Glass Detection Dataset

## Comparison of Classification ALgorithms



# Outline

- Importing libraries
- Importing Dataset
- Exploring Dataset
- Preparing Dataset
- Removing Outliers
- Visualization of dataset
- Train/Test Split
- Applying Machine Learning Models
- Summary

## List Of Machine Learning Models That Have Been applied:

- KNN
- Logistic Regression
- Decision Tree
- SVM (Linear Kernal)
- SVM (Non Linear Kernal)
- Random Forest
- Neural Network
- Gradient Decent Tree Boosting

## Importing Libraries

In [None]:
import pandas as pd
import numpy as np

## Importing Dataset

In [None]:
data = pd.read_csv("../input/glass/glass.csv")
data.head()

## Exploring Dataset

1.Shape of dataset  
2.Count of Null values  
3.Uniques values  
4.Statisitics of dataset

In [None]:
data.shape

In [None]:
data.isnull().sum()

In [None]:
data.describe()

### The Descriptive Statistics helps us observe that the above data across all attributes is not in the same range. Hence, normalization of Data is required.

## Preparing The Dataset

#### Adding meaningful column/attribute names

In [None]:
names = ['RI','Na','Mg','Al','Si','K','Ca','Ba','Fe','glass_type']
data.columns = names
data.head()

In [None]:
data.head(3)

## Checking Outliers With The Help Of Z-score 

In [None]:
from scipy import stats

z = abs(stats.zscore(data))

#np.where(z > 3)

data = data[(z < 3).all(axis=1)]

#data.shape

### Separating Features and Label

In [None]:
features = ['RI','Na','Mg','Al','Si','K','Ca','Ba','Fe']
label = ['glass_type']

X = data[features]

y = data[label]

In [None]:
X.shape

In [None]:
type(X)

## Data Visualization

In [None]:
x2 = X.values

from matplotlib import pyplot as plt
import seaborn as sns
for i in range(1,9):
        sns.distplot(x2[i])
        plt.xlabel(features[i])
        plt.show()

## Above diagrams shows that our dataset is skewed either on positive side or negative side and that the data is not normalized.

In [None]:
x2 = pd.DataFrame(X)

plt.figure(figsize=(8,8))
sns.pairplot(data=x2)
plt.show()

In [None]:
correlation= X.corr()
plt.figure(figsize=(15,15))
sns.heatmap(correlation,cbar=True,square=True,annot=True,fmt='.1f',annot_kws={'size': 15},
            xticklabels=features,yticklabels=features,alpha=0.7,cmap= 'coolwarm')
plt.show()

Our Diagram shows correlation between different features of the dataset.
Conclusion:
- Rl and Ca have strong correlation between each other
- Al and Ba have intermediate correlation between each other 

#### Scalling the data  (1-0 range)

In [None]:
## Normalizing / Scaling the data  

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
#scaler.fit(X)
#X = scaler.transform(X)
#X = pd.DataFrame(X)

In [None]:
X.head(2)

In [None]:
y.head(2)

### Scaling The Features

In [None]:
from sklearn import preprocessing
X=preprocessing.scale(X)

## Visualizing Data After Preprocessing

In [None]:
x2 = X

from matplotlib import pyplot as plt
import seaborn as sns
for i in range(1,9):
        sns.distplot(x2[i])
        plt.xlabel(features[i])
        plt.show()

## Above diagrams depict that after relevant data preprocessing, the skewness is reduced and data is more normalized.

## Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0,stratify=y)

In [None]:
## Flattening the array
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

In [None]:
print('Shape of X_train = ' + str(X_train.shape))
print('Shape of X_test = ' + str(X_test.shape))
print('Shape of y_train = ' + str(y_train.shape))
print('Shape of y_test = ' + str(y_test.shape))

## Applying Different Machine learning Models

### 1.KNN

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

Scores = []

for i in range (2,11):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    score = knn.score(X_test,y_test)
    Scores.append(score)

print(knn.score(X_train,y_train))
print(Scores)

### 2.Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

Scores = []

for i in range(1):
    tree = DecisionTreeClassifier(random_state=0)
    tree.fit(X_train, y_train)
    score = tree.score(X_test,y_test)
    Scores.append(score)

print(tree.score(X_train,y_train))
print(Scores)

### 3.Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

Scores = []

for i in range(1):
    logistic = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial',max_iter=100)
    logistic.fit(X_train, y_train)
    score = logistic.score(X_test,y_test)
    Scores.append(score)
    
print(logistic.score(X_train,y_train))
print(Scores)

### 4.SVC Classifier (Non-Linear Kernal)

In [None]:
from sklearn.svm import SVC

Scores = []

for i in range(1):
    svc = SVC(gamma='auto')
    svc.fit(X_train, y_train)
    score = svc.score(X_test,y_test)
    Scores.append(score)

print(svc.score(X_train,y_train))
print(Scores)

### 5.SVC Classifier (Linear Kernel)

In [None]:
from sklearn.svm import LinearSVC

Scores = []

for i in range(1):
    svc = LinearSVC(random_state=0)
    svc.fit(X_train, y_train)
    score = svc.score(X_test,y_test)
    Scores.append(score)

print(svc.score(X_train,y_train))
print(Scores)

### 6.Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

Scores = []
Range = [10,20,30,50,70,80,100,120]

for i in range(1):
    forest = RandomForestClassifier(criterion='gini', n_estimators=10, min_samples_leaf=1, min_samples_split=4, random_state=1,n_jobs=-1)
    #forest = RandomForestClassifier(n_estimators=i ,random_state=0)
    forest.fit(X_train, y_train)
    score = forest.score(X_test,y_test)
    #Scores.append(score)

print(forest.score(X_train,y_train))
print(score)

### 7.Neural Network

In [None]:
from sklearn.neural_network import MLPClassifier

Scores = []

for i in range(1):
    NN = MLPClassifier(random_state=0)
    NN.fit(X_train, y_train)
    score = NN.score(X_test,y_test)
    Scores.append(score)

print(NN.score(X_train,y_train))
print(Scores)

### 8.Gradient Decent Tree Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gd = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

gd.fit(X_train, y_train)
score = gd.score(X_test,y_test)

print(gd.score(X_train,y_train))
print(score)

# Summary

### Out of all the models tested above:

1. Random forest gives the best result with:  

    - training accuracy: 0.9793103448275862  
    - test accuracy: 0.7755102040816326

### But since it is overfitting,it suggests that the model might not perform well on unknown data. Hence, we shall choose the next best model that is:

2. SVM (Non Linear Kernel)
    
    - training accuracy: 0.7586206896551724
    - testing accuracy:  0.7551020408163265