# Logistic Regression to classify glass types
First we import the necessary packages and the data

In [72]:
#1	Window glass
#2	Kitchenware glass
#3	Smartphone cover glass
#4	OLED display substrate glass
#5	Bioactive glass
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

data = pd.read_csv('Glasses.csv')
# deleting last 9 columns that are artifacts of the csv-file
data = data.loc[:, "SiO2":"Family of glass"]

print(data.head())

   SiO2      B2O3      P2O5     Al2O3       Na2O        CaO  Family of glass
0   0.0  0.516284  3.654946  3.240924  37.375325  55.212521                5
1   0.0  0.138838  0.000000  2.563676  16.432675  80.864810                5
2   0.0  0.813877  3.288834  4.412898  43.806742  47.677650                5
3   0.0  2.010510  1.302941  1.023996  40.145047  55.517506                5
4   0.0  1.321328  3.072029  0.491069  26.582426  68.533148                5


### Splitting the data
We split the data into training and testing data sets. Here we could use for example k-fold cross validation, but I chose to focus my efforts on the clustering part of this homework. Also, even though not doing a k-fold cross validation will lead to less robust results, a single data split is sufficient

In [73]:
def splitData(data, training_frac):
    train = data.sample(frac=training_frac, random_state=1).sort_index()
    test = data.drop(train.index)
    return train, test

### Training and evaluating the logistic regressor
Here I only do a simple regression, and to build a more advanced model I could engineer new features or use a neural network, but again, I chose to focus my time on the second task.

In [74]:
train, test = splitData(data, 0.8)
train_X = train.loc[:, 'SiO2':'CaO'].to_numpy()
train_Y = train.loc[:, ['Family of glass']].to_numpy()
test_X = test.loc[:, 'SiO2':'CaO'].to_numpy()
test_Y = test.loc[:, ['Family of glass']].to_numpy()
train_Y = train_Y.reshape((train_Y.shape[0], ))
test_Y = test_Y.reshape((test_Y.shape[0], ))

clf = LogisticRegression(penalty='none')

clf.fit(train_X, train_Y)

pred = clf.predict(test_X)
print('The accuracy of the model is {}'.format(clf.score(test_X, test_Y)))

predict_unseen = clf.predict([[70, 0, 0, 0, 20, 10], [70, 0, 0, 15, 0, 15], [70, 0, 0, 15, 15, 0]])
print('The predictions for the unseen glasses are as follows: {}'.format(predict_unseen))

The accuracy of the model is 0.76
The predictions for the unseen glasses are as follows: [1 4 3]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


As can be seen, the accuracy of the model is 76%. Also, it is successful in classifying the three unseen types of glass. The accuracy could perhaps be made higher with a more complex model, one with engineered features or a neural network.