# Discussion 5
## This is just a demonstration how to solve the homework using the iris.csv that we previously used in Discussion 2

## Exercise 1 : Building a Feed-Forward Neural Network(50 points)

<img src="./ffnn.JPG"/>
Multiple input attributes go through a hidden layer to the output. When training the model, the hidden layer is trained so that the produced output get as close to the actual output as possible

### Exercise 1.1 : Data Preprocessing (10 points)

- As the classes are categorical, use one-hot encoding to represent the set of classes. You will find this useful when developing the output layer of the neural network.
- Normalize each field of the input data using the min-max normalization technique.

### Exercise 1.2 : Training and Testing the Neural Network (40 points)

Design a 4-layer artificial neural network, specifically a feed-forward multi-layer perceptron (using the sigmoid activation function), to classify the type of 'Dry Bean' given the other attributes in the data set, similar to the one mentioned in the paper above. Please note that this is a multi-class classification problem so select the right number of nodes accordingly for the output layer.

For training and testing the model, split the data into training and testing set by __80:20__ and use the training set for training the model and the test set to evaluate the model performance.

Consider the following hyperparameters while developing your model :

- Number of nodes in each hidden layer should be (10, 2)
- Learning rate should be 0.4
- Number of epochs should be 600
- The sigmoid function should be used as the activation function in each layer
- Stochastic Gradient Descent should be used to minimize the error rate

__Requirements once the model has been trained :__

- A confusion matrix for all classes, specifying the true positive, true negative, false positive, and false negative cases for each category in the class
- The accuracy and mean squared error (MSE) of the model
- The precision and recall for each label in the class

__Notes :__

- Splitting of the dataset should be done __after__ the data preprocessing step.
- The mean squared error (MSE) values obtained __should be positive__.

<!-- ## Part 1 (50 points)

Design a 4-layer artificial neural network, specifically a feed-forward multi-layer perceptron (using the sigmoid activation function), to classify the type of 'Dry Bean' given the other attributes in the data set, similar to the one mentioned in the paper above. For this, split the data into training and testing set by 90:10 and use the training set for training the model and the test set to evaluate the model performance. Please note that this is a multi-class classification problem so select the right number of nodes accordingly for the output layer.

Consider the following hyperparameters :

- Number of nodes in each hidden layer should be (12, 3).
- Learning rate should be 0.3
- Number of epochs should be 500
- The sigmoid function should be used as the activation function in each layer
- Stochastic Gradient Descent should be used to minimize the error rate

Once the model has been trained, test the model and obtain the following :

- The confusion matrix for each class, specifying the true positive, true negative, false positive, and false negative cases for each category in the class
- The accuracy and mean squared error (MSE) of the model
- The precision and recall for each label in the class

 -->

In [1]:
import pandas as pd
import numpy as np

In [2]:
dataset = pd.read_csv("iris.csv")

print("Dataset :")
print(dataset.head())
print("Species : ")
print(dataset['species'].unique())

print("Dimensions of the dataset : ", dataset.shape)
print("Features of the dataset :")
print(dataset.describe(include = 'all'))


Dataset :
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
Species : 
['setosa' 'versicolor' 'virginica']
Dimensions of the dataset :  (150, 5)
Features of the dataset :
        sepal_length  sepal_width  petal_length  petal_width species
count     150.000000   150.000000    150.000000   150.000000     150
unique           NaN          NaN           NaN          NaN       3
top              NaN          NaN           NaN          NaN  setosa
freq             NaN          NaN           NaN          NaN      50
mean        5.843333     3.054000      3.758667     1.198667     NaN
std         0.828066     0.433594      1.764420     0.763161     NaN
min         

In [3]:
dataset['species']

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: object

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler

X = dataset.drop('species', axis = 1)
y = dataset['species']

# normalize data
scaler = MinMaxScaler(feature_range=(0, 1))
X_rescaled = scaler.fit_transform(X)
X = pd.DataFrame(data = X_rescaled, columns = X.columns)

set_of_classes = y.value_counts().index.tolist()
set_of_classes= pd.DataFrame({'species': set_of_classes})
y = pd.get_dummies(y)

print("Pre-processed data :")
print(X)

print("Pre-processed class :")
print(y)

#splitting data into ratio 80:20
data_train, data_test, class_train, class_test = train_test_split(X, y, test_size=0.2)

# Number of nodes in each hidden layer should be (10, 2)
# Learning rate should be 0.4
# Number of epochs should be 600
mlp = MLPClassifier(solver = 'sgd', random_state = 42, activation = 'logistic', learning_rate_init = 0.4, batch_size = 100, hidden_layer_sizes = (10, 2), max_iter = 600)
mlp

Pre-processed data :
     sepal_length  sepal_width  petal_length  petal_width
0        0.222222     0.625000      0.067797     0.041667
1        0.166667     0.416667      0.067797     0.041667
2        0.111111     0.500000      0.050847     0.041667
3        0.083333     0.458333      0.084746     0.041667
4        0.194444     0.666667      0.067797     0.041667
..            ...          ...           ...          ...
145      0.666667     0.416667      0.711864     0.916667
146      0.555556     0.208333      0.677966     0.750000
147      0.611111     0.416667      0.711864     0.791667
148      0.527778     0.583333      0.745763     0.916667
149      0.444444     0.416667      0.694915     0.708333

[150 rows x 4 columns]
Pre-processed class :
     setosa  versicolor  virginica
0         1           0          0
1         1           0          0
2         1           0          0
3         1           0          0
4         1           0          0
..      ...         ...    

In [5]:
y = dataset['species']
y = pd.get_dummies(y)

In [6]:
mlp.fit(data_train, class_train)

pred = mlp.predict(data_test)
pred
#prediction on the test data. species are represented using the hot-keys

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 1, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0]])

In [7]:
from sklearn.metrics import classification_report, confusion_matrix, multilabel_confusion_matrix
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score, recall_score

print("Accuracy : ", accuracy_score(class_test, pred))
print("Mean Square Error : ", mean_squared_error(class_test, pred))

print(pred[:5])
print("Confusion Matrix for each label : ")
print(multilabel_confusion_matrix(class_test, pred))

print("Classification Report : ")
print(classification_report(class_test, pred))

Accuracy :  0.9333333333333333
Mean Square Error :  0.044444444444444446
[[1 0 0]
 [0 1 0]
 [0 0 1]
 [1 0 0]
 [0 0 1]]
Confusion Matrix for each label : 
[[[21  0]
  [ 0  9]]

 [[15  0]
  [ 2 13]]

 [[22  2]
  [ 0  6]]]
Classification Report : 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         9
           1       1.00      0.87      0.93        15
           2       0.75      1.00      0.86         6

   micro avg       0.93      0.93      0.93        30
   macro avg       0.92      0.96      0.93        30
weighted avg       0.95      0.93      0.94        30
 samples avg       0.93      0.93      0.93        30



#### Confusion Matrix
<img src="./matrix.png" style="width:200px;height:200px"/> 

## Exercise 2 : k-fold Cross Validation (10 points)

In order to avoid using biased models, use 8-fold cross validation to generalize the model based on the given data set.

__Requirements :__
- The accuracy and MSE values during each iteration of the cross validation
- The overall average accuracy and MSE value

__Note :__ The mean squared error (MSE) values obtained should be positive.

<img src="./k_fold.jpg" style="width:300px;height:400px"/> 

In [8]:
# Using sklearn function cross_validate()

from sklearn.model_selection import cross_validate

CV = cross_validate(mlp, X, y, cv=8, scoring=['accuracy', 'neg_mean_squared_error'])
print('Accuracy')
print(CV['test_accuracy'])
print('MSE')
print(-1*CV['test_neg_mean_squared_error'])

Accuracy
[1.         1.         1.         0.84210526 0.94736842 1.
 1.         0.83333333]
MSE
[0.         0.         0.         0.10526316 0.03508772 0.
 0.         0.11111111]


In [9]:
print('Average Accuracy = ', sum(CV['test_accuracy']) / len(CV['test_accuracy']))
print('Average MSE = ', sum(-1 * CV['test_neg_mean_squared_error']) / len(CV['test_neg_mean_squared_error']))

Average Accuracy =  0.9528508771929824
Average MSE =  0.031432748538011694


In [10]:
# To find list of accuracy and MSE values
# Without using the sklearn function cross_validate()

from sklearn.model_selection import KFold

n_splits=8
# step 1: randomize the dataset and create k equal size partitions
kf = KFold(n_splits=n_splits)

acc = 0
mse = 0

i = 0 #keep track of batch number
# step 5: iterate k times with a different testing subset
for train_indices, test_indices in kf.split(X):

    # step 2-3: use k-1/k^th partition for the training/testing model
    start_train, stop_train = train_indices[0], train_indices[-1]+1
    start_test, stop_test = test_indices[0], test_indices[-1]+1
    
    # perform the training similar to Q1
    #this was based on the requirements in Q1
    mlp = MLPClassifier(solver = 'sgd', random_state = 42, activation = 'logistic', learning_rate_init = 0.4, batch_size = 100, hidden_layer_sizes = (10, 2), max_iter = 600)
    mlp.fit(X[start_train:stop_train], y[start_train:stop_train])
    pred = mlp.predict(X[start_test:stop_test])
    
    # step 4: record the evaluating scores
    i+=1
    acc += accuracy_score(y[start_test:stop_test], pred)
    mse += mean_squared_error(y[start_test:stop_test], pred)
    
    print("\nAccuracy for batch ", i, " : ", accuracy_score(y[start_test:stop_test], pred))
    print("Mean Square Error for batch ", i, " : ", mean_squared_error(y[start_test:stop_test], pred))

# step 6: find the average and select the batch with highest evaluation scores
print('\nAverage Accuracy = ', acc / n_splits)
print('Average MSE = ', mse / n_splits)


Accuracy for batch  1  :  1.0
Mean Square Error for batch  1  :  0.0

Accuracy for batch  2  :  1.0
Mean Square Error for batch  2  :  0.0

Accuracy for batch  3  :  1.0
Mean Square Error for batch  3  :  0.0

Accuracy for batch  4  :  1.0
Mean Square Error for batch  4  :  0.0

Accuracy for batch  5  :  0.9473684210526315
Mean Square Error for batch  5  :  0.03508771929824561

Accuracy for batch  6  :  1.0
Mean Square Error for batch  6  :  0.0

Accuracy for batch  7  :  1.0
Mean Square Error for batch  7  :  0.0

Accuracy for batch  8  :  0.8333333333333334
Mean Square Error for batch  8  :  0.1111111111111111

Average Accuracy =  0.9725877192982456
Average MSE =  0.01827485380116959


## Exercise 3 - Logistic Regression (20 points in total)
Recall the dataset from last week homework (Discussion use iris.csv)

Now we are going to build a classification model on ``species`` using all the other 4 attributes. <br >
Note that Logistic Regression is a binary classificaiton algorithm.

### Exercise 3.1 - Processing and Splitting the Dataset (5 points)
In this exercise 3, we only consider those species of "versicolor" or "virginica". <br >
So please **remove** those species that belong to "setosa". <br >
And then, split the data into training and testing set with the ratio of 70:30. <br >

In [11]:
df = pd.read_csv('./iris - Copy.csv')

data = df.copy().loc[(df['species'] != 'setosa'), :]
train, test = train_test_split(data, test_size=0.3, random_state=21)
# The remaining
X_train, y_train = train.drop(columns=['species']) ,train['species']
X_test, y_test = test.drop(columns=['species']), test['species']

### Exercise 3.2 - Logistic Regression (15 points)

Using all the other 4 attributes, please build a Logistic Regression model that distinguishes between flowers in versicolor versus virginica. <br >

Requirements
 - Report the testing precision and recall for both species.

In [12]:
from sklearn.linear_model import LogisticRegression

cls = LogisticRegression()
cls.fit(X_train, y_train)
print(classification_report(y_test, cls.predict(X_test)))

              precision    recall  f1-score   support

  versicolor       1.00      0.90      0.95        21
   virginica       0.82      1.00      0.90         9

    accuracy                           0.93        30
   macro avg       0.91      0.95      0.93        30
weighted avg       0.95      0.93      0.94        30



**precision**: proportion of TP to the total number of positive predictions (TP+FP) <br>
**recall**: true positive rate, which is the proportion of true positive (TP) 
predictions to the total number of actual positive instances (TP+FN)

## Exercise 4 - Polynomial Regression (20 points in total)
Check out the notebook file that Professor went over in class. 