# Henry Ezeanowi
# 8900446
# Lab 6

In [62]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
from sklearn.model_selection import train_test_split, cross_val_score, KFold, cross_val_predict
from sklearn.datasets import load_iris
import numpy as np


In [63]:
# Load the Iris dataset
df = load_iris()

In [64]:
df.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [65]:
df.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Feature matrix and Splitting the data from training and test

In [66]:
X = df.data

y = df.target_names[df.target] == 'virginica' # either virginica or non-virginica

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=42) 


Training and Building a Logistic Regression Model

In [67]:
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

Predicting the results

In [68]:
y_pred = model.predict(X_test)
y_pred

array([False, False,  True, False, False, False, False,  True, False,
       False,  True, False, False, False, False, False,  True, False,
       False,  True, False,  True, False,  True,  True,  True,  True,
        True, False, False, False, False, False, False, False,  True,
       False, False])

Accuracy of the model

In [69]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

1.0

There is no false prediction, the model is perfectly accurate

In [70]:
# Performing Cross Validation with 5 splits to  find accuracy

cross_val_scores = cross_val_score(model, X_train, y_train, cv=KFold(n_splits=5, shuffle=True, random_state=42))

# Print the cross-validation scores
print("Cross Validation scores:", cross_val_scores)
print("Mean Accuracy:", cross_val_scores.mean())
print("Standard Deviation:", cross_val_scores.std())

Cross Validation scores: [0.95652174 1.         0.95454545 0.86363636 0.95454545]
Mean Accuracy: 0.9458498023715416
Standard Deviation: 0.044623788228946075


For the different splits , the accuracy varies. The 4th split had the lowest accuracy while the 2nd split had the highest.

Precision

In [71]:
precision = precision_score(y_test, y_pred)
precision

1.0

All the values of the predicted positive values are true positives

Recall

In [72]:
recall = recall_score(y_test, y_pred)
recall

1.0

The model predicted 100% correctly all the actual true values

Calculating Confusison Matrix

In [73]:
conf_mat = confusion_matrix(y_test, y_pred)
conf_mat

array([[26,  0],
       [ 0, 12]], dtype=int64)

There is no false positive or false negative which indicates that the model's accuracy is perfect 

In [74]:
# Confusion matrix with K-fold cross validation

y_cross_pred = cross_val_predict(model, X_test, y_test, cv=5)

cross_conf_mat = confusion_matrix(y_test, y_cross_pred)

cross_conf_mat

array([[26,  0],
       [ 1, 11]], dtype=int64)

With cross validation, the model predicts 1 wrongly and it is a false negative.

Insights about the model:


1. The accuracy of the logistic regression model on the test set gives an indication of how well the model performs. A higher accuracy score indicates a better-performing model, which should be considered in the context of the problem and the dataset.

2. The model has overly adapted to the training data, which leads to perfect accuracy.

3. The model has virtually remembered the training data, including its anomalies or noise.

4. Combining 'setosa' and 'versicolor' into the 'non-virginica' class can affect the model's performance. The model may be biased towards the 'non-virginica' class, if the 'virginica' class has significantly fewer samples.

5. The model lacks generalization, the model might not predict well to unseen data making its accuracy untrustworthy.