# **Practical Lab-6 Dwarakanath Chandra (8856840)**

In [1]:
# Importing the libraries

import numpy as np
import pandas as pd
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

### **IRIS Data Import**

In [2]:
# Import the IRIS Dataset

from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)

In [3]:
type(iris)

sklearn.utils._bunch.Bunch

In [4]:
list(iris)  # show the keys of the Bunch dictionary

['data',
 'target',
 'frame',
 'target_names',
 'DESCR',
 'feature_names',
 'filename',
 'data_module']

In [5]:
iris.keys()  # this is the same as list(iris)

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [6]:
# print types of all the keys
for key in iris.keys():
    print(key, type(iris[key]))

data <class 'pandas.core.frame.DataFrame'>
target <class 'pandas.core.series.Series'>
frame <class 'pandas.core.frame.DataFrame'>
target_names <class 'numpy.ndarray'>
DESCR <class 'str'>
feature_names <class 'list'>
filename <class 'str'>
data_module <class 'str'>


In [7]:
iris.data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [8]:
iris.target.head() 

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int32

In [9]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

### **Creating Feature matrix, Target Vector and Splitting the data from training and validation**

In [10]:
X = iris.data
y = iris.target_names[iris.target] == 'virginica' # Setting the classes as either virginica (1) or Non-Virginica (0)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=42) # Splitting the data into 75%-25% ratio

### **Logistic Regession Model Fitting**

In [11]:
# Logistic Regression Model Building and Training

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

### **Predicting the Results for the Validation set**

In [12]:
# Predicting the results for X_test data

y_pred = log_reg.predict(X_test)

In [13]:
y_pred

array([False, False,  True, False, False, False, False,  True, False,
       False,  True, False, False, False, False, False,  True, False,
       False,  True, False,  True, False,  True,  True,  True,  True,
        True, False, False, False, False, False, False, False,  True,
       False, False])

In [14]:
y_test

array([False, False,  True, False, False, False, False,  True, False,
       False,  True, False, False, False, False, False,  True, False,
       False,  True, False,  True, False,  True,  True,  True,  True,
        True, False, False, False, False, False, False, False,  True,
       False, False])

### **Performance Evaluation of the Logistic Regression model**

#### **Accuracy**

In [15]:
# Calculate the accuracy of the model

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)

print(accuracy)

1.0


#### **Observation:** 
##### Accuracy = 1.0 indicates 100% accurate model. There a no False predictions with this model. This implies that for any given X feature matrix values, the logistic regression model is predicting the accurate results with certainty. However, this may be due to the class imbalance between the target variable classes or due to the training dataset with which we have trained the logistic regression model. This can be further investigated and explored by performing the cross validation technique on the same model with different splits as shown below

#### **Performing Cross_validation to find accuracy**

In [16]:
# Perform Cross Validation with 5 splits

from sklearn.model_selection import cross_val_score, KFold

cross_val_scores = cross_val_score(log_reg, X_train, y_train, cv=KFold(n_splits=5, shuffle=True, random_state=42))

In [17]:
# Print the cross-validation scores
print("Cross-validation scores:", cross_val_scores)
print("Mean accuracy:", cross_val_scores.mean())
print("Standard deviation:", cross_val_scores.std())

Cross-validation scores: [0.95652174 1.         0.95454545 0.86363636 0.95454545]
Mean accuracy: 0.9458498023715416
Standard deviation: 0.044623788228946075


#### **Observation:**
#### We can see that, with a 5-fold cross validation, the accuracy varies for each split. In the second split, we got an accuracy of 100%. However, in the 4th split, the accuracy dropped to 86.3%. This makes the mean accuracy value averaged to be 94.5% for a 5-fold cross validation. Moreover, we cannot rely on the accuracy score alone. To evaluate a model we need to explore the other performance metrics too such as confusion matrix, precision, and recall.

#### **Confusion Matrix**

#### **Creating the confusion matrix**

In [18]:
# Confusion matrix formation

from sklearn.metrics import confusion_matrix

conf_matr = confusion_matrix(y_test, y_pred)

conf_matr

array([[26,  0],
       [ 0, 12]], dtype=int64)

#### **Observation:**
#### The above confusion matrix has no False Positives and no False Negatives. This indicates that the model predictions are perfectly accurate. However, let's explore the same confusion matrix with K-fold cross validation prediction results.

In [19]:
# Creating confusion matrix with Cross_validation

from sklearn.model_selection import cross_val_predict

y_cross_pred = cross_val_predict(log_reg, X_test, y_test, cv=5)

cross_conf_matr = confusion_matrix(y_test, y_cross_pred)

print("Confusion Matrix:")
print(cross_conf_matr)

Confusion Matrix:
[[26  0]
 [ 1 11]]


In [20]:
tn, fp, fn, tp = confusion_matrix(y_test, y_cross_pred).ravel()
tn, fp, fn, tp

(26, 0, 1, 11)

#### **Observation:**
#### With cross_val_predict function, when we predict the results using X_test, we got a false prediction for 1 value. This is a False Negative prediction. This means the actual true value is Virginica, but the logistic regression model predicts it to be Non-Virginica. 

#### **Precision**

In [21]:
# Finding the precision score the logistic regression model

from sklearn.metrics import precision_score

prec_score = precision_score(y_test, y_pred)

prec_score

1.0

#### **Observation:**
#### This indicates out of all predicted positive values, all of the values of True Positives. This is consistent with the confusion matrix. However, let's explore the same with cross validated predictions.

In [22]:
# Performing cross_validation to find the precision score

from sklearn.metrics import make_scorer

# Define the precision scoring metric
precision_scorer = make_scorer(precision_score, average='weighted') # The average ='weighted' will consider the class imbalance.

# Perform cross-validation with precision scoring
cross_val_scores = cross_val_score(log_reg, X_train, y_train, cv=KFold(n_splits=5, shuffle=True, random_state=42), scoring=precision_scorer)

# Print the cross-validation scores
print("Cross-validation precision scores:", cross_val_scores)
print("Mean precision:", cross_val_scores.mean())
print("Standard deviation:", cross_val_scores.std())

Cross-validation precision scores: [0.96195652 1.         0.95738636 0.86893939 0.96103896]
Mean precision: 0.9498642480707697
Standard deviation: 0.043336140005007136


#### **Observation:**
#### We can see that the precision scores of the logistic regression model are changing for each split as new data is fed into the model.


#### **Recall**

In [23]:
# Calculating the recall for the logistic regression model

from sklearn.metrics import recall_score

recall = recall_score(y_test, y_pred)

recall

1.0

#### **Observation:**

#### Recall = 1 means out of all actual True values, the model predicted 100% correctly. Let's explore the same with the cross validation results.

In [24]:
# Calculating the recall score using cros validation

scoring = make_scorer(recall_score, average='macro') # average='macro': The recall score is calculated for each class individually, and then the average of these scores is computed. It gives equal weight to each class, regardless of class imbalance.

recall_scores = cross_val_score(log_reg, X_train, y_train, cv=5, scoring=scoring)

# Print the recall scores for each fold
print("Recall Scores:", recall_scores)

# Calculate the mean recall score
mean_recall_score = recall_scores.mean()
print("Mean Recall Score:", mean_recall_score)

Recall Scores: [1.         0.9375     0.9        0.96666667 0.9375    ]
Mean Recall Score: 0.9483333333333335


#### **Observation:**
#### We can observe that, the recall scores are varying for logistic regression model for different split due to introduction new observations.