# Assignment 12: Logistic Regression for Banknote Authentication

Scenario: Imagine you are working for a financial institution developing automated systems to detect counterfeit banknotes. Data has been collected from images of genuine and forged banknote-like specimens. Features were extracted from these images using Wavelet Transform tools, resulting in four continuous numerical measurements. The goal is to build a model that can predict whether a banknote is genuine or forged based on these image-derived features.

Dataset: Banknote Authentication Dataset

In [32]:
#importing libraries and modules
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import warnings


## Task 1: Data Loading and Preparation

In [33]:
#executing rest of the starter code

# Suppress potential convergence warnings for cleaner output (optional)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

# --- Reference Date ---
# Assignment context date: Saturday, April 5, 2025 (as per environment context)
print(f"Assignment starter code executed. Context Date: April 5, 2025")

# --- Task 1: Data Loading and Preparation ---
print("\n--- Task 1: Data Loading and Preparation ---")

# URL for the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt'

# Define column names
column_names = ['Variance', 'Skewness', 'Curtosis', 'Entropy', 'Class']

# Load data, specifying no header and comma separator
try:
    df = pd.read_csv(url, header=None, names=column_names, sep=',')
    print("Data loaded successfully.")
except Exception as e:
    print(f"Error loading data: {e}")
    # Exit or handle error appropriately in a real script
    exit()

Assignment starter code executed. Context Date: April 5, 2025

--- Task 1: Data Loading and Preparation ---
Data loaded successfully.


Checking Data

In [34]:
# Display basic info - Verify data loaded correctly
print("\nDataFrame Head:")
print(df.head())
print("\nDataFrame Info:")
df.info()
print("\nDataFrame Description:")
print(df.describe())


DataFrame Head:
   Variance  Skewness  Curtosis  Entropy  Class
0   3.62160    8.6661   -2.8073 -0.44699      0
1   4.54590    8.1674   -2.4586 -1.46210      0
2   3.86600   -2.6383    1.9242  0.10645      0
3   3.45660    9.5228   -4.0112 -3.59440      0
4   0.32924   -4.4552    4.5718 -0.98880      0

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Variance  1372 non-null   float64
 1   Skewness  1372 non-null   float64
 2   Curtosis  1372 non-null   float64
 3   Entropy   1372 non-null   float64
 4   Class     1372 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 53.7 KB

DataFrame Description:
          Variance     Skewness     Curtosis      Entropy        Class
count  1372.000000  1372.000000  1372.000000  1372.000000  1372.000000
mean      0.433735     1.922353     1.397627    -1.191657     0.444606
std       2

Preparing Data:

In [35]:
# --- Prepare Features (X) and Target (Y) ---
x = df[['Variance', 'Skewness', 'Curtosis', 'Entropy']] #-- features
y = df['Class']                                         #-- target

print("\nShape of X:", x.shape)
print("Shape of Y:", y.shape)

#using pandas describe
print("\nDescribe X:\n", x.describe())
print("Describe Y:\n", y.describe())


Shape of X: (1372, 4)
Shape of Y: (1372,)

Describe X:
           Variance     Skewness     Curtosis      Entropy
count  1372.000000  1372.000000  1372.000000  1372.000000
mean      0.433735     1.922353     1.397627    -1.191657
std       2.842763     5.869047     4.310030     2.101013
min      -7.042100   -13.773100    -5.286100    -8.548200
25%      -1.773000    -1.708200    -1.574975    -2.413450
50%       0.496180     2.319650     0.616630    -0.586650
75%       2.821475     6.814625     3.179250     0.394810
max       6.824800    12.951600    17.927400     2.449500
Describe Y:
 count    1372.000000
mean        0.444606
std         0.497103
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: Class, dtype: float64


## Task 2: Train-Test Split 

In [36]:
print("\n--- Task 2: Train-Test Split ---")

x_train, x_test, y_train, y_test = train_test_split( #splitting data into train and test sets
    x, y,
    test_size=0.20,      # 25% for testing
    random_state=42,     # For reproducibility
    stratify=y     # Ensure class proportions are maintained in splits
)

print("Training set size:", x_train.shape[0])
print("Test set size:", x_test.shape[0])
print("\nForged vs. Genuine proportion in Train set:\n", y_train.value_counts(normalize=True))
print("\nForged vs. Genuine proportion in Test set:\n", y_test.value_counts(normalize=True))


--- Task 2: Train-Test Split ---
Training set size: 1097
Test set size: 275

Forged vs. Genuine proportion in Train set:
 Class
0    0.55515
1    0.44485
Name: proportion, dtype: float64

Forged vs. Genuine proportion in Test set:
 Class
0    0.556364
1    0.443636
Name: proportion, dtype: float64


## Task 3: Model Training

In [37]:
print("\n--- Task 3: Model Training ---")
#standardize the training data
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Create and train the Logistic Regression model
log_reg_model = LogisticRegression(random_state=42) # Using defaults for now
log_reg_model.fit(x_train, y_train) # Train ONLY on the training data

print("\nLogistic Regression Model trained successfully on banknote authentication data.")



--- Task 3: Model Training ---

Logistic Regression Model trained successfully on banknote authentication data.


## Task 4: Model Evaluation

In [38]:
print("\n--- Task 4: Model Evaluation ---")

# Make predictions (y_pred) on the test data using the trained model
y_pred_labels = log_reg_model.predict(x_test) #Predict Class Labels 
y_pred_proba = log_reg_model.predict_proba(x_test) #Predict Probabilities 

print("\n--- Predictions on Banknote Authentication Test Set ---")
print("First 5 Actual Labels:   ", y_test.iloc[:5].values)
print("First 5 Predicted Labels:", y_pred_labels[:5])
print("\nFirst 5 Predicted Probabilities (P(Y=0), P(Y=1)):")
print(y_pred_proba[:5].round(4))


# Calculate the accuracy_score by comparing y_pred to y_test
print("\n--- Calculating the Accuracy Score")
accuracy = accuracy_score(y_test, y_pred_labels)

# Print the accuracy score (formatted nicely)
print(f"Accuracy Score on Test Set: {accuracy:.2f} ({(accuracy*100):.2f}%)")

#Add a print statement interpreting the accuracy in context
print("\n--- Interpreting the Accuracy Score")
print(f"This means the model correctly predicted whether banknotes were forged or not for {accuracy*100:.2f}% of the banknotes in the unseen test data.")

#Calculate the precision and recall scores
print("\n--- Calculating the Precision and Recall Scores")
cm = confusion_matrix(y_test, y_pred_labels)
tn, fp, fn, tp = cm.ravel()

precision_score = tp/(tp+fp)
recall_score = tp/(tp+fn)

print(f"Precision Score: {precision_score:.2f}")
print(f"Recall Score: {recall_score:.2f}")

# Add a print statement interpreting these scores in context
print("\n--- Interpreting the Precision and Recall Scores")
print(f"- Precision ({precision_score:.2f}): When the model predicts a banknote is forged, it is correct {(precision_score)*100:.1f}% of the time.")
print(f"- Recall ({recall_score:.2f}): The model correctly identifies {recall_score*100:.1f}% of all banknotes which were actually forged.")


--- Task 4: Model Evaluation ---

--- Predictions on Banknote Authentication Test Set ---
First 5 Actual Labels:    [0 1 0 1 1]
First 5 Predicted Labels: [0 1 0 1 1]

First 5 Predicted Probabilities (P(Y=0), P(Y=1)):
[[9.996e-01 4.000e-04]
 [3.360e-02 9.664e-01]
 [7.556e-01 2.444e-01]
 [3.560e-02 9.644e-01]
 [7.000e-03 9.930e-01]]

--- Calculating the Accuracy Score
Accuracy Score on Test Set: 0.97 (97.09%)

--- Interpreting the Accuracy Score
This means the model correctly predicted whether banknotes were forged or not for 97.09% of the banknotes in the unseen test data.

--- Calculating the Precision and Recall Scores
Precision Score: 0.94
Recall Score: 1.00

--- Interpreting the Precision and Recall Scores
- Precision (0.94): When the model predicts a banknote is forged, it is correct 93.8% of the time.
- Recall (1.00): The model correctly identifies 100.0% of all banknotes which were actually forged.


## Task 5: Coefficient Interpretation

In [39]:
print("\n--- Task 5: Coefficient Interpretation ---")

# Extract coefficients from the trained model
log_coefficients = log_reg_model.coef_[0]
feature_names = x.columns
coeffs_log_df = pd.DataFrame({'Feature': feature_names, 'Coefficient (Log-Odds)': log_coefficients})

# Print the coefficients for 'Skewness' and 'Entropy'
print(f"\n--- Logistic Regression Model Coefficients (Banknote Authentiction Data) ---")
print(f"--- Coefficients for Skewness:")
print(coeffs_log_df[(coeffs_log_df['Feature'] == 'Skewness')])
print(f"\n--- Coefficients for Entropy:")
print(coeffs_log_df[(coeffs_log_df['Feature'] == 'Entropy')])




--- Task 5: Coefficient Interpretation ---

--- Logistic Regression Model Coefficients (Banknote Authentiction Data) ---
--- Coefficients for Skewness:
    Feature  Coefficient (Log-Odds)
1  Skewness               -4.806524

--- Coefficients for Entropy:
   Feature  Coefficient (Log-Odds)
3  Entropy                0.257966


## Task 6: Conclusion

The logistic regression model achieved 97.09% accuracy, 94% precision, and 100% recall in banknote authentication, with a strong negative Skewness coefficient (-4.81) reducing forgery odds and a small positive Entropy coefficient (0.26) slightly increasing them. This reliable model excels at verification but could improve with advanced techniques.

## Task 7: Code Quality

In [40]:
print("\n--- Task 7: Code Quality ---")
print("Ensure code is commented, uses meaningful variable names, and runs without errors.")


--- Task 7: Code Quality ---
Ensure code is commented, uses meaningful variable names, and runs without errors.
