# <font face = 'Impact' color = '#FFAEBC' > Sample Demonstration on Machine Learning for Classification<font/>
#### <font face = 'Times New Roman' color = '#B5E5CF'> License: GPL v3.0<font/>
#### <font face = 'Times New Roman' color = '#B5E5CF'> Author and Trainer: Paolo Hilado MSc. (Data Science)<font/>
This notebook provides a backgrounder in doing Machine Learning in Python employing models such as K-Nearest Neighbot, Decision Trees, Logistics Regression, Support Vector Machine, and Random Forest Classifier.

# <font face = 'Palatino Linotype' color = '#5885AF'> Business Understanding:<font/>

Every business that invoices customers faces one common challenge:
- ‚úÖ Ensuring invoices are paid on time.
- ‚ö†Ô∏è Late payments can disrupt cash flow, increase administrative costs, and affect the company‚Äôs ability to operate efficiently.

"We Provide All", a mid-sized B2B service provider, issues hundreds of invoices monthly to clients across various industries. Some clients pay promptly, while others consistently delay payments. The Operations Department want to anticipate payment delays so they can take preventive action such as early reminders, offering discounts, or adjusting payment terms.

The QUESTION relevant to the business problem is simple but impactful:
## ‚ÄúCan we predict whether a customer will pay their next invoice on time?‚Äù

By identifying invoices at risk of late payment, the company can:
- üéØImprove cash flow forecasting
- üéØPrioritize follow-up actions for at-risk customers
- üéØOffer proactive incentives or flexible payment plans
- üéØReduce overdue receivables and collection costs

This transforms a reactive collections process into a data-driven, proactive operation.

In [None]:
# Load the necessary packages
import pandas as pd

In [None]:
# Load the dataset
df = pd.read_csv("timely_pay3.csv")
# Check out the first few records
df.head()

In [None]:
# Check the information about the data frame.
df.info()

# <font face = 'Palatino Linotype' color = '#5885AF'> Data Understanding:<font/>
Collaboration with the Operations Department paved way to a solid understanding of the explanatory variables and target variable relevant to the business problem. They are documented as follows:

| Feature                  | Description                                                                    | Why It Matters                                                                                                |
| ------------------------ | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------- |
| **credit_score**         | A numeric score representing a customer‚Äôs creditworthiness (300‚Äì850).          | Higher credit scores indicate lower risk and better likelihood of on-time payments.                           |
| **debt_to_income_ratio** | Ratio of total debt to total income, typically between 0.1‚Äì0.6 in the dataset. | Lower ratios suggest the customer has manageable debt relative to income, making timely payments more likely. |
| **num_late_payments**    | Number of past late payments recorded (0‚Äì2 in synthetic dataset).              | Past payment behavior is a strong predictor; more late payments increase the risk of delays.                  |
| **on_time_payment**      | Target variable: 1 if payment was on time, 0 if late.                          | The outcome we are trying to predict; understanding its drivers is key for modeling risk.                     |


In [None]:
# Checking for dataset balance or imbalance
import qdesc as qd
data = pd.DataFrame()
data['col'] = df['on_time_payment'].astype('category')
qd.freqdist(data, "col")

# <font face = 'Palatino Linotype' color = '#5885AF'> Data Preparation:<font/>
This presents the recipe for data explorationg and transformation in preparation for modelling.

In [None]:
from sklearn.model_selection import train_test_split 
# Split the dataset into train and test sets.
# Given 11 explanatory variables we would at need > 138 observations for
# training a regression model (Tabachnick and Fidell, 2013). The 80-20 split
# will be used for this project. 
train, test = train_test_split(df, test_size=0.20, random_state=42)
print(f'''The number of records for the train set is {len(train)}.
The number of records for the test set is {len(test)}.''')
# Source: Tabachnick, B.G.,Fidell, L.S., 2013. Using Multivariate Statistics, 
#         6th ed. Pearson Education, Inc., Boston. 

In [None]:
# Separating the explanatory variables from the outcome variable (train).
x_train = train.drop(['on_time_payment'], axis = 1)
y_train = train['on_time_payment']
x_train.head()

In [None]:
# Separating the explanatory variables from the outcome variable (test).
x_test = test.drop(['on_time_payment'], axis = 1)
y_test = test['on_time_payment']
x_test.head()

In [None]:
# Standardize all the continuous variables for train set.
from sklearn.preprocessing import StandardScaler

# Assigning feature labels to variable continuous_vars.
continuous_vars = ['credit_score', 'debt_to_income_ratio', 'num_late_payments',]

# Initialize StandardScaler.
scaler = StandardScaler()

# Fit scaler to the continuous variables and transform them.
x_train[continuous_vars] = scaler.fit_transform(x_train[continuous_vars])

In [None]:
# Standardize all the continuous variables for test set.
from sklearn.preprocessing import StandardScaler

# Assuming you have your data in a DataFrame called df with continuous variables
# Replace continuous_vars with the names of your continuous variables
continuous_vars = ['credit_score', 'debt_to_income_ratio', 'num_late_payments',]

# Fit scaler to the continuous variables and transform them
x_test[continuous_vars] = scaler.transform(x_test[continuous_vars])

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# Checking for Multicollinearity among continuous variables using correlation matrix
plt.figure(figsize=(6, 4))
sns.heatmap(x_train[continuous_vars].corr(), annot=True, cmap='coolwarm')
plt.show()

# <font face = 'Palatino Linotype' color = '#5885AF'> Modelling: KNN<font/>

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

knn = KNeighborsClassifier()

param_grid = {
    'n_neighbors': list(range(3, 21)),
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(
    estimator=knn,
    param_grid=param_grid,
    cv=cv,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(x_train, y_train)

best_knn = grid_search.best_estimator_

# CV Score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

# Train Score
y_train_pred = best_knn.predict(x_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
print("Train Accuracy:", train_accuracy)

# Test Score
y_pred = best_knn.predict(x_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", test_accuracy)

In [None]:
print("\nClassification Report:\n", classification_report(y_test, y_pred))

metrics = {
    "Accuracy": accuracy_score(y_test, y_pred),
    "Precision": precision_score(y_test, y_pred, average='macro'),
    "Recall": recall_score(y_test, y_pred, average='macro'),
    "F1-score": f1_score(y_test, y_pred, average='macro')
}

results_df = np.round(pd.DataFrame(metrics, index=['Score']),2)
print(results_df)

In [None]:
# Let us fit the KNN Model using the hyperparameters for our best model
from sklearn import tree
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Define class labels.
class_labels = np.unique(y_test)

# Create a confusion matrix.
cm = confusion_matrix(y_test, y_pred, labels=class_labels)

# Proper labelling of outcomes.
tn, fp, fn, tp= confusion_matrix(y_test, y_pred, labels=class_labels).ravel()
print("tp:", tp,"fn:", fn,"fp:",fp,"tn:", tn )

# Create a heatmap of the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=class_labels, yticklabels=class_labels)
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()

# <font face = 'Palatino Linotype' color = '#5885AF'> Modelling: Decision Trees<font/>

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score

# Define classifier with random state
classifier = DecisionTreeClassifier()

# Hyperparameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],       # Split criterion ('gini' or 'entropy')
    'max_depth': [3, 5, 10, 15],         # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],        # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],          # Minimum number of samples required to be at a leaf node
    'max_features': ['log2', 'sqrt']      # Number of features to consider when looking for the best split    
}

# Stratified CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Grid search
grid_search = GridSearchCV(
    estimator=classifier,
    param_grid=param_grid,
    cv=cv,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(x_train, y_train)

# Best model
best_tree = grid_search.best_estimator_

print("Best Hyperparameters:", grid_search.best_params_)
print("Best CV Accuracy Score:", grid_search.best_score_)

# Train evaluation
y_train_pred = best_tree.predict(x_train)
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))

# Test evaluation
y_pred = best_tree.predict(x_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))

In [None]:
print("\nClassification Report:\n", classification_report(y_test, y_pred))

metrics = {
    "Accuracy": accuracy_score(y_test, y_pred),
    "Precision": precision_score(y_test, y_pred, average='macro'),
    "Recall": recall_score(y_test, y_pred, average='macro'),
    "F1-score": f1_score(y_test, y_pred, average='macro')
}

results_df = np.round(pd.DataFrame(metrics, index=['Score']),2)
print(results_df)

In [None]:
# Let us fit the Decision Tree Model using the hyperparameters for our best model
from sklearn import tree
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Make predictions using the best model on the test set.
y_pred = grid_search.predict(x_test)

# Define class labels.
class_labels = np.unique(y_test)

# Create a confusion matrix.
cm = confusion_matrix(y_test, y_pred, labels=class_labels)

# Proper labelling of outcomes.
tn, fp, fn, tp= confusion_matrix(y_test, y_pred, labels=class_labels).ravel()
print("tp:", tp,"fn:", fn,"fp:",fp,"tn:", tn )

# Create a heatmap of the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=class_labels, yticklabels=class_labels)
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()

# <font face = 'Palatino Linotype' color = '#5885AF'> Modelling: Logistic Regression<font/>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report

# Proper Stratified CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=44)

# Logistic Regression model with balanced classes
logreg = LogisticRegression(class_weight='balanced', random_state=42, max_iter=500)

# Hyperparameter grid
param_grid = {
    'penalty': ['l1', 'l2'],
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear'],  # 'liblinear' supports both L1 and L2
}

# Grid search using accuracy
grid_search = GridSearchCV(
    estimator=logreg,
    param_grid=param_grid,
    cv=cv,
    scoring='accuracy',   
    n_jobs=-1
)

# Fit grid search
grid_search.fit(x_train, y_train)

# Best model
best_logreg = grid_search.best_estimator_

print("Best Hyperparameters:", grid_search.best_params_)
print("Best CV Accuracy:", grid_search.best_score_)

# Train prediction
y_train_pred = best_logreg.predict(x_train)
print("\nTrain Accuracy:", accuracy_score(y_train, y_train_pred))

# Test prediction
y_test_pred = best_logreg.predict(x_test)
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))

In [None]:
print("\nClassification Report:\n", classification_report(y_test, y_pred))

metrics = {
    "Accuracy": accuracy_score(y_test, y_pred),
    "Precision": precision_score(y_test, y_pred, average='macro'),
    "Recall": recall_score(y_test, y_pred, average='macro'),
    "F1-score": f1_score(y_test, y_pred, average='macro')
}

results_df = np.round(pd.DataFrame(metrics, index=['Score']),2)
print(results_df)

In [None]:
# Let us fit the Logistic Regression Model using the hyperparameters for our best model
from sklearn import tree
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Print the Logistic Regression Model performance on our test set.
y_pred = grid_search.predict(x_test)

# Define class labels.
class_labels = np.unique(y_test)

# Compute confusion matrix.
cm = confusion_matrix(y_test, y_pred, labels=class_labels)

# Proper labelling of outcomes.
tp, fn, fp, tn= confusion_matrix(y_test, y_pred, labels=class_labels).ravel()
print("tp:", tp,"fn:", fn,"fp:",fp,"tn:", tn )

# Create a heatmap of the confusion matrix.
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=class_labels, yticklabels=class_labels)
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()

# <font face = 'Palatino Linotype' color = '#5885AF'> Modelling: Random Forest<font/>

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Load the Random Forest Classifier Model.
rf = RandomForestClassifier(n_jobs=-1, random_state=42)

# Define the parameters of the Random Forest Classifier for Hyperparameter Tuning.
param_grid = {
    'n_estimators': [10, 50, 100,200],
    'max_depth': [None, 5, 10,20],
    'min_samples_split': [2, 5, 10,15]
}

# Setup the grid search with accuracy scoring
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy', 
    n_jobs=-1
)

# Using GridSearchCV to determine the best model.
grid_search.fit(x_train, y_train)

# Get the best parameters and its Accuracy score
best_params = grid_search.best_params_
best_cv_accuracy = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best CV Accuracy:", best_cv_accuracy)

# Train the best model
best_rf = grid_search.best_estimator_

# Predictions
y_train_pred = best_rf.predict(x_train)
y_test_pred = best_rf.predict(x_test)

# Evaluate
print("\nTrain Accuracy:", accuracy_score(y_train, y_train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))

In [None]:
print("\nClassification Report:\n", classification_report(y_test, y_pred))

metrics = {
    "Accuracy": accuracy_score(y_test, y_pred),
    "Precision": precision_score(y_test, y_pred, average='macro'),
    "Recall": recall_score(y_test, y_pred, average='macro'),
    "F1-score": f1_score(y_test, y_pred, average='macro')
}

results_df = np.round(pd.DataFrame(metrics, index=['Score']),2)
print(results_df)

In [None]:
from sklearn import tree
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Print the RF Model performance on our test set.
y_pred = grid_search.predict(x_test)

# Calculate model performance such as F1, Recall, and Precision. 
from sklearn import metrics
print("Test f1 Score =",metrics.f1_score(y_test, y_pred, pos_label = 1))
print("Test recall Score =",metrics.recall_score(y_test, y_pred, pos_label = 1))
print("Test precision Score =",metrics.precision_score(y_test, y_pred, pos_label = 1))


# Define class labels.
class_labels = np.unique(y_test)

# Compute confusion matrix.
cm = confusion_matrix(y_test, y_pred, labels=class_labels)

# Proper labelling of outcomes.
tp, fn, fp, tn= confusion_matrix(y_test, y_pred, labels=class_labels).ravel()
print("tp:", tp,"fn:", fn,"fp:",fp,"tn:", tn )

# Create a heatmap of the confusion matrix.
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=class_labels, yticklabels=class_labels)
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()

# <font face = 'Palatino Linotype' color = '#5885AF'> Saving the Model for Future Deployment<font/>

In [None]:
# Save a copy of the Random Forest Model.
import pickle
pickle.dump(best_rf, open('RFPaymodel.pkl', 'wb'))

# Decision

Given the results of the CV RMSE across the different models, Random Forest Classifier have promising model performance compared to other trained models. Its CV accuracy is consistent with Train accuracy indicating no overfit. It also performs well on the unseen test data having an accuracy of near 0.9.