# **1. Model Building**

In [None]:
# loading the final DataFrame
df = pd.read_csv(final_data_path)

df.head()

In [None]:
# loading the final DataFrame
df2 = pd.read_csv(final_data_path)

df2.head()

## **5.1 Splitting the Dataset**
- We will split the dataset into training and testing sets using an 75-25 split to ensure the model is trained on a substantial portion of the data while retaining a separate test set for evaluation.

In [None]:
# Splitting the dataset into features and target variable
X = df2.drop(columns=['default_payment_next_month'])
y = df2['default_payment_next_month']

# Displaying the shapes of features and target variable
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

- We split the dataset into features (X) and target variable (y).

In [None]:
# Spliting the dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Displaying the shapes of training and testing sets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

- We used stratified sampling to ensure that the class distribution in the target variable is maintained in both training and testing sets.

## **5.2 Handling Class Imbalance on Training Dataset**
- We will combining oversampling and undersampling techniques to handle class imbalance in the training dataset.
    - We will use the `SMOTE` (Synthetic Minority Over-sampling Technique) to oversample the minority class (default payment). It generates synthetic samples for the minority class by interpolating between existing minority class samples, effectively increasing the representation of the minority class in the training dataset.
    - We will use the `Tomek Links` undersampling technique to undersample the majority class (no default payment). Tomek Links identifies and removes samples from the majority class that are close to the decision boundary, helping to clean up the majority class and improve the model's ability to distinguish between classes.
- This combination of oversampling the minority class and undersampling the majority class will help us create a balanced dataset for training the model, improving its ability to predict both classes effectively.

In [None]:
# Applying SMOTE to handle class imbalance
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Displaying the shapes of resampled training set
print(f"X_train_resampled shape: {X_train_resampled.shape}")
print(f"y_train_resampled shape: {y_train_resampled.shape}")

# Displaying class distribution in original and resampled training sets
from collections import Counter
print("Class distribution in original training set:", Counter(y_train))
print(f"Class distribution in resampled training set: {Counter(y_train_resampled)}")

- After applying SMOTE oversampling, now we have 37336 rows in the training dataset, with 18668 rows in both majority class (no default payment) and minority class (default payment).

In [None]:
# Applying Tomek Links Under-sampling technique to the resampled training set
from imblearn.under_sampling import TomekLinks
tomek_links = TomekLinks()
X_train_final, y_train_final = tomek_links.fit_resample(X_train_resampled, y_train_resampled)

# Displaying the shapes of final training set after Tomek Links
print(f"X_train_final shape: {X_train_final.shape}")
print(f"y_train_final shape: {y_train_final.shape}")

# Displaying class distribution in SMOTE resampled and final training sets
print(f"Class distribution in SMOTE resampled training set: {Counter(y_train_resampled)}")
print("Class distribution in final training set:", Counter(y_train_final))

# Saving X_train_final and y_train_final back to X_train and y_train
X_train, y_train = X_train_final, y_train_final

- After applying Tomek Links undersampling, now we have 36804 rows in the training dataset, with 18668 rows in the majority class (no default payment) and 18136 rows in the minority class (default payment). This ensures a balanced dataset for training the model.

## **5.3 Baseline Model**
- First we will build a baseline model using Logistic Regression to establish a performance benchmark for our credit card default prediction task.

In [None]:
# Building a baseline model using Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initializing the Logistic Regression model
logistic_model = LogisticRegression(penalty='l1', max_iter=1000, solver='saga', random_state=42, n_jobs=-1)

# Fitting the model on the training data
logistic_model.fit(X_train, y_train)

# Making predictions on training set and test set
logistic_y_train_pred = logistic_model.predict(X_train)
logistic_y_pred = logistic_model.predict(X_test)

# Calculating accuracy on training set and test set
logistic_train_accuracy = accuracy_score(y_train, logistic_y_train_pred)
logistic_test_accuracy = accuracy_score(y_test, logistic_y_pred)

# Displaying results for training sets
print(f"Training Accuracy: {logistic_train_accuracy:.4f}")
print(classification_report(y_train, logistic_y_train_pred))
print(confusion_matrix(y_train, logistic_y_train_pred))
print()

# Displaying results for test sets
print("\nResults on Test Sets:")
print(f"Test Accuracy: {logistic_test_accuracy:.4f}")
print(classification_report(y_test, logistic_y_pred))
print(confusion_matrix(y_test, logistic_y_pred))

- The performance of logistic regression model on both training and testing dataset is not very good. The confusion matrix shows that the model is predicting more false positives than false negatives, indicating that it is not able to correctly identify the minority class (default payment) effectively.

## **5.4 Advanced Models**
- Now we will build advanced models such as Decision Trees, Random Forest, Gradient Boosting, and XGBoost to improve the performance of our credit card default prediction task.

### **Decision Tree Classifier**