# Feature Engineering and Modeling

In this notebook, I will build upon the preprocessed dataset created in the `01_data_loading_and_exploration.ipynb` notebook. Here, I will:

1. Possibly perform additional feature engineering (if needed) to improve predictive performance.
2. Train a baseline machine learning model (e.g., Logistic Regression) to establish a performance benchmark.
3. Evaluate the model using appropriate metrics.
4. Potentially experiment with more advanced models or parameter tuning.

By the end of this notebook, I should have a good sense of how well a simple model can predict credit risk and understand where there might be room for improvement.

In [1]:
import pandas as pd
import numpy as np
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load the previously saved train_test splits
X_train = joblib.load("../data/X_train.pkl")
X_test = joblib.load("../data/X_test.pkl")
y_train = joblib.load("../data/y_train.pkl")
y_test = joblib.load("../data/y_test.pkl")

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (800, 48)
X_test shape: (200, 48)
y_train shape: (800,)
y_test shape: (200,)


### Baseline Model: Logistic Regression

Before diving into complex models or extensive parameter tuning, I will start with a simple baseline model. Logistic Regression is a classic linear model often used as a first benchmark in classification tasks. 

**Why Logistic Regression?**  
- It’s straightforward and widely understood, making it an excellent first choice to gauge the difficulty of the prediction problem.
- It runs quickly and gives a point of comparison for more advanced models later.
- If Logistic Regression performs reasonably well, it suggests that linear relationships between features and the target are informative. If it performs poorly, it may indicate that more complex relationships or features are needed.

**What Am I Looking For?**  
- **Accuracy**: How many predictions are correct overall.
- **Precision/Recall/F1-score**: To understand the quality of predictions for each class, especially the "bad" credit class, which might be more critical to identify accurately.
- **Confusion Matrix**: To see if the model predominantly misclassifies one class over the other.

After evaluating Logistic Regression, I’ll know whether I need more advanced techniques (e.g., Random Forests, Gradient Boosted Trees) or additional data transformations. This baseline sets the foundation for all subsequent improvements.

In [2]:
# Initialize a Logistic Regression model with default parameters
model = LogisticRegression(max_iter=1000, random_state=42)

# Fit the model on the training data
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [3]:
# Evaluate the model
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", acc)
print("\nConfusion Matrix:\n", cm)
print("\nClassification Report:\n", report)

Accuracy: 0.795

Confusion Matrix:
 [[127  14]
 [ 27  32]]

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.90      0.86       141
           1       0.70      0.54      0.61        59

    accuracy                           0.80       200
   macro avg       0.76      0.72      0.74       200
weighted avg       0.79      0.80      0.79       200



### Addressing the Convergence Warning

I received a `ConvergenceWarning` which indicates that the Logistic Regression model did not fully converge within the given number of iterations (`max_iter=1000`). This can happen when:

- Features are on different scales, making it harder for the optimizer to find a stable solution.
- The default solver and iteration limit are insufficient for this particular dataset.

**Why does scaling help?**  
When features vary widely in scale (e.g., some in the hundreds, others less than one), the optimization algorithm struggles to navigate the feature space efficiently. By scaling all features to a similar range (for example, using StandardScaler to give them all a mean of 0 and a standard deviation of 1), the model can converge more easily.

**Why increase `max_iter` or change the solver?**  
If the optimizer doesn’t converge within the default number of iterations, giving it more iterations (`max_iter`) can help. Also, some solvers (like `liblinear`) may handle certain datasets more gracefully, converging faster or more reliably.

**Next Step**: I will scale the data, increase `max_iter`, and use a different solver (`liblinear`) to see if the warning disappears and to ensure the model is properly converged.

In [4]:
from sklearn.preprocessing import StandardScaler

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize Logistic Regression with more iterations and a different solver
model_improved = LogisticRegression(max_iter=2000, solver='liblinear', random_state=42)
model_improved.fit(X_train_scaled, y_train)
y_pred_improved = model_improved.predict(X_test_scaled)

acc_improved = accuracy_score(y_test, y_pred_improved)
cm_improved = confusion_matrix(y_test, y_pred_improved)
report_improved = classification_report(y_test, y_pred_improved)

print("Accuracy (Improved):", acc_improved)
print("\nConfusion Matrix (Improved):\n", cm_improved)
print("\nClassification Report (Improved):\n", report_improved)

Accuracy (Improved): 0.795

Confusion Matrix (Improved):
 [[124  17]
 [ 24  35]]

Classification Report (Improved):
               precision    recall  f1-score   support

           0       0.84      0.88      0.86       141
           1       0.67      0.59      0.63        59

    accuracy                           0.80       200
   macro avg       0.76      0.74      0.74       200
weighted avg       0.79      0.80      0.79       200



### Conclusion

By scaling the features and increasing the number of iterations, as well as selecting the `liblinear` solver, the Logistic Regression model now converges without warnings. This small adjustment demonstrates my ability to diagnose and fix common modeling issues.

The performance might be similar or slightly different, but importantly, the model is now properly optimized. This sets a more solid baseline for comparison if I try more advanced models or further tune the pipeline in subsequent steps.