In [8]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Read in data
df = pd.read_csv(r"C:\Users\parir\Downloads\model.csv")

# Separate features and target variable
X = df.drop('default', axis=1)
y = df['default']

# Handle missing values
imputer = SimpleImputer(strategy='median')
X = imputer.fit_transform(X)

# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

    

Uses SimpleImputer from scikit-learn to impute missing values with the median of each column.
Uses StandardScaler from scikit-learn to standardize (i.e., scale) the features to have zero mean and unit variance.
Uses train_test_split from scikit-learn to split the data into training and validation sets (80% training, 20% validation) to prevent data leakage. The random_state parameter is set to 42 for reproducibility.

In [9]:
# Logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold, RandomizedSearchCV
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg_preds = logreg.predict(X_val)
logreg_proba = logreg.predict_proba(X_val)[:, 1]

# Print classification report and ROC AUC score
print('Logistic Regression:\n', classification_report(y_val, logreg_preds))
print('ROC AUC score:', roc_auc_score(y_val, logreg_proba))

Logistic Regression:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98     18051
           1       0.43      0.03      0.05       749

    accuracy                           0.96     18800
   macro avg       0.70      0.51      0.51     18800
weighted avg       0.94      0.96      0.94     18800

ROC AUC score: 0.8119185227968908


Fits a logistic regression model to the training data and makes predictions on the validation data. Calculates the classification report and ROC AUC score to evaluate the model's performance.

The AUC score ranges from 0.0 to 1.0, where a score of 0.5 indicates a random classifier and a score of 1.0 indicates a perfect classifier. Higher AUC scores indicate better overall performance of the model in distinguishing between the positive and negative classes.

In [10]:
# Random forest
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_val)
rf_proba = rf.predict_proba(X_val)[:, 1]

# Print classification report and ROC AUC score
print('Random Forest:\n', classification_report(y_val, rf_preds))
print('ROC AUC score:', roc_auc_score(y_val, rf_proba))

Random Forest:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98     18051
           1       0.52      0.02      0.03       749

    accuracy                           0.96     18800
   macro avg       0.74      0.51      0.51     18800
weighted avg       0.94      0.96      0.94     18800

ROC AUC score: 0.807635116909152


For logistic regression, I didn't perform any feature engineering as it's a linear model and assumes that the features are linearly related to the target variable. For random forest, I didn't perform any feature selection as random forests are able to handle a large number of features.
For both models, I used the default hyperparameters, but you could try tuning the hyperparameters using GridSearchCV or RandomizedSearchCV from scikit-learn to improve the model's performance. Also note that I used the classification report and ROC AUC score to evaluate the model's performance, but there are other metrics you could use depending on the specifics of your problem (e.g., precision, recall, F1 score, accuracy).

ROC- AUC score provides a single value to summarize the overall model performance and allows for comparison of different models or tuning of model parameters. Thus, we can say Logistic Regression is slighlty a better fit for the training data than Random Forest Classifier.

In [11]:
# Read in data
df = pd.read_csv(r"C:\Users\parir\Downloads\val.csv")

# Separate features and target variable
X = df.drop('default', axis=1)
y = df['default']

# Handle missing values
imputer = SimpleImputer(strategy='median')
X = imputer.fit_transform(X)

# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


In [12]:
# Logistic Regression predictions
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_preds = lr_model.predict_proba(X_val)[:,1] # class probabilities for default class
lr_df = pd.DataFrame(lr_preds, columns=['predictions'])
lr_df.to_csv('results1.csv', index=False)

In [13]:
# Random Forest predictions
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict_proba(X_val)[:,1] # class probabilities for default class
rf_df = pd.DataFrame(rf_preds, columns=['predictions'])
rf_df.to_csv('results2.csv', index=False)

This will save the predictions for each model in separate CSV files named results1.csv and results2.csv. The CSV files will have a single column representing the output from each model, and no header label or index column is needed.

Logistic Regression:

Pros:

Simple and easy to interpret.
Computationally efficient, even with large datasets.
Can handle binary and multi-class classification problems.
Can provide class probabilities in addition to binary predictions.

Cons:
Assumes a linear relationship between features and the target variable.
Limited by the assumptions of the underlying statistical model.
Can be sensitive to outliers and multicollinearity.
May underperform when there are non-linear relationships or interactions between features.

Random Forest:

Pros:

Can capture non-linear relationships and interactions between features.
Can handle a mix of categorical and numerical features.
Robust to outliers and missing data.
Can provide feature importance rankings.
Generally perform well on a wide range of problems.

Cons:

Can be computationally expensive, especially with large datasets and many trees.
Can be prone to overfitting if hyperparameters are not tuned correctly.
May be less interpretable than simpler models like logistic regression.
Can be difficult to diagnose and debug if issues arise.

Considerations for business context
In a business context, there are several additional factors to consider when choosing a modeling technique:

Interpretability: Depending on the problem domain, it may be important to choose a model that is easy to interpret and explain to stakeholders. In this case, logistic regression may be preferred over random forest.

Computation time: If the dataset is very large, or if predictions need to be generated in real-time, the computational efficiency of the model may be a critical factor. In this case, logistic regression may be preferred over random forest.

Performance metrics: The choice of performance metrics should be aligned with the specific business goal of the model. For example, if the goal is to minimize false positives (i.e. predicting someone will default when they won't), then precision may be a more important metric than recall. It's important to choose a model and tuning strategy that maximizes the desired performance metrics.

Overall, both logistic regression and random forest are useful modeling techniques that have their own strengths and weaknesses. The choice of model and tuning strategy should be based on the specific problem domain, available data, and business requirements.