<a href="https://colab.research.google.com/github/varshith1824/Data-science-_cipher-schools/blob/main/credit_risk_scoring_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Develop a credit risk scoring system using a logistic regression model, starting with the `credit_risk_dataset.csv` file. This involves data preprocessing, training and evaluating the model, and extracting the model's coefficients and intercept for future use in a MySQL stored procedure.

## Load Dataset

### Subtask:
Load the 'credit_risk_dataset.csv' file into a pandas DataFrame to begin the credit risk scoring system development.


**Reasoning**:
First, import the pandas library, then load the specified CSV file into a DataFrame, and finally display its first 5 rows to verify the data has been loaded correctly.



In [None]:
import pandas as pd

df = pd.read_csv('/content/credit_risk_dataset.csv')
df.head()

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


## Data Preprocessing and Feature Engineering

### Subtask:
Perform necessary data preprocessing steps including handling missing values, encoding categorical features (e.g., one-hot encoding), and preparing the data for model training. This will involve identifying features and the target variable.


**Reasoning**:
The first step in data preprocessing is to inspect the DataFrame for missing values and data types to understand its structure and identify columns that need cleaning. I will use `df.isnull().sum()` to count missing values and `df.info()` to check data types.



In [None]:
print(df.isnull().sum())
df.info()

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              895
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 3116
loan_status                      0
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6

**Reasoning**:
Based on the `df.isnull().sum()` output, `person_emp_length` and `loan_int_rate` have missing values. I will impute these numerical columns with their respective medians to handle potential outliers and ensure data completeness for model training.



In [None]:
df['person_emp_length'].fillna(df['person_emp_length'].median(), inplace=True)
df['loan_int_rate'].fillna(df['loan_int_rate'].median(), inplace=True)

print('Missing values after imputation:')
print(df.isnull().sum())

Missing values after imputation:
person_age                    0
person_income                 0
person_home_ownership         0
person_emp_length             0
loan_intent                   0
loan_grade                    0
loan_amnt                     0
loan_int_rate                 0
loan_status                   0
loan_percent_income           0
cb_person_default_on_file     0
cb_person_cred_hist_length    0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['person_emp_length'].fillna(df['person_emp_length'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['loan_int_rate'].fillna(df['loan_int_rate'].median(), inplace=True)


**Reasoning**:
Now that missing values have been handled, I will identify the categorical columns and apply one-hot encoding to them to convert them into a numerical format suitable for model training. Then, I will separate the features (X) from the target variable (y).



In [None]:
categorical_cols = ['person_home_ownership', 'loan_intent', 'loan_grade', 'cb_person_default_on_file']
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

X = df_encoded.drop('loan_status', axis=1)
y = df_encoded['loan_status']

print('Shape of feature matrix X:', X.shape)
print('Shape of target vector y:', y.shape)
print('\nFirst 5 rows of X:')
print(X.head())
print('\nFirst 5 values of y:')
print(y.head())

Shape of feature matrix X: (32581, 22)
Shape of target vector y: (32581,)

First 5 rows of X:
   person_age  person_income  person_emp_length  loan_amnt  loan_int_rate  \
0          22          59000              123.0      35000          16.02   
1          21           9600                5.0       1000          11.14   
2          25           9600                1.0       5500          12.87   
3          23          65500                4.0      35000          15.23   
4          24          54400                8.0      35000          14.27   

   loan_percent_income  cb_person_cred_hist_length  \
0                 0.59                           3   
1                 0.10                           2   
2                 0.57                           3   
3                 0.53                           2   
4                 0.55                           4   

   person_home_ownership_OTHER  person_home_ownership_OWN  \
0                        False                      False

## Split Data into Training and Testing Sets

### Subtask:
Divide the preprocessed dataset into training and testing sets to prepare for model development and evaluation.


**Reasoning**:
To prepare the data for model training and evaluation, I will split the preprocessed feature matrix `X` and target vector `y` into training and testing sets using `train_test_split` with a test size of 20% and a random state for reproducibility. Finally, I'll print the shapes of the resulting datasets to verify the split.



In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Shape of X_train:', X_train.shape)
print('Shape of X_test:', X_test.shape)
print('Shape of y_train:', y_train.shape)
print('Shape of y_test:', y_test.shape)

Shape of X_train: (26064, 22)
Shape of X_test: (6517, 22)
Shape of y_train: (26064,)
Shape of y_test: (6517,)


## Train Logistic Regression Model

### Subtask:
Develop and train a logistic regression model using scikit-learn on the training data to score loan applicants.


**Reasoning**:
I will import the `LogisticRegression` class, instantiate it with the specified parameters, and then fit the model to the training data.



In [None]:
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
model = LogisticRegression(random_state=42, solver='liblinear')

# Fit the model to the training data
model.fit(X_train, y_train)

print("Logistic Regression model trained successfully.")

Logistic Regression model trained successfully.


## Evaluate Model Performance

### Subtask:
Evaluate the trained logistic regression model's performance on the test set using appropriate metrics such as accuracy, precision, recall, or F1-score.


### Reasoning:
To assess the effectiveness of the logistic regression model, I will generate predictions on the `X_test` dataset. Following this, I'll calculate key evaluation metrics—accuracy, precision, recall, and F1-score—by comparing these predictions against the true labels in `y_test` to provide a comprehensive understanding of the model's performance.

**Reasoning**:
Following the instructions, I will generate predictions on the test set (`X_test`) using the trained `model`. Then, I will import the necessary metrics from `sklearn.metrics` and calculate the accuracy, precision, recall, and F1-score of the model, printing each metric to assess its performance.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-score: {f1:.4f}')

Accuracy: 0.8148
Precision: 0.7621
Recall: 0.2394
F1-score: 0.3644


## Extract Model Coefficients and Intercept

### Subtask:
Extract the coefficients and intercept from the trained logistic regression model. These values are crucial for implementing the scoring logic in a MySQL stored procedure later.


**Reasoning**:
To extract the coefficients and intercept from the trained logistic regression model, I will access the `coef_` and `intercept_` attributes of the `model` object. Since `coef_` returns a 2D array and `intercept_` returns a 1D array, I will extract the first (and only) element from each to get the scalar values. Finally, I will print these values with clear labels.



In [None]:
coefficients = model.coef_[0]
intercept = model.intercept_[0]

print("Model Coefficients:")
print(coefficients)
print("\nModel Intercept:")
print(intercept)

Model Coefficients:
[-1.74160153e-02 -4.27112656e-05 -1.60074182e-02  1.02503750e-04
  6.03446816e-02  5.25947519e-04  5.13138485e-03  5.31603083e-05
 -3.31474516e-03  6.90049670e-03 -2.42087409e-03  1.41455305e-03
  1.07780277e-03 -7.68018464e-04 -3.05358788e-03 -4.83007174e-03
 -1.16913938e-03  8.39567124e-03  2.45881542e-03  7.24967789e-04
  2.98648441e-04  5.39738997e-03]

Model Intercept:
-0.001487981301711348


## Final Task

### Subtask:
Summarize the data preparation, modeling process, model evaluation results, and the extracted coefficients and intercept, which completes the Python notebook deliverable for the credit risk scoring system.


## Summary:

### Data Analysis Key Findings

*   The `credit_risk_dataset.csv` file, containing 32581 records across 11 columns, was successfully loaded. Initial columns included `person_age`, `person_income`, `person_home_ownership`, `loan_intent`, `loan_grade`, `loan_amnt`, `loan_int_rate`, `loan_status`, `loan_percent_income`, `cb_person_default_on_file`, and `cb_person_cred_hist_length`.
*   During data preprocessing, missing values in `person_emp_length` (895 entries) and `loan_int_rate` (3116 entries) were imputed using the median of their respective columns.
*   Categorical features (`person_home_ownership`, `loan_intent`, `loan_grade`, `cb_person_default_on_file`) were one-hot encoded, resulting in a feature matrix `X` with 32581 samples and 22 features. The target variable `y` was `loan_status`.
*   The dataset was split into training (80%) and testing (20%) sets. The training set comprised 26064 samples, and the testing set had 6517 samples.
*   A Logistic Regression model was successfully trained using the training data.
*   Model evaluation on the test set yielded the following performance metrics:
    *   Accuracy: 0.8148
    *   Precision: 0.7621
    *   Recall: 0.2394
    *   F1-score: 0.3644
*   The model's coefficients (a 22-element array corresponding to each feature) and intercept (-0.001487981301711348) were successfully extracted for future use.

### Insights or Next Steps

*   The model shows relatively high accuracy; however, the low recall and F1-score indicate that it struggles to correctly identify a significant portion of actual loan defaults. This suggests potential class imbalance in the target variable, which should be investigated further.
*   The extracted coefficients and intercept are ready to be integrated into a MySQL stored procedure, enabling the deployment of the credit risk scoring logic for real-time evaluations. Future work should focus on optimizing the model (e.g., using techniques for imbalanced datasets or exploring other algorithms) to improve its ability to detect defaults.
