<a href="https://colab.research.google.com/github/yuvasri694/CodeAlpha_taskname/blob/main/credit_scoring_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Implement a credit scoring classification model using the `GermanCredit.csv` dataset.

## Load the data

### Subtask:
Load the `GermanCredit.csv` dataset into a pandas DataFrame.


**Reasoning**:
Load the dataset into a pandas DataFrame and display the first 5 rows to verify the data is loaded correctly.



In [1]:
import pandas as pd

df = pd.read_csv('GermanCredit.csv')
display(df.head())

Unnamed: 0,status,duration,credit_history,purpose,amount,savings,employment_duration,installment_rate,personal_status_sex,other_debtors,...,property,age,other_installment_plans,housing,number_credits,job,people_liable,telephone,foreign_worker,credit_risk
0,... < 100 DM,6,critical account/other credits existing,domestic appliances,1169,unknown/no savings account,... >= 7 years,4,male : single,none,...,real estate,67,none,own,2,skilled employee/official,1,yes,yes,1
1,0 <= ... < 200 DM,48,existing credits paid back duly till now,domestic appliances,5951,... < 100 DM,1 <= ... < 4 years,2,female : divorced/separated/married,none,...,real estate,22,none,own,1,skilled employee/official,1,no,yes,0
2,no checking account,12,critical account/other credits existing,retraining,2096,... < 100 DM,4 <= ... < 7 years,2,male : single,none,...,real estate,49,none,own,1,unskilled - resident,2,no,yes,1
3,... < 100 DM,42,existing credits paid back duly till now,radio/television,7882,... < 100 DM,4 <= ... < 7 years,2,male : single,guarantor,...,building society savings agreement/life insurance,45,none,for free,1,skilled employee/official,2,no,yes,1
4,... < 100 DM,24,delay in paying off in the past,car (new),4870,... < 100 DM,1 <= ... < 4 years,3,male : single,none,...,unknown/no property,53,none,for free,2,skilled employee/official,2,no,yes,0


## Explore and preprocess

### Subtask:
Explore the dataset to understand its structure and characteristics. Preprocess the data by handling missing values, encoding categorical features, and scaling numerical features.


**Reasoning**:
Get column names and data types, check for missing values, identify categorical and numerical columns, encode categorical features, scale numerical features, and concatenate the processed data.



In [2]:
print("Column names and data types:")
print(df.info())

print("\nMissing values count:")
print(df.isnull().sum())

categorical_cols = df.select_dtypes(include=['object']).columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

print("\nCategorical columns:", list(categorical_cols))
print("Numerical columns:", list(numerical_cols))

# One-hot encode categorical columns
df_categorical_encoded = pd.get_dummies(df[categorical_cols], drop_first=True)

# Scale numerical columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_numerical_scaled = pd.DataFrame(scaler.fit_transform(df[numerical_cols]), columns=numerical_cols)

# Concatenate processed features
df_processed = pd.concat([df_numerical_scaled, df_categorical_encoded], axis=1)

display(df_processed.head())

Column names and data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   status                   1000 non-null   object
 1   duration                 1000 non-null   int64 
 2   credit_history           1000 non-null   object
 3   purpose                  1000 non-null   object
 4   amount                   1000 non-null   int64 
 5   savings                  1000 non-null   object
 6   employment_duration      1000 non-null   object
 7   installment_rate         1000 non-null   int64 
 8   personal_status_sex      1000 non-null   object
 9   other_debtors            1000 non-null   object
 10  present_residence        1000 non-null   int64 
 11  property                 1000 non-null   object
 12  age                      1000 non-null   int64 
 13  other_installment_plans  1000 non-null   object
 14  housing     

Unnamed: 0,duration,amount,installment_rate,present_residence,age,number_credits,people_liable,credit_risk,status_... >= 200 DM / salary for at least 1 year,status_0 <= ... < 200 DM,...,property_unknown/no property,other_installment_plans_none,other_installment_plans_stores,housing_own,housing_rent,job_skilled employee/official,job_unemployed/unskilled - non-resident,job_unskilled - resident,telephone_yes,foreign_worker_yes
0,-1.236478,-0.745131,0.918477,1.046987,2.766456,1.027079,-0.42829,0.654654,False,False,...,False,True,False,True,False,True,False,False,True,True
1,2.248194,0.949817,-0.870183,-0.765977,-1.191404,-0.704926,-0.42829,-1.527525,False,True,...,False,True,False,True,False,True,False,False,False,True
2,-0.738668,-0.416562,-0.870183,0.140505,1.183312,-0.704926,2.334869,0.654654,False,False,...,False,True,False,True,False,False,False,True,False,True
3,1.750384,1.634247,-0.870183,1.046987,0.831502,-0.704926,2.334869,0.654654,False,False,...,False,True,False,False,False,True,False,False,False,True
4,0.256953,0.566664,0.024147,1.046987,1.535122,1.027079,2.334869,-1.527525,False,False,...,True,True,False,False,False,True,False,False,False,True


## Split the data

### Subtask:
Split the dataset into training and testing sets.


**Reasoning**:
Split the data into training and testing sets.



In [3]:
from sklearn.model_selection import train_test_split

X = df_processed.drop('credit_risk', axis=1)
y = df_processed['credit_risk']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (800, 48)
Testing set shape: (200, 48)


## Train a classification model

### Subtask:
Choose a suitable classification algorithm (e.g., Logistic Regression, Random Forest, or Gradient Boosting) and train it on the training data.


**Reasoning**:
Import the Logistic Regression model and train it on the training data.



In [4]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)

ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.

**Reasoning**:
The error indicates that the target variable `y_train` is continuous, while the Logistic Regression model expects discrete classes. I need to convert `y_train` to a discrete representation. Looking at the value of `y`, it appears to be scaled versions of the original `credit_risk` column which likely contained 0s and 1s. I will convert the scaled continuous values in `y_train` back to their original discrete values (0 and 1).



In [5]:
y_train = y_train.round().astype(int)
y_test = y_test.round().astype(int)

model = LogisticRegression()
model.fit(X_train, y_train)

## Evaluate the model

### Subtask:
Evaluate the trained model's performance on the testing data using appropriate metrics such as accuracy, precision, recall, and F1-score.


**Reasoning**:
Evaluate the trained model's performance on the testing data using appropriate metrics such as accuracy, precision, recall, and F1-score.



In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")

Accuracy: 0.79
Precision: 0.8235294117647058
Recall: 0.8936170212765957
F1-score: 0.8571428571428571


## Summary:

### Data Analysis Key Findings

*   The dataset contains 1000 entries and 21 columns, with no missing values.
*   The dataset consists of 8 numerical and 13 categorical columns.
*   Categorical features were one-hot encoded, and numerical features were scaled.
*   The dataset was split into training (800 samples) and testing (200 samples) sets.
*   A Logistic Regression model was trained for credit risk classification.
*   The model achieved an accuracy of 0.79 on the test set.
*   The model showed a precision of 0.82 and a recall of 0.89 on the test set.
*   The F1-score for the model on the test set was 0.86.

### Insights or Next Steps

*   Explore other classification algorithms and compare their performance metrics to the Logistic Regression model.
*   Investigate feature importance to identify the most influential factors in credit risk prediction.
