# Task
Analyze the "https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data" dataset to predict creditworthiness using a random forest model.

## Download the data

### Subtask:
Download the Statlog (German Credit Data) dataset.


**Reasoning**:
Download the dataset and the description file from the provided URLs using curl.



In [1]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data -o german.data
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.names -o german.names

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 79793    0 79793    0     0   166k      0 --:--:-- --:--:-- --:--:--  166k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     9    0     9    0     0     34      0 --:--:-- --:--:-- --:--:--    34


## Load the data

### Subtask:
Load the data into a pandas DataFrame and add column names.


**Reasoning**:
Import the pandas library, define column names, load the data into a DataFrame, and display the first few rows and info.



In [2]:
import pandas as pd

column_names = [
    'checking_account_status', 'duration_in_month', 'credit_history', 'purpose',
    'credit_amount', 'savings_account_status', 'employment_status',
    'installment_rate', 'personal_status_and_sex', 'other_debtors_or_guarantors',
    'residence_since', 'property', 'age_in_years', 'other_installment_plans',
    'housing', 'number_of_existing_credits', 'job', 'number_of_people_liable',
    'telephone', 'foreign_worker', 'creditworthiness'
]

df = pd.read_csv('german.data', sep=' ', names=column_names)

display(df.head())
df.info()

Unnamed: 0,checking_account_status,duration_in_month,credit_history,purpose,credit_amount,savings_account_status,employment_status,installment_rate,personal_status_and_sex,other_debtors_or_guarantors,...,property,age_in_years,other_installment_plans,housing,number_of_existing_credits,job,number_of_people_liable,telephone,foreign_worker,creditworthiness
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   checking_account_status      1000 non-null   object
 1   duration_in_month            1000 non-null   int64 
 2   credit_history               1000 non-null   object
 3   purpose                      1000 non-null   object
 4   credit_amount                1000 non-null   int64 
 5   savings_account_status       1000 non-null   object
 6   employment_status            1000 non-null   object
 7   installment_rate             1000 non-null   int64 
 8   personal_status_and_sex      1000 non-null   object
 9   other_debtors_or_guarantors  1000 non-null   object
 10  residence_since              1000 non-null   int64 
 11  property                     1000 non-null   object
 12  age_in_years                 1000 non-null   int64 
 13  other_installment_plans      1000 

## Preprocess the data

### Subtask:
Convert categorical variables to numerical using one-hot encoding and split the data into training and testing sets.


**Reasoning**:
Identify categorical columns, apply one-hot encoding, separate features and target, and split the data into training and testing sets.



In [3]:
from sklearn.model_selection import train_test_split

categorical_cols = df.select_dtypes(include='object').columns
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

X = df_encoded.drop('creditworthiness', axis=1)
y = df_encoded['creditworthiness']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (800, 48)
Shape of X_test: (200, 48)
Shape of y_train: (800,)
Shape of y_test: (200,)


## Train a random forest model

### Subtask:
Train a Random Forest classifier on the training data.


**Reasoning**:
Train a RandomForestClassifier model on the training data.



In [4]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

## Evaluate the model

### Subtask:
Evaluate the model's performance on the testing data using appropriate metrics like accuracy, precision, recall, and F1-score.


**Reasoning**:
Calculate and print the accuracy, precision, recall, and F1-score of the model on the test data.



In [5]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = rf_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Accuracy: 0.7450
Precision: 0.7679
Recall: 0.9149
F1-score: 0.8350


## Summary:

### Data Analysis Key Findings

*   The dataset contains 1000 entries and 21 columns, including a mix of integer and object data types.
*   After one-hot encoding of categorical variables, the feature set for training and testing has 48 columns.
*   The training set consists of 800 samples, and the testing set consists of 200 samples.
*   The trained Random Forest model achieved an accuracy of 0.7450 on the test data.
*   The model's precision on the test data is 0.7679.
*   The model's recall on the test data is 0.9149.
*   The model's F1-score on the test data is 0.8350.

### Insights or Next Steps

*   The high recall suggests the model is effective at identifying most of the creditworthy individuals, but the lower precision indicates it also incorrectly labels some non-creditworthy individuals as creditworthy. Further analysis could explore the trade-off between precision and recall based on the business requirements.
*   Investigate feature importance from the trained Random Forest model to understand which features are most influential in predicting creditworthiness and potentially refine the model or data collection.


## Feature Importance Analysis

### Subtask:
Analyze the feature importance from the trained Random Forest model.

**Reasoning**:
Extract feature importances from the trained Random Forest model, create a DataFrame for visualization, sort by importance, and display the top features.

In [6]:
importances = rf_model.feature_importances_
feature_names = X_train.columns
feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
feature_importance_df = feature_importance_df.sort_values('importance', ascending=False)

print("Top 10 Most Important Features:")
display(feature_importance_df.head(10))

Top 10 Most Important Features:


Unnamed: 0,feature,importance
1,credit_amount,0.118808
4,age_in_years,0.099245
0,duration_in_month,0.083883
9,checking_account_status_A14,0.061039
2,installment_rate,0.043058
3,residence_since,0.037845
5,number_of_existing_credits,0.022982
13,credit_history_A34,0.022859
41,housing_A152,0.021538
40,other_installment_plans_A143,0.021171


## Summary:

### Data Analysis Key Findings

* The dataset contains 1000 entries and 21 columns, including a mix of integer and object data types.
* After one-hot encoding of categorical variables, the feature set for training and testing has 48 columns.
* The training set consists of 800 samples, and the testing set consists of 200 samples.
* The trained Random Forest model achieved an accuracy of 0.7450 on the test data.
* The model's precision on the test data is 0.7679.
* The model's recall on the test data is 0.9149.
* The model's F1-score on the test data is 0.8350.
* The top features influencing creditworthiness prediction are 'credit_amount', 'age_in_years', 'duration_in_month', and 'checking_account_status'.

### Insights or Next Steps

* The high recall suggests the model is effective at identifying most of the creditworthy individuals, but the lower precision indicates it also incorrectly labels some non-creditworthy individuals as creditworthy. Further analysis could explore the trade-off between precision and recall based on the business requirements.
* The feature importance analysis provides valuable insights into which factors are most influential. This information can be used to refine the model or data collection strategies.
* Consider exploring other classification algorithms or hyperparameter tuning to potentially improve model performance.


## Project Conclusion

This project aimed to predict creditworthiness using the Statlog (German Credit Data) dataset. We followed a standard machine learning pipeline, starting with data acquisition and loading. The dataset, containing 1000 entries and 21 features, presented a mix of numerical and categorical data.

To prepare the data for modeling, we applied one-hot encoding to the categorical variables, resulting in a feature set with 48 columns. The data was then split into training (800 samples) and testing (200 samples) sets to evaluate the model's performance on unseen data.

We trained a Random Forest Classifier as our initial model. The evaluation metrics on the test set were as follows:

*   **Accuracy:** 0.7450
*   **Precision:** 0.7679
*   **Recall:** 0.9149
*   **F1-score:** 0.8350

The Random Forest model demonstrated a high recall, indicating its effectiveness in identifying a large proportion of creditworthy individuals. However, the precision was lower, suggesting that some individuals were incorrectly classified as creditworthy.

We also analyzed the feature importance from the trained Random Forest model. The analysis revealed that the most influential features in predicting creditworthiness were 'credit_amount', 'age_in_years', 'duration_in_month', and 'checking_account_status'. This insight can be valuable for understanding the key factors driving credit decisions and potentially for future data collection or model refinement.

While the Random Forest model provided a good starting point, further exploration with other algorithms or hyperparameter tuning could potentially lead to improved performance. However, based on the initial analysis, the Random Forest model offers a reasonable balance of precision and recall for this task, with clear insights into the most important features.

In summary, this project successfully demonstrated the application of a Random Forest model to predict creditworthiness using the German Credit Data. The results highlight the importance of features like credit amount and age, and the model provides a solid foundation for further development or deployment.