![WMLE LOGOS](https://github.com/sanjayksau/wmle2024/blob/main/logo3.png?raw=true)



#Machine Learning with German Credit Dataset
![ML Pipeline](https://github.com/sanjayksau/wmle2024/blob/main/ml_pipeline.png?raw=true)



##Dataset Characteristics
Dataset: Statlog(German Credit Data) https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data


Dataset Characteristics: Multivariate

Subject Area: Social Science

Associated Tasks: Classification

Feature Type: Categorical, Integer

Num of Instances: 1000

Num of Features: 20

# Attribute Information (Dataset Details):
- Attribute 1: (qualitative) Status of existing checking account
- Attribute 2: (numerical) Duration in month
- Attribute 3: (qualitative) Credit history
- Attribute 4: (qualitative) Purpose
- Attribute 5: (numerical) Credit amount
- Attribute 6: (qualitative) Savings account/bonds
- Attribute 7: (qualitative) Present employment since
- Attribute 8: (numerical) Installment rate in percentage of disposable income
- Attribute 9: (qualitative) Personal status and sex
- Attribute 10: (qualitative) Other debtors/guarantors
- Attribute 11: (numerical) Present residence since
- Attribute 12: (qualitative) Property
- Attribute 13: (numerical) Age in years
- Attribute 14: (qualitative) Other installment plans
- Attribute 15: (qualitative) Housing
- Attribute 16: (numerical) Number of existing credits at this bank
- Attribute 17: (qualitative) Job
- Attribute 18: (numerical) Number of people being liable to provide maintenance for
- Attribute 19: (qualitative) Telephone
- Attribute 20: (qualitative) Foreign worker


##Load the German Credit Dataset

###Upload Dataset from  https://archive.ics.uci.edu/static/public/144/data.csv

In [None]:
#TODO: access dataset from https://archive.ics.uci.edu/static/public/144/data.csv


### Upload Dataset from Google Drive

In [None]:
#TODO: Upload dataset from Google Drive if available
#/content/drive/MyDrive/WMLE2024/german_data.csv: change this appropriately


### Upload data available on local system

In [None]:
#Upload dataset from local system in colab and read in a DataFrame.


##Install ucimlrepo package

In [None]:
#TODO:  Install ucimlrepo package to fetch UCI datasets easily


In [None]:
#TODO: Fetch dataset here as suggested on https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data


In [None]:
#default column names
print(X.columns)

#Renaming columns with these relatable names.
new_feature_names = ['Status_of_existing_checking_account',
 'Duration_in_month',
 'Credit_history',
 'Purpose',
 'Credit_amount',
 'Savings_account_or_bonds',
 'Present_employment_since',
 'Installment_rate_in_percentage_of_disposable_income',
 'Personal_status_and_sex',
 'Other_debtors_or_guarantors',
 'Present_residence_since',
 'Property',
 'Age_in_years',
 'Other_installment_plans',
 'Housing',
 'Number_of_existing_credits_at_this_bank',
 'Job',
 'Number_of_people_being_liable_to_provide_maintenance_for',
 'Telephone',
 'foreign_worker']

#Update and Verify the updated names
X.columns = new_feature_names
print()
print(X.columns)

##Exploratory Data Analysis(EDA)
In this step, we will perform some basic exploratory data analysis to understand the dataset.

###Visualizing the distribution of Numeric Variables

In [None]:
#TODO: import matplotlib.pyplot as plt

#Identify all numeric columns
num_cols = X.select_dtypes(include=['int64', 'float64']).columns
subset = X[num_cols]

# Plot histograms of numeric features
subset.hist(bins=15, figsize=(15, 10)) #figsize: width and height in inches
plt.show()

###Age Distribution

In [None]:
# Plot Age distribution (Attribute: 'Age_in_years' represents Age in the dataset)
#TODO: import libraray seborn as sns

sns.histplot(X['Age_in_years'], kde=True)
plt.title('Age Distribution')
plt.show()

###Credit Amount Distribution


In [None]:
#TODO: Plot Credit Amount distribution (Attribute: 'Credit_amount' represents Credit Amount)


### Correlation Heatmap of Numerical Features

In [None]:
# Create a correlation matrix of numeric features
plt.figure(figsize=(10, 8))
sns.heatmap(subset.corr(), annot=True) #annot annotes the correlation values in the heatmap
plt.title('Correlation Heatmap')
plt.show()

##Data Preprocessing

Before building a machine learning model, it's important to preprocess the data. We will use a pipeline for the transformation.

###Categorical to Numerical Conversion and Feature Scaling in the Dataset


In [None]:
#TODO: Display Dataset information again to verify the different feature datatypes


In [None]:
# Import necessary preprocessing tools
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler


# Define ColumnTransformer for handling different types of variables
#The conversion of Categorical/Qualitative features to oderinal or one-hot needs more careful consideration.
#Also Scaling is done just for demonstration, not required here.
HybridTransformer = ColumnTransformer([
    ("ordinal", OrdinalEncoder(), ["Status_of_existing_checking_account", "Credit_history", "Savings_account_or_bonds", "Present_employment_since",
                                   "Other_debtors_or_guarantors", "Property", "Other_installment_plans", "Housing", "Job", "Telephone", "foreign_worker"]),
    ("onehot", OneHotEncoder(), ["Purpose", "Personal_status_and_sex"])
], remainder="passthrough")

# Build pipeline with ColumnTransformer and StandardScaler
pipe = Pipeline([
    ("transforming", HybridTransformer),
    ("scale", StandardScaler())
])


#TODO: Transform the data using the pipeline and assign it to X_transformed


#TODO: Display the shape of transformed X

# TODO: Display pipe


# TODO: Display new Feature names


###Encode the Target Variable to 0, 1


In [None]:
# Label encode the target variable (class 0 and 1)
from sklearn.preprocessing import LabelEncoder

y_encoded = LabelEncoder().fit_transform(y) #DataFrame to numpy array

#TODO Check the shape of the target variable


##Model Building (Decision Tree)
- Add text to add other classifiers also and emphasize that there are several classifiers availabe and depending on your choice you can go with any model.

In [None]:
# Import Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split


#TODO: Split the data, Build the Model, Train and Predict
# Split the data into train and test sets

# Initialize and train the decision tree classifier, max_depth=4


# Predict on test data(X_test)


# Check the accuracy
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print("Model accuracy:", accuracy_score(y_test, y_pred))

#TODO: Compute: precision, recall and f1_score.


##Visualizing the decision Tree


In [None]:
# Import necessary function to plot the decision tree
from sklearn.tree import plot_tree

# Set figure size for better readability
plt.figure(figsize=(20, 10))

#TODO: Plot the decision tree, max_depth=2
plot_tree(clf, max_depth=2, feature_names=pipe[0].get_feature_names_out(), class_names=['Good', 'Bad'], proportion=True, filled=True, rounded=True)
plt.show()

##TODO

- Change the split criterion to 'entropy', retrain and generate tree again.

- The dataset webside mentions cost_matrix. The cost of assigning a 'Bad' risk to a 'Good' one is much less than assigning a 'Good' label to a 'Bad' one.

- Implement this agressive assignment for 'Bad' and observe the changes in Decision Tree. use the following to impletment the same.
clf = DecisionTreeClassifier(max_depth=4, class_weight={0: 1, 1: 5}) #more favorable towards classifying as bad

- Look at the German Credit Dataset again and verify the ordinal/one-hot encoding assigned to categorial variables. Reassign if required, and rerun the pipeline, train the model, evaluate performance and visualize the Decision tree for new results

In [None]:
#TODO: Try plotting with different max_depth parameter values.
#plot_tree(clf, max_depth=2, feature_names=pipe[0].get_feature_names_out(), class_names=['Class 0', 'Class 1'], filled=True, rounded=True)