![WMLE LOGOS](https://github.com/sanjayksau/wmle2024/blob/main/logo3.png?raw=true)

#XGBoost Classifier
In this workbook, we will use XGBoost, a powerful machine learning algorithm, to predict credit risk using the German Credit Dataset. The task is to classify whether a credit applicant is a "Good Credit" or a "Bad Credit" based on various financial and personal attributes.
https://xgboost.readthedocs.io/en/latest/tutorials/model.html

https://www.nvidia.com/en-in/glossary/xgboost/


##Install Required Libraries

In [None]:
#install xgboost
!pip install xgboost

In [None]:
import pandas as pd
import xgboost as xgb

#suppress warnings
#import warnings
#warnings.filterwarnings('ignore')

##Load the dataset
The dataset we will use is the German Credit Dataset, which can be found here. It contains data about applicants who applied for credit, along with information about their financial status.

In [None]:
# Load the German Credit dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"

# Define column names for the dataset
columns = [
    'Status of existing checking account', 'Duration in month', 'Credit history',
    'Purpose', 'Credit amount', 'Savings account/bonds', 'Present employment since',
    'Installment rate in percentage of disposable income', 'Personal status and sex',
    'Other debtors / guarantors', 'Present residence since', 'Property',
    'Age in years', 'Other installment plans', 'Housing', 'Number of existing credits at this bank',
    'Job', 'Number of people being liable to provide maintenance for', 'Telephone',
    'foreign worker', 'Creditability'  # The target column (1: Good, 2: Bad)
]

#TODO: Read the dataset but with custom/meaningful feature/columns names.
#Read the dataset (separator space, no header, columns as names) into a pandas DataFrame.

# TODO: Display the first few rows of the dataset


##Data Preprocessing
Before we train our model, we need to prepare the data. This involves converting categorical data into numerical values and preparing the target labels for binary classification (1 for "Bad Credit" and 0 for "Good Credit").

In [None]:
#TODO: Display DataFrame info


In [None]:
from sklearn.preprocessing import LabelEncoder

#TODO: Convert target labels to binary (0 for Good Credit, 1 for Bad Credit)
#Assign map to df['Creditability']

# Preprocess categorical variables using LabelEncoder
# Convert categorical features into numerical values
for column in df.select_dtypes(include=['object']).columns:
    #TODO: fit and transform df[column] using LabelEncoder()

#TODO: Display the first few rows after preprocessing


##Split the Dataset
We will split the dataset into training and test sets to evaluate the model’s performance.

In [None]:
from sklearn.model_selection import train_test_split

# TODO: Split df into features (X) and target (y), drop columns=['Creditability']


#TODO: Split the data into training and test sets

#Display the shapes of the train and test sets


##Train on XGBoost Model


In [None]:
import xgboost as xgb

# Convert the dataset into DMatrix, which is the internal data structure used by XGBoost

#DMatrix are XGBoost special data structures to represent datasets in the most efficient
#way for XGBoost
train_data = xgb.DMatrix(X_train, label=y_train)
test_data = xgb.DMatrix(X_test, label=y_test)

# Define the parameters for the XGBoost model
#https://xgboost.readthedocs.io/en/latest/parameter.html

#“binary:logistic” –logistic regression for binary classification, output probability
#logloss: negative log-likelihood
params = {
    #'objective': 'binary:hinge',
    'objective': 'binary:logistic',  # logistic regression for binary classification, o/p probability
    'max_depth': 4,  # Maximum depth of each tree
    'learning_rate': 0.1,  # Learning rate
    'eval_metric': 'logloss', # Logarithmic loss as the evaluation metric
    'seed': 42  # Random seed for reproducibility
}

# Train the XGBoost model with 100 boosting rounds
#train <=> fit
model = xgb.train(params, train_data, num_boost_round=100)

print("Model training done.")

#TODO: Try training with 'binary:hinge' and eval_metric: 'mae', mean absolute error

##Make Predictions and Evaluate the Model
After training, we will use the test data to evaluate the model's performance by making predictions and calculating accuracy, confusion matrix, and classification report.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

#TODO: Make predictions y_pred_prob on the test data using model.predict

#TODO: Convert probabilities into binary predictions (0 if <=0.5 or 1 if >0.5)

#TODO: Evaluate the model's accuracy using y_test, y_pred
print("Accuracy: ", accuracy_score(y_test, y_pred))

#TODO: Confusion Matrix: Rows: Ground Truth, Columns: Predictions


##Evaluation Metrics:
- What do precision, recall, and F1-score tell about the model’s performance?

Accuracy measures the proportion of correct predictions(both True positives and True Negatives).

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$

Precision is the proportion of positive predictions that are actually correct. It is important when False positives are costly (e.g. fraud detection, spam filtering)
$$Precision = \frac{TP}{TP+FP}$$


Recall is the proportion of actual positives, that the model correct identify. It is important when False negatives are costly (e.g. identifying diseases)
$$Recall = \frac{TP}{TP+FN}$$


F1-Score is the harmonic mean of precision and recall. It balances both precision and recall especially when the class are imbalanced.

##Optional Tasks
- Hyperparameter Tuning: Experiment with different XGBoost parameters (max_depth, learning_rate, etc.) to improve the model's performance.
- Feature Engineering: Try adding or transforming features to see if it improves the model's accuracy.
- Cross-validation: Implement cross-validation to get a more reliable estimate of the model's performance.

##Hyperparameter Tuning using GridSearchCV
We will use GridSearchCV to search for the best hyperparameters for the XGBoost model. First, we define the parameter grid we want to search over, and then we use GridSearchCV to find the best combination.

In [None]:
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

# Create a base model
xgb_model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False)

#Note that the tuning may drive the performance low depending on the parameter range specified.
#So a wide range might be needed. Example below is just to highlight how to do a tuning over a
#specified parameter range.

# Define the parameter grid for GridSearch
param_grid = {
    'max_depth': [3, 4, 5],        # Maximum depth of the trees
    'learning_rate': [0.01, 0.1, 0.2],  # Learning rate
    'n_estimators': [50, 100, 200],     # Number of boosting rounds
}

# Instantiate the GridSearchCV object with 3-fold cross-validation
#https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GridSearchCV.html
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, verbose=1) #verbose=1,2,3

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Output the best parameters found by GridSearch
print(f"Best parameters: {grid_search.best_params_}")

# Get the best model from grid search
best_model = grid_search.best_estimator_


##Re-evaluate the tuned model
After obtaining the best model from GridSearchCV, we can use it to make predictions and evaluate its performance just like we did earlier.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#TODO: Make predictions with the tuned model: best_model.predict

# Evaluate the tuned model's accuracy
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print(f"Accuracy after tuning: {accuracy_tuned}")

# Confusion Matrix for the tuned model
conf_matrix_tuned = confusion_matrix(y_test, y_pred_tuned)
print("Confusion Matrix (Tuned Model):")
print(conf_matrix_tuned)

# Classification Report: Provides precision, recall, and F1-score for both classes
# Classification Report for the tuned model
#class_report_tuned = classification_report(y_test, y_pred_tuned)
#print("Classification Report (Tuned Model):")
#print(class_report_tuned)


#TODO:
- Change parameters loss, eval_metric etc by refering to documentation for available options: https://xgboost.readthedocs.io/en/stable/parameter.html

- Build a model using xgb.train and XGBClassifier, train and test
(Refer xgb and XGBClassifier documentation)
- Change verbosity level in GridSearchCV to higher value(2 or more) and observe the messages.
- Prepare a synthetic dataset using make_classification with two features and two classes. Use XGB for eval and compare the results with DecisionTree