# **Classification Practical Assessment**

## **Datasets overview and preliminary analysis**

In this lab, we explore two datasets: the **adult dataset** and the **student dataset**. The adult dataset is related to demographics and income levels, while the student dataset focuses on student performance and characteristics. These datasets belong to the social and education domains, respectively. The goal is to analyze and model patterns within the data, such as understanding income brackets or predicting academic outcomes. The datasets include a mix of attribute types, such as numerical (e.g., age, scores) and categorical data (e.g., occupation, gender).

The **adult** and **student** data are first loaded into `DataFrame` data structures from the `adult.csv` and `student.csv` files. When working with data in Python, these packages are very helpful for different tasks:  

1. **`sys`**: This package helps manage Python's environment. It lets you interact with the system, such as reading command-line arguments or exiting a program.  
2. **`numpy` (`np`)**: A powerful library for working with numbers. It makes it easy to perform mathematical operations on large arrays or tables of data.  
3. **`pandas` (`pd`)**: A library for working with `DataFrame` structures. It helps organize, analyze, and clean data efficiently.  
4. **`matplotlib.pyplot` (`plt`)**: A tool for creating graphs and charts to visualize data.  
5. **`seaborn` (`sns`)**: Built on top of `matplotlib`, it simplifies the process of creating beautiful and advanced visualizations.  
6. **`adsa_utils2` (`ad`)**: A custom Python set of functions useful in the context of the assessment labs.

In [None]:
# This line sets the filter for warnings. It tells Python to ignore all warnings that are generated during the execution of the program.
import sys
import warnings
warnings.filterwarnings('ignore')

# Import essentil libraries 
import sys
import numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
import ads 

### **Adult dataset**

The adult dataset is a collection of data related to **income** and **demographics**, primarily used to understand factors influencing income levels. The dataset is from the socioeconomic domain, focusing on **identifying individuals earning more than $50,000 annually**. This analysis helps in understanding **income disparities**, **labor market trends**, and **demographic characteristics affecting earnings**, making it useful for social research and policy-making.  

The dataset contains 947 rows with a binary class label (`<=50K`, `>50K`) indicating income levels. It includes a diverse set of attributes, comprising **6 numerical** and **8 categorical** data types. For example, continuous attributes like age, capital gains/losses, and hours worked per week provide quantitative insights, while categorical features such as workclass, education level, occupation, and marital status reflect demographic and professional characteristics.  

The dataset also includes information about individuals' native countries, spanning diverse regions such as the United States, Europe, Asia, and Latin America. This allows us to **explore how geographic and cultural backgrounds influence income levels**, adding a layer of complexity to the analysis.  

This dataset is frequently used in **predictive modeling for income classification**, helping to uncover how factors like education, work hours, and occupation impact earning potential. The mix of numerical and categorical data presents challenges during preprocessing, such as encoding categories into numerical formats and addressing potential imbalances in class labels (`<=50K` and `>50K`). By analyzing and modeling this data, we gain valuable insights into the **interplay of various demographic, geographic, and professional factors** that contribute to income disparities in diverse populations poses challenges for preprocessing, including encoding categorical variables and handling potential imbalances in the class labels.

In [None]:
adult_data= pd.read_csv('data/adult.csv', sep=';')
adult_data.head()

In [None]:
adult_data.info()

All columns have **947 non-null values**, confirming that there are no missing entries. This is beneficial for analysis as it eliminates the need for data imputation or handling missing values. Additionally, some categorical attributes will be primarily analyzed based on their unique values to clarify ambiguities and identify the distinct categories or groups within each feature.

In [None]:
adult_data['education'].unique()

The output of `adult_data['education'].unique()` shows the unique values in the "education" column. These values represent different levels or types of education that individuals in the dataset have completed. Here's a brief analysis of these values.

The education levels are categorized across different stages of formal schooling. **Primary education** typically includes **1st-6th grade**, with the dataset reflecting this in categories like **1st-4th** and **5th-6th**. **Secondary education** spans **7th-12th grade**, which includes **middle school** (grades 7-8) and **high school** (grades 9-12). The dataset reflects this progression with categories like **7th-8th**, **9th**, **10th**, **11th**, and **12th**. **Higher education** follows high school and includes degrees like **Associate**, **Bachelor’s**, **Master’s**, and **Doctorate**.

Additionally, **Some-college** refers to individuals who have attended college but have not completed a degree, typically after high school, while **Preschool** refers to early childhood education. Categories like **HS-grad** indicate high school graduates, and **Assoc-acdm** and **Assoc-voc** represent associate degrees, typically completed after high school, with **Assoc-acdm** focusing on academic subjects and **Assoc-voc** on vocational training. **Prof-school** represents education from specialized schools, such as law or medical schools, often following a bachelor's degree. 

In [None]:
adult_data['occupation'].unique()

The **occupation** attribute in the dataset represents various job categories, such as managerial roles (Exec-managerial), technical support (Tech-support), skilled trades (Craft-repair), and professional specialties (Prof-specialty). It also includes roles in sales, administrative work (Adm-clerical), and protective services (Protective-serv). Some categories, like **Other-service** and **Priv-house-serv**, are more ambiguous, as they are broad and lack specific details, with **Other-service** covering various unspecified service roles and **Priv-house-serv** referring to household workers without further clarification. 

## **Classification models for Adult dataset**

### **Algorithm choices**

In this analysis, the goal is to compare different classification algorithms for predicting income levels in the Adult dataset, where the target variable is a binary class label. The first model chosen is **logistic regression**, a simple and interpretable model designed for **binary classification**. Although logistic regression is a form of regression, it is widely used for classification tasks where the outcome is a binary class. It is effective when the relationship between the features and the target is mostly linear and provides clear insights into how each feature impacts the model's prediction. This simplicity and interpretability make logistic regression a natural choice as a baseline model for comparison.

The second approach involves **decision trees**, followed by **XGBoost**, a powerful gradient boosting algorithm. Decision trees are able to handle both numerical and categorical data, and can capture non-linear relationships and interactions between features. However, their performance can often be limited by overfitting or underfitting. To address this, **XGBoost** is applied to enhance the decision tree model. As an ensemble method, XGBoost combines multiple decision trees and is particularly effective at managing complex, non-linear relationships, leading to improved accuracy and robustness.

### **Logistic regression model**

Logistic regression is a classification algorithm used for predicting a binary class label (e.g., 0 or 1). It is based on the concept of estimating the probability that an instance (row in the dataset) belongs to a particular class.

- **Logistic function (Sigmoid)**: The core of logistic regression is the **sigmoid function**, which transforms any real-valued number into a probability between 0 and 1. It is given by:
$$
P(y=1 | X) = \frac{1}{1 + e^{-z}}
$$
where $P(y=1 | X)$ is the probability that the instance belongs to class 1, and $z$ is the **logit**, which is a linear combination of the input features and their corresponding coefficients:
$$
z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n
$$
where $\beta_0$ is the intercept term, and $\beta_1, \beta_2, \dots, \beta_n$ are the coefficients for the features $x_1, x_2, \dots, x_n$.

- **Decision rule**:
To classify an instance, the output probability is compared to a threshold (usually 0.5):
    - If $P(y=1 | X) \geq 0.5$, the instance is classified as **class 1**.
    - If $P(y=1 | X) < 0.5$, the instance is classified as **class 0**.

- **Cost function**:
Logistic regression uses a cost function called **log-loss** or **binary cross-entropy** to measure how well the model's predictions match the actual class labels. The cost function is minimized during training:
$$
J(\beta) = - \frac{1}{m} \sum_{i=1}^{m} \left( y^{(i)} \log(h_\beta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\beta(x^{(i)})) \right)
$$
where $m$ is the number of training examples, $y^{(i)}$ is the actual label of the $i$-th training example, and $h_\beta(x^{(i)})$ is the predicted probability for the $i$-th example.

Before applying logistic regression, **categorical features** in the dataset must be transformed into numerical representations using techniques such as **one-hot encoding**, **label encoding**, or **ordinal encoding**. One-hot encoding is often preferred because it prevents the model from assuming any order or hierarchy among the categories, which label encoding or ordinal encoding might unintentionally introduce. Logistic regression is a linear model, and it interprets numerical inputs as having a specific magnitude or order. If categories are encoded as integers using label encoding, the model could mistakenly assume that higher numbers represent greater importance or imply a ranking, even when no such relationship exists. Similarly, ordinal encoding explicitly assigns a ranking to categories (e.g., `Low < Medium < High`), which is only appropriate when the categories naturally have an order. However, using ordinal encoding for nominal data (e.g., `Red`, `Green`, `Blue`) would also lead to incorrect assumptions about relationships between the categories. One-hot encoding solves these issues by creating separate binary columns for each category, treating them as independent and ensuring the model does not infer relationships that do not exist. This makes one-hot encoding more suitable for preserving the integrity of categorical data when applying logistic regression, especially when the categories lack a meaningful order.

In addition to transforming the data properly, the evaluation of logistic regression should not rely solely on the cost function, such as log-loss, but also consider a variety of metrics derived from the confusion matrix. Metrics like **accuracy**, **precision**, **recall**, and **F1 score** provide detailed insights into the model’s classification performance. The **ROC-AUC** metric is especially important for understanding how well the model distinguishes between the two classes across different thresholds. For imbalanced datasets, other metrics, such as **Cohen’s Kappa** and the **Geometric Mean (G-Mean)**, are helpful as they account for imbalances in the class distribution and provide a more nuanced evaluation. These metrics help ensure the model's predictions are assessed comprehensively, capturing both its strengths and weaknesses.

With these preparations completed, the next step will involve the preparation of the dataset for training and conducting an analysis using Python to evaluate its performance and interpret the results.

In [None]:
# Create a copy of adult_data
ad = adult_data.copy()
ad

In [None]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# Initialize the OneHotEncoder for categorical attributes
# onehot_encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' avoids the dummy variable trap
onehot_encoder = OneHotEncoder(sparse_output=False)  # drop='first' avoids the dummy variable trap

# Initialize the OrdinalEncoder for the label column
ordinal_encoder = OrdinalEncoder()

# Identify categorical columns (excluding the label column)
ad_c = ad.select_dtypes(include=['object']).columns

# If the label column is categorical, apply OrdinalEncoder to it
ad['label'] = ordinal_encoder.fit_transform(ad[['label']])

# Remove the label from the categorical list
ad_c = ad_c[ad_c != 'label']

# Apply one-hot encoding to the categorical features (excluding the label column)
ad_c_e = onehot_encoder.fit_transform(ad[ad_c])

# Convert the encoded data to a DataFrame
ad_c_e_df = pd.DataFrame(ad_c_e, columns=onehot_encoder.get_feature_names_out(ad_c))

# Drop the original categorical columns (including label) and concatenate the encoded ones
ad_e = pd.concat([ad.drop(columns=ad_c), ad_c_e_df], axis=1)

# Display the updated data
ad_e

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Manually list the numerical columns
ad_n = ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']  # replace with actual names of numerical columns

ad_e_s = ad_e.copy()

# Apply scaling to numerical columns
ad_e_s[ad_n] = scaler.fit_transform(ad_e[ad_n])

# Display the updated data with scaled numerical features
ad_e_s

In [None]:
print(ad_e_s.columns)

When working with small datasets, it's often better to use **cross-validation** instead of **train-test split validation**. With split validation, the data is divided into just two sets: one for training and one for testing. This approach can leave some data unused for training, which may result in an unrepresentative model, especially with small datasets.

In contrast, cross-validation splits the data into multiple smaller parts (folds) and uses each part for both training and testing. This ensures that the model is trained and tested on all available data, leading to better performance and more reliable evaluation. Cross-validation is, therefore, a better approach when working with limited data.

However, before applying any model, we should always **split** the dataset into **training** and **testing** sets. This step ensures that the model is evaluated on **unseen data**, providing a better understanding of its real-world performance. Once the initial split is done, **cross-validation** can be applied to the training set to optimize and fine-tune the model. Afterward, the model can be tested on the **test set** to assess its ability to generalize.

This approach is beneficial for two main reasons:
- It allows for **model optimization** using cross-validation on the training set, reducing the risk of overfitting.
- It provides an accurate measure of the model's performance on unseen data (the test set), helping us evaluate how it might perform in real-world applications.

Since our dataset is relatively small, we will use a **0.3 test / 0.7 train split**. This will give us a larger test set, which is important for evaluating the model's performance more reliably. With a larger test set, we can better assess how well the model generalizes to unseen data and ensure that the performance metrics we calculate are more robust and meaningful. This approach helps prevent overfitting and gives a clearer picture of how the model will perform in real-world scenarios.

In [None]:
X_ad_e_s = ad_e_s[ad_e_s.columns.difference(['label'])]
print(X_ad_e_s.columns)

In [None]:
y_ad_e_s = ad_e_s['label'] 
y_ad_e_s

In [None]:
from sklearn.model_selection import train_test_split, cross_val_predict, cross_validate, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression 

# Split the dataset into training and testing sets
X_ad_e_s_train, X_ad_e_s_test, y_ad_e_s_train, y_ad_e_s_test = train_test_split(X_ad_e_s, y_ad_e_s,
                                                                                test_size=0.3,
                                                                                stratify=y_ad_e_s,
                                                                                random_state=43)
# `stratify=y_ad`: This ensures that the distribution of the target variable (`y_ad`) is maintained in both the training and test sets.

# Define the model (e.g., Logistic Regression)
logistic_regression_m_ad = LogisticRegression()

ads.custom_crossvalidation(X_ad_e_s_train, y_ad_e_s_train, logistic_regression_m_ad, cv_=5)

Logistic Regression is a widely used machine learning algorithm for binary and multi-class classification tasks. However, its performance heavily depends on the choice of **hyperparameters**, such as:

1. **Regularization strength (`C`)**: Controls the trade-off between fitting the training data and regularization to prevent overfitting.
2. **Penalty type (`l1`, `l2`, etc.)**: Determines the kind of regularization applied to the model.

Manual selection of these hyperparameters is time-consuming and prone to suboptimal results. **GridSearchCV** automates this process by systematically exploring combinations of hyperparameter values and selecting the combination that maximizes model performance, based on a chosen metric (e.g., accuracy, F1 score).

Using GridSearch:
1. **Optimizes model performance**: Ensures that the best combination of hyperparameters is chosen for the given dataset.
2. **Reduces human bias**: Removes guesswork and ensures a more objective selection process.
3. **Efficient evaluation**: Uses cross-validation to test each combination, providing a reliable estimate of performance.

By applying GridSearch, we ensure the Logistic Regression model is not only accurate but also generalizes well to unseen data.

In [None]:
# Define the expanded parameter grid for Logistic Regression
logistic_regression_param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength (larger range for flexibility)
    'penalty': ['l1', 'l2', 'elasticnet'],  # Regularization types (L1, L2, or ElasticNet)
    'solver': ['liblinear', 'saga', 'newton-cg', 'lbfgs'],  # Optimization algorithms
    'max_iter': [100, 200],  # Maximum number of iterations for optimization
    'tol': [1e-4, 1e-3, 1e-2],  # Tolerance for stopping criteria
    'fit_intercept': [True, False],  # Whether to include an intercept in the model
    'class_weight': [None, 'balanced']  # Class weight options (useful for imbalanced datasets)
}

# Call the custom_crossvalidation function
best_logistic_regression_m_ad, best_logistic_regression_m_ad_params, best_logistic_regression_m_ad_preds = (
    ads.custom_crossvalidation_gridsearch(
        X_ad_e_s_train, 
        y_ad_e_s_train, 
        logistic_regression_m_ad, 
        logistic_regression_param_grid, 
        cv_=5
    )
)

Accuracy measures how many predictions were correct overall. It is calculated by dividing the number of correct predictions (true positives and true negatives) by the total number of instances:

$$
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}
$$

In this case, the accuracy is **82.62%**, meaning the model correctly predicted the class for about 83% of the instances. While useful, accuracy can be misleading when the dataset is imbalanced because it may overlook the model's performance on the minority class.

Precision measures how many of the predicted positive instances are actually correct. It is calculated as the number of true positives (TP) divided by the sum of true positives and false positives (FP):

$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$

For the <=50K class, the precision is 86%, meaning most predictions for this class are correct. For the >50K class, the precision is **80%**, indicating that a smaller proportion of predicted >50K instances were correct. Precision is important when the cost of false positives is high.

Recall shows how well the model identifies all the actual positive instances. It is calculated as the number of true positives (TP) divided by the sum of true positives and false negatives (FN):

$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$

For <=50K, recall is 78%, meaning the model missed about 22% of the instances for this class. For >50K, recall is 88%, meaning the model identified most of the >50K instances. Recall is crucial when it is important to identify as many positives as possible.

The F1-score is the harmonic mean of precision and recall, providing a balance between them. It is calculated as:

$$
\text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

For the <=50K class, the F1-score is **81%**, and for the >50K class, it is **84%**. This indicates that the model balances precision and recall well for both classes, with a slightly better performance for >50K.

Other metrics like support, macro average, and weighted average are less informative for this binary classification. Support simply counts the number of instances in each class, and averages like macro or weighted averages are more useful for imbalanced datasets or multi-class problems. 

The output suggests that the model achieved a mean accuracy of 82.62% with a small standard deviation (3.94%), indicating a good general performance with slight variability across folds. The precision for both classes is good (83.06% for <=50K and 80% for >50K), but there is a trade-off between precision and recall, with recall being slightly lower for <=50K (78%). The F1-score, which balances both precision and recall, is around 82.55%, indicating that the model is well-balanced in terms of both metrics.

These metrics can be used to further tune the model for better performance, especially in handling imbalanced datasets by adjusting parameters like class_weight.

In [None]:
# assume you apply the same technique to transform the test data
# then apply the model trained above to the test portion
y_ad_e_s_test_predicted = best_logistic_regression_m_ad.predict(X_ad_e_s_test)
print(classification_report(y_ad_e_s_test, y_ad_e_s_test_predicted))
ads.plot_confusion_matrix(y_ad_e_s_test, y_ad_e_s_test_predicted)

**Predictions Confidence Levels**


- In scikit-learn, once we train a model, we can retrieve the confidence levels (also called probabilities) of a classifier's predictions using the predict_proba method which returns an array with the probabilities of each class for each row/instance.
- This is especially useful for measuring the model's confidence in its predictions.

NOTE: Most classifiers in scikit-learn support predict_proba, but a few don't, such as SVM (Support Vector Machine) which does not directly support probabilities, but we can enable probability estimates by setting probability=True when initializing the SVC model (note that this adds computational cost).

NOTE: adding the predictions and confidence levels to the test dataset is implemented in get_test_dataset from adsa_utils.

In [None]:
# fit (train) the model
best_logistic_regression_m_ad.fit(X_ad_e_s_train, y_ad_e_s_train)
# call the function get_test_dataset from adsa_utils
ads.get_test_dataset(best_logistic_regression_m_ad, X_ad_e_s_train, y_ad_e_s_train)

### 2.3 **Decision tree models**

Decision trees are widely recognized as effective supervised learning tools for classification and regression. Data is divided into subsets based on specific rules, forming a tree-like structure where **leaf nodes** represent final decisions. Measures such as **entropy** or the **Gini index** are used to evaluate the quality of splits, ensuring meaningful partitions. However, if trees are grown excessively deep, **overfitting** can occur, where noise is captured instead of general patterns.  

After logistic regression has been applied to the Adult dataset, the **C5.0 algorithm** will be used for further analysis and classification. This advanced decision tree algorithm is designed to improve upon its predecessor, C4.5, by incorporating **boosting**, which allows weak models to be combined for higher accuracy. Continuous attributes and missing data are handled efficiently, while **post-pruning** is employed to simplify the tree and reduce overfitting. The tree is first grown fully, and branches that do not significantly enhance accuracy are removed, resulting in a model that is both interpretable and effective for unseen data.

By applying the C5.0 algorithm, improved classification performance and deeper insights into the Adult dataset are expected to be achieved.

In [None]:
ad = adult_data.copy()
ad

In [None]:
X_ad = ad[ad.columns.difference(['label'])]
X_ad

In [None]:
y_ad = ad[['label']] 
y_ad

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Initialize the OrdinalEncoder
encoder = OrdinalEncoder()

# Identify categorical columns
X_ad_c = X_ad.select_dtypes(include=['object']).columns

# Apply ordinal encoding to categorical columns
X_ad_c_e = encoder.fit_transform(X_ad[X_ad_c])

# Convert the encoded data to a DataFrame
X_ad_c_e = pd.DataFrame(X_ad_c_e, columns=encoder.get_feature_names_out(X_ad_c))

# Drop the original categorical columns and concatenate the encoded ones
X_ad_e = pd.concat([X_ad.drop(columns=X_ad_c), X_ad_c_e], axis=1)

# Display the updated data
X_ad_e.info()

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_graphviz

# Split the dataset into training and testing sets
X_ad_e_train, X_ad_e_test, y_ad_train, y_ad_test = train_test_split(X_ad_e, y_ad,
                                                                    test_size=0.3,
                                                                    stratify=y_ad,
                                                                    random_state=43)
# `stratify=y_ad`: This ensures that the distribution of the target variable (`y_ad`) is maintained in both the training and test sets.

# Define the model (e.g., decision tree classifier)
decision_tree_classifier_m_ad = DecisionTreeClassifier(random_state=43)

ads.custom_crossvalidation(X_ad_e_train, y_ad_train, decision_tree_classifier_m_ad)

In [None]:
# get the tree model
decision_tree_classifier_m_ad.fit(X_ad_e_train, y_ad_train)
plt.figure(figsize=(25, 16))
plot_tree(decision_tree_classifier_m_ad, 
          filled=True, 
          feature_names=list(X_ad_e_train.columns), 
          class_names=decision_tree_classifier_m_ad.classes_)

In [None]:
# can output the depth and the number of leaf nodes
decision_tree_classifier_m_ad.get_depth(), decision_tree_classifier_m_ad.get_n_leaves()

In [None]:
print(decision_tree_classifier_m_ad.classes_)

In [None]:
decision_tree_classifier_m_ad.decision_path(X_ad_e_train, check_input=True)

### Pruning
we can manually modify various hyperparameters:
1. Pre-Pruning Hyperparameters: max_depth, min_samples_split, min_samples_leaf
2. Post-Pruning Hyperparameter: ccp_alpha

In [None]:
# prepruning
decision_tree_classifier_m_ad_pre = DecisionTreeClassifier(max_depth=8, min_samples_split=2, min_samples_leaf=1, random_state=43)
ads.custom_crossvalidation(X_ad_e_train, y_ad_train, decision_tree_classifier_m_ad_pre)

In [None]:
decision_tree_classifier_m_ad_pre.fit(X_ad_e_train, y_ad_train)
print(decision_tree_classifier_m_ad_pre.get_depth(), decision_tree_classifier_m_ad_pre.get_n_leaves())
plt.figure(figsize=(25, 16))
plot_tree(decision_tree_classifier_m_ad_pre, 
          filled=True, 
          feature_names=list(X_ad_e_train.columns), 
          class_names=decision_tree_classifier_m_ad_pre.classes_)

In [None]:
# can output the depth and the number of leaf nodes
decision_tree_classifier_m_ad_pre.get_depth(), decision_tree_classifier_m_ad_pre.get_n_leaves()

In [None]:
print(decision_tree_classifier_m_ad_pre.classes_)

In [None]:
decision_tree_classifier_m_ad_pre.decision_path(X_ad_e_train, check_input=True)

In [None]:
# postpruning
decision_tree_classifier_m_ad_post = DecisionTreeClassifier(ccp_alpha=0.005, random_state=43)
ads.custom_crossvalidation(X_ad_e_train, y_ad_train, decision_tree_classifier_m_ad_post)

In [None]:
decision_tree_classifier_m_ad_post.fit(X_ad_e_train, y_ad_train)
print(decision_tree_classifier_m_ad_post.get_depth(), decision_tree_classifier_m_ad_post.get_n_leaves())
plt.figure(figsize=(25, 16))
plot_tree(decision_tree_classifier_m_ad_post, 
          filled=True, 
          feature_names=list(X_ad_e_train.columns), 
          class_names=decision_tree_classifier_m_ad_post.classes_)

In [None]:
# can output the depth and the number of leaf nodes
decision_tree_classifier_m_ad_post.get_depth(), decision_tree_classifier_m_ad_post.get_n_leaves()

In [None]:
print(decision_tree_classifier_m_ad_post.classes_)

In [None]:
decision_tree_classifier_m_ad_post.decision_path(X_ad_e_train, check_input=True)

In [None]:
# pre/postpruning
decision_tree_classifier_m_ad_prepost = DecisionTreeClassifier(max_depth=8, min_samples_split=2,min_samples_leaf=1, ccp_alpha=0.005,random_state=43)
ads.custom_crossvalidation(X_ad_e_train, y_ad_train, decision_tree_classifier_m_ad_prepost)

In [None]:
decision_tree_classifier_m_ad_prepost.fit(X_ad_e_train, y_ad_train)
print(decision_tree_classifier_m_ad_prepost.get_depth(), decision_tree_classifier_m_ad_prepost.get_n_leaves())
plt.figure(figsize=(25, 16))
plot_tree(decision_tree_classifier_m_ad_prepost, 
          filled=True, 
          feature_names=list(X_ad_e_train.columns), 
          class_names=decision_tree_classifier_m_ad_prepost.classes_)

In [None]:
# can output the depth and the number of leaf nodes
decision_tree_classifier_m_ad_prepost.get_depth(), decision_tree_classifier_m_ad_prepost.get_n_leaves()

In [None]:
print(decision_tree_classifier_m_ad_prepost.classes_)

In [None]:
decision_tree_classifier_m_ad_prepost.decision_path(X_ad_e_train, check_input=True)

**Find Optimal Model using a grid search**

Optimise for f-measure rather than accuracy, given the class imbalance.

In [None]:
params_grid = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': range(3, 9),
    'min_impurity_decrease' : np.arange(0.01, 0.3, 0.01),
    'min_samples_leaf': range(1, 10, 1),
    'min_samples_split': range(2, 10, 1),
    'max_features': [None, 'sqrt', 'log2'],
    'class_weight': [None, 'balanced'],
    'ccp_alpha': [0.003, 0.005, 0.007]
}

params_grid_2 = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': range(3, 13),
    #'min_weight_fraction_leaf': [0.0, 0.01, 0.05, 0.1],
    #'min_impurity_decrease' : np.arange(0.01, 0.3, 0.01),
    #'min_samples_split': [2, 3, 4, 5, 10, 20],
    #'min_samples_leaf': [1, 2, 3, 4, 5, 10],
    'ccp_alpha': [0.003, 0.004, 0.005]
}

grid_search = GridSearchCV(decision_tree_classifier_m_ad, param_grid=params_grid_2, scoring='f1_macro', cv=5, n_jobs=-1)
#grid_search = GridSearchCV(decision_tree_classifier_m_ad, param_grid=params_grid_2, scoring='accuracy', cv=5, n_jobs=-1)
grid_search.fit(X_ad_e_train, y_ad_train)
print(grid_search.best_params_)
print(grid_search.best_score_)

In [None]:
from sklearn.pipeline import Pipeline

# Split the dataset into training and testing sets
XX_ad_train, XX_ad_test, yy_ad_train, yy_ad_test = train_test_split(X_ad, y_ad,
                                                                    test_size=0.3,
                                                                    stratify=y_ad,
                                                                    random_state=43)

best_decision_tree_classifier_m_ad_prepost = Pipeline([
    # unknown categories are handled by mapping them to -1
    ('encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
    ('sklearn_dt', grid_search.best_estimator_)
])
best_decision_tree_classifier_m_ad_prepost.fit(XX_ad_train, yy_ad_train)

ads.custom_crossvalidation(XX_ad_train, yy_ad_train, best_decision_tree_classifier_m_ad_prepost)

In [None]:
yy_ad_test_pred = best_decision_tree_classifier_m_ad_prepost.predict(XX_ad_test)
print(classification_report(yy_ad_test, yy_ad_test_pred))
ads.plot_confusion_matrix(yy_ad_test, yy_ad_test_pred)

**XGboost**

In [None]:
from xgboost import XGBClassifier
import xgboost as xgb


XX_ad = X_ad.copy()
XX_ad_c = XX_ad.select_dtypes(include='object').columns
XX_ad[XX_ad_c] = XX_ad[XX_ad_c].astype('category')
XX_ad.info()

In [None]:
yy_ad = y_ad.copy()
yy_ad_e = yy_ad['label'].map({'<=50K': 0, '>50K': 1})
yy_ad_e.info()

In [None]:
yy_ad_e.value_counts()

In [None]:
# split the data
XX_ad_train, XX_ad_test, yy_ad_e_train, yy_ad_e_test = train_test_split(XX_ad, yy_ad_e, test_size=0.3, random_state=43)

In [None]:
# simulate a single decision tree; leave other hyperparameters to default values
xgb_dt_m_ad = XGBClassifier(enable_categorical=True, 
                            n_estimators=1, 
                            num_parallel_tree=1, 
                            objective='multi:softmax', 
                            booster='dart',
                            num_class=2, 
                            eval_metric='mlogloss',
                            random_state=43)
# crossvalidate and output performance
ads.custom_crossvalidation(XX_ad_train, yy_ad_e_train, xgb_dt_m_ad)

In [None]:
xgb_dt_m_ad.fit(XX_ad_train, yy_ad_e_train)
xgb_dt_m_ad_tree_text = xgb_dt_m_ad.get_booster().get_dump(with_stats=True)
# Print the decision rules for the first tree
print(xgb_dt_m_ad_tree_text[0])

In [None]:

import graphviz
fig = plt.figure(figsize=(25, 16))
ax = fig.gca()
# num_trees specifies the index of the tree to plot
xgb.plot_tree(xgb_dt_m_ad, num_trees=0, rankdir='LR', ax=ax)
plt.show()

In [None]:
fig = plt.figure(figsize=(25, 40))
ax = fig.gca()
# num_trees specifies the index of the tree to plot
xgb.plot_tree(xgb_dt_m_ad, num_trees=1, ax=ax)
plt.show()

In [None]:
# Optimization
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

params = {
    'max_depth' : range(3, 11),
    #'n_estimators': [1, 3, 5, 7],
    #'num_parallel_tree': [1, 3, 5],  # Number of trees to build in parallel per boosting round
    'max_leaves':[3, 5, 7, 10],
    'learning_rate': np.arange(0.1, 0.5, 0.1),
    'min_child_weight': range(1,6,2)
}

params2 = {
    #'n_estimators': [100, 200, 300, 400],
    'n_estimators': [21],
    'num_parallel_tree': [1],
    'learning_rate': uniform(0.01, 0.5),
    'max_depth' : range(3, 13),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
    'gamma': uniform(0, 0.5),
    #'max_leaves':[3, 5, 7, 10],
    'min_child_weight': range(1,6,1)
}

random_search = RandomizedSearchCV(xgb_dt_m_ad, 
                                   param_distributions=params2, 
                                   n_iter=200, 
                                   cv=5, 
                                   n_jobs=-1, 
                                   random_state=43, 
                                   scoring='f1_macro')

# Fit the randomized search
random_search.fit(XX_ad_train, yy_ad_e_train)
print(random_search.best_params_)


# optimise for f1-measure
# grid_search = GridSearchCV(xgb_dt_m_ad, param_grid=params3, scoring='f1_macro', cv=5)
# grid_search.fit(XX_ad_train, yy_ad_e_train)
# print(grid_search.best_params_)

In [None]:
#(np.mean(grid_search.cv_results_['mean_test_score']),
#np.std(grid_search.cv_results_['mean_test_score']))
(np.mean(random_search.cv_results_['mean_test_score']),
np.std(random_search.cv_results_['mean_test_score']))

In [None]:
optimal_xgb = random_search.best_estimator_
ads.custom_crossvalidation(XX_ad_train, yy_ad_e_train, optimal_xgb)

In [None]:
yy_ad_e_test_pred = optimal_xgb.predict(XX_ad_test)
print(classification_report(yy_ad_e_test, yy_ad_e_test_pred))
ads.plot_confusion_matrix(yy_ad_e_test, yy_ad_e_test_pred)

In [None]:
fig = plt.figure(figsize=(25, 16))
ax = fig.gca()
# num_trees specifies the index of the tree to plot
xgb.plot_tree(optimal_xgb, num_trees=1, ax=ax)
plt.show()

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import RocCurveDisplay, roc_curve, auc

# ROC curves for 3 optimal models
best_lr_m_ad_disp = RocCurveDisplay.from_estimator(best_logistic_regression_m_ad, X_ad_e_s_test, y_ad_e_s_test)

best_dtc_m_ad_prepost_named_disp = RocCurveDisplay.from_estimator(
    best_decision_tree_classifier_m_ad_prepost,
    XX_ad_test, yy_ad_test,
    ax=best_lr_m_ad_disp.ax_
)

best_xgb_m_dt_disp = RocCurveDisplay.from_estimator(
    optimal_xgb,
    XX_ad_test, yy_ad_e_test,
    ax=best_dtc_m_ad_prepost_named_disp.ax_
)

best_xgb_m_dt_disp.figure_.suptitle("ROC curve comparison")
plt.plot([0, 1], [0, 1], 'k--')
