# Predicting income

## Lecture 7

### GRA 4160
### Advanced Regression and Classification Analysis, Ensemble Methods And Neural Networks

#### Lecturer: Vegard H. Larsen

## Boosting

Boosting is an ensemble technique that trains a sequence of models, each one correcting the errors of the previous model.
Boosting focuses on learning from the mistakes of the previous models and adjusts the weights of the training samples to emphasize the harder samples in the next round of training.
In boosting, each model is trained on a weighted version of the training data, with more weight assigned to the samples that were misclassified by the previous models.
The final prediction is the weighted sum of the predictions of the individual models. One of the most popular boosting algorithms is the Gradient Boosting algorithm.

## The `GradientBoostingClassifier` in Scikit-Learn

The gradient boosting algorithm is a popular ensemble learning method that combines multiple weak learners to form a strong learner.

The `GradientBoostingClassifier` builds an ensemble of decision trees sequentially, where each subsequent tree aims to correct the errors made by the previous tree.
Specifically, it minimizes a loss function by using gradient descent to update the parameters of the model, such as the weights of the features and the thresholds for the decision nodes.

The `GradientBoostingClassifier` has several hyperparameters that can be tuned to control the complexity of the model and prevent overfitting, including:

- **n_estimators**: The number of decision trees in the ensemble
- **learning_rate**: The step size of the gradient descent algorithm
- **max_depth**: The maximum depth of each decision tree in the ensemble
- **min_samples_split**: The minimum number of samples required to split a decision node.
- **min_samples_leaf**: The minimum number of samples required to be at a leaf node.

The GradientBoostingClassifier also supports several loss functions, such as the deviance loss (used for binary classification) and the exponential loss (used for AdaBoost).
Additionally, it provides several methods for making predictions, such as predict for obtaining the class labels and predict_proba for obtaining the class probabilities.

## Adult dataset

https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

The Adult dataset, also commonly referred to as the "Census Income" dataset, is a popular resource for machine learning, especially for tasks involving classification and pattern recognition. This dataset was extracted from the 1994 Census database by Barry Becker and contains demographic information about adults from various backgrounds.

Key features of the dataset include:

- age: The age of the individual.
- workclass: The type of employment of the individual (e.g., Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, etc.).
- fnlwgt: Final weight. The number of people the census believes the entry represents.
- education: The highest level of education achieved by the individual (e.g., Bachelors, Some-college, 11th, HS-grad, Prof-school, etc.).
- education-num: The highest level of education in numerical form.
- marital-status: Marital status of the individual.
- occupation: The individual's occupation (e.g., Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, etc.).
- relationship: Describes the individual's role in the family (e.g., Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried).
- race: Race of the individual (e.g., White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black).
- sex: The sex of the individual (Male, Female).
- capital-gain: Income from investment sources, apart from wages/salary.
- capital-loss: Losses from investment sources.
- hours-per-week: Number of hours worked per week.
- native-country: Country of origin for the individual.
- income: Whether the individual earns more than $50K/year.

The dataset is often used for predictive modeling and binary classification tasks, such as predicting whether an individual's income exceeds a certain threshold based on their demographic characteristics.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
                 header=None, names=['age', 'workclass', 'fnlwgt', 'education', 'education-num',
                                     'marital-status', 'occupation', 'relationship', 'race', 'sex',
                                     'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'])


# Replace '?' values with NaN
df = df.replace('?', pd.NaT)

# Drop rows with missing values
df = df.dropna()

# Encode categorical features using LabelEncoder
categorical_features = ['workclass', 'education', 'marital-status', 'occupation',
                        'relationship', 'race', 'sex', 'native-country']
encoder = LabelEncoder()
for feature in categorical_features:
    df[feature] = encoder.fit_transform(df[feature])

# Encode target variable
df['income'] = df['income'].apply(lambda x: 1 if x.strip() == '>50K' else 0)

# Split dataset into training and testing sets
X = df.drop(columns=['income'])
y = df['income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


# Exercises
1. Train a Gradient Boosting Classifier using the "Adult" dataset and evaluate its performance using the area under the ROC curve (AUC).
2. Experiment wih different hyperparameters of the Gradient Boosting Classifier, such as the number of trees, learning rate, and maximum depth of each tree, and observe how they affect the model's performance.
3. Train an AdaBoost Classifier using the "Adult" dataset and compare its performance to the Gradient Boosting Classifier. You can also try with XGBoost Classifier (need to be installed first).
4. Perform feature selection using the Gradient Boosting Classifier and evaluate the performance of the model using only the top-k most important features.

In [2]:
# Exercise 1

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

# Train a GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

# Make predictions on the test set and calculate the AUC
y_pred = gb.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred)
print('AUC:', auc)

AUC: 0.923775181024132


In [3]:
# Exercise 2

from sklearn.metrics import accuracy_score

# Experiment with different hyperparameters
for n_trees in [50, 100, 200]:
    for learning_rate in [0.01, 0.1, 1]:
        for max_depth in [3, 5, 7]:
            # Train a GradientBoostingClassifier
            gb = GradientBoostingClassifier(n_estimators=n_trees, learning_rate=learning_rate, max_depth=max_depth)
            gb.fit(X_train, y_train)

            # Make predictions on the test set
            y_pred = gb.predict(X_test)

            # Evaluate the accuracy of the model
            accuracy = accuracy_score(y_test, y_pred)
            print(f'n_trees={n_trees}, learning_rate={learning_rate}, max_depth={max_depth}, accuracy={accuracy:.2f}')

n_trees=50, learning_rate=0.01, max_depth=3, accuracy=0.81
n_trees=50, learning_rate=0.01, max_depth=5, accuracy=0.81
n_trees=50, learning_rate=0.01, max_depth=7, accuracy=0.82
n_trees=50, learning_rate=0.1, max_depth=3, accuracy=0.86
n_trees=50, learning_rate=0.1, max_depth=5, accuracy=0.87
n_trees=50, learning_rate=0.1, max_depth=7, accuracy=0.88
n_trees=50, learning_rate=1, max_depth=3, accuracy=0.87
n_trees=50, learning_rate=1, max_depth=5, accuracy=0.86
n_trees=50, learning_rate=1, max_depth=7, accuracy=0.84
n_trees=100, learning_rate=0.01, max_depth=3, accuracy=0.84
n_trees=100, learning_rate=0.01, max_depth=5, accuracy=0.85
n_trees=100, learning_rate=0.01, max_depth=7, accuracy=0.86
n_trees=100, learning_rate=0.1, max_depth=3, accuracy=0.87
n_trees=100, learning_rate=0.1, max_depth=5, accuracy=0.88
n_trees=100, learning_rate=0.1, max_depth=7, accuracy=0.88
n_trees=100, learning_rate=1, max_depth=3, accuracy=0.87
n_trees=100, learning_rate=1, max_depth=5, accuracy=0.85
n_trees=10

In [4]:
# Exercise 3

from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

# Train an AdaBoostClassifier
ab = AdaBoostClassifier()
ab.fit(X_train, y_train)

# Train the XGBClassiifer
#xgb = XGBClassifier()
#xgb.fit(X_train, y_train)

# Make predictions on the test set and compare the performance of the two models
y_pred_gb = gb.predict(X_test)
accuracy_gb = accuracy_score(y_test, y_pred_gb)

y_pred_ab = ab.predict(X_test)
accuracy_ab = accuracy_score(y_test, y_pred_ab)

#y_pred_xgb = xgb.predict(X_test)
#accuracy_xgb = accuracy_score(y_test, y_pred_xgb)

print(f'Gradient Boosting Classifier accuracy: {accuracy_gb:.2f}')
print(f'AdaBoost Classifier accuracy: {accuracy_ab:.2f}')
#print(f'XGBoost Classifier accuracy: {accuracy_xgb:.2f}')

Gradient Boosting Classifier accuracy: 0.85
AdaBoost Classifier accuracy: 0.86


In [5]:
# Exercise 4

# Train a GradientBoostingClassifier with custom loss function
gb_custom = GradientBoostingClassifier(loss='exponential')
gb_custom.fit(X_train, y_train)

# Train a GradientBoostingClassifier with default loss function
gb_default = GradientBoostingClassifier()
gb_default.fit(X_train, y_train)

# Make predictions on the test set and compare the performance of the two models
y_pred_custom = gb_custom.predict(X_test)
accuracy_custom = accuracy_score(y_test, y_pred_custom)

y_pred_default = gb_default.predict(X_test)
accuracy_default = accuracy_score(y_test, y_pred_default)

print(f'Custom loss function accuracy: {accuracy_custom:.2f}')
print(f'Default loss function accuracy: {accuracy_default:.2f}')

Custom loss function accuracy: 0.87
Default loss function accuracy: 0.87


In [6]:
# Exercise 4

from sklearn.feature_selection import SelectFromModel

# Perform feature selection using the GradientBoostingClassifier
fs = SelectFromModel(gb, prefit=True, max_features=5)
X_train_fs = fs.transform(X_train)
X_test_fs = fs.transform(X_test)

# Train a GradientBoostingClassifier with the selected features
gb_fs = GradientBoostingClassifier()
gb_fs.fit(X_train_fs, y_train)

# Make predictions on the test set and evaluate the performance of the model
y_pred_fs = gb_fs.predict(X_test_fs)
accuracy_fs = accuracy_score(y_test, y_pred_fs)
print(f'Accuracy with top-5 most important features: {accuracy_fs:.2f}')



Accuracy with top-5 most important features: 0.86


<bound method BaseEstimator.get_params of SelectFromModel(estimator=GradientBoostingClassifier(learning_rate=1,
                                                     max_depth=7,
                                                     n_estimators=200),
                max_features=5, prefit=True)>