# Week 4 - Models and Experimentation

## Step 1 Training a model

For the purposes of this demo, we will be using this [adapted demo](https://www.datacamp.com/tutorial/xgboost-in-python) and training an XGBoost model, and then doing some experimentation and hyperparameter tuning.


If running this notebook locally, use the following steps to create virtual environment:
- Don't use past python 3.10
- To create virtual environment use "venv"

`python -m venv NAME`

- Try to avoid anaconda, poetry or similar package management platforms
- To install a package use pip

`python -m pip install <package-name>`

- once you are done working with this virtual environment, deactivate it with `deactivate`

### Install packages

In [1]:
!pip install wandb -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.1/267.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost.sklearn import XGBRegressor


### Import data

We will be using Diamonds dataset imported from Seaborn. It is also available on [Kaggle](https://www.kaggle.com/datasets/shivam2503/diamonds).

Read about the features by following the link. We will be predicting the price of diamonds.

In [None]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [None]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


In [None]:
diamonds.shape

(53940, 10)

In [None]:
X,y = diamonds.drop('price', axis=1), diamonds[['price']]

# For the cut, color and clarity use pandas category to enable XGBoost ability to deal with categorical data.

X['cut'] = X['cut'].astype('category')
X['color'] = X['color'].astype('category')
X['clarity'] = X['clarity'].astype('category')

### Split the data and train a model

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

In [None]:
# Define hyperparameters
params = {"objective": "reg:squarederror", "tree_method": "exact", "max_depth": 5}

n = 100
model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
)

In [None]:
# Define evaluation metrics - Root Mean Squared Error

predictions = model.predict(dtest)
rmse = mean_squared_error(y_test, predictions, squared=False)
print(f"RMSE: {rmse}")

RMSE: 538.0601629551486


### Incorporate validation

In [None]:
params = {"objective": "reg:squarederror", "tree_method": "exact", "max_depth": 5}
n = 100

# Create the validation set
evals = [(dtrain, "train"), (dtest, "validation")]

In [None]:
evals = [(dtrain, "train"), (dtest, "validation")]

model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=15,
)

[0]	train-rmse:2882.30339	validation-rmse:2878.35615
[15]	train-rmse:543.69613	validation-rmse:569.84565
[30]	train-rmse:502.37185	validation-rmse:547.94872
[45]	train-rmse:480.85806	validation-rmse:541.82053
[60]	train-rmse:464.29028	validation-rmse:538.52657
[75]	train-rmse:447.50316	validation-rmse:541.60391
[90]	train-rmse:435.67158	validation-rmse:539.10447
[99]	train-rmse:428.48040	validation-rmse:538.06016


In [None]:
# Incorporate early stopping
n = 10000


model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=50,
   # Activate early stopping
   early_stopping_rounds=50
)

[0]	train-rmse:2882.30339	validation-rmse:2878.35615
[50]	train-rmse:475.20786	validation-rmse:541.01524
[100]	train-rmse:427.92908	validation-rmse:537.71564
[150]	train-rmse:397.07634	validation-rmse:541.32713
[152]	train-rmse:396.09247	validation-rmse:541.08236


In [None]:
# Cross-validation

params = {"objective": "reg:squarederror", "tree_method": "exact", "max_depth": 5}
n = 1000

results = xgb.cv(
   params, dtrain,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=20
)


In [None]:
results.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,2882.071247,8.116676,2883.527781,36.019748
1,2113.945404,5.698647,2117.894678,31.647177
2,1591.767738,5.099691,1597.508452,28.805518
3,1238.222797,4.385918,1248.097223,27.271571
4,1002.220567,4.779087,1018.119509,28.45302


In [None]:
best_rmse = results['test-rmse-mean'].min()

best_rmse

549.7105620216503

## Start W&B


- Login into your W&B profile using the code below
- Alternatively you can set environment variables. There are several env variables which you can set to change the behavior of W&B logging. The most important are:
    - WANDB_API_KEY - find this in your "Settings" section under your profile
    - WANDB_BASE_URL - this is the url of the W&B server

- Find your API Token in "Profile" -> "Setttings" in the W&B App



In [3]:
# Log in to your W&B account
import wandb

wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# TO DO
# Start experiment tracking with W&B
# Do at least 5 experiments with various hyperparameters
# Choose any method for hyperparameter tuning: grid search, random search, bayesian search
# Describe your findings and what you see

In [4]:
sweep_config = {
    'method': 'random',
    'metric': {
        'name': 'rmse',
        'goal': 'minimize'
    },
    'parameters': {
        'max_depth': {
            'values': [3, 5, 8, 10]
        },
        'learning_rate': {
            'values': [0.001, 0.01, 0.1]
        },
        'n_estimators': {
            'values': [100, 200, 300]
        }
    }
}

In [5]:
sweep_file = wandb.sweep(sweep_config, project="industry_hw")

Create sweep with ID: qz5nm0f0
Sweep URL: https://wandb.ai/northwestern_yiyig/industry_hw/sweeps/qz5nm0f0


In [6]:
def train():
    with wandb.init() as run:
        config = run.config

        pipeline = Pipeline([
            ('regressor', XGBRegressor(
                max_depth=config.max_depth,
                learning_rate=config.learning_rate,
                n_estimators=config.n_estimators,
                tree_method='hist',
                objective='reg:squarederror',
                device='cuda'))
        ])

        diamonds = sns.load_dataset('diamonds')
        diamonds = pd.get_dummies(diamonds, columns=['cut', 'color', 'clarity'])
        X, y = diamonds.drop('price', axis=1), diamonds[['price']]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        pipeline.fit(X_train, y_train)
        predictions = pipeline.predict(X_test)
        rmse = mean_squared_error(y_test, predictions, squared=False)

        wandb.log({'rmse': rmse})

In [7]:
wandb.agent(sweep_file, train, count=15)

[34m[1mwandb[0m: Agent Starting Run: fy20z73z with config:
[34m[1mwandb[0m: 	learning_rate: 0.01
[34m[1mwandb[0m: 	max_depth: 10
[34m[1mwandb[0m: 	n_estimators: 200
[34m[1mwandb[0m: Currently logged in as: [33myiyinggao[0m ([33mnorthwestern_yiyig[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,799.79207


[34m[1mwandb[0m: Agent Starting Run: x2n529r3 with config:
[34m[1mwandb[0m: 	learning_rate: 0.1
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 300


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,552.01295


[34m[1mwandb[0m: Agent Starting Run: owxz4x7e with config:
[34m[1mwandb[0m: 	learning_rate: 0.1
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 200


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,560.14146


[34m[1mwandb[0m: Agent Starting Run: 6ncjqr0i with config:
[34m[1mwandb[0m: 	learning_rate: 0.1
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 300


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,552.01295


[34m[1mwandb[0m: Agent Starting Run: aksfg9xs with config:
[34m[1mwandb[0m: 	learning_rate: 0.1
[34m[1mwandb[0m: 	max_depth: 3
[34m[1mwandb[0m: 	n_estimators: 300


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,623.57359


[34m[1mwandb[0m: Agent Starting Run: 1uemnz9i with config:
[34m[1mwandb[0m: 	learning_rate: 0.01
[34m[1mwandb[0m: 	max_depth: 8
[34m[1mwandb[0m: 	n_estimators: 300


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,620.65364


[34m[1mwandb[0m: Agent Starting Run: i4196yi7 with config:
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 200


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,3331.09516


[34m[1mwandb[0m: Agent Starting Run: 30gx6fyk with config:
[34m[1mwandb[0m: 	learning_rate: 0.01
[34m[1mwandb[0m: 	max_depth: 10
[34m[1mwandb[0m: 	n_estimators: 200


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,799.79207


[34m[1mwandb[0m: Agent Starting Run: 0brk7k1a with config:
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	max_depth: 10
[34m[1mwandb[0m: 	n_estimators: 100


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,3622.3769


[34m[1mwandb[0m: Agent Starting Run: b3pf7w28 with config:
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 300


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,3052.36026


[34m[1mwandb[0m: Agent Starting Run: 5jexhowq with config:
[34m[1mwandb[0m: 	learning_rate: 0.01
[34m[1mwandb[0m: 	max_depth: 3
[34m[1mwandb[0m: 	n_estimators: 200


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,1286.10002


[34m[1mwandb[0m: Agent Starting Run: l6h40vlh with config:
[34m[1mwandb[0m: 	learning_rate: 0.01
[34m[1mwandb[0m: 	max_depth: 10
[34m[1mwandb[0m: 	n_estimators: 300


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,595.48507


[34m[1mwandb[0m: Agent Starting Run: 1s300fyv with config:
[34m[1mwandb[0m: 	learning_rate: 0.01
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 100


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,1752.31052


[34m[1mwandb[0m: Agent Starting Run: zrn5xy80 with config:
[34m[1mwandb[0m: 	learning_rate: 0.1
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 300


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113438477777512, max=1.0…

VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,552.01295


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 9ca0cwzb with config:
[34m[1mwandb[0m: 	learning_rate: 0.001
[34m[1mwandb[0m: 	max_depth: 8
[34m[1mwandb[0m: 	n_estimators: 300


VBox(children=(Label(value='0.010 MB of 0.010 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,3012.06549


The experiment is designed to optimize hyperparameters for an XGBoost regression model that predicts the price of diamonds using the `diamonds` dataset available in the Seaborn library. The dataset will be processed using one-hot encoding to convert categorical variables 'cut', 'color', and 'clarity' into numerical format.

The hyperparameter tuning will be conducted using Weights & Biases with a random search approach for the below parameters:

- `max_depth`
- `learning_rate`
- `n_estimators`

The goal is to find the combination of these parameters that results in the lowest RMSE.

The model will train for 15 iterations, each with a different combination of hyperparameters selected by the agent. The outcome will be an understanding of how these hyperparameters affect the performance of the XGBoost model on this particular regression task, aiming to identify the best-performing model configuration based on the RMSE metric.