<a href="https://colab.research.google.com/github/shubhi/msai-490-ai-industry-practicum/blob/main/week4-weights%26biases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 - Models and Experimentation

## Step 1 Training a model

For the purposes of this demo, we will be using this [adapted demo](https://www.datacamp.com/tutorial/xgboost-in-python) and training an XGBoost model, and then doing some experimentation and hyperparameter tuning.


If running this notebook locally, use the following steps to create virtual environment:
- Don't use past python 3.10
- To create virtual environment use "venv"

`python -m venv NAME`

- Try to avoid anaconda, poetry or similar package management platforms
- To install a package use pip

`python -m pip install <package-name>`

- once you are done working with this virtual environment, deactivate it with `deactivate`

### Install packages

In [1]:
!pip install wandb -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.1/267.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import xgboost as xgb
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


### Import data

We will be using Diamonds dataset imported from Seaborn. It is also available on [Kaggle](https://www.kaggle.com/datasets/shivam2503/diamonds).

Read about the features by following the link. We will be predicting the price of diamonds.

In [3]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [4]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


In [5]:
diamonds.shape

(53940, 10)

In [6]:
X,y = diamonds.drop('price', axis=1), diamonds[['price']]

# For the cut, color and clarity use pandas category to enable XGBoost ability to deal with categorical data.

X['cut'] = X['cut'].astype('category')
X['color'] = X['color'].astype('category')
X['clarity'] = X['clarity'].astype('category')

### Split the data and train a model

In [7]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

In [8]:
# Define hyperparameters
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}

n = 100
model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
)


    E.g. tree_method = "hist", device = "cuda"



In [9]:
# Define evaluation metrics - Root Mean Squared Error

predictions = model.predict(dtest)
rmse = mean_squared_error(y_test, predictions, squared=False)
print(f"RMSE: {rmse}")

RMSE: 532.8838153117543



    E.g. tree_method = "hist", device = "cuda"



### Incorporate validation

In [10]:
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}
n = 100

# Create the validation set
evals = [(dtrain, "train"), (dtest, "validation")]

In [11]:
evals = [(dtrain, "train"), (dtest, "validation")]

model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=10,
)

[0]	train-rmse:2859.49097	validation-rmse:2851.62630
[10]	train-rmse:550.99470	validation-rmse:571.16640



    E.g. tree_method = "hist", device = "cuda"



[20]	train-rmse:491.51435	validation-rmse:544.08058
[30]	train-rmse:464.38845	validation-rmse:537.01895
[40]	train-rmse:445.99106	validation-rmse:533.85127
[50]	train-rmse:430.36010	validation-rmse:532.90320
[60]	train-rmse:418.87898	validation-rmse:533.04629
[70]	train-rmse:409.66247	validation-rmse:533.58046
[80]	train-rmse:397.34048	validation-rmse:534.31963
[90]	train-rmse:389.94294	validation-rmse:532.61946
[99]	train-rmse:377.70831	validation-rmse:532.88383


In [12]:
# Incorporate early stopping
n = 10000


model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=50,
   # Activate early stopping
   early_stopping_rounds=50
)

[0]	train-rmse:2859.49097	validation-rmse:2851.62630



    E.g. tree_method = "hist", device = "cuda"



[50]	train-rmse:430.36010	validation-rmse:532.90320
[100]	train-rmse:377.56825	validation-rmse:532.79980
[103]	train-rmse:375.44970	validation-rmse:532.50220


In [13]:
# Cross-validation

params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}
n = 1000

results = xgb.cv(
   params, dtrain,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=20
)



    E.g. tree_method = "hist", device = "cuda"



In [14]:
results.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,2861.153015,8.266765,2861.773555,36.937516
1,2081.378004,5.534608,2084.973481,32.064109
2,1545.361682,3.287745,1553.681211,31.059209
3,1182.364236,3.585787,1192.464771,26.157805
4,941.828819,2.971779,958.467497,23.613538


In [15]:
best_rmse = results['test-rmse-mean'].min()

best_rmse

549.1039652582465

## Start W&B


- Login into your W&B profile using the code below
- Alternatively you can set environment variables. There are several env variables which you can set to change the behavior of W&B logging. The most important are:
    - WANDB_API_KEY - find this in your "Settings" section under your profile
    - WANDB_BASE_URL - this is the url of the W&B server

- Find your API Token in "Profile" -> "Setttings" in the W&B App



In [16]:
# Log in to your W&B account
import wandb

In [17]:
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# TO DO
# Start experiment tracking with W&B
# Do at least 5 experiments with various hyperparameters
# Choose any method for hyperparameter tuning: grid search, random search, bayesian search
# Describe your findings and what you see

In [18]:
sweep_config = {
    'method': 'bayes',  # Using Bayesian optimization
    'metric': {
        'name': 'rmse',
        'goal': 'minimize'
    },
    'parameters': {
        'learning_rate': {
            'distribution': 'uniform',
            'min': 0.01,
            'max': 0.1
        },
        'max_depth': {
            'distribution': 'int_uniform',
            'min': 3,
            'max': 9
        },
        'subsample': {
            'distribution': 'uniform',
            'min': 0.5,
            'max': 0.9
        },
        'colsample_bytree': {
            'distribution': 'uniform',
            'min': 0.5,
            'max': 0.9
        },
        'n_estimators': {
            'distribution': 'int_uniform',
            'min': 100,
            'max': 300
        },
        'reg_alpha': {
            'distribution': 'uniform',
            'min': 0.0,
            'max': 1.0
        },
        'reg_lambda': {
            'distribution': 'uniform',
            'min': 0.0,
            'max': 1.0
        }
    }
}


In [19]:
sweep_id = wandb.sweep(sweep_config, project="xgboost_diamonds_sweep_v5", entity='shubhigupta2025')

Create sweep with ID: 70qp7pyh
Sweep URL: https://wandb.ai/shubhigupta2025/xgboost_diamonds_sweep_v5/sweeps/70qp7pyh


In [20]:
def train():
    # Initialize a W&B run
    run = wandb.init()

    # Access the hyperparameters through wandb.config
    config = wandb.config

    # Define the model
    params = {
        'objective': 'reg:squarederror',
        'learning_rate': config.learning_rate,
        'max_depth': int(config.max_depth),
        'subsample': config.subsample,
        'colsample_bytree': config.colsample_bytree,
        'n_estimators': int(config.n_estimators),
        'eval_metric': 'rmse'
    }


    # Train the model
    model = xgb.train(params, dtrain, num_boost_round=config.n_estimators)

    # Evaluate the model
    predictions = model.predict(dtest)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))

    # Log metrics
    wandb.log({'rmse': rmse})

    run.finish()


In [21]:
wandb.agent(sweep_id, train)

[34m[1mwandb[0m: Agent Starting Run: j0r1dx7i with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.6487029679976578
[34m[1mwandb[0m: 	learning_rate: 0.04425632624485143
[34m[1mwandb[0m: 	max_depth: 6
[34m[1mwandb[0m: 	n_estimators: 246
[34m[1mwandb[0m: 	reg_alpha: 0.8142253560247309
[34m[1mwandb[0m: 	reg_lambda: 0.4032885738617237
[34m[1mwandb[0m: 	subsample: 0.8610179687262243
[34m[1mwandb[0m: Currently logged in as: [33mshubhigupta2025[0m. Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,533.34659


[34m[1mwandb[0m: Agent Starting Run: syscrzks with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.6033584665328677
[34m[1mwandb[0m: 	learning_rate: 0.0441048878891554
[34m[1mwandb[0m: 	max_depth: 8
[34m[1mwandb[0m: 	n_estimators: 236
[34m[1mwandb[0m: 	reg_alpha: 0.6129941551241375
[34m[1mwandb[0m: 	reg_lambda: 0.07547624872366165
[34m[1mwandb[0m: 	subsample: 0.6622373335421011


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,534.11683


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: ehohhczg with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.7566517277483122
[34m[1mwandb[0m: 	learning_rate: 0.04838156745770726
[34m[1mwandb[0m: 	max_depth: 7
[34m[1mwandb[0m: 	n_estimators: 115
[34m[1mwandb[0m: 	reg_alpha: 0.16520217519318592
[34m[1mwandb[0m: 	reg_lambda: 0.766003599431076
[34m[1mwandb[0m: 	subsample: 0.5771819861024324


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,535.50793


[34m[1mwandb[0m: Agent Starting Run: tv5yq7sk with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.5880766549262627
[34m[1mwandb[0m: 	learning_rate: 0.05235242808198565
[34m[1mwandb[0m: 	max_depth: 7
[34m[1mwandb[0m: 	n_estimators: 248
[34m[1mwandb[0m: 	reg_alpha: 0.8030926699211278
[34m[1mwandb[0m: 	reg_lambda: 0.5967901568355315
[34m[1mwandb[0m: 	subsample: 0.8840120018794912


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,530.3824


[34m[1mwandb[0m: Agent Starting Run: vyfswxmm with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.8228527578752438
[34m[1mwandb[0m: 	learning_rate: 0.06402912391150568
[34m[1mwandb[0m: 	max_depth: 8
[34m[1mwandb[0m: 	n_estimators: 190
[34m[1mwandb[0m: 	reg_alpha: 0.1882271296800541
[34m[1mwandb[0m: 	reg_lambda: 0.27165958329105444
[34m[1mwandb[0m: 	subsample: 0.5639946950298452


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113035599999913, max=1.0…

VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,531.68613


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 10mzyn1z with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.6053522091587215
[34m[1mwandb[0m: 	learning_rate: 0.04311112586161467
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 139
[34m[1mwandb[0m: 	reg_alpha: 0.1980423039235656
[34m[1mwandb[0m: 	reg_lambda: 0.902342887382237
[34m[1mwandb[0m: 	subsample: 0.8544642769849016


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,568.65554


[34m[1mwandb[0m: Agent Starting Run: 3fi303pd with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.6668347619767728
[34m[1mwandb[0m: 	learning_rate: 0.05328091798672931
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	n_estimators: 243
[34m[1mwandb[0m: 	reg_alpha: 0.9427688603656184
[34m[1mwandb[0m: 	reg_lambda: 0.4135942528169418
[34m[1mwandb[0m: 	subsample: 0.8108339706110325


VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.11043501952592862, max=1.…

0,1
rmse,▁

0,1
rmse,534.70286


[34m[1mwandb[0m: Agent Starting Run: lsiyfmkw with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.5926452043237564
[34m[1mwandb[0m: 	learning_rate: 0.08671999867821084
[34m[1mwandb[0m: 	max_depth: 7
[34m[1mwandb[0m: 	n_estimators: 275
[34m[1mwandb[0m: 	reg_alpha: 0.96405470965418
[34m[1mwandb[0m: 	reg_lambda: 0.3872358304264987
[34m[1mwandb[0m: 	subsample: 0.8729397658475437


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011113471644444593, max=1.0…

VBox(children=(Label(value='0.010 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.9167040660622925, max=1.0…

0,1
rmse,▁

0,1
rmse,542.57643


[34m[1mwandb[0m: Agent Starting Run: fzu8e5ub with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.8409373834860037
[34m[1mwandb[0m: 	learning_rate: 0.04710774451332946
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	n_estimators: 221
[34m[1mwandb[0m: 	reg_alpha: 0.4871009996885294
[34m[1mwandb[0m: 	reg_lambda: 0.3649433057515271
[34m[1mwandb[0m: 	subsample: 0.5626617602427579


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,529.76083


[34m[1mwandb[0m: Agent Starting Run: bbrihgdt with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.8683947723279376
[34m[1mwandb[0m: 	learning_rate: 0.04134047333828361
[34m[1mwandb[0m: 	max_depth: 7
[34m[1mwandb[0m: 	n_estimators: 156
[34m[1mwandb[0m: 	reg_alpha: 0.6318692537070247
[34m[1mwandb[0m: 	reg_lambda: 0.2210900956533559
[34m[1mwandb[0m: 	subsample: 0.5394169226548503


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,529.04779


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: pbjeu7xg with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.8704557710607872
[34m[1mwandb[0m: 	learning_rate: 0.024533868933606404
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	n_estimators: 138
[34m[1mwandb[0m: 	reg_alpha: 0.48588791060416303
[34m[1mwandb[0m: 	reg_lambda: 0.3927732949260675
[34m[1mwandb[0m: 	subsample: 0.5056358706298488


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,562.66072


[34m[1mwandb[0m: Agent Starting Run: 60mj2koh with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.8869385693151993
[34m[1mwandb[0m: 	learning_rate: 0.05553307752739365
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	n_estimators: 221
[34m[1mwandb[0m: 	reg_alpha: 0.468884644399385
[34m[1mwandb[0m: 	reg_lambda: 0.3336011710716471
[34m[1mwandb[0m: 	subsample: 0.5197394744398728


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,531.86523


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Sweep Agent: Exiting.


## Hyperparameter Optimization Sweep Report for XGBoost on Diamond Dataset

### Sweep Overview

The hyperparameter optimization was conducted using Weights & Biases with Bayesian optimization on the XGBoost model applied to a dataset of diamonds. The goal was to minimize the Root Mean Square Error (RMSE) by tuning several hyperparameters.

### Hyperparameter Space

The following hyperparameters were included in the sweep:

- **Learning Rate**: Uniformly distributed between 0.01 and 0.1.
- **Max Depth**: Integer values uniformly distributed between 3 and 9.
- **Subsample**: Uniform distribution between 0.5 and 0.9.
- **Colsample_bytree**: Uniform distribution between 0.5 and 0.9.
- **N_estimators**: Integer values uniformly distributed between 100 and 300.
- **Reg_alpha**: Uniform distribution between 0.0 and 1.0 (L1 regularization).
- **Reg_lambda**: Uniform distribution between 0.0 and 1.0 (L2 regularization).

### Results Summary

The Bayesian sweep conducted multiple runs, each testing a different combination of hyperparameters. Below are the summarized results of selected runs:

| Run ID       | RMSE       | Learning Rate | Max Depth | N_estimators | Reg_alpha | Reg_lambda | Subsample | Colsample_bytree |
|--------------|------------|---------------|-----------|--------------|-----------|------------|-----------|------------------|
| j0r1dx7i     | 533.34659  | 0.0443        | 6         | 246          | 0.814     | 0.403      | 0.861     | 0.649            |
| syscrzks     | 534.11683  | 0.0441        | 8         | 236          | 0.613     | 0.0755     | 0.662     | 0.603            |
| ehohhczg     | 535.50793  | 0.0484        | 7         | 115          | 0.165     | 0.766      | 0.577     | 0.757            |
| tv5yq7sk     | 530.3824   | 0.0524        | 7         | 248          | 0.803     | 0.597      | 0.884     | 0.588            |
| vyfswxmm     | 531.68613  | 0.0640        | 8         | 190          | 0.188     | 0.272      | 0.564     | 0.823            |
| 10mzyn1z     | 568.65554  | 0.0431        | 5         | 139          | 0.198     | 0.902      | 0.854     | 0.605            |
| 3fi303pd     | 534.70286  | 0.0533        | 9         | 243          | 0.943     | 0.414      | 0.811     | 0.667            |
| lsiyfmkw     | 542.57643  | 0.0867        | 7         | 275          | 0.964     | 0.387      | 0.873     | 0.593            |
| fzu8e5ub     | 529.76083  | 0.0471        | 9         | 221          | 0.487     | 0.365      | 0.563     | 0.841            |
| bbrihgdt     | 529.04779  | 0.0413        | 7         | 156          | 0.632     | 0.221      | 0.539     | 0.868            |
| pbjeu7xg     | 562.66072  | 0.0245        | 9         | 138          | 0.486     | 0.393      | 0.506     | 0.870            |
| 60mj2koh     | 531.86523  | 0.0555        | 9         | 221          | 0.469     | 0.334      | 0.520     | 0.887            |

### Analysis

- **Best Performing Configurations**: The best RMSE scores were found in the runs with Run IDs `fzu8e5ub` and `bbrihgdt`, producing RMSEs of 529.76083 and 529.04779 respectively. These configurations suggest that a combination of higher `colsample_bytree`, moderate `learning_rate`, and `n_estimators` in the mid-range tend to perform better.
- **Parameter Influence**: The `max_depth` and `subsample` values did not show a consistent trend, indicating that their optimal values may depend more on their interaction with other parameters like `reg_alpha` and `reg_lambda`.
- **Variability**: The RMSE scores fluctuated across different runs, which highlights the complexity of the hyperparameter interactions and the importance of fine-tuning each parameter.