<a href="https://colab.research.google.com/github/PavanKumarDharmoju/weights-and-biases/blob/main/Weights_and_biases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Weights abd biases

## Step 1 Training a model

For the purposes of this demo, we will be using this [adapted demo](https://www.datacamp.com/tutorial/xgboost-in-python) and training an XGBoost model, and then doing some experimentation and hyperparameter tuning.


If running this notebook locally, use the following steps to create virtual environment:
- Don't use past python 3.10
- To create virtual environment use "venv"

`python -m venv NAME`

- Try to avoid anaconda, poetry or similar package management platforms
- To install a package use pip

`python -m pip install <package-name>`

- once you are done working with this virtual environment, deactivate it with `deactivate`

### Install packages

In [None]:
!pip install wandb -qU

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


### Import data

We will be using Diamonds dataset imported from Seaborn. It is also available on [Kaggle](https://www.kaggle.com/datasets/shivam2503/diamonds).

Read about the features by following the link. We will be predicting the price of diamonds.

In [None]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [None]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB


In [None]:
diamonds.shape

(53940, 10)

In [None]:
X,y = diamonds.drop('price', axis=1), diamonds[['price']]

# For the cut, color and clarity use pandas category to enable XGBoost ability to deal with categorical data.

X['cut'] = X['cut'].astype('category')
X['color'] = X['color'].astype('category')
X['clarity'] = X['clarity'].astype('category')

### Split the data and train a model

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

In [None]:
# Define hyperparameters
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}

n = 100
model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
)


    E.g. tree_method = "hist", device = "cuda"



In [None]:
# Define evaluation metrics - Root Mean Squared Error

predictions = model.predict(dtest)
rmse = mean_squared_error(y_test, predictions, squared=False)
print(f"RMSE: {rmse}")

RMSE: 532.8838153117543



    E.g. tree_method = "hist", device = "cuda"



### Incorporate validation

In [None]:
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}
n = 100

# Create the validation set
evals = [(dtrain, "train"), (dtest, "validation")]

In [None]:
evals = [(dtrain, "train"), (dtest, "validation")]

model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=10,
)

[0]	train-rmse:2859.49097	validation-rmse:2851.62630
[10]	train-rmse:550.99470	validation-rmse:571.16640
[20]	train-rmse:491.51435	validation-rmse:544.08058



    E.g. tree_method = "hist", device = "cuda"



[30]	train-rmse:464.38845	validation-rmse:537.01895
[40]	train-rmse:445.99106	validation-rmse:533.85127
[50]	train-rmse:430.36010	validation-rmse:532.90320
[60]	train-rmse:418.87898	validation-rmse:533.04629
[70]	train-rmse:409.66247	validation-rmse:533.58046
[80]	train-rmse:397.34048	validation-rmse:534.31963
[90]	train-rmse:389.94294	validation-rmse:532.61946
[99]	train-rmse:377.70831	validation-rmse:532.88383


In [None]:
# Incorporate early stopping
n = 10000


model = xgb.train(
   params=params,
   dtrain=dtrain,
   num_boost_round=n,
   evals=evals,
   verbose_eval=50,
   # Activate early stopping
   early_stopping_rounds=50
)

[0]	train-rmse:2859.49097	validation-rmse:2851.62630



    E.g. tree_method = "hist", device = "cuda"



[50]	train-rmse:430.36010	validation-rmse:532.90320
[100]	train-rmse:377.56825	validation-rmse:532.79980
[102]	train-rmse:376.20429	validation-rmse:532.59813


In [None]:
# Cross-validation

params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}
n = 1000

results = xgb.cv(
   params, dtrain,
   num_boost_round=n,
   nfold=5,
   early_stopping_rounds=20
)



    E.g. tree_method = "hist", device = "cuda"



In [None]:
results.head()

Unnamed: 0,train-rmse-mean,train-rmse-std,test-rmse-mean,test-rmse-std
0,2861.153015,8.266765,2861.773555,36.937516
1,2081.378004,5.534608,2084.973481,32.064109
2,1545.361682,3.287745,1553.681211,31.059209
3,1182.364236,3.585787,1192.464771,26.157805
4,941.828819,2.971779,958.467497,23.613538


In [None]:
best_rmse = results['test-rmse-mean'].min()

best_rmse

549.1039652582465

## Start W&B


- Login into your W&B profile using the code below
- Alternatively you can set environment variables. There are several env variables which you can set to change the behavior of W&B logging. The most important are:
    - WANDB_API_KEY - find this in your "Settings" section under your profile
    - WANDB_BASE_URL - this is the url of the W&B server

- Find your API Token in "Profile" -> "Setttings" in the W&B App



In [None]:
# Log in to your W&B account
import wandb

wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mpavandharmoju[0m ([33mpavanresearch[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

In [None]:
sweep_config = {
    'method': 'random',  # Can be 'grid', 'random', 'bayesian'
    'metric': {
        'name': 'rmse',
        'goal': 'minimize'
    },
    'parameters': {
        'learning_rate': {
            'min': 0.01,
            'max': 0.2
        },
        'max_depth': {
            'values': [3, 5, 7, 9]
        },
        'subsample': {
            'min': 0.6,
            'max': 0.9
        },
        'colsample_bytree': {
            'min': 0.6,
            'max': 0.9
        },
        'n_estimators': {
            'values': [50, 100, 150, 200]
        }
    }
}


In [None]:
import wandb

sweep_id = wandb.sweep(sweep_config, project="xgboost_diamonds_sweep_experiments", entity='pavandharmoju')


Create sweep with ID: nrzboqad
Sweep URL: https://wandb.ai/pavandharmoju/xgboost_diamonds_sweep_experiments/sweeps/nrzboqad


In [None]:
def train():
    # Initialize a W&B run
    run = wandb.init()

    # Access the hyperparameters through wandb.config
    config = wandb.config

    # Define the model
    params = {
        'objective': 'reg:squarederror',
        'learning_rate': config.learning_rate,
        'max_depth': int(config.max_depth),
        'subsample': config.subsample,
        'colsample_bytree': config.colsample_bytree,
        'n_estimators': int(config.n_estimators),
        'eval_metric': 'rmse'
    }

    # Train the model
    model = xgb.train(params, dtrain, num_boost_round=config.n_estimators)

    # Evaluate the model
    predictions = model.predict(dtest)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))

    # Log metrics
    wandb.log({'rmse': rmse})

    run.finish()


In [None]:
wandb.agent(sweep_id, train)

[34m[1mwandb[0m: Agent Starting Run: 6gfns4qy with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.8721852812160612
[34m[1mwandb[0m: 	learning_rate: 0.10543401200754732
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 200
[34m[1mwandb[0m: 	subsample: 0.8827807666925989
[34m[1mwandb[0m: Currently logged in as: [33mpavandharmoju[0m. Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,533.9805


[34m[1mwandb[0m: Agent Starting Run: wp0kn5ky with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.7233831099982967
[34m[1mwandb[0m: 	learning_rate: 0.09199680902788608
[34m[1mwandb[0m: 	max_depth: 3
[34m[1mwandb[0m: 	n_estimators: 200
[34m[1mwandb[0m: 	subsample: 0.7715684068652043


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,576.59301


[34m[1mwandb[0m: Agent Starting Run: 1q7y3qv5 with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.7050223344036699
[34m[1mwandb[0m: 	learning_rate: 0.062081427442106735
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	n_estimators: 50
[34m[1mwandb[0m: 	subsample: 0.7500189687798899


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,598.59432


[34m[1mwandb[0m: Agent Starting Run: xjqsanvw with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.7723809156777699
[34m[1mwandb[0m: 	learning_rate: 0.0491656842054201
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 100
[34m[1mwandb[0m: 	subsample: 0.8388963760375594


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,563.89694


[34m[1mwandb[0m: Agent Starting Run: 8glgv2s0 with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.6392808452239291
[34m[1mwandb[0m: 	learning_rate: 0.017168999761187714
[34m[1mwandb[0m: 	max_depth: 7
[34m[1mwandb[0m: 	n_estimators: 50
[34m[1mwandb[0m: 	subsample: 0.8693235766968883


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,1868.47987


[34m[1mwandb[0m: Agent Starting Run: baqfjdud with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.7926860515476201
[34m[1mwandb[0m: 	learning_rate: 0.1062646486843946
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 200
[34m[1mwandb[0m: 	subsample: 0.7437248354719405


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,535.12056


[34m[1mwandb[0m: Agent Starting Run: 2jwg3eb7 with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.8924666901340875
[34m[1mwandb[0m: 	learning_rate: 0.028815912974652844
[34m[1mwandb[0m: 	max_depth: 3
[34m[1mwandb[0m: 	n_estimators: 150
[34m[1mwandb[0m: 	subsample: 0.7984276672054719


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,757.77383


[34m[1mwandb[0m: Agent Starting Run: fysq5fvd with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.6910431691986211
[34m[1mwandb[0m: 	learning_rate: 0.061841582667712953
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	n_estimators: 100
[34m[1mwandb[0m: 	subsample: 0.853416628532357


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,533.42644


[34m[1mwandb[0m: Agent Starting Run: 0dk0qwqq with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.7879126178034734
[34m[1mwandb[0m: 	learning_rate: 0.1596815070224063
[34m[1mwandb[0m: 	max_depth: 5
[34m[1mwandb[0m: 	n_estimators: 50
[34m[1mwandb[0m: 	subsample: 0.871375863357573


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,542.93392


[34m[1mwandb[0m: Agent Starting Run: rp93klp7 with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.7983299733272369
[34m[1mwandb[0m: 	learning_rate: 0.06151642810780787
[34m[1mwandb[0m: 	max_depth: 3
[34m[1mwandb[0m: 	n_estimators: 200
[34m[1mwandb[0m: 	subsample: 0.7232341550797495


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,600.45896


[34m[1mwandb[0m: Agent Starting Run: j7hwjuyh with config:
[34m[1mwandb[0m: 	colsample_bytree: 0.7087480918840534
[34m[1mwandb[0m: 	learning_rate: 0.08908882135167041
[34m[1mwandb[0m: 	max_depth: 7
[34m[1mwandb[0m: 	n_estimators: 150
[34m[1mwandb[0m: 	subsample: 0.6387753392249019


VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
rmse,▁

0,1
rmse,535.64313


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Sweep Agent: Exiting.


## Hyperparameters and RMSE Observations

### Hyperparameter Variability and Impact:

- **colsample_bytree**: This parameter varied across runs from about 0.639 to 0.892. Changes in this parameter, which controls the fraction of features (by number) that will be used to train each tree, seem to influence the RMSE. However, a clear trend is not directly observable from the logs alone, suggesting the interplay with other parameters affects its impact.
  
- **learning_rate**: Varied between about 0.017 and 0.16. Generally, models with mid-range learning rates (around 0.1) tended to perform better, indicating too slow or too fast learning rates might be suboptimal.
  
- **max_depth**: Values ranged from 3 to 9. The depth of trees can significantly influence overfitting; shallower trees generally lead to higher biases while deeper trees might overfit, particularly in smaller datasets.
  
- **n_estimators**: This had values like 50, 100, 150, and 200. More trees generally improve model performance but up to a point, after which improvements might plateau or even decline due to overfitting.
  
- **subsample**: Ranged from about 0.638 to 0.882. This parameter sets the fraction of the training data to be randomly sampled for each tree. Higher values generally provide a better chance at model robustness by reducing variance.

### Performance Analysis:

- The RMSE scores ranged significantly, from a low of about 533 to a high of 1868. Lower RMSE scores indicate better model predictions. The run with the highest RMSE (1868) used particularly low `learning_rate` (0.017), relatively high `max_depth` (7), and a low number of `n_estimators` (50), possibly indicating insufficient learning phase.
  
- The best performing models (lowest RMSEs around 533) had more balanced configurations, avoiding extremes in learning rates and using a moderate number of trees.
