## Weights & Biases workshop

* Video: https://www.youtube.com/watch?v=yNyqFMwEyL4
* Github repository: https://wandb.me/mlops-zoomcamp-github


## Homework with Weights & Biases

The goal of this homework is to get familiar with Weights & Biases for experiment tracking, model management, hyperparameter optimization, and many more.

Befor getting started with the homework, you need to have a Weights & Biases account. You can do so by visiting [wandb.ai/site](https://wandb.ai/site) and clicking on the **Sign Up** button.

# Q1. Install the Package

To get started with Weights & Biases you'll need to install the appropriate Python package.

For this we recommend creating a separate Python environment, for example, you can use [conda environments](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-envs), 
and then install the package there with `pip` or `conda`.

Following are the libraries you need to install:

* `pandas`
* `matplotlib`
* `scikit-learn`
* `pyarrow`
* `wandb`


In [None]:
# pip install pandas matplotlib scikit-learn pyarrow wandb

In [6]:
print('Q1. Version:')
!wandb --version

Q1. Version:
wandb, version 0.15.3


# Q2. Download and preprocess the data

We'll use the Green Taxi Trip Records dataset to predict the amount of tips for each trip. 

Download the data for January, February and March 2022 in parquet format from [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).


In [4]:
!python homework-wandb/preprocess_data.py \
  --wandb_project mlops_dtc \
  --wandb_entity rifrif \
  --raw_data_path ./Data  \
  --dest_path output

[34m[1mwandb[0m: Currently logged in as: [33mrifrif[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/home/jovyan/MLOps-Zoomcamp-2023/Week2/wandb/run-20230606_081420-wh5onw4i[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33msleek-rain-6[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/rifrif/mlops_dtc[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/rifrif/mlops_dtc/runs/wh5onw4i[0m
[34m[1mwandb[0m: Adding directory to artifact (./output)... Done. 0.0s
[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 🚀 View run [33msleek-rain-6[0m at: [34m[4mhttps://wandb.ai/rifrif/mlops_dtc/runs/wh5onw4i[0m
[34m[1mwandb[0m: Synced 6 W&B file(s), 0 media file(s), 4 artifact file(s) and 0 other file(s)
[34m[1mwandb[0m: 

In [5]:
import os
print('Question 2: Size of the DictVectorizer file:',os.path.getsize("output/dv.pkl"),'bytes')

Question 2: Size of the DictVectorizer file: 153660 bytes


# Q3. Train a model with Weights & Biases logging

We will train a `RandomForestRegressor` (from Scikit-Learn) on the taxi dataset.

We have prepared the training script `train.py` for this exercise, which can be also found in the folder `homework-wandb`. 

The script will:

* initialize a Weights & Biases run.
* load the preprocessed datasets by fetching them from the Weights & Biases artifact previously created,
* train the model on the training set,
* calculate the MSE score on the validation set and log it to Weights & Biases,
* save the trained model and log it to Weights & Biases as a model artifact.

In [8]:
!python homework-wandb/train.py \
  --wandb_project mlops_dtc \
  --wandb_entity rifrif \
  --data_artifact "rifrif/mlops_dtc/NYC-Taxi:v1"


[34m[1mwandb[0m: Currently logged in as: [33mrifrif[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/home/jovyan/MLOps-Zoomcamp-2023/Week2/wandb/run-20230606_083422-ncta458t[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mwarm-snowball-8[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/rifrif/mlops_dtc[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/rifrif/mlops_dtc/runs/ncta458t[0m
[34m[1mwandb[0m:   4 of 4 files downloaded.  
[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m: MSE ▁
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run summary:
[34m[1mwandb[0m: MSE 2.45398
[34m[1mwandb[0m: 
[34m[1mwandb[0m: 🚀 View run [33mwarm-snowball-8[0m at: [34m[4mht

# Q4. Tune model hyperparameters

Now let's try to reduce the validation error by tuning the hyperparameters of the `RandomForestRegressor` using [Weights & Biases Sweeps](https://docs.wandb.ai/guides/sweeps). We have prepared the script `sweep.py` for this exercise in the `homework-wandb` directory.

In [11]:
!python homework-wandb/sweep.py \
  --wandb_project mlops_dtc \
  --wandb_entity rifrif \
  --data_artifact "rifrif/mlops_dtc/NYC-Taxi:v1"


Create sweep with ID: 65hsvtmw
Sweep URL: https://wandb.ai/rifrif/mlops_dtc/sweeps/65hsvtmw
[34m[1mwandb[0m: Agent Starting Run: 9q6ctt11 with config:
[34m[1mwandb[0m: 	max_depth: 9
[34m[1mwandb[0m: 	min_samples_leaf: 4
[34m[1mwandb[0m: 	min_samples_split: 2
[34m[1mwandb[0m: 	n_estimators: 22
[34m[1mwandb[0m: Currently logged in as: [33mrifrif[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/home/jovyan/MLOps-Zoomcamp-2023/Week2/wandb/run-20230606_085605-9q6ctt11[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mtrue-sweep-1[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/rifrif/mlops_dtc[0m
[34m[1mwandb[0m: 🧹 View sweep at [34m[4mhttps://wandb.ai/rifrif/mlops_dtc/sweeps/65hsvtmw[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/rifrif/mlops_d

# Q5. Link the best model to the model registry

Now that we have obtained the optimal set of hyperparameters and trained the best model, we can assume that we are ready to test some of these models in production. In this exercise, you'll create a model registry and link the best model from the Sweep to the model registry.

First, you will need to create a Registered Model to hold all the candidate models for your particular modeling task. You can refer to [this section](https://docs.wandb.ai/guides/models/walkthrough#1-create-a-new-registered-model) of the official docs to learn how to create a registered model using the Weights & Biases UI.

Once you have created the Registered Model successfully, you can navigate to the best run of your sweep, navigate to the model artifact created by the particular run, and click on the Link to Registry option from the UI. This would link the model artifact to the Registered Model. You can choose to add some suitable aliases for the Registered Model, such as `production`, `best`, etc.
