<img src="https://wandb.me/logo-im-png" width="400" alt="Weights & Biases" />
<!--- @wandbcode{beans-University-of-Stavanger} -->

Use Weights & Biases for machine learning experiment tracking, dataset versioning, and project collaboration.


<img src="https://wandb.me/mini-diagram" width="650" alt="Weights & Biases" />


## What this notebook covers with Weights and Biases:
* Metrics logging 
* Exploratory Data Analysis (EDA)
* W&B plots such as Confusion Matrices, ROC curves & PR curves
* HyperParameter search with W&B Sweeps



# ✅ Sign Up

Sign up to a free [Weights & Biases account here](https://wandb.ai/signup)

# Kaggle Competition Page

[Submit to the Competition here](http://www.kaggle.com/c/university-of-stavanger-beans-classification)

# 🚀 Installing and importing

In [1]:
!pip install -q --upgrade wandb
!pip install -q scikit-learn==1.0.1
!pip install pandas

Collecting pandas
  Downloading pandas-1.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
[K     |████████████████████████████████| 11.7 MB 3.4 MB/s eta 0:00:01    |██████▏                         | 2.2 MB 3.4 MB/s eta 0:00:03     |█████████▋                      | 3.5 MB 3.4 MB/s eta 0:00:03     |████████████                    | 4.4 MB 3.4 MB/s eta 0:00:03     |██████████████████▊             | 6.8 MB 3.4 MB/s eta 0:00:02
Installing collected packages: pandas
Successfully installed pandas-1.4.1


In [2]:
import os
import wandb
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

A useful logging function to log multiple metrics to W&B at once

In [3]:
def log_metrics(labels, preds, is_val=True):
  if is_val: pref = 'validation'
  else: pref = 'train'
  
  metrics = {}
  metrics[f"{pref}/accuracy_score"] = accuracy_score(y_val, y_pred)
  metrics[f"{pref}/precision"] = precision_score(y_val, y_pred, average="weighted")
  metrics[f"{pref}/recall"] = recall_score(y_val, y_pred, average="weighted")
  metrics[f"{pref}/f1_score"] = f1_score(y_val, y_pred, average="weighted")

  for k in metrics.keys():
    print(f'{k} : {metrics[k]}')
    wandb.summary[f"{k}"] = metrics[k]

  #wandb.log(metrics)

Set some constants 

In [4]:
PROJECT = 'beans-tabular-University-of-Stavanger'
DATA_DIR = 'data'
ARTIFACT_PATH = 'wandb_fc/beans-tabular-pydata-tunisia/beans_competition_dataset:latest'

# 💾 Data
#### Download and Load the Data
`train.csv` and `val.csv` data will be downloaded to `DATA_DIR`


In [5]:
wandb.init(project=PROJECT, job_type='download_dataset')
artifact = wandb.use_artifact(ARTIFACT_PATH, type='dataset')
artifact_dir = artifact.download(DATA_DIR)
wandb.finish()

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/ubuntu/.netrc





In [6]:
# Read csvs to DataFrame
train_df = pd.read_csv(f'{DATA_DIR}/train_c.csv')
train_df = train_df.sample(frac=1)  # shuffle the train data
train_df.reset_index(inplace=True, drop=True)

val_df = pd.read_csv(f'{DATA_DIR}/val.csv')
test_df = pd.read_csv(f'{DATA_DIR}/test_no_label.csv')

train_df.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class,id
0,34942,701.872,250.916336,178.058144,1.409182,0.704573,35396,210.925428,0.786504,0.987174,0.891337,0.840621,0.007181,0.002212,0.706643,0.995788,SIRA,7493
1,27427,625.835,229.816161,152.537045,1.506625,0.747968,27869,186.871991,0.797297,0.98414,0.879971,0.813137,0.008379,0.00226,0.661192,0.996168,DERMASON,6580
2,-73178,1033.997,398.300147,235.982898,1.687835,0.805589,73969,305.242729,0.735162,0.989306,0.860106,0.766364,0.005443,0.001158,0.587313,0.991288,CALI,3428
3,52423,914.234,382.9034,175.428808,2.182671,0.888873,53020,258.354479,0.732308,0.98874,0.788165,0.674725,0.007304,0.000934,0.455254,0.993669,HOROZ,3464
4,49493,833.519,309.598164,204.0679,1.517133,0.752022,50044,251.030765,0.714432,0.98899,0.895205,0.810828,0.006255,0.001668,0.657442,0.997426,SIRA,3031


In [7]:
test_df

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,id
0,68008,996.629,371.705881,234.441758,1.585493,0.776012,69070,294.262595,0.725465,0.984624,0.860405,0.791654,0.005466,0.001324,0.626717,0.993655,0
1,33169,676.789,242.434085,174.652218,1.388096,0.693547,33629,205.504458,0.766701,0.986321,0.909988,0.847671,0.007309,0.002328,0.718547,0.997412,1
2,32279,670.977,250.820564,164.181397,1.527704,0.755995,32662,202.728635,0.757190,0.988274,0.900979,0.808262,0.007770,0.002046,0.653287,0.998029,2
3,47480,809.477,308.172912,197.172445,1.562961,0.768532,47977,245.872759,0.811915,0.989641,0.910566,0.797840,0.006491,0.001622,0.636549,0.994902,3
4,35615,692.536,233.003591,195.224650,1.193515,0.545883,36004,212.947004,0.743342,0.989196,0.933164,0.913922,0.006542,0.002815,0.835253,0.996887,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3398,38491,745.466,278.802822,176.662392,1.578167,0.773623,39247,221.378100,0.726341,0.980737,0.870389,0.794031,0.007243,0.001776,0.630485,0.995011,3398
3399,43489,769.887,285.884484,194.201811,1.472100,0.733859,43893,235.312377,0.807415,0.990796,0.922009,0.823103,0.006574,0.001861,0.677499,0.997346,3399
3400,50400,857.430,327.844472,198.024117,1.655579,0.796970,51001,253.320495,0.686695,0.988216,0.861476,0.772685,0.006505,0.001430,0.597042,0.988450,3400
3401,43378,787.214,304.180475,182.437741,1.667311,0.800174,43943,235.011883,0.774801,0.987142,0.879617,0.772607,0.007012,0.001541,0.596921,0.995253,3401


#### Prep Data
Extract the X,y values and encode the classes into integer values

In [8]:
le = preprocessing.LabelEncoder()

y_train_txt = train_df['Class'].values.tolist()
le.fit(y_train_txt)
labels = le.classes_

X_train = train_df.iloc[:,:-2].values.tolist()
y_train = le.transform(y_train_txt)

X_val = val_df.iloc[:,:-2].values.tolist()
y_val_txt = val_df['Class'].values.tolist()
y_val = le.transform(y_val_txt)

X_test = test_df.iloc[:,:-1].values.tolist()

labels = train_df['Class'].unique()

list(le.inverse_transform([2, 2, 1]))

['CALI', 'CALI', 'BOMBAY']

# 🖼️ EDA with W&B Tables
Log the train and validation datasets to W&B Tables for EDA

In [9]:
wandb.init(project=PROJECT, job_type='log_dataset')
wandb.log({'Datasets/train_ds':train_df})
wandb.log({'Datasets/val_ds':val_df})
wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33msanderele1[0m (use `wandb login --relogin` to force relogin)


[34m[1mwandb[0m: Network error (ConnectionError), entering retry loop.
wandb: Network error (ConnectionError), entering retry loop.





#👟 Train
Train a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) from sci-kit learn

In [10]:
wandb.init(project=PROJECT)

model = RandomForestClassifier()

# ✍️ Log your Models parameter config to W&B
wandb.config.update(model.get_params())

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred = model.predict(X_val)
y_probas = model.predict_proba(X_val)

importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

✍️ Log your model's Metrics to W&B

In [11]:
log_metrics(y_val, y_pred)

validation/accuracy_score : 0.931843891402715
validation/precision : 0.931686336138724
validation/recall : 0.931843891402715
validation/f1_score : 0.9317082546430002


#🤩 Visualize Model Performance in W&B
Weights & Biases have charting functions for popular model evaluation charts including confusion matrices, ROC curves, PR curves and more.
[Check out wandb charts documentation here $\rightarrow$](https://docs.wandb.ai/guides/track/log/plots#model-evaluation-charts)

**Confusion Matrix**


In [12]:
wandb.log({"confusion Matrix" : wandb.plot.confusion_matrix(y_probas, y_val, class_names=labels)})

**ROC Curve**


In [13]:
wandb.log({"ROC Curve": wandb.plot.roc_curve(y_val, y_probas, labels=labels, title='ROC Curve')})

**Precision Recall Curve**

In [14]:
wandb.log({"Precision-Recall": wandb.plot.pr_curve(y_val, y_probas, labels=labels, title='Precision-Recall')})

**Feature Importances**

Evaluates and plots the importance of each feature for the classification task. Only works with classifiers that have a `feature_importances_` attribute, like trees.

In [15]:
feat_names = train_df.columns.values
imps = []
feats = []
for i in indices:
  imps.append(importances[i])
  feats.append(feat_names[i])

fi_data = pd.DataFrame({"Feature":feats, "Importance":imps})

In [16]:
table = wandb.Table(data=fi_data, columns = ["Feature", "Importance"])
wandb.log({"Feature Importance" : wandb.plot.bar(table, "Feature",
                               "Importance", title="Feature Importance")})

#### 🏁 Finish W&B Run
When you're finished with your logging for a run make sure to call `wandb.finish()` to avoid logging metrics from your next experiment to the wrong run

In [17]:
wandb.finish()




0,1
validation/accuracy_score,0.93184
validation/f1_score,0.93171
validation/precision,0.93169
validation/recall,0.93184


# Submission

In [18]:
y_pred_test = model.predict(X_test)
y_pred_test = list(le.inverse_transform(y_pred_test))
ids = test_df.id.values

submission_df = pd.DataFrame({'Id':ids, 'Predicted':y_pred_test})
submission_df.to_csv('submission.csv', index=False)

# 🧪 HyperParameter Sweep

Weights and Biases also enables you to do hyperparameter sweeps, either with our own [Sweeps functionality](https://docs.wandb.ai/guides/sweeps/python-api).

#### Sweep Train Function
A W&B Sweep needs to passed in a config and a training function to run.

In [19]:
def train():     
    with wandb.init() as _:
      
      model = RandomForestClassifier(
          
          n_estimators=wandb.config['n_estimators'],   # n_estimators parameter will now be set by W&B
          max_depth=wandb.config['max_depth']     # max_depth parameter will now be set by W&B
          
          # [Optional] add additional model parameters here
          
          )
      
      # ✍️ Log your Models parameter config to W&B
      wandb.config['model_type'] = 'random_forest'
      wandb.config.update(model.get_params())

      model.fit(X_train, y_train)

      y_pred_train = model.predict(X_train)
      y_pred = model.predict(X_val)
      y_probas = model.predict_proba(X_val)
        
      # Log validation summary metrics to W&B
      wandb.summary["validation/accuracy"] = accuracy_score(y_val, y_pred)
      wandb.summary["validation/precision"] = precision_score(y_val, y_pred, average="weighted")
      wandb.summary["validation/recall"] = recall_score(y_val, y_pred, average="weighted")
      wandb.summary["validation/f1_score"] = f1_score(y_val, y_pred, average="weighted")
    
      # Make test set predictions and save as csv  
      y_pred_test = model.predict(X_test)
      y_pred_test = list(le.inverse_transform(y_pred_test))

      submission_df = pd.DataFrame({'Id':test_df.id.values, 'Predicted':y_pred_test})
      submission_df.to_csv('submission.csv', index=False)
        
      wandb.log_artifact('submission.csv', name=f'{wandb.run.id}_submission', type='submission')

💡 **Tip**

The `train` function above uses Sci-Kit Learn's [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) but you can also modify the code to user other models such as `DecisionTreeClassifier` or `AdaBoostClassifier` or other boosting models such as [`XGBoost`](https://xgboost.readthedocs.io/en/latest/get_started.html). 

Note that you'll likely have to chanage the argument names in the `sweep_config` when using these models in a sweep.


```
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

model = DecisionTreeClassifier()
model = AdaBoostClassifier()
```



#### Sweep Config
Define the name of your sweep, how you'd like to sweep and what parameters to sweep over. See the [Sweep Configuration Docs](https://docs.wandb.ai/guides/sweeps/configuration) here for more advanced functionality

In [20]:
sweep_config = {
  "name" : "beans_sweep",
  "method" : "random",
  "parameters" : {
    "n_estimators" :{
      "min": 10,
      "max": 400
    },
    "max_depth" :{
      "min": 2,
      "max": 100
    },

    # [Optional] add additional parameters here

  }
}

sweep_id = wandb.sweep(sweep_config, project=PROJECT)

Create sweep with ID: k0ph9jy4
Sweep URL: https://wandb.ai/sanderele1/beans-tabular-University-of-Stavanger/sweeps/k0ph9jy4


💡 **Tip**

The above `sweeps_config` is very simple, consider sweeping over additional parameters - don't forget to modify your `train` function to pass these additional parameters to your model

#### Run Sweep
Now we define the number of experiments we'd like to run using `N_RUNS`, pass the sweep_id and the training function and then start the sweep


In [None]:
N_RUNS = 50 # number of runs to execute
wandb.agent(sweep_id, project=PROJECT, function=train, count=N_RUNS)

[34m[1mwandb[0m: Agent Starting Run: qxi3q8ym with config:
[34m[1mwandb[0m: 	max_depth: 13
[34m[1mwandb[0m: 	n_estimators: 43







0,1
validation/accuracy,0.9293
validation/f1_score,0.92919
validation/precision,0.92916
validation/recall,0.9293


[34m[1mwandb[0m: Agent Starting Run: x3x0u6jv with config:
[34m[1mwandb[0m: 	max_depth: 15
[34m[1mwandb[0m: 	n_estimators: 40







0,1
validation/accuracy,0.92619
validation/f1_score,0.92617
validation/precision,0.92621
validation/recall,0.92619


[34m[1mwandb[0m: Agent Starting Run: 326amfsi with config:
[34m[1mwandb[0m: 	max_depth: 11
[34m[1mwandb[0m: 	n_estimators: 327







0,1
validation/accuracy,0.93439
validation/f1_score,0.93429
validation/precision,0.93425
validation/recall,0.93439


[34m[1mwandb[0m: Agent Starting Run: kvqm8vpu with config:
[34m[1mwandb[0m: 	max_depth: 23
[34m[1mwandb[0m: 	n_estimators: 312







0,1
validation/accuracy,0.93071
validation/f1_score,0.93062
validation/precision,0.93062
validation/recall,0.93071


[34m[1mwandb[0m: Agent Starting Run: rn56nvvz with config:
[34m[1mwandb[0m: 	max_depth: 3
[34m[1mwandb[0m: 	n_estimators: 265


  _warn_prf(average, modifier, msg_start, len(result))





0,1
validation/accuracy,0.79808
validation/f1_score,0.76257
validation/precision,0.82161
validation/recall,0.79808


[34m[1mwandb[0m: Agent Starting Run: gd5z42sq with config:
[34m[1mwandb[0m: 	max_depth: 39
[34m[1mwandb[0m: 	n_estimators: 132







0,1
validation/accuracy,0.93213
validation/f1_score,0.93204
validation/precision,0.932
validation/recall,0.93213


[34m[1mwandb[0m: Agent Starting Run: 24qof6aw with config:
[34m[1mwandb[0m: 	max_depth: 87
[34m[1mwandb[0m: 	n_estimators: 347




# Get Submission Files From a Sweeps Run

You can download the test set predictions from each Sweeps run via Artifacts. Or you can find which parameters resulted in the best trained model, and replicate that training using the training and prediction code in the first part of this notebook.

In [None]:
# Find the run_id of the best performing Sweeps run, it can be found in the URL in the W&B UI
RUN_ID = 'abc123'

In [None]:
# Download the submission file from artifacts
wandb.init(project=PROJECT, job_type='download_submission')
artifact = wandb.use_artifact(f'{wandb.run.entity}/{PROJECT}/{RUN_ID}_submission:latest', type='submission')
artifact_dir = artifact.download('my_submissions')
wandb.finish()

# 🪄 More from W&B
#### 🎨 Example Gallery

See examples of projects tracked and visualized with W&B in our gallery, [Fully Connected→](https://app.wandb.ai/gallery)

#### 🏙️ Community

Join a community of ML practitioners in our
[Discourse forum→](http://wandb.me/and-you)

#### 📏 Best Practices

1. **Projects**: Log multiple runs to a project to compare them. `wandb.init(project="project-name")`
2. **Groups**: For multiple processes or cross validation folds, log each process as a run and group them together. `wandb.init(group='experiment-1')`
3. **Tags**: Add tags to track your current baseline or production model.
4. **Notes**: Type notes in the table to track the changes between runs.
5. **Reports**: Take quick notes on progress to share with colleagues and make dashboards and snapshots of your ML projects.

#### 🤓 Advanced Setup

1. [Environment variables](https://docs.wandb.com/library/environment-variables): Set API keys in environment variables so you can run training on a managed cluster.
2. [Offline mode](https://docs.wandb.com/library/technical-faq#can-i-run-wandb-offline): Use `dryrun` mode to train offline and sync results later.
3. [On-prem](https://docs.wandb.com/self-hosted): Install W&B in a private cloud or air-gapped servers in your own infrastructure. We have local installations for everyone from academics to enterprise teams.
4. [Sweeps](http://wandb.me/sweeps-colab): Set up hyperparameter search quickly with our lightweight tool for tuning.
5. [Artifacts](http://wandb.me/artifacts-colab): Track and version models and datasets in a streamlined way that automatically picks up your pipeline steps as you train models.
6. [Tables](http://wandb.me/dsviz-nature-colab): Log, query, and analyze tabular data. Understand your datasets, visualize model predictions, and share insights in a central dashboard.