# PoC Hugging Face, WandB, PyTorch

## Goal

* Explorer possible pipeline
* Use existing tools to focus on data and models

## Schema

* SCM/VCS Github
* Exploration with Jupyter Notebooks
* Models, datasets, tokenizer, metrics from Hugging Face
* Logging und visualisation with Weights&Biases (WandB)

![Pipeline Jupyter HF WandB](https://raw.githubusercontent.com/qte77/ML-HF-WnB/32f21d112ab707b737a07cd027ad837b680f32fe/img/ML-Pipeline-HF-WnB.draw.io.png)

## TODO

* Export helper functions for saving/loading into py
 * models, datasets, tokenizer, metrics
* Import specific architecture with [PretrainedConfig](https://huggingface.co/docs/transformers/v4.20.1/en/model_doc/bert#transformers.BertConfig)
 * vocab_size, hidden_size, num_attention_heads, num_hidden_layers
* Try multi processsing
 * python module `multiprocessing`
 * linux `!nohup`
* Mount gdrive non-interactive, e.g. PyDrive
 * [Google loading and saving data from external sources](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=zU5b6dlRwUQk) 
* [HF How to benchmark models with Transformers
](https://github.com/huggingface/notebooks/blob/main/examples/benchmark.ipynb)
 * deprecated, use other module/framework or self-implement
* Use dataset specific metrics
 * [GLUE](https://github.com/huggingface/datasets/tree/master/metrics/glue), [SuperGLUE](https://github.com/huggingface/datasets/tree/master/metrics/super_glue), [SQuAD](https://github.com/huggingface/datasets/tree/master/metrics/squad), [SQuADv2](https://github.com/huggingface/datasets/tree/master/metrics/squad_v2)
* Instead of [HF Metrics Builder Scripts](https://github.com/huggingface/datasets/tree/master/metrics), try
 * [from sklearn.metrics import precision_recall_fscore_support, accuracy_score]()
 * ` from datasets import load_metric`



# Transformer

## Implementation

* [Paper: Attention is all you need](https://arxiv.org/abs/1706.03762)
* [Paper: BERT Bi-Directional Encoder Representation of Transformer](https://arxiv.org/abs/1810.04805)
* The Annotated Transformer: [Artikel](https://nlp.seas.harvard.edu/2018/04/03/attention), [Github](https://github.com/harvardnlp/annotated-transformer)
* [Tensorflow tutorial: Transformer model for language understanding](https://www.tensorflow.org/text/tutorials/transformer)
* HF DistilBERT "Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT": [Paper](https://arxiv.org/abs/1910.01108), [Blog](https://medium.com/huggingface/distilbert-8cf3380435b5)

## DistilBERT Tips

* DistilBERT doesn’t have token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token tokenizer.sep_token (or [SEP]).
* DistilBERT doesn’t have options to select the input positions (position_ids input). This could be added if necessary though, just let us know if you need this option.


[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)

## Benchmarks

* [GLUE General Language Understanding and Evaluation](https://gluebenchmark.com/)
* [SuperGLUE]()
* [SQuAD Stanford Question Answering Dataset](https://rajpurkar.github.io/SQuAD-explorer/)
* [Paper: Long Range Arena: A Benchmark for efficient Transformers](https://arxiv.org/abs/2011.04006)
* [Paper: Efficient Transformers: A survey](https://arxiv.org/abs/2009.06732)

# Additional resources

* [Are Sixteen Heads Really Better than One?](https://blog.ml.cmu.edu/2020/03/20/are-sixteen-heads-really-better-than-one/)
* [HF How to benchmark models with Transformers
](https://github.com/huggingface/notebooks/blob/main/examples/benchmark.ipynb)
* [A Recipe for Training Neural Networks](https://karpathy.github.io/2019/04/25/recipe/)
* [MetaAI OPT175B Logbook](https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf)
* BigScience 176B multi-lingual
 * [Lessons learned](https://github.com/bigscience-workshop/bigscience/blob/master/train/lessons-learned.md)
 * [Chronicles](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md)
 * [TensorBoard](https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard)
 * [Paper](https://openreview.net/forum?id=rI7BL3fHIZq)
 * [Blog](https://bigscience.huggingface.co/blog/model-training-launched)
 * [Announcemenr](https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours)
* [ML Roadmap 2020](https://whimsical.com/machine-learning-roadmap-2020-CA7f3ykvXpnJ9Az32vYXva)
* WandB
 * [WandB get raw data](https://docs.wandb.ai/guides/track/public-api-guide)
* Mixed Precision Training
 * [Paper: Mixed Precision Training](https://arxiv.org/pdf/1710.03740.pdf)
 * [Nvidia: Train With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html)
 * [Fast.ai: Mixed precision training](https://docs.fast.ai/callback.fp16.html)

# Pre-Requisites

## Module-Installation

In [None]:
import sys
import os

In [None]:
red='\033[31m'
green='\033[32m'
orange='\033[33m'

In [None]:
# os.environ['req'] = "https://raw.githubusercontent.com/qte77/ML-HF-WnB/main/k8s-app/app/config/"
# os.environ['rfn'] = "requirements.txt"

In [None]:
# %%shell
# if [ ! -f $rfn ]; then
#   echo "Downloading ${rfn}"
#   curl "${req}${rfn}" -o $rfn
# else
#   echo "${rfn} already in path"
# fi

In [None]:
# !{sys.executable} -m pip install -r $rfn

In [None]:
!{sys.executable} -m pip install -qqq setuptools watermark

In [None]:
#pre-install folium because wandb-version is obsolete
!{sys.executable} -m pip  install -qqq 'folium == 0.2.1'
!{sys.executable} -m pip install -qqq wandb
#remove and re-install folium from wandb if pre-install fails
#!{sys.executable} -m pip uninstall -yyy -qqq folium

In [None]:
!{sys.executable} -m pip install -qqq datasets transformers
# Optional -> install latest version from source
#!{sys.executable} -m pip install -qqq git+https://github.com/huggingface/transformers

In [None]:
from google.colab import drive
import json
import watermark
%load_ext watermark

In [None]:
from datasets import load_dataset, list_datasets, list_metrics
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import load_metric
import wandb
#load bert_score if needed
#import bert_score
import numpy as np
import torch

In [None]:
%watermark -a qte77 -gu qte77 -ws qte77.github.io -u -i -v -iv
#%watermark?

## Mount storage

### interactive mount

Mount non-interactivelly not possible

* https://github.com/googlecolab/colabtools/issues/2563#issuecomment-1083524007

!gcsfuse

* https://cloud.google.com/storage/docs/gcs-fuse
* https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/installing.md

!gcloud --help --no-browser --access-token-file {conf_dir}/{keyfile}

* https://cloud.google.com/sdk/docs/cheatsheet


In [None]:
gdrive = '/gdrive'
save_dir = f'{gdrive}/MyDrive' #no spaces allowed

In [None]:
drive.mount(gdrive)

## Parametrise experiment

* Model
* Dataset
* Metrics
 * Primary metric for eval
 * Further metrics to use
* WandB
  * Project (entity)
  * Logging settings
* Compute accelerator (CPU, GPU, TPU)



In [None]:
dataset = 'mrpc'
model = 'rbc'
wnb_entity = 'ba'
wnb_run_group = '' #'no-label-smoothing-more-steps'
wnb_job_type = 'training'
wnb_notes = 'Runs without label_smoothing changed and more steps'
wnb_tags = ['medium-range', 'no-label-smoothing']
train_count = '5'
metric_to_optimize = 'f1'
#https://huggingface.co/metrics
metrics_to_load = ['accuracy', 'precision', 'recall', 'f1', 'mae', 'mse']
toggle_reproduce_wrong_optim: bool = False
nvidia_smi_query = 'timestamp,name,temperature.gpu,utilization.gpu,' \
      'utilization.memory,memory.total,memory.free,memory.used'

## Validate and ingest parameter

In [None]:
#https://github.com/huggingface/datasets
#dataset/task, configuration (sub ds/task), col to rename for tokenizer, cat avg for f1/recall/prec,
#MRPC human annotations for whether the sentences in the pair are semantically equivalent
#https://dl.fbaipublicfiles.com/glue/data/mrpc_dev_ids.tsv
#MNLI Multi-Genre Natural Language Inference Corpus, sentence pairs
#MNLI https://dl.fbaipublicfiles.com/glue/data/MNLI.zip 
dataset_param = {
    'YAHOO': ['yahoo_answers_topics','','macro','topic',['question_title'], #,'question_content','best_answer'],
              ['id','question_content','best_answer']],
    'MRPC': ['glue','mrpc','macro','label',['sentence1','sentence2'],['idx']],
    'MNLI': ['glue','mnli','macro','label',['premise','hypothesis'],['idx']]
}
model_param = {
    'DBBU' : 'distilbert-base-uncased',
    'BBU' : 'bert-base-uncased',
    #https://huggingface.co/docs/transformers/model_doc/longformer#transformers.LongformerForSequenceClassification.forward
    'LBU' : 'allenai/longformer-base-4096',
    'ESG' : 'google/electra-small-generator',
    'ESD' : 'google/electra-small-discriminator',
    'EBD' : 'google/electra-base-discriminator',
    'ABU1' : 'albert-base-v1',
    'ABU2' : 'albert-base-v2',
    'RBC' : 'roberta-base',
}
params = {
  'accuracy' : ['maximize', True],
  'f1' : ['maximize', True]  ,
  'loss' : ['minimize', False],
  'eval_loss' : ['minimize', False]
}
'''
try:
  metrics_to_load.index(metric_to_optimize)
except:
  print('Metric to optimize not contained in metrics to load.')
'''
try:
  dataset = dataset.upper()
  ds_name, ds_config, ds_avg, ds_colren, ds_colstokens, ds_colrem = dataset_param.get(dataset)
except Exception as e:
  print(red, e)
try:
  model = model.upper()
  modelname = model_param[model]
except Exception as e:
  print(red, e)
try:
  metric_to_optimize.lower()
  goal, greaterBool = params.get(metric_to_optimize, ['Invalid metric.',''])
except Exception as e:
  print(red, e)

In [None]:
#os.environ['COLAB_TPU_ADDR']
#os.environ['XRT_TPU_CONFIG']
try:
  os.environ['TPU_NAME']
  device = 'tpu'
except:
  try:
    #torch.cuda.is_available()
    #cpu, cuda, xpu, mkldnn, opengl, opencl, ideep, hip, ve, ort, mlc, xla, lazy, vulkan, meta, hpu
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  except Exception as e:
    print(red, e)
device = f'{device}'.upper()
print(device)
if f'{device}' == 'CUDA':
  %shell nvidia-smi

In [None]:
wnb_project_name=f'{model}-{dataset}-{device}-sweep'
print(green, wnb_project_name)

In [None]:
#https://docs.wandb.ai/guides/track/advanced/environment-variables
%env WANDB_WATCH=all
%env WANDB_LOG_MODEL=true
%env WANDB_SAVE_CODE=true
%env WANDB_PROJECT={wnb_project_name}
%env WANDB_ENTITY={wnb_entity}
# %env WANDB_JOB_TYPE={wnb_job_type}
# %env WANDB_RUN_GROUP={wnb_run_group}
%env WANDB_NOTES={wnb_notes}
%env WANDB_TAGS=wnb_tags
#avoid error:
#The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
%env TOKENIZERS_PARALLELISM=true

In [None]:
#os.environ['COLAB_GPU']
#os.environ['COLAB_TPU_ADDR']
#os.environ['XRT_TPU_CONFIG']
#os.environ['TPU_NAME']
#os.environ

# Dataset

## Links

* [HF Datasets](https://huggingface.co/docs/datasets/index)
* [Load](https://huggingface.co/docs/datasets/load_hub)
* [EDA](https://huggingface.co/docs/datasets/access)
* [Pre-process]()

## Load and save

In [None]:
%%time
#https://huggingface.co/docs/datasets/v1.2.1/loading_datasets.html
#https://huggingface.co/docs/datasets/loading#local-and-remote-files
#list_datasets()
#load_ds also splits into train/eval

dataset_dir = f'{save_dir}/Datasets/{dataset}'

if ds_config == '':
  print(orange, f'Downloading "{ds_name}"')
  ds = load_dataset(ds_name)
else:
  print(orange, f'Downloading "{ds_config}" from "{ds_name}"')
  ds = load_dataset(ds_name, ds_config)

ds.save_to_disk(dataset_dir)

#TODO save and load locally
# dataset_dir = f'{save_dir}/Datasets/{dataset}'
# try:
#   if not os.path.exists(dataset_dir):
#     os.makedirs(dataset_dir)
#     if ds_config == '':
#       print(orange, f'Downloading and saving dataset "{ds_name}".')
#       ds = load_dataset(ds_name)
#     else:
#       print(orange, f'Downloading and saving dataset "{ds_config}" from "{ds_name}".')
#       ds = load_dataset(ds_name, ds_config)
#     ds.save_to_disk(dataset_dir)
#   else:
#     print(orange, f'Loading dataset from {dataset_dir}')    
#     # if ds_config == '':
#     data_files = { 'train': 'train' }
#     ds = load_dataset(path = dataset_dir, data_files = data_files)
#     # else:
#     #   ds = load_dataset(path=ds_name, name=ds_config, data_files=dataset_dir)
# except Exception as e:
#   print(red, e)

print(ds_name, ds_config, dataset_dir)

# dataset_dir = None; del dataset_dir

In [None]:
%shell ls -ARsh /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/

## EDA

In [None]:
ds

In [None]:
ds.shape

In [None]:
ds['train'][:2]

In [None]:
lblcnt0 = ds['train']['label'].count(0)
lblcnt1 = ds['train']['label'].count(1)
print(f'Label count: 0x {lblcnt0}, 1x {lblcnt1}')

In [None]:
# from sklearn.utils import compute_class_weight
# classWeight = compute_class_weight(
#   'balanced',
#   classes = ds['train']['labels'],
#   y = ds['train']['sentence1']
# ) 
# classWeight = dict(enumerate(classWeight))

# Tokenizer

In [None]:
## Prepare dataset

In [None]:
label_list = ds['train'].unique(ds_colren)
num_labels = len(label_list)

In [None]:
#rename column 'dscol_rename', model expects 'labels'
for name in ds:
  if ds_colren in ds[name].column_names:
    ds[name] = ds[name].rename_column(ds_colren, 'labels')
  else:
    print(red, "Attribute/Feature/Column '%s' not found in '%s'. Found:" % (ds_colren, name))
    print(ds[name].column_names)

In [None]:
ds.column_names

## Load and save

In [None]:
%%time
# tokenizer converts the tokensto vocabulary indices and pads batched data
#TODO try args max_length=X and fast=False
tokenizer_dir = f'{save_dir}/Tokenizer/{modelname}'
try:
  if not os.path.exists(tokenizer_dir):
    print(orange, f'Downloading and saving tokenizer to {tokenizer_dir}')
    os.makedirs(tokenizer_dir)
    tokenizer = AutoTokenizer.from_pretrained(modelname, use_fast=True, truncation=True, padding=True)
    tokenizer.save_pretrained(tokenizer_dir)
  else:
    print(orange, f'Loading tokenizer from {tokenizer_dir}')
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
except Exception as e:
  print(red, e)

In [None]:
tokenizer.encode("This is not Sparta")

In [None]:
print("Vocab size: %s" % len(tokenizer.vocab))
print("Special tokens: %s" % tokenizer.special_tokens_map.values())

In [None]:
%%time
#tokenizing and padding
#try tokenizer(padding="max_length", max_length=X)
#attributes needed for down-stream training:
#'input_ids', 'token_type_ids', 'attention_mask', 'labels'
#lambda convert tuple to list, list comprehension?
#'labels' has to be present after tokenization to avoid KeyError('loss')
#https://github.com/huggingface/transformers/issues/11256

def _tokenize(ds):
  cols = [ds[col] for col in ds_colstokens]
  return tokenizer(*cols, truncation=True)

ds.map(_tokenize, batched=True)

In [None]:
#avoid info 'The following columns in the evaluation set  don't have a corresponding argument'
try:
  ds_tokenized = ds_tokenized.remove_columns(ds_colstokens).remove_columns(ds_colrem)
except Exception as e:
  print(red, e)
ds_tokenized.column_names

In [None]:
#save tokenized dataset to /root/.cache/huggingface/datasets
#or google drive, mount beforehand
#TODO

# Model

## Load and save

In [None]:
%%time
model_dir = f'{save_dir}/Models/{modelname}'
try:
  if not os.path.exists(model_dir):
    print(green, f'Downloading and saving model to {model_dir}')
    os.makedirs(model_dir)
    modelobj = AutoModelForSequenceClassification.from_pretrained(modelname, num_labels=num_labels)
    modelobj.save_pretrained(model_dir)
  else:
    print(green, f'Loading model from {model_dir}')
    modelobj = AutoModelForSequenceClassification.from_pretrained(model_dir)
except Exception as e:
  print(red, e)

model_dir = None; del model_dir

## Pre-trained hyperparam

In [None]:
print(modelobj.config)
# modelobj.config.classifier_dropout = 0.2
# modelobj.config.max_position_embeddings = 1024

## Transformer architecture

![Transformer Attention](https://miro.medium.com/max/875/1*9nUzdaTbKzJrAsq1qqJNNA.png)

## Attention

In [None]:
try:
  print(modelobj.base_model.encoder.layer[0])
except:
  try:
    print(modelobj.base_model.transformer.layer[0])
  except:
    try:
      print(modelobj.base_model.encoder)
    except Exception as e:
      print(red, e)

## Embeddings

In [None]:
print(modelobj.base_model.embeddings)
# print(modelobj.bert.embeddings)

## Test model before fine-tuning

In [None]:
#TODO source for test function, WandB colab?
#TODO multi inputs
def test_model(inputs, tokenizer=tokenizer, model=modelobj):
  if device == "CUDA":
    for i in inputs:
      print(i)
    # inputs = tokenizer(sentence, return_tensors='pt')
    # ensure model and inputs are on the same device (GPU)
    # inputs = {name: tensor.cuda() for name, tensor in inputs.items()}
    # model = model.cuda()
    # get prediction - 10 classes "probabilities" (not really true because they still need to be normalized)
    # with torch.no_grad(): 
    #     predictions = model(**inputs)[0].cpu().numpy()
    # get the top prediction class and convert it to its associated label
    # top_prediction = predictions.argmax().item()
    # return ds['train'].features['labels'].int2str(top_prediction)
    return 1
  else:
    return "NO CUDA"

In [None]:
%%time
if dataset == 'YAHOO':
  print(test_model('Why is cheese so much better with wine?'))
# elif dataset == 'MRPC':
#   print(test_model('hallo', 'hedda'))

# Metrics

## Links

* [Metrics (deprecated)](https://huggingface.co/metrics)
* [Evaluate (new)](https://huggingface.co/docs/evaluate/index)

## Types

### Model

* Loss (MSE, MAE)
* Accuracy
* Recall, Precision, F1
* Perplexity (PPL)
* BLEU

### System

## PrecisionRecallCurve
* [sklearn.metrics.precision_recall_curve](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/wandb-log/Plot_Precision_Recall_Curves_with_W%26B.ipynb)
* [plot_precision_recall](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#sphx-glr-auto-examples-model-selection-plot-precision-recall-py)
* [plot_display_object_visualization](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_display_object_visualization.html#sphx-glr-auto-examples-miscellaneous-plot-display-object-visualization-py)

## Load with HF Metrics Builder Scripts

In [None]:
# #list_metrics()
# metrics_dir = f'{save_dir}/Metrics'
# metrics_loaded = []

# #downloading metrics builder scripts
# try:
#   for met in metrics_to_load:
#     print(orange, f'Downloading builder script for "{met}".')
#     metrics_loaded.append(load_metric(met))
#     print(green, metrics_loaded[-1].description)
#     print(green, metrics_loaded[-1].features)
#     print('\n')
# except Exception as e:
#   print(red, e)
# #TODO save locally
# # for met in metrics_loaded:
#   # met.

In [None]:
# def compute_metrics(eval_pred):
#   predictions, labels = eval_pred
#   predictions = np.argmax(predictions, axis=1) #predictions.argmax(-1)
  
#   print(orange,"*************")
  
#   for i, m in enumerate(metrics_loaded):

#     if metrics_to_load[i] in ['precision','recall','f1']:
#       met = m.compute(predictions=predictions, references=labels, average=ds_avg)
#     else:
#       met = m.compute(predictions=predictions, references=labels)

#     if metrics_to_load[i] == 'accuracy':
#       ret = met

#     wandb.log(met)
#     print(met)

#     #test if metrics-obj need to be reloaded to avoid same eval values
#     #metrics_loaded[i] = load_metric(metrics_to_load[i])

# #    print("**************** dir(predictions)")
# #    print(dir(met))
# #    print("**************** dir(labels)")
# #    print(dir(labels))
# #    print("**************** m.__dir__")
# #    print(m.__dict__)
# #    print("**************** dir(m)")
# #    print(dir(m))
    
#   print(orange,"*************")

#   return ret

## Load with specific HF Builder Scripts for provided dataset

In [None]:
#dataset specific builder scripts by Hf
from datasets import load_metric
if ds_config == None:
  metrics = load_metric(ds_name)
else:
  metrics = load_metric(ds_name, ds_config)

In [None]:
metrics_to_load = ["recall", "precision", "mse", "mae"]
metrics_loaded = []
met_avg = "macro"

In [None]:
try:
  for met in metrics_to_load:
    print(orange, f'Downloading builder script for "{met}".')
    metrics_loaded.append(load_metric(met))
    # print(green, metrics_loaded[-1].description)
    # print(green, metrics_loaded[-1].features)
    # print('\n')
except Exception as e:
  print(red, e)

In [None]:
import numpy as np

def compute_metrics(pred):

  #TODO, same axis as np?
  #https://numpy.org/doc/stable/reference/generated/numpy.argmax.html
  predictions = np.argmax(pred.predictions, axis=1)
  # predictions = pred.predictions.argmax(-1)
  labels = pred.label_ids

  results = metrics.compute(
      predictions = predictions,
      references = labels
  )

  for i, metric in enumerate(metrics_loaded):
    if metrics_to_load[i] in ['precision','recall']:
      met = metric.compute(
        predictions = predictions,
        references = labels,
        average = met_avg
      )
    else:
      met = metric.compute(
        predictions = predictions,
        references = labels
      )
    results[metrics_to_load[i]] = met[metrics_to_load[i]]

  print(orange,"*************")
  wandb.log(results)
  print(results)
  
  if device == 'CUDA':
    #https://developer.nvidia.com/nvidia-system-management-interface
    #https://nvidia.custhelp.com/app/answers/detail/a_id/3751/~/useful-nvidia-smi-queries
    %shell nvidia-smi --query-gpu={nvidia_smi_query} --format=csv

  print(orange,"*************")

  return results

## Load with sklearn.metrics

In [None]:
# from sklearn.metrics import precision_recall_fscore_support, accuracy_score
# def compute_metrics(pred):
#     """
#     Compute metrics for Trainer
#     """
#     labels = pred.label_ids
#     preds = pred.predictions.argmax(-1)
#     #_, _, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
#     precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="macro")
    
#     acc = accuracy_score(labels, preds)

#     print(f'acc: {acc}')
#     print(f'f1: {f1}')

#     print(orange,"*************")
#     wandb.log(met)
#     print(met)
#     print(orange,"*************")

#     return {
#         'accuracy': acc,
#         'f1': f1,
#         #'macro f1': macro_f1,
#         'precision': precision,
#         'recall': recall
#     }

# Sweep

## Links
* [Sweep configuration](https://docs.wandb.com/sweeps/configuration)
* [YAML file](https://docs.wandb.com/sweeps/quickstart#2-sweep-config)
* [sweep random variables](https://docs.wandb.com/sweeps/configuration#distributions)
* [simple training script and a few flavors of sweep configs](https://github.com/wandb/examples/tree/master/examples/keras/keras-cnn-fashion)
* [list of all configuration options](https://docs.wandb.com/library/sweeps/configuration)
* [Big collection of examples in YAML format](https://github.com/wandb/examples/tree/master/examples/keras/keras-cnn-fashion)
* [Sweep from an existing project](https://docs.wandb.ai/guides/sweeps/existing-project)
* [Sweep on CLI Quickstart](https://docs.wandb.com/sweeps/quickstart)
* [Example Tune Sweep Dashboard](https://app.wandb.ai/wandb/examples-keras-cnn-fashion/sweeps/xbs2wm5e?workspace=user-lavanyashukla)
* We offer to `early_terminate` your runs with the [HyperBand](https://arxiv.org/pdf/1603.06560.pdf) scheduling algorithm. See more [here](https://docs.wandb.com/sweeps/configuration#stopping-criteria)
* [Bayesian BOHB ArXiv 1807.01774](https://arxiv.org/abs/1807.01774)
* [Bayesian Hyperband](https://app.wandb.ai/wandb/examples-keras-cnn-fashion/sweeps/us0ifmrf?workspace=user-lavanyashukla)

## Configuration

In [None]:
#get better acc
if not toggle_reproduce_wrong_optim:
  parameters_dict = {}
  parameters_dict = {
        'max_steps' : {
          'values': list(range(4000,7000,1000))
        },
        'per_device_train_batch_size': {
          'values' : [8, 16, 32] #not enough RAM: [64,128,256,512,1024]
        },
        # 'per_device_eval_batch_size':
        # 'per_gpu_eval_batch_size':
        # 'per_gpu_train_batch_size':
        'seed' : {
          # 'value': 101
          'distribution': 'int_uniform',
          'min': 1,
          'max': 101
        },
        'evaluation_strategy': {
          'distribution': 'categorical',
          'values': ['steps', 'epoch'] #, 'no']          
        },
        # 'label_smoothing_factor': {
        #   'distribution': 'uniform',
        #   'min': 0,
        #   'max': 0.1
        # },
  }
  # parameters_dict['per_gpu_eval_batch_size'] = parameters_dict['per_device_train_batch_size']

In [None]:
# #reproduce leveling out of metrics with wrong optimizers
# #https://huggingface.co/docs/transformers/main_classes/optimizer_schedules
if toggle_reproduce_wrong_optim:
  parameters_dict = {}
  parameters_dict = {
      'learning_rate': {
          'distribution': 'uniform',
          'min': 5e-7,
          'max': 5e-3
        },
        'seed' : {
          'distribution': 'int_uniform',
          'min': 51,
          'max': 202
        },
        'max_steps' : {
          'values': list(range(1500,5000,500))
        },
        'optim': {
          'distribution': 'categorical',
          'values': ['torch.AdamW', 'adafactor']# 'sgd', deprecated: adamw_hf
        },
        'evaluation_strategy': {
          'value': 'steps'          
        },
  }

In [None]:
sweep_config = {
  'method' : 'random' #grid, bayes
}
sweep_config['metric'] = {
    'name': metric_to_optimize,
    'goal': goal
    }
sweep_config['parameters'] = parameters_dict

In [None]:
# print({
#     'train': parameters_dict['per_device_train_batch_size']
# })

# Training

* Die Funktion `train()` initiert die jeweiligen Durchläufe mit den Werten der Konfiguration, die der `wand.agent` auswählt
* [**`wandb.init()`**](https://docs.wandb.com/library/init) – Initialize a new W&B Run. Each Run is a single execution of the training function
* [**`wandb.config`**](https://docs.wandb.com/library/config) – Save all your hyperparameters in a configuration object so they can be logged. Read more about how to use `wandb.config` [here](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/wandb-config/Configs_in_W%26B.ipynb)
* More details on instrumenting W&B with PyTorch, see [this Colab](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/pytorch/Simple_PyTorch_Integration.ipynb)

## Args and Train()

In [None]:
# num_train_epochs=16,              # total number of training epochs
# per_device_train_batch_size=32,  # batch size per device during training
# per_device_eval_batch_size=32,   # batch size for evaluation
# warmup_steps=600,                # number of warmup steps for learning rate scheduler
# weight_decay=0.01,               # strength of weight decay
# logging_dir='/content/logs',
# fp16 = True,  
def train(config=None):
  
  with wandb.init(config=config):
      
    config = wandb.config

    eval_steps = round(config.max_steps / 5)
    save_steps = eval_steps * 2

    #args need to be assigned here to avoid wandb runtime TypeError()
    #"'TrainingArguments' object does not support item assignment"
    args = TrainingArguments(
      report_to = 'wandb',
      output_dir = wnb_project_name,
      run_name = wnb_project_name,
      overwrite_output_dir = True,
      load_best_model_at_end = True,
      logging_steps = 100,
      eval_steps = eval_steps,
      save_steps = save_steps,
      #remove_unused_columns = True,          # avoid info 'The following columns in the evaluation set  don't have a corresponding argument'
      metric_for_best_model = metric_to_optimize,
      greater_is_better = greaterBool,
      #sweep params to be changed by sweep agent
      max_steps = config.max_steps,
      seed = config.seed,
      #In Trainer, evaluation_strategy defaults to no, but save_strategy defaults to steps. Why? #14051
      #https://github.com/huggingface/transformers/issues/14051
      evaluation_strategy = config.evaluation_strategy,
      save_strategy = config.evaluation_strategy,
      # label_smoothing_factor = config.label_smoothing_factor,
      per_device_train_batch_size = config.per_device_train_batch_size,
      per_device_eval_batch_size = config.per_device_train_batch_size,
    )

    if toggle_reproduce_wrong_optim:
      #['adamw_hf', 'adamw_torch', 'adamw_torch_xla', 'adamw_apex_fused', 'adafactor', 'adamw_bnb_8bit', 'sgd', 'adagrad']
      #'adamw_hf' deprecated
      args.optim = config.optim
      #lr the lower the better in experiments
      args.learning_rate = config.learning_rate
    
    trainer = Trainer(
      model = modelobj,
      args = args, 
      train_dataset = ds_tokenized['train'],
      eval_dataset = ds_tokenized['test'],
      tokenizer = tokenizer,
      compute_metrics = compute_metrics
    )
    
    print(orange,"*************")
    print("Model: %s, Architectures: %s" % (
        trainer.model.name_or_path,
        trainer.model.config.architectures
      )
    )
    print("Metric to optimize: %s, #Labels: %s, Avg: %s" % (
        args.metric_for_best_model,
        num_labels,
        ds_avg
      )
    )

    if args.evaluation_strategy.value == 'steps':
      print("eval_steps: %s, save_steps: %s" % (eval_steps, save_steps))
    print(orange,"*************")

    trainer.evaluate()
    trainer.train()

# WandB

## Load key-file

In [None]:
wandb_keyfile = 'conf/wandb.json'

In [None]:
#get API-key from https://app.wandb.ai/authorize
#get key from file and save to ENV
with open(f"{save_dir}/{wandb_keyfile}", 'r') as j:
 data = json.loads(j.read())
 os.environ['WANDB_USERNAME'] = data['username']
 os.environ['WANDB_API_KEY'] = data['key'] #WANDB_KEY deprecated

## Login

In [None]:
#wandb.login()
#wandb.init(project=wnb_project_name, entity=wnb_entity, save_code = True)
#wandb.finish()

In [None]:
#read api-key from ENV, if not provided from interactive user input 
#!wandb login --relogin
!wandb login --cloud $WANDB_KEY

## Schema

* Initieren des Sweep Controlers durch `wandb.sweep()` mit `sweep_config` und `project`
* Sweep Controler gibt `sweep_id` zurück, mit der Agents durch `wandb.agent` initiert werden können
* `wandb.sweep`in CLI durch `wandb sweep config.yaml` ersetzt
* Aufruf von `wandb.finish()` in Notebook notwendig, um Agenten zu beenden

<img src="https://i.imgur.com/zlbw3vQ.png" alt="sweeps-diagram" width="500">




## Initialise

In [None]:
sweep_id = wandb.sweep(sweep_config, project=wnb_project_name, entity=wnb_entity)

## Start Training

In [None]:
wandb.agent(sweep_id, train, count=train_count)

## Finish if inside NB

In [None]:
#explicitly call finish within notebooks
wandb.finish()