# Baseline solution

In this notebooks we will create a baseline solution to our lemon problem. To iterate fast a notebook is a handy solution. We will then refactor this code into a script to be able to use hyperparameter sweeps.

In [None]:
import os
from pathlib import Path
import pandas as pd

import wandb
from fastai.vision.all import *

In [None]:
set_seed(42, reproducible=True)

We will define some global configuration parameters

In [None]:
PROJECT_NAME = 'lemon-project'
ENTITY = 'wandb_course'
PROCESSED_DATA_AT = 'lemon_dataset_split_data'

Let's grab the preprocessed data

In [None]:
run = wandb.init(project=PROJECT_NAME, entity=ENTITY, job_type="training")

find the most recent ("latest") version of the processed data

In [None]:
processed_data_at = run.use_artifact(f'{PROCESSED_DATA_AT}:latest')
processed_dataset_dir = Path(processed_data_at.download())
df = pd.read_csv(processed_dataset_dir / 'data_split.csv')

we will not use the hold out dataset stage at this moment

In [None]:
df = df[df.stage != 'test'].reset_index(drop=True)

In [None]:
(processed_dataset_dir/'images').ls()

this will tell our trainer how we want to split data between training and validation

In [None]:
df['valid'] = df.stage == 'valid'
df.head()

## Using a configuration dict

We will use `ml_collections` here, which is a handy way of expressing configurations of experiments and models. Here we will define some global configuration parameters to train our model.

In [None]:
from ml_collections import config_dict

cfg = config_dict.ConfigDict()
cfg.img_size = 256
cfg.target_column = 'mold'
cfg.bs = 32
cfg.seed = 42
cfg.arch = 'resnet18'

We will udpate the config of the run, it is a simple as adding the new entries to the `wandb.config` dictionary

In [None]:
wandb.config.update(cfg.to_dict())

We are using fastai, so creating a Dataloader pipeline from a dataframe is straightforward using the `ImageDataLoaders` class.

In [None]:
dls = ImageDataLoaders.from_df(df, path=processed_dataset_dir, seed=cfg.seed, fn_col='file_name', 
                               label_col=cfg.target_column, valid_col='valid', 
                               item_tfms=Resize(cfg.img_size), bs=cfg.bs)

In [None]:
dls.show_batch()

Let's check how many images have mold on the validation dataset

In [None]:
df[df.valid == True]['mold'].value_counts()

this is the baseline accuracy

In [None]:
df[df.valid == True]['mold'].value_counts()[0] / len(df[df.valid == True])

In `fastai` we already have a callback that integrates tightly with W&B, we only need to pass the `WandbCallback` to the learner and we are ready to go. The callback will log all the useful variables for us. For example, whatever metric we pass to the learner will be tracked by the callback.

In [None]:
from fastai.callback.wandb import WandbCallback

In [None]:
learn = vision_learner(dls, 
                       cfg.arch,
                       metrics=[accuracy, Precision(), Recall(), F1Score()],
                       cbs=[WandbCallback(log_preds=False, log_model=True), SaveModelCallback(monitor='f1_score')])

learn.fine_tune(2)

We can log a table with all the predictions on the validation dataset using `learn.get_preds`

In [None]:
inp,preds,targs,out = learn.get_preds(with_input=True, with_decoded=True)
inp.shape, preds.shape, targs.shape, out.shape

We will create a Table with 4 columns: (Images, probabilities, targets, predictions)

In [None]:
imgs = [wandb.Image(t.permute(1,2,0)) for t in inp] # we need to put as channels last for wandb.Image 
pred_proba = preds[:,1].numpy().tolist()
targets = targs.numpy().tolist()
predictions = out.numpy().tolist()

we create an intermediate `pd.DataFrame` to then create a Table. 

In [None]:
preds_df = pd.DataFrame(list(zip(imgs, pred_proba, predictions, targets)),
                        columns =['image', 'probability', 'prediction', 'target'])

run.log({'predictions_table': wandb.Table(dataframe=preds_df)})
run.finish()