# Bee vs wasp dataset preview

You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Note: this code requires fastai 2.0+

In [None]:
!pip install fastai --upgrade

Initialisation:

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from fastai.vision.all import *
from fastai.metrics import error_rate

# additional classic imports
from pathlib import Path
import pandas as pd
import numpy as np
import random

Training hyperparameters:

In [None]:
bs = 64 # Batch size
resize_size = 96 # for training, resize all the images to a square of this size
training_subsample = 0.1 # for development, use a small fraction of the entire dataset rater than full dataset

Load the labels from `.csv` using pandas:

In [None]:
bees_vs_wasps_dataset_path=Path('../input/bee-vs-wasp/kaggle_bee_vs_wasp') # this is relative to the "example_notebook" folder. Modify this to reflect your setup
df_labels = pd.read_csv(bees_vs_wasps_dataset_path/'labels.csv')
df_labels=df_labels.set_index('id')
df_labels.head()

convert the paths to linux path:

In [None]:
for idx in df_labels.index:    
    df_labels.loc[idx,'path']=df_labels.loc[idx,'path'].replace('\\','/')    

Subsample the dataset to reduce training time for this demonstration only:

In [None]:
df_labels = df_labels.sample(frac=training_subsample, axis=0) 

configure the fast.ai Image data loader:

In [None]:
data = ImageDataLoaders.from_df(
    df = df_labels,
    path = Path(bees_vs_wasps_dataset_path),
    valid_pct=0.2,
    seed = 42,
    fn_col='path',
    folder=None,
    label_col='label',
    bs=bs,
    shuffle_train=True,
    batch_tfms=aug_transforms(),
    item_tfms=Resize(resize_size),device='cpu', num_workers=0,
)

preview a few samples from the dataset:

In [None]:
data.show_batch()

## Train a basic classifier
Note that the settings are rubbish to make it execute quickly; hence, the results are rubbish.

In [None]:
learn = cnn_learner(data, resnet18, metrics=error_rate)
learn.model_dir='/kaggle/temp/'

Run the learning rate finder:

In [None]:
best_lr=learn.lr_find(start_lr=1e-04, end_lr=1, num_it=30) 

Run the transfer learning procedure:

In [None]:
learn.fine_tune(1,base_lr=best_lr[0])

In [None]:
learn.show_results()

In [None]:
learn.save('stage-1')

visualize sample results:

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
losses,idxs = interp.top_losses()
interp.plot_top_losses(12, figsize=(14,14))

Display the confusion matrix. 

Note that in this example, this is only calculated over the subsampled dataset.

In [None]:
interp.plot_confusion_matrix(figsize=(4,4), dpi=120)

---