# **Audio Tagging**

# Table of content

## **<div id="I">I. Define the problem</div>**

### **<div id="I0">0. Sources</div>**

https://medium.com/@ageitgey/machine-learning-is-fun-part-6-how-to-do-speech-recognition-with-deep-learning-28293c162f7a
https://medium.com/nanonets/how-to-use-deep-learning-when-you-have-limited-data-part-2-data-augmentation-c26971dc8ced
https://www.kaggle.com/daisukelab/cnn-2d-basic-solution-powered-by-fast-ai

### **<div id="I1">1. Problem description</div>**

One year ago, Freesound and Google’s Machine Perception hosted an audio tagging competition challenging Kagglers to build a general-purpose auto tagging system. This year they’re back and taking the challenge to the next level with multi-label audio tagging, doubled number of audio categories, and a noisier than ever training set.

![](https://storage.googleapis.com/kaggle-media/competitions/freesound/task2_freesound_audio_tagging.png)

Here's the background: Some sounds are distinct and instantly recognizable, like a baby’s laugh or the strum of a guitar. Other sounds are difficult to pinpoint. If you close your eyes, could you tell the difference between the sound of a chainsaw and the sound of a blender?

Because of the vastness of sounds we experience, no reliable automatic general-purpose audio tagging systems exist. A significant amount of manual effort goes into tasks like annotating sound collections and providing captions for non-speech events in audiovisual content.

To tackle this problem, Freesound (an initiative by MTG-UPF that maintains a collaborative database with over 400,000 Creative Commons Licensed sounds) and Google Research’s Machine Perception Team (creators of AudioSet, a large-scale dataset of manually annotated audio events with over 500 classes) have teamed up to develop the dataset for this new competition.

To win this competition, Kagglers will develop an algorithm to tag audio data automatically using a diverse vocabulary of 80 categories.

If successful, your systems could be used for several applications, ranging from automatic labelling of sound collections to the development of systems that automatically tag video content or recognize sound events happening in real time.

### **<div id="I2">2. Tools importing</div>**

Here we are importing every useful tool needed during our research process.

In [None]:
import time
start_time = time.time()

# Data analysis and wrangling
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import IPython
import IPython.display
import librosa
import librosa.display
import random
from tqdm import tqdm_notebook
from fastai import *
from fastai.vision import *
from fastai.vision.data import *
from fastai.imports import *
from fastai.callback import *
from fastai.callbacks import *

# Machine learning
from sklearn import preprocessing
import sklearn.metrics
from sklearn.metrics import label_ranking_average_precision_score

# File handling
from pathlib import Path
import gc
import os
print(os.listdir("../input"))

### **<div id="I3">3. Label ranking average precision</div>**

The task consists of predicting the audio labels (tags) for every test clip. Some test clips bear one label while others bear several labels. The predictions are to be done at the clip level, i.e., no start/end timestamps for the sound events are required.

The primary competition metric will be label-weighted label-ranking average precision (lwlrap, pronounced "Lol wrap"). This measures the average precision of retrieving a ranked list of relevant labels for each test clip (i.e., the system ranks all the available labels, then the precisions of the ranked lists down to each true label are averaged). This is a generalization of the mean reciprocal rank measure (used in last year’s edition of the competition) for the case where there can be multiple true labels per test item. The novel "label-weighted" part means that the overall score is the average over all the labels in the test set, where each label receives equal weight (by contrast, plain lrap gives each test item equal weight, thereby discounting the contribution of individual labels when they appear on the same item as multiple other labels).

We use label weighting because it allows per-class values to be calculated, and still have the overall metric be expressed as simple average of the per-class metrics (weighted by each label's prior in the test set). For participant’s convenience, a Python implementation of lwlrap is provided in this public Google Colab.

In [None]:
# from official code https://colab.research.google.com/drive/1AgPdhSp7ttY18O3fEoHOQKlt_3HJDLi8#scrollTo=cRCaCIb9oguU
def _one_sample_positive_class_precisions(scores, truth):
    """Calculate precisions for each true class for a single sample.

    Args:
      scores: np.array of (num_classes,) giving the individual classifier scores.
      truth: np.array of (num_classes,) bools indicating which classes are true.

    Returns:
      pos_class_indices: np.array of indices of the true classes for this sample.
      pos_class_precisions: np.array of precisions corresponding to each of those
        classes.
    """
    num_classes = scores.shape[0]
    pos_class_indices = np.flatnonzero(truth > 0)
    # Only calculate precisions if there are some true classes.
    if not len(pos_class_indices):
        return pos_class_indices, np.zeros(0)
    # Retrieval list of classes for this sample.
    retrieved_classes = np.argsort(scores)[::-1]
    # class_rankings[top_scoring_class_index] == 0 etc.
    class_rankings = np.zeros(num_classes, dtype=np.int)
    class_rankings[retrieved_classes] = range(num_classes)
    # Which of these is a true label?
    retrieved_class_true = np.zeros(num_classes, dtype=np.bool)
    retrieved_class_true[class_rankings[pos_class_indices]] = True
    # Num hits for every truncated retrieval list.
    retrieved_cumulative_hits = np.cumsum(retrieved_class_true)
    # Precision of retrieval list truncated at each hit, in order of pos_labels.
    precision_at_hits = (
            retrieved_cumulative_hits[class_rankings[pos_class_indices]] /
            (1 + class_rankings[pos_class_indices].astype(np.float)))
    return pos_class_indices, precision_at_hits


def calculate_per_class_lwlrap(truth, scores):
    """Calculate label-weighted label-ranking average precision.

    Arguments:
      truth: np.array of (num_samples, num_classes) giving boolean ground-truth
        of presence of that class in that sample.
      scores: np.array of (num_samples, num_classes) giving the classifier-under-
        test's real-valued score for each class for each sample.

    Returns:
      per_class_lwlrap: np.array of (num_classes,) giving the lwlrap for each
        class.
      weight_per_class: np.array of (num_classes,) giving the prior of each
        class within the truth labels.  Then the overall unbalanced lwlrap is
        simply np.sum(per_class_lwlrap * weight_per_class)
    """
    assert truth.shape == scores.shape
    num_samples, num_classes = scores.shape
    # Space to store a distinct precision value for each class on each sample.
    # Only the classes that are true for each sample will be filled in.
    precisions_for_samples_by_classes = np.zeros((num_samples, num_classes))
    for sample_num in range(num_samples):
        pos_class_indices, precision_at_hits = (
            _one_sample_positive_class_precisions(scores[sample_num, :],
                                                  truth[sample_num, :]))
        precisions_for_samples_by_classes[sample_num, pos_class_indices] = (
            precision_at_hits)
    labels_per_class = np.sum(truth > 0, axis=0)
    weight_per_class = labels_per_class / float(np.sum(labels_per_class))
    # Form average of each column, i.e. all the precisions assigned to labels in
    # a particular class.
    per_class_lwlrap = (np.sum(precisions_for_samples_by_classes, axis=0) /
                        np.maximum(1, labels_per_class))
    # overall_lwlrap = simple average of all the actual per-class, per-sample precisions
    #                = np.sum(precisions_for_samples_by_classes) / np.sum(precisions_for_samples_by_classes > 0)
    #           also = weighted mean of per-class lwlraps, weighted by class label prior across samples
    #                = np.sum(per_class_lwlrap * weight_per_class)
    return per_class_lwlrap, weight_per_class


# Wrapper for fast.ai library
def lwlrap(scores, truth, **kwargs):
    score, weight = calculate_per_class_lwlrap(to_np(truth), to_np(scores))
    return torch.Tensor([(score * weight).sum()])

### **<div id="I4">4. Data</div>**

#### **Train set**

The train set is meant to be for system development. The idea is to limit the supervision provided (i.e., the manually-labeled data), thus promoting approaches to deal with label noise. The train set is composed of two subsets as follows:

**Curated subset**

The curated subset is a small set of manually-labeled data from FSD.
* Number of clips/class: 75 except in a few cases (where there are less)
* Total number of clips: 4970
* Avge number of labels/clip: 1.2
* Total duration: 10.5 hours

The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

**Noisy subset**

The noisy subset is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset.
* Number of clips/class: 300
* Total number of clips: 19815
* Avge number of labels/clip: 1.2
* Total duration: ~80 hours

The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s.

Considering the numbers above, per-class data distribution available for training is, for most of the classes, 300 clips from the noisy subset and 75 clips from the curated subset, which means 80% noisy - 20% curated at the clip level (not at the audio duration level, considering the variable-length clips).

#### **Test set**

The test set is used for system evaluation and consists of manually-labeled data from FSD. Since most of the train data come from YFCC, some acoustic domain mismatch between the train and test set can be expected. All the acoustic material present in the test set is labeled, except human error, considering the vocabulary of 80 classes used in the competition.

The test set is split into two subsets, for the public and private leaderboards. In this competition, the submission is to be made through Kaggle Kernels. Only the test subset corresponding to the public leaderboard is provided (without ground truth).

Submissions must be made with inference models running in Kaggle Kernels. However, participants can decide to train also in the Kaggle Kernels or offline (see Kernels Requirements for details).

This is a kernels-only competition with two stages. The first stage comprehends the submission period until the deadline on June 10th. After the deadline, in the second stage, Kaggle will rerun your selected kernels on an unseen test set. The second-stage test set is approximately three times the size of the first. You should plan your kernel's memory, disk, and runtime footprint accordingly.

#### **Files**

* train_curated.csv - ground truth labels for the curated subset of the training audio files (see Data Fields below)
* train_noisy.csv - ground truth labels for the noisy subset of the training audio files (see Data Fields below)
* sample_submission.csv - a sample submission file in the correct format, including the correct sorting of the sound categories; it contains the list of audio files found in the test.zip folder (corresponding to the public leaderboard)
* train_curated.zip - a folder containing the audio (.wav) training files of the curated subset
* train_noisy.zip - a folder containing the audio (.wav) training files of the noisy subset
* test.zip - a folder containing the audio (.wav) test files for the public leaderboard

#### **Columns**

Each row of the train_curated.csv and train_noisy.csv files contains the following information:

* fname: the audio file name, eg, 0006ae4e.wav
* labels: the audio classification label(s) (ground truth). Note that the number of labels per clip can be one, eg, Bark or more, eg, "Walk_and_footsteps,Slam".


## **<div id="II">II. Gather the data</div>**

We start by acquiring the training and testing datasets into Pandas DataFrames. We also combine these datasets to run certain operations on both datasets together.

In [None]:
training_curated_df = pd.read_csv("../input/train_curated.csv")
training_noisy_df = pd.read_csv("../input/train_noisy.csv")
training_df = [training_curated_df, training_noisy_df]
testing_df = pd.read_csv('../input/sample_submission.csv')

In [None]:
Path('trn_curated').mkdir(exist_ok=True, parents=True)
Path('trn_noisy').mkdir(exist_ok=True, parents=True)
Path('test').mkdir(exist_ok=True, parents=True)

## **<div id="III">III. Wrangle, cleanse and Prepare Data for Consumption</div>**

In [None]:
# preview the data
training_curated_df.head()

In [None]:
# preview the data
training_noisy_df.head()

In [None]:
training_curated_df.info()
print('_'*40)
training_noisy_df.info()

In [None]:
labels_curated = training_curated_df['labels'].unique()
print(labels_curated.shape)
print('_'*40)
print(labels_curated)

In [None]:
labels_noisy = training_noisy_df['labels'].unique()
print(labels_noisy.shape)
print('_'*40)
print(labels_noisy)

In [None]:
training_curated_df.describe()

## **<div id="IV">IV. Turning sound into bits</div>**

### **<div id="IV1">1. Sampling sound data</div>**

The first step in speech recognition is obvious — we need to feed sound waves into a computer. But sound is transmitted as waves. How do we turn sound waves into numbers?

![](https://cdn-images-1.medium.com/max/1200/1*6_q1VIVJuavYa-9Uby_L-A.png)

Sound waves are one-dimensional. At every moment in time, they have a single value based on the height of the wave. To turn this sound wave into numbers, we just record of the height of the wave at equally-spaced points.

This is called **sampling**. We are taking a reading thousands of times a second and recording a number representing the height of the sound wave at that point in time. That’s basically all an uncompressed .wav audio file is.

“CD Quality” audio is sampled at 44.1khz (44,100 readings per second). For our problem, here is how we will proceed:
* Handle sampling rate 44.1kHz as is, no information loss.
* Size of each file will be 128 x L, L is audio seconds x 128; [128, 256] if sound is 2s long.

### **<div id="IV2">2. Pre-processing our sampled Sound Data</div>**

We now have an array of numbers with each number representing the sound wave’s amplitude at 1/44100th of a second intervals.

We could feed these numbers right into a neural network. But trying to recognize speech patterns by processing these samples directly is difficult. Instead, we can make the problem easier by doing some pre-processing on the audio data.

To make this data easier for a neural network to process, we are going to break apart this complex sound wave into it’s component parts. We’ll break out the low-pitched parts, the next-lowest-pitched-parts, and so on. Then by adding up how much energy is in each of those frequency bands (from low to high), we create a fingerprint of sorts for this audio snippet.

We do this using a mathematic operation called a Fourier transform. It breaks apart the complex sound wave into the simple sound waves that make it up. Once we have those individual sound waves, we add up how much energy is contained in each one.

The end result is a score of how important each frequency range is, from low pitch (i.e. bass notes) to high pitch.

![](https://cdn-images-1.medium.com/max/1200/1*A4CxgdyqYd_nrF3e-7ETWA.png)

If we repeat this process on every 20 millisecond chunk of audio, we end up with a spectrogram. This is what we have done below on one of our audio sound.

In [None]:
#EasyDict allows to access dict values as attributes (works recursively). A Javascript-like properties dot notation for python dicts.
#It is mandatory in order to use the library below
# Special thanks to https://github.com/makinacorpus/easydict/blob/master/easydict/__init__.py
class EasyDict(dict):

    def __init__(self, d=None, **kwargs):
        if d is None:
            d = {}
        if kwargs:
            d.update(**kwargs)
        for k, v in d.items():
            setattr(self, k, v)
        # Class attributes
        for k in self.__class__.__dict__.keys():
            if not (k.startswith('__') and k.endswith('__')) and not k in ('update', 'pop'):
                setattr(self, k, getattr(self, k))

    def __setattr__(self, name, value):
        if isinstance(value, (list, tuple)):
            value = [self.__class__(x)
                     if isinstance(x, dict) else x for x in value]
        elif isinstance(value, dict) and not isinstance(value, self.__class__):
            value = self.__class__(value)
        super(EasyDict, self).__setattr__(name, value)
        super(EasyDict, self).__setitem__(name, value)

    __setitem__ = __setattr__

    def update(self, e=None, **f):
        d = e or dict()
        d.update(f)
        for k in d:
            setattr(self, k, d[k])

    def pop(self, k, d=None):
        delattr(self, k)
        return super(EasyDict, self).pop(k, d)

In [None]:
#Thanks to https://github.com/daisukelab/ml-sound-classifier
def read_audio(conf, pathname, trim_long_data):
    y, sr = librosa.load(pathname, sr=conf.sampling_rate) #Loads an audio file as a floating point time series. This functions samples the sound
    # trim silence
    if 0 < len(y): # workaround: 0 length causes error
        y, _ = librosa.effects.trim(y) # trim, top_db=default(60)
    # make it unified length to conf.samples
    if len(y) > conf.samples: # long enough
        if trim_long_data:
            y = y[0:0+conf.samples]
    else: # pad blank
        padding = conf.samples - len(y)    # add padding at both ends
        offset = padding // 2
        y = np.pad(y, (offset, conf.samples - len(y) - offset), 'constant')
    return y

def audio_to_melspectrogram(conf, audio):
    spectrogram = librosa.feature.melspectrogram(audio, 
                                                 sr=conf.sampling_rate,
                                                 n_mels=conf.n_mels,
                                                 hop_length=conf.hop_length,
                                                 n_fft=conf.n_fft,
                                                 fmin=conf.fmin,
                                                 fmax=conf.fmax)
    spectrogram = librosa.power_to_db(spectrogram)
    spectrogram = spectrogram.astype(np.float32) #Returns an 128 x L array corresponding to the spectrogram of the sound (L = 128*n° of s)
    return spectrogram

def melspectrogram_to_delta(mels):
    return librosa.feature.delta(mels)

def show_melspectrogram(conf, mels, title='Log-frequency power spectrogram'):
    librosa.display.specshow(mels, x_axis='time', y_axis='mel', 
                             sr=conf.sampling_rate, hop_length=conf.hop_length,
                            fmin=conf.fmin, fmax=conf.fmax)
    plt.colorbar(format='%+2.0f dB')
    plt.title(title)
    plt.show()

def read_as_melspectrogram(conf, pathname, trim_long_data, debug_display=False):
    x = read_audio(conf, pathname, trim_long_data)
    mels = audio_to_melspectrogram(conf, x)
    if debug_display:
        delta = melspectrogram_to_delta(mels)
        delta_squared = melspectrogram_to_delta(delta)
        IPython.display.display(IPython.display.Audio(x, rate=conf.sampling_rate))
        show_melspectrogram(conf, mels)
        show_melspectrogram(conf, delta)
        show_melspectrogram(conf, delta_squared)
    return mels

conf = EasyDict()
conf.sampling_rate = 44100
conf.duration = 2
conf.hop_length = 347 * conf.duration # to make time steps 128
conf.fmin = 20
conf.fmax = conf.sampling_rate // 2
conf.n_mels = 128
conf.n_fft = conf.n_mels * 20
conf.samples = conf.sampling_rate * conf.duration

In [None]:
# example
path = '../input/train_curated/0006ae4e.wav'
x = read_audio(conf, path, trim_long_data=False)
print(x)
print('_'*40)
print(audio_to_melspectrogram(conf, x))
print(audio_to_melspectrogram(conf, x).shape)
x1 = read_as_melspectrogram(conf, path, trim_long_data=False, debug_display=True)

* The first array corresponds to our **sampled audio file**. It is an array, where every value corresponds to the sound's wave amplitude (Hz) every 1/44100th of a second.
* The second array corresponds to the **spectrogram of our audio file**. It is an list of array.
    * First, we cut our sampled audio file into 20ms pieces (each piece contains 44100/50 = 882 values.
    * Then, we perform a Fourier transform for each 20ms piece. It breaks apart the complex sound wave into the simple sound waves that make it up.
    * Once we have those individual sound waves, we add up how much energy is contained in each one. This creates the first list of our array
    * Last, we do the same thing for each 20ms piece, in order to create our spectrogram.
* The last item is a **representation of our spectrogram**. Each list in our previous array is displayed as a colored vertical bar. The more bright the color is, the more the frequency is represented for this audio snippet.


### **<div id="IV3">3. Transforming sound into images</div>**


Now that we can convert our sounds into spectrograms, we want to be able to utilize them. The next step is to convert our spectrograms into images. There is a very powerful model for image recognition which is called CNN (for Convolutional Neural Network), and this model gives also ver good results for audio recognition; but before using it, we need to actually transform our sounds into images, using these spectrograms.

In [None]:
"""
The mono_to_color function takes as an input the spectrogram of our sound (list of array, see above). 
It stacks it three times, so that it has the same shape as a classic RGB image.
Then it standardize the array (take a matrix and change it so that its mean is equal to 0 and variance is 1). This improves performance.
Then it normalizes each value between 0 and 255 (gray scale). 
"""

def mels_preprocessing(X1, X2, X3, mean=None, std=None, norm_max=None, norm_min=None, eps=1e-6):
    # Stack X as [X,X,X]
    X = np.stack([X1, X2, X3], axis=-1)

    # Standardize
    mean = mean or X.mean()
    std = std or X.std()
    #Standardization. Xstd has 0 mean and 1 variance
    Xstd = (X - mean) / (std + eps)
    _min, _max = Xstd.min(), Xstd.max()
    norm_max = norm_max or _max
    norm_min = norm_min or _min
    if (_max - _min) > eps:
        # Scale to [0, 255]
        V = Xstd
        V[V < norm_min] = norm_min
        V[V > norm_max] = norm_max
        V = 255 * (V - norm_min) / (norm_max - norm_min)
        V = V.astype(np.uint8)
    else:
        # Just zero
        V = np.zeros_like(Xstd, dtype=np.uint8)
    return V

def convert_wav_to_image(df, source, img_dest):
    X = []
    for i, row in tqdm_notebook(df.iterrows()):
        x1 = read_as_melspectrogram(conf, source/str(row.fname), trim_long_data=False)
        x2 = melspectrogram_to_delta(x1)
        x3 = melspectrogram_to_delta(x2)
        x_preprocessed = mels_preprocessing(x1, x2, x3)
        X.append(x_preprocessed)
    return df, X

In [None]:
training_curated_df, X_train_curated = convert_wav_to_image(training_curated_df, source=Path('../input/train_curated'), img_dest=Path('trn_curated'))
testing_df, X_test = convert_wav_to_image(testing_df, source=Path('../input/test'), img_dest=Path('test'))

print(f"Finished data conversion at {(time.time()-start_time)/3600} hours")

In [None]:
for i in range(0,6):
    a = np.asarray(X_train_curated[i:i+1])
    a = np.squeeze(a)
    print(a.shape)
    plt.imshow(a)
    plt.show()

The array above represents our spectrogram as an RGB image.
* Each 1x3 list represents **one pixel of our image**. Each value in this list represents respectively the Red, Green and Blue value (between 0 and 255) of the pixel (see image below). In our case, as our "color" is "one dimensional" (one pixel in our spectrogram is juste a real number representing the db value for a particular frequency at a particular time), our pixel color will be a shade of grey. Thus, the three RGB values will always be the same.
* Each list of 1x3 list represents **one horizontal line of our image**.
* **The whole array is our sound**, represented as a **gray-scaled image**. 

![image.png](attachment:image.png)

### **<div id="IV4">4. Normalizing images and performing data augmentation</div>**

Now that we have transformed our sound into images, we want them to have the same scale (for example, 128x128), for training purposes.
We will also perform **data augmentation**.

![test](https://cdn-images-1.medium.com/max/800/1*C8hNiOqur4OJyEZmC7OnzQ.png)

Data augmentation consists in making minor alterations to our existing dataset. Minor changes such as flips or translations or rotations. Our neural network would think these are distinct images.
A convolutional neural network that can robustly classify objects even if its placed in different orientations is said to have the property called invariance. More specifically, a CNN can be invariant to translation, viewpoint, size or illumination (Or a combination of the above).

This essentially is the premise of data augmentation. And augmentation can also help even with a large dataset; it can help to increase the amount of relevant data in your dataset. This is related to the way with which neural networks learn.

Of course, each change are not good for each type of data. For example, in our problem, we may not want to flip or rotate our image, since it would alter our sound in a bad way. But we can for exmple increase the brightness of the image (resulting in a louder sound I guess), or translate it.  


In [None]:
CUR_X_FILES, CUR_X = list(training_curated_df.fname.values), X_train_curated

def open_fat2019_image(fn, convert_mode, after_open)->Image:
    # open
    idx = CUR_X_FILES.index(fn.split('/')[-1])
    x = PIL.Image.fromarray(CUR_X[idx])
    # crop
    time_dim, base_dim = x.size
    crop_x = 0
    #crop_x = random.randint(0, time_dim - base_dim)
    x = x.crop([crop_x, 0, crop_x+base_dim, base_dim])    
    # standardize
    return Image(pil2tensor(x, np.float32).div_(255))

vision.data.open_image = open_fat2019_image

In [None]:
#Batch size --> How many images are trained at one time. Lower it if you run out of memory
bs = 64
#Image size. Square images makes the learning process faster. We can increase the size of the images once our model is stable, in order to improve accuracy.
size = 128

#Performing data augmentation
tfms = get_transforms(do_flip=False, max_rotate=0, max_lighting=0.1, max_zoom=0, max_warp=0.)

#We put the transformed image data into /kaggle/working because ../input is a read-only directory.
src = (ImageList.from_csv('/kaggle/working', '../input/train_curated.csv', folder='../input/train_curated')
       .split_by_rand_pct(0.2).label_from_df(label_delim=','))

#Creates a databunch, because our cnn_learner below needs a databunch.
data = src.transform(tfms, size=size).databunch(bs=bs).normalize()

In [None]:
data.show_batch(4)

## **<div id="V">V. Model data</div>**

Now we will start training our model. We will use a **convolutional neural network** backbone and a fully connected head with a single hidden layer as a classifier. Our model takes images as input and will output the predicted probability for each of the categories.

We need to feed our learner with a databunch, and an architecture model. **resnet34** is a very good architecture to get started with. We can use **resnet50** to try getting better results once we are happy with our model. If we run out of memory while using resnet50, we can try to lower *bs* (batch size, how many images are trained at one time). Computing time will get a little bit longer, but we won't run out of memory anymore.

We have to set *pretrained* to *False*, as pretrained models are forbidden in this competition. For our metrics, we use *lwlrap* as it is the metric used in the competition.

We use **lr_find** to find the best learning rate for our model. It seems to start diverging when lr > 0,1, so we choose a value ten times lower, i.e. **lr = 0,01**.

We will train for 5 epochs (5 cycles through all our data).

In [None]:
arch = models.resnet18

learn = cnn_learner(data, arch, pretrained=False, metrics=[lwlrap], wd = 0.1, ps = 0.5)

learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(5, 1e-2)
learn.save('first-attempt-128')

In [None]:
learn.unfreeze()
learn.fit_one_cycle(1)

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit(20, slice(2e-3, 2e-4))
learn.save('second-attempt-128')

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit(20, slice(1e-3, 1e-4))
learn.save('third-attempt-128')

### **<div id="VI">VI. New learner with increased size of images</div>**

Now that we have a pretty good learner fed with 128x128 images, let's: 
* Create a new databunch full of 256x256 images, 
* Keep the same learner as before ('third-attempt-128'),
* Replace the data inside the learner with my new 256x256 data,
* Freeze the model again (in order to train only the last few layers).

In [None]:
size = 256

#Creates a databunch, because our cnn_learner below needs a databunch.
data = src.transform(tfms, size=size).databunch(bs=bs).normalize(imagenet_stats)

In [None]:
#Replace with 256x256 databunch
learn.data = data
#Freeze the model
learn.freeze()
#Plot lr_find()
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit(5, 3e-3)
learn.save('first-attempt-256')

In [None]:
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit(7, slice(1e-3, 1e-4))
learn.save('second-attempt-256')

In [None]:
learn.recorder.plot_losses()

In [None]:
learn.fit_one_cycle(20, slice(5e-4, 5e-5), callbacks=[SaveModelCallback(learn, monitor='lwlrap', mode='max')])

In [None]:
learn.export()

In [None]:
CUR_X_FILES, CUR_X = list(testing_df.fname.values), X_test

test = ImageList.from_csv(Path('/kaggle/working'), Path('../input/sample_submission.csv'), folder=Path('../input/test'))
learn = load_learner(Path('/kaggle/working'), test=test)
preds, _ = learn.get_preds(ds_type=DatasetType.Test)

testing_df[learn.data.classes] = preds
testing_df.to_csv('submission.csv', index=False)
testing_df.head()