# fastai 2019: notes

## Data
- in fastai, the following applies (this is mainly how kaggle does it too)
- `train` training data with labels
- `valid` a validation set for assessing and testing. Valid DS has labels
- `test` a test data set which has **NO** labels

## Gradient Descent
- will take our prediction and try to make it better using the `intercept` and `slope`
- for each "movement", calculate the loss, if it is better we keep that
- we do this by calculating the `derivative` (ie calculating the gradient)

**Gradient Descent** is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves towards a set of parameter values that minimies the function. This is done by taking steps in the negative direction of the function gradient

## Weight Decay
- is a type of regulariztion
- we throw away a certain % of weights for each layer of the model to stop it from learning the data it is training on.
- Weight decay is
    - where we take the loss function and add the sum of squared of the parameters x by some number (wd)
    - `wd should be 0.1`
    - according to J Howard. You should try using this when you train
- $w_t = w_{t-1} - lr \times {{dl}\over{dw}_{t-1}}$
    - update $w$ weight at epoch/time t
    - This means, our $w_{t-1}$ weights at previous time 
    - minus learning rate $lr$ times 
    - the derivative of our $lr$ with respect to derivative of our weights $w$

## Momentum
- `0.9` is really common
- momentum is what affects our step size when we are exploring the weight space
- it is the exponentially weighted moving average of the gradient

## RMS Prop
- similar to momentum but is..
    - exponentially weighted moving average of the gradient squared

## Adam
- does **both** momentumn and weight decay
- keep track of 
    - exponentially weighted moving average of the gradient squared
    - exponentially weighted moving average of my steps
    - both divide by exponentially weighted moving average of the squared terms
    - take .9 of a step in the same direct as last time


### Image regression
- different to classification
    - in classification problems, we predict a discrete variable (ie category)
    - in regression, we are trying to predict continuous variables
- look at the 'heads' data set
    - [lesson 3](https://www.youtube.coam/watch?v=MpZxV6DVsmM&t=619s)
    - it is predicting a set of coordinates on an image
    
    
## Loss
- is some function of our independent variables and our weights
- $L(x,w) = mse(\hat{y}, y)$
- we used MSE for eg
    - between predictions `y_hat` and actuals `y`
    - predictions come from running a model on those predictions, and the model contains some weights
    - this creates `a.graf` 
    - then we add weight decay
- $L(x,w) = mse(\hat{y}, y) + wd \times \sum{w}^2$

### cross entropy loss
- this is the loss function you want when doing
    - single label, multi-class classificaiton
- for classification, we need a loss function where
    - predicting the right thing confidently should have low loss
    - predicting the wrong thing confidently should have high loss
- sum of one hot encoded variables, times, all of your activations
- requires `softmax` which says
    - e to the activations, div sum of e to the activations
    - it requires that
    - all activations sum to 1
    - all activations are > 0
    - all activations are less than 1


### Loss functions to use
- Classification
    - cross entropy loss
- Regression
    - MSE


### data augmentation
- called by `get_transforms` in fastai
- particularly important to image data as the transforms are things like brightness, flipping, skewing/perspective warping, padding (zeros, border, reflection)
    - "reflection is nearly always better" J Howard
- there are transformations that your preprocessing will add to you data. 
- the purpose is "teach" the model that a cat is still a cat even if the image is too bright/dark/blury
- you should assess that data to see whether the data requires augmentation
- data augmentation creates more versions of each individual image, thereby increasing the size of your dataset, providing you with more training data
- everytime you grab something, fastai randomly transforms it so potentially every image will look a little bit different. You can see this by plotting something a few times (check 2019 lesson 7 video around 10:30)


# Convolutional Neural Networks
- like the neural nets we have seen before, so doing matrix multiplication but a slightly different type of matrix multiplication

### Convolutional Kernal
- explained well [here](https://setosa.io/ev/image-kernels/)
- as the kernal passes over the image, the resulting mat mul and addition is creating a negative image
- all a convolution can do is find edges and gradients
- each layer takes the results of the previous to create more complex shapes (see Zyler and Fergus visualizing layers of nets)
- each output is the result of a linear equation
- convolutions can be implemented with matrix multiplication but we generally don't do it because it is slow

### rank three tensor and kernals
- think of a cube
- think of a colour image as having 3 channels (R,G,B) - rank 3 tensor
- the kernal now becomes a rank 3 kernal (3x3x3 kernal)
- we now do an element wise mult of 27 things instead of 9
- we then add all 27 together to end up with one number
- there need to be 3 kernels to create rank 3 tensor as an output
    - however we get to choose how ever many kernels we need
    - ofter 16 in the first layer
    - these will create 16 channels representing
        - how much left edge, top, edge, gradient, blue etc etc
- this is repeated many times
    - we want to have more and more channels as we go deeper into the network
    - this creates memory issues
    - to avoid this we use a kernel that skips over pixels
    - **called STRIDE 2** convolution
    

### weight tying
- when you have multiple things with the same weight it's called weight tying

### kernel size in model
- generally speaking we start with a larger kernel first which then reduces
- stride size will reduce the image size ie image of 224 will become 112x112 with a stride 2 conv

### how to get to final output
- for every channel in the final output, we take an average, which will give us an vector of x length
- then we pop through a single matmul of vector of size x by the number of categories 
- this is called average pooling


# Simple CNN
- sequential layers
- conv2d

# ResNet
- adds skip onnectionsd to sequential architecture

# DenseNet
- like resnet but instead of + x it concatenates
- see lesson 7.
- not too clear on this yet
- dens blocks get bigger and bigger but the original layer features are still there
- these nets are very memory intensive because of this
- though they do have fewer parameters
- they work really well for small data sets and for segmentation
    - maybe for generation too??

# U-Net
- 




## Tabular Data
- what architecture is this? It's not a `CNN` so maybe `RNN` or a linear model??
- you need to specify you categorical and continuous variables
- if this is a regression problem, ie you dependent variable is continuous then you need to... **this is discussed somewhere** **add notes here**
- `Normalization`
    - takes continuous variables, subtracts their mean and div by standard deviation (converts to 0, 1)
- whatever you do to training, you have to do to validation set (re: pre-processing)
- `layers=[200,100]`
    - this is the embedding size of the last two layers??
- Time Series Tabular Data
    - generally **don't use RNN** for time series tabular
    - add additional categorical variables dor date columns
    
    


## Collaborative filtering
- **linear model**
    - it's basically a regression
    - which means we only have one layer so no point with discriminative `learning rates`
        - so for fit, just pass in one `lr`
- `n_factors=50`
    - this is the width of the embedding size
    - factors is what they call the term in this collab filtering domain.
- `min_score` & `max_score`
    - the min and max of the 'ratings'
    - **replaced by `y_range`**
    - these are the bounds where the `sigmoid` function will truncate
        - we need to go a bit above the max "rating" number so that the actual number can be reached.
        - ie if you have ratings from 1-5, you would pick a `y_range=[0, 5.5]`
        - this is a way of improving the network by limiting the range. We want it to be as good as possible at predicting scores between 0-5 so no use allowing any numbers above ~5.
- embedding matrix
    - these are vectors with a baised term added on
    - the biased term is like the score for all "movies/product"
    - the biased term is a way to say some products/movies are better than others so it's not surprising that they are liked more.
    - **an embedding means, look something up in an array**
        - this is the same as doing a matrix product by a one hot encoded matrix
        - embedding is a memory efficient way of doing the multiplicaiton
- latent factors or features
    - these are the hidden features that are revealed through training our model
    - the bias term is a weight that basically give better items more weight, worse items less weight
    


-----------------
# Classification
### loss functions, regularization and activations






------------------

# Reading
- [neuralnetworks and deep learning](http://neuralnetworksanddeeplearning.com/)
- [chroma](https://en.wikipedia.org/wiki/Chroma_feature)
- matrix products 
    - be familiar with the output of a matrix of size x * size y = size ?
    
-----------------

# training notes
- `lr` of `3e-3` generally works well for the first round of training before unfreezing
- then for second round, for the first part of the slice use 10x lower for second part of slice, then whatever `lr_finder` found for the first part of the slice
    - `learn.fit_one_cycle(4, 3e-3)`
    - `learn.unfreeze()`
    - `learn.fit_one_cycle(4, slice("lr_finder number",3e-4))`
- `learn.recorder.plot_losses()` will show you the loss plotted out. 
    - You **want** to see something that goes down, then increases a bit then goes down again. That is a good sign
    - ![lr_good](../img/good_loss.png)
    - if it is ALWAYS going down, then you can bump your learning rate up a bit.
- if you are overfitting
    - add more `wd`


## `Learner`
- we can pass in `data`, `model`, `metrics`, `loss function`
- it is a convenience function for us

## `fit_one_cycle`
- we use (in fastai) something like `Adam` by default
- fit one cycle implements
    - discriminative learning rate and learning rate annealing
        - increase the `lr` if you are doing well, then decrease after half way
        - start slow when exploring the weight space, then increase towards the end
    - as `lr` increases, `momentum` decreases, then towards the end, `lr` decreases, `momentum` increases
- 


## over and underfitting
- **training loss should always be lower than validation loss**
- **`lr` too high**
    - validation loss will be very high
        - lower the `lr`
- **`lr` too low**
    - error_rate will reduce but very very slowly
    - increase `lr` a bit
    - **training loss will be higher than validation loss**
        - you never want this
        - this means you are **underftting**
        - num epochs too low or `lr` too low
        - see [48:50](https://www.youtube.com/watch?v=ccMHJeQU4Qw&t=3219s) in lesson 2 video
- **too few epochs**
        - this looks dimilar to low `lr`
        - so try more epochs first
        - then if `lr` goes over the top, lower it
- **too many epochs**
    - overfitting
    - it is really hard to overfit
    - how to tell
        - **error rate improves for a while, then gets worse again**

    
    
---------  

# Terms
- Learning rate
    - is the thing we mult gradient by to decide how much we update weights by
- Epoch
    - one complete run through all data points
- Minibatch
    - random bunch of points to update weights
- SGD
    - stochastic gradient descent
- Model/Architecture
    - function we are fitting the parameters to
- Parameters / coefficients / weights
    - the numbers we are updating 
- Afine function
    - linear function
    - multiply things together then add them up
- Loss Function
    - how far away or how close you are to the correct answer
- ReLU
    - is a "filter" where any number below 0 is cut off and set to 0
- **Activations**
    - numbers
    - are the result of either a matrix multiply or an activation function such as(ReLU)
    - sometimes called nonlinearities
- **Parameters/Weights**
    - numbers inside the weights that we multipy
    - that are stored to make a calculation
    - this is what the model learns
    - we use gradient descent on the parameters to update them
    - `parameters -= lr * parameters.grad`
- Layers
    - everything in the network that does a calculation
    - every layer results in a set of activations
    - Start layer
        - input layer
    - End layer
        - output (final set of activations)
- Back Propagation
    - the process of updating the parameters with gradient descent
- Fine Tuning
    - Resnet34 was trained on imagenet so the final weight matrix is of len 1000 because you need to predict 1000 categories. We generally don't need to do that so that final set of weights is thrown away and replaced by 2 new weight matrices with a ReLU in between. 
    - these originally have random numbers in them
    - this is what we train first while the start layers are frozen
    - this ensures we don't back propagate the weights back into the initial layers
        - this must be why when you unfreeze then re-train, you get worse before you get better. Makes sense!
    - fastai by default splits the model into different sections and applied different learning rates to each part. this is because we don't need to train the early layers by much. So those weights won't be trained a lot.
    - this is called using **discriminative learning rates** see Leslie Smith
    - after unfreezing you can call
        - `fit(epochs=1, max_lr=1e-3)`
            - single lr throughout
        - `fit(epochs=1, max_lr=slice(1e-3))`
            - evenly split `lr` between layers based on divisions of 3 (ie 1e-3/3)
        - `fit(epochs=1, max_lr=slice(1e-5,1e-3))`
            - will apply 1e-5 to start group, then 1e-4 for middle group then 1e-3 for last layer group
- `fit_one_cycle`
    - `epoch` and `cyc_len` are the same
    - both represent the number of times you scan through your items

---------------

# Project notes

## Multilabel Classificaiton with Audio data

- What labels can you predict?
    - using a spectrogram, you could classify
    - `key`, `scale`, 'instrument`, `tambre` etc
    - using vaoice it could be the tone of the voice
        - `hooty`, `squeezed`, `breathy`, `chest`, `head`, `etc`
        
- Metrics
    - use `accuracy_thresh` with a selected threshold
    - check the [video](https://www.youtube.coam/watch?v=MpZxV6DVsmM&t=619s)
    - you need to update the accuracy metric

### Creating data
- you might want to use the actual matrices instead of images
- pytorch has a `TensorDataset()` function that will converts any 2 tensor into a dataset. You can then use `DataBunch.create()` to create a databunch iterator
- [lesson 5 at 1:27](https://www.youtube.com/watch?v=CJKnDu2dxOE)shows this

# Reading
### links
- [musical freqs](https://pages.mtu.edu/~suits/notefreqs.html)
- [ISMIR](https://www.ismir.net/)
    - librosa creators
- [MIR](https://musicinformationretrieval.com/index.html)
    - [THIS IS USEFUL](https://musicinformationretrieval.com/pitch_transcription_exercise.html)
    - **Pitch Detection**
### Papers
- [Detecting Musical Key With Supervised Learning](http://cs229.stanford.edu/proj2016/report/Mahieu-DetectingMusicalKeyWithSupervisedLearning-report.pdf)
- [Deep residual learning for image recognition](https://arxiv.org/abs/1512.03385)
    - ResBlocks and resnets

- [Visualising the loss landscape of neural nets](https://arxiv.org/abs/1712.09913)

