<h1 style="color:rgb(0,120,170)">Neural Networks and Deep Learning</h1>
<h2 style="color:rgb(0,120,170)">Flux Introduction</h2>

For the non-initiated, Flux.jl is currently (september/2021) the most starred package in the
Julia ecossystem, and it's the go-to package in terms of Deep Neural Networks for Julia.


Autograd: Automatic Differentiation
===================================

Automatic differentiation (the topic of this notebook) is 
Flux's core feature. It take a Julia function `f` and a set of arguments, returning the
gradient with respect to each argument.


*Ref: This tutorial is based on
[this example tutorial from Flux's github page](https://github.com/FluxML/model-zoo/blob/master/tutorials/60-minute-blitz/60-minute-blitz.jl).*

In [1]:
using Pkg
# This commands activates this folder environment.
# It is similar to using a pyenv, but this is the way we do in Julia.
# Note that we don't need external packages, Julia uses environments natively.
# More about it can be found in the tutorial on Julia basics.
Pkg.activate(".")

# Let's import Flux as a package. This comes with the function `gradient` from Zygote.jl
using Flux: gradient

f(x) = 3x^2 + 2x + 1

# Returns the gradient at 0.0.
gradient(f,0.0)

[32m[1m  Activating[22m[39m environment at `~/MEGA/EMAp/NeuralNetworks_Course/Notebooks/Flux/Project.toml`


(2.0,)

The `gradient` function uses automatic differentiation to calculate
the derivative of polynomials.

This does no work for any arbitrary
function. Try for example, `f(x) = exp(x)` and you'll get an error.

Below, we write another example for a function of three variables.

In [3]:
h(x,y,z) = y^3 + x^3 + x*y + z

gradient(h,1,0,0)

(3.0, 1.0, 1.0)

Now it's where things get interesting. We can take gradients
of arrays.

Take for example
$Ax + b$, where $A$ is a 2 by 2 matrix, $x$ and $b$ are two dimensional vectors.
This function actually returns another vector. So there is no gradient,
but a jacobian. For this situation, we use the `jacobian` function from Zygote,
which is not shipped with Flux.

In [18]:
using Zygote: jacobian

f(A,x,b) = A*x .+ b

A = [1 2
     3 3]
b = [1,1]
x = [0,0]

jacobian(f,A,x,b)

([0 0 0 0; 0 0 0 0], [1 2; 3 3], [1 0; 0 1])

The reason that `jacobian` is not shipped on Flux is that Neural Networks
usually only require the gradient in order to perform backward propagation. Hence,
instead of $Ax + b$, we have functions such as $\sum^n_{i=1} (Ax)_i + b_i$, which does
have a gradient. Look the example below.

In [19]:
f(A,x,b) = sum(A*x .+ b)

A = [1 2
     3 3]
b = [1,1]
x = [0,0]

gradient(f,A,x,b)

([0.0 0.0; 0.0 0.0], [4.0, 5.0], [1.0, 1.0])

<div class="alert alert-info"><p>
<strong>Obs</strong>: note the `Fill(1,2)` as the last element of the output of the `gradient` function.
This is just a way to represent a vector of dimension 2 where all elements are equal to 1.
if you want to underestand more about it, copy the code below and run it in a cell to see the output.

```julia
    
    using FillArrays
    @show collect(Fill(1,2))
    @show collect(Fill(3.5,2,2))
    
```

It's even more impressive. It can take gradient of functions defined programmatically! 
Take a look at the example below.

In [6]:
function mycrazyfunction(x)
    if x ≥ 1
        return sin(x)
    elseif -1 < x < 1
        return exp(x)
    else
        return x^2
    end
end

@show gradient(mycrazyfunction,0)[1] == exp(0)
@show gradient(mycrazyfunction,1)[1] == cos(1)
@show gradient(mycrazyfunction,-10)[1] == 2*-10;

(gradient(mycrazyfunction, 0))[1] == exp(0) = true
(gradient(mycrazyfunction, 1))[1] == cos(1) = true
(gradient(mycrazyfunction, -10))[1] == 2 * -10 = true


Let's define a loss function similar to what we actually find in Neural Networks.
Remember that the function `gradient()` will return a tuple with dimension `n`
equal to the number of arguments that the loss function receives. In our example below,
our `n=3`.

In [6]:
myloss(W, b, x) = sum(W * x .+ b)

W = randn(3, 5)
b = zeros(3)
x = [3,1,0,1,2]

# Here we get the gradient in terms of the x at [3,1,0,1,2]
gradient(myloss, W, b, x)[3]

5-element Vector{Float64}:
 -0.6711226626925774
  1.0659156306091664
  1.175536176922669
 -0.47252159253728543
  1.5855710120528388

When training a network, we might not need to, for example, take the gradient of `x`, since `x` represents our data.
We only want to take the gradient of the parameters of the model, e.g. `W` and `b`.
This can be done using the function `params` from Flux.

In [7]:
using Flux: params


W = randn(3, 5)
b = zeros(3)
x = [3,1,0,1,2]

y(x) = sum(W * x .+ b)

grads = gradient(()->y(x), params([W, b]))

grads[W], grads[b]

([3.0 1.0 … 1.0 2.0; 3.0 1.0 … 1.0 2.0; 3.0 1.0 … 1.0 2.0], Fill(1.0, 3))

In [8]:
grads.params[1]

3×5 Matrix{Float64}:
 -0.996638  -0.05127     -0.18591   -1.7349    -0.930081
 -1.15274    0.835228     0.960265  -0.874299   0.834282
  0.596683  -0.00502874   1.90293   -0.86473    0.392212

In [52]:
func()

-7.959585942769246