<a href="https://colab.research.google.com/github/wiso/TutorialML-AtlasItalia2022/blob/main/notebooks/0.2-IntroKeras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import tensorflow as tf
from matplotlib import pyplot as plt
import scipy

## Ingrediets for ML

   * A large, curated dataset
   * A model, taking inputs and making predictions (e.g. a neural network)
   * A loss, evaluating how well the model is performing, including a regularization to contrain the model
   * A minimization procedure, to optimize the loss tuning the model parameters
   * Several metrics, to evaluate the performance of the trained model
   * Powerful hardware

## Model

As model here we consider only neural networks, but many ML models are on the market. For regression and classification the main competitors are Decision Trees.

## Simplest neural network

<img src="imgs/1neuron.png" />
<img src="imgs/dense.png" />

Consider a fully connected neural network with one single layer. Each neuron $i$ takes as input a vector $x\in \mathbb{R}^N$ and returns as output $\sigma(W^{(i)} \cdot  x + b^{(i)})$, where $W^{(i)}\in\mathbb{R}^N$ is a vector of weights and $b^{(i)}\in\mathbb{R}$ is the bias. $\sigma:\mathbb{R}\to\mathbb{R}$ is the response function and it must be non-linear. If we stack the output of all the $L$ neurons in a vector $y$ (the response of the layer):

$$
y = \sigma(W x + b)
$$

in the formula above $\sigma$ is applied on each elements in the parenthesis (elementwise). Here $W\in\mathbb{R}^{L\times N}$ while $y, b\in\mathbb{R}^L$

## Deep dense neural network


<img src="imgs/Fullyconnected.png" />

We can stack several layers:

$$
y_1 = \sigma_{L1}(W^{L1} x + b^{L1}) \\
y_2 = \sigma_{L2}(W^{L2} y_1 + b^{L2}) \\
\ldots
$$

## Activation functions

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
xspace = np.linspace(-3, 3, 100)
for fname in 'relu', 'elu', 'gelu', 'selu', 'swish', 'tanh', 'sigmoid':
    f = getattr(tf.keras.activations, fname)
    ax.plot(xspace, f(xspace), label=fname)
ax.legend(ncol=2, fontsize=20);

## Keras functional API

In [None]:
x_input = tf.keras.Input(shape=(4,))
x = tf.keras.layers.Dense(64, activation="relu")(x_input)
x = tf.keras.layers.Dense(32, activation="relu")(x)
x = tf.keras.layers.Dense(1, activation="softmax")(x)

model = tf.keras.Model(inputs=x_input, outputs=x)
model.summary()

In [None]:
test_input = np.array([[1, 2, 3, 4]])  # note the [[  ]]
model(test_input)

In [None]:
test_input = np.array([[1, 2, 3, 4], [1, 2, 3, 4]])  
model(test_input)

## Keras functional API

In [None]:
model = tf.keras.Sequential(
    [
        tf.keras.Input(shape=(4, )),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dense(32, activation="relu"),
        tf.keras.layers.Dense(1, activation="softmax"),
    ]
)

tf.keras.utils.plot_model(model, show_layer_activations=True)

In [None]:
model.save_weights('model_weights.h5')

with open('model_description.json', 'w') as fn:
    fn.write(model.to_json(indent=2))

# load the model
# from keras.models import model_from_json
# with open(model_path,'r') as model_file:
#     model = model_from_json(model_file.read())
# model.load_weights(weights_path)

## Metrics and losses
Many losses exists and they are related to the metric we want to achieve, which is related to the specifi problem. Usually the metric cannot be used directly as a loss. The point is that the loss is a function on the whole sample, while the loss can be evaluated on each element of the sample, or at least on a mini-batch. In this way the loss of the sample can be defined as the sum of the losses for each element of the sample:

$$L = \sum_i l(\hat y_i)$$

where $\hat y_i$ is the $i$-th output of the neural network.

### Metrics

   * for classification: accuracy (fraction of correct guesses), signal significance, area under the curve, ...
   * for regression: the resolution of the estimated quantity

Remember that ML is not just classification and regression...

### Losses

   * for classification: for exampes for signal vs background: binary cross entropy (e.g. $y_i = 0$ or $1$)
   $$
   −\sum_i (y_i \log(\hat y_i) + (1−y_i) \log(1−\hat y_i))
   $$
   * for regression: mean squared error, mean absolute error, ...
   
In the loss many other regularization terms may be added, for example for each layer $L_2 = \sum_{ij} |W_{ij}|^2$.

## Minimization
The training procedure updates the parameters of the model to minimize the loss evaluated on the training dataset.

## Automatic differentiation
The key ingredient for optimize neural network is the ability to compute the gradient of the loss with respect to the parameters of the model. This is achived with automatic differentiation.

In [None]:
def f(x):
    return (x - 1.) ** 2

### In Tensorflow

In [None]:
x = tf.Variable(3.0)
with tf.GradientTape() as tape:
    y = f(x)

tape.gradient(y, x)

### In Autograd

In [None]:
import autograd

f_dx = autograd.grad(f)
f_dx(3.)

### In Jax

In [None]:
import jax

f_dx = jax.grad(f)
f_dx(3.)

## Using control flow
You can differentiate functions with `if`/`for`/..., recursion

In [None]:
# complicated function with if and for
def f(x):
    if x > 0:                        # condition
        for i in range(2):           # for loop
            x += jax.numpy.sqrt(x)
        return x / 10.
    else:
        return f(f(x ** 2)) + 1      # recursion
    
f_dx = jax.grad(f)
f_dx(3.4)

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
xspace = np.linspace(-2., 5, 500)
yi = np.asarray([f(xx) for xx in xspace])
ax.plot(xspace, yi, label='function')
yi = np.asarray([f_dx(xx) for xx in xspace])
ax.plot(xspace, yi, label='derivative')
ax.legend()
plt.show()

## Minimizers

Once you have defined the loss, you want to optimize the parameters of the model to minimize the loss.

Many optimizers are based on Stocatstic Gradient Descend. Usually our loss is defiend as

$$
L = \sum_q l_w(x_q)
$$

where the sum is over the element of the training sample. We can apply gradient descend to optimize it updating in  iterative way the weights $w$:

$$
w_{i+1} = w_{i} - \eta \nabla_w L = w_{i} - \eta \nabla_w \sum_q l_w(x_q)
$$

Here we need to compute the gradient for all the elements. Instead in SGD we can consider only one (in random order):

$$
w_{i+1} = w_{i} - \eta \nabla_w l_w(x_{i+1})
$$


### Minimize a function without data
Let minimize a function (the loss) which have only parameters (and no data)

In [None]:
# define a function without data
var = tf.Variable(starting_point := 1.0)
loss = lambda: (var ** 2)

def minimize_with_history(loss, opt, nepochs):
    history_steps = [(var.numpy(), var.numpy() ** 2)]
    for epoch in range(nepochs):
        opt.minimize(loss, [var])
        history_steps.append([var.numpy(), loss()])
    return np.asarray(history_steps)

# this represent just one minimization step (not the full minimization)
opt = tf.keras.optimizers.SGD(learning_rate=0.1)
history_steps1 = minimize_with_history(loss, opt, 50)

# reset and use a huge learning_rate
var = tf.Variable(starting_point)
opt = tf.keras.optimizers.SGD(learning_rate=0.99)
history_steps2 = minimize_with_history(loss, opt, 50)

# reset and use a tiny learning_rate
var = tf.Variable(starting_point)
opt = tf.keras.optimizers.SGD(learning_rate=0.001)
history_steps3 = minimize_with_history(loss, opt, 50)

In [None]:
fig, ax = plt.subplots(figsize=(8, 7))
xspace = np.linspace(-1, 1, 100)
ax.plot(xspace, xspace ** 2, color='0.7')
ax.plot(*history_steps1.T, '.-', label='0.1')
ax.plot(*history_steps2.T, '.-', label='0.9')
ax.plot(*history_steps3.T, '.-', label='0.001')
ax.legend(title='learning rate', loc=0);

## Linear fit 1D
Minimize the usual 
$$L(m, q) = \sum_i(f(m, q, x_i) - y_i)^2$$

where $f$ is our model $f(x) = mx+q$. In this case we need to feed the data to the model. Let use minibatch.

In [None]:
data_x = np.arange(0, 1000.)  # 1000 points
data_y = data_x * 2.5 + 1.2  # true-y
data_y = np.random.normal(data_y, data_y * 0.1 + 0.1)

class MyModel(tf.keras.Model):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.m = tf.Variable(5.)  # random numbers (use floats)
        self.q = tf.Variable(10.)

    def call(self, x):
        return self.m * x + self.q

model = MyModel()

model.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=0.000001),
    loss=tf.keras.losses.mean_squared_error,
)
model.fit(data_x, data_y, epochs=5, batch_size=100)

In [None]:
plt.plot(data_x, data_y)
plt.plot(data_x, model.predict(data_x))

## Not only ML

### Statistics

Let define the likeilhood of a counting experiments, one category, one signal, background uncertainty. The parameters are the POI (signal strenght) and the NP about the background uncertainty.

In [None]:
import pyhf
pyhf.set_backend('jax')

# make a counting experiment
model = pyhf.simplemodels.uncorrelated_background(signal=[5.], bkg=[10.], bkg_uncertainty=[3.5])
pars = jax.numpy.array(model.config.suggested_init())

# generate an Asimov dataset (e.g. 15 events observed)
data = jax.numpy.array(model.expected_data(model.config.suggested_init()))

bestfit = pyhf.infer.mle.fit(data, model)  # not really needed since it is an Asimov
bestfit

In [None]:
H = -2 * jax.hessian(model.logpdf)(bestfit, data)[0]
cov = np.linalg.inv(H)
cov

You can compute the Hessian of the likelihood with autodifferentiation.

If you have a likelihood, you don't need any minimization to compute the expected errors!!!

In [None]:
from matplotlib.patches import Ellipse

In [None]:
grid = x, y = np.meshgrid(np.linspace(0.5, 1.5, 101), np.linspace(0.5, 1.5, 101))

def get_ellispse_from_covariance(x, y, covariance, z=1, *args, **kwargs):
    from matplotlib.patches import Ellipse
    ls, vs = np.linalg.eig(covariance)

    vs_max = vs[np.argmax(np.abs(ls))]
    vs_min = vs[np.argmin(np.abs(ls))]
    angle = np.arctan(vs_max[0] / vs_max[1])

    return Ellipse((x, y),
                   np.sqrt(max(np.min(ls), 0)) * z,
                   np.sqrt(np.max(ls)) * z,
                   angle / np.pi * 180, *args, **kwargs)

ellipse = get_ellispse_from_covariance(bestfit[0], bestfit[1], cov, fill=False, edgecolor='k', lw=2)

points = np.swapaxes(grid, 0, -1).reshape(-1, 2)
v = jax.vmap(model.logpdf, in_axes=(0, None))(points, data)
v = np.swapaxes(v.reshape(101, 101), 0, -1)

fig, ax = plt.subplots(figsize=(10, 10))
ax.pcolormesh(x, y, v)

grid = x, y = np.meshgrid(np.linspace(0.5, 1.5, 11), np.linspace(0.5, 1.5, 11))
points = np.swapaxes(grid,0,-1).reshape(-1,2)
values, gradients = jax.vmap(
    jax.value_and_grad(lambda p,d: model.logpdf(p,d)[0]),
    in_axes = (0,None))(points, data)

ax.quiver(points[:,0], points[:,1], gradients[:,0], gradients[:,1],
          angles = 'xy', scale = 75)
ax.scatter(bestfit[0], bestfit[1], c='r')
ax.add_patch(ellipse)
ax.set_aspect('equal')

## Heavy number crunching
Even if the interface to most of the ML is in python, the expressions (the model, but also the minimization steps, the preprocessing, ...) are represented as a computational graph, which is optimized, compiled and distributed to the optimal hardware (CPU/GPU/TPU)

In [None]:
ymin, ymax = -1.5, 1.5
xmin, xmax = -1.5, 1.5

nx, ny = 500, 500

X, Y = np.meshgrid(np.linspace(xmin, xmax, nx), np.linspace(ymin, ymax, ny))
Z = X + 1j * Y

# Grid of complex numbers
xs = tf.constant(Z.astype(np.complex64))

# Z-values for determining divergence; initialized at zero
zs = tf.zeros_like(xs)

# N-values store the number of iterations taken before divergence
ns = tf.Variable(tf.zeros_like(xs, tf.float32))

def step(c, z, n):
    z = z * z + c
    
    not_diverged = tf.abs(z) < 4
    n = tf.add(n, tf.cast(not_diverged, tf.float32))
    
    return c, z, n

fig, axs = plt.subplots(1, 2, figsize=(15, 7))
iterations = 1000

# mandelbrot
for _ in range(iterations): 
    xs, zs, ns = step(xs, zs, ns)

def shade_fractal(fractal):
    fractal = np.where(fractal == 0, iterations, fractal)
    fractal = fractal / fractal.max()
    fractal = np.log10(fractal)  
    return fractal

axs[0].pcolormesh(X, Y, shade_fractal(ns), shading='gouraud')    

#julia
zs = tf.zeros_like(xs)
ns = tf.Variable(tf.zeros_like(xs, tf.float32))

for _ in range(iterations): 
    zs, xs, ns = step(-0.7269 + 0.1889j, xs, ns)
    
axs[1].pcolormesh(X, Y, shade_fractal(ns), shading='gouraud')    

for ax in axs:
    ax.set_aspect('equal')

## Hardware
Computing power to train the model (in PFlop/s $\times$ day). 3.4-month doubling!

<img width="1000" src="imgs/ai-and-compute-all-error-no-title.png">

from https://arxiv.org/abs/2005.04305
<img src="imgs/ai-and-efficiency-compute.png">

GPT-3 175B model (175B parameters) required $3.14\times 10^{23}$ flop for training. Even at theoretical 28 TFLOPS for V100 GPU (1 = 10k\\$) and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost \\$4.6M for a single training.

<img src="imgs/gpt3_table.png">

## GPU

GPU exposes massive parallelism

<img src="imgs/A100.png">

The block diagram for A100 below shows an architecture with 128 streaming multiprocessor (SMs) (though only 108 are actually enabled in a production A100 chip).

<img src="imgs/A100_block_diagram.png">

### Pascal / Volta / Ampere SM (Compute Capability 6.0, 7.0, 8.0)

- 64 SP units ("CUDA cores")
- 32 DP units
- LD/ST units
- FP16 at twice the SP rate
- Pascal: 2 warp schedulers; Volta, Ampere: 4 warp schedulers
- Tensor Cores (in the Volta and Ampere variants)
- INT32 units (in the Volta and Ampere variants)
- P100: 56 SMs, 16 GB
- V100: 80 SMs, 16/32 GB
- A100: 108 SMs, 40 GB

<img src="imgs/volta_sm.png" width="800" />

## Serverless solutions

Google Cloud Platform, AWS Lambda, IBM Watson, Microsoft Azure, Lambdalabs, Paperspace, ...

### Swan (https://swan-k8s.cern.ch)

<img src="imgs/swan.png">

### INFN Cloud
<img src="imgs/infn_cloud.png" />

But don't be scared. For a simple NN with tens of inputs your laptop is usually ok