<a href="https://colab.research.google.com/github/rastringer/code_first_ml/blob/main/intro_to_ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to Machine Learning

Machine learning (ML) is a field within Artificial Intelligence (AI) which has been around for decades. Various advances in computational power (accelerators such as GPUs) and algorithm design have led to stunning advances in the last decade.

In this notebook, we will examine typical approaches to ML; some of the mathematics involved, what 'training' means in model development, and common considerations for 'MLOps', or ML in production.  

In [None]:
! pip install torchviz torch torchvision shap

<img src="https://blog.hnf.de/wp-content/uploads/2020/12/Arthur_Samuel.jpg" width="400"/>

#### Early days

Arthur Samuel, IBM researcher in 1962:

“Programming a computer...is, at best, a difficult task, not primarily because of any inherent complexity in the computer itself but, rather, because of the need to spell out every minute step of the process in the most exasperating detail. Computers, as any programmer will tell you, are giant morons, not giant brains.”

In traditional computing, we take inputs, perform some prescribed operations, and generate an output.

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/input_program_output.png?raw=true" width="600"/>


```
def square(x):
   return x*x
```

Samuel referred to the idea of assigning weights to inputs, which could then be adjusted to maximize the performance of a particular task.

Samuel:

“Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. ”


<img src="https://github.com/rastringer/code_first_ml/blob/main/images/input_model_weights.png?raw=true" width="500"/>

"We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would "learn" from its experience.” –Samuel.

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/input_model_backprop.png?raw=true" width="600"/>


A sporting analogy is that of an athlete practicing their discipline, with a good coach who ensures they are learning to improve their skills as they perform them repeatedly.

<img src="https://d1s9j44aio5gjs.cloudfront.net/2020/07/Becoming_a_swimming_coach_Careers_in_Aquatics.jpg" width="600"/>

We will explore each of these themes in more detail as we progress through the notebook, however here, in a nutshell, outlined in the 1960s, are the basics of what became the burgeoning field of machine learning.

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/flowers_example.png?raw=true" width="600"/>




### Principles and code

Let's explore the building blocks of a neural network. Grateful for Jeremy Howard's teaching methodology and [related notebook](https://www.kaggle.com/code/jhoward/how-does-a-neural-net-really-work) for the following materials.
If today interests you, check out his courses at [fast.ai](https://www.fast.ai)

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact

def plot_function(f, title=None, min=-2.1, max=2.1, color='r', ylim=None):
    x = torch.linspace(min,max, 100)[:,None]
    if ylim: plt.ylim(ylim)
    plt.plot(x, f(x), color)
    if title is not None: plt.title(title)

### Quadratic functions
Quadratic functions can be useful for modeling the trajectory of projectiles, arcs and parabolic shapes. They are often used in optimization problems, or to fit real-world data.

For this notebook, we have no use in mind other than to demonstrate some of the building blocks of a neural network.

This function shows $ax^2+bx+c$, with parameters $a = 1, b = 2, c = 1$.

In [None]:
def quadratic(a, b, c, x):
  return a*x**2 + b*x + c

In [None]:
# Setup quadratic
f = lambda x: quadratic(1, -2, 1, x)

# Generate x and y values over a range
x = np.linspace(-2.1, 2.1, 100)[:,None]
y = f(x)

# Create plot
fig, ax = plt.subplots()
ax.plot(x, y, '-r', label='Quadratic function')
ax.legend()
ax.set_title('Simple Quadratic Function')
ax.set_xlabel('X value')
ax.set_ylabel('Y = f(x)')

plt.show()



We can fix these values using a `partial` function.

In [None]:
from functools import partial

def make_quad(a,b,c):
  return partial(quadratic, a,b,c)

In [None]:
f2 = make_quad(3,2,1)
plot_function(f2)

Add random noise

In [None]:
def noise(x, scale):
  return np.random.normal(scale=scale, size=x.shape)

def add_noise(x, mult, add):
  return x * (1+noise(x,mult)) + noise(x,add)

In [None]:
np.random.seed(42)

x = torch.linspace(-2, 2, steps=20)[:,None]
y = add_noise(f(x), 0.15, 1.5)

In [None]:
x[:5],y[:5]

What are these tensors?

Tensors are an array of numerical values, just like scalars, vectors and matricies. They can have any number of dimensions. For example, an image is a 3D tensor with height, width and depth dimensions.
Dimensions are also commonly referred to as 'rank'.

In [None]:
tensor = torch.rand(3, 4)
print(f"Tensor = {tensor}")
print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

In [None]:
import torch

# Scalar
scalar = torch.tensor(5)
print(f"Scalar rank: {scalar.dim()}")

# Vector
vec = torch.tensor([1, 2, 3])
print(f"Vector rank: {vec.dim()}")

# Matrix
mat = torch.tensor([[1, 2], [3, 4]])
print(f"Matrix rank: {mat.dim()}")

# 3D Tensor
tensor3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(f"3D tensor rank: {tensor3d.dim()}")

Back to the quadratic. Let's add some noise to explore fitting our quadratic to points in a graph.

In [None]:
plt.scatter(x,y);

In [None]:
@interact(a=1.1, b=1.1, c=1.1)
def plot_quad(a, b, c):
    plt.scatter(x,y)
    plot_function(make_quad(a,b,c), ylim=(-3,13))

Our experimentation would be a lot more efficient if we had a way of calculating our error to help us move towards improvement faster. Introducing...

### Mean Absolute Error

MAE one possible 'loss function' for machine learning. It measures errors between a prediction and actual. MAE is calculated as the sum of absolute errors divided by the sample size:


$\mathit{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_{true}^{(i)} - y_{pred}^{(i)}|$

As is often the case in ML, this looks far simpler in code:

In [None]:
def mae(preds, acts):
  return (torch.abs(preds-acts)).mean()

A useful feature of Colab or Jupyter notebooks is using `??` to check the documentation for a particular method. Let's look at `.abs`, part of the PyTorch library.

In [None]:
torch.abs??

In [None]:
@interact(a=1.1, b=1.1, c=1.1)
def plot_quad(a, b, c):
    f = make_quad(a,b,c)
    plt.scatter(x,y)
    loss = mae(f(x), y)
    plot_function(f, ylim=(-3,12), title=f"MAE: {loss:.2f}")

### Gradient descent

If we know the gradient of our MAE function with respect to `a`, `b`, and `c`, then we can workout how adjusting either one of the parameters may change the value of the MAE function.

For example, is `a` has a negative gradient, increasing the value of `a` will decrease our MAE loss, which of course we want to make as low as possible.

We therefore need a function that takes `a`, `b`, and `c` as a vector input, and returns the MAE based on those parameters.

In [None]:
def quad_mae(params):
    f = make_quad(*params)
    return mae(f(x), y)

The result should be the same result MAE gave us above for the first plot.

In [None]:
quad_mae([1.1, 1.1, 1.1])

Let's try it on basic initialized values.

In [None]:
abc = torch.tensor([1.1,1.1,1.1])

To calculate the gradients, we just use `requires_grad`.

In [None]:
abc.requires_grad_()

In [None]:
loss = quad_mae(abc)
loss

Backward pass:

Output layer's gradient:
\begin{equation}
\frac{\partial \text{loss}}{\partial Y} = 2(Y - \hat{y})
\end{equation}

Propagate the gradient backwards:
\begin{align*}
\frac{\partial \text{loss}}{\partial W_l} &= \frac{\partial \text{loss}}{\partial Y} \odot g'(W_l \cdot Y) \odot Y \
\frac{\partial \text{loss}}{\partial Y_{l-1}} &= \frac{\partial \text{loss}}{\partial Y} \odot g'(W_l \cdot Y) \cdot W_l^T
\end{align*}

where:

⊙ denotes the element-wise dot product.
g
′
  is the derivative of the activation function g.
Update the weights:
\begin{equation}
W_l \leftarrow W_l - \eta \cdot \frac{\partial \text{loss}}{\partial W_l}
\end{equation}

where η is the learning rate.

In [None]:
"""
Let's say your network has L layers,
where each layer has an activation function g and a weight matrix W.
The input to the network is X, and the output is Y.
"""

Y = X
for l in range(1, L+1):
  Y = g(W_l @ Y)

"""
Calculate the loss
Define the loss function eg Mean Squared Error to minimize
"""

loss = (Y - y_hat)^2

"""
Backward pass:
Calculate the output layer's gradient
"""

dL/dY = 2 * (Y - y_hat)

"""
Propagate the gradient backwards:
For each layer l in reverse order (L, L-1, ..., 1):
"""

dW_l = dL/dY * g'(W_l @ Y) * Y
dY = dL/dY * g'(W_l @ Y) * W_l.T

"""
dW_l is the gradient of the loss with respect to the weights of layer l.
dY is the gradient of the loss with respect to the output of layer l-1.
g'(z) is the derivative of the activation function g evaluated at z.
"""

# Update the weights
W_l = W_l - learning_rate * dW_l

In reality, ML frameworks make this a lot easier.

Calculating the gradients from our `quad_mae` loss above in PyTorch is straightforward, using `backward()`.

In [None]:
loss.backward()

The gradients are accessible via `.grad`.

In [None]:
abc.grad

In [None]:
with torch.no_grad():
    abc -= abc.grad*0.01
    loss = quad_mae(abc)

print(f'loss={loss:.2f}')

In [None]:
for i in range(10):
    loss = quad_mae(abc)
    loss.backward()
    with torch.no_grad(): abc -= abc.grad*0.01
    print(f'step={i}; loss={loss:.2f}')

Here are all the steps together

**Forward pass**:

**Define the network**:
\begin{align*}
Y &= X \
\text{for } l &= 1, \dots, L: \
Y &= g(W_l \cdot Y)
\end{align*}

**Calculate the loss**:
\begin{equation}
\text{loss} = \frac{1}{2} || Y - \hat{y} ||^2
\end{equation}

**Backward pass**:

**Output layer's gradient**:
\begin{equation}
\frac{\partial \text{loss}}{\partial Y} = 2(Y - \hat{y})
\end{equation}

**Propagate the gradient backwards**:
\begin{align*}
\frac{\partial \text{loss}}{\partial W_l} &= \frac{\partial \text{loss}}{\partial Y} \odot g'(W_l \cdot Y) \odot Y \
\frac{\partial \text{loss}}{\partial Y_{l-1}} &= \frac{\partial \text{loss}}{\partial Y} \odot g'(W_l \cdot Y) \cdot W_l^T
\end{align*}

**where**:

⊙ denotes the element-wise dot product.
g
′
  is the derivative of the activation function g.
Update the weights:
\begin{equation}
W_l \leftarrow W_l - \eta \cdot \frac{\partial \text{loss}}{\partial W_l}
\end{equation}

where η is the learning rate.

### From quadratic functions to approximating to any computable function.

The two core building blocks that allow neural networks to classify images, generate text, and translate languages are matrix multiplication and activation functions.

### Matrix multiplication

AKA matmul, computes the weighted sum of multiple inputs to produce outputs at each neuron.

The weighted matrix, which contains connection weights between neurons in adjacent layers, and the input vector, containing the input values fed into the current layer of neurons, are multiplied together to create an output vector.

See [matrixmultiplication.xyz](matrixmultiplication.xyz) for a helpful visualization of this operation.

### Activation function

Activations introduce non-linearity, which allows neural networks to learn complex patterns in data. Without non-linear activations, a neural network is essentially just a linear regression model.

They also constrain the range of outputs, by squashing inputs to a range like 0-1 or -1 to 1. This bounds neuron outputs which has benefits for model training.

Here is one of the most commonly used activation functions, the Rectified Linear Unit. This function simply replaces all negative numbers with zero.

In [None]:
def rectified_linear(m,b,x):
    y = m*x+b
    return torch.clip(y, 0.)

In [None]:
plot_function(partial(rectified_linear, 1,1))

In [None]:
@interact(m=1.5, b=1.5)
def plot_relu(m, b):
    plot_function(partial(rectified_linear, m,b), ylim=(-1,4))

In [None]:
def double_relu(m1,b1,m2,b2,x):
    return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x)

@interact(m1=-1.5, b1=-1.5, m2=1.5, b2=1.5)
def plot_double_relu(m1, b1, m2, b2):
    plot_function(partial(double_relu, m1,b1,m2,b2), ylim=(-1,6))

While there are many pre-processing techniques depending on whether a model is to be trained on images, or text etc, and many different model architectures, most boil down to inventive and efficient combinations of matrix multiplications and activation functions.

### A beginner neural network

Let's look at a simple neural net written from scratch. Again, major thanks to Jeremy Howard for this [example](https://pytorch.org/tutorials/beginner/nn_tutorial.html) on PyTorch.org.

Models need data, and we will use the classic MNIST dataset, which comprises black and white images of hand-drawn digits between 0-9. This is a very common starting point in ML.

In [None]:
from pathlib import Path
import requests

DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"

PATH.mkdir(parents=True, exist_ok=True)

URL = "https://github.com/pytorch/tutorials/raw/main/_static/"
FILENAME = "mnist.pkl.gz"

if not (PATH / FILENAME).exists():
        content = requests.get(URL + FILENAME).content
        (PATH / FILENAME).open("wb").write(content)

In [None]:
import pickle
import gzip

with gzip.open((PATH / FILENAME).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")

The images are stored as a flattened row, 784 in length. To display an image, we can reshape it to 2d.


In [None]:
from matplotlib import pyplot
import numpy as np

pyplot.imshow(x_train[18].reshape((28, 28)), cmap="gray")
# ``pyplot.show()`` only if not on Colab
try:
    import google.colab
except ImportError:
    pyplot.show()
print(x_train.shape)

In [None]:
type(x_train)

Since the image type is numpy array, we convert to PyTorch tensors.

In [None]:
import torch

x_train, y_train, x_valid, y_valid = map(
    torch.tensor, (x_train, y_train, x_valid, y_valid)
)
n, c = x_train.shape
print(x_train, y_train)
print(x_train.shape)
print(y_train.min(), y_train.max())

### Neural network from tensor operations

We create weight tensors using *Xavier initialisation*, which means we multiply $1/sqrt(n)$. There are many approaches to initializing weights, generally we want to assign small values to begin with, that we can update throughout the training cycle.

Specifying `requires_grad()` here means the tensor weights require a gradient. PyTorch will store the operations carried out on the tensor, enabling automatic gradient calculation during backpropagation.

In [None]:
import math

weights = torch.randn(784, 10) / math.sqrt(784)
weights.requires_grad_()
bias = torch.zeros(10, requires_grad=True)

### Activation function

We use `log_softmax`, which uses the log probabilities to convert prediction scores for each class into probabilities that sum to 1. (eg probability this is a '3': 0.6; a '4': 0.3, an '8': 0.1).

This typically would only be done for the output layer of the neural network, however since we have a network of one 'layer' (essentially a function in this case), it suffices.

In [None]:
def log_softmax(x):
    return x - x.exp().sum(-1).log().unsqueeze(-1)

### Model

Our model is somewhat simpler than most current ML architectures. We use matrix multiplication and the `log_softmax` activation function.

`@` is shorthand for the multiplication eg `multiply xb (a batch of training examples) by weights`.

In [None]:
def model(xb):
    return log_softmax(xb @ weights + bias)


### Feed forward neural net

We now have a feed forward neural network. The results will not be accurate since we start with random weights and have just one optimization step.

In [None]:
bs = 64  # batch size
xb = x_train[0:bs]  # a mini-batch from x
preds = model(xb)  # predictions
preds[0], preds.shape
print(preds[0], preds.shape)

At this stage, we have a matrix multiplication to multiply inputs by weights. We add a bias. Then an activation squashes the values to scores in the range 0-1 for each class.

We just need a loss function to work out how wrong our predictions are.

Here's negative loss likelihood, another approach to loss functions.

$\mathit{L}(\hat{y}, y) = -\log(\hat{y}_y)$

In [None]:
def neg_loss_likelihood(input, target):
    return -input[range(target.shape[0]), target].mean()

loss_func = neg_loss_likelihood

In [None]:
yb = y_train[0:bs]
print(loss_func(preds, yb))

In [None]:
def accuracy(out, yb):
    preds = torch.argmax(out, dim=1)
    return (preds == yb).float().mean()

In [None]:
print(accuracy(preds, yb))

We can now run a training loop. For each iteration, we will:

select a mini-batch of data (of size bs)

use the model to make predictions

calculate the loss

loss.backward() updates the gradients of the model, in this case, weights and bias.

In [None]:
from IPython.core.debugger import set_trace

lr = 0.02  # learning rate
epochs = 4  # how many epochs to train for


for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        #         set_trace()
        start_i = i * bs
        end_i = start_i + bs
        xb = x_train[start_i:end_i]
        yb = y_train[start_i:end_i]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        with torch.no_grad():
            weights -= weights.grad * lr
            bias -= bias.grad * lr
            weights.grad.zero_()
            bias.grad.zero_()

In [None]:
print(loss_func(model(xb), yb), accuracy(model(xb), yb))

In [None]:
import torch.nn.functional as F

loss_func = F.cross_entropy

def model(xb):
    return xb @ weights + bias

In [None]:
print(loss_func(model(xb), yb), accuracy(model(xb), yb))

### Now for the easy way

Let's use the features of PyTorch to train an MNIST model.



### Hyperparameters

Hyperparameters govern some of the behaviour of a neural network. Tuning hyperparameters is a key concern for a machine learning workload. Typically engineers will run several experiments to find the best combination of hyperparameters on subsets of data before full (and more expensive) training runs.

Typically in model training notebooks you would see hyperparmas appear like environment vars:

In [None]:
# EPOCHS = 5 # How many times the entire training dataset is passed through the net

# BATCH_SIZE = 64 # No. of data samples processed at once during training
# LR = 0.001 # Small value used in weight updates based on calculated gradients
# SEED = 1 # Random seed for network initialization. Ensures reproducible results
# LOG_INTERVAL = 100 # Interval at which training results are logged

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from torch.profiler import profile, record_function, ProfilerActivity

# Define a simple neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

In [None]:
# Load MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
train_dataset = datasets.MNIST(root='./data', train=True, transform=transform, download=True)
test_dataset = datasets.MNIST(root='./data', train=False, transform=transform, download=True)
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=True)

In [None]:
# Access and view elements from the DataLoader
for batch in train_loader:
    images, labels = batch
    # You can now work with the batch of images and labels
    # For example, printing the shape of the batch
    print("Batch of images shape:", images.shape)
    print("Batch of labels shape:", labels.shape)
    image_tensor = images[:5]
    print("Image tensor shape:", image_tensor.shape)
    break  # Stop after processing the first batch

In [None]:
examples = enumerate(test_loader)
batch_idx, (example_data, example_targets) = next(examples)

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure()
for i in range(6):
  plt.subplot(2,3,i+1)
  plt.tight_layout()
  plt.imshow(example_data[i][0], cmap='gray', interpolation='none')
  plt.title("Ground Truth: {}".format(example_targets[i]))
  plt.xticks([])
  plt.yticks([])
fig

In [None]:
# Train the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNN().to(device)


criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model and profile with PyTorch Profiler
with profile(activities=[ProfilerActivity.CUDA], record_shapes=True, use_cuda=True) as prof:
    for epoch in range(5):
        total_loss = 0.0
        correct = 0
        total = 0

        for batch_idx, (inputs, labels) in enumerate(train_loader):
            optimizer.zero_grad()
            with record_function("model_inference"):
              inputs = inputs.to(device)
              labels = labels.to(device)
              outputs = model(inputs)
              loss = criterion(outputs, labels)
              loss.backward()
              optimizer.step()

            total_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

        # Print results at the end of each epoch
        avg_loss = total_loss / len(train_loader)
        accuracy = 100 * correct / total
        print(f"Epoch {epoch + 1}/{5}, Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}")

torch.save(model.state_dict(), 'mnist_model.pth')
# Save profiler results
prof.export_chrome_trace("profile_results.json")


### Visualizing the model

In [None]:
from torchsummary import summary

summary(model, (1, 28, 28))  # Assuming input image size is (1, 28, 28)

Better still, we can use the `torchviz` library to visualize the various building blocks of the network.

In [None]:
from torchviz import make_dot

# Same size as input data
dummy_input = torch.randn(1, 1, 28, 28).cuda()

graph = make_dot(model(dummy_input), params=dict(model.named_parameters()))
graph.render("CNNModel", format="png", cleanup=True)

In [None]:
from IPython.display import Image, display

# Display the image in the notebook
image_path = "CNNModel.png"
display(Image(filename=image_path))

In [None]:
model = torch.load("mnist_model.pth")

# with torch.no_grad():
#   output = model(example_data)

### Interpreting the model

In [None]:
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Define the test dataset and DataLoader
test_dataset = datasets.MNIST(root='./data', train=False, transform=transform, download=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Load the trained model
model = SimpleNN()
model.load_state_dict(torch.load('mnist_model.pth'))  # Replace with the actual path to your trained model file

# Set the model to evaluation mode
model.eval()

# Use the model to make predictions on the test set
correct = 0
total = 0

with profile(activities=[
        ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
        with torch.no_grad():
          for inputs, labels in test_loader:
            with record_function("model_inference"):
              outputs = model(inputs)
              _, predicted = outputs.max(1)
              total += labels.size(0)
              correct += predicted.eq(labels).sum().item()

accuracy = 100 * correct / total
print(f'Test Accuracy: {accuracy:.2f}%')


In [None]:
import shap

# Select a few samples from the MNIST dataset for interpretation
batch_size = 128
test_dataset = datasets.MNIST(root='./data', train=False, transform=transform, download=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

# since shuffle=True, this is a random sample of test data
batch = next(iter(test_loader))
images, _ = batch

background = images[:100]
test_images = images[100:103]

e = shap.DeepExplainer(model, background)
shap_values = e.shap_values(test_images)

In [None]:
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))


In [None]:
shap_numpy = [np.swapaxes(np.swapaxes(s, 1, -1), 1, 2) for s in shap_values]
test_numpy = np.swapaxes(np.swapaxes(test_images.numpy(), 1, -1), 1, 2)

# plot the feature attributions
shap.image_plot(shap_numpy, -test_numpy)

### Matmul and accelerators

In [None]:
image_tensor = x_train[100:110]

In [None]:
image_tensor.shape

In [None]:
torch.manual_seed(1)
weights = torch.randn(784, 10)
bias = torch.zeros(10)

In [None]:
a = image_tensor
b = weights
a.shape, b.shape

In [None]:
# a rows, a columns
ar, ac = a.shape
# b rows, b columns
br, bc = b.shape

(ar, ac), (br, bc)


In [None]:
t1 = torch.zeros(ar, bc)
t1.shape

In [None]:
t1

In [None]:
# Check for CUDA
device = "cuda"
print(f"Using device: {device}")

# # Set seed for reproducibility
# torch.manual_seed(SEED)

# if device == "cuda":
#     torch.cuda.manual_seed(SEED)

# print(f"Performing computations on {device}")

### Matmul on CPU

In [None]:
def matmul_simple(a, b):
  (ar,ac),(br,bc) = a.shape,b.shape
  t1 = torch.zeros(ar, bc)
  for i in range(ar):
    for j in range(bc):
      for k in range(ac):
        t1[i][j] += a[i][k] * b[k][j]

  return t1

In [None]:
%timeit matmul_simple(a, b) # around 1.78s on CPU

### Numba / CUDA

In [None]:
import numba as nb
from numba import njit
import numpy as np

a_np = a.numpy()
b_np = b.numpy()

@nb.jit(nopython=True)
def matmul_numba(a, b):
  ar,ac = a_np.shape
  br,bc = b_np.shape
  t1 = np.zeros((ar, bc))
  for i in range(ar):
    for j in range(bc):
      dot_product = 0.0
      for k in range(ac):
        dot_product += a[i][k] * b[k][j]
      t1[i][j] = dot_product
  return t1

In [None]:
%timeit matmul_numba(a_np, b_np)

In [None]:
# 50,000 matmul at 114 microseconds
0.000114 * 50000

In [None]:
### 50,000 matmul at 1.57 seconds (in minutes)
1.57 * 50000 / 60