# MetalNeedle Tutorial
First please follow the guidance in repo's [README](https://github.com/wenjunsun/dlsys-needle-m1/blob/main/README.md) to setup the environment.

In [19]:
import needle as ndl
import numpy as np

To run the tutorial successfully, you need to have a M1-enabled Mac, check if your computer satisfies the requirements with the code below.

In [20]:
ndl.m1().enabled()

True

## Simple operations && automatic differentiation
MetalNeedle satisfies all the basic tensor operations in other automatic differentiation tools like Pytorch, Tensorflow, etc. 

You can set the device argument to M1 to accelerate these operations.

In [22]:
a = ndl.Tensor(np.array([1, 2, 3]), device=ndl.m1(), requires_grad=True)
b = ndl.Tensor(np.array([3, 2, 1]), device=ndl.m1(), requires_grad=True)
c = ndl.Tensor(np.array([2, 2, 2]), device=ndl.m1(), requires_grad=True)
d = ndl.exp((a + b) * c)
e = ndl.relu(d)
f = ndl.summation(e)

The simple `backward` tensor method can automatically compute the gradient with backward propogation.

In [23]:
f.backward()

In [24]:
a.grad, b.grad, c.grad

(needle.Tensor([5961.9155 5961.9155 5961.9155]),
 needle.Tensor([5961.9155 5961.9155 5961.9155]),
 needle.Tensor([11923.831 11923.831 11923.831]))

## Matrix Multiplication
Matrix multiplication is an important operation in Machine Learning area. With M1 GPU, you can easily achieve over 7x acceleration for the demo case below.

In [14]:
M, N, K = 1000, 1000, 1000
A_m1 = ndl.Tensor(np.random.randn(M, K), device=ndl.m1())
B_m1 = ndl.Tensor(np.random.randn(K, N), device=ndl.m1())
A_cpu = ndl.Tensor(np.random.randn(M, K), device=ndl.cpu())
B_cpu = ndl.Tensor(np.random.randn(K, N), device=ndl.cpu())

In [15]:
%timeit A_cpu @ B_cpu

71.3 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [12]:
%timeit A_m1 @ B_m1

9.79 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Train ResNet on CIFAR10
You can implement a ResNet in needle and train it on the CIFAR10 dataset. With M1 GPU acceleration, you can achieve 10x speedup training time. Currently the matrix multiplication is implemented in a naive way. We believe if it is implemented in a highly optimized way like Cublas, it can achieve competitive performance with CUDA. 

In [26]:
import sys
sys.path.append("./apps")
from models import ResNet9
from simple_training import train_cifar10, evaluate_cifar10

device = ndl.m1()
train_dataset = ndl.data.CIFAR10Dataset("data/cifar-10-batches-py", train=True)
train_dataloader = ndl.data.DataLoader(dataset=train_dataset,
                                       batch_size=128,
                                       shuffle=True,
                                       device=device)
test_dataset = ndl.data.CIFAR10Dataset("data/cifar-10-batches-py", train=False)
test_dataloader = ndl.data.DataLoader(dataset=test_dataset,
                                       batch_size=128,
                                       shuffle=True,
                                       device=device)

model = ResNet9(device=device, dtype="float32")
train_cifar10(model, train_dataloader, n_epochs=10, optimizer=ndl.optim.Adam,
              lr=0.0005, weight_decay=0.001)
evaluate_cifar10(model, test_dataloader)

Epoch: 0, Acc: 0.3538, Loss: 1.80337234375, Time: 394.81s
Epoch: 1, Acc: 0.46514, Loss: 1.48837484375, Time: 394.36s
Epoch: 2, Acc: 0.51546, Loss: 1.3542328125, Time: 396.63s
Epoch: 3, Acc: 0.5532, Loss: 1.2529396875, Time: 397.00s
Epoch: 4, Acc: 0.58366, Loss: 1.1686440625, Time: 395.02s
Epoch: 5, Acc: 0.612, Loss: 1.09461765625, Time: 393.30s
Epoch: 6, Acc: 0.63522, Loss: 1.0276240625, Time: 396.00s
Epoch: 7, Acc: 0.65828, Loss: 0.9641878125, Time: 422.36s
Epoch: 8, Acc: 0.67964, Loss: 0.906412421875, Time: 1413.13s
Epoch: 9, Acc: 0.70216, Loss: 0.85104984375, Time: 405.90s


(0.4764, 1.683678515625)