# Introduction

This Colab notebook serves as a personal workbook to guide you (and me) through coding micrograd from scratch.

I like to try to code some of Andrej Karpathy's 'Zero-to-Hero' projects from scratch every now and again.

By following this workbook, you can regularly practice and internalize the process of building micrograd from the ground up.

Each time you start a new Colab, simply copy over the instructions cell and build upon it.

[Go to micrograd](https://github.com/karpathy/micrograd)

# Instructions

In [3]:

# ------------------------------------

# There are 3 'phases' to coding micrograd from scratch

# 1. Value class (micrograd engine)
# 2. Neural Network
# 3. Training


# 1. Value class.

# __init__
# __add__
# __repl__
# __radd__
# __mul__
# __rmul__
# __pow__
# __neg__
# __sub__
# __rsub__
# __truediv__
# __rtruediv__

# __relu__
# __tanh__

# backward() - topological sort

# Compare to Pytorch with micrograd/test/test_engine.py

# ------------------------------------


# 2. Neural Network Library

# Module
# Neuron
# Layer
# MLP


# ------------------------------------

# 3. Training

# test_mlp()
# test_train()

# Initiate model
# Test model with one batch
# Test data (xs, ys)
# Training loop
# mse loss

# Micrograd Engine

In [4]:
import math

class Value:

    def __init__(self, data, _children=(), _op=""):
        self.data = data
        self.grad = 0.0

        self._prev = set(_children)
        self._backward = lambda: None
        self._op = _op

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), "+")

        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward

        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), "*")

        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward

        return out

    def __pow__(self, other):
        assert isinstance(other, (int, float)), "only supporting int/float powers for now"

        out = Value(self.data**other, (self,), f'**{other}' )

        def _backward():
            self.grad += (other * self.data**(other-1)) * out.grad
        out._backward = _backward

        return out

    def relu(self):
        out = Value(0.0 if self.data <0 else self.data, (self,), 'ReLU')

        def _backward():
            self.grad += (out.data > 0) * out.grad
        out._backward = _backward

        return out

    def tanh(self):
        x = self.data
        t = (math.exp(2*x)-1)/(math.exp(2*x)+1)
        out = Value(t, (self,), 'tanh')

        def _backward():
            self.grad += (1-t**2) * out.grad
        out._backward = _backward

        return out

    def backward(self):

        topo = []
        visited = set()

        def dfs(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    dfs(child)
                topo.append(v)
        dfs(self)

        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

    def __radd__(self, other):
        return self + other

    def __rmul__(self, other):
        return self * other

    def __neg__(self):
        return self * -1

    def __sub__(self, other):
        return self + (-other)

    def __rsub__(self, other):
        return other + (-self)

    def __truediv__(self, other):
        return self * (other**-1)

    def __rtruediv(self, other):
        return other * (self**-1)


In [5]:
a = Value(5.0)
b = Value(3.0)
c = a.tanh()
print(c)
c.grad=1.0
c._backward()
print(a.grad)



Value(data=0.9999092042625951, grad=0.0)
0.0001815832309438603


# Compare micrograd with pytorch (simple graph)

In [6]:
x1 = Value(2.0)
x2 = Value(0.0)
w1 = Value(-3.0)
w2 = Value(1.0)
b = Value(6.8813735870195432)

x1w1 = x1*w1
x2w2 = x2*w2
x1w1x2w2 = x1w1 + x2w2
n = x1w1x2w2 + b
o = n.tanh()
print(o)
o.backward()


print(x2.grad)


Value(data=0.7071067811865476, grad=0.0)
0.4999999999999999


In [7]:
import torch

In [8]:

x1 = torch.tensor([2.0], dtype=torch.double) ; x1.requires_grad=True
x2 = torch.tensor([0.0], dtype=torch.double) ; x2.requires_grad=True
w1 = torch.tensor([-3.0], dtype=torch.double) ; w1.requires_grad=True
w2 = torch.tensor([1.0], dtype=torch.double) ; w2.requires_grad=True
b = torch.tensor([6.8813735870195432], dtype=torch.double) ; b.requires_grad=True

n = x1*w1 + x2*w2 + b
o = torch.tanh(n)
print(o.item())
o.backward()

print(x2.grad)

0.7071067811865476
tensor([0.5000], dtype=torch.float64)


# Neural Network Library

In [32]:
import random

class Module:

    def parameters(self):
        return []

    def zero_grad(self):
        for p in self.parameters():
            p.grad = 0

class Neuron(Module):

    def __init__(self, nin):
        self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
        self.b = Value(0)

    def __call__(self, x):
        act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
        out = act.tanh()
        return out

    def parameters(self):
        return self.w + [self.b]


class Layer(Module):

    def __init__(self, nin, nout):
        self.neurons = [Neuron(nin) for _ in range(nout)]

    def __call__(self, x):
        outs = [n(x) for n in self.neurons]
        return outs[0] if len(outs)== 1 else outs

    def parameters(self):
        return [p for neuron in self.neurons for p in neuron.parameters()]


class MLP(Module):

    def __init__(self, nin, nouts):
        sz = [nin] + nouts
        self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]

    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]



# Training Loop

In [51]:
x = [ 2.0, 3.0, -1.0]

n = MLP(3, [4, 4, 1])

out = n(x)

print(out)

Value(data=0.6569427405516585, grad=0.0)


In [52]:
xs = [[ 2.0, 3.0, -1.0],
      [ 1.0, -5.0, 1.0],
      [ -2.0, 2.0, -1.0],
      [ 4.0, 0.0, -2.0]]

ys = [1.0, -1.0, -1.0, 1.0]



In [54]:
for i in range(200):

    ypred = [n(x) for x in xs]

    loss = sum((ygt - yout)**2 for ygt, yout in zip(ys, ypred))

    if i % 10 ==0:
        print(f"step: {i}, loss: {loss.data}")

    n.zero_grad()

    loss.backward()

    for p in n.parameters():
        p.data += -0.1 * p.grad







step: 0, loss: 0.0074704852292340595
step: 10, loss: 0.005207816272289852
step: 20, loss: 0.004016421118811813
step: 30, loss: 0.003277972254928078
step: 40, loss: 0.002773866493847907
step: 50, loss: 0.0024069798459459735
step: 60, loss: 0.002127504748165722
step: 70, loss: 0.0019072283361321052
step: 80, loss: 0.001728954056751397
step: 90, loss: 0.0015815916238825045
step: 100, loss: 0.0014576608611143716
step: 110, loss: 0.0013519290420868763
step: 120, loss: 0.0012606230678682245
step: 130, loss: 0.0011809519190026786
step: 140, loss: 0.0011108054570466788
step: 150, loss: 0.001048557916129813
step: 160, loss: 0.0009929359005978148
step: 170, loss: 0.0009429274300054031
step: 180, loss: 0.000897717849936193
step: 190, loss: 0.0008566437689616818


# Notes


## The benefits of learning micrograd

Studying this 'homemade' autograd engine is beneficial for understanding the principles underlying more complex production autograd engines like PyTorch's. Here are some key features that make this code worth studying:

### 1. **Core Concepts of Autograd:**
   - **Automatic Differentiation:** The primary purpose of this code is to automatically compute gradients, which is crucial for training neural networks through backpropagation. Understanding this concept is foundational for comprehending how frameworks like PyTorch handle gradient computation.
   - **Gradient Accumulation:** The code demonstrates how gradients are accumulated through the `_backward` method for various operations. This is essential for understanding how gradients flow through the computational graph.

### 2. **Computational Graph Construction:**
   - **Node Representation:** Each `Value` object represents a node in the computational graph, storing data, gradient, and references to parent nodes (`_prev`). This mirrors how tensors in PyTorch are tracked in the computational graph.
   - **Operation Overloading:** By overloading operators (e.g., `__add__`, `__mul__`, `__pow__`), the code constructs the computational graph dynamically as operations are performed on `Value` instances. This is akin to PyTorch's dynamic graph construction.

### 3. **Backward Propagation:**
   - **Topological Sorting:** The `backward` method includes a topological sort to ensure that gradients are computed in the correct order. This is crucial for understanding how backward passes are structured in more complex frameworks.
   - **Chain Rule Application:** The code explicitly applies the chain rule to compute gradients, providing a clear, step-by-step illustration of how this mathematical principle is implemented programmatically.

### 4. **Extensibility and Debugging:**
   - **Custom Operations:** The inclusion of a variety of operations (addition, multiplication, power, ReLU) and their corresponding gradient computations serves as a basis for extending the engine with more complex functions.
   - **Debugging and Visualization:** The `_op` attribute and the `_backward` method provide hooks for debugging and visualizing the computational graph, which is useful for understanding and troubleshooting the behavior of the autograd system.

### 5. **Simplicity and Clarity:**
   - **Minimalist Design:** The simplicity of this code, with its minimal dependencies and straightforward implementation, makes it an excellent educational tool. It distills the core ideas of an autograd engine without the complexity of a full-fledged framework.
   - **Clear and Readable:** The code is written in a clear and readable manner, making it accessible for learners who are new to the concept of automatic differentiation and computational graphs.

### 6. **Practical Understanding:**
   - **Hands-On Experience:** Implementing and modifying this code provides practical experience that can deepen one's understanding of how autograd systems work in practice. This hands-on approach can bridge the gap between theoretical knowledge and real-world application.