# Pytorch internals
`Tensor` is the central data structure in PyTorch. The internals of `torch.tensor` involves
1. data
2. metadata
   1. the size of tensor(`a.shape`)
   2. the type of elements(`a.dtype`)
   3. the device of tensor(`a.device`)
   4. the layout of tensor(`a.layout`)
   5. the stride of tensor(`a.stride`)

## Memory structure
The `torch.tensor` class is backed by a low-level implementation in C++. So we can regard this as a C++ object with multiple member variables and functions. 

> `torch.tensor` is just an another more elegant wrapper of `pointer`. 

The default layout is row-major, and the physical memory is always one-dimensional. So like `pointer` in C++, the position of the element `a[i,j,k]` in `torch.tensor` is indexed by $i \times stride[0] + j\times stride[1] + k\times stride[2]$.

## Tensor low-level Operations
At the most abstract level, when you call `torch.mm`, two dispatches happen:
1. The first dispatch is based on the device type and layout of tensors. The device type determines where the computation will be performed (e.g., CPU or GPU), while the layout defines how the tensor data is organized in memory (e.g., strided tensors or sparse tensors). It is a dynamic dispatch mechanism that allows PyTorch to choose the appropriate implementation for the given input tensors.
2. The second dispatch is based on the data type of the tensors. This includes information about the precision of the data (e.g., float32, float64, int32, etc.) and any other relevant properties. This dispatch ensures that the correct implementation is chosen for the specific data types involved in the operation.


<div align="center">
  <img src="https://github.com/rhu2xx/picx-images-hosting/raw/master/20251014/tensor_dispatch.pfqamzmoj.png" alt="tensor_dispatch" width="300"/>
</div>


## Tensor extensions
Except dense tensor, PyTorch also supports various other tensor types, including XLA tensors, quantized tensor, sparse tensors, and MKL-DNN tensors.

The trinity three parameters which uniquely determine what a tensor is.
| Name | Definition | Example |
|------|------|------|
| **device** | where the tensor is stored | `'cpu'`, `'cuda:0'`, `'xla'`, `'mps'` |
| **layout** | how the tensor is laid out in memory | `torch.strided`, `torch.sparse_coo`, `torch.sparse_csr`, `torch._mkldnn` |
| **dtype** | the data type of each element | `torch.float32`, `torch.int8`, `torch.qint8`, etc. |

> The cartesian product of these three parameters defines the full space of tensor types in PyTorch.
> **Tensor = function(device, layout, dtype)**

Except extensions (create a new tensor class), we can also write a wrapper class around Pytorch tensors that implements our object type.

## Differences between tensor wrapper and extending pytorch
| Feature | **Tensor Wrapper** | **Extending PyTorch** |
|----------|--------------------|------------------------|
| **What it is** | A normal Python class that holds a Tensor inside | A new Tensor type added inside PyTorch’s core |
| **How it works** | Uses an existing Tensor (just wraps it) | Changes how Tensor is built or stored (C++ level) |
| **Autograd (gradient)** | ❌ Wrapper itself doesn’t get gradients | ✅ Fully supports autograd and gradients |
| **Use case** | Add simple features like logging, printing, unit labels | Add new behavior like new device, layout, or data type |
| **Difficulty** | 🟢 Easy (Python only) | 🔴 Hard (needs C++ knowledge) |
| **Example** | `class MyTensor:` that wraps a Tensor | `torch.sparse_coo`, quantized Tensor |
| **Speed** | Same as normal Tensor | May be faster or slower depending on backend |
| **When to use** | When you only need extra Python logic | When you need new Tensor behavior or backend support |







## Gradient
A gradient is a slope -- it tells your how much something changes when you change its input. The gradients is the rate of change of a function with respect to its inputs.

In deep learning, we want our model to learn good weights. To do that, we ask:
> If I change this weight a little, will my loss get bigger or smaller?

So the update rule in training is:
$$
new\_weight = old\_weight - learning\_rate \times gradient
$$


### Forward Pass
1. put data into the model
2. flows through all the layers
3. get an output and compute a loss

```python
y_pred = model(x)
loss = criterion(y_pred, y_true)
```
### Backward Pass
1. compute the gradient of the loss with respect to the model parameters.
2. Apply the chain rule to compute how each weight affected the final loss
3. Store those derivatives(gradient) in each parameter's `.grad`
```python
loss.backward()
```

4. update the model parameters using the gradients
```python
optimizer.step()
```

## Autograd



























In [1]:
import torch

x = torch.tensor(2.0, requires_grad=True)
y = x**2      # y = x²
y.backward()   # compute dy/dx
print(x.grad)


  cpu = _conversion_method_template(device=torch.device("cpu"))


tensor(4.)


In [2]:
import torch

# 1️⃣ Make input tensor that tracks gradients
x = torch.tensor(2.0, requires_grad=True)

# 2️⃣ Forward pass (PyTorch builds the graph automatically)
y = 2 * x + 3     # Step 1
y.retain_grad()
z = y ** 3        # Step 2
print("Forward pass:")
print(f"x = {x.item()}, y = {y.item()}, z = {z.item()}")

# 3️⃣ Backward pass
z.backward()       # Compute dz/dx
print("\nBackward pass (gradients):")
print(f"x.grad = {x.grad}")
print(f"y.grad = {y.grad}")


Forward pass:
x = 2.0, y = 7.0, z = 343.0

Backward pass (gradients):
x.grad = 294.0
y.grad = 147.0
