In [1]:
# check Pytorch version 
import torch 
torch.__version__

'2.5.1'

In [2]:
# scalar
scalar = torch.tensor(7)
scalar

tensor(7)

In [3]:
scalar.ndim

0

In [4]:
# retrieve scalar value 
scalar.item()

7

In [5]:
# vector 
vector = torch.tensor([7, 7])
vector

tensor([7, 7])

In [6]:
vector.ndim

1

In [8]:
vector.shape

torch.Size([2])

In [14]:
A = torch.tensor([[7, 8], 
                       [9, 10]])
A

tensor([[ 7,  8],
        [ 9, 10]])

In [15]:
A.ndim

2

In [16]:
A.shape

torch.Size([2, 2])

In [17]:
T = torch.tensor([[[1, 2, 3], 
                   [3, 6, 9], 
                   [2, 4, 5]]])
T.ndim

3

In [18]:
T.shape

torch.Size([1, 3, 3])

In [19]:
# create a random tensro of size (3, 4)
random_tensor = torch.rand(2, 3)
random_tensor, random_tensor.dtype

(tensor([[0.1083, 0.1840, 0.5514],
         [0.2550, 0.1402, 0.5912]]),
 torch.float32)

In [20]:
random_image_size_tensor = torch.rand(size=(224, 224, 3))

In [21]:
# tensor of zeros 
zeros = torch.zeros(size=(3, 4))
zeros, zeros.dtype

(tensor([[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]),
 torch.float32)

In [23]:
# arange in Pytorch 
zero_to_ten = torch.arange(0, 10)
zero_to_ten

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [24]:
# tensor dtyoes 
float_32_tensor = torch.tensor([3.0, 6.0, 9.0], 
                               dtype=None, 
                               device=None, 
                               requires_grad=False)
float_32_tensor.shape, float_32_tensor.dtype

(torch.Size([3]), torch.float32)

In [35]:
# use torch function to multiply 
t = torch.tensor([1, 2, 3])
torch.multiply(t, 10)

tensor([10, 20, 30])

In [30]:
# element-wise mult
print(t * t)
# vector-multiplication
print(torch.matmul(t, t))
# vector-multiplication 
print(t @ t)
 

tensor([1, 4, 9])
tensor(14)
tensor(14)


In [41]:
# matmul faster? 
import time
start_time = time.perf_counter()
value = 0 
for i in range(len(t)):
    value += t[i] * t[i]
value
end_time = time.perf_counter()
elapsed_time = end_time - start_time 
print(f"elapsed time is: {elapsed_time}")

elapsed time is: 0.0012700839999979507


In [42]:
import time
start_time = time.perf_counter()
value = t @ t
end_time = time.perf_counter()
elapsed_time = end_time - start_time 
print(f"elapsed time is: {elapsed_time}")

elapsed time is: 0.00021220899998297682


In [43]:
import time
start_time = time.perf_counter()
value = torch.matmul(t, t)
end_time = time.perf_counter()
elapsed_time = end_time - start_time 
print(f"elapsed time is: {elapsed_time}")

elapsed time is: 0.0001752079999732814


In [45]:
# transpose 
A = torch.tensor([[1, 2], 
                  [3, 4], 
                  [5, 6]], dtype=torch.float32)
B = torch.tensor([[7, 8], 
                  [9, 19], 
                  [11, 12]], dtype=torch.float32)

torch.matmul(A, B.T)

tensor([[ 23.,  47.,  35.],
        [ 53., 103.,  81.],
        [ 83., 159., 127.]])

In [46]:
# faster to type 
torch.mm(A, B.T)

tensor([[ 23.,  47.,  35.],
        [ 53., 103.,  81.],
        [ 83., 159., 127.]])

The torch `torch.nn.Linear()` module implements a feed-forward layer or fully-connected layer

$$\mathbf{y} = \mathbf{x}W^T + \mathbf{b}$$ 

where $\mathbf{x}$ is the input layer, $W$ is the weights matrix, and $\mathbf{b}$ is a bias term, and $\mathbf{y}$ is the 
output. 

In [47]:
torch.manual_seed(42)
# matrix multiplication 
linear = torch.nn.Linear(in_features=2,     # in_features = matches inner dim of input
                         out_features=6)    # out_features = outer value 
x = torch.randn(3, 2)              # 3 X 2 
output = linear(x)                 # (3 x 2) x (2 x 6) 
print(f"Inner shape is {x.shape}\n")
print(f"Out shape is {output.shape}\n")
print(output) 


# What's happening above 
W = linear.weight   # shape: [6, 2]
b = linear.bias     # shape: [6]
manual_output = x @ W.T + b 

Inner shape is torch.Size([3, 2])

Out shape is torch.Size([3, 6])

tensor([[ 0.9888,  0.3117,  0.3418, -0.0555,  0.5656,  0.4033],
        [ 0.5818,  1.2026,  0.7456,  0.9326, -1.0080, -0.5535],
        [ 0.4686, -0.4747,  0.1364, -0.6240,  1.3291,  0.4219]],
       grad_fn=<AddmmBackward0>)


### Multiple Observations in Statistical Learning

$$
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}
$$

Where:

- $ \mathbf{X} \in \mathbb{R}^{n \times p} $: Design matrix (n samples, p features)  
- $ \boldsymbol{\beta} \in \mathbb{R}^{p} $ or $ \mathbb{R}^{p \times 1} $: Coefficient vector or matrix  
- $ \mathbf{y} \in \mathbb{R}^{n} $: Response vector  

That is, you're multiplying a design matrix $ \mathbf{X} $ (rows = examples, columns = features) by a weight vector/matrix $ \boldsymbol{\beta} $.

---

### Linear Layer in PyTorch

PyTorch’s `nn.Linear(in_features, out_features)` defines a layer that applies the transformation:

$$
\mathbf{y} = \mathbf{x} \mathbf{W}^\top + \mathbf{b}
$$

Where:

- $ \mathbf{x} \in \mathbb{R}^{n \times \text{in\_features}} $: Input matrix  
- $ \mathbf{W} \in \mathbb{R}^{\text{out\_features} \times \text{in\_features}} $: Weight matrix  
- $ \mathbf{W}^\top \in \mathbb{R}^{\text{in\_features} \times \text{out\_features}} $: Transposed weight matrix  
- So, $ \mathbf{x} \mathbf{W}^\top \in \mathbb{R}^{n \times \text{out\_features}} $

This ensures that the output shape matches the expected behavior of `nn.Linear`: input shape $(n, \text{in\_features})$ gives output shape $(n, \text{out\_features})$.

---

### Why the Transpose?

It’s mostly a convention in implementation:

- PyTorch stores $ \mathbf{W} $ as shape $(\text{out\_features}, \text{in\_features})$ for efficiency and compatibility with backend libraries (e.g., C++/CUDA/BLAS).
- But it performs the multiplication as $ \mathbf{x} \mathbf{W}^\top $, which aligns with standard linear algebra:

$$
\mathbf{x} \in \mathbb{R}^{n \times p}, \quad \mathbf{W}^\top \in \mathbb{R}^{p \times d}, \quad \Rightarrow \quad \mathbf{y} \in \mathbb{R}^{n \times d}
$$

So while it may look like PyTorch is transposing the weights, it's simply a matter of internal layout for performance and convention.

#### Statistical View (ISLR-style)

Let:

$$
\mathbf{x}_i = [x_1,\ x_2] = [2.0,\ 3.0]
$$

$$
\boldsymbol{\beta} = \begin{bmatrix} \beta_1 \\ \beta_2 \end{bmatrix} = \begin{bmatrix} 0.5 \\ -1.0 \end{bmatrix}
$$

Then:

$$
y_i = \beta_1 x_1 + \beta_2 x_2 = 0.5 \cdot 2.0 + (-1.0) \cdot 3.0 = 1.0 - 3.0 = -2.0
$$

---

#### PyTorch View

PyTorch applies the transformation:

$$
\mathbf{y} = \mathbf{x} \mathbf{W}^\top + \mathbf{b}
$$

Assuming no bias ($\mathbf{b} = 0$), and with:

$$
\mathbf{x} = \begin{bmatrix} 2.0 & 3.0 \end{bmatrix}, \quad
\mathbf{W} = \begin{bmatrix} 0.5 & -1.0 \end{bmatrix}
$$

Then:

$$
\mathbf{W}^\top = \begin{bmatrix} 0.5 \\ -1.0 \end{bmatrix}
$$

and:

$$
\mathbf{x} \mathbf{W}^\top =
\begin{bmatrix} 2.0 & 3.0 \end{bmatrix}
\begin{bmatrix} 0.5 \\ -1.0 \end{bmatrix}
= (2.0)(0.5) + (3.0)(-1.0) = -2.0
$$


In [None]:
# 