## Learn gpt and learn the basics of pytorch

reference:
+ https://pytorch.org/docs/stable/index.html
+ https://github.com/karpathy/nanoGPT/blob/master/model.py
+ https://www.youtube.com/watch?v=kCc8FmEb1nY
+ https://github.com/rasbt/LLMs-from-scratch
+ ChatGPT

purpose:
+ build a simple gpt model using pytorch
+ build a simple gpt model using simple function in pytroch
+ understand the data flow of each module

In [1]:
import math
import inspect
from dataclasses import dataclass

## inspect Module
The inspect module provides several useful functions to help you get information about live objects, such as modules, classes, methods, functions, tracebacks, frame objects, and code objects. It is primarily used for introspection, which means examining the type or properties of an object at runtime.

Key functions in the inspect module include:

+ inspect.getmembers(object): Return all the members of an object.
+ inspect.signature(callable): Return a Signature object for the given callable.
+ inspect.isfunction(object): Return True if the object is a Python function.
+ inspect.isclass(object): Return True if the object is a class.
+ inspect.getdoc(object): Return the documentation string for an object.

In [2]:
import inspect

def my_function(a, b):
    """This is my function."""
    return a + b

# Get the signature of the function
sig = inspect.signature(my_function)
print(sig)  # Output: (a, b)

# Get the documentation of the function
doc = inspect.getdoc(my_function)
print(doc)  # Output: This is my function.

(a, b)
This is my function.


## dataclasses Module
The dataclasses module provides a decorator and functions for automatically adding special methods to user-defined classes. These methods include __init__, __repr__, __eq__, and others. The main purpose is to reduce boilerplate code when creating classes that primarily store data.

A class decorated with @dataclass automatically gets:

+ An '__init__' method that initializes the instance variables.
+ A '__repr__' method that provides a string representation of the instance.
+ An '__eq__' method that allows comparison between instances.
+ Additional methods based on the parameters provided to the decorator.

In [3]:
from dataclasses import dataclass

@dataclass
class Point:
    x: int
    y: int

# Creating an instance of Point
p1 = Point(1, 2)
print(p1)  # Output: Point(x=1, y=2)

# Comparing instances of Point
p2 = Point(1, 2)
print(p1 == p2)  # Output: True

Point(x=1, y=2)
True


* can you give me a Example to make me understand the usage of "__repr__"? 

In [4]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __repr__(self):
        return f"Person(name={self.name!r}, age={self.age!r})"

# Creating an instance of the Person class
p = Person("Alice", 30)

# Printing the instance
print(p)  # Output: Person(name='Alice', age=30)

Person(name='Alice', age=30)


In [5]:
import torch
import torch.nn as nn
from torch.nn import functional as F

## layerNorm

In [6]:
class LayerNorm(nn.Module):
    """ LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """

    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

+ The super().__init__() call in the __init__ method of a class that inherits from another class (in this case, nn.Module) is used to initialize the parent class (nn.Module). This ensures that the parent class is properly initialized, setting up any required state or initial configurations defined in the parent class.

+ In PyTorch, nn.Module is a base class for all neural network modules, and it includes important initialization steps, such as setting up the internal structures for managing the parameters of the model and handling GPU/CPU transfers. By calling super().__init__(), you make sure that the LayerNorm class properly inherits and initializes these properties and functionalities from nn.Module.

+ https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html#layernorm

Layer normalization (LayerNorm) is a technique used in deep learning to improve the training and performance of neural networks. It normalizes the inputs across the features for each training example independently, which helps stabilize the learning process and accelerates convergence. Here’s a mathematical explanation of how LayerNorm works:

### Summary of Steps

1. **Compute the mean** of the input vector $ \mathbf{x} $.
2. **Compute the variance** of the input vector $ \mathbf{x} $.
3. **Normalize** the input vector $ \mathbf{x} $ using the computed mean and variance.
4. **Scale and shift** the normalized input using learnable parameters $ \gamma $ and $ \beta $.

### Benefits of LayerNorm

- **Stabilizes Training:** By normalizing the inputs to each layer, LayerNorm helps stabilize the training process, especially for deep networks.
- **Accelerates Convergence:** Normalization helps the model converge faster by providing more stable gradients.
- **Reduces Covariate Shift:** By normalizing across features, LayerNorm reduces internal covariate shift, which is the change in the distribution of network activations due to the updating of parameters.

### Comparison with Batch Normalization (BatchNorm)

- **Normalization Scope:** BatchNorm normalizes across the batch dimension, whereas LayerNorm normalizes across the feature dimension.
- **Application:** BatchNorm is more effective for convolutional neural networks (CNNs) with large batch sizes, while LayerNorm is more suitable for recurrent neural networks (RNNs) and transformers where batch sizes can vary or be small.

Understanding LayerNorm mathematically involves recognizing its role in standardizing the inputs to a neural network layer, which in turn helps in achieving stable and efficient training.

Yes, L1 norm and L2 norm are two common ways to measure the length (or magnitude) of a vector in a vector space. They are widely used in various fields, including machine learning and statistics.

### L1 Norm (Manhattan Norm or Taxicab Norm)

The L1 norm of a vector $ \mathbf{x} = [x_1, x_2, \ldots, x_n] $ is defined as the sum of the absolute values of its components:

$
\| \mathbf{x} \|_1 = \sum_{i=1}^n |x_i|
$

For example, for the vector $ \mathbf{x} = [1, -2, 3, -4, 5] $:

$
\| \mathbf{x} \|_1 = |1| + |-2| + |3| + |-4| + |5| = 1 + 2 + 3 + 4 + 5 = 15
$

### L2 Norm (Euclidean Norm)

The L2 norm of a vector $ \mathbf{x} = [x_1, x_2, \ldots, x_n] $ is defined as the square root of the sum of the squares of its components:

$
\| \mathbf{x} \|_2 = \sqrt{\sum_{i=1}^n x_i^2}
$

For the same vector $ \mathbf{x} = [1, -2, 3, -4, 5] $:

$
\| \mathbf{x} \|_2 = \sqrt{1^2 + (-2)^2 + 3^2 + (-4)^2 + 5^2} = \sqrt{1 + 4 + 9 + 16 + 25} = \sqrt{55} \approx 7.416
$

### Applications

- **L1 Norm**: Used in Lasso regression, where it encourages sparsity in the model parameters, effectively performing feature selection.
- **L2 Norm**: Used in Ridge regression, where it encourages smaller coefficients, helping to prevent overfitting by penalizing large weights.

### Summary

- **L1 Norm**: Sum of absolute values of vector components.
- **L2 Norm**: Square root of the sum of the squares of vector components.

Both norms are useful in different contexts and have distinct properties that make them suitable for various tasks in machine learning and data analysis.

In [1]:
text = '''First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes:'''
print(text)

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes:


In [2]:
print(set(text))

{':', '\n', 'k', 'i', ',', 'n', 'h', '.', 'F', ';', 'f', 'm', 'j', 'g', 's', 'z', 'a', 'o', 'C', 'p', 'w', 'u', 'y', 'v', 't', 'W', 'e', 'r', 'L', 'd', ' ', 'c', 'b', 'l'}


In [3]:
print(sorted(list(set(text))))

['\n', ' ', ',', '.', ':', ';', 'C', 'F', 'L', 'W', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z']


In [4]:
chars = sorted(list(set(text)))
for i,ch in enumerate(chars):
    print(i,ch)

0 

1  
2 ,
3 .
4 :
5 ;
6 C
7 F
8 L
9 W
10 a
11 b
12 c
13 d
14 e
15 f
16 g
17 h
18 i
19 j
20 k
21 l
22 m
23 n
24 o
25 p
26 r
27 s
28 t
29 u
30 v
31 w
32 y
33 z


In [5]:
import torch
ix = torch.randint(12315, (4,))
print(ix)

tensor([ 5248,  1044, 11086, 11059])


In [6]:
ixx = torch.randint(12315, (4,8))
print(ixx)

tensor([[10390,  2825,  6406,  5501,  4820,  5107,  2540, 10388],
        [ 7783,  6997,  6379,  4976, 11672, 10848,  8931,  9748],
        [ 9113,  1373,  2486,   742,  2324,  5557,   956,  8499],
        [ 3402,  7472,  8843, 10691,  7444,  9004,  4771,  6852]])


In [1]:
import torch

In [2]:
input = torch.randn(3, 5, requires_grad=True)
print(input)

tensor([[-1.2619,  0.5803,  0.2570, -1.6198,  0.7313],
        [-0.6691, -1.0521,  0.3380, -0.1779,  0.6680],
        [-0.8905, -1.0465,  0.2469, -0.9095,  0.6921]], requires_grad=True)


In [3]:
target = torch.randint(5, (3,), dtype=torch.int64)
print(target)

tensor([2, 4, 1])


In [6]:
from torch.nn import functional as F
loss = F.cross_entropy(input, target)
print(loss)

tensor(1.6540, grad_fn=<NllLossBackward0>)


In [15]:
import torch
input = torch.randn(8, 4, 5)
print(input.shape)
print(input)
logits = input[:, 0, :]
print(logits.shape)
print(logits)

torch.Size([8, 4, 5])
tensor([[[ 0.0556,  0.7931,  0.2027, -1.1361,  1.4996],
         [ 0.7243, -1.0727, -0.0048, -0.7183, -0.1111],
         [ 0.5431, -0.5410,  0.3666, -0.8849, -0.4974],
         [ 1.3834, -1.7518,  1.3320, -1.0032, -0.1557]],

        [[ 0.0234,  2.2345,  0.0224,  0.4388, -0.4750],
         [ 0.0817, -0.2610, -0.7862,  0.2321, -0.3828],
         [ 0.3017,  0.3296, -0.6282,  0.3463, -0.2611],
         [-1.8314, -0.8343, -1.2071, -1.0810,  0.9788]],

        [[-0.4080, -1.2915, -0.4471, -1.7200, -0.9816],
         [ 2.1752, -0.9490,  0.4909,  1.1006,  0.7446],
         [ 0.8315, -0.7812,  0.5920,  0.1139,  0.7367],
         [ 1.1277, -1.5654, -1.3464, -0.5767,  0.7914]],

        [[-0.4262, -2.2725, -1.3849,  0.8259,  0.2521],
         [ 0.1539,  0.3614,  0.1212, -0.8539,  0.4816],
         [ 0.1144, -0.3812,  0.3434, -0.4478, -0.2321],
         [-0.5238, -1.3684, -2.2022,  0.1606,  0.7356]],

        [[ 0.4162, -1.7961,  0.8615, -0.1497,  0.1669],
         [-0.6429,

In [16]:
input = torch.randn(1, 1, 65)
print(input.shape)
print(input)
logits = input[:, -1, :]
print(logits.shape)
print(logits)
probs = F.softmax(logits, dim=-1)
print(probs)

torch.Size([1, 1, 65])
tensor([[[ 2.3839e+00,  2.2380e-01, -7.4795e-01, -9.6642e-01, -1.2405e-01,
           1.5229e+00, -1.1294e+00,  9.5270e-01, -8.4379e-01,  1.7417e-01,
          -1.5384e+00,  1.6064e+00, -5.6547e-01, -6.2347e-02, -5.8367e-01,
          -1.9701e-01, -8.6588e-01,  7.1284e-01, -6.5321e-01,  1.0467e+00,
           1.6062e-01,  2.6254e-01,  1.9448e-01,  8.8856e-01, -1.8510e+00,
          -9.6507e-01,  1.1852e+00, -1.3623e+00, -1.1595e-03, -6.1612e-01,
           4.1488e-01,  8.3554e-02, -2.0600e+00,  1.4339e+00, -1.4375e-01,
          -9.0445e-01, -1.9722e-01, -1.5797e+00,  6.7009e-01, -1.6732e-02,
           6.3933e-01,  1.3744e+00, -8.5839e-01, -9.8788e-01, -1.2302e+00,
          -2.4145e+00, -2.6937e+00,  3.5657e-01,  4.2258e-01,  3.4872e-01,
          -7.3164e-01, -6.4600e-02,  1.9877e-01,  8.2799e-01,  1.4397e+00,
          -1.0195e+00, -1.6995e+00,  7.4717e-02,  8.2202e-01,  8.7804e-01,
          -1.0397e-01,  6.0437e-01, -1.0304e+00,  1.8374e+00,  3.8428e-01]]])

In [21]:
probs = F.softmax(logits, dim=-1)
print(probs)
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
print(idx_next)

tensor([[0.1101, 0.0127, 0.0048, 0.0039, 0.0090, 0.0466, 0.0033, 0.0263, 0.0044,
         0.0121, 0.0022, 0.0506, 0.0058, 0.0095, 0.0057, 0.0083, 0.0043, 0.0207,
         0.0053, 0.0289, 0.0119, 0.0132, 0.0123, 0.0247, 0.0016, 0.0039, 0.0332,
         0.0026, 0.0101, 0.0055, 0.0154, 0.0110, 0.0013, 0.0426, 0.0088, 0.0041,
         0.0083, 0.0021, 0.0198, 0.0100, 0.0192, 0.0401, 0.0043, 0.0038, 0.0030,
         0.0009, 0.0007, 0.0145, 0.0155, 0.0144, 0.0049, 0.0095, 0.0124, 0.0232,
         0.0428, 0.0037, 0.0019, 0.0109, 0.0231, 0.0244, 0.0091, 0.0186, 0.0036,
         0.0638, 0.0149]])
tensor([[22]])
