# MLP: Multi Layer Perceptron to our Character level language model

[Paper Implementation: A Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

Use full link: 
* [PyTorch internals](https://blog.ezyang.com/2019/05/pytorch-internals/)

In [1]:
import torch 
import torch.nn.functional as F
import matplotlib.pyplot as plt 

%matplotlib inline
%config InlineBackend.figure_format='retina'
%load_ext autoreload
%autoreload 2

## The main idea:


A goal of statistical language modeling is to learn the joint probability function of sequences of
words in a language. -> this is intrinscially difficult because of the curse of dimensionality as our Matrix N will keep getting larger and larger 

The paper propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentenace to inform the model about an exponential number of semantically neighboring sentences 

The model learns simultaneously 
1. a distributed representation for each word along with 
2. the probability function for word sequences, expressed in terms of these representation 

**What is a Distributed Representation for Each Word?**

In traditional language models (like n-grams), words are treated as discrete symbols. Each word is typically assigned a unique index, meaning the model has no inherent understanding of the relationships between words. For example:
*	“dog” → Index 1023
*	“cat” → Index 453
*	“apple” → Index 7890

This representation is one-hot encoding, where each word is represented as a long vector with a single 1 at the corresponding index and 0s elsewhere. However, this approach has two major drawbacks:
*	It is sparse (high-dimensional vectors with mostly zeros).
*	It lacks semantic relationships (e.g., “dog” and “cat” are treated as completely unrelated).

How Distributed Representations Solve This Problem

Instead of using one-hot vectors, the neural model assigns each word a continuous, dense vector representation in a lower-dimensional space. This is known as word embeddings.

For example, instead of:
`dog → [0, 0, 0, 1, 0, 0, ..., 0]  (one-hot vector of size 100,000)`

we only have 
`dog → [0.8, -1.2, 0.3, 0.5, ...]  (dense vector of size 300)`

These vectors capture semantic similarities:
*	“dog” and “cat” will have similar vector representations because they belong to the same category (animals).
*	“dog” and “car” will have very different vectors because they are unrelated concepts.

____________________
Key Idea
*	Each word is mapped to a continuous vector space, where similar words have similar representations.
*	These embeddings are learned during training and allow the model to generalize better to unseen word sequences.


**Probability Function for Word Sequences Using These Representations**

Once we have dense word embeddings, we can define a probability function for predicting the next word in a sequence.

*Traditional n-gram Probability Function*

In an n-gram model, the probability of a word sequence is defined as:

$$P(w_t | w_{t-1}, w_{t-2}, …, w_{t-n+1})$$

Where each $w_t$ is a word from the vocabulary, and probabilities are learned based on counting occurrences in a corpus.

*Neural Probability Function*

Instead of relying on discrete word occurrences, the neural probabilistic model defines the probability function in terms of the learned word representations:

$$P(w_t | w_{t-1}, w_{t-2}, …, w_{t-n+1}) = f(\textbf{v}{w{t-1}}, \textbf{v}{w{t-2}}, …, \textbf{v}{w{t-n+1}})$$

where:
*	$\textbf{v}_{w}$ is the word embedding of word w.
* 	f is a neural network that learns to predict the next word given its context.

Instead of looking up fixed probability tables (as in n-grams), the model learns a smooth function that can generalize to unseen sequences.

Example: Predicting the Next Word

Let’s say our training corpus contains:
* 	“The cat sat on the mat.”
* 	“The dog lay on the mat.”

If the model encounters the incomplete sentence “The rabbit jumped on the ___”, traditional models might struggle because they haven’t seen “rabbit” before.

However, a neural probabilistic model has distributed representations where “rabbit” has a similar vector to “cat” and “dog,” allowing it to infer that “mat” is still a likely next word.

________________________

*Key Takeaways*
1.	Distributed Representations:
	*	Instead of treating words as discrete symbols, the model represents each word as a dense, continuous vector.
	*	This allows words with similar meanings to have similar representations, helping generalization.
2.	Probability Function in Terms of Representations:
	*	Instead of counting word sequences, the model learns a function that predicts the next word based on the embeddings of previous words.
	*	This function is implemented as a neural network, which can generalize to unseen sequences.
3.	Advantage Over Traditional Models:
	*	The neural model avoids the curse of dimensionality by reducing reliance on explicit probability tables.
	*	It learns semantic relationships between words, making it more robust to data sparsity.

## Data loading

In [3]:
words = open('name.txt', 'r').read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [11]:
# Look up for characters 
chars = sorted(set(''.join(words)))
stoi = {s:i+1 for i, s in enumerate(chars)}
stoi['.'] = 0 
itos = {i:s for s,i in stoi.items()}

In [None]:
# Creating the dataset 
X ,Y = [], []

# this is context length: how many character do we take to predict the next
block_size = 3 

for w in words[:5]:
    
    print(w)
    # Setting the index for SOS 
    context = [0] * block_size
    
    # loop through all the chs in w with EOS as '.'
    for chs in w + '.':
        ix = stoi[chs]
        X.append(context) # already known 
        Y.append(ix) # following/next character 

        print(f"{''.join(itos[i] for i in context)} -----> {itos[ix]}")
        # updating the context 
        context = context[1:] + [ix]

# Converting to tensor
X = torch.tensor(X)
Y = torch.tensor(Y)


emma
... -----> e
..e -----> m
.em -----> m
emm -----> a
mma -----> .
olivia
... -----> o
..o -----> l
.ol -----> i
oli -----> v
liv -----> i
ivi -----> a
via -----> .
ava
... -----> a
..a -----> v
.av -----> a
ava -----> .
isabella
... -----> i
..i -----> s
.is -----> a
isa -----> b
sab -----> e
abe -----> l
bel -----> l
ell -----> a
lla -----> .
sophia
... -----> s
..s -----> o
.so -----> p
sop -----> h
oph -----> i
phi -----> a
hia -----> .


In [None]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

In [20]:
# look up table for the neural net
# in the paper they have embedded a vocab of 17000 words to as small as 17000, 300 
# we will try to embedded it into 27, 2 

# randomly intiallized the C 
C = torch.randn((27, 2)) 


In [None]:
# lets try to embedded the integer 6 (this will be index 5 as indexing in python starts at 0)
# the embedding we have at index 5 is 
C[5]

tensor([0.8341, 0.2555])

In [None]:
# Earlier we were using one-hot encoding to create a new vector
# if we do the same for int 5 we will have a vector of size 27 with 1 at index 5 and rest everything as 0 
# and when we multiply the same with C we will get the same result as indexing at C[5]
# instead of doing one-hot encoding and doing matrix multiplication we can just index C 

F.one_hot(torch.tensor(5), num_classes = 27).float() @ C

tensor([0.8341, 0.2555])

In [27]:
X[0], C[0], C[X[0]].shape

(tensor([0, 0, 0]), tensor([ 0.1190, -0.0155]), torch.Size([3, 2]))

In [29]:
C[X].shape

torch.Size([32, 3, 2])

In [30]:
emb = C[X]
emb.shape

torch.Size([32, 3, 2])

In [38]:
# Creating the hidden layer 
W1 = torch.randn((6, 100) )
b1 = torch.randn(100)


# Now what we will be doing is emb @ W1 + b1 but this will through the error as 
# our embedding are of the shape 32, 3, 2 and W1 is 6, 100 
# to solve this we can use pytoch method torch.cat -> it takes a sequence of tensor and concat them for the given dim

# for all the examples give me the embedding for the 1st character in the context 
# as per the image in the paper these are the output from the first layer for neuron 0, 1, 2
# emb[:, 0, :] , emb[:, 1, :], emb[:, 2, :] each one of them will of shape (32, 2)
# representing the first character, second character, third character and each character represented by vector of size 2(embedding)
# concat them at dim 1 we will get the 32, 6


In [None]:
torch.cat((emb[:, 0, :] , emb[:, 1, :], emb[:, 2, :]), 1).shape

# but if we increase the block size then we will have to update the above code again 
# instead of this we can use torch.unbind -> this removes a tensor dimenssion 
# so we will remove the dim 1 and create a tuple of 3(based on the input tensor) tensor of shape 32,2  
# and then concat them together at dim 1 to get 32, 6

torch.Size([32, 6])

In [34]:
torch.cat(torch.unbind(emb, 1), 1).shape

torch.Size([32, 6])

In [None]:
# the better option view() : gives a temporary view of the shape we want make no changes in the input tensor 
emb.view(-1, 6).shape, emb.view(-1, 6).dtype

(torch.Size([32, 6]), torch.float32, torch.float32)

In [51]:
h = torch.tanh((emb.view(-1, 6) @ W1 + b1))
h.shape, h.dtype, h[:1]

(torch.Size([32, 100]),
 torch.float32,
 tensor([[-0.0348,  0.8522, -0.3373,  0.6458, -0.1282,  0.9641,  0.1663,  0.2037,
           0.7526, -0.8614,  0.4565,  0.7607, -0.9409, -0.8764,  0.0039, -0.9848,
           0.9433,  0.7091, -0.2657, -0.9657, -0.2004, -0.9435, -0.3533,  0.3090,
          -0.8595,  0.6137, -0.2199,  0.6065,  0.6908, -0.1122,  0.7290, -0.3638,
           0.1711, -0.7295, -0.3449, -0.6122, -0.9385, -0.4828, -0.5417, -0.2458,
          -0.3233,  0.3978,  0.6026, -0.5404, -0.7556,  0.3038, -0.6807, -0.9824,
           0.0593,  0.8942,  0.7186, -0.9873, -0.7633, -0.5711,  0.7156,  0.1131,
           0.8977,  0.7049,  0.3637, -0.8260,  0.9959, -0.2805,  0.9093, -0.1771,
           0.7259, -0.9278, -0.2451,  0.9379, -0.9114,  0.9254,  0.5064, -0.1054,
          -0.5759,  0.6350, -0.8552,  0.8523,  0.1835,  0.8706, -0.9628,  0.0686,
          -0.7906,  0.3130,  0.9276,  0.7049, -0.9981,  0.5847,  0.6729, -0.1481,
           0.0935, -0.1218, -0.8515, -0.4874,  0.8115,  0.

In [52]:
(emb.view(-1, 6) @ W1 + b1)

tensor([[-0.0349,  1.2642, -0.3510,  ..., -0.0552,  1.1362, -0.3026],
        [-0.4745,  1.2872, -0.6505,  ...,  0.2471,  1.3004, -0.1889],
        [-0.0846,  1.5484, -1.3016,  ..., -0.6944,  1.8721, -1.3949],
        ...,
        [ 0.0611,  2.0055,  0.8441,  ..., -2.4274,  1.8375,  0.1690],
        [-0.1578, -0.2346,  2.2537,  ..., -0.5914, -0.5785, -0.3627],
        [-0.4736,  1.7491, -3.0632,  ..., -2.6217,  2.2543, -0.4815]])