# What are Embedding Layers in PyTorch
* https://www.youtube.com/watch?v=e6kcs9Uj_ps&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi&index=30
* https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_06_4_embedding.ipynb

Embedding Layers are a handy feature of PyTorch that allows the program to automatically insert additional information into the data flow of your neural network. An embedding layer would automatically allow you to insert vectors in the place of word indexes.


Programmers often use embedding layers with Natural Language Processing (NLP); however, you can use these layers when you wish to insert a lengthier vector in an index value place. In some ways, you can think of an <u>embedding layer as dimension expansion</u>. However, the hope is that theses additional dimensions provide more information to the model and provide a better score.

In [None]:
try:
    from google.colab import drive
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Simple Embedding Layer Example
* **num_embeddings** = How large is the vocabulary? How many categories are you encoding? This parameter is the number of items in your "lookup table".
* **embedding_dim** = How many numbers in the vector you wish to return.


Now we create a neural network with a vocabulary size of 10, which will reduce those values between 0-9 to 4 number vectors. This neural network does nothing more than passing the embedding on to the output. But it does let us see what the embedding is doing. Each feature vector coming in will have two such features.

In [None]:
import torch
import torch.nn as nn

embedding_layer = nn.Embedding(num_embeddings=10, embedding_dim=4)
optimizer = torch.optim.Adam(embedding_layer.parameters(), lr=0.001)
loss_function = nn.MSELoss()

Let's take a look at the structure of this neural network to see what is happening inside it.

In [None]:
print(embedding_layer)

Embedding(10, 4)


For this neural network, which is just an embedding layer, the input is a vector size 2. These two inputs are interger numbers from 0 to 9 (corresponding to the requested input_dim quantity of 10 values). Looking at the summary above, we see that the embedding layer has 40 parameters. This value comes from the embedded lookup table that contains four amounts (output_dim) for each of the 10 (imput_dim) possible interger values for the two inputs. The output is 2 (input_length) lenght 4 (output_dim) vectors, resulting in a total output size of 8, which corresponds to the Output Shape given in the summary above.


Now, let us query the neural network with two rows. Thi input is two integer values, as was specified when we created the neural network.

In [None]:
input_tensor = torch.tensor([[1, 2]], dtype=torch.long)
print(input_tensor)
print("input_tensor.shape: ", input_tensor.shape, '\n')

pred = embedding_layer(input_tensor)
print(pred)
print("pred.shape: ", pred.shape)

tensor([[1, 2]])
input_tensor.shape:  torch.Size([1, 2]) 

tensor([[[-1.0776, -0.0567, -1.0366, -0.3445],
         [-0.6044, -1.7945, -0.5762, -0.7177]]], grad_fn=<EmbeddingBackward0>)
pred.shape:  torch.Size([1, 2, 4])


Here we see two length-4 vectors that PyTorch looked up for each input interger. Recall that Python arrays are zero-based. PyTorch replaced the value of 1 with the second row of the 10 * 4 lookup matrix. Similarly, PyTorch returned the value of 2 by the third row of the lookup matrix. The following code displays the lookup matrix in its entirety. The embedding layer performs no mathmatical operations other than inserting the correct row from the lookup table.

In [None]:
embedding_layer.weight.data

tensor([[ 1.9302, -0.6667,  0.3645,  1.5885],
        [-1.0776, -0.0567, -1.0366, -0.3445],
        [-0.6044, -1.7945, -0.5762, -0.7177],
        [ 0.7623, -0.4987,  0.1511,  0.1636],
        [-0.5855, -0.8876, -1.8424,  1.3046],
        [ 0.0557,  0.6229,  0.7430, -2.4860],
        [ 0.8600, -1.7739,  0.5993, -0.5418],
        [-0.0722,  0.8435,  0.3237, -0.9304],
        [-0.6080, -0.5653, -1.3088, -0.2174],
        [-1.1741,  1.8047,  1.1985,  0.3427]])

The values above are random parameters that PyTorch generated as starting points. Generally, we will transfer an embedding or train these random values into something useful. The following section demonstrates how to embed a hand-coded embedding.

# Transferring An Embedding
Now, we see how to hard-code an embedding lookup that performs a simple one-hot encoding. One-hot encoding would transform the input interger values of **0**, **1**, and **2** to the vectors **[1, 0, 0]**, **[0, 1 ,0]**, and **[0, 0, 1]** respectively.  The following code replaced the random lookup values in the embedding layer with this one-hot coding-inspired lookup table.

In [None]:
# Define the embedding lookup matrix
embedding_lookup = torch.tensor([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1]
], dtype=torch.float32) # Make sure to use float32 for weight matrices

# Create the embedding layer
embedding_layer = nn.Embedding(num_embeddings=3, embedding_dim=3)

# Set the weights of the embedding layer
embedding_layer.weight.data = embedding_lookup

We have the following parameters for the Embedding layer:
* input_dim=3 - There are three different integer categorical values allowed.
* output_dim=3 - Three columns represent a categorical value with three possible values per one-hot encoding.
* input_length=2 - The input vector has two of these categorical values.


We query the neural network with two categorical values to see the lookup performed

In [None]:
# Create the input tensor directly in PyTorch
input_tensor = torch.tensor([[0, 1]], dtype=torch.long)
print(input_tensor)
print("input_tensor.shape: ", input_tensor.shape, '\n')

# Forward pass to get the predictions
pred = embedding_layer(input_tensor)
print(pred)
print("pred.shape: ", pred.shape)

tensor([[0, 1]])
input_tensor.shape:  torch.Size([1, 2]) 

tensor([[[1., 0., 0.],
         [0., 1., 0.]]], grad_fn=<EmbeddingBackward0>)
pred.shape:  torch.Size([1, 2, 3])


The give output show that we provided the program with two rows from the one-hot encoding table. This encoding is a correct one-hot encoding for the values 0 and 1, where there are up to 3 unique values possible.

# Training an Embedding

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import OneHotEncoder
from torch.nn.utils.rnn import pad_sequence

We create a neural network that classifies restaurant reviews according to positibe or negative. This neural network can accept strings as input, such as given here. This code also includes positive or negative labels for each review.

In [None]:
# Define 10 resturant reviews.
reviews = [
    'Never coming back!',
    'Horrible service',
    'Rude waitress',
    'Cold food.',
    'Horrible food!',
    'Awesome',
    'Awesome service!',
    'Rocks!',
    'poor work',
    'Couldn\'t have done better']

# Define labels (1=negative, 0=positive)
labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

Notice that the second to the last label is incorrect. Errors such as this are not too out of the oridinary, as most training data could have some noise.


We difine a vocaburaly size of 50 words. Though we do not have 50 words, it is okay to use a value larger than needed. If there are more than 50 words, the least frequently used  words in the training set are automatically dropped by the embedding layer during training. For input, we one-hot encode the strings. We use the TensorFlow one-hot encoding method here rather than Scikit-Learn. Scikit-Learn would expand these strings to the 0's and 1's as we would typically see for dummy variables. TensorFlow translates all words to index values and replaces each word with that index.

In [None]:
hash('Never') % 50

10

In [None]:
# One-hot encode reviews
VOCAB_SIZE = 50
# `%` - 割り算の余りを算出
encoded_reviews = [torch.tensor([hash(word) % VOCAB_SIZE for word in review.split()]) for review in reviews]

print(f"Encoded reviews: {encoded_reviews}")
# reviewsリストの各要素の単語を数値に変換した結果が以下

Encoded reviews: [tensor([10,  5, 27]), tensor([33, 29]), tensor([28,  0]), tensor([39,  5]), tensor([33, 21]), tensor([36]), tensor([36, 20]), tensor([19]), tensor([28,  3]), tensor([ 3, 15, 46, 18])]


The program one-hot encodes these reviews to word indexes; however, their lengths are different. We pad these reviews to 4 words and truncate any words beyond the fourth word.

In [None]:
MAX_LENGTH = 4
padded_reviews = pad_sequence(encoded_reviews, batch_first=True, padding_value=0).narrow(1, 0, MAX_LENGTH)
print(padded_reviews)
# pad_sequence()で各行ごとに、足りない要素をゼロで埋める

tensor([[10,  5, 27,  0],
        [33, 29,  0,  0],
        [28,  0,  0,  0],
        [39,  5,  0,  0],
        [33, 21,  0,  0],
        [36,  0,  0,  0],
        [36, 20,  0,  0],
        [19,  0,  0,  0],
        [28,  3,  0,  0],
        [ 3, 15, 46, 18]])


As specified by the **padding=post** setting, each review is padded by appending zeros at the end, as specified the **padding=post** setting.

Next we create a neural network to learn to classify reviews.

In [None]:
model = nn.Sequential(
    nn.Embedding(VOCAB_SIZE, 8),
    nn.Flatten(),
    nn.Linear(8 * MAX_LENGTH, 1),
    nn.Sigmoid()
)

This network accepts four integer inputs that specify the indexes of a padded movie review. The first embedding layer converts these four indexes into four length vectors 8. These vectors come from the lookup table the contains 50 (VOCAB_SIZE) rows of vectors of length 8. This encoding is evident by the 400 (8 times 50) parameters in the embedding layer. The output size from the embedding layer is 32 (4 words expressed as 8-number embedded vectors). A single output neuron is connected to the embedding layer by 33 weights (32 from the embedding layer and a single bias neuron). Because this is a single-class classification network, we use the sigmoid activation function and binary_crossentropy.


The program now trains the neural network. The embedding lookup and dense 33 weights are updated to produce a better score.

In [None]:
criterion = nn.BCELoss() # Binary Cross Entropy
optimizer = optim.Adam(model.parameters())

# Training the model
epochs = 100
for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(padded_reviews.long())
    loss = criterion(outputs.squeeze(), torch.tensor(labels, dtype=torch.float))
    loss.backward()
    optimizer.step()

We can see the learned embeddings. Think of each word's vector as a location in the 8 dimension space where words associated with positive reviews are close to other words. Similarly, training places negative reviews close to each other. In addition to the training setting these embeddings, the 33 weights between the embedding layer and output neuron similarly learn to transform these embeddings into an actual prediction. You can see these embeddings here.

In [None]:
embedding_weights = list(model[0].parameters())[0]
print(embedding_weights.shape)
print(embedding_weights)

torch.Size([50, 8])
Parameter containing:
tensor([[-1.6768, -1.9031,  0.0252, -0.5319,  0.7442,  1.2344,  1.2802,  0.7962],
        [ 0.6082,  1.5734,  0.8852,  1.5405, -2.1552, -0.1733, -2.5363, -1.8048],
        [-0.1492,  1.6089,  0.1645, -1.0403,  0.2712, -1.6028,  0.1053, -0.7191],
        [ 0.6954, -1.6610,  0.2845,  1.5048, -0.6243, -0.6798, -0.3955, -1.2092],
        [-1.4907,  0.9310, -1.1549,  1.8388, -0.4695,  0.3294, -1.2787, -1.2318],
        [-0.6396, -1.8355,  1.7043,  0.1649, -0.2356,  2.3743,  0.1703,  0.0206],
        [ 1.3712,  1.5142, -1.3969,  0.1835,  1.2148,  0.2743, -0.4599, -0.8549],
        [-0.4438,  2.9385,  0.5593, -0.5218,  0.4831,  0.2573,  0.8810,  0.8127],
        [ 1.2870,  0.4554,  0.8146, -1.7634,  0.6124, -0.5731,  0.3127, -0.4764],
        [-0.4755,  0.0245,  0.8558, -1.7638, -1.3385, -0.0087,  2.2258, -0.0103],
        [ 0.6813,  0.8824, -0.3097,  1.1874,  0.0993, -0.5140, -0.5533, -0.8991],
        [-0.0908, -1.2281,  1.0500,  0.6610, -1.4649,  0

We can now evaluate this neural network's accuracy, including the embeddings and the learned dense layer.

In [None]:
# Evaluation
with torch.no_grad():
    outputs = model(padded_reviews.long())
    predicted_labels = (outputs > 0.5).float().squeeze()
    accuracy = (predicted_labels == torch.tensor(labels)).float().mean().item()
    loss_value = criterion(outputs.squeeze(), torch.tensor(labels, dtype=torch.float)).item()

print(f"Accuracy: {accuracy}")
print(f"Loss: {loss_value}")


Accuracy: 1.0
Loss: 0.3505394160747528


The accuracy is greate, but there could be overfitting. It would be good to use early stopping to not overfit for a more complex data set. However, the loss is not perfect. Even though the predicted probabilities indicated a correct prediction in every case, the program did not achieve absolute confidence in each correct answer. The lack of confidence was likely due to the small amount of noise in the data set. Some words that appeared in both positive and negative reviews contributed to this lack of absolute certainty.

## About One-hot Encode by `[torch.tensor([hash(word) % VOCAB_SIZE for word in review.split()]) for review in reviews]`
The goal is to convert the texual data(reviews) into a numerical format that can be used as input for the neural network.
Here is the detailed explanation of the code:
```
VOCAB_SIZE = 50
encoded_reviews = [torch.tensor([hash(word) % VOCAB_SIZE for word in review.split()]) for review in reviews]
```

### Components:
1. **VOCAB_SIZE**:
```
VOCAB_SIZE = 50
```
    * This sets the vocablary size to 50. It means that we will represent each word by an index ranging from 0 to 49.

2. **Encoding Reviews**:
```
encoded_reviews = [torch.tensor([hash(word) % VOCAB_SIZE for word in review.split()]) for review in reviews]
```
    * Here, we are converting each review (a string) into a tensor of integers (indices).

### Detailed Breakdown:
1. **Splitting Reviews into Words**:
```
review.split()
```
    * For each review, **`split()`** method is called. This splits the review string into a list of words. For example, **"Never coming back!" becomes **["Never", "coming", "back!"]**.

2. **Hashing and Modulo Operation**:
```
[hash(word) % VOCAB_SIZE for word in review.split()]
```
    * For each word in the split review, we apply the **`has`** function.
    * The **`has`** function generates a unique interger for each word.
    * We then use the modulo operation **`% VOCAB_SIZE`** to map the hash value to an index within the range of **0** to **49** (since **VOCAB_SIZE** is 50). This ensures that each word is represented by an index that fits within our defined vocabulary size.
    * For example, if **hash("Never") returns **135798642**, then **`135798642 % 50`** might return **42**, so **"Never"** is represented by the index **42**.

3. **Creating Tensors**:
```
torch.tensor([...])
```
    * After computing the indices for all words in a review, we wrap the list of indices in **torch.tensor** to create a tensor.
    * Each review is now represented as a tensor of integers.

### Example:
Let’s take a concrete example to see how a review is encoded.

**Review**: `"Never coming back!"`

**Split into Words**: `["Never", "coming", "back!"]`

**Hash and Modulo Operation**:
- `hash("Never") % 50 = 42`
- `hash("coming") % 50 = 17`
- `hash("back!") % 50 = 9`

**Encoded Review**:
- Tensor: `torch.tensor([42, 17, 9])`

**Applying this to all reviews**:
The process is repeated for each review in the list, resulting in `encoded_reviews`, a list of tensors.

### Summary
This step converts each word in the reviews into a numerical index based on its hash value and maps it into a fixed vocabulary size. This process ensures that the textual data is transformed into a numerical format that can be fed into the neural network.

## Why does `hash(word) % VOCAB_SIZE for word`

The calculation **`hash(word) % VOCAB_SIZE`** is used to map words to integer indices within a fixed range. Here are the reasons for using this approach:

( **x % y**: x/yの剰余)

### Reason for Hashing and Modulo Operation
1. **Fixed Vocablary Size**:
    * **Purpose**: Neural networks typically require <u>fixed-size input dimensions</u>.
    * **Reason**: By using the the modulo operation with a fixed **VOCAB_SIZE**, we ensure that all words are mapped to indices within the range **[0, VOCAB_SIZE - 1]**. This keeps the input dimension consistent.
2. **Handling Unknown Words**:
    * **Purpose**: Real-world data often contains a vast number of unique words, including typos, slang, and rare words.
    * **Reason**: the hashing approach allows the model to handle any word, even those not seed during training, by mapping them to one of the fixed indices. This is a way to deal with out-of-vocabulary words without requiring explicit handling.
3. **Simplicity and Efficiency**:
    * **Purpose**: Directly converting words to indices should be computationally efficient.
    * **Reason**: Hash functions provide a simple and quick way to convert strings(words) into integers. The modulo operation ensures the result stays within the desired range. This combination is both fast and easy to implement.
4. **Avoiding Large Memory Usage**:
    * **Purpose**: Managing large vocabulary size can lead to significant memory usage.
    * **Reason**: By limiting the vocabulary size to a manageable number(**VOCAB_SIZE**), we prevent the model from needing to handle an excessively large number of unique indices, which can be memory-intensive.

### Example for Clarity
Imagine we have a vocabulary size of 50(VOCAB_SIZE=50). Here's a step-by-step example of how a word is mapped to an index:
1. **Word**: "`delicious`"
2. **Hash Calculation**:
    * `hash("delicious")` might return a large integer, e.g., `103728491`.
3. **Modulo Operation**:
    * `103728491 % 50` results in an index within `[0, 49]`, e.g., `41`.


This process ensures that no matter how large the hash value, the final index will always ift within the range defined by **VOCAB_SIZE**.


### Why Not Using Direct Indices or One-Hot Encoding?
1. **Direct Indices**:
    * If we were to assign a unique index to every word directly, the vocabulary could become very large, especially in applications with diverse and extensive text data. This would require a correspondingly large embedding matrix and could lead to significant computational and memory challenges.
2. **One-Hot Encoding**:
    * One-hot encoding would result in <u>very sparse vectors(mostly zeros)</u> with a size equal to the vocabulary. For large vocabularies, this approach is highly inefficient in terms of memory and computation. Instead, using embeddings allows us to represent words in as dense, lower-dimensional space.


### Summary
The use of `hash(word) % VOCAB_SIZE` allows for a fixed-size, efficient, and straightforward way to convert words into numerical indices. This ensures consistency, handles out-of-vocabulary words gracefully, and maintains computational efficiency. This approach is particularly useful in scenarios with large or dynamic vocabularies.


## What is Modulo Operation

The "modulo operation" is a mathematical operation that finds the remainder of the division of one number by another. In other words, for two numbers (a) and (b), the expression (a % b) (read as "a modulo b") returns the remainder when (a) is divided by (b).

### How the Modulo Operation Works
1. **Basic Definition**:
    * If (a) and (b) are intergers, then (a % b) gives the remainder when (a) is divided by (b).
    * The formula can be written as:
$$
a \% b = a - \left\lfloor \frac{a}{b} \right\rfloor \times b
$$
    * Here, $\left\lfloor \frac{a}{b} \right\rfloor$ represents the floor function, which gives the largest integer less than or equal to $\frac{a}{b}$.

2. **Example Calculation**:
    * 10 % 3:
        * $10 \div 3$ is $3$ with a remainder of $1$.
        * So, 10 % 3 = 1
    * 20 % 5:
        * $20 \div 5$ is $4$ with no remainder.
        * So, 20 % 5 = 0
    * 7 % 4:
        * $ 7 \div 4$ is $1$ with a remainder of $3$.
        * So, 7 % 4 = 3

### Why Use the Modulo Operation?
In the context of the provided code, the modulo operation is used to ensure that the result of hashing a word 8which can produce a very large integer) fits within a predefined range, namely, the size of the vocabulary (**VOCAB_SIZE**). This is crucial for maintaining consistency and preventing excessively large indices.

### Applying Modulo in the Code
Here's a step-by-step breakdown of how the modulo operation is used in your code:

```python
VOCAB_SIZE = 50
encoded_reviews = [torch.tensor([hash(word) % VOCAB_SIZE for word in review.split()]) for review in reviews]
```

1. **Hash Function**:
   - `hash(word)` generates a large integer value based on the input word.
   - Example: `hash("delicious")` might return `103728491`.

2. **Modulo Operation**:
   - `hash(word) % VOCAB_SIZE` maps the large integer to a smaller range.
   - If `VOCAB_SIZE` is 50, the result will be between 0 and 49.
   - Example: `103728491 % 50` results in `41`.

By using the modulo operation, we ensure that every word is represented by an index within the fixed range `[0, VOCAB_SIZE - 1]`. This keeps the vocabulary size manageable and the input dimensions consistent for the neural network.