In [5]:
import torch
from torch import nn

### What an embedding layer does in Pytorch

In [11]:
# input to the embedding layer (nn.Embedding) is a list of indexes
# of the word from the vocabulary list
input_index = torch.tensor([4,5, 6, 7], dtype=torch.long)

In [25]:
# embeddings is a map with 500 entries, each corresponding to a word
# and its value is a randomly initialized vector of 50 dimension
embeddings = nn.Embedding(500, 50)

In [18]:
# when inputing the input_index, the four vectors corresponding to
# the four input indices will be retrieved
embed_vectors = embeddings(input_index)

In [26]:
# for a simple embedding, we just sum the four input vectors and
# output the flattened one vector
embeds = sum(embeddings(input_index)).view(1, -1)
embeds.shape

torch.Size([1, 50])

### Initialization of embedding vectors
* `nn.init.normal_(embedding.weight)` initializes the weights with random values from a normal distribution with a mean of 0 and std of 1
* `nn.init.constant_(embedding.weight, 0)`: initialize weights with the specific constant value of 0
* `nn.init.xavier_uniform()` and `nn.init.xavier_normal_()` are designed to work well with signmoid and tanh activation functions. They are initialized the weights to values that are close to zero, but not too small
* `nn.init.kaiming_uniform_()` and `nn.init.kaiming_normal_()` work well with ReLU and its variants (LeakyReLU, PReLU, RReLU, etc)
* can be initialized using pre-trained word vectors such as GloVe or word2vec, which have been trained on large corpora and have been shown to be useful for many NLP tasks. The proecess of using a pre-trained word vector is called Fine-tuning
``` python
    import torch
    import torch.nn as nn

    # Load a pre-trained embedding model
    pretrained_embeddings = torch.randn(10, 50) # Example only, not actual pre-trained embeddings

    # Initialize the embedding layer with the pre-trained embeddings
    embedding.weight.data.copy_(pretrained_embeddings)

    # you can also use from_pretrained()  method
    embedding_layer = nn.Embedding.from_pretrained(pretrained_embeddings)

    # or load from a pretrained model
    glove = torchtext.vocab.GloVe(name='6B', dim=300)
    embedding_layer = nn.Embedding.from_pretrained(glove.vectors)
```

  + you can free the embedding layer from being trained by setting     
  `embedding_layer.weight.requiresGrad = False` 


### Other parameters
* sparse option
* `padding_idx`
* `max_norm`
* `norm_type`

In [30]:
from scipy.sparse import csr_matrix
import numpy as np
# Initialize a sparse matrix: This could be your training set
X_train = csr_matrix(np.array([[1, 0, 1, 0],
                               [0, 0, 1, 1],
                               [1, 1, 1, 0]]))
# Get one row: One sample in the training set
row = X_train.getrow(0)

w_linear = nn.Linear(4, 3, bias=False)
w_linear.weight

w_embedding = nn.Embedding(4, 3).from_pretrained(w_linear.weight.T)
w_embedding.weight

Parameter containing:
tensor([[ 0.1929,  0.2584,  0.4394],
        [-0.0670,  0.4838,  0.0833],
        [ 0.3219,  0.2301,  0.2915],
        [ 0.1095,  0.1989,  0.3667]])

In [32]:
row.indices
w_embedding(torch.tensor(row.indices))

tensor([[0.1929, 0.2584, 0.4394],
        [0.3219, 0.2301, 0.2915]])

### Transformers
``` python
    import torch
    import torch.nn as nn

    class Transformer(nn.Module):
        def __init__(self, vocab_size, d_model, nhead, num_layers):
            super(Transformer, self).__init__()
            # This is our holy embedding layer - the topic of this post
            self.embedding = nn.Embedding(vocab_size, d_model)

            # This is a transformer layer. It contains encoder and decoder
            self.transformer = nn.Transformer(d_model, nhead, num_layers)

            #This is the final fully connected layer that predicts the probability of each word
            self.fc = nn.Linear(d_model, vocab_size)

        def forward(self, x):
            # Pass input through the embedding layer
            x = self.embedding(x)

            # Pass input through the transformer layers (NOTE: This input is usually concatenated with positional encoding. I left it out for simplicity)
            x = self.transformer(x)
            # Pass input through the final linear layer
            x = self.fc(x)
            return x

    # Initialize the model
    vocab_size = 10
    d_model = 50
    nhead = 2
    num_layers = 3
    model = Transformer(vocab_size, d_model, nhead, num_layers)
```
