# The Attention Mechanism   

## Introduction ## 

______________________________________________________________________________________________________________

![](assets/Caption_Generator_Attention.png)

[Show, Attend and Tell: Neural Image CaptionGeneration with Visual Attention](https://arxiv.org/pdf/1502.03044.pdf)


![](assets/Align_And_Translate_Attention.png)

[NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE](https://arxiv.org/pdf/1409.0473v7.pdf)

___________________________

## I've seen what attention does visually; let's take a step forward ## 

![](assets/Align_And_Translate_Attention.png)

In [None]:
# Query Entity 
hidden = torch.randn(1,HDDN_DIM) # hidden = entity_hidden

# List of Entities
encoder_outputs = torch.randn(14, 1, ENC_DIM) # encoder_outputs = [entity_1, entity_2, entity_3]

In [None]:
encoder_outputs.shape

torch.Size([14, 1, 5])

In [None]:
hidden.shape

torch.Size([1, 5])

In [None]:
encoder_outputs_attention_weights, energy = attention(hidden,encoder_outputs)

# bmm 
context_vector = torch.bmm(
    encoder_outputs_attention_weights.unsqueeze(1),
    encoder_outputs.permute(1, 0, 2))

In [None]:
context_vector.shape

torch.Size([1, 1, 5])

In [None]:
encoder_outputs_attention_weights

tensor([[0.0068, 0.0087, 0.0099, 0.0084, 0.0070, 0.0063, 0.0093, 0.0092, 0.0091,
         0.0069, 0.0091, 0.0067, 0.0067, 0.0070, 0.0070, 0.0059, 0.0082, 0.0076,
         0.0056, 0.0045, 0.0047, 0.0079, 0.0074, 0.0046, 0.0057, 0.0089, 0.0066,
         0.0061, 0.0060, 0.0097, 0.0066, 0.0068, 0.0055, 0.0088, 0.0051, 0.0070,
         0.0076, 0.0077, 0.0074, 0.0080, 0.0099, 0.0052, 0.0047, 0.0064, 0.0086,
         0.0074, 0.0065, 0.0056, 0.0081, 0.0063, 0.0071, 0.0072, 0.0086, 0.0070,
         0.0049, 0.0054, 0.0074, 0.0086, 0.0067, 0.0056, 0.0055, 0.0086, 0.0065,
         0.0062, 0.0075, 0.0088, 0.0056, 0.0062, 0.0092, 0.0079, 0.0083, 0.0073,
         0.0072, 0.0073, 0.0084, 0.0065, 0.0058, 0.0061, 0.0086, 0.0074, 0.0072,
         0.0074, 0.0111, 0.0054, 0.0060, 0.0077, 0.0067, 0.0079, 0.0091, 0.0062,
         0.0048, 0.0053, 0.0063, 0.0089, 0.0066, 0.0067, 0.0090, 0.0047, 0.0075,
         0.0081, 0.0074, 0.0092, 0.0086, 0.0070, 0.0078, 0.0058, 0.0071, 0.0051,
         0.0064, 0.0060, 0.0

In [None]:
encoder_outputs_attention_weights.sum()

tensor(1., grad_fn=<SumBackward0>)


The Input 
---------------


In [None]:
encoder_outputs.shape

torch.Size([14, 1, 5])

In [None]:
hidden.shape

torch.Size([1, 5])

The Output 
---------------

In [None]:
encoder_outputs_attention_weights, energy

(tensor([[0.0897, 0.0583, 0.0661, 0.0786, 0.0655, 0.0809, 0.0639, 0.0688, 0.0652,
          0.0619, 0.0821, 0.0815, 0.0630, 0.0745]], grad_fn=<SoftmaxBackward>),
 tensor([[ 0.1802, -0.2512, -0.1253,  0.0478, -0.1342,  0.0772, -0.1589, -0.0844,
          -0.1381, -0.1903,  0.0922,  0.0851, -0.1732, -0.0048]],
        grad_fn=<SqueezeBackward1>))

In [None]:
encoder_outputs_attention_weights.sum()

tensor(1.0000, grad_fn=<SumBackward0>)

In [None]:
encoder_outputs.shape

torch.Size([50, 1, 5])

BMM
--------


In [None]:
encoder_outputs.shape

torch.Size([140, 1, 5])

In [None]:
encoder_outputs_attention_weights.shape

torch.Size([1, 140])

In [None]:
encoder_outputs_small = torch.tensor([
    [1.,2.], # [1.,2.]*0.7 = [0.7, 1.4]
    [2.,1.], # [2.,1.]*0.2 = [0.4, 0.2]
    [1.,1.]  # [1.,1.]*0.1 = [0.1, 0.1]
]).view([3,1,2])

attn_weights_small = torch.tensor([
    [0.7],
    [0.2],
    [0.1]
]).view(1,3)

In [None]:
encoder_outputs_small.shape

torch.Size([3, 1, 2])

In [None]:
attn_weights_small.shape

torch.Size([1, 3])

In [None]:
torch.bmm(
    attn_weights_small.unsqueeze(1),
    encoder_outputs_small.permute(1, 0, 2))
# This vector is supposed be to 70% of [1., 2.] and 20% of [2., 1.]

tensor([[[1.2000, 1.7000]]])

The context vector obtained is supposed to be a vector representation of: 
1. The image with the focus on the frisbee from the first example. 
2. The source sentence with the focus on the word 'in' in the NMT example. 

## Internals of the Attention Mechanism ## 

In [None]:
# ENC_HID_DIM = 5
# DEC_HID_DIM = 5

HDDN_DIM = 5
ENC_DIM = 5
DEC_DIM = 5

attn = nn.Linear(HDDN_DIM + ENC_DIM, DEC_DIM)
v = nn.Linear(DEC_DIM, 1, bias = False)

In [None]:
attn

Linear(in_features=10, out_features=5, bias=True)

In [None]:
hidden = torch.randn(1,HDDN_DIM) # hidden = entity_hidden
encoder_outputs = torch.randn(10, 1, ENC_DIM) # encoder_outputs = [entity_1, entity_2, entity_3]

In [None]:
encoder_outputs.shape #(seq_len, batchsize, vector_size)

torch.Size([10, 1, 5])

In [None]:
# query entity
hidden.shape

torch.Size([1, 5])

In [None]:
def attention(hidden,encoder_outputs):
#     import pdb;pdb.set_trace()
    batch_size = encoder_outputs.shape[1]
    src_len = encoder_outputs.shape[0]

    #repeat decoder hidden state src_len times
    hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
    encoder_outputs = encoder_outputs.permute(1, 0, 2)

    # energy function 
    energy = torch.tanh(
        attn(
            torch.cat(
                (hidden,
                encoder_outputs), dim = 2)
        )
    ) 
    attention = v(energy).squeeze(2)

    return F.softmax(attention, dim=1), attention 

In [None]:
hidden.shape

torch.Size([1, 5])

In [None]:
encoder_outputs.shape

torch.Size([14, 1, 5])

Walking Back 
--------------------

In [None]:
encoder_outputs_attention_softmax_weights, energy  = attention(hidden, encoder_outputs)

# bmm 
context_vector = torch.bmm(
    encoder_outputs_attention_softmax_weights.unsqueeze(1),
    encoder_outputs.permute(1, 0, 2))

The probability $\alpha_{ij}$ , or its associated energy $e_{ij}$, reflects the importance of the annotation $h_{j}$ with respect to the previous hidden states $i−1$ in deciding the next states $i$ and generating $y_i$ . Intuitively,this implements a mechanism of attention in the decoder.  The decoder decides parts of the source sentence to pay attention to. 

In [None]:
HDDN_DIM = 5
ENC_DIM = 5
DEC_DIM = 5

attn = nn.Linear(HDDN_DIM + ENC_DIM, DEC_DIM)
v = nn.Linear(DEC_DIM, 1, bias = False)

In [None]:
q = torch.randn(1,HDDN_DIM + ENC_DIM)

In [None]:
q.shape

torch.Size([1, 10])

In [None]:
attn(q).shape

torch.Size([1, 5])

In [None]:
hidden = torch.randn(1,HDDN_DIM) # hidden = entity_hidden
encoder_outputs = torch.randn(10, 1, ENC_DIM) # encoder_outputs = [entity_1, entity_2, entity_3]

### The Attention Function ### 

Let's zoom into the attention func. I've taken every line and put it in a different cell. 

In [None]:
encoder_outputs.shape

torch.Size([14, 1, 5])

In [None]:
batch_size = encoder_outputs.shape[1]
src_len = encoder_outputs.shape[0]

In [None]:
src_len

14

In [None]:
hidden.shape

torch.Size([1, 5])

In [None]:
#repeat decoder hidden state src_len times
hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
encoder_outputs = encoder_outputs.permute(1, 0, 2)


In [None]:
hidden.shape

torch.Size([1, 14, 5])

In [None]:
encoder_outputs.shape

torch.Size([1, 14, 5])

In [None]:
torch.tanh(
    attn(
        torch.cat(
            (hidden,
            encoder_outputs), dim = 2)
    )
).shape

torch.Size([1, 14, 5])

In [None]:
# energy function 
energy = torch.tanh(
    attn(
        torch.cat(
            (hidden,
            encoder_outputs), dim = 2)
    )
) 



In [None]:
attention = v(energy).squeeze(2)

In [None]:
energy.shape

torch.Size([1, 14, 5])

In [None]:
attention

tensor([[0.3557, 0.3651, 0.4421, 0.4193, 0.4137, 0.6503, 0.3485, 0.2695, 0.4164,
         0.5820, 0.4092, 0.2467, 0.5506, 0.5263]], grad_fn=<SqueezeBackward1>)

In [None]:
F.softmax(attention, dim=1).sum()

tensor(1., grad_fn=<SumBackward0>)

# The Energy Function # 

![](assets/energy_functions.png)

[Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/pdf/1508.04025.pdf).

## General Attention ## 

In [None]:
n_hidden = 5
attn = nn.Linear(n_hidden, n_hidden)

In [None]:
def get_att_weight(dec_output, enc_outputs):  # get attention weight one 'dec_output' with 'enc_outputs'
    n_step = len(enc_outputs)
    attn_scores = torch.zeros(n_step)  # attn_scores : [n_step]

    for i in range(n_step):
        attn_scores[i] = get_att_score(dec_output, enc_outputs[i])

    # Normalize scores to weights in range 0 to 1
    return F.softmax(attn_scores).view(1, 1, -1), attn_scores

def get_att_score(dec_output, enc_output):  # enc_outputs [batch_size, num_directions(=1) * n_hidden]
    score = attn(enc_output)  # score : [batch_size, n_hidden]
    return torch.dot(dec_output.view(-1), score.view(-1))  # inner product make scalar value

In [None]:
encoder_outputs = torch.randn(3, 1, 5)
hidden = torch.randn(1, 1, 5)

In [None]:
encoder_outputs_attention_softmax_weights, energy = get_att_weight(hidden, encoder_outputs)

  return F.softmax(attn_scores).view(1, 1, -1), attn_scores


In [None]:
encoder_outputs_attention_softmax_weights

tensor([[[3.4176e-04, 8.2757e-01, 1.7209e-01]]], grad_fn=<ViewBackward>)

In [None]:
energy

tensor([-5.8196,  1.9725,  0.4020], grad_fn=<CopySlices>)

## Further Reading ## 
Before we move on to the next topic we've covered enough ground here for you to be able to look at implementations of the attention mechanisms and/or read research papers that utilize the attention mechanism for various tasks. 

[**NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE**](https://arxiv.org/pdf/1409.0473v7.pdf): This paper introduced the attention mechanism and it did that in the context of the NMT task. [Here](https://paperswithcode.com/paper/neural-machine-translation-by-jointly) are the implementations of this paper on paperwithcode. 


[**Effective Approaches to Attention-based Neural Machine Translation**](https://arxiv.org/pdf/1508.04025.pdf): This paper introduces a new energy function as I discussed above. 

## Self-Attention ## 

[transformers](https://huggingface.co/transformers/)

In [None]:
from transformers.models.distilbert.modeling_distilbert import MultiHeadSelfAttention
from transformers.models.distilbert.configuration_distilbert import DistilBertConfig

In [None]:
model_checkpoint = "distilbert-base-uncased"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
tokenizer 

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [None]:
def embeddify(text):
    token_ids = tokenizer(text)['input_ids']
    _len = len(token_ids)
    return embeddings(torch.tensor(torch.tensor(tokenizer(text)["input_ids"]).view(1,_len))), tokenizer.convert_ids_to_tokens(token_ids)
    

In [None]:
multi_head_attn = MultiHeadSelfAttention(config)

In [None]:
config

DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "vocab_size": 30522
}

In [None]:
embeddified_text, tokens = embeddify('Ronaldo is one of the best football players in the world')
# x = torch.randn(1,10,config.dim) # (bs, seq_length, dim)

  return embeddings(torch.tensor(torch.tensor(tokenizer(text)["input_ids"]).view(1,_len))), tokenizer.convert_ids_to_tokens(token_ids)


In [None]:
embeddified_text.shape

torch.Size([1, 14, 768])

In [None]:
tokens

['[CLS]',
 'ronald',
 '##o',
 'is',
 'one',
 'of',
 'the',
 'best',
 'football',
 'players',
 'in',
 'the',
 'world',
 '[SEP]']

In [None]:
mask = torch.ones(1,14)

In [None]:
multi_head_attn_op = multi_head_attn(
    embeddified_text,
    embeddified_text,
    embeddified_text,
    mask)

In [None]:
len(multi_head_attn_op)

1

In [None]:
multi_head_attn_op[0].shape

torch.Size([1, 14, 768])

![](assets/transformer_self-attention_visualization.png)

In [None]:
chatuur_multi_head_attn = Chatuur_MultiHeadSelfAttention(config)

In [None]:
op = chatuur_multi_head_attn(embeddified_text,
                             embeddified_text,
                             embeddified_text,
                             mask)

> [0;32m<ipython-input-819-8f2a1170a3c9>[0m(46)[0;36mforward[0;34m()[0m
[0;32m     44 [0;31m        """
[0m[0;32m     45 [0;31m        [0;32mimport[0m [0mpdb[0m[0;34m;[0m[0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 46 [0;31m        [0mbs[0m[0;34m,[0m [0mq_length[0m[0;34m,[0m [0mdim[0m [0;34m=[0m [0mquery[0m[0;34m.[0m[0msize[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     47 [0;31m        [0mk_length[0m [0;34m=[0m [0mkey[0m[0;34m.[0m[0msize[0m[0;34m([0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     48 [0;31m        [0;31m# assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> n
> [0;32m<ipython-input-819-8f2a1170a3c9>[0m(47)[0;36mforward[0;34m()[0m
[0;32m     45 [0;31m        [0;32mimport[0m [0mpdb[0m[0;34m;[0m[0mpdb[0m[0;34m.[0

ipdb> 
> [0;32m<ipython-input-819-8f2a1170a3c9>[0m(67)[0;36mforward[0;34m()[0m
[0;32m     65 [0;31m        [0mv[0m [0;34m=[0m [0mshape[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mv_lin[0m[0;34m([0m[0mvalue[0m[0;34m)[0m[0;34m)[0m  [0;31m# (bs, n_heads, k_length, dim_per_head)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     66 [0;31m[0;34m[0m[0m
[0m[0;32m---> 67 [0;31m        [0mq[0m [0;34m=[0m [0mq[0m [0;34m/[0m [0mmath[0m[0;34m.[0m[0msqrt[0m[0;34m([0m[0mdim_per_head[0m[0;34m)[0m  [0;31m# (bs, n_heads, q_length, dim_per_head)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     68 [0;31m        [0mscores[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mmatmul[0m[0;34m([0m[0mq[0m[0;34m,[0m [0mk[0m[0;34m.[0m[0mtranspose[0m[0;34m([0m[0;36m2[0m[0;34m,[0m [0;36m3[0m[0;34m)[0m[0;34m)[0m  [0;31m# (bs, n_heads, q_length, k_length)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     69 [0;31m        [0mmask[0m [0;34m=[0m [0;34m([0

ipdb> n
> [0;32m<ipython-input-819-8f2a1170a3c9>[0m(80)[0;36mforward[0;34m()[0m
[0;32m     78 [0;31m[0;34m[0m[0m
[0m[0;32m     79 [0;31m        [0mcontext[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mmatmul[0m[0;34m([0m[0mweights[0m[0;34m,[0m [0mv[0m[0;34m)[0m  [0;31m# (bs, n_heads, q_length, dim_per_head)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 80 [0;31m        [0mcontext[0m [0;34m=[0m [0munshape[0m[0;34m([0m[0mcontext[0m[0;34m)[0m  [0;31m# (bs, q_length, dim)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     81 [0;31m        [0mcontext[0m [0;34m=[0m [0mself[0m[0;34m.[0m[0mout_lin[0m[0;34m([0m[0mcontext[0m[0;34m)[0m  [0;31m# (bs, q_length, dim)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     82 [0;31m[0;34m[0m[0m
[0m
ipdb> p context.shape
torch.Size([1, 12, 14, 64])
ipdb> p weights.shape
torch.Size([1, 12, 14, 14])
ipdb> v.shape
torch.Size([1, 12, 14, 64])
ipdb> q


BdbQuit: 

In [None]:
# Here we have the MultiHeadSelfAttention from the trnsformer library. 
class Chatuur_MultiHeadSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()

        self.n_heads = config.n_heads
        self.dim = config.dim
        self.dropout = nn.Dropout(p=config.attention_dropout)

        assert self.dim % self.n_heads == 0

        self.q_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
        self.k_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
        self.v_lin = nn.Linear(in_features=config.dim, out_features=config.dim)
        self.out_lin = nn.Linear(in_features=config.dim, out_features=config.dim)

        self.pruned_heads = set()

    def prune_heads(self, heads):
        attention_head_size = self.dim // self.n_heads
        if len(heads) == 0:
            return
        heads, index = find_pruneable_heads_and_indices(heads, self.n_heads, attention_head_size, self.pruned_heads)
        # Prune linear layers
        self.q_lin = prune_linear_layer(self.q_lin, index)
        self.k_lin = prune_linear_layer(self.k_lin, index)
        self.v_lin = prune_linear_layer(self.v_lin, index)
        self.out_lin = prune_linear_layer(self.out_lin, index, dim=1)
        # Update hyper params
        self.n_heads = self.n_heads - len(heads)
        self.dim = attention_head_size * self.n_heads
        self.pruned_heads = self.pruned_heads.union(heads)

    def forward(self, query, key, value, mask, head_mask=None, output_attentions=False):
        """
        Parameters:
            query: torch.tensor(bs, seq_length, dim)
            key: torch.tensor(bs, seq_length, dim)
            value: torch.tensor(bs, seq_length, dim)
            mask: torch.tensor(bs, seq_length)

        Returns:
            weights: torch.tensor(bs, n_heads, seq_length, seq_length) Attention weights context: torch.tensor(bs,
            seq_length, dim) Contextualized layer. Optional: only if `output_attentions=True`
        """
        import pdb;pdb.set_trace()
        bs, q_length, dim = query.size()
        k_length = key.size(1)
        # assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)
        # assert key.size() == value.size()

        dim_per_head = self.dim // self.n_heads

        mask_reshp = (bs, 1, 1, k_length)

        def shape(x):
            """ separate heads """
            return x.view(bs, -1, self.n_heads, dim_per_head).transpose(1, 2)

        def unshape(x):
            """ group heads """
            return x.transpose(1, 2).contiguous().view(bs, -1, self.n_heads * dim_per_head)

        # query object 
        q = shape(self.q_lin(query))  # (bs, n_heads, q_length, dim_per_head)
        
        # list of things
        # Discuss the rearrangement for multi heads. 
        k = shape(self.k_lin(key))  # (bs, n_heads, k_length, dim_per_head)
        v = shape(self.v_lin(value))  # (bs, n_heads, k_length, dim_per_head)

        # Attention All you Need paper states tha this operation improves results. 
        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)
        
        
        # Dot Energy function. 
        # show we have a score for each word being treated as query 
        # and performing attention on itself. 
        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, q_length, k_length)
        

        
        # Will talk about this later
        mask = (mask == 0).view(mask_reshp).expand_as(scores)  # (bs, n_heads, q_length, k_length)
        scores.masked_fill_(mask, -float("inf"))  # (bs, n_heads, q_length, k_length)

        weights = nn.Softmax(dim=-1)(scores)  # (bs, n_heads, q_length, k_length)
        weights = self.dropout(weights)  # (bs, n_heads, q_length, k_length)

        # Mask heads if we want to
        if head_mask is not None:
            weights = weights * head_mask

        context = torch.matmul(weights, v)  # (bs, n_heads, q_length, dim_per_head)
        context = unshape(context)  # (bs, q_length, dim)
        context = self.out_lin(context)  # (bs, q_length, dim)

        if output_attentions:
            return (context, weights)
        else:
            return (context,)

## Conclusion ## 

Now, I am leaving a few questions for the next video: 
1. The purpose of q_lin, k_lin and v_lin. 
2. What is a mask. 

## When I'm using Attention What does that look like? ## 

In [None]:
# Properties of Attention: 
#     1. Fetches information from a list of entities. 
#     2. Represents the fetched information as one entity

In [None]:
# Query Entity 
hidden = torch.randn(1,HDDN_DIM) # hidden = entity_hidden

# List of Entities
encoder_outputs = torch.randn(5, 1, ENC_DIM) # encoder_outputs = [entity_1, entity_2, entity_3]

In [None]:
encoder_outputs.shape

torch.Size([5, 1, 5])

In [None]:
encoder_outputs_attention_weights, energy  = attention(hidden, encoder_outputs)

# bmm 
context_vector = torch.bmm(
    encoder_outputs_attention_weights.unsqueeze(1),
    encoder_outputs.permute(1, 0, 2))

In [None]:
encoder_outputs_attention_weights

tensor([[0.1984, 0.2289, 0.1863, 0.1938, 0.1925]], grad_fn=<SoftmaxBackward>)

In [None]:
context_vector.shape

torch.Size([1, 1, 5])

In [None]:
encoder_outputs_attention_weights

tensor([[0.3430, 0.3196, 0.3373]], grad_fn=<SoftmaxBackward>)

In [None]:
encoder_outputs.shape

torch.Size([3, 1, 5])

In [None]:
F.softmax(torch.tensor([[100., 75., 5., 4.]]), dim=1)

tensor([[1.0000e+00, 1.3888e-11, 5.5211e-42, 2.0305e-42]])

In [None]:
encoder_outputs[0]*0.37

tensor([[ 0.0524,  0.0319, -0.1579, -0.6566,  0.0464]])

In [None]:
encoder_outputs[1]*0.33

tensor([[ 0.3120,  0.5839, -0.3179, -0.1141, -0.1452]])

In [None]:
encoder_outputs[1]*0.29

tensor([[ 0.2742,  0.5132, -0.2793, -0.1002, -0.1276]])

In [None]:
encoder_outputs[0]*0.3704 + encoder_outputs[1]*0.3327 + encoder_outputs[2]*0.2969

tensor([[ 0.3082,  1.0963, -0.7871, -0.5865, -0.1027]])

In [None]:
context_vector = torch.bmm(
    encoder_outputs_attention_softmax_weights.unsqueeze(1),
    encoder_outputs.permute(1, 0, 2))

In [None]:
context_vector

tensor([[[ 0.3082,  1.0963, -0.7871, -0.5866, -0.1027]]],
       grad_fn=<BmmBackward0>)

In [None]:
context_vector

tensor([[[ 1.0724, -0.2875,  0.0412, -0.3702,  0.0347]]],
       grad_fn=<BmmBackward0>)

In [None]:
energy

tensor([[0.3601, 0.4211, 0.5020]], grad_fn=<SqueezeBackward1>)

In [None]:
encoder_outputs_attention_softmax_weights

tensor([[0.3110, 0.3306, 0.3584]], grad_fn=<SoftmaxBackward>)

In [None]:
hidden

tensor([[-0.9320, -2.8623,  0.2094,  0.2498, -0.2511]])

In [None]:
encoder_outputs

tensor([[[-2.6336e-05,  8.2289e-01,  7.7262e-02,  1.9560e+00,  1.5071e-02]],

        [[ 7.4831e-02, -1.6805e+00, -9.8090e-01, -6.8144e-01, -7.4640e-03]],

        [[-1.5339e-01, -6.5884e-03,  3.0120e-01, -1.7828e+00, -1.1082e+00]]])

In [None]:
print(encoder_outputs_attention_softmax_weights.unsqueeze(1).shape)
print(encoder_outputs_attention_softmax_weights.shape)

torch.Size([1, 1, 3])
torch.Size([1, 3])


In [None]:
print(encoder_outputs.permute(1, 0, 2).shape)
print(encoder_outputs.shape)

torch.Size([1, 3, 5])
torch.Size([3, 1, 5])


In [None]:
# BMM step. 
context_vector = torch.bmm(
    encoder_outputs_attention_softmax_weights.unsqueeze(1),
    encoder_outputs.permute(1, 0, 2))

In [None]:
encoder_outputs_attention_softmax_weights

tensor([[0.3234, 0.3342, 0.3424]], grad_fn=<SoftmaxBackward>)

In [None]:
encoder_outputs

tensor([[[-2.6336e-05,  8.2289e-01,  7.7262e-02,  1.9560e+00,  1.5071e-02]],

        [[ 7.4831e-02, -1.6805e+00, -9.8090e-01, -6.8144e-01, -7.4640e-03]],

        [[-1.5339e-01, -6.5884e-03,  3.0120e-01, -1.7828e+00, -1.1082e+00]]])

In [None]:
0.2442*0.3492 + -1.0587*0.2922 + 0.5481*0.3586

-0.02752884

In [None]:
context_vector

tensor([[[-0.0275, -0.2978, -0.1997, -0.2056, -0.3770]]],
       grad_fn=<BmmBackward0>)

# The Energy Function # 

![](assets/energy_functions.png)

## General Attention ## 

In [None]:
n_hidden = 5
attn = nn.Linear(n_hidden, n_hidden)

In [None]:
def get_att_weight(dec_output, enc_outputs):  # get attention weight one 'dec_output' with 'enc_outputs'
    n_step = len(enc_outputs)
    attn_scores = torch.zeros(n_step)  # attn_scores : [n_step]

    for i in range(n_step):
        attn_scores[i] = get_att_score(dec_output, enc_outputs[i])

    # Normalize scores to weights in range 0 to 1
    return F.softmax(attn_scores).view(1, 1, -1), attn_scores

def get_att_score(dec_output, enc_output):  # enc_outputs [batch_size, num_directions(=1) * n_hidden]
    score = attn(enc_output)  # score : [batch_size, n_hidden]
    return torch.dot(dec_output.view(-1), score.view(-1))  # inner product make scalar value

In [None]:
encoder_outputs = torch.randn(3, 1, 5)
hidden = torch.randn(1, 1, 5)

In [None]:
encoder_outputs_attention_softmax_weights, energy = get_att_weight(hidden, encoder_outputs)

  return F.softmax(attn_scores).view(1, 1, -1), attn_scores


In [None]:
encoder_outputs_attention_softmax_weights

tensor([[[3.4176e-04, 8.2757e-01, 1.7209e-01]]], grad_fn=<ViewBackward>)

In [None]:
energy

tensor([-5.8196,  1.9725,  0.4020], grad_fn=<CopySlices>)

# The context of Discovery: NMT

![](assets/seq2seq1.png)

## [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf) ##

## 1 Introduction ##

Neural machine translationis a newly emerging approach to machine translation, recently proposed by Kalchbrenner and Blunsom (2013), Sutskeveret al.(2014) and Choet al.(2014b).  Unlike thetraditional phrase-based translation system (see, e.g., Koehnet al., 2003) which consists of manysmall sub-components that are tuned separately, neural machine translation attempts to build andtrain a single, large neural network that reads a sentence and outputs a correct translation.

Most  of  the  proposed  neural  machine  translation  models  belong  to  a  family  of encoder–decoders(Sutskeveret al., 2014; Choet al., 2014a), with an encoder and a decoder for each lan-guage, or involve a language-specific encoder applied to each sentence whose outputs are then com-pared (Hermann and Blunsom, 2014).  An encoder neural network reads and encodes a source sen-tence into a fixed-length vector. A decoder then outputs a translation from the encoded vector. Thewhole encoder–decoder system, which consists of the encoder and the decoder for a language pair,is jointly trained to maximize the probability of a correct translation given a source sentence

In [None]:
rnn_example_seq_tensor, seq = embeddify('This person is a good person')

  return embeddings(torch.tensor(torch.tensor(tokenizer(text)["input_ids"]).view(1,_len))), tokenizer.convert_ids_to_tokens(token_ids)


In [None]:
rnn_example_seq_tensor.shape

torch.Size([1, 6, 768])

In [None]:
seq

['This', 'person', 'is', 'a', 'good', 'person']

In [None]:
rnn_example_hidden = torch.zeros(1,1,768)

In [None]:
example_rnn = nn.RNN(768,768,batch_first=True)

In [None]:
encoder_outputs,last_encoder_output = example_rnn(rnn_example_seq_tensor,rnn_example_hidden)

In [None]:
encoder_outputs.shape

torch.Size([1, 8, 768])

In [None]:
last_encoder_output.shape

torch.Size([1, 1, 768])

In [None]:
encoder_outputs[0][-1] == last_encoder_output[0][0]

tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, Tr

In [None]:
# code by Tae Hwan Jung @graykode
import argparse
import numpy as np
import torch
import torch.nn as nn

# S: Symbol that shows starting of decoding input
# E: Symbol that shows starting of decoding output
# P: Symbol that will fill in blank sequence if current batch data size is short than time steps


In [None]:

# Model
class Seq2Seq(nn.Module):
    def __init__(self):
        super(Seq2Seq, self).__init__()

        self.enc_cell = nn.RNN(input_size=n_class, hidden_size=n_hidden, dropout=0.5)
        self.dec_cell = nn.RNN(input_size=n_class, hidden_size=n_hidden, dropout=0.5)
        self.fc = nn.Linear(n_hidden, n_class)

    def forward(self, enc_input, enc_hidden, dec_input):
        # batch_size, seq_len, vector_len
        
        import pdb;pdb.set_trace()
        enc_input = enc_input.transpose(0, 1) # enc_input: [max_len(=n_step, time step), batch_size, n_class]
        dec_input = dec_input.transpose(0, 1) # dec_input: [max_len(=n_step, time step), batch_size, n_class]

        # enc_states : [num_layers(=1) * num_directions(=1), batch_size, n_hidden]
        _, enc_states = self.enc_cell(enc_input, enc_hidden)
        # outputs : [max_len+1(=6), batch_size, num_directions(=1) * n_hidden(=128)]
        outputs, _ = self.dec_cell(dec_input, enc_states)

        model = self.fc(outputs) # model : [max_len+1(=6), batch_size, n_class]
        return model



In [None]:

def make_batch():
    input_batch, output_batch, target_batch = [], [], []
    for seq in seq_data:
        for i in range(2):
            seq[i] = seq[i] + 'P' * (n_step - len(seq[i]))

        input = [num_dic[n] for n in seq[0]]
        output = [num_dic[n] for n in ('S' + seq[1])]
        target = [num_dic[n] for n in (seq[1] + 'E')]
        
        input_batch.append(np.eye(n_class)[input])
        output_batch.append(np.eye(n_class)[output])
        target_batch.append(target) # not one-hot

    # make tensor
    return torch.FloatTensor(input_batch), torch.FloatTensor(output_batch), torch.LongTensor(target_batch)


In [None]:
n_step = 5
n_hidden = 128

char_arr = [c for c in 'SEPabcdefghijklmnopqrstuvwxyz']
num_dic = {n: i for i, n in enumerate(char_arr)}
seq_data = [['man', 'women'], ['black', 'white'], ['king', 'queen'], ['girl', 'boy'], ['up', 'down'], ['high', 'low']]

n_class = len(num_dic)
batch_size = len(seq_data)

model = Seq2Seq()

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

input_batch, output_batch, target_batch = make_batch()





In [None]:
len(seq_data)

6

In [None]:
input_batch.shape
# batch_size, seq_len, vector_len

torch.Size([6, 5, 29])

In [None]:
for epoch in range(5):
    # make hidden shape [num_layers * num_directions, batch_size, n_hidden]
    hidden = torch.zeros(1, batch_size, n_hidden)

    optimizer.zero_grad()
    output = model(input_batch, hidden, output_batch)

    output = output.transpose(0, 1) # [batch_size, max_len+1(=6), n_class]
    loss = 0
    for i in range(0, len(target_batch)):
        loss += criterion(output[i], target_batch[i])
    if (epoch + 1) % 1000 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
    loss.backward()
    optimizer.step()

# Test
def translate(word):
    input_batch, output_batch, _ = make_batch([[word, 'P' * len(word)]])

    # make hidden shape [num_layers * num_directions, batch_size, n_hidden]
    hidden = torch.zeros(1, 1, args.n_hidden)
    output = model(input_batch, hidden, output_batch)
    # output : [max_len+1(=6), batch_size(=1), n_class]

    predict = output.data.max(2, keepdim=True)[1] # select n_class dimension
    decoded = [char_arr[i] for i in predict]
    end = decoded.index('E')
    translated = ''.join(decoded[:end])

    return translated.replace('P', '')



> [0;32m<ipython-input-188-82550f9d53b1>[0m(12)[0;36mforward[0;34m()[0m
[0;32m     10 [0;31m    [0;32mdef[0m [0mforward[0m[0;34m([0m[0mself[0m[0;34m,[0m [0menc_input[0m[0;34m,[0m [0menc_hidden[0m[0;34m,[0m [0mdec_input[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     11 [0;31m        [0;32mimport[0m [0mpdb[0m[0;34m;[0m[0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 12 [0;31m        [0menc_input[0m [0;34m=[0m [0menc_input[0m[0;34m.[0m[0mtranspose[0m[0;34m([0m[0;36m0[0m[0;34m,[0m [0;36m1[0m[0;34m)[0m [0;31m# enc_input: [max_len(=n_step, time step), batch_size, n_class][0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     13 [0;31m        [0mdec_input[0m [0;34m=[0m [0mdec_input[0m[0;34m.[0m[0mtranspose[0m[0;34m([0m[0;36m0[0m[0;34m,[0m [0;36m1[0m[0;34m)[0m [0;31m# dec_input: [max_len(=n_step, time step), batch_size, n_class][0m[0;34m[0m[0

BdbQuit: 

A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector.  This maymake it difficult for the neural network to cope with long sentences, especially those that are longerthan the sentences in the training corpus. Choet al.(2014b) showed that indeed the performance ofa basic encoder–decoder deteriorates rapidly as the length of an input sentence increases.

In order to address this issue, we introduce an extension to the encoder–decoder model which learns to align and translate jointly.  Each time the proposed model generates a word in a translation, **it(soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated.**  The model then predicts a target word based on the context vectors associated withthese source positions and all the previous generated target words.

The most important distinguishing feature of this approach from the basic encoder–decoder is that **it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it en-codes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.**  This frees a neural translation model from having to squash all theinformation of a source sentence, regardless of its length, into a fixed-length vector.  We show thisallows a model to cope better with long sentences

In [None]:
# code by Tae Hwan Jung @graykode
# Reference : https://github.com/hunkim/PyTorchZeroToAll/blob/master/14_2_seq2seq_att.py
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt

# S: Symbol that shows starting of decoding input
# E: Symbol that shows starting of decoding output
# P: Symbol that will fill in blank sequence if current batch data size is short than time steps

def make_batch():
    input_batch = [np.eye(n_class)[[word_dict[n] for n in sentences[0].split()]]]
    output_batch = [np.eye(n_class)[[word_dict[n] for n in sentences[1].split()]]]
    target_batch = [[word_dict[n] for n in sentences[2].split()]]

    # make tensor
    return torch.FloatTensor(input_batch), torch.FloatTensor(output_batch), torch.LongTensor(target_batch)

class Attention(nn.Module):
    def __init__(self):
        super(Attention, self).__init__()
        self.enc_cell = nn.RNN(input_size=n_class, hidden_size=n_hidden, dropout=0.5)
        self.dec_cell = nn.RNN(input_size=n_class, hidden_size=n_hidden, dropout=0.5)

        # Linear for attention
        self.attn = nn.Linear(n_hidden, n_hidden)
        self.out = nn.Linear(n_hidden * 2, n_class)

    def forward(self, enc_inputs, hidden, dec_inputs):
        enc_inputs = enc_inputs.transpose(0, 1)  # enc_inputs: [n_step(=n_step, time step), batch_size, n_class]
        dec_inputs = dec_inputs.transpose(0, 1)  # dec_inputs: [n_step(=n_step, time step), batch_size, n_class]

        # enc_outputs : [n_step, batch_size, num_directions(=1) * n_hidden], matrix F
        # enc_hidden : [num_layers(=1) * num_directions(=1), batch_size, n_hidden]
        enc_outputs, enc_hidden = self.enc_cell(enc_inputs, hidden)

        trained_attn = []
        hidden = enc_hidden
        n_step = len(dec_inputs)
        model = torch.empty([n_step, 1, n_class])

        for i in range(n_step):  # each time step
            # dec_output : [n_step(=1), batch_size(=1), num_directions(=1) * n_hidden]
            # hidden : [num_layers(=1) * num_directions(=1), batch_size(=1), n_hidden]
            dec_output, hidden = self.dec_cell(dec_inputs[i].unsqueeze(0), hidden)
            attn_weights = self.get_att_weight(dec_output, enc_outputs)  # attn_weights : [1, 1, n_step]
            trained_attn.append(attn_weights.squeeze().data.numpy())

            # matrix-matrix product of matrices [1,1,n_step] x [1,n_step,n_hidden] = [1,1,n_hidden]
            context = attn_weights.bmm(enc_outputs.transpose(0, 1))
            dec_output = dec_output.squeeze(0)  # dec_output : [batch_size(=1), num_directions(=1) * n_hidden]
            context = context.squeeze(1)  # [1, num_directions(=1) * n_hidden]
            model[i] = self.out(torch.cat((dec_output, context), 1))

        # make model shape [n_step, n_class]
        return model.transpose(0, 1).squeeze(0), trained_attn

    def get_att_weight(self, dec_output, enc_outputs):  # get attention weight one 'dec_output' with 'enc_outputs'
        n_step = len(enc_outputs)
        attn_scores = torch.zeros(n_step)  # attn_scores : [n_step]

        for i in range(n_step):
            attn_scores[i] = self.get_att_score(dec_output, enc_outputs[i])

        # Normalize scores to weights in range 0 to 1
        return F.softmax(attn_scores).view(1, 1, -1)

    def get_att_score(self, dec_output, enc_output):  # enc_outputs [batch_size, num_directions(=1) * n_hidden]
        score = self.attn(enc_output)  # score : [batch_size, n_hidden]
        return torch.dot(dec_output.view(-1), score.view(-1))  # inner product make scalar value


In [None]:
sentences = ['ich mochte ein bier P', 'S i want a beer', 'i want a beer E']

word_list = " ".join(sentences).split()
word_list = list(set(word_list))
word_dict = {w: i for i, w in enumerate(word_list)}
number_dict = {i: w for i, w in enumerate(word_list)}
n_class = len(word_dict)  # vocab list

# hidden : [num_layers(=1) * num_directions(=1), batch_size, n_hidden]
hidden = torch.zeros(1, 1, n_hidden)

model = Attention()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

input_batch, output_batch, target_batch = make_batch()




In [None]:
for epoch in range(2):
    optimizer.zero_grad()
    output, _ = model(input_batch, hidden, output_batch)

    loss = criterion(output, target_batch.squeeze(0))
    if (epoch + 1) % 400 == 0:
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

    loss.backward()
    optimizer.step()


  return F.softmax(attn_scores).view(1, 1, -1)


In [None]:
from transformers.models.distilbert.modeling_distilbert import MultiHeadSelfAttention
from transformers.models.distilbert.configuration_distilbert import DistilBertConfig

In [None]:
config = DistilBertConfig()

In [None]:
from transformers.models.distilbert.modeling_distilbert import Embeddings

In [None]:
embeddings = Embeddings(config)

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sequence_a = "This is a short sequence."
encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
torch.tensor(encoded_sequence_a).view(1,8).shape

p = embeddings(torch.tensor(torch.tensor(encoded_sequence_a).view(1,8)))

  p = embeddings(torch.tensor(torch.tensor(encoded_sequence_a).view(1,8)))


In [None]:
len()

4

In [None]:
tokenizer.convert_ids_to_tokens([101, 8667, 1291, 102])

In [None]:
def embeddify(text):
    token_ids = tokenizer(text)['input_ids']
    _len = len(token_ids)
    return embeddings(torch.tensor(torch.tensor(tokenizer(text)["input_ids"]).view(1,_len))), tokenizer.convert_ids_to_tokens(token_ids)
    

In [None]:
q, _ = embeddify('My Name is something and else as well in 2016')

  return embeddings(torch.tensor(torch.tensor(tokenizer(text)["input_ids"]).view(1,_len))), tokenizer.convert_ids_to_tokens(token_ids)


In [None]:
q.shape

torch.Size([1, 8, 768])

In [None]:
multi_head_attn = MultiHeadSelfAttention(config)

In [None]:
config.attention_dropout

0.1

In [None]:
multi_head_attn.dim

768

In [None]:
q = Chatuur_MultiHeadSelfAttention(config)

In [None]:
x = torch.randn(1,10,config.dim) # (bs, seq_length, dim)

In [None]:
mask = torch.ones(1,10)

In [None]:
w = q(x,x,x,mask)

> [0;32m<ipython-input-74-8f2a1170a3c9>[0m(46)[0;36mforward[0;34m()[0m
[0;32m     44 [0;31m        """
[0m[0;32m     45 [0;31m        [0;32mimport[0m [0mpdb[0m[0;34m;[0m[0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 46 [0;31m        [0mbs[0m[0;34m,[0m [0mq_length[0m[0;34m,[0m [0mdim[0m [0;34m=[0m [0mquery[0m[0;34m.[0m[0msize[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     47 [0;31m        [0mk_length[0m [0;34m=[0m [0mkey[0m[0;34m.[0m[0msize[0m[0;34m([0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     48 [0;31m        [0;31m# assert dim == self.dim, 'Dimensions do not match: %s input vs %s configured' % (dim, self.dim)[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> p query.size()
torch.Size([1, 10, 768])
ipdb> p key.size(1)
10
ipdb> n
> [0;32m<ipython-input-74-8f2a1170a3c9>[0m(47)[0;36mforward[0;34m()[0m
[0;32m     45 [0;31m    

ipdb> n
> [0;32m<ipython-input-74-8f2a1170a3c9>[0m(67)[0;36mforward[0;34m()[0m
[0;32m     65 [0;31m        [0mv[0m [0;34m=[0m [0mshape[0m[0;34m([0m[0mself[0m[0;34m.[0m[0mv_lin[0m[0;34m([0m[0mvalue[0m[0;34m)[0m[0;34m)[0m  [0;31m# (bs, n_heads, k_length, dim_per_head)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     66 [0;31m[0;34m[0m[0m
[0m[0;32m---> 67 [0;31m        [0mq[0m [0;34m=[0m [0mq[0m [0;34m/[0m [0mmath[0m[0;34m.[0m[0msqrt[0m[0;34m([0m[0mdim_per_head[0m[0;34m)[0m  [0;31m# (bs, n_heads, q_length, dim_per_head)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     68 [0;31m        [0mscores[0m [0;34m=[0m [0mtorch[0m[0;34m.[0m[0mmatmul[0m[0;34m([0m[0mq[0m[0;34m,[0m [0mk[0m[0;34m.[0m[0mtranspose[0m[0;34m([0m[0;36m2[0m[0;34m,[0m [0;36m3[0m[0;34m)[0m[0;34m)[0m  [0;31m# (bs, n_heads, q_length, k_length)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     69 [0;31m        [0mmask[0m [0;34m=[0m [0;34m([0

BdbQuit: 

In [None]:
w[0].shape

torch.Size([1, 10, 768])

In [None]:
import math 

In [None]:
q = ['Golu' ,'Gupta ','is' ,'a ','bad' ,'man']

In [None]:
golu_vect = [1,2,3] # dim = 256 


In [None]:
golu_token = [1,2,3]
horrible_token = [x,y,z]

In [None]:
bad_token = [5,8]

In [None]:
golu_pcnt = 0.1
bod_pcnt = 0.4

In [None]:
golu_pcnt*golu_vect
bad_pcnt*bad_vect

In [None]:
[0.1,0.2,0.3]
[]

# Testing Out DistilBert # 

In [None]:
AutoModel

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'from_config',
 'from_pretrained']

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer, AutoTokenizer, AutoModel
model_checkpoint = "distilbert-base-uncased"
model = AutoModel.from_pretrained(model_checkpoint)

In [None]:
tokens_pt = tokenizer("This is an input example", return_tensors="pt")
for key, value in tokens_pt.items():
    print("{}:\n\t{}".format(key, value))

input_ids:
	tensor([[ 100,   23,   31, 7301,  634]])


In [None]:
outputs = model(**tokens_pt)
last_hidden_state = outputs.last_hidden_state
# pooler_output = outputs.pooler_output
# print("Token wise output: {}, Pooled output: {}".format(last_hidden_state.shape, pooler_output.shape))



In [None]:
model

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

In [None]:
tokens_pt

{'input_ids': tensor([[ 100,   23,   31, 7301,  634]])}

In [None]:
tokens_pt['input_ids']

tensor([[ 100,   23,   31, 7301,  634]])

In [None]:
tokenizer.convert_ids_to_tokens(tokens_pt['input_ids'][0])

['This', 'is', 'an', 'input', 'example']

In [None]:
outputs[0].shape

torch.Size([1, 5, 768])

In [None]:
outputs[0][:, 0].shape

torch.Size([1, 768])

In [None]:
outputs.pooler_output.shape

AttributeError: 'BaseModelOutput' object has no attribute 'pooler_output'

In [None]:
outputs.last_hidden_state.shape

torch.Size([1, 7, 768])

In [None]:
model.embeddings.word_embeddings(torch.tensor([[ 101, 2023, 2003, 2019, 7953, 2742,  102]]))[0][0][:10]

tensor([ 0.0390, -0.0123, -0.0208, -0.0005, -0.0198,  0.0383, -0.0206,  0.0034,
        -0.0225, -0.0440], grad_fn=<SliceBackward>)

In [None]:
model.embeddings.word_embeddings(torch.tensor([[ 101, 2023, 2003, 2019, 7953, 2742,  102]]))[0][0][:10]

tensor([ 0.0390, -0.0123, -0.0208, -0.0005, -0.0198,  0.0383, -0.0206,  0.0034,
        -0.0225, -0.0440], grad_fn=<SliceBackward>)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
tokens = tokenizer(q)

In [None]:
tokens

{'input_ids': [101, 1045, 2572, 1037, 2204, 2711, 1998, 1045, 2066, 3256, 6949, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
q = 'I am a good person and I like ice cream'

In [None]:
vec, tokens = embeddify(q)

  return embeddings(torch.tensor(torch.tensor(tokenizer(text)["input_ids"]).view(1,_len))), tokenizer.convert_ids_to_tokens(token_ids)


In [None]:
tokens 

{'input_ids': [101, 1045, 2572, 1037, 2204, 2711, 1998, 1045, 2066, 3256, 6949, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
model(**tokens)

AttributeError: 'list' object has no attribute 'size'

# Transformer XL Things # 

In [None]:
# from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer, AutoTokenizer, AutoModel
# model_checkpoint = "distilbert-base-uncased"
# model = AutoModel.from_pretrained(model_checkpoint)

In [None]:
_config = 'transfo-xl-wt103'

In [None]:
from transformers import TransfoXLConfig, TransfoXLModel, TransfoXLTokenizer

In [None]:
configuration = TransfoXLConfig()

In [None]:
model = TransfoXLModel(configuration)

  if getattr(self, "_initialized", False) and not isinstance(value, torch.nn.Parameter):


In [None]:
tokenizer = AutoTokenizer.from_pretrained(_config)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=856.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=9143470.0, style=ProgressStyle(descript…




In [None]:
model = AutoModel.from_pretrained(_config)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1140884800.0, style=ProgressStyle(descr…




In [None]:
# model

In [None]:
tokens_pt = tokenizer("This is an input example", return_tensors="pt")

In [None]:
tokens_pt = tokenizer("This is an input example")

In [None]:
tokens_pt

{'input_ids': tensor([[ 100,   23,   31, 7301,  634]])}

## Import and Other Things I don't want taking up space but ARE IMPORTANT 

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import pandas as pd 