## Introduction to NLP ## 

Hello Everyone! 
Welcome to the first video of this lecture series on Attention. 

Before we start our discussion on Attention we're gonna breifly discuss how Deep Learning is used in Natural Language Processing to establish a base line of terminologies and to discuss certain concepts that form the bedrock of Deep Learning in NLP. 

Like every other usage of Deep Learning models, in NLP we use different kinds of neural networks to create vectorized representations of our input. For any NLP task to create vector space representation text we first start with representing every word in our dataset with a randomly initialized vector. In PyTorch, the [Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) class is used to create a mapping between all words in our dataset and a fixed length vector (a tensor of single dimension). Here's an example. 

In [None]:
embedding = nn.Embedding(10, 3)
# Here I've created a mapping where my vocab size is 10 
# and the length of my fixed length vector is 3. 

input = torch.LongTensor([[9]])
# Each word in our dataset is given a numerical ID. 
embedding(input)


tensor([[[-0.3612,  0.4019,  0.7331]]], grad_fn=<EmbeddingBackward>)

Now, one of the objectives of this video was to help you the viewer develop a habit to look at source code. Keeping that in mind let's take a brief detour and see what the internals of the Embedding class look like and try to recreate sections of it. 

In [None]:
from torch.nn.parameter import Parameter
weight = Parameter(torch.Tensor(10, 3))

In [None]:
nn.init.normal_(weight)

Parameter containing:
tensor([[ 1.4227, -1.5219,  0.9717],
        [-1.1892,  0.0214, -0.5868],
        [ 2.2857,  1.0582,  0.0106],
        [-0.9125, -0.0913, -0.2314],
        [-1.4316,  0.5261, -0.6858],
        [-0.3281, -0.9034,  0.9479],
        [-1.0346,  1.0548, -0.7614],
        [-0.5806,  1.0494, -0.1831],
        [-1.4869,  0.1978,  0.6518],
        [ 1.1752, -0.7529,  1.2326]], requires_grad=True)

In [None]:
weight

Parameter containing:
tensor([[-1.6990e-26,  3.0871e-41, -2.6363e-29],
        [ 4.5593e-41,  2.1449e-02,  5.8680e-01],
        [ 2.2857e+00,  1.0582e+00,  1.0593e-02],
        [ 9.1253e-01,  9.1309e-02,  2.3141e-01],
        [ 1.4316e+00,  5.2611e-01,  6.8577e-01],
        [ 3.2810e-01,  9.0337e-01,  9.4789e-01],
        [ 1.0346e+00,  1.0548e+00,  7.6137e-01],
        [ 5.8065e-01,  1.0494e+00,  1.8311e-01],
        [ 1.4869e+00,  1.9779e-01,  6.5181e-01],
        [ 1.1752e+00,  7.5294e-01,  1.2326e+00]], requires_grad=True)

Another library that we're gonna be actively looking at is the [transformers](https://huggingface.co/transformers/) library. This is a library build on top of PyTorch which provides a lot inbuilt functionality for NLP tasks. Here, I've created the `embeddify` func that uses that library to return the vector space representations for every word passed in a sentence as input. 

In [None]:
a, b = embeddify('The attention mechanism was invented in 2015')
# We will look at the internals of this function at some other point. 

  return embeddings(torch.tensor(torch.tensor(tokenizer(text)["input_ids"]).view(1,_len))), tokenizer.convert_ids_to_tokens(token_ids)


In [None]:
a.shape
# (batch_size,seq_len, token_vector_size)

torch.Size([1, 7, 768])

In [None]:
b

['The', 'attention', 'mechanism', 'was', 'invented', 'in', '2015']

`a` is a 3 dimensional tensor. Lets take a look at what each of those dimensions mean: 

**batch_size** : This refers to the number of sentences being represented by the tensor. Usually when I'm training or evaluating a model I'm gonna be passing multiple sentences. 

**seq_len**: This refers to the number of words in my sentence. Here, my sentence is made up of 7 words.  Each entity is further represented by a vector of size 768. 

**token_vector_size**: This is the length of the fixed length vector representing every word. 

In [None]:
# rnn_example_seq_tensor, seq = embeddify('This person is a good person')

  return embeddings(torch.tensor(torch.tensor(tokenizer(text)["input_ids"]).view(1,_len))), tokenizer.convert_ids_to_tokens(token_ids)


In [None]:
b

['The', 'attention', 'mechanism', 'was', 'invented', 'in', '2015']

In [None]:
rnn_example_hidden = torch.zeros(1,1,768)

In [None]:
example_rnn = nn.RNN(768,768,batch_first=True)

In [None]:
encoder_outputs,last_encoder_output = example_rnn(a,rnn_example_hidden)

In [None]:
a.shape

torch.Size([1, 7, 768])

In [None]:
encoder_outputs.shape

torch.Size([1, 7, 768])

In [None]:
last_encoder_output.shape

torch.Size([1, 1, 768])


Now, you can look at a as a sequence of vectors representing our sentence. Given such input  [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html)s are used to generate a vector of fixed size. Here, `example_rnn` is an RNN when takes as input a sequence of vectors which are 768 in size and to represent the sequence of tokens uses a vector of size 768.


How a RNN works is that it starts off with fixed length vector, which is referred to as the hidden state (rnn_example_hidden) processes the input sequentially. At each step it performs an operation to merge a vector  with the hidden state. 

`last_encoder_output` represents the entire sequence in `b` and was created after merging the last vector/word with the hidden state. `encoder_outputs` is a list of "hidden states" generated after merging each vector of the hidden state. You can see that there are 7 entities in `encoder_outputs`. 

> **encoder_outputs[0]** represents the merging of **_rnn_example_hidden_** and the word **_The_**

> **encoder_outputs[1]** represents the merging of **_encoder_outputs[0]_** and the word **_attention_**

So and and so forth. Lastly, encoder_outputs[-1] represents the entire sequence.


Now, a discussion of the internals of the RNN module is beyond the scope of this article. For a more in depth introduction to RNNs I would first suggest [this](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) tutorial by [Andrej Karpathy](https://karpathy.ai/) and follow that up with [this](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) discussion by Chris Olah. But it's worth taking a moment to discuss one particular drawback of the RNN. The RNN does not do a very good job of representing long sequences very well and only remembers things near the corresponding input. So, a consequence of that in our case will be: 

> encoder_outputs[-1] does a good job of remebering the information near the token `2015` but might not represent earlier tokens in the sequence very well. 

## The Humble Linear Layer ## 

![](assets/1layer.gif)

Let's talk a bit about the humble Linear Layer. Now, if any of you have trained a neural network chances are you've stacked up a few linear layers and covered them with activation functions to perform some task. However, for the longest time I did not understand the intuition behind them. The idea is subtle and it has got to do with visualizing what a linear layer does. 

A Linear layer performs a `Linear Transformations`. Linear Transformations (though they have certain properties attached to them) perform the action of stretching and squishing space it is applied upon. The properties restrict the kind of stretching and squishing that can be done. In essence, the properties translate to prarllel lines should remain parallel.

A Linear Transformation maps a particular vector to a different space and it does that by mapping a vector to a space where the i-cap and j-cap of that mapped space are not (1,0) and (0,1) but a different set of vectors. These vectors are represented by the weights of our linear layer. 



In [None]:
# lin1 = nn.Linear(2,2, bias=False)
# Here I have created a Linear Layer. 

In [None]:
lin1.weight
# For this LT i-cap is at [-0.6971,  0.2710]
#          and j-cap is at [0.2710,  0.0192]

Parameter containing:
tensor([[-0.6971,  0.6101],
        [ 0.2710,  0.0192]], requires_grad=True)

In [None]:
lin1(torch.tensor([1.,0.]))
# lin1(torch.tensor([0.,1.]))
# lin1(torch.tensor([1.,1.]))

tensor([-0.6971,  0.2710], grad_fn=<SqueezeBackward3>)

A Linear Transformation and an activation function is shown in the animation above. Three things happen here: Stretching, Sliding and Squification. The first two are performed by the linear transformation and the last is performed by a activation function.  So, when vectors are passed as input they are mapped to a different space. 

Now, I suspect many of you might be full of questions and an exhaustive analysis of Linear Transformation is beyond the scope of this lecture. But, [here's](https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) an excellect series by the wonderful [3Blue1Brown](https://twitter.com/3Blue1Brown) on Linear Transformations which without a doubt will cover most of your questions. Seriously folks, it changed how I look at Deep Learning and without a doubt it will change your prespective as well. Also, I have taken the above animation from [Chis Olah](https://twitter.com/ch402)'s wonderful [blogpost](http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/) about NNs and topology (which I would suggest you only look at after the LA course it might confuse you further)

In [None]:
# a = [.1,.1]

# # lin1 = nn.Linear(2,2, bias=False)
# # lin2 = nn.Linear(2,2, bias=False)

# b = lin1(torch.tensor(a))
# c = lin2(torch.tensor(a))

# print(b)
# print(c)

# import numpy as np
# import matplotlib.pyplot as plt

# V = np.array([[.1,.1], [-.4452,.6256], [-.0827,.1850]])
# origin = np.array([[0, 0, 0],[0, 0, 0]]) # origin point

# plt.quiver(*origin, V[:,0], V[:,1], color=['r','b','g'], scale=2)
# plt.show()

## What is SoftMax? Not Perfect  ## 
Given a set of scores the Softmax func is used to output probabilities. The most common use case is that of classification. The setup in which softmax is commonly used is something like this: 

You have a network of some sort which outputs a vector of a given size and you want to use this vector to perform classification. So, if you have 5 labels you use a Linear Layer to use this vector to create 5 scores. 
For example: 



In [None]:
x = torch.randn(1,100) # Fixed Length Vector Representing you input
lin = nn.Linear(100,5)
y = lin(x)

In [None]:
y

tensor([[-0.3326, -0.9005, -1.0053,  0.1555,  0.1022]],
       grad_fn=<AddmmBackward>)

In [None]:
F.softmax(y)

  F.softmax(y)


tensor([[0.1904, 0.1079, 0.0972, 0.3103, 0.2942]], grad_fn=<SoftmaxBackward>)

Here I have used the Softmax to convert the scores into probabilities. Now, what does the softmax do which makes me say that these are probabilities? That is related to how the softmax computes these scores. So, all that softmax does is that it normalizes the scores given to it and one of the properties of the scores generated by SoftMax is that they will always sum to 1. 

So, in our example, if you know that the input definitely belongs to one of the 5 classes and only one of the 5 classes softmax will normalize the scores so that the probabilities sum up to 1.

So, to conclude, What does Softmax do? 
Given a set of scores it will normalize the scores so that they sum to 1 and this allows us to think of the output as probabilities. 

Now, there are a few caveats. if your input could belong to multiple classes or none of the classes the probabilities of the softmax function can be misleading. 