# Introduction to Natural Language Processing

----

<center><h3>Suriyadeepan Ramamoorthy</h3></center>

# Overview
---

- Text Represenation
- Reduction / Abstraction
- Articulation / Synthesis
- Reasoning


# Overview
---

- Text Represenation
- Reduction / Abstraction
- ~~Articulation / Synthesis~~
- ~~Reasoning~~


## Text Represenation
---
- Count-based Representation
- Continuous Represetation
    - Example : Continuous Bag-of-Words
- Text Preprocessing
    - Example : Social Media Sentiment Corpus

## Reduction / Abstraction
---
- Neural Networks
- Example : Neural Networks from Scratch in pytorch
- Recurrent Neural Networks
    - LSTM (Long Short Term Memory)
- Example : Sentiment Classification

# Count-based Representation
---

- One-hot Encoding
- Bag-of-Words
- TF-IDF

# One-hot Encoding

```python
vocab = [ 'one', 'two', 'three' , 'four' ]
```
```
one   : tensor([1., 0., 0., 0.])
two   : tensor([0., 1., 0., 0.])
three : tensor([0., 0., 1., 0.])
four  : tensor([0., 0., 0., 1.])
```

# One-Hot Encoding

In [None]:
import torch

def one_hot_encode(w, w2i):
  x = torch.zeros(len(w2i))
  idx = w2i[w]
  x[idx] = 1
  return x

vocab = [ 'one', 'two', 'three' , 'four' ]
w2i = { w: i for i, w in enumerate(vocab) }
for w in vocab:
    print(w, ':', one_hot_encode(w, w2i))

# N-gram

```python
sentence = 'one two three four'
```

```
['one',
 'two',
 'three',
 'four',
 'one two',
 'two three',
 'three four',
 'one two three',
 'two three four']
```

# N-gram

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


bigram_vectorizer = CountVectorizer(ngram_range=(1, 3))
analyze = bigram_vectorizer.build_analyzer()
analyze('one two three four')

# Count Vectorization


```
Size of vocabulary :  122
Sentence :  The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.
Vector :  [[1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0
  0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 3 0 0 0
  2 0 0 0 0 1 0 0 0 0 1 0 0 0]]
Most Frequent Word :  (104, 'the')
```

# Count Vectorization

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import sent_tokenize
import numpy as np

# create an instance of sklearn's Count Vectorizer
vectorizer = CountVectorizer()
sentences = sent_tokenize(open('data/asimov-tiny.txt').read())
vectorizer.fit(sentences)
w2i = vectorizer.vocabulary_
vocab = { i: w for w, i in w2i.items() }
print('Size of vocabulary : ', len(vocab))
print('Sentence : ', sentences[0])
sent_vector = vectorizer.transform([sentences[0]]).toarray()
print('Vector : ', sent_vector)
most_frequent = np.argmax(sent_vector[-1])
print('Most Frequent Word : ', (w2i['the'], vocab[104]))

# TF-IDF
---

$$tf \times idf$$
- $tf$ : Term Frequency in the document
- $idf$ : (logarithm of) inverse fraction of the documents that contain the word 

# TF-IDF

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import sent_tokenize
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.tokenize import sent_tokenize
# create an instance of sklearn's Count Vectorizer
vectorizer = CountVectorizer()
sentences = sent_tokenize(open('data/asimov-tiny.txt').read())
vectorizer.fit(sentences)
vectors = vectorizer.transform(sentences)
tf_transformer = TfidfTransformer()
tf_transformer.fit(vectors)

vectorizer.fit(sentences)
w2i = vectorizer.vocabulary_
vocab = { i: w for w, i in w2i.items() }
print('Size of vocabulary : ', len(vocab))
print('Sentence : ', sentences[0])
sent_vector = vectorizer.transform([sentences[0]]).toarray()
print('Vector : ', sent_vector)
most_frequent = np.argmax(sent_vector[-1])
print(tf_transformer.transform(sent_vector).toarray())

# TF-IDF


```
[[0.19564323 0.19564323 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.19564323 0.1639643  0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.39128647
  0.         0.         0.1639643  0.         0.         0.
  0.         0.         0.         0.         0.19564323 0.
  0.         0.         0.         0.19564323 0.1639643  0.19564323
  0.         0.         0.         0.19564323 0.         0.19564323
  0.         0.         0.19564323 0.         0.         0.
  0.         0.19564323 0.         0.         0.         0.
  0.         0.         0.         0.         0.19564323 0.
  0.         0.         0.         0.         0.         0.
  0.         0.1639643  0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.19564323
  0.         0.         0.26199672 0.         0.         0.
  0.39128647 0.         0.         0.         0.         0.14148774
  0.         0.         0.         0.         0.19564323 0.
  0.         0.        ]]
  ```

# PyTorch for Machine Learning
---

- Logistic Regression
- Neural Network
- Recurrent Neural Network
- Long Short Term Memory (LSTM)

# Linear Regression
---

| x | y   |
|------|------|
| 0.54  | 2.01    |
| 1.21  | 4.13    |
| 0.2   | 0.82    |
| ...   | ...     |

$$\hat{y} = wx + c$$

In [None]:
import torch
import torch.nn as nn

x = torch.rand(1)
linear = nn.Linear(1, 1)
y = linear(x)
print('x : ', x)
print('y : ', y)
print('Associated Parameters : ', list(linear.named_parameters()))

# Linear Regression

```
x :  tensor([0.3726])
y :  tensor([-0.6245], grad_fn=<AddBackward0>)
Associated Parameters :  [('weight', Parameter containing:
tensor([[0.1961]], requires_grad=True)), ('bias', Parameter containing:
tensor([-0.6975], requires_grad=True))]
```

# Multi-variate Linear Regression
---

| $x_0$ | $x_1$   | $x_2$ | y   |
|------|------|------|------|
| 0.54  | 2.01    | 0.14  | 2.91    |
| 1.21  | 4.13    | 0.24  | 4.22    |
| 0.2   | 0.82    | 1.8   | 1.35    |
| ...   | ...     | ...   | ...     |

$$\hat{y} = w_0x_0 + w_1x_1 + w_2x_2 + b$$
<center><h3>OR</h3></center>
$$\hat{\overrightarrow{y}} = W\overrightarrow{x} + b$$

In [None]:
import torch
import torch.nn as nn

x = torch.rand(1, 3)
linear = nn.Linear(3, 1)
y = linear(x)
print('x : ', x)
print('y : ', y)
print('Associated Parameters : ', list(linear.named_parameters()))

# Multi-variate Linear Regression

```
x :  tensor([[0.2249, 0.2030, 0.8778]])
y :  tensor([[0.0854]], grad_fn=<AddmmBackward>)
Associated Parameters :  [('weight', Parameter containing:
tensor([[ 0.4610, -0.3530,  0.5720]], requires_grad=True)), ('bias', Parameter containing:
tensor([-0.4488], requires_grad=True))]
```

# Logistic Regression
---
$$\hat{y} = \sigma( wx + b )$$
$$ y \epsilon (0, 1)\ \forall\ x \epsilon (-\infty, +\infty) $$

![](doc/sigmoid.png)

# Logistic Regression

In [None]:
import torch
import torch.nn

x = torch.rand(1)
linear = nn.Linear(1, 1)
activation_fn = nn.Sigmoid()
# activation_fn = nn.Tanh()
y = activation_fn(linear(x))
print('x : ', x)
print('y : ', y)
print('Associated Parameters : ', list(linear.named_parameters()))

# Logistic Regression

```
x :  tensor([0.4730])
y :  tensor([0.6162], grad_fn=<SigmoidBackward>)
Associated Parameters :  [('weight', Parameter containing:
tensor([[0.1867]], requires_grad=True)), ('bias', Parameter containing:
tensor([0.3851], requires_grad=True))]
```

# Neural Network
---

- Feed-Forward Neural Network
- Add a hidden layer
- 1 input layer, 1 hidden layer, 1 output layer

$$ \hat{y} = \tanh(W_2z + b_2) $$
$$ z = \tanh(W_1x + b_1) $$

![](doc/nn.png)

# Neural Network



In [None]:
import torch
import torch.nn

x = torch.rand(1, 3)
linear_1 = nn.Linear(3, 5)
activation_fn_1 = nn.Tanh()
linear_2 = nn.Linear(5, 1)
activation_fn_2 = nn.Tanh()
z = activation_fn_1(linear_1(x))
y = activation_fn_2(linear_2(z))
print('x : ', x)
print('z : ', z)
print('y : ', y)
print('Associated Parameters : ')
print('\nlinear_1 (weight) : ', 
      linear_1.weight.size(), linear_1.weight)
print('linear_1 (bias) : ', 
      linear_1.bias.size(), linear_1.bias)
print('\nlinear_2 (weight) : ', 
      linear_2.weight.size(), linear_2.weight)
print('linear_2 (bias) : ', 
      linear_2.bias.size(), linear_2.bias)

# Neural Network

```
linear_1 (weight) :  torch.Size([5, 3]) Parameter containing:
tensor([[ 0.0067, -0.0129, -0.3105],
        [ 0.0487, -0.5520, -0.5208],
        [-0.5579, -0.0674, -0.3417],
        [-0.2760, -0.2374,  0.4286],
        [ 0.1238,  0.5067, -0.0352]], requires_grad=True)
linear_1 (bias) :  torch.Size([5]) Parameter containing:
tensor([-0.4454,  0.5320,  0.0727,  0.1670, -0.0999], requires_grad=True)

linear_2 (weight) :  torch.Size([1, 5]) Parameter containing:
tensor([[ 0.1594,  0.2691, -0.1408,  0.1989,  0.4021]], requires_grad=True)
linear_2 (bias) :  torch.Size([1]) Parameter containing:
tensor([0.2987], requires_grad=True)
```

# Recurrent Neural Network
---

$$h_t = W_{hh}h_{t-1} + W_{xh}x_{t}$$
$$y_t = softmax(W_{hy}h_t)$$

![](doc/rnn.png)

# Recurrent Neural Network
---

![](doc/simple_seq2seq.png)

# Recurrent Neural Network

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
input = torch.rand([1, 5, 3])
output_size = 2
state = torch.zeros(1, 3)
ss, si = nn.Linear(3, 3), nn.Linear(3, 3)
so = nn.Linear(3, 3)
linear_final = nn.Linear(3, output_size)
y = []
for i in range(5):
    state = ss(state) + si(input.permute(1, 0, 2)[i])
    out = so(state)
    y.append(F.softmax(linear_final(out)))
    
print(y)

[tensor([[0.2284, 0.7716]], grad_fn=<SoftmaxBackward>), tensor([[0.2077, 0.7923]], grad_fn=<SoftmaxBackward>), tensor([[0.2070, 0.7930]], grad_fn=<SoftmaxBackward>), tensor([[0.2067, 0.7933]], grad_fn=<SoftmaxBackward>), tensor([[0.1817, 0.8183]], grad_fn=<SoftmaxBackward>)]


  


# LSTM
---

![](doc/lstm1.png)

# LSTM
---

![](doc/lstm_eqn.png)

# LSTM

```python
class LstmClassifier(nn.Module):

  def __init__(self, hparams, weights=None):
    """ (hparams : dictionary of hyperparameters) """
    super(LstmClassifier, self).__init__()
    self.hparams = hparams
    self.weights = weights
    self.embedding = nn.Embedding(hparams['vocab_size'], hparams['emb_dim'])
    self.lstm = nn.LSTM(hparams['emb_dim'], hparams['hidden_dim'])
    self.linear = nn.Linear(hparams['hidden_dim'], hparams['output_size'])
```

# LSTM

```python
  def forward(self, sequence, batch_size=None, get_hidden=False):
    input = self.embedding(sequence)
    input = input.permute(1, 0, 2)
    batch_size = batch_size if batch_size else self.hparams['batch_size']
    h0 = torch.zeros(1, batch_size, self.hparams['hidden_dim'])
    c0 = torch.zeros(1, batch_size, self.hparams['hidden_dim'])
    self.lstm.flatten_parameters()
    lstm_out, (h, c) = self.lstm(input, (h0, c0))
    self.h = h[-1]
    linear_out = self.linear(h[-1])
    return linear_out
```

# Preprocessing
---

- Tokenization
- Indexing
- Padding
- Batching

# Tokenization

In [6]:
import nltk
sentence = "Hi, I'm Suriya"
nltk.word_tokenize(sentence)

['Hi', ',', 'I', "'m", 'Suriya']

# Indexing

In [14]:
def word_to_index(w, w2i, UNK_IDX=1):
    return w2i[w] if w in w2i else UNK_IDX

def index(sentence, w2i):
    assert isinstance(sentence, type([]))
    return [ word_to_index(token, w2i) for token in sentence ]

w2i = {'one' : 2, 'two' : 3, 'three' : 4, 'four' : 5}
tokenized_sentence = ['zero', 'three', 'two', 'five']
index(tokenized_sentence, w2i)

[1, 4, 3, 1]

# Padding

In [20]:
def pad(samples, max_len=None, PAD_IDX=0, to_tensor=torch.tensor):
  max_len = max([len(s[0]) for s in samples]) if not max_len else max_len
  padded_text = []
  labels = []
  for text, label in samples:
    padded_text.append(
        text + [PAD_IDX] * (max_len - len(text))
       )
    labels.append(label)
  return (
      to_tensor(padded_text) if to_tensor else padded_text,
      to_tensor(labels)
      )
samples = [ ([1], 1), ([2, 3], 1), ([4, 5, 6], 0) ]
pad(samples)

(tensor([[1, 0, 0],
         [2, 3, 0],
         [4, 5, 6]]), tensor([1, 1, 0]))

# Batching

In [25]:
def make_batches(samples, batch_size):
  batches = []
  num_batches = len(samples) // batch_size
  for i in range(num_batches):
    batches.append(
        pad(samples[i * batch_size : (i + 1) * batch_size])
        )
  return batches

samples = [ ([1], 1), ([2, 3], 1), ([4, 5, 6], 0), ([7, 8, 9, 10], 1) ]
batches = make_batches(samples, batch_size=2)
for i, batch in enumerate(batches):
    text, label = batch
    print(i, text, label)

0 tensor([[1, 0],
        [2, 3]]) tensor([1, 1])
1 tensor([[ 4,  5,  6,  0],
        [ 7,  8,  9, 10]]) tensor([0, 1])


## Social Media Sentiment Dataset
---


In [39]:
!python3 srmnlp/data/socialmedia.py

Text  : torch.Size([3, 37]) tensor([[ 1,  1, 22,  1,  1,  1, 24,  1,  1, 22,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1, 12,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0],
        [ 1, 22,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 22,  1, 24,  1, 22,  1,
          1,  1,  1, 24,  1, 22,  1,  1,  1,  5,  1,  1,  1,  1,  1,  5,  1,  1,
         12],
        [22,  1,  1,  1, 12,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0]])
Label : torch.Size([3]) tensor([0, 1, 1])


## Exercise
---

1. Use existing code to recreate sentences from preprocessed social media data
2. Reduce the number of padding zeros by sorting the samples based on length, before batching

# Continuous Representation
---

- word2vec
    - (CBOW) Continous Bag-of-Words
    - Skipgram
- glove

# CBOW


![](doc/cbow1.png)


## Model
---



In [27]:
class CBOW(torch.nn.Module):

  def __init__(self, vocab_size, embed_dim):
    super(CBOW, self).__init__()

    # create embeddings with random initialization
    self.embeddings = nn.Embedding(vocab_size, embed_dim)
    # linear layer (1)
    self.linear1 = nn.Linear(embed_dim, 128)
    # linear layer (2)
    self.linear2 = nn.Linear(128, vocab_size)

  def forward(self, inputs):
    embedded = self.embeddings(inputs).sum(dim=1)
    out = F.relu(self.linear1(embedded))
    return F.log_softmax(self.linear2(out), dim=-1)

  def get_embedding(self, idx):
    if not isinstance(type(idx), type(torch.tensor(2.3))):
      idx = torch.tensor(idx)
    return self.embeddings(idx)

## Data
---

```
We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.
```
---
```python
(['the', 'idea', 'a', 'computational'], 'of') 
(['to', 'study', 'idea', 'of'], 'the') 
(['processes', 'manipulate', 'abstract', 'things'], 'other')
(['the', 'spirits', 'the', 'computer'], 'of')
(['programs', 'to', 'processes.', 'In'], 'direct')
(['a', 'pattern', 'rules', 'called'], 'of')
...
```

## Preprocessing
---

```python
for i in range(2, len(words) - 2):
    context = [ words[i - 2], words[i - 1], words[i + 1], words[i + 2] ]
    target = words[i]
    samples.append((context, target))
```
---
```python
vocab = create_vocabulary(samples)[2:]  # we do not need PAD and UNK here
w2i = vocab_to_w2i(vocab)
label2idx = w2i
samples = [ (x, label2idx[y]) for x, y in samples ]
indexed_samples = [ (index(text, w2i), label) for text, label in samples ]
return (
  make_batches(indexed_samples, batch_size),
  None
  ), vocab, w2i
```

## Training
---

```python
def train(model, trainset, num_epochs=50):
    loss_fn = nn.NLLLoss()
    optim = torch.optim.Adam(model.parameters())
    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0.
        for context, target in trainset:
            optim.zero_grad()
            logloss = model(context)
            loss = loss_fn(logloss, target)
            loss.backward()
            optim.step()
        epoch_loss += loss.item()
        if epoch % 10 == 0:
            print('epoch [{}], Loss : {}'.format(epoch, epoch_loss))
```

## Training
---

In [33]:
!python3 srmnlp/cbow_train.py

epoch [0], Loss : 238.19856929779053
epoch [10], Loss : 1.45696222782135
epoch [20], Loss : 0.37830962240695953
epoch [30], Loss : 0.1671372950077057
epoch [40], Loss : 0.0899118036031723
Context Words :  ['People', 'create', 'to', 'direct']
Target Word :  programs


# Skipgram

![](doc/skipgram.png)

## Exercise
---

- Implement Skipgram

# Sentiment Analysis
---

- Model : LSTM based Recurrent Neural Network
- Data  : Social Media Sentiment Corpus (from Kaggle)

```bash
# train-evaluate-predict
python3 srmnlp/sentiment_analysis.py
```

# Sentiment Analysis

In [40]:
!python3 srmnlp/sentiment_analysis.py

[1]
Epoch loss : 1.376315678541477, Epoch accuracy : 57.03125%
::[evaluation] Loss : 0.6824692029219407, Accuracy : 56.875
[2]
Epoch loss : 1.3328340710737767, Epoch accuracy : 60.19631576538086%
::[evaluation] Loss : 0.604145020705003, Accuracy : 67.45192307692308
[3]
Epoch loss : 0.9626376546728306, Epoch accuracy : 77.9647445678711%
::[evaluation] Loss : 0.3938124108773011, Accuracy : 82.9326923076923
[4]
Epoch loss : 0.7804959511909729, Epoch accuracy : 83.47355651855469%
::[evaluation] Loss : 0.38190830326997316, Accuracy : 85.0
[5]
Epoch loss : 0.7446358345257931, Epoch accuracy : 83.57371520996094%
::[evaluation] Loss : 0.3639948526254067, Accuracy : 85.4326923076923
[6]
Epoch loss : 0.7185407674465424, Epoch accuracy : 84.19470977783203%
::[evaluation] Loss : 0.3618682542672524, Accuracy : 85.24038461538461
[7]
Epoch loss : 0.6821818559979781, Epoch accuracy : 84.49519348144531%
::[evaluation] Loss : 0.35941146543392766, Accuracy : 85.24038461538461
[8]
Epoch loss : 0.652417094

## Exercise
---

- Try to improve the accuracy of Sentiment Classification on Social Media data

# Tasks
---

- <h4>Machine Translation</h4>
- <h4>Question Answering</h4>
- <h4>Text Summarization</h4>
- <h4>Natural Language Inference</h4>
- <h4>Goal Oriented Dialogue</h4>

## Machine Translation

![](doc/mt.png)

## Question Answering

![](doc/qa.png)

## Text Summarization

![](doc/summarization2.png)

## Natural Language Inference
---

- <h4>Entailment</h4>: If the first sentence is true, the second sentence must be true.
- <h4>Contradiction</h4>: If the first sentence is true, the second sentence must be false.
- <h4>Neutral</h4>: Neither entailment nor contradiction.


## Natural Language Inference

![](doc/nli.svg)

## Goal-Oriented Dialogue

![](doc/god0.jpg)

## Goal-Oriented Dialogue

![](doc/god.png)

In [44]:
!jupyter nbconvert notebook.ipynb --to slides --reveal-prefix ~/Desktop/tools/reveal.js

[NbConvertApp] Converting notebook notebook.ipynb to slides
[NbConvertApp] Writing 413977 bytes to notebook.slides.html
