# Introduction to Natural Language Processing

----

<center><h3>Suriyadeepan Ramamoorthy</h3></center>

# Overview
---

- Text Represenation
- Reduction / Abstraction
- Articulation / Synthesis
- Reasoning


## Text Represenation
---
- Count-based Representation
- Continuous Represetation
    - Example : Continuous Bag-of-Words
- Text Preprocessing
    - Example : Social Media Sentiment Corpus

## Reduction / Abstraction
---
- Neural Networks
- Example : Neural Networks from Scratch in pytorch
- Recurrent Neural Networks
    - LSTM (Long Short Term Memory)
- Example : Sentiment Classification

# Count-based Representation
---

- One-hot Encoding
- Bag-of-Words
- TF-IDF

# One-hot Encoding

```python
vocab = [ 'one', 'two', 'three' , 'four' ]
```
```
one   : tensor([1., 0., 0., 0.])
two   : tensor([0., 1., 0., 0.])
three : tensor([0., 0., 1., 0.])
four  : tensor([0., 0., 0., 1.])
```

# One-Hot Encoding

In [None]:
import torch

def one_hot_encode(w, w2i):
  x = torch.zeros(len(w2i))
  idx = w2i[w]
  x[idx] = 1
  return x

vocab = [ 'one', 'two', 'three' , 'four' ]
w2i = { w: i for i, w in enumerate(vocab) }
for w in vocab:
    print(w, ':', one_hot_encode(w, w2i))

# N-gram

```python
sentence = 'one two three four'
```

```
['one',
 'two',
 'three',
 'four',
 'one two',
 'two three',
 'three four',
 'one two three',
 'two three four']
```

# N-gram

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


bigram_vectorizer = CountVectorizer(ngram_range=(1, 3))
analyze = bigram_vectorizer.build_analyzer()
analyze('one two three four')

# Count Vectorization


```
Size of vocabulary :  122
Sentence :  The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.
Vector :  [[1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0
  0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 3 0 0 0
  2 0 0 0 0 1 0 0 0 0 1 0 0 0]]
Most Frequent Word :  (104, 'the')
```

# Count Vectorization

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import sent_tokenize
import numpy as np

# create an instance of sklearn's Count Vectorizer
vectorizer = CountVectorizer()
sentences = sent_tokenize(open('data/asimov-tiny.txt').read())
vectorizer.fit(sentences)
w2i = vectorizer.vocabulary_
vocab = { i: w for w, i in w2i.items() }
print('Size of vocabulary : ', len(vocab))
print('Sentence : ', sentences[0])
sent_vector = vectorizer.transform([sentences[0]]).toarray()
print('Vector : ', sent_vector)
most_frequent = np.argmax(sent_vector[-1])
print('Most Frequent Word : ', (w2i['the'], vocab[104]))

# TF-IDF
---

$$tf \times idf$$
- $tf$ : Term Frequency in the document
- $idf$ : (logarithm of) inverse fraction of the documents that contain the word 

# TF-IDF

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import sent_tokenize
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.tokenize import sent_tokenize
# create an instance of sklearn's Count Vectorizer
vectorizer = CountVectorizer()
sentences = sent_tokenize(open('data/asimov-tiny.txt').read())
vectorizer.fit(sentences)
vectors = vectorizer.transform(sentences)
tf_transformer = TfidfTransformer()
tf_transformer.fit(vectors)

vectorizer.fit(sentences)
w2i = vectorizer.vocabulary_
vocab = { i: w for w, i in w2i.items() }
print('Size of vocabulary : ', len(vocab))
print('Sentence : ', sentences[0])
sent_vector = vectorizer.transform([sentences[0]]).toarray()
print('Vector : ', sent_vector)
most_frequent = np.argmax(sent_vector[-1])
print(tf_transformer.transform(sent_vector).toarray())

# TF-IDF


```
[[0.19564323 0.19564323 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.19564323 0.1639643  0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.39128647
  0.         0.         0.1639643  0.         0.         0.
  0.         0.         0.         0.         0.19564323 0.
  0.         0.         0.         0.19564323 0.1639643  0.19564323
  0.         0.         0.         0.19564323 0.         0.19564323
  0.         0.         0.19564323 0.         0.         0.
  0.         0.19564323 0.         0.         0.         0.
  0.         0.         0.         0.         0.19564323 0.
  0.         0.         0.         0.         0.         0.
  0.         0.1639643  0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.19564323
  0.         0.         0.26199672 0.         0.         0.
  0.39128647 0.         0.         0.         0.         0.14148774
  0.         0.         0.         0.         0.19564323 0.
  0.         0.        ]]
  ```

# PyTorch for Machine Learning
---

- Logistic Regression
- Neural Network
- Recurrent Neural Network
- Long Short Term Memory (LSTM)

# Linear Regression
---

| x | y   |
|------|------|
| 0.54  | 2.01    |
| 1.21  | 4.13    |
| 0.2   | 0.82    |
| ...   | ...     |

$$\hat{y} = wx + c$$

In [None]:
import torch
import torch.nn as nn

x = torch.rand(1)
linear = nn.Linear(1, 1)
y = linear(x)
print('x : ', x)
print('y : ', y)
print('Associated Parameters : ', list(linear.named_parameters()))

# Linear Regression

```
x :  tensor([0.3726])
y :  tensor([-0.6245], grad_fn=<AddBackward0>)
Associated Parameters :  [('weight', Parameter containing:
tensor([[0.1961]], requires_grad=True)), ('bias', Parameter containing:
tensor([-0.6975], requires_grad=True))]
```

# Multi-variate Linear Regression
---

| $x_0$ | $x_1$   | $x_2$ | y   |
|------|------|------|------|
| 0.54  | 2.01    | 0.14  | 2.91    |
| 1.21  | 4.13    | 0.24  | 4.22    |
| 0.2   | 0.82    | 1.8   | 1.35    |
| ...   | ...     | ...   | ...     |

$$\hat{y} = w_0x_0 + w_1x_1 + w_2x_2 + b$$
<center><h3>OR</h3></center>
$$\hat{\overrightarrow{y}} = W\overrightarrow{x} + b$$

In [None]:
import torch
import torch.nn as nn

x = torch.rand(1, 3)
linear = nn.Linear(3, 1)
y = linear(x)
print('x : ', x)
print('y : ', y)
print('Associated Parameters : ', list(linear.named_parameters()))

# Multi-variate Linear Regression

```
x :  tensor([[0.2249, 0.2030, 0.8778]])
y :  tensor([[0.0854]], grad_fn=<AddmmBackward>)
Associated Parameters :  [('weight', Parameter containing:
tensor([[ 0.4610, -0.3530,  0.5720]], requires_grad=True)), ('bias', Parameter containing:
tensor([-0.4488], requires_grad=True))]
```

# Logistic Regression
---
$$\hat{y} = \sigma( wx + b )$$
$$ y \epsilon (0, 1)\ \forall\ x \epsilon (-\infty, +\infty) $$

In [None]:
import torch
import torch.nn

x = torch.rand(1)
linear = nn.Linear(1, 1)
activation_fn = nn.Sigmoid()
# activation_fn = nn.Tanh()
y = activation_fn(linear(x))
print('x : ', x)
print('y : ', y)
print('Associated Parameters : ', list(linear.named_parameters()))

# Logistic Regression

```
x :  tensor([0.4730])
y :  tensor([0.6162], grad_fn=<SigmoidBackward>)
Associated Parameters :  [('weight', Parameter containing:
tensor([[0.1867]], requires_grad=True)), ('bias', Parameter containing:
tensor([0.3851], requires_grad=True))]
```

# Neural Network
---

- Feed-Forward Neural Network
- Add a hidden layer
- 1 input layer, 1 hidden layer, 1 output layer

$$ \hat{y} = \tanh(W_2z + b_2) $$
$$ z = \tanh(W_2x + b_1) $$


# Neural Network



In [None]:
import torch
import torch.nn

x = torch.rand(1, 3)
linear_1 = nn.Linear(3, 5)
activation_fn_1 = nn.Tanh()
linear_2 = nn.Linear(5, 1)
activation_fn_2 = nn.Tanh()
z = activation_fn_1(linear_1(x))
y = activation_fn_2(linear_2(z))
print('x : ', x)
print('z : ', z)
print('y : ', y)
print('Associated Parameters : ')
print('\nlinear_1 (weight) : ', 
      linear_1.weight.size(), linear_1.weight)
print('linear_1 (bias) : ', 
      linear_1.bias.size(), linear_1.bias)
print('\nlinear_2 (weight) : ', 
      linear_2.weight.size(), linear_2.weight)
print('linear_2 (bias) : ', 
      linear_2.bias.size(), linear_2.bias)

# Neural Network

```
linear_1 (weight) :  torch.Size([5, 3]) Parameter containing:
tensor([[ 0.0067, -0.0129, -0.3105],
        [ 0.0487, -0.5520, -0.5208],
        [-0.5579, -0.0674, -0.3417],
        [-0.2760, -0.2374,  0.4286],
        [ 0.1238,  0.5067, -0.0352]], requires_grad=True)
linear_1 (bias) :  torch.Size([5]) Parameter containing:
tensor([-0.4454,  0.5320,  0.0727,  0.1670, -0.0999], requires_grad=True)

linear_2 (weight) :  torch.Size([1, 5]) Parameter containing:
tensor([[ 0.1594,  0.2691, -0.1408,  0.1989,  0.4021]], requires_grad=True)
linear_2 (bias) :  torch.Size([1]) Parameter containing:
tensor([0.2987], requires_grad=True)
```

In [144]:
!jupyter nbconvert notebook.ipynb --to slides --reveal-prefix ~/Desktop/tools/reveal.js

[NbConvertApp] Converting notebook notebook.ipynb to slides
[NbConvertApp] Writing 360427 bytes to notebook.slides.html
