###### 2023-12-13 ITHS

```
===============================

Lektion 11: Robert Nyquist

===============================
```

### NLP & RNN

[wikipedia.org/wiki/Natural_language_processing](https://en.wikipedia.org/wiki/Natural_language_processing)<br>
[wikipedia.org/wiki/Recurrent_neural_network](https://en.wikipedia.org/wiki/Recurrent_neural_network)

# Natural Language Processing

- Tokenization
- Stop words
- Lemmatization & Stemming
- Word embeddings
<br><br>
---

NLP combines linguistics and AI.
- Natural language much complex, computadora no comprendo

Many difficulties, not just for computers:
- Different meaning (Polysemy, Homophones, Idioms, Sarcasm & Irony)
- Different languages (False friend)
- Culural and historical context

### Interesting topics on language ambiguity

[Polysemy](https://en.wikipedia.org/wiki/Polysemy) | [Aberrant decoding](https://en.wikipedia.org/wiki/Aberrant_decoding) | [False friend](https://en.wikipedia.org/wiki/False_friend) | [Dysphemism](https://en.wikipedia.org/wiki/Dysphemism) | [Oxymoron](https://en.wikipedia.org/wiki/Oxymoron) | [Homophone](https://en.wikipedia.org/wiki/Homophone) | [Phrasal verbs](https://en.wikipedia.org/wiki/English_phrasal_verbs)

How would you learn a computer to deal with this?

What is a *good* text?
- Text can vary in quality.
- How to deal with spelling and grammar?

Language changes quickly over time.

**Text corpus**: Dataset with text (large, unstructured)

## Tokenization

Trying to learn the *meaning* of the words in a sentence by breaking up all the words.

Sometimes useful with placeholder for unknown words. "UNK" or `<UNK>` (Unknown word or unknown token).

```
{
    "<UNK>":0,
    "jag":1,
    "läser",2,
    "en":3,
    "bok":4,
}
```

[N-grams](https://en.wikipedia.org/wiki/N-gram) to break sentences into sequences with n words.

1-gram (unigram): "Jag", "läser", "en", "bok"<br>
2-gram (bigram) : "Jag läser", "läser en", "en bok"<br>
3-gram (trigram): "Jag läser en", "läser en bok"

For sentiment analysis: use tokenization

## Stop words

"and", "is", "the", "in", "to", "it

Very frequent words, often not very useful (english and swedish). *Can* be very important, but generally recommended to remove.

Removing stopwords:
- Reduce dimensions in out data
- Remove noise



## Lemmatization & Stemming
"Normalizing" words in terms of their grammatical root or base form.

[Stemming](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html): Remove inflectional endings.

Example:

Stemming would reduce "running", "ran", and "runner" into "run".<br>
Lemmatization would identify the **base form** of each word based on **context**.

The complexity of the language can make lemmatization quite difficult.

Why lemmatization?
- Reduce risk of overfitting
- Reduce dimensionality
- Better word count

## Word embeddings
We cannot count word (strings).

"House" och "cottage" are very similar to us native NL-speakers. But there is no similarity in the letters.

To combat this we translate the strings to vectors.

Example: [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) 

King - Man + Woman = Queen

![](https://miro.medium.com/v2/resize:fit:2000/format:webp/1*SYiW1MUZul1NvL1kc1RxwQ.png)

From [TowardsDataScience: A Guide to Word Embedding](https://towardsdatascience.com/a-guide-to-word-embeddings-8a23817ab60f)

## Padding
When working with sentences we won't know the *size of our input*.<br>
All sentces will be the same length, set a threshold on max: Padding

Padding adds extra tokens to a sequence of data to make it the same length as other sequences in a dataset.

Example: 5 input token:
```
{
    "<PAD>":0,
    "<UNK>":1,
    "jag":2,
    "läser":3,
    "en":4,
    "bok":5
}

```



# Reccurent Neural Network

RNN handles sequences of data, where **order matters**.

"Student cook delicious food" != "Delicious food cook student"

![RNN_1.png](../Resources/material/RNN_1.png)

$X_t$: Input<br>
$h_t$: Hidden state (memory)<br>
$A$: Activation function<br>

$h_t=\sigma(W_{hh}h_{t-1} + W_{xh}x_t+b_h)$

$y_t = W_{hy}h_t + b_y$

Memory handles sequence internally in the layers. <br>
With the hidden state, we use weights * previous neuron in the same layer + weight * input + bias => hidden state (memory) -> Activation function.


Regular backpropagation calculates the gradients of the weights given an error.
The error is the difference in output and the ground truth. => Get gradient.

This needs adjustments when working with RNNs.

## Backpropagation through time (BPTT)
Backpropagation for RNN, [BPTT](https://en.wikipedia.org/wiki/Backpropagation_through_time)

[RNN Unrolling](https://machinelearningmastery.com/rnn-unrolling/)

We calculate gradient at every time step.<br>
Sum gradient for all time step and update weights.

At a time step `t` we calculate the error `et`

1. We unroll the RNN network in time by creating a new network for each time step of the input sequence. Each time step the network has the same structure as the original network, but it only has input from the current time step and output from the previous time step.

2. We apply backpropagation to each time step network. We calculate the error for each time step and update the weights accordinly.

3. We sum up the gradients for all time steps. This gives us the final gradient for the entire input sequence.

4. We use the final gradient to update the weights of the original network.

---

### Encoder-Decoder

Sequence-to-sequence

[seq2seq](https://en.wikipedia.org/wiki/Seq2seq)

Handle dynamic length of output.

Encoder-Decoder architecture

- Encoder: Sequence ➜ Fixed length output
- Decoder: Output ➜ Dynamic length ouput

I am a student ➜ Encoder ➜ [0.34, 1.23][0.34, 1.23] ➜ Decoder ➜ soy un estudiante



### Time series

<center>
<blockquote><h3>If we cannot randomize order, we should consider RNNs.</h3></blockquote>
</center>

Order matters!

- Weather
- Stocks
- Health data etc.


Time series:
- Feature engineering
- Handle missing data
- Normalization etc.

---

## LSTM

RNN: Long Short-Term Memory

![LSTM_1.png](../Resources/material/LSTM_1.png)

'lotta shit goin' on. Instead of just one activation:<br>
Input, hidden state, memory, forget gate, input gate, candidate memory, output gate.

- Memory (Cell state): Memory
- Candidate memory: Contain all information from input
- Forget gate: What data to discard
- Input gate: What data should be added to cell state
- Output gate: What data should be used to calculate output


Flow:

- Three inputs: $x_t$ (words in a sentence), $h_{t-1}$ (previous hidden state) and previous cell state ($C_{t-1}$)

- Forget gate decides what information to store in the cell. Previous hidden state and input to generte a vector in range 0,1. (0 = forget everything, 1 = store everything)

- Input gate decides what values to update in cell.

- Combining informations from forget gate and input gate: Cell decides what to add or update.

- Ouput cell + new hidden state used in next cell.


Pros & Cons

- **Handling long-term dependencies** in time, word, sentences, etc.
- **Improved complexity** (slower train and predict)
- **Vulnearbility to overfitting** especially when the dataset is limited

# GRU

Gated Recurrent Unit

![GRU_1.png](../Resources/material/GRU_1.png)

###### *"lite mer lagom"*

Less complex. Easier to train. Cannot handle complex problems as well as LSTM.

Flow:

- Two input $x_t$ (word in a sentence) and $h_{t-1}$ (previous hidden state). Like RNN.
- Reset gate decides how much previous hidden state should be combined with current input.
- Update gate decides how much of the cell's output should retain new information (current input) and how much to keep previous hidden state.
- Hidden state output: Combination of input and reset gate + update gate output