# LSTM for Text Classification

Di notebook ini, kita akan berkenalan dengan salah satu arsitektur *neural networks* yang sering digunakan untuk data sekuensial, yaitu **Long Short-Term Memory (LSTM)**.

## Agenda

Agenda kita hari ini:
* Why LSTM
* LSTM dengan PyTorch
* Bagaimana cara kerja LSTM

In [None]:
import random
import re
from collections import Counter
from pathlib import Path
from string import punctuation

import numpy as np
import pandas as pd
import torch
from torch import nn
from torch.utils.data import TensorDataset, DataLoader

pd.set_option("display.max_colwidth", 0)

## Why LSTM?

Misalkan kita diberikan potongan berita seperti berikut:

> Timnas Indonesia U-22 lolos ke babak final SEA Games 2019 dengan perjuangan berat

dan juga seperti berikut:

>  Pedangdut Selvi Kitty masih mengupayakan yang terbaik bagi kesembuhan putranya, Abizard Kavin Suseno yang mengidap penyakit langka, sindrom Kawasaki.

Kira-kira 2 berita tersebut masuk ke kategori apa?

---

Setiap kata yang menyusun kalimat di atas, saling terhubung dengan kata sebelum ataupun sesudahnya. Bahkan, mungkin bisa terhubung dengan kata-kata jauh sebelum atau juga setelahnya. Inilah yang menjadi tantangan dalam memproses data teks.

Model seperti CNN ataupun ANN memiliki keterbatasan jika diharuskan belajar dari data sekuensial seperti data teks. Salah satu keterbatasan yang ada dalam deep neural networks atau bahkan convolutional neural networks adalah kedua jenis arsitektur tersebut menerima input vektor dalam bentuk atau ukuran yang tetap (gambar) dan menghasilkan output vektor dengan ukuran yang tetap juga (probabilitas tiap kelas).

## LSTM with PyTorch

Kita akan membuat model LSTM untuk mengklasifikasikan kategori dari artikel berita.

### Datasets

Data yang akan digunakan adalah data artikel yang sudah dikumpulkan yang terdiri dari 9 jenis kategori:
* football
* international news
* health
* politik
* business
* celebs
* local news
* romance
* religi

In [None]:
DATA_DIR = Path("data/news-article")
DATA_FILEPATH = DATA_DIR / "news.csv"

In [None]:
df_news = pd.read_csv(DATA_FILEPATH)
df_news = df_news[~df_news["class"].isin(["international_film_tv"])]

In [None]:
df_news.sample(5)

### Data Preprocessing

In [None]:
def lowerize(df):
    df["full_text"] = df["full_text"].str.lower()
    return df


def remove_punctuation(df):
    df["full_text"] = df["full_text"].apply(
        lambda excerpt: "".join([char for char in excerpt if char not in punctuation])
    )
    return df


def remove_digits(df):
    df["full_text"] = df["full_text"].apply(
        lambda excerpt: re.sub(r"\b\d+\b", "", excerpt)
    )
    return df


df_news = df_news.pipe(lowerize).pipe(remove_punctuation).pipe(remove_digits)

In [None]:
df_news.sample(5)

#### Tokenization

In [None]:
# create dict vocab
count = Counter(" ".join(df_news["full_text"].tolist()).split())
vocab = sorted(count, key=count.get, reverse=False)
vocab_to_int = {word: i for i, word in enumerate(vocab, 1)}
int_to_vocab = {i: word for word, i in vocab_to_int.items()}
print("Number of vocab:", len(vocab))

# tokenize
news_tokens = []
for news in df_news["full_text"]:
    news_tokens.append([vocab_to_int[word] for word in news.split()])

In [None]:
len(vocab_to_int)

In [None]:
df_news.loc[1, "full_text"]

In [None]:
print(news_tokens[1])

In [None]:
" ".join([int_to_vocab[token] for token in news_tokens[1]])

In [None]:
news_lengths = Counter([len(x) for x in news_tokens])
print("Maximum news length:", max(news_lengths))
print("Minimum news length:", min(news_lengths))

#### Padding Sequences

In [None]:
def pad_sequence(sequences, seq_length):
    """
    Return sequences where each sequence is padded with 0's
    or truncated to the `seq_length`
    """
    padded_sequences = np.zeros((len(sequences), seq_length), dtype=int)

    for i, row in enumerate(sequences):
        padded_sequences[i, -len(row):] = np.array(row, dtype=int)[:seq_length]

    return padded_sequences

In [None]:
SEQ_LENGTH = 500

padded_sequences = pad_sequence(news_tokens, SEQ_LENGTH)

# assert to check all goes as expected
assert len(padded_sequences) == len(news_tokens)
assert len(padded_sequences[0]) == SEQ_LENGTH

In [None]:
print(padded_sequences[0])

In [None]:
len(news_tokens[0])

In [None]:
np.where(padded_sequences == 0, 1, 0)[0].sum()

#### Encode Target Class

In [None]:
class_to_idx = {c: idx for idx, c in enumerate(df_news["class"].unique())}
idx_to_class = {idx: c for c, idx in class_to_idx.items()}
df_news["encoded_class"] = df_news["class"].map(class_to_idx)

In [None]:
print(class_to_idx)

In [None]:
df_news.sample(5)

### Training, Validation, Test

#### Shuffle Dataset

In [None]:
features = padded_sequences.copy()
labels = df_news["encoded_class"].values.copy()

rng = np.random.default_rng(11)
rng.shuffle(features)
rng.shuffle(labels)

In [None]:
TRAIN_SIZE = .8

train_idx = int(len(features) * TRAIN_SIZE)
X_train, X_remaining = features[:train_idx], features[train_idx:]
y_train, y_remaining = labels[:train_idx], labels[train_idx:]

test_idx = int(len(X_remaining) * .5)
X_val, X_test = X_remaining[:test_idx], X_remaining[test_idx:]
y_val, y_test = y_remaining[:test_idx], y_remaining[test_idx:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{} {}".format(X_train.shape, y_train.shape), 
      "\nValidation set: \t{} {}".format(X_val.shape, y_val.shape),
      "\nTest set: \t\t{} {}".format(X_test.shape, y_test.shape))

#### Data Batching

In [None]:
BATCH_SIZE = 32

train_data = TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
valid_data = TensorDataset(torch.from_numpy(X_val), torch.from_numpy(y_val))
test_data = TensorDataset(torch.from_numpy(X_test), torch.from_numpy(y_test))

train_loader = DataLoader(train_data, shuffle=True, batch_size=BATCH_SIZE, drop_last=True)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=BATCH_SIZE, drop_last=True)
test_loader = DataLoader(test_data, shuffle=True, batch_size=BATCH_SIZE, drop_last=True)

In [None]:
dataiter = iter(train_loader)
X_sample, y_sample = next(dataiter)

print('Sample input size: ', X_sample.size()) # batch_size, seq_length
print('Sample input: \n', X_sample)
print()
print('Sample label size: ', y_sample.size()) # batch_size
print('Sample label: \n', y_sample)

### Modeling

In [None]:
is_using_gpu = torch.cuda.is_available()

if is_using_gpu:
    print("Will use GPU for modeling")
else:
    print("No GPU available, will use CPU")

In [None]:
class NewsLSTM(nn.Module):
    """LSTM model for News Category Classification."""
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=.5):
        """Init model."""
        super().__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim=embedding_dim)
        self.lstm = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            dropout=drop_prob,
            batch_first=True
        )

        # dropout layer
        self.dropout = nn.Dropout(.3)

        # linear and sigmoid layer
        self.fc = nn.Linear(hidden_dim, output_size)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x: torch.Tensor, hidden: torch.Tensor):
        "Perform forward propagation."
        batch_size = x.size()

        # embedding and LSTM outs
        x = x.long()
        embed_outs = self.embedding(x)
        lstm_outs, hidden = self.lstm(embed_outs, hidden)

        # get the last sequence step outputs
        lstm_outs = lstm_outs[:, -1, :]

        # dropout and fully-connected layer
        out = self.dropout(lstm_outs)
        out = self.fc(out)

        # sigmoid
        # out = self.softmax(out)

        return out, hidden

    def init_hidden(self, batch_size):
        "Init hidden state of LSTM layer."
        # Create two new tensors with size: n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (is_using_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

#### Training Model

In [None]:
EPOCHS = 4
GRAD_CLIP = 5

In [None]:
def train_model(
    model, train_loader, valid_loader,
    criterion, optimizer, print_every=50,
    epochs=EPOCHS, grad_clip=GRAD_CLIP, batch_size=BATCH_SIZE
):
    if is_using_gpu:
        model.cuda()

    # set mode to training
    model.train()
    for epoch in range(epochs):
        # initialize hidden state
        h = model.init_hidden(batch_size)

        for counter, (inputs, labels) in enumerate(train_loader, 1):
            if is_using_gpu:
                inputs, labels = inputs.cuda(), labels.cuda()

            # create new instance for hidden state
            # so we don't backprop to the entire training history
            h = tuple([weight.data for weight in h])

            # zeros gradient
            model.zero_grad()

            # forward propagation
            output, h = model(inputs, h)
            # print(output, output.shape)

            # calculate loss and do backprop
            loss = criterion(output, labels)
            loss.backward()

            # update weights and clip gradient to avoid exploding gradient
            nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
            optimizer.step()

            # validate after `print_every` batch
            if counter % print_every == 0:
                val_h = model.init_hidden(batch_size)
                val_losses = []

                # set mode to validation
                model.eval()
                for inputs, labels in valid_loader:
                    # create new instance for hidden state
                    # so we don't backprop to the entire training history
                    val_h = tuple([weight.data for weight in val_h])

                    if is_using_gpu:
                        inputs, labels = inputs.cuda(), labels.cuda()

                    output, val_h = model(inputs, val_h)
                    val_loss = criterion(output, labels)

                    val_losses.append(val_loss.item())

                model.train()
                print("Epoch: {}/{}...".format(epoch, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.6f}...".format(loss.item()),
                      "Val Loss: {:.6f}".format(np.mean(val_losses)))
    return model

In [None]:
# training components
LEARNING_RATE = 1e-2
VOCAB_SIZE = len(vocab_to_int) + 1
OUTPUT_SIZE = df_news["class"].nunique()
EMBEDDING_DIM = 300
HIDDEN_DIM = 256
NUM_LAYERS = 2

# define model
model = NewsLSTM(
    vocab_size=VOCAB_SIZE,
    output_size=OUTPUT_SIZE,
    embedding_dim=EMBEDDING_DIM,
    hidden_dim=HIDDEN_DIM,
    n_layers=NUM_LAYERS
)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

model = train_model(model, train_loader, valid_loader, criterion, optimizer, print_every=25, epochs=4)

#### Test Model

In [None]:
test_losses = []
num_correct = 0

h = model.init_hidden(BATCH_SIZE)

# set mode to eval
model.eval()

for inputs, labels in test_loader:
    h = tuple([weight.data for weight in h])

    if is_using_gpu:
        inputs, labels = inputs.cuda(), labels.cuda()

    output, h = model(inputs, h)

    test_loss = criterion(output, labels)
    test_losses.append(test_loss.item())

    _, prediction = output.max(1)

    num_correct += (prediction == labels).sum().item()

# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct / len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

#### Inference

In [None]:
def preprocess(inputs: str):
    inputs = inputs.lower()
    inputs = " ".join([char for char in inputs if char not in punctuation])
    inputs = re.sub(r"\b\d+\b", "", inputs)
    # tokenize
    inputs = [vocab_to_int[word] for word in inputs.split()]
    # pad sequence
    inputs = pad_sequence([inputs], SEQ_LENGTH)
    return torch.from_numpy(inputs)

In [None]:
excerpt = """
Pelatih Gillingham, Neil Harris, menurunkan Elkan Baggott sejak awal laga.
Bek Timnas Indonesia itu berduet dengan Max Ehmer sebagai bek tengah,
sementara di sayap ada Cheye Alexander dan Robbie McKenzie.
Kemenangan Gillingham hadir dari gol Lewis Walker pada menit 43.
Ini merupakan pertandingan ulang melawan Fylde karena pada sebelumnya
laga berakhir imbang dengan skor 1-1 pada 5 November lalu.
"""
print(excerpt)

inputs = preprocess(excerpt)

In [None]:
model.eval()
with torch.no_grad():
    h = model.init_hidden(1)
    out, h = model(inputs, h)
    _, prediction = out.max(1)
    prediction = idx_to_class[prediction.item()]

print("Text:")
print(excerpt)
print("Predicted news category:", prediction)

## How LSTM Works?

LSTM sengaja didesain untuk mengatasi permasalahan **long-term dependencies**. Komputasi pada layer LSTM lebih kompleks jika dibandingkan dengan model *neural network* berjenis sekuensial lainnya (misal RNN). Ini dikarenakan LSTM mencoba mempertahankan informasi yang diterima jauh sebelum data saat ini diproses.

LSTM menghasilkan 2 output:
* long-term memory
* short-term memory

Komputasi pada layer LSTM melibatkan **2 fungsi aktivasi** sesuai dengan fungsi untuk menghasilkan long-term memory dan juga short-term memory.

<div align='center'>
     <img src="https://sylabs.notion.site/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F39eea92c-e227-4707-b28c-d377461a528e%2Flstm-diagram.png?table=block&id=b5653706-1884-4ec8-9b06-2cea7564e4fa&spaceId=685593da-9b2b-4a94-b296-d52808c79757&width=2000&userId=&cache=v2" width=80%/>
</div>

Jika kita bayangkan 2 input dan 2 output pada layer LSTM sebagai **“gerbang”** masuk dan keluar layer, maka layer LSTM memiliki 4 “gerbang” komputasi, yaitu:
* learn gate
* forget gate
* remember gate
* use gate

### Forget Gate

Gerbang komputasi ini akan memilih informasi apa yang akan **dibuang** dan informasi apa yang akan **tetap dipertahankan**. Gerbang ini menggunakan input dari input vektor ke-$t$ yang dikombinasikan dengan hidden state ke-$t-1$ ($[X^{<t>},\, h^{<t-1>}]$) sebagai input dari fungsi aktivasi sigmoid $f_t$ yang berfungsi sebagai forget factor yang outputnya berada pada interval 0 sampai 1. Maka, output dari forget gate adalah $c^{<t-1>} \cdot f_t$. Berikut adalah persamaan yang digunakan untuk menghitung forget gate.

$$
\begin{align}
f_t &= \sigma(W_f[h^{<t-1>},\: X^{<t>}] + b_f) \\
\text{forget_gate} &= c^{<t-1>} \cdot f_t
\end{align}
$$

<div align='center'>
    <img src="https://sylabs.notion.site/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F22a48a00-ba3f-4b9f-a696-29194ad11c80%2Fforget-gate.png?table=block&id=39f7afd4-1905-409b-8e69-aa6702f12a46&spaceId=685593da-9b2b-4a94-b296-d52808c79757&width=1860&userId=&cache=v2" width=70%/>
</div>

### Learn Gate

Komputasi pada learn gate ini hanya akan memproses input vektor $X^{<t>}$ dan hidden state $h^{<t-1>}$. Intuisi dari gerbang komputasi ini adalah “menggabungkan" kedua input vektor tersebut kemudian **mengabaikan sebagian informasi “seperlunya”**. Sehingga, terdapat 2 jenis komputasi pada gerbang ini. Pertama dengan kata kunci “mengabaikan”, berarti kita akan menggunakan fungsi aktivasi sigmoid pada kombinasi kedua input vektor $[X^{<t>},\: h^{<t-1>}]$ sebagai ignoring factor $i_t$. Setelah itu, proses komputasi yang “menggabungkan” kedua informasi tersebut menggunakan fungsi aktivais tanh. Berikut adalah persamaan yang digunakan.

$$
\begin{align}
i_t &= \sigma(W_i[X^{<t>},\: h^{<t-1>}] + b_i) \\
\text{learn_gate} &= \tanh(W_n[X^{<t>},\: h^{<t-1>}] + b_i) \cdot i_t \\
&= \tilde{C}_t \cdot i_t
\end{align}
$$

<div align='center'>
    <img src="https://sylabs.notion.site/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F4f159c27-0689-448f-9c29-c7f6b6e941f2%2Flearn-gate.png?table=block&id=998e9a10-7274-4d87-a296-8c941a19cf84&spaceId=685593da-9b2b-4a94-b296-d52808c79757&width=1810&userId=&cache=v2" width=70%/>
</div>

### Remember Gate

Gerbang komputasi ini menggabungkan kedua gerbang sebelumnya, yaitu **forget gate** dan **learn gate**. Ini artinya, remember gate akan menghasilkan output yang akan digunakan sebagai long-term memory atau disebut juga sebagai cell state pada baris ke-$t$, $c^{<t>}$. Proses komputasi pada gerbang ini sangat sederhana karena kita hanya menjumlahkan hasil forget gate dan learn gate sebagai berikut.

$$
\begin{align}
c^{<t>} &= \text{forget_gate} + \text{learn_gate} \\
&= c^{<t-1>} \cdot f_t + \tilde{C}_t \cdot i_t
\end{align}
$$

<div align="center">
    <img src="https://sylabs.notion.site/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F94884e35-2f15-49c2-8446-87bfcffd0221%2Fremember-gate.png?table=block&id=4e90725f-16bd-45ae-9c1b-f5b220792c67&spaceId=685593da-9b2b-4a94-b296-d52808c79757&width=1790&userId=&cache=v2" width=70%/>
</div>

### Use gate

Komputasi pada gerbang ini akan menghasilkan output yang akan diteruskan ke layer selanjutnya sekaligus berfungsi sebagai short-term memory baru atau hidden state, $h_t$. Berikut adalah persamaan yang digunakan untuk komputasi pada use gate.

$$
\begin{align}
o_t &= \sigma(W_o[h^{<t-1>},\: X^{<t>}] + b_o) \\
h^{<t>} &= o_t * \tanh{c^{<t>}}
\end{align}
$$

<div align="center">
    <img src="https://sylabs.notion.site/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F7d43ea1e-d77b-49c9-95fa-d04aac804aad%2Fuse-gate.png?table=block&id=ddfd2fda-35af-46ca-a068-bf308421caf2&spaceId=685593da-9b2b-4a94-b296-d52808c79757&width=1780&userId=&cache=v2" width=70%/>
</div>

### Overview

<div align="center">
    <img src="https://sylabs.notion.site/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F4317adf2-9ac7-467d-9c3f-d829ebfdd68b%2Flstm-symbol.png?table=block&id=fc4eea6f-d311-4b74-9d8d-b96b91dd9bb9&spaceId=685593da-9b2b-4a94-b296-d52808c79757&width=1400&userId=&cache=v2" width=70%/>
</div>

<div align="center">
    <img src="https://s3.us-west-2.amazonaws.com/secure.notion-static.com/f6bd87e8-eefc-4805-8117-f933c9acc30c/animated-lstm.gif?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20221116%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20221116T040918Z&X-Amz-Expires=86400&X-Amz-Signature=09fa5a1660aa745c5f728cd696568f937c36b73e7885a5bb8a2996226c7317d2&X-Amz-SignedHeaders=host&x-id=GetObject" width=70%/>
</div>