<h1 style="text-align: center;"><b>Text generation</b></h1>
<h5 style="text-align: center;"><I>Generating poetry from a trained model</I></h5>


#### **1. Executive summary**

#### **2. Introduction**

Previously I experimented with clustering and topic modeling with my poetry. I used K-means clustering and vectorization initially and found the model to perform poorly, given the limited data and model's lack of complexity to compensate for it. Next I used ELMo word embeddings and tried clustering the themes in the data with both K-means model and the Latent Drichlet Allocation (LDA) model. The LDA model performed much betters, especially when used with an autoencoder to extract features.

The next step in this pipeline is to generate poetry using a text generation model. I will be using a Long Short-Term Memory (LSTM) model to generate poetry. The model will be trained on the poetry corpus and will be able to generate poetry based on the themes in the corpus.

#### **3. Importing data and packages**

#### **4. Preprocessing**

#### **5. Text generation models**

***Long Short-Term Memory (LSTM) model***

Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN) architecture specifically designed to address the vanishing gradient problem in traditional RNNs. This problem arises when training RNNs on long sequences, as gradients either vanish or explode, making it difficult for the network to learn long-range dependencies. LSTMs are capable of learning and retaining information over long sequences, making them suitable for various sequence-to-sequence tasks like language modeling, machine translation, and text generation.

The LSTM architecture introduces memory cells and additional gating mechanisms to control the flow of information within the network. The key components of an LSTM cell are:

- Input gate $(i_t)$
- Forget gate $(f_t)$
- Output gate $(o_t)$
- Cell state $(C_t)$
- Hidden state $(h_t)$


Mathematically, the LSTM cell's operations can be described as follows:

**Input gate $(i_t)$:**
$$i_t = σ(W_i[h_{t-1}, x_t] + b_i)$$
The input gate decides how much of the new input $(x_t)$ and previous hidden state $(h_{t-1})$ should be used to update the cell state. It uses a sigmoid activation function $(σ)$ and learns the weights $(W_i)$ and biases $(b_i)$ during training.

**Forget gate $(f_t)$**:
$$f_t = σ(W_f[h_{t-1}, x_t] + b_f)$$
The forget gate decides how much of the previous cell state $(C_{t-1})$ should be retained or forgotten. It also uses a sigmoid activation function and learns the weights $(W_f)$ and biases $(b_f)$ during training.


**Cell state $(C_t)$:**
$$C_t = f_t * C_{t-1} + i_t * tanh(W_c[h_{t-1}, x_t] + b_c)$$
The cell state is the internal memory of the LSTM cell. It gets updated based on the previous cell state $(C_{t-1})$, input gate $(i_t)$, and a candidate cell state, which is a combination of the previous hidden state $(h_{t-1})$ and current input $(x_t)$ passed through a hyperbolic tangent (tanh) activation function. The weights $(W_c)$ and biases $(b_c)$ are learned during training.


**Output gate $(o_t)$:**
$$o_t = σ(W_o[h_{t-1}, x_t] + b_o)$$
The output gate controls how much of the cell state $(C_t)$ is exposed to the next layer or the next time step. It uses a sigmoid activation function and learns the weights $(W_o)$ and biases $(b_o)$ during training.


**Hidden state $(h_t)$:**
$$h_t = o_t * tanh(C_t)$$
The hidden state is the output of the LSTM cell at each time step. It is a combination of the output gate $(o_t)$ and the cell state $(C_t)$ passed through a hyperbolic tangent $(tanh)$ activation function.

**5.1.1 Training the LSTM model**

**5.1.2 Testing the model**

***BERT***

BERT is a pre-trained language model developed by Google. It is a bidirectional transformer-based model that uses a self-attention mechanism to learn contextual representations of words. The model is trained on a large corpus of text and can be fine-tuned on a variety of downstream tasks. 

BERT is built upon the Transformer architecture, which was introduced by Vaswani et al. in their paper "Attention is All You Need." The key components of the BERT model are:

1. Multi-Head Self-Attention Mechanism
2. Position-wise Feed-Forward Networks
3. Layer Normalization
4. Positional Encoding

BERT is a very advanced and complicated model but a high level pseudoalgorithm of the model is as follows:

1. **Load pre-trained BERT model and tokenizer**.

2. **Tokenize input text**:
<br>
    a. Add special tokens [CLS] at the beginning and [SEP] at the end of the text.
    <br>
    b. Apply WordPiece tokenization to the input text.
    <br> c. Convert tokens into token IDs using the tokenizer's vocabulary.

3. **Create input features**:<br>
    a. Token IDs: The sequence of token IDs obtained from step 2.c.<br>
    b. Segment IDs: A binary mask to differentiate between different sentences (0 for the first sentence and 1 for the   second sentence, if applicable).<br>
    c. Position IDs: A sequence of integers representing the position of each token in the input.<br>
    d. Attention Mask: A binary mask indicating the positions of non-padding tokens.<br>

4. **Forward pass through the BERT model**:<br>
    a. Embed token IDs, segment IDs, and position IDs.<br>
    b. Pass the embeddings through a stack of Transformer encoder layers.<br>
    c. For each layer:<br>
    * i. Apply multi-head self-attention mechanism.<br>
    * ii. Add the attention output to the input and apply layer normalization.<br>
    * iii. Pass the result through a position-wise feed-forward network.<br>
    * iv. Add the feed-forward output to the input and apply layer normalization.<br>
    
    d. Obtain the final output representations for each token.<br>

5. **Fine-tune BERT for a specific task (if necessary):**<br>
    a. Add a task-specific output layer on top of the BERT model.<br>
    b. Train the model using labeled data for the target task, updating the weights with backpropagation.<br>

6. **Perform inference for the target task:**<br>
    a. For classification tasks, use the representation of the [CLS] token and pass it through the task-specific output layer.<br>
    b. For token-level tasks, use the representations of the individual tokens and pass them through the task-specific output layer.<br>
    c. Apply the appropriate activation function and compute the final predictions.<br>

**5.2.1 Training the BERT model**

**5.2.2 Testing the model**

#### **6. Discussion**

Both the models show some ver