# Fundamentals of Text Generation

**Author:** [Murat Karakaya](https://www.linkedin.com/in/muratkarakaya/)<br>
**Date created:** 21 April 2021<br>
**Last modified:** 29 May 2021<br>
**Description:** This is an introductory ***tutorial*** on ***Text Generation in Deep Learning*** which is the first part of the "**Controllable Text Generation with Transformers**" series<br>
**Accessible on:**
* [YouTube in English](https://youtube.com/playlist?list=PLQflnv_s49v8Eo2idw9Ju5Qq3JTEF-OFW)
* [YouTube in Turkish](https://youtube.com/playlist?list=PLQflnv_s49v8-xeTLx1QmuE-YkRB4bToF)
* [Medium](https://medium.com/deep-learning-with-keras/controllable-text-generation-in-deep-learning-with-transformers-gpt3-using-tensorflow-keras-3d9e6bbe243b)
* [Github pages](https://kmkarakaya.github.io/Deep-Learning-Tutorials/)
* [Github Repo](https://github.com/kmkarakaya/Deep-Learning-Tutorials)
* [Google Colab](https://colab.research.google.com/drive/1JGgU3Zcpe7sitdPuI3WvCMp1kDna2UY5?usp=sharing)


# Controllable Text Generation with Transformers tutorial series

In this series, we will focus on developing TensorFlow (TF) / Keras implementation of Controllable Text Generation  from scratch.

**Part A:** Fundamentals of Controllable Text Generation:
  * **A1** A Review of Text Generation
  * **A2** An Introduction to Controllable Text Generation


**Part B:** A Tensorflow Data Pipeline for Word Level Controllable Text Generation

**Part C**: Sample Implementations of Controllable Text Generation with TensorFlow & Keras:

  * **C1** **Approach**: Input Update + **Language Model**: LSTM  
  * **C2** **Approach**: Input Update + **Language Model**: Encoder-Decoder 
  * **C3** **Approach**: Input Update + **Language Model**: Transformer (GPT3)  

[You can access all the parts from this link](https://medium.com/deep-learning-with-keras/controllable-text-generation-in-deep-learning-with-transformers-gpt3-using-tensorflow-keras-3d9e6bbe243b).



# Important
Before getting started, I **assume** that you have already **reviewed**:

*  the tutorial series "[Text Generation methods in Deep Learning with Tensorflow (TF) & Keras](https://medium.com/deep-learning-with-keras/text-generation-in-deep-learning-with-tensorflow-keras-e403aee375c1)" 
*  the tutorial series "[Sequence-to-Sequence Learning](https://medium.com/deep-learning-with-keras/part-a-introduction-to-seq2seq-learning-a-sample-solution-with-mlp-network-95dc0bcb9c83)"
*   the previous parts in [this series](https://medium.com/deep-learning-with-keras/controllable-text-generation-in-deep-learning-with-transformers-gpt3-using-tensorflow-keras-3d9e6bbe243b)

Please **ensure** that you have **completed** above tutorial series to ***easily follow*** the below discussions.

# References

**Language Models:**

- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Janvin, [**A neural probabilistic language model**](https://dl.acm.org/doi/10.5555/944919.944966)
 

- A. Radford, Karthik Narasimhan, [**Improving Language Understanding by Generative Pre-Training (GPT)**](https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035)
- A. Radford, Jeffrey Wu, R. Child, David Luan, Dario Amodei, Ilya Sutskever, [**Language Models are Unsupervised Multitask Learners (GPT-2)**](https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe)
- Tom B. Brown, et.al., [**Language Models are Few-Shot Learners (GPT-3)**](https://arxiv.org/abs/2005.14165)

- Jay Alammar, [**The Illustrated GPT-2** (Visualizing Transformer Language Models)](http://jalammar.github.io/illustrated-gpt2/)

- Murat Karakaya, **Encoder-Decoder Structure in Seq2Seq Learning** Tutorials: on YouTube in [English](https://youtube.com/playlist?list=PLQflnv_s49v-4aH-xFcTykTpcyWSY4Tww) or [Turkish](https://youtube.com/playlist?list=PLQflnv_s49v97hDXtCo4mgje_SEiJ0_hH). You can also access these tutorials [on Medium here](https://medium.com/deep-learning-with-keras/sequence-to-sequence-learning-c8be6cd34848). 

- Sebastian Ruder, [Recent Advances in Language Model Fine-tuning](https://ruder.io/recent-advances-lm-fine-tuning/)

- Jackson Stokes, [A guide to language model sampling in AllenNLP](https://medium.com/ai2-blog/a-guide-to-language-model-sampling-in-allennlp-3b1239274bc3)
- Jason Brownlee, [How to Implement a Beam Search Decoder for Natural Language Processing](https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/)

 












**Text Generation:**
- Murat Karakaya, **Text Generation with different Deep Learning Models** Tutorials: on YouTube in [English](https://youtube.com/playlist?list=PLQflnv_s49v9QOres0xwKyu21Ai-Gi3Eu) or [Turkish](https://youtube.com/playlist?list=PLQflnv_s49v-oEYNgoqK5e4GyUbodfET3). You can also access these tutorials [on Medium here](https://medium.com/deep-learning-with-keras/text-generation-in-deep-learning-with-tensorflow-keras-e403aee375c1). 
- Apoorv Nandan, [Text generation with a miniature GPT](https://keras.io/examples/generative/text_generation_with_miniature_gpt/)
- Nicholas Renotte, [Generate Blog Posts with GPT2 & Hugging Face Transformers | AI Text Generation GPT2-Large](https://www.youtube.com/watch?v=cHymMt1SQn8)
- Mariya Yao, [Novel Methods For Text Generation Using Adversarial Learning & Autoencoders](https://www.topbots.com/ai-research-gan-vae-text-generation/)
- Guo, Jiaxian and Lu, Sidi and Cai, Han and Zhang, Weinan and Yu, Yong and Wang, Jun, [Long Text Generation via Adversarial Training with Leaked Information](https://github.com/CR-Gjx/LeakGAN)
- Patrick von Platen, [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate)
- Discussion Forum, [What is the difference between word-based and char-based text generation RNNs?](https://datascience.stackexchange.com/questions/13138/what-is-the-difference-between-word-based-and-char-based-text-generation-rnns)
- Papers with Code web page, [Text Generation](https://paperswithcode.com/task/text-generation)
- Ben Mann, [How to sample from language models](https://towardsdatascience.com/how-to-sample-from-language-models-682bceb97277)




 

**Controllable Text Generation:**
- Neil Yager, [Neural text generation: How to generate text using conditional language models](https://medium.com/phrasee/neural-text-generation-generating-text-using-conditional-language-models-a37b69c7cd4b)
- Alec Radford, Ilya Sutskever, Rafal Józefowicz, Jack Clark, Greg Brockman, [Unsupervised Sentiment Neuron](https://openai.com/blog/unsupervised-sentiment-neuron/)
- Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu, **Plug and Play Language Models: A Simple Approach to Controlled Text Generation**, [video](https://www.youtube.com/watch?app=desktop&v=q3Q_LTetx9o&feature=youtu.be), [code](https://github.com/uber-research/PPLM)
- Ivan Lai, [Conditional Text Generation by Fine Tuning GPT-2](https://towardsdatascience.com/conditional-text-generation-by-fine-tuning-gpt-2-11c1a9fc639d)
- Lilian Weng, [Controllable Neural Text Generation](https://lilianweng.github.io/lil-log/2021/01/02/controllable-neural-text-generation.html)
-  Shrimai Prabhumoye, , Alan W Black, Ruslan Salakhutdinov **Exploring Controllable Text Generation Techniques**, [video](https://www.youtube.com/watch?v=khTYgGHLDqE), [paper](https://www.aclweb.org/anthology/2020.coling-main.1/)
- Muhammad Khalifa, Hady Elsahar, Marc Dymetman, **A Distributional Approach to Controlled Text Generation** [video](https://www.youtube.com/watch?v=RJ9TT81i338), [paper](https://arxiv.org/abs/2012.11635)
-  Abigail See, [Controlling text generation for a better chatbot](https://www.youtube.com/watch?v=uqPTpUyHZJE)
- Alvin Chan, Yew-Soon Ong, Bill Pung, Aston Zhang, Jie Fu, **CoCon: A Self-Supervised Approach for Controlled Text Generation**, [video](https://www.youtube.com/watch?app=desktop&v=f-DjgIMn02k&feature=youtu.be), [paper](https://arxiv.org/abs/2006.03535)
- 


# PART A1: A Review of Text Generation

# What is text generation?

In the simplest form, you train a Deep Learning (DL) model to generate random but hopefully meaningful text.

Text generation is a subfield of **natural language processing (NLP)**. It leverages knowledge in computational linguistics and artificial intelligence to ***automatically generate natural language texts***, which can ***satisfy certain communicative requirements***.

You can visit [Write With Transformer](https://transformer.huggingface.co/) or [Talk to Transformer](https://app.inferkit.com/demo) websites to interact with several demos.

Here is [a quick demo](https://youtu.be/mLwdx5IxjwU):



<img src="https://raw.githubusercontent.com/kmkarakaya/Deep-Learning-Tutorials/master/images/ControllableTextGen_TalkToTransformer.gif">

# What is a prompt?
Prompt is the initial text input to the trained model so that it can complete the prompt by generating suitable text.

We expect that the trained model is capable of taking care of the prompt properly to generate sensible text.

In the above demo, the prompt we provided is "***I believe that one day, robots will***" and the trained model generates the following text:

<img src="https://raw.githubusercontent.com/kmkarakaya/Deep-Learning-Tutorials/master/images/ControllableTextGen_Prompt.png" width="500">

# What is a Corpus?

A corpus (plural ***corpora***) or text corpus is a **language resource** consisting of a ***large and structured set of texts***. 

For example, [Large Movie Review corpus](https://ai.stanford.edu/~amaas/data/sentiment/index.html) consists of 25,000 highly polar movie reviews for training, and 25,000 for testing to train a Language Model for sentiment analysis. 


<img src="https://github.com/kmkarakaya/Deep-Learning-Tutorials/blob/master/images/LanguageModel_Corpus.gif?raw=true" width="500">

# What is a Token?

In general, a **token** is ***a string of contiguous characters*** between two spaces, or between a space and punctuation marks. 


A **token** can also be any ***number*** (an integer, or real).

All other **symbols** are tokens themselves ***except*** apostrophes and quotation marks in a word (with no space), which in many cases symbolize acronyms or citations. 

The token can be
* a word 
* a character 
* a symbol
* a number
* x number of contiguous above items.

Actually, the programmer decides the size and meaning of the token in the NLP implementation. 

It is the **unit** (granulity) of the **text input** to the Language Model and the **output** of the model as well.

Mostly, we use 3 levels for tokenization in Deep Learning applications:
* **word** 
* **character** based (level) tokenization 
* **n-gram** characters





# What is Text Tokenization?
Tokenization is a way of separating a piece of text into smaller units called tokens.

Basically for training a language model, we prepare the training data as follows:
* we **collect**, **clean**, and **structure** the data
* this data is called **corpus**
* we decide 
  * the **token size** (***word, character, or n-gram***)
  * the **maximum number of tokens** in each sample
  * the **number of distinct tokens** in the ***dictionary*** (***vocabulary size***)
* we **tokenize** the corpus into **chunks of tokens** (***sequences***) considering the maximum size (length)

At the end of **tokenization** process, we have
* **sequences** of tokens as samples (inputs or outputs for the LM)
* a **vocabulary** consisting of maximum n number of frequent tokens in the corpus
* an **index** list to represent each token in the dictionary 

After then we can **convert** ***sequences of tokens*** to ***sequences of indices***.

All above steps are called **tokenization** and you can use ***Tensorflow Data Pipeline*** to handle these steps in a structured way. For more information about tokenization and  Tensorflow Data Pipeline, see these ***Murat Karakaya Akademi tutorials***: 
* Tensorflow Data Pipeline Medium blogs: 
  * [Word level tokenization](https://medium.com/deep-learning-with-keras/build-an-efficient-tensorflow-input-pipeline-for-word-level-text-generation-2d224e02ae15)
  * [Character level tokenization](https://medium.com/deep-learning-with-keras/build-an-efficient-tensorflow-input-pipeline-for-char-level-text-generation-b369d6a68429)  
* Tensorflow Data Pipeline YouTube videos: 
  * [Turkish](https://youtube.com/playlist?list=PLQflnv_s49v8l8dYU01150vcoAn4sWSAm) 
  * [English](https://youtube.com/playlist?list=PLQflnv_s49v_m6KLMsORgs9hVIvDCwDAb).

# What is a Language Model?

A language model is at the **core** of many NLP tasks, and is simply a probability distribution over a sequence of words

In this current context, **the model trained to generate text** is mostly called a **Language Model** (LM).

In a broader context, a ***statistical language model*** is a probability distribution over sequences of tokens (i.e., words or characters). 

Given a prompt (***assume a partial statement***), say of length m, a ***trained*** Language Model (LM) assigns a **conditional probability distribution** over the dictionary (vocabulary) tokens $P(w_{1}$,$...$, $w_{m})$. 

We can use the **conditional probability distribution** to select (***sample***) the **next token** to complete the given ***prompt***.

For example, when the prompt is "***I want to cook***", the ***trained*** language model can output the **probability** of each token in the dictionary ***to be the next token*** as below.

Then, according to the implemented **sampling** method, one can pick the next token considering this ***probability distribution***. 






<img src="https://github.com/kmkarakaya/Deep-Learning-Tutorials/blob/master/images/LanguageModel_wordLevel.png?raw=true" width="800">

# How does a Language Model generate  text?
In general, we first **train** a LM then make it to generate text (**inference**).

* In **training**, we first prepare the train data from the ***corpus***. Then, LM **learns** the ***conditional probability distribution*** of the next token for a sequence (prompt) ***generated*** from the ***corpus***.



* In **inference** (***text generation***) mode, a LM works in a **loop**:
  * We provide initial  text (**prompt**) to the LM. 
  * The LM **calculates** the **conditional probability** of each vocabulary token to be the ***next*** token. 
  * We **sample** the next token using this conditional probability distribution.
  * We **concatenate** this token to the seed and provide this sequence as the new seed to LM

<img src="https://github.com/kmkarakaya/Deep-Learning-Tutorials/blob/master/images/LanguageModel_wordLevel_GenerationLoop.gif?raw=true" width="800">

# What is Word-based and Char-based Text Generation?

We can set the token size at the **word** level or **character** level.

In the ***above example***, the tokenization is done at **word level**. Thus, the input and output of the Language Model is composed of words.

Below, you see that we opt out **character-based** tokenization.

**Pay attention** to the above and below models' ***outputs*** and ***dictionaries (vocabularies)***. 








<img src="https://github.com/kmkarakaya/Deep-Learning-Tutorials/blob/master/images/LanguageModel_charLevel.png?raw=true" width="500">

# Which Level of Tokenization (Word or Character based) should be used?

In general, 
* **character level** LMs can mimic grammatically correct sequences for a wide range of languages, require bigger hidden layer and computationally more expensive
* **word level** LMs train faster and generate more coherent texts and yet even these generated texts are far from making actual sense. 

Main advantage of character level over word level  Text Generation models:
*  **Character level models** have a really **small vocabulary**. For example, the GBW dataset will contain approximately **800 characters** compared to **800,000 words**. 
* In practice, this means that **Character level models** will require **less memory** and have **faster inference** than their word counterparts. 
* **Character level models** **do not require tokenization** as a preprocessing step. 
* However, **Character level models** require a much **bigger hidden layer** to successfully model ***long-term dependencies*** which means higher computational costs.


In summary, you need to work on both to understand their advantages and disadvantages.

[More discussion is here](https://datascience.stackexchange.com/questions/13138/what-is-the-difference-between-word-based-and-char-based-text-generation-rnns)

# What is Sampling?
**Sampling** means randomly **picking** the next word according to its *conditional probability distribution*.
After generating a probability distribution over vocabulary for the given input sequence, we need to  carefully decide how to **select the next token** (***sample***) from this distribution. 

<img src="https://github.com/kmkarakaya/Deep-Learning-Tutorials/blob/master/images/LanguageModel_wordLevel_GenerationLoop.gif?raw=true" width="400">


There are **several methods for sampling** in text generation such as:
* **Greedy Search (Maximization)** 
* **Temperature Sampling**
* **Top-K Sampling**
* **Top-P Sampling (Nucleus sampling)**
* **Beam Search**

You can learn details of these **sampling** methods and **how to code them** with ***Tensorflow / Keras*** in these Murat Karakaya Akademi tutorials: 
* [Medium blog](https://medium.com/deep-learning-with-keras/sampling-in-text-generation-b2f4825e1dad)  
* Youtube videos in [Turkish](https://youtu.be/L3D6rNwqqpo) & [English](https://youtu.be/0RFQ6QOYL68).

Also, you can visit the blog by Patrick von Platen, [How to generate text: using different decoding methods for language generation with Transformers](https://huggingface.co/blog/how-to-generate).


# What kinds of Language Models do exist in Artificial Neural Networks?

The most popular approaches to create a Language Model in Deep Learning are:
* Recurrent Neural Networks (LSTM or GRU)
* Encoder-Decoder Models
* Transformers
* Generative Adversarial Networks (GANs)

# Which Language Model to use?

The LMs mentioned above have their advantages and disadvantages.

In a very short and simple comparision:
* **Transformers** are the novel models but they require much more data to be trained with.

* **RNNs** can not create coherent long sequences

* **Encoder-Decoder** models enhanced with Attention Mechanism could perform better than RNNs but worse than Transformers

* **GANs** can not be easily trained or converge.

As a researcher or developer, we need to know how to apply all these approaches on text generation problem.

# Text Generation Types
Mainly, we can think of 2 types of text generation approach:
* **Random Text Generation:** The LM is free to generate any text without being limited or directed by any specific rules or expectations. We only hope realistic, coherent, understandable content to be generated. 
* **Controllable Text Generation:** Controllable text generation is the task of generating natural sentences whose **attributes** can be controlled. For example, we can define some attributes of the text to be generated such as:
  * tense 
  * sentiment
  * structure
  * grammer
  * consist of some key terms/topics

For example, [in this work](https://arxiv.org/abs/1703.00955),  the authors train a LM such that it can **control** the **tense** (present or past) and **attitude** (positive or negative) of the generated text like below: 




<img src="https://github.com/kmkarakaya/Deep-Learning-Tutorials/blob/master/images/LanguageModel_controlExample.png?raw=true" width="800">

# Text Generation Summary

So far,  we have reviewed the important concepts and methods related to Text Generation in Deep Learning.

If you want to go deeper and see how to implement several Language Models (LSTM, Encoder-Decoder, Transformers, etc.) with Python / TensorFlow / Keras you can refer the following **Text Generation with different Deep Learning Models**  provided by ***Murat Karakaya Akademi***:
* on YouTube in [English](https://youtube.com/playlist?list=PLQflnv_s49v9QOres0xwKyu21Ai-Gi3Eu) or [Turkish](https://youtube.com/playlist?list=PLQflnv_s49v-oEYNgoqK5e4GyUbodfET3)
* [on Medium](https://medium.com/deep-learning-with-keras/text-generation-in-deep-learning-with-tensorflow-keras-e403aee375c1)


Furthermore, you might need to check the above **references** for more details.

If you want to learn **Controllable Text Generation Fundamentals** and how to implement it with different **Deep Learning models** in ***Python, TensorFlow & Keras*** please **continue** with the **next parts**.













# Controllable Text Generation tutorial series

[You can access all the parts from this link](https://medium.com/deep-learning-with-keras/controllable-text-generation-in-deep-learning-with-transformers-gpt3-using-tensorflow-keras-3d9e6bbe243b).

# Comments or Questions?

Please **[share your Comments or Questions](https://www.youtube.com/post/UgyU_3vLcztTo6rz0E14AaABCQ)**.

Thank you in advance.

Do not forget to check out the next parts!

Take care!