Text Generation

The final model was built on Flask and deployed to Heroku.

Objective

The goal of this project is to generate scientific sentences, when a scientific phrase or sentence is given.

Background

Text generation is an application of language modeling, and a subfield of natural language processing. It utilizes techniques in artificial intelligence to automatically generate natural language text, which fits in a certain communication context.

Text generation can be used to write stories, poems, emails, news articles, and more. It is also useful for machine translation and chatbots.

Descriptions

Note: Due to the lack of computing power, only the abstracts of the articles are selected to form the corpus.

`flask-app`

This folder contains all files for deployment to Heroku with Flask.

`scraper.py`

This file is used to collect papers information from NIPS website using scraping techniques.

`text_preprocessing.jpynb`

It is used for data cleaning, and save clean text into .txt file for model training.

Steps involved in text preprocessing:

Words in British English are converted into American English.

Note: Although included in the data cleaning pipeline to reduce the vocabulary size , the following steps can be skipped for large corpus and with enough computing power.

URLs, equations, citations are removed.
Punctuation (except periods and commas), special characters are removed.
Hyphenated descriptions like “video-related” are converted into separate words, like “video related”.
All numbers are replaced with “NUMBER”.
All characters are converted to lowercase.

`text_generation_word_train.ipynb`

It is used for data preparation, and data modeling. It outputs a language model for text generation.

Bidirectional LSTM is used to improve the performance.

A Python generator is used to save memory space, and solve scaling problem.

`text_generation_word_generate.ipynb`

It loads a pre-trained model created from text_generation_word_train.ipynb to generate text. A seed of a text sequence needs to be provided so that the model can generate text from there.

`gpt2_train.ipynb` and `gpt2_generate.jpynb`

GPT-2 model is used to evaluate the performance of the developed LSTM model.

gpt2_train.jpynb takes the cleaned text file from text_preprocessing.jpynb, and uses GPT-2 to retrain a model. Then it passes the trained model to gpt2_generate.jpynb to generate text.

For more details regarding how to use a GPT-2 model, please refer to gpt-2-simple

Future work

Several ways that can further improve the model:

In text preprocessing part, add spell checker
Use the entire articles to form the corpus
Do rigorous parameters tuning
Add more LSTM layers
Try seq2seq model

References:

[1] Andrej Karpathy, 2015, The Unreasonable Effectiveness of Recurrent Neural Networks

[2] Schuster, Mike, and Kuldip K. Paliwal, 1997, Bidirectional recurrent neural networks

[3]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, 2019, Language Models are Unsupervised Multitask Learners

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
flask-app		flask-app
README.md		README.md
gpt2_generate.ipynb		gpt2_generate.ipynb
gpt2_train.ipynb		gpt2_train.ipynb
nips.csv		nips.csv
scraper.py		scraper.py
text_generation_word_generate.ipynb		text_generation_word_generate.ipynb
text_generation_word_train.ipynb		text_generation_word_train.ipynb
text_preprocessing.ipynb		text_preprocessing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Generation

Objective

Background

Descriptions

`flask-app`

`scraper.py`

`text_preprocessing.jpynb`

`text_generation_word_train.ipynb`

`text_generation_word_generate.ipynb`

`gpt2_train.ipynb` and `gpt2_generate.jpynb`

Future work

References:

About

Releases

Packages

Languages

tangp3/text-generation

Folders and files

Latest commit

History

Repository files navigation

Text Generation

Objective

Background

Descriptions

flask-app

scraper.py

text_preprocessing.jpynb

text_generation_word_train.ipynb

text_generation_word_generate.ipynb

gpt2_train.ipynb and gpt2_generate.jpynb

Future work

References:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`flask-app`

`scraper.py`

`text_preprocessing.jpynb`

`text_generation_word_train.ipynb`

`text_generation_word_generate.ipynb`

`gpt2_train.ipynb` and `gpt2_generate.jpynb`

Packages