SeqGAN Paraphrase Generation

A neural network model based on SeqGAN that generates paraphrases for a given sentence.

Details can be found in the final report.

Dataset

Quora

References

Usage

Preparation

1: Install dependencies

Anaconda, default env name: seqgan (can change this in environment.yml)
1. Create virtual environment: $ conda env create -f environment.yml
2. Enter virtual env: $ source activate <env-name>
Other methods
1. $ pip3 install -r requirements.txt

2: Activate nltk

nltk.word_tokenize() method needs punkt package.

$ python3
>>> import nltk
>>> nltk.download('punkt')

3: Install nlg-eval (for evaluation only)

From instructions here: https://github.com/Maluuba/nlg-eval

$ git clone https://github.com/Maluuba/nlg-eval.git
$ cd nlg-eval
$ pip3 install -e .
$ nlg-eval --setup

4: Download pretrained word embeddings

Download pretrained word embeddings
- GloVe
- word2vec
- fastText
- ELMo
Extract into plain text files & put under dataset/pretrained_word_embeddings/original/

5: Parse pretrained word embeddings

$ python3 tools/parse_emb.py <word-embedding-file> <output-vector-file> <output-info-file>
## Vector file: <output-dir>/<word-emb>_<emb-dim>.npy, which keeps a numpy array
## Info file: <output-dir>/<word-emb>_<emb-dim>_info.pkl, which keeps the word mapping information

* Currently only GloVe is tested; if you want to substitute it with others, simply make sure the pretrained embedding file format is consistent with that of GloVe's, and change the pretrained_emb parameter in src/utils/hyper_params.py to your choice.

Summary

$ conda env create -f environment.yml
$ source activate seqgan

$ python3
>>> import nltk
>>> nltk.download('punkt')

$ git clone https://github.com/Maluuba/nlg-eval.git
$ cd nlg-eval
$ pip3 install -e .
$ nlg-eval --setup

# Download pretrained word embeddings & extract

$ python3 tools/parse_emb.py dataset/pretrained_word_embeddings/original/glove.twitter.27B.50d.txt dataset/pretrained_word_embeddings/glove_50.npy dataset/pretrained_word_embeddings/glove_50_info.pkl

Train Model

$ python3 -m src.train
# Output model files: model/<dataset-info>/<model-params>/<gen/dis>.trc
# Output pretrained model files: model/<dataset-info>/<model-params>/pretrain/<pretrained-model-params>/<gen/dis>.trc
# Output log file: log/<timestamp>_<dataset-info>_<model-params>_<pretrained-model-params>.log

Hyperparameters: src/utils/hyper_params.py
- Batch size, rollout number, training epochs/steps, etc.
Project parameters: src/utils/static_params.py
- Debug mode (run fewer iterations), light mode (load smaller dataset), save mode (save model or not)
- Train/test/validation set sizes

Continue training

If you just trained for some iterations and would like to pick up from that point, simply change the ADV_TRAIN_ITERS param in src/utils/hyper_params.py to the number of additional training iterations, leave all other parameters intact, and rerun. Our smart pathbuilder will detect the existing model with the same parameters, load that model, and resume training.

Evaluate Model

Paraphrases will be generated, and the BLEU-2 and METEOR evaluation metrics will be calculated. The model path below should be the directory path of the pretrained model and end in slash. Our pathbuilder tool will parse everything with this.

$ python3 -m tools.evaluate model/<model-params>/pretrain/<pretrained-model-params>/
# Output results to output/<model-params>/pretrain/<pretrained-model-params>/results.tsv

Options
- Only evaluate pretrained model: --pretrained
- Don't include evaluation metric scores: --no-score
  - Output file name: results_raw.tsv
- Evaluate on training set instead of test set: --mode train

Output file format:

original (cond) sample (pos)    generated (neg) BLEU    METEOR
This is the first sentence. This sentence is the first one. This one is generated.  1.005427487071676e-08   0.1839080459770115

Other Tools

Visualize log file with loss & accuracy information
- $ python3 tools/visualize_log.py <path-to-log-file> plots/
- Output *_acc.png and *_loss.png
Interactive paraphrase generation
- $ python3 -m tools.paraphrase_generation <path-to-pretrained-model-dir>
Find paraphrases with top-n scores
- $ python3 tools/find_good_ex.py <path-to-result-file> <output-filename> -n <n> --metric <metric-to-compare>

Project structure

.
├── dataset
│   ├── pretrained_word_embeddings    # not included by git, download word embedding dataset and see Usage to create parsed files
│   │   ├── glove_50_info.pkl         # word mapping information
│   │   ├── glove_50.npy              # pretrained word vectors
│   │   └── original                  # raw vectors, downloaded from https://nlp.stanford.edu/projects/glove/
│   │       ├── glove.twitter.27B.50d.txt
│   │       └── ...
│   └── quora_duplicate_questions.tsv # downloaded from https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs
├── environment.yml                   # conda env details
├── log                               # storing logs named by timestamp and hyperparameters
├── model                             # storing trained models
├── plots                             # plots of log files, including accuracy & loss trends
├── README.md
├── requirements.txt                  # dependency details
├── src
│   ├── discriminator.py              # discriminator model
│   ├── generator.py                  # generator model
│   ├── __init__.py
│   ├── train.py                      # training script
│   └── utils
│       ├── dataloader.py             # load dataset, parse into positive samples & condition pairs
│       ├── helpers.py                # helper methods
│       ├── hyper_params.py           # hyperparameters e.g. hidden dimension, training epoch, etc.
│       ├── pathbuilder.py            # build model save paths given model parameters; auto-detect trained models and resume training
│       ├── static_params.py          # static project settings e.g. debug mode, dataset path, etc.
│       └── word_embeddings.py        # word embedding helper class, load or initialize word embeddings
└── tools
    ├── __init__.py
    ├── evaluate.py                   # evaluation script, output generated paraphrases and evaluation scores
    ├── find_good_ex.py               # find generated samples with the best n scores
    ├── paraphrase_generation.py      # interactive script, generate paraphrase given a sentence
    ├── parse_emb.py                  # parse word embedding vectors and mappings from raw file
    └── visualize_log.py              # visualize loss, acc, etc. information in a given log file`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeqGAN Paraphrase Generation

Dataset

References

Usage

Preparation

1: Install dependencies

2: Activate nltk

3: Install nlg-eval (for evaluation only)

4: Download pretrained word embeddings

5: Parse pretrained word embeddings

Summary

Train Model

Continue training

Evaluate Model

Other Tools

Project structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
dataset		dataset
log		log
output		output
plots		plots
src		src
tools		tools
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
final_report.pdf		final_report.pdf
requirements.txt		requirements.txt

pyliaorachel/SeqGAN-paraphrase-generation

Folders and files

Latest commit

History

Repository files navigation

SeqGAN Paraphrase Generation

Dataset

References

Usage

Preparation

1: Install dependencies

2: Activate nltk

3: Install nlg-eval (for evaluation only)

4: Download pretrained word embeddings

5: Parse pretrained word embeddings

Summary

Train Model

Continue training

Evaluate Model

Other Tools

Project structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages