Skip to content

A one-stop-shop for augmenting text data in Python using recent (and not-so-recent-but-effective) techniques

License

Notifications You must be signed in to change notification settings

samhavens/NLP-data-augmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP data augmentation

This repository started as code snippets that were created as part of Konstantin Hemker's Master's Thesis on NLP at Imperial College London. Currently there are two similar techniques, based on replacing words nearby in word2vec space.

This fork plans on adding many more techniques.

Installation

Python requirements

Python version: 3

Only tested on 3.6.

Dependencies

Uses Pipenv. Install deps with pipenv install. Run commands like pipenv run python augment.py, or launch a shell in the automatically created venv with pipenv shell.

Other dependencies

Most models require pre-trained word vector models. As these models are relatively large, I ommitted them from the git repo. To download the files automatically, run the shell script ./pretrained_vectors.sh

Models are downloaded as needed using gensim.downloader, which is automatically called by Gensim. See more here. Currently the models it downloads are hardcoded, but this should probably be configurable.

Methods

Threshold

Loads in a word embedding pre-trained on one of the large text corpora given above. Replaces the words in a sentence with their highest cosine similarity word vector neighbour that exceed a threshold given as an argument.

POS-tag

Replaces all words of a given POS-tag (given as argument) in the sentence with their most similar word vector from a large pre-trained word embedding.

Generative

The idea of this is: train a two-layer LSTM network to learn the word representations of given class. The network then generates new samples of the class by initialising a random start word and following the LSTM's predictions of the next word given the previous sequence.

However, this hasn't been implemented yet.

Input

Takes in a CSV file with mutually exclusive, numerical labels and text input. The arguments for the Augment object are as follows:

Augment(method, source_path, target_path, corpus_='none', valid_tags=['NN'], threshold=0.75, x_col='tweet', y_col='class')

  • method: Which of the three augmentation methods should be used (valid args: 'threshold', 'postag', 'generate')
  • source_path: Path of the input csv file (type: string)
  • target_path: Path of the output csv file (type: string)
  • corpus: Text corpus of pre-trained word embeddings that should be used (valid args: 'google', 'glove', 'fasttext')
  • valid_tags: POS-tags of words that should be replaced in the POS-tag based method (type: list of strings)
  • threshold: Threshold hyperparameter when threshold-based augmentation is used (type: float)
  • x_col: Column name of the samples in input CSV file (type: string)
  • y_col: Column name of the labels in input CSV file (type: string)

Fork Plan

The idea is to add many NLP Data augmentation techniques and have them in one place. These are enumerated in the issues. Also, improving developer experience is important. Text data augmentation is not widely adopted, and making it easy to do will help speed adoption.