A better sentence-level autocorrect: sequential denoising autoencoders with language models

In this project, we use state-of-the-art linguistic encoders such as BERT and GPT-2 to denoise corrupted text within a sequential denoising autoencoder architecture and training objective. What does that mean? For example, in this sentence:

"Where is the the teacher?"

you can easily tell that the intended message is "where is the teacher". You unconsciously deletes the repeated "the" from the sentence with minimal, if any, mental effort.

To approach this task, I train an autoencoder: a network that recreates the sentence I give it as input. Since I feed the autoencoder words one by one, it is a sequential autoencoder. Finally, I give the autoencoder noisy input i.e. corrupted text, where a word has been deleted or inserted, or two words have been swapped, and ask it to give me the original, uncorrupted text back. This means that the autoencoder denoises.

How do powerful language models such as BERT encode noisy sentences? Can these language models help us denoise sentences at scale? What does this process reveal about the model's knowledge of language in general?

How this repository is organized

preprocess.sh: This script introduces noise into a corpus you provide and creates a vocabulary file. Afterwards, you can run the autoencoder (either train your own or use a provided model) on the preprocessed corpus. s

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
Analysis		Analysis
Geometry		Geometry
Models		Models
Noise		Noise
Results		Results
Slurm		Slurm
Stimuli		Stimuli
__pycache__		__pycache__
.RData		.RData
.Rhistory		.Rhistory
.gitignore		.gitignore
README.md		README.md
clear-nets.sh		clear-nets.sh
noisy-nets.Rproj		noisy-nets.Rproj
noisy-nets.sh		noisy-nets.sh
noisy-nets_input_output_vary.sh		noisy-nets_input_output_vary.sh
preprocess.sh		preprocess.sh
preprocess_input_output_vary.sh		preprocess_input_output_vary.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A better sentence-level autocorrect: sequential denoising autoencoders with language models

How this repository is organized

About

Uh oh!

Releases

Packages

Languages

jklafka/noisy-nets

Folders and files

Latest commit

History

Repository files navigation

A better sentence-level autocorrect: sequential denoising autoencoders with language models

How this repository is organized

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages