Skip to content

This repo contains the code for a preprocessing paper I authored along with another colleague that got published at NAACL2021.

Notifications You must be signed in to change notification settings

wfearn/preprocessing-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

A repository with the code used to produce the results in the NAACL2021 paper titled Exploring the Relationship Between Algorithm Performance, Vocabulary, and Run-Time in Text Classification.

Information

Here is a list of the directories and what they contain:

preprocess - the main code folder, can be accessed as a package in python by running 'python3 setup.py --install'

scripts - Folder with bash, slurm, and python scripts that can be used to perform different steps of the experiment pipeline as well as run these steps using a slurm scheduler.

test - Folder with unit tests that I created to make sure the code in preprocess ran correctly. You can run these with pytest.

utilities - Folder meant to hold miscellaneous files important to code execution, currently just holds an English stopword list.

Setup

  1. Ensure datasets are available in the $HOME/.preprocess/downloads folder under the names 'apnews' and 'amazon,' respectively.

  2. Install the requirements.txt folder using pip.

Experiments

I split parts of the experiment up because of time constraints on the supercomputer I used to gather these results.

The parts are as follows:

  1. Preprocess - Preprocesses the raw text corpora according to the given specifications. This part was done to facilitate parallelization because the Amazon corpus is so large, and having a supercomputer made it convenient.

  2. Combine - Combines each of the parts of the corpus generated by the preprocess step into one corpus.

  3. Import - Converts the preprocessed corpus from a text file into a pickled format usable by the code.

  4. Vocabulary - Creates a pickled vocabulary from the raw corpus text for use in analysis.

  5. Analysis - Performs text classification according to user specification.

Here are some example commands for how to use these parts together:

1. python3 preprocess_corpus.py --corpus amazon --methods lc,np,nr

This will begin preprocessing the amazon corpus using lowercasing (lc), punctuation removal (np), and number removal (nr). Note that if you want to use rare word filtering you must have first built the vocabulary for it.

2. python3 $PREPROCESS_DIR/preprocess/combiner.py --corpus amazon

This will combine the parts of the amazon corpus together which were split up during preprocessing.

3. python3 create_vocabulary.py --corpus amazon --methods lc,np,nr --seed 0

This will create a vocabulary based on the corpus that we preprocessed in the previous two steps. It is important to know the seed from the previous steps (defaults to 0 if not specified).

4. python3 import_corpus.py --corpus amazon --methods lc,np,nr --train 100000 --seed 0

This will create a pickled corpus useful for analysis that has a training size of 100,000 documents. Again it is important to make sure that the methods and seed stay consistent across steps of the experiment.

5. python3 run_analysis.py --corpus amazon --methods lc,np,nr --model nb --train 100000 --seed 0

This final step will run naive bayes (nb) on the corpus that we imported in step 4.

All results can be found in the $HOME/.preprocess folder.

About

This repo contains the code for a preprocessing paper I authored along with another colleague that got published at NAACL2021.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published