compuframes

Required libraries/external files in `lib` folder

commons-cli-1.2.jar
gson-2.2.4.jar
segan.jar
commons-compress-1.4.1.jar
mulan.jar
stopwords.txt
commons-math3-3.1-SNAPSHOT.jar
opennlp-maxent-3.0.1-incubating.jar
weka-3.7.6.jar
en-sent.bin
opennlp-tools-1.5.1-incubating.jar
en-token.bin
opennlp-uima-1.5.1-incubating.jar

Compile

To compile: ant compile
To build jar file: ant jar
To make a clean build: ant clean-build

Process data

To perform preprocessing of the text data on the whole dataset

java -Xmx10000M -Xms10000M -cp 'dist/compuframes.jar:lib/*' experiment.MultilabelExperiment --json-file <json-file> -v -d --data-folder <processed-data-folder> --format-folder <format-folder> --run-mode preprocess -s -l --u 5 --b 10 --bs 5 --V 10000 --doc-type <document-type>

<json-file>: path to the JSON file
<processed-data-folder>: path to the folder containing processed data
<format-folder>: subfolder of <processed-data-folder> to store one specific instance of processed data (e.g., with a specific set of preprocessing parameter of the same dataset)
<document-type>: the type of documents to consider. This can be either "tobacco" or "immigration". If this is not set, all documents are considered.
-s: whether stopwords are removed (stopwords are stored in lib/stopwords.txt)
-l: whether stemming is performed
--u: the minimum raw count of unigrams
--b: the minimum raw count of bigrams
--bs: the minimum chi-square score for a bigram to be considered
--V: the maximum vocab size For other arguments, use -help.

Working example:

java -Xmx10000M -Xms10000M -cp 'dist/compuframes.jar:lib/*' experiment.MultilabelExperiment --json-file data/rounds_4-9_singly_coded.json -v -d --data-folder data --format-folder format --run-mode preprocess -s -l --u 5 --b 10 --bs 5 -s -l --V 10000

To create cross-validated data

java -Xmx10000M -Xms10000M -cp 'dist/compuframes.jar:lib/*' experiment.MultilabelExperiment --json-file <json-file> -v -d --run-mode create-cv -s -l --u 5 --b 10 --bs 5 --V 10000 --num-folds <number-of-folds> --tr2dev-ratio <train-to-dev-ratio> --cv-folder <cross-validation-folder>

<number-of-folds>: number of cross validation folds
<train-to-dev-ratio>: the ratio between the number of training instances and the number of development instances
<cross-validation-folder>: folder to store the cross validated data

This will creates <number-of-folds> sets of train/dev/test. It will perform preprocessing (like in section 1 above) on the training set and use the vocabulary extracted from the training data to perform preprocessing on the development and test sets.

Working example:

java -Xmx10000M -Xms10000M -cp 'dist/compuframes.jar:lib/*' experiment.MultilabelExperiment --json-file data/rounds_4-9_singly_coded.json -v -d --run-mode create-cv -s -l --u 5 --b 10 --bs 5 -s -l --V 10000 --num-folds 5 --tr2dev-ratio 0.8 --cv-folder data/cv

Run document-level multi-label model

java -Xmx10000M -Xms10000M -cp 'dist/compuframes.jar:lib/*:../segan/dist/lib/*:../segan/dist/*' experiment.MultilabelExperiment -v -d --dataset compuframes --run-mode run --num-folds <number-of-folds> --cv-folder <cross-validation-folder> --model <model-name> --expt-folder <experiment-folder> --fold <fold-number>

<model-name>: the name of a supported model
<experiment-folder>: folder to store the learned model and prediction results
<fold-number>: the fold to be run. If this is not specified, all folds in the cross-validation folder will be run.

Working example: running random baseline

java -Xmx10000M -Xms10000M -cp 'dist/compuframes.jar:lib/*:../segan/dist/lib/*:../segan/dist/*' experiment.MultilabelExperiment -v -d --dataset compuframes --run-mode run --num-folds 5 --cv-folder data/cv --model random --expt-folder experiments --fold 0

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
lib		lib
src		src
README.md		README.md
build.xml		build.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

compuframes

Required libraries/external files in `lib` folder

Compile

Process data

To perform preprocessing of the text data on the whole dataset

To create cross-validated data

Run document-level multi-label model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

compuframes

Required libraries/external files in lib folder

Compile

Process data

To perform preprocessing of the text data on the whole dataset

To create cross-validated data

Run document-level multi-label model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Required libraries/external files in `lib` folder

Packages