Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: cb53d824a1
Fetching contributors…

Cannot retrieve contributors at this time

144 lines (115 sloc) 6.2 kb
Refer run.sh for sample usage.
-DataDir DIR
Use all txt files in DIR as data. The data files are organized such that
all instances corresponding to one class are listed in a single file
named with its label. Each line in the text file is treated as a single
data instance. All the test data may be put together into a single
file with the same formatting if the labels are unknown. If the labels
are known, then use the same filename as you would for training data,
but prefix it with "test_". So for example, all training data for the class
"action_thriller" will go into "action_thriller.txt" and test data of type
"action_thriller" go into "test_action_thriller.txt".
If test labels are known, jrae will print out the final accuracy figures
when the test completes.
-minCount MINCOUNT
Default value is 5. Only words which occur atleast MINCOUNT number of times
are included in the vocabulary.
-TrainModel TRUE|FALSE
Indicates whether to train an RAE based on the data provided by the DIR
directive.
-ModelFile FILE
This parameter is compulsory. FILE indicates the file into which the
trained RAE model is saved if TRAINMODEL is set to TRUE. If not, it
indicates the model file to be loaded to perform testing or feature
extraction.
-ClassifierFile FILE
This parameter is compulsory. FILE indicates the file into which the
trained classifier is saved if TRAINMODEL is set to TRUE. If not, it
indicates the model file to be loaded in to perform testing.
-FeaturesOutputFile FILE
This parameter is used only if not training. When not training, one of
FeaturesOutputFile or ProbabilitiesOutputFile is compulsory. The system
extracts features for the test data using the Model and saves it in FILE
NOTE :: FILE should either not be a txt file or it should not point to the
same directory as the -DataDir option. The next time jrae processes the
directory, it will read in the probabilities as training data.
-ProbabilitiesOutputFile FILE
This parameter is used only if not training. When not training, one of
FeaturesOutputFile or ProbabilitiesOutputFile is compulsory. The system
classifies the test instances and saves the probabilities of the item
belonging to each class into FILE. The classes are numbered 0 to K-1,
and indicates a class label. The ordering of the class labels is the
alphabetical ordering of files in the -DataDir option. This will be
fixed in the next commit to list the labels themselves instead of the
label index.
NOTE :: FILE should either not be a txt file or it should not point to the
same directory as the -DataDir option. The next time jrae processes the
directory, it will read in the probabilities as training data.
-TreeDumpDir DIR
Set DIR to point to point to a directory where you want to dump out all
the trees. You can use this parameter both during training and testing.
For each data item, it writes out three files as follows with each line
containing information about a subtree of the RAE model.
(# stands for data item number, with index starting at 1):
* sent#_strings.txt
Each line is of the form <n word1 word2 ... wordn>
n indicates how long the subtree is.
* sent#_classifierOutput.txt
The probability emitted by the classifier indicating which
class it belongs to. It is increasing order of label.
NOTE: The ordering of the labels is listed in "labels.map".
* sent#_nodeVecs.txt
Each line contains the features calculated by the RAE model at
each node. This is the feature of the entire subtree underneath
this node.
There is also a treeStructures.txt file which lists the tree structure
built by the RAE, one data item per line. The first "n" values indicate
the index of the parent of the individual tokens in the data item. The
next "n-1" entries each correspond to the internal nodes built by the
RAE model.
-NumCores NUMCORES
Indicates how many parallel threads to use for feature processing.
Ideally it is the same as the number of cores on the machine but never
more. If this field is not set, it automatically sets this value to
the number of cores on the processor.
-NumFolds NUMFOLDS
While doing a full demo run, indicates the number of folds to split
the data into. Ignored by all other interfaces.
-MaxIterations MAXITERATIONS
The number of iterations for training the RAE. The default is 80.
-embeddingSize EMBEDDINGSIZE
The RAE performs feature extraction by first embedding each word in
the vocabulary into a high dimensional real space. Defaults to 50.
Higher values may result in severly increasing the training time.
-alphaCat ALPHA
ALPHA is in [0,1] which indicates the balance between optimizing for
classification loss against auto-encoder loss. In detail : feature
learning is performed by minimizing the reconstruction error. It is
also possible to minimize for a classification loss which is indicates
of how informative a single word is of the final class the sentence
belongs to. Defaults to 0.2.
For example, a word like "awesome" is highly indicative of a good
review. If your corpus is full of such words, increase the value of
this parameter.
-Beta BETA
BETA is in [0,1] and indicates the weight on the classification loss.
Defaults to 0.5.
-lambdaW LAMBDAW
All lambda are weights on the regularization terms. LambdaW is the weight
on the embedding W1 - W4 (Refer the paper for more details)
-lambdaL LAMBDAL
All lambda are weights on the regularization terms. LAMBDAL is the weight
on the embedding We (Refer the paper for more details)
-lambdaCat LAMBDACAT
All lambda are weights on the regularization terms. LAMBDACAT is the
weight on the classifier weights.
-lambdaRAE LAMBDARAE
All lambda are weights on the regularization terms. LAMBDARAE is the
weight on the classifier weights. This differs from LAMBDACAT in that
this is applied in the second phase where the RAE is being fine-tuned
(Refer the paper for more details)
-CurriculumLearning
FLAG is set to False by default. Set to True to turn on Curriculum
learning. Refer to Bengio,Y ICML 09 for more details.
--help
Print this message and quit.
Jump to Line
Something went wrong with that request. Please try again.