This is the code and data repository for our application of information gain to the problem of predicting adjective order across languages.
CoNLLU files
- CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings
- Universal Dependencies
1. extract NPs from conllu file
./tools/extract_conllu_nps.sh <file>
input
<file>: a conllu file
output
nps.tsv: a tab-delimited file containing COUNT, NOUN, and comma-delimited ADJ(s)
2. build vectors
python ./src/build_feature_vecs.py -n nps.tsv
input
- -n: file containing NPs, generated by
extract_conllu_nps.sh
output
ig.pkl: a pickle file containing feature vectors and associated metadata
3. extract triples for training regression
./tools/extract_conllu_triples.sh <file>
input
<file>: a conllu file
output
triples.csv: a comma-delimited file containing N/NOUN,A1/ADJ,A2/ADJ triples (in any order)
note
- rename
triples.csvto something liketriples.train
4. extract testing triples
./tools/extract_conllu_triples.sh <file>
input
<file>: a conllu file
output
triples.csv: a comma-delimited file containing N/NOUN,A1/ADJ,A2/ADJ triples (in any order)
note
- rename
triples.csvto something liketriples.test
5. partition feature vectors
python ./src/partition.py -s <file>
input
<file>: a comma-delimited triples file, generated byextract_conllu_triples.sh
output
scores.tsv: a tab-delimited file containing:key: a sorted version of the triplewordforms: a surface permutation of the wordforms in a tripletemplate: the template (AAN, ANA, NAA) of this permutationattest: the number of attestations of this permutation in the corpusig_seq: the information gain of each word in the permutationig_1st_a: the IG of the 1st adjectiveig_sum: the sum of the IGs for each wordig_uc_pos: the unconditioned positive IG of the 1st adjectiveig_c_pos: the conditioned positive IG of the 1st adjectiveig_uc_neg: the unconditioned negative IG of the 1st adjectiveig_c_neg: the conditioned negative IG of the 1st adjective
6. evaluate
python src/regress.py -tr <training_scores>.tsv -te <testing_scores>.tsv [-m <metric>] [--verbose]
input
- -tr: scores file for training regression
- -te: scores file for testing
- -m: metric (ig_1st_a|ig_sum|ig_uc_pos|ig_c_pos|ig_uc_neg|ig_c_neg)
output
triples.gen: a tab-delimited file containing TEMPLATE, ATTESTED, and GENERATED columns