Skip to content

wmdyer/infogain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting cross-linguistic adjective order with information gain

This is the code and data repository for our application of information gain to the problem of predicting adjective order across languages.

source data

CoNLLU files

workflow

1. extract NPs from conllu file

./tools/extract_conllu_nps.sh <file>

input

  • <file>: a conllu file

output

  • nps.tsv: a tab-delimited file containing COUNT, NOUN, and comma-delimited ADJ(s)

2. build vectors

python ./src/build_feature_vecs.py -n nps.tsv

input

  • -n: file containing NPs, generated by extract_conllu_nps.sh

output

  • ig.pkl: a pickle file containing feature vectors and associated metadata

3. extract triples for training regression

./tools/extract_conllu_triples.sh <file>

input

  • <file>: a conllu file

output

  • triples.csv: a comma-delimited file containing N/NOUN,A1/ADJ,A2/ADJ triples (in any order)

note

  • rename triples.csv to something like triples.train

4. extract testing triples

./tools/extract_conllu_triples.sh <file>

input

  • <file>: a conllu file

output

  • triples.csv: a comma-delimited file containing N/NOUN,A1/ADJ,A2/ADJ triples (in any order)

note

  • rename triples.csv to something like triples.test

5. partition feature vectors

python ./src/partition.py -s <file>

input

  • <file>: a comma-delimited triples file, generated by extract_conllu_triples.sh

output

  • scores.tsv: a tab-delimited file containing:
    • key: a sorted version of the triple
    • wordforms: a surface permutation of the wordforms in a triple
    • template: the template (AAN, ANA, NAA) of this permutation
    • attest: the number of attestations of this permutation in the corpus
    • ig_seq: the information gain of each word in the permutation
    • ig_1st_a: the IG of the 1st adjective
    • ig_sum: the sum of the IGs for each word
    • ig_uc_pos: the unconditioned positive IG of the 1st adjective
    • ig_c_pos: the conditioned positive IG of the 1st adjective
    • ig_uc_neg: the unconditioned negative IG of the 1st adjective
    • ig_c_neg: the conditioned negative IG of the 1st adjective

6. evaluate

python src/regress.py -tr <training_scores>.tsv -te <testing_scores>.tsv [-m <metric>] [--verbose]

input

  • -tr: scores file for training regression
  • -te: scores file for testing
  • -m: metric (ig_1st_a|ig_sum|ig_uc_pos|ig_c_pos|ig_uc_neg|ig_c_neg)

output

  • triples.gen: a tab-delimited file containing TEMPLATE, ATTESTED, and GENERATED columns

About

using information gain to predict adj order

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages