Predicting cross-linguistic adjective order with information gain

This is the code and data repository for our application of information gain to the problem of predicting adjective order across languages.

source data

CoNLLU files

workflow

1. extract NPs from conllu file

./tools/extract_conllu_nps.sh <file>

input

<file>: a conllu file

output

nps.tsv: a tab-delimited file containing COUNT, NOUN, and comma-delimited ADJ(s)

2. build vectors

python ./src/build_feature_vecs.py -n nps.tsv

input

-n: file containing NPs, generated by extract_conllu_nps.sh

output

ig.pkl: a pickle file containing feature vectors and associated metadata

3. extract triples for training regression

./tools/extract_conllu_triples.sh <file>

input

<file>: a conllu file

output

triples.csv: a comma-delimited file containing N/NOUN,A1/ADJ,A2/ADJ triples (in any order)

note

rename triples.csv to something like triples.train

4. extract testing triples

./tools/extract_conllu_triples.sh <file>

input

<file>: a conllu file

output

triples.csv: a comma-delimited file containing N/NOUN,A1/ADJ,A2/ADJ triples (in any order)

note

rename triples.csv to something like triples.test

5. partition feature vectors

python ./src/partition.py -s <file>

input

<file>: a comma-delimited triples file, generated by extract_conllu_triples.sh

output

scores.tsv: a tab-delimited file containing:
- key: a sorted version of the triple
- wordforms: a surface permutation of the wordforms in a triple
- template: the template (AAN, ANA, NAA) of this permutation
- attest: the number of attestations of this permutation in the corpus
- ig_seq: the information gain of each word in the permutation
- ig_1st_a: the IG of the 1st adjective
- ig_sum: the sum of the IGs for each word
- ig_uc_pos: the unconditioned positive IG of the 1st adjective
- ig_c_pos: the conditioned positive IG of the 1st adjective
- ig_uc_neg: the unconditioned negative IG of the 1st adjective
- ig_c_neg: the conditioned negative IG of the 1st adjective

6. evaluate

python src/regress.py -tr <training_scores>.tsv -te <testing_scores>.tsv [-m <metric>] [--verbose]

input

-tr: scores file for training regression
-te: scores file for testing
-m: metric (ig_1st_a|ig_sum|ig_uc_pos|ig_c_pos|ig_uc_neg|ig_c_neg)

output

triples.gen: a tab-delimited file containing TEMPLATE, ATTESTED, and GENERATED columns

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
data		data
src		src
tools		tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting cross-linguistic adjective order with information gain

source data

workflow

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

wmdyer/infogain

Folders and files

Latest commit

History

Repository files navigation

Predicting cross-linguistic adjective order with information gain

source data

workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages