ÚFAL team

Members

Rudolf Rosa, Dan Zeman, Martin Vastl

(Nová registrace 12.4.2020; v té předchozí byl jen Dan a Ruda a nemáme ani žádný záznam, co přesně jsme tam napsali. Nyní tedy název týmu = "ÚFAL", afiliace = "Charles University, Faculty of Mathematics and Physics, ÚFAL, Prague, Czechia".)

Plan

Suggested approaches (simpler to more complex):

majority voting based on language family (the language genera in train and test data will probably have no overlap)
- RR: redone correctly, accuracy 60.65% on dev data
conditional probability p(feature g=y|feature f=x)
- DZ: first attempt done (take strongest source feature, ignore rest); accuracy 64.47% on dev data.
determined by closest language (try to find the most similar language based on the filled in features as well as language family and GPS, copy values from that language, if a value is missing then e.g. take the second most similar language etc.)
combination, use weighted voting (weight = language similarity)
looking for intralingual causation or correlation (such as SVO implies SV, or postposition imply OV ), probably using some statistical methods such as CCA

The shared task website also lists some existing work on the topic:

SIGTYP 2020 Shared Task: Prediction of Typological Features

To participate in the shared task, you will build a system that can predict typological properties of languages, given a handful of observed features. Training examples and development examples will be provided. All submitted systems will be compared on a held-out test set.

Final results

To obtain the final results, run python scripts/score.py [TSVFILE] [more TSVFILES]. Note that the script runs on cleaned input files which may not be the file you submitted. Generated plots are in results_plots.ods, they do not interact with the python script.

Data Format

The model will receive the language code, name, latitude, longitude, genus, family, country code, and feature names as inputs and will be required to fill values for those requested features.

Input:

mhi      Marathi      19.0      76.0      Indic      Indo-European      IN      order_of_subject,_object,_and_verb=? | number_of_genders=?
jpn      Japanese      37.0      140.0      Japanese      Japanese      JP      case_syncretism=? | order_of_adjective_and_noun=?

The expected output is:

mhi      Marathi      19.0      76.0      Indic      Indo-European      IN      order_of_subject,_object,_and_verb= SOV | number_of_genders=three
jpn      Japanese      37.0      140.0      Japanese      Japanese      JP      case_syncretism=no_case_marking | order_of_adjective_and_noun=demonstrative-Noun

Data

The model will have access to typology features across a set of languages. These features are derived from the WALS database. For the purpose of this shared task, we will provide a subset of languages/features as shown below:

tur      Turkish      39.0      35.0      Turkic      Altaic      TR      case_syncretism=no_syncretism | order_of_subject,_object,_and_verb= SOV | number_of_genders=none | definite_articles=no_definite_but_indefinite_article
jpn      Japanese      37.0      140.0      Japanese      Japanese      JP      order_of_subject,_object,_and_verb= SOV | prefixing_vs_suffixing_in_inflectional_morphology=strongly_suffixing

Column 1: Language ID

Column 2: Language name

Column 3: Latitude

Column 4: Longitude

Column 5: Genus

Column 6: Family

Column 7: Country Codes

Column 8: It contains the feature-value pairs for each language, where features are separated by ‘|’

Name		Name	Last commit message	Last commit date
Latest commit History 347 Commits
data		data
haldaume		haldaume
models		models
outputs		outputs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
poznamky-dan.txt		poznamky-dan.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ÚFAL team

Members

Plan

SIGTYP 2020 Shared Task: Prediction of Typological Features

Final results

Data Format

Data

About

Releases

Packages

Languages

ufal/ST2020

Folders and files

Latest commit

History

Repository files navigation

ÚFAL team

Members

Plan

SIGTYP 2020 Shared Task: Prediction of Typological Features

Final results

Data Format

Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages