This notebook tests Greenberg’s Rule 20 using German Universal Dependencies (HDT) data. It is part of Milestone 3 and focuses on an in-depth corpus-based analysis of a single universal.

In [3]:
!git clone https://github.com/venetagrigorova/linguistic-universals-wals/

Cloning into 'linguistic-universals-wals'...
remote: Enumerating objects: 154, done.[K
remote: Counting objects: 100% (154/154), done.[K
remote: Compressing objects: 100% (111/111), done.[K
remote: Total 154 (delta 73), reused 110 (delta 36), pack-reused 0 (from 0)[K
Receiving objects: 100% (154/154), 5.70 MiB | 10.60 MiB/s, done.
Resolving deltas: 100% (73/73), done.


In [6]:
from conllu import parse_incr
from collections import defaultdict

PATH = "/content/linguistic-universals-wals/data/conllu/de_hdt-ud-dev.conllu"



In [7]:
def load_conllu(path):
    with open(path, "r", encoding="utf-8") as f:
        return list(parse_incr(f))

sents = load_conllu(PATH)
len(sents)


18434

In [8]:
def feats(token):
    return token.get("feats") or {}

def has_feat(token, key, value):
    v = feats(token).get(key)
    if v is None:
        return False
    if isinstance(v, list):
        return value in v
    return v == value

def sent_tokens(sent):
    # nur "normale" Tokens (keine Multiword ranges)
    return [t for t in sent if isinstance(t["id"], int)]


In [9]:
def analyze_rule20(sentences, max_examples=15):
    total_test = 0
    holds = 0
    exceptions = 0
    mixed_or_unclear = 0
    examples = []

    for sent in sentences:
        toks = sent_tokens(sent)
        by_id = {t["id"]: t for t in toks}

        # head -> children
        children = defaultdict(list)
        for t in toks:
            head = t.get("head")
            if isinstance(head, int) and head != 0:
                children[head].append(t)

        for noun in toks:
            if noun["upos"] not in {"NOUN", "PROPN"}:
                continue

            nid = noun["id"]
            kids = children.get(nid, [])

            # --- (A) "Modifier before noun"? Adj / Demonstrative / Numeral
            modifier_before = False

            for k in kids:
                # adjective
                if k["deprel"] == "amod" and k["id"] < nid:
                    modifier_before = True

                # numeral
                if k["deprel"] == "nummod" and k["id"] < nid:
                    modifier_before = True

                # demonstrative determiner: det + PronType=Dem
                if k["deprel"] == "det" and k["id"] < nid and has_feat(k, "PronType", "Dem"):
                    modifier_before = True

            if not modifier_before:
                continue

            # --- (B) "Genitive possessor"? nmod/nmod:poss with Case=Gen
            gen_poss = []
            for k in kids:
                if k["deprel"] in {"nmod:poss", "nmod"} and has_feat(k, "Case", "Gen"):
                    gen_poss.append(k)

            if not gen_poss:
                continue

            total_test += 1

            any_before = any(k["id"] < nid for k in gen_poss)
            any_after  = any(k["id"] > nid for k in gen_poss)

            if any_before and not any_after:
                holds += 1
            elif any_after and not any_before:
                exceptions += 1
                if len(examples) < max_examples:
                    text = " ".join(t["form"] for t in toks)
                    examples.append({
                        "sentence": text,
                        "head_noun": noun["form"],
                        "genitive": [g["form"] for g in gen_poss]
                    })
            else:
                mixed_or_unclear += 1

    support = holds / total_test if total_test else None
    return {
        "total_test": total_test,
        "holds": holds,
        "exceptions": exceptions,
        "mixed_or_unclear": mixed_or_unclear,
        "support_rate": support,
        "examples": examples
    }

res = analyze_rule20(sents)
res


{'total_test': 845,
 'holds': 108,
 'exceptions': 736,
 'mixed_or_unclear': 1,
 'support_rate': 0.12781065088757396,
 'examples': [{'sentence': 'In dem letzten Quartal dieses Jahres hoffen die Amazon-Manager erstmals einen Gewinn verbuchen zu können .',
   'head_noun': 'Quartal',
   'genitive': ['Jahres']},
  {'sentence': 'Die FTC fordert deshalb , dass eventuelle Käufer der Daten diese nur zu dem gleichen Zweck nutzen dürfen wie vormals Toysmart.com , also als Kundendatenbank für einen Online-Spielwarenhandel .',
   'head_noun': 'Käufer',
   'genitive': ['Daten']},
  {'sentence': 'Die private Nutzung des Internet rückt gegenüber der beruflichen zunehmend in den Vordergrund .',
   'head_noun': 'Nutzung',
   'genitive': ['Internet']},
  {'sentence': 'Bei rund einem halben Prozent der abgesetzten Tickets sollen die Käufer fremde Kartendaten angegeben haben .',
   'head_noun': 'Prozent',
   'genitive': ['Tickets']},
  {'sentence': 'Die Kreditkartengesellschaft Visa gibt an , dass Online-G

In [10]:
print("Total test NPs (modifier-before AND genitive-possessor):", res["total_test"])
print("Rule holds (genitive BEFORE noun):", res["holds"])
print("Exceptions (genitive AFTER noun):", res["exceptions"])
print("Mixed/unclear:", res["mixed_or_unclear"])
print("Support rate:", round(res["support_rate"], 3) if res["support_rate"] is not None else None)

print("\nExample exceptions:")
for ex in res["examples"][:5]:
    print("-", ex["sentence"])
    print("  head noun:", ex["head_noun"], "| genitive:", ex["genitive"])


Total test NPs (modifier-before AND genitive-possessor): 845
Rule holds (genitive BEFORE noun): 108
Exceptions (genitive AFTER noun): 736
Mixed/unclear: 1
Support rate: 0.128

Example exceptions:
- In dem letzten Quartal dieses Jahres hoffen die Amazon-Manager erstmals einen Gewinn verbuchen zu können .
  head noun: Quartal | genitive: ['Jahres']
- Die FTC fordert deshalb , dass eventuelle Käufer der Daten diese nur zu dem gleichen Zweck nutzen dürfen wie vormals Toysmart.com , also als Kundendatenbank für einen Online-Spielwarenhandel .
  head noun: Käufer | genitive: ['Daten']
- Die private Nutzung des Internet rückt gegenüber der beruflichen zunehmend in den Vordergrund .
  head noun: Nutzung | genitive: ['Internet']
- Bei rund einem halben Prozent der abgesetzten Tickets sollen die Käufer fremde Kartendaten angegeben haben .
  head noun: Prozent | genitive: ['Tickets']
- Die Kreditkartengesellschaft Visa gibt an , dass Online-Geschäfte nur rund zwei Prozent ihrer Umsätze ausmachen 

Summary

Testing Greenberg’s Rule 20 with German Universal Dependencies Data
In this part of the project, we take a closer look at one specific universal, Greenberg’s Rule 20, using corpus data instead of typological summaries from WALS. Rule 20 states that when any or all modifiers precede the noun, the genitive almost always precedes the noun as well. While WALS suggests that this is a strong tendency across languages, German appears to be a problematic case, since it often places genitive possessors after the noun.
To investigate this in more detail, we use Universal Dependencies (UD) data for German, specifically the German HDT treebank. Universal Dependencies provide syntactically annotated corpora in a unified format, which makes it possible to study word order patterns directly in real language use. The data is stored in CoNLL-U format, where each sentence is represented as a dependency tree with information about word order, grammatical relations, and morphological features such as case.
We operationalize Rule 20 as follows. We extract all noun phrases that contain at least one modifier preceding the noun. Modifiers are defined as adjectives, demonstratives, or numerals. In addition, the noun phrase must contain a genitive possessor, identified by a dependent noun marked with genitive case. Only noun phrases that meet both conditions are included in the test set. For each of these cases, we then check whether the genitive possessor appears before or after the head noun.
Using this procedure, we identify a total of 845 relevant noun phrases in the German UD data. Out of these, only 108 cases show the genitive preceding the noun, which corresponds to about 13 percent. In contrast, 736 cases, or about 87 percent, show the genitive following the noun. This means that in the vast majority of relevant examples, German violates Greenberg’s Rule 20.
The extracted examples clearly illustrate this pattern. Common constructions such as “die Nutzung des Internet”, “zwei Prozent der Umsätze”, or “das letzte Quartal dieses Jahres” contain modifiers before the noun, while the genitive possessor consistently follows the noun. These are not rare or marked constructions, but highly frequent and stylistically neutral expressions in German.
Overall, this analysis shows that German does not merely exhibit occasional exceptions to Rule 20. Instead, it systematically contradicts the predicted pattern in actual corpus data. This suggests that Rule 20 should be understood as a broad typological tendency rather than a universal that reliably applies to individual languages. The comparison between WALS-based generalizations and corpus-based evidence highlights the importance of testing linguistic universals against real language usage.
