This notebook tests Greenberg’s Rule 20 using German Universal Dependencies (HDT) data. It is part of Milestone 3 and focuses on an in-depth corpus-based analysis of a single universal.

In [None]:
!git clone https://github.com/venetagrigorova/linguistic-universals-wals/

Cloning into 'linguistic-universals-wals'...
remote: Enumerating objects: 165, done.[K
remote: Counting objects: 100% (165/165), done.[K
remote: Compressing objects: 100% (119/119), done.[K
remote: Total 165 (delta 79), reused 116 (delta 39), pack-reused 0 (from 0)[K
Receiving objects: 100% (165/165), 7.67 MiB | 6.00 MiB/s, done.
Resolving deltas: 100% (79/79), done.


In [None]:
!pip install conllu


Collecting conllu
  Downloading conllu-6.0.0-py3-none-any.whl.metadata (21 kB)
Downloading conllu-6.0.0-py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-6.0.0


In [None]:
from conllu import parse_incr
from collections import defaultdict

PATH = "/content/linguistic-universals-wals/data/conllu/de_hdt-ud-dev.conllu"



In [None]:
def load_conllu(path):
    with open(path, "r", encoding="utf-8") as f:
        return list(parse_incr(f))

sents = load_conllu(PATH)
len(sents)


18434

In [None]:
def feats(token):
    return token.get("feats") or {}

def has_feat(token, key, value):
    v = feats(token).get(key)
    if v is None:
        return False
    if isinstance(v, list):
        return value in v
    return v == value

def sent_tokens(sent):
    # nur "normale" Tokens (keine Multiword ranges)
    return [t for t in sent if isinstance(t["id"], int)]


In [None]:
def analyze_rule20(sentences, max_examples=15):
    total_test = 0
    holds = 0
    exceptions = 0
    mixed_or_unclear = 0
    examples = []

    for sent in sentences:
        toks = sent_tokens(sent)
        by_id = {t["id"]: t for t in toks}

        # head -> children
        children = defaultdict(list)
        for t in toks:
            head = t.get("head")
            if isinstance(head, int) and head != 0:
                children[head].append(t)

        for noun in toks:
            if noun["upos"] not in {"NOUN", "PROPN"}:
                continue

            nid = noun["id"]
            kids = children.get(nid, [])

            # --- (A) "Modifier before noun"? Adj / Demonstrative / Numeral
            modifier_before = False

            for k in kids:
                # adjective
                if k["deprel"] == "amod" and k["id"] < nid:
                    modifier_before = True

                # numeral
                if k["deprel"] == "nummod" and k["id"] < nid:
                    modifier_before = True

                # demonstrative determiner: det + PronType=Dem
                if k["deprel"] == "det" and k["id"] < nid and has_feat(k, "PronType", "Dem"):
                    modifier_before = True

            if not modifier_before:
                continue

            # --- (B) "Genitive possessor"? nmod/nmod:poss with Case=Gen
            gen_poss = []
            for k in kids:
                if k["deprel"] in {"nmod:poss", "nmod"} and has_feat(k, "Case", "Gen"):
                    gen_poss.append(k)

            if not gen_poss:
                continue

            total_test += 1

            any_before = any(k["id"] < nid for k in gen_poss)
            any_after  = any(k["id"] > nid for k in gen_poss)

            if any_before and not any_after:
                holds += 1
            elif any_after and not any_before:
                exceptions += 1
                if len(examples) < max_examples:
                    text = " ".join(t["form"] for t in toks)
                    examples.append({
                        "sentence": text,
                        "head_noun": noun["form"],
                        "genitive": [g["form"] for g in gen_poss]
                    })
            else:
                mixed_or_unclear += 1

    support = holds / total_test if total_test else None
    return {
        "total_test": total_test,
        "holds": holds,
        "exceptions": exceptions,
        "mixed_or_unclear": mixed_or_unclear,
        "support_rate": support,
        "examples": examples
    }

res = analyze_rule20(sents)
res


{'total_test': 845,
 'holds': 108,
 'exceptions': 736,
 'mixed_or_unclear': 1,
 'support_rate': 0.12781065088757396,
 'examples': [{'sentence': 'In dem letzten Quartal dieses Jahres hoffen die Amazon-Manager erstmals einen Gewinn verbuchen zu können .',
   'head_noun': 'Quartal',
   'genitive': ['Jahres']},
  {'sentence': 'Die FTC fordert deshalb , dass eventuelle Käufer der Daten diese nur zu dem gleichen Zweck nutzen dürfen wie vormals Toysmart.com , also als Kundendatenbank für einen Online-Spielwarenhandel .',
   'head_noun': 'Käufer',
   'genitive': ['Daten']},
  {'sentence': 'Die private Nutzung des Internet rückt gegenüber der beruflichen zunehmend in den Vordergrund .',
   'head_noun': 'Nutzung',
   'genitive': ['Internet']},
  {'sentence': 'Bei rund einem halben Prozent der abgesetzten Tickets sollen die Käufer fremde Kartendaten angegeben haben .',
   'head_noun': 'Prozent',
   'genitive': ['Tickets']},
  {'sentence': 'Die Kreditkartengesellschaft Visa gibt an , dass Online-G

In [None]:
print("Total test NPs (modifier-before AND genitive-possessor):", res["total_test"])
print("Rule holds (genitive BEFORE noun):", res["holds"])
print("Exceptions (genitive AFTER noun):", res["exceptions"])
print("Mixed/unclear:", res["mixed_or_unclear"])
print("Support rate:", round(res["support_rate"], 3) if res["support_rate"] is not None else None)

print("\nExample exceptions:")
for ex in res["examples"][:5]:
    print("-", ex["sentence"])
    print("  head noun:", ex["head_noun"], "| genitive:", ex["genitive"])


Total test NPs (modifier-before AND genitive-possessor): 845
Rule holds (genitive BEFORE noun): 108
Exceptions (genitive AFTER noun): 736
Mixed/unclear: 1
Support rate: 0.128

Example exceptions:
- In dem letzten Quartal dieses Jahres hoffen die Amazon-Manager erstmals einen Gewinn verbuchen zu können .
  head noun: Quartal | genitive: ['Jahres']
- Die FTC fordert deshalb , dass eventuelle Käufer der Daten diese nur zu dem gleichen Zweck nutzen dürfen wie vormals Toysmart.com , also als Kundendatenbank für einen Online-Spielwarenhandel .
  head noun: Käufer | genitive: ['Daten']
- Die private Nutzung des Internet rückt gegenüber der beruflichen zunehmend in den Vordergrund .
  head noun: Nutzung | genitive: ['Internet']
- Bei rund einem halben Prozent der abgesetzten Tickets sollen die Käufer fremde Kartendaten angegeben haben .
  head noun: Prozent | genitive: ['Tickets']
- Die Kreditkartengesellschaft Visa gibt an , dass Online-Geschäfte nur rund zwei Prozent ihrer Umsätze ausmachen 

In [None]:
from collections import defaultdict

def analyze_rule23(sentences, max_examples=15):
    """
    Greenberg Rule 23:
    If the verb precedes the object (VO), the adjective likewise precedes the noun.

    Operationalization in UD:
    - Find verb-object pairs where verb_id < obj_id  (VO)
    - For nouns in the same sentence that have an adjectival modifier (amod),
      check whether adjective_id < noun_id (Adj before N)
    """
    total_test = 0
    holds = 0
    exceptions = 0
    mixed_or_unclear = 0
    examples = []

    for sent in sentences:
        toks = sent_tokens(sent)

        # head -> children
        children = defaultdict(list)
        for t in toks:
            head = t.get("head")
            if isinstance(head, int) and head != 0:
                children[head].append(t)

        # (A) Check if sentence contains at least one VO pattern (verb before object)
        vo_in_sentence = False
        for t in toks:
            if t.get("upos") != "VERB":
                continue
            vid = t["id"]
            for ch in children.get(vid, []):
                if ch.get("deprel") == "obj" and isinstance(ch["id"], int):
                    oid = ch["id"]
                    if vid < oid:
                        vo_in_sentence = True
                        break
            if vo_in_sentence:
                break

        if not vo_in_sentence:
            continue

        # (B) For nouns with adjectives: check if adjective precedes noun
        for noun in toks:
            if noun["upos"] not in {"NOUN", "PROPN"}:
                continue

            nid = noun["id"]
            kids = children.get(nid, [])

            adjs = [k for k in kids if k.get("deprel") == "amod" and isinstance(k["id"], int)]
            if not adjs:
                continue

            total_test += 1

            any_adj_before = any(a["id"] < nid for a in adjs)
            any_adj_after  = any(a["id"] > nid for a in adjs)

            if any_adj_before and not any_adj_after:
                holds += 1
            elif any_adj_after and not any_adj_before:
                exceptions += 1
                if len(examples) < max_examples:
                    text = " ".join(t["form"] for t in toks)
                    examples.append({
                        "sentence": text,
                        "head_noun": noun["form"],
                        "adjectives": [a["form"] for a in adjs],
                        "condition": "VO in sentence, but Adj AFTER noun"
                    })
            else:
                mixed_or_unclear += 1

    support = holds / total_test if total_test else None
    return {
        "total_test": total_test,
        "holds": holds,
        "exceptions": exceptions,
        "mixed_or_unclear": mixed_or_unclear,
        "support_rate": support,
        "examples": examples
    }



In [None]:
def analyze_rule24(sentences, max_examples=15):
    """
    Greenberg Rule 24:
    If the verb follows the object (OV), the adjective likewise follows the noun.

    Operationalization in UD:
    - Find verb-object pairs where verb_id > obj_id  (OV)
    - For nouns in the same sentence that have an adjectival modifier (amod),
      check whether adjective_id > noun_id (Adj after N)
    """
    total_test = 0
    holds = 0
    exceptions = 0
    mixed_or_unclear = 0
    examples = []

    for sent in sentences:
        toks = sent_tokens(sent)

        # head -> children
        children = defaultdict(list)
        for t in toks:
            head = t.get("head")
            if isinstance(head, int) and head != 0:
                children[head].append(t)

        # (A) Check if sentence contains at least one OV pattern (verb after object)
        ov_in_sentence = False
        for t in toks:
            if t.get("upos") != "VERB":
                continue
            vid = t["id"]
            for ch in children.get(vid, []):
                if ch.get("deprel") == "obj" and isinstance(ch["id"], int):
                    oid = ch["id"]
                    if vid > oid:
                        ov_in_sentence = True
                        break
            if ov_in_sentence:
                break

        if not ov_in_sentence:
            continue

        # (B) For nouns with adjectives: check if adjective follows noun
        for noun in toks:
            if noun["upos"] not in {"NOUN", "PROPN"}:
                continue

            nid = noun["id"]
            kids = children.get(nid, [])

            adjs = [k for k in kids if k.get("deprel") == "amod" and isinstance(k["id"], int)]
            if not adjs:
                continue

            total_test += 1

            any_adj_after  = any(a["id"] > nid for a in adjs)
            any_adj_before = any(a["id"] < nid for a in adjs)

            if any_adj_after and not any_adj_before:
                holds += 1
            elif any_adj_before and not any_adj_after:
                exceptions += 1
                if len(examples) < max_examples:
                    text = " ".join(t["form"] for t in toks)
                    examples.append({
                        "sentence": text,
                        "head_noun": noun["form"],
                        "adjectives": [a["form"] for a in adjs],
                        "condition": "OV in sentence, but Adj BEFORE noun"
                    })
            else:
                mixed_or_unclear += 1

    support = holds / total_test if total_test else None
    return {
        "total_test": total_test,
        "holds": holds,
        "exceptions": exceptions,
        "mixed_or_unclear": mixed_or_unclear,
        "support_rate": support,
        "examples": examples
    }

In [None]:
res23 = analyze_rule23(sents)
res24 = analyze_rule24(sents)

res23, res24


({'total_test': 4887,
  'holds': 4887,
  'exceptions': 0,
  'mixed_or_unclear': 0,
  'support_rate': 1.0,
  'examples': []},
 {'total_test': 7571,
  'holds': 0,
  'exceptions': 7571,
  'mixed_or_unclear': 0,
  'support_rate': 0.0,
  'examples': [{'sentence': 'In dem Geschäftsbericht für das vierte Quartal 2000 hatte Amazon einen Verlust aus dem operativen Geschäft von 60 Millionen US-Dollar ausgewiesen .',
    'head_noun': 'Quartal',
    'adjectives': ['vierte'],
    'condition': 'OV in sentence, but Adj BEFORE noun'},
   {'sentence': 'In dem Geschäftsbericht für das vierte Quartal 2000 hatte Amazon einen Verlust aus dem operativen Geschäft von 60 Millionen US-Dollar ausgewiesen .',
    'head_noun': 'Geschäft',
    'adjectives': ['operativen'],
    'condition': 'OV in sentence, but Adj BEFORE noun'},
   {'sentence': 'In dem vierten Quartal 1999 hatte der Verlust noch 175 Millionen US-Dollar betragen .',
    'head_noun': 'Quartal',
    'adjectives': ['vierten'],
    'condition': 'OV in 

In [None]:
def print_res(label, res):
    print(label)
    print("Total test cases:", res["total_test"])
    print("Rule holds:", res["holds"])
    print("Exceptions:", res["exceptions"])
    print("Mixed/unclear:", res["mixed_or_unclear"])
    print("Support rate:", round(res["support_rate"], 3) if res["support_rate"] is not None else None)
    print("\nExample exceptions:")
    for ex in res["examples"][:5]:
        print("-", ex["sentence"])
        print("  head noun:", ex["head_noun"], "| adjectives:", ex["adjectives"])
    print("\n" + "-"*60 + "\n")

print_res("Rule 23 (VO -> Adj before N)", res23)
print_res("Rule 24 (OV -> Adj after N)", res24)


Rule 23 (VO -> Adj before N)
Total test cases: 4887
Rule holds: 4887
Exceptions: 0
Mixed/unclear: 0
Support rate: 1.0

Example exceptions:

------------------------------------------------------------

Rule 24 (OV -> Adj after N)
Total test cases: 7571
Rule holds: 0
Exceptions: 7571
Mixed/unclear: 0
Support rate: 0.0

Example exceptions:
- In dem Geschäftsbericht für das vierte Quartal 2000 hatte Amazon einen Verlust aus dem operativen Geschäft von 60 Millionen US-Dollar ausgewiesen .
  head noun: Quartal | adjectives: ['vierte']
- In dem Geschäftsbericht für das vierte Quartal 2000 hatte Amazon einen Verlust aus dem operativen Geschäft von 60 Millionen US-Dollar ausgewiesen .
  head noun: Geschäft | adjectives: ['operativen']
- In dem vierten Quartal 1999 hatte der Verlust noch 175 Millionen US-Dollar betragen .
  head noun: Quartal | adjectives: ['vierten']
- In dem letzten Quartal dieses Jahres hoffen die Amazon-Manager erstmals einen Gewinn verbuchen zu können .
  head noun: Quarta

Summary:
Testing Greenberg’s Rules 20, 23 and 24 with German Universal Dependencies Data

In this part of the project, we investigate several word order universals proposed by Greenberg using corpus data instead of typological summaries from WALS. While WALS describes general tendencies across languages, Universal Dependencies (UD) allows us to test these tendencies directly in real sentence data. We use the German HDT treebank in CoNLL-U format, which provides detailed syntactic and morphological annotations for each sentence.

**Greenberg’s Rule 20 **
states that *when modifiers precede the noun, the genitive possessor almost always precedes the noun as well.* To test this rule, we extracted all noun phrases that contain at least one modifier before the noun (adjective, demonstrative, or numeral) and also contain a genitive possessor.

Using this procedure, we identified 845 relevant noun phrases. Only 108 of these cases (about **13 percent**) show the genitive preceding the noun, while 736 cases (about **87 percent**) show the genitive following the noun. T**his means that German systematically violates Rule 20 in actual corpus data.** Typical examples such as “die Nutzung des Internet” or “das letzte Quartal dieses Jahres” illustrate that the postnominal genitive is the dominant pattern in German. **Therefore, Rule 20 does not describe German usage reliably, but rather represents a broad typological tendency.**

**Greenberg’s Rule 23**
 states that *if the verb precedes the object (VO order), the adjective should also precede the noun.* In the German UD data, we tested this by identifying sentences with VO order and then checking the position of adjectival modifiers in noun phrases.
We found 4,887 relevant test cases. **In all of these cases, the adjective preceded the noun. No exceptions were found.** This results in a support rate of 1.0. **Therefore, Rule 23 is fully supported by the German corpus data and appears to describe German word order very reliably.**


**Greenberg’s Rule 24**
states that *if the verb follows the object (OV order), the adjective should also follow the noun.* We applied the same method to sentences with OV order.
In this case, we identified 7,571 relevant test cases. **None of them followed the predicted pattern.** In all cases, the adjective still preceded the noun. The support rate for Rule 24 is therefore 0.0. **This shows that German strongly contradicts Rule 24**.

**Overall interpretation**

The results show that Greenberg’s universals behave very differently when tested on real corpus data. Rule 23 is fully confirmed for German, while Rules 20 and 24 are systematically violated. This suggests that these universals should not be interpreted as strict rules for individual languages, but rather as general typological tendencies derived from cross-linguistic comparisons. The contrast between WALS-based generalizations and corpus-based evidence highlights the importance of testing linguistic universals against real language usage.
