We extract some data from the "Deutsche ReferenzKorpus" (via COSMAS II) via manual querying. 
The resulting files are saved as `.txt` files in this folder.

Queries:
- Internal I: `:Ab:*?Innen`: 241k tokens, 18k types (`:Ab:*?In` and `:Ab:#REG(^[A-ZÄÖÜ][a-zäöüß]+In(nen)?$)` throw errors)
- Slash: `#REG(^[A-ZÄÖÜ][a-zäöüß]+\/in(nen)?$)`: 136k tokens, 9k types
- Star: `#REG(^[A-ZÄÖÜ][a-zäöüß]+\*in(nen)?$)`: 48k tokens, 5k types
- Colon: `#REG(^[A-ZÄÖÜ][a-zäöüß]+:in(nen)?$)`: 10k tokens, 3k types
- Underscore: `#REG(^[A-ZÄÖÜ][a-zäöüß]+_in(nen)?$)`: 3k tokens, 1k types
- Interpunct: `#REG(^[A-ZÄÖÜ][a-zäöüß]+·in(nen)?$)`: 4(!) matches
- Brackets: `*?\(In\)`, `*?\(Innen\)`, `#REG(\(in(nen)\))` and similar queries throw errors

There is no machine-readable download on DeReKo to our knowledge (KorAP should do this, but is still work in progress), so we process the files a bit:

In [32]:
from typing import *
import re
import sys

sys.path.insert(0, "..")
from helpers import add_to_dict, log
from helpers_csv import csvs_to_dict, dict_to_csvs

We want to keep only entries that are actually properly gendered, and we only want these properly gendered words, so we write some complicated regexes to find them:

In [33]:
match_properly_gendered_word = r"[A-ZÄÖÜ][a-zäöüß]{3,}(([/*:_·(]in(nen)?\)?)|In(nen)?)"


def is_properly_gendered_word(word: str) -> bool:
    return (
        re.findall(r"^[A-ZÄÖÜ][a-zäöüß]{3,}(([/*:_·(]in(nen)?\)?)|In(nen)?)$", word)
        != []
    )


assert is_properly_gendered_word("Bundeskanzler:innen") == True
assert is_properly_gendered_word("BundeskanzlerIn") == True
assert is_properly_gendered_word("Bundeskanzler*Innen") == False

And then we define some function specifically targeted at the structure of the DeReKo output files:

In [34]:
def dereko_to_csv(filename: str):
    text = open(filename + ".txt").read()
    lines = text.split("\n")[20:]
    words = [
        re.match(match_properly_gendered_word, line)[0]
        for line in lines
        if re.match(match_properly_gendered_word, line)
    ]
    open(filename + ".csv", "w").write("\n".join(words))
    return words


assert "Bundeskanzler*in" in dereko_to_csv("star")

In [35]:
dereko_to_csv("internal-i")[:5]

['AachenerInnen',
 'AbbiegerInnen',
 'AbbrecherInnen',
 'AbeitsplatzbesitzerInnen',
 'AbendländerInnen']

In [36]:
dereko_to_csv("colon")[:5]

['Abenteurer:innen',
 'Abiturient:innen',
 'Ablehner:innen',
 'Abnehmer:innen',
 'Abonennt:innen']

We want to distinguish singular and plural, which luckily is easy for gendered words:

In [37]:
def is_gendered_plural(word: str) -> str:
    return re.findall(r"[Ii]nnen\)?$", word) != []


assert is_gendered_plural("Bundeskanzler*in") == False
assert is_gendered_plural("Bundesminister/in") == False
assert is_gendered_plural("Bundesminister*innen") == True

And we want to ungender them. This also seems simple at first:

In [38]:
def ungender(word: str) -> str:
    return re.sub(r"[/*:_·()]?[Ii]n(n(en))?$", "", word)


assert ungender("Bundeskanzler*in") == "Bundeskanzler"
assert ungender("Bundesminister*innen") == "Bundesminister"

In [39]:
def gender_sg(word: str) -> str:
    return re.sub(r"\*innen$", "*in", word)

In [40]:
def regender(word: str, symbol: str) -> str:
    # replace gender symbol with other gender symbol
    return re.sub(r"[/*:_·()]?-?[Ii]n(nen)?$", r"{}in\1".format(symbol), word)


assert regender("Bundeskanzler*in", "*") == "Bundeskanzler*in"
assert regender("Bundeskanzler:in", "*") == "Bundeskanzler*in"
assert regender("Bundeskanzler_in", "*") == "Bundeskanzler*in"
assert regender("Bundeskanzler/in", "*") == "Bundeskanzler*in"
assert regender("Bundeskanzler/-in", "*") == "Bundeskanzler*in"
assert regender("Bundeskanzler·in", "*") == "Bundeskanzler*in"
assert regender("BundeskanzlerIn", "*") == "Bundeskanzler*in"
assert regender("BundesministerIn", "*") == "Bundesminister*in"
assert regender("BundesministerInnen", "*") == "Bundesminister*innen"

But then there's also cases like these, where our method fails:

In [41]:
assert not regender("Abiturient*innen", "*") == "Abiturienten"
assert not regender("Kolleg*innen", "*") == "Kolleginnen"

One way to take care of this is to check whether the ungendered form is a valid word, and whether the ungendered form + "en"is a valid word. 
We sketch this idea, but in the end we do not implement but rather decide to not use the DeReKo plural forms at all.

In [42]:
def is_word(a):
    # TODO
    return True

In [43]:
count_dict = {}


def add_to_count_dict(key, number, val):
    if (key, number, val) in count_dict.keys():
        count_dict[(key, number, val)] += 1
    else:
        count_dict[(key, number, val)] = 1

In [44]:
dereko_lists = [
    dereko_to_csv(a)
    for a in ["colon", "internal-i", "interpunct", "slash", "star", "underscore"]
]
sg_count = 0
pl_count = 0
for l in dereko_lists:
    for word in l:
        if is_properly_gendered_word(word):
            key = ungender(word)
            suggestion = regender(word, "*")
            if is_gendered_plural(suggestion):
                pl_count += 1
                if is_word(key):
                    add_to_count_dict(key, "pl", suggestion)
                if is_word(key + "en"):
                    add_to_count_dict(key + "en", "pl", suggestion)
                add_to_count_dict(key, "sg", gender_sg(suggestion))
            else:
                sg_count += 1
                add_to_count_dict(key, "sg", suggestion)
                add_to_count_dict(key, "pl", suggestion + "nen")

print("total gendered words in sg", sg_count)
print("total gendered words in pl", pl_count)

total gendered words in sg 4607
total gendered words in pl 20025


In [45]:
dic: Dict[str, Dict[str, List[str]]] = {"sg": {}, "pl": {}}
for (key, number, val), count in count_dict.items():
    if count >= 3:
        add_to_dict(key, [val], dic[number])

In [46]:
dict_to_csvs(dic, "dereko_unified")

We check whether reading the data back to Python works well:

In [47]:
dic = csvs_to_dict("dereko_unified")
list(dic["sg"].items())[:5]

[('Abenteurer', ['Abenteurer*in']),
 ('Abiturient', ['Abiturient*in']),
 ('Abnehmer', ['Abnehmer*in']),
 ('Abonnent', ['Abonnent*in']),
 ('Absender', ['Absender*in'])]