# Creating the rule set

This notebook is about creating the rule set for diversity sensitive suggestions for the LanguageTool server. Some sources will be pulled from the internet and will be processed to fit the format as well as possible.

The notebook uses [Poetry](https://python-poetry.org/) for reproducibility. For running the notebook in an environment where the appropriate dependencies are installed, run `poetry install` and then `poetry run jupyter notebook` to start the notebook server.

In [1]:
import cache_magic
import copy_files
import datetime
import io
from os import path
import pandas as pd
import random
import re
import requests    
import subprocess
import spacy
from typing import *
import wayback
import yaml

%cache magic is now registered in ipython


In [2]:
data_dir = "wordlists" # where the downloaded and processed data will be saved
fetch_data = False # whether to re-download the files

In [3]:
def log(a, x):
    # Helper function for functionally logging things.
    print(a)
    print(x)
    return x

In [4]:
nlp = spacy.load("de_core_news_sm")


In [5]:
def number(s: str) -> List[str]:
    return nlp(s)[0].morph.get("Number")

assert number("Bundeskanzlerin") == ["Sing"]
assert number("Bundeskanzlerinnen") == ["Plur"]

In [6]:
def is_word(word: str) -> bool:
    return True
#     return len(gn.sysnsets(word)) > 0

# assert is_word("Baum") == True
# assert is_word("Bäum") == False

In [7]:
data = {
    "sg": {},
    "pl": {},
}

def strip_spaces(a):
    s = re.sub("  +|\"|'", " ", a)
    a = re.sub("^ | $|[.,:;!?]", "", a)
    return a

def add_to_data(pattern, numerus, suggestions):
    pattern = strip_spaces(pattern)
    if numerus == "sg":
        add_to_dict(pattern, suggestions, data["sg"])
    elif numerus == "pl":
        add_to_dict(pattern, suggestions, data["pl"])
    elif numerus == "unknown":
        if "Sing" in number(pattern):
            add_to_data(pattern, "sg", suggestions)
        if "Plur" in number(pattern):
            add_to_data(pattern, "pl", suggestions)

def add_to_dict(key, vals: List[str], dic):
    if key in dic.keys():
        for val in vals:
            if not val in dic[key]:
                dic[key].append(val)
    else:
        dic[key] = vals

## The data set by _geschickt gendern_

In [8]:
wb = client = wayback.WaybackClient()
%cache mem = wb.get_memento("https://geschicktgendern.de/download/1642/", datetime=datetime.datetime(2021, 9, 11, tzinfo=datetime.timezone.utc), exact=False)


loading cached value for variable 'mem'. Time since pickling  0:00:11.597049


In [9]:
df = pd.read_excel(mem.content, header=None, names=["ungendered", "gendered"], skiprows=3, usecols=[1,2])
df.sort_values(by="ungendered")
df

Unnamed: 0,ungendered,gendered
0,"<div id=""A""><b>A</b><div>",
1,Abbrecherquote,Abbruchquote
2,Abenteurer (sg.),Waghals; abenteuerliebende Person; abenteuerlu...
3,Abgänger,absolvierende Person; Abschluss innehabende Pe...
4,Abiturient,"Abitur ablegende Person; Person, die Abitur macht"
...,...,...
1814,Zuschauer (pl.),Publikum; Zuschauende
1815,Zuschauerquote,Einschaltquote
1816,Zuschauerzahl,Publikumszahl
1817,Zuständiger,zuständige Person


We drop rows like the first one, where there is merely some HTML description but no value.

In [10]:
# df = df[df["gendered"].notna()]
# df.to_csv(path.join(data_dir, "geschicktgendern.csv"), index=False)

In [11]:
df = pd.read_csv(path.join(data_dir, "geschicktgendern.csv"))

We convert the singular / plural annotations to part-of-speech tags for LanguageTool:

In [12]:
df.loc[13]

ungendered                     Abteilungsleiter (pl.)
gendered      Abteilungsleitungen; Abteilungsleitende
Name: 13, dtype: object

We see that some suggestions are annotated by HTML, for example with the annotation that there is no good suggestion yet. This is too complicated for us to handle, so we will drop such suggestions.

The `gendered` column often contains multiple variants that are separated by a semicolon. We want to capture this.

Some rows contain values that include formatting. We drop these values, but not the whole row.

In [13]:
records = df.to_records()
def numerus(key: Optional[str]) -> str:
    numerus = re.search("\(.*(sg|pl)\.\)", key)
    if numerus:
        if numerus[1] == "sg":
            return "sg"
        if numerus[1] == "pl":
            return "pl"
    else: 
        return "unknown"
def complicated(a):
    # the rule includes extra annotation in brackets or HTML and is thus too complicated for us to use
    return any([b in a for b in ["<", "(", "[", "\"", "'"]])
i = 0
for (_, key, val) in records:
    unannotated_suggestions = [x for x in val.split("; ") if  not (complicated(x) or x == "")]
    if not (complicated(key) or "..." in key) and len(unannotated_suggestions) > 0:
        pattern =  re.sub(" ?\[.*\]", "", re.sub(" ?\(.*\)", "", key))
        add_to_data(pattern, numerus(key), unannotated_suggestions)
        i += 1
print("This reduces the number of used rules from {} to {}.".format(len(records), i))

This reduces the number of used rules from 1792 to 943.


## The Microsoft / Vienna catalog

In [14]:
%cache mem = wb.get_memento("https://www.data.gv.at/katalog/dataset/15d6ede8-f128-4fcd-aa3a-4479e828f477/resource/804f6db1-add7-4480-b4d0-e52e61c48534/download/worttabelle.csv", datetime=datetime.datetime(2021, 9, 13, tzinfo=datetime.timezone.utc), exact=False)
text = re.sub(";;\r\n", "\n", mem.content.decode("utf-8"))
df = pd.read_csv(io.StringIO(text))
df = df[df["Hauptwort"].notna()]
df.to_csv(path.join(data_dir, "vienna_catalog.csv"), index=False)
df

creating new value for variable 'mem'


Unnamed: 0,Laenge,Hauptwort,Vorschlag,Binnen
0,50,Verantwortlicher für Informationssicherheit (C...,CISO,N
1,50,Verantwortlicher für Informationssicherheit (C...,Verantwortliche bzw. Verantwortlicher für Info...,N
2,45,Diplomierte Gesundheits- und Krankenschwester,Diplomiertes Krankenpflegepersonal,N
3,43,Unabhängiger Bedienstetenschutzbeauftragter,Unabhängige Bedienstetenschutzbeauftragte bzw....,N
4,39,Kontrakt- und Berichtswesenbeauftragter,Kontrakt- und Berichtswesenbeauftragte bzw. -b...,N
...,...,...,...,...
2266,4,Koch,Köchin bzw. Koch,N
2267,4,Star,Berühmtheit,N
2268,4,User,Userin bzw. User,N
2269,4,User,UserInnen,Y


In [15]:
for (_, _, pattern, suggestion, binnenI) in df.to_records():
    if binnenI == "Y":
        suggestion = re.sub(r"([a-zäöüß])I", r"\1*i", suggestion)
    if not complicated(pattern) or complicated(suggestion):
        if re.findall("[iI]n($| )", suggestion) != []:
            add_to_data(pattern, "sg", [suggestion])
        elif re.findall("[iI]nnen$", suggestion) != []:
            add_to_data(pattern, "pl", [suggestion])
        else:
            add_to_data(pattern, "unknown", [suggestion])

## The _DeReKo_ data set

We extract some data from the "Deutsche ReferenzKorpus". 

Queries:
- Internal I: `:Ab:*?Innen`: 241k tokens, 18k types (`:Ab:*?In` and `:Ab:#REG(^[A-ZÄÖÜ][a-zäöüß]+In(nen)?$)` throw errors)
- Slash: `#REG(^[A-ZÄÖÜ][a-zäöüß]+\/in(nen)?$)`: 136k tokens, 9k types
- Star: `#REG(^[A-ZÄÖÜ][a-zäöüß]+\*in(nen)?$)`: 48k tokens, 5k types
- Colon: `#REG(^[A-ZÄÖÜ][a-zäöüß]+:in(nen)?$)`: 10k tokens, 3k types
- Underscore: `#REG(^[A-ZÄÖÜ][a-zäöüß]+_in(nen)?$)`: 3k tokens, 1k types
- Interpunct: `#REG(^[A-ZÄÖÜ][a-zäöüß]+·in(nen)?$)`: 4(!) matches
- Brackets: `*?\(In\)`, `*?\(Innen\)`, `#REG(\(in(nen)\))` and similar queries throw errors

There is no machine-readable download on DeReKo to our knowledge, so we process the files a bit:

In [16]:
match_properly_gendered_word = r"[A-ZÄÖÜ][a-zäöüß]{3,}(([/*:_·(]in(nen)?\)?)|In(nen)?)"

def is_properly_gendered_word(word: str) -> bool:
    return re.findall(r"^[A-ZÄÖÜ][a-zäöüß]{3,}(([/*:_·(]in(nen)?\)?)|In(nen)?)$", word) != []

assert is_properly_gendered_word("Bundeskanzler:innen") == True
assert is_properly_gendered_word("BundeskanzlerIn") == True
assert is_properly_gendered_word("Bundeskanzler*Innen") == False

In [17]:
def dereko_to_csv(filename: str):
    text = open(path.join(data_dir, "dereko", filename + '.txt')).read()
    lines = text.split("\n")[20:]
    words = [re.match(match_properly_gendered_word, line)[0] for line in lines if re.match(match_properly_gendered_word, line)]
    open(path.join(data_dir, "dereko", filename + '.csv'), "w").write("\n".join(words))
    return words

assert 'Bundeskanzler*in' in dereko_to_csv("star")

In [18]:
dereko_to_csv("internal-i")[:5]

['AachenerInnen',
 'AbbiegerInnen',
 'AbbrecherInnen',
 'AbeitsplatzbesitzerInnen',
 'AbendländerInnen']

In [19]:
dereko_to_csv("colon")[:5]

['Abenteurer:innen',
 'Abiturient:innen',
 'Ablehner:innen',
 'Abnehmer:innen',
 'Abonennt:innen']

In [20]:
def is_gendered_plural(word: str) -> str:
    return re.findall(r"[Ii]nnen\)?$", word) != []

assert is_gendered_plural("Bundeskanzler*in") == False
assert is_gendered_plural("Bundesminister/in") == False
assert is_gendered_plural("Bundesminister*innen") == True

In [21]:
def ungender(word: str) -> str:
    return re.sub(r"[/*:_·()]?[Ii]n(n(en))?$", "", word)

assert ungender("Bundeskanzler*in") == "Bundeskanzler"
assert ungender("Bundesminister*innen") == "Bundesminister"

In [22]:
def regender(word: str, symbol: str) -> str:
    # replace gender symbol with other gender symbol
    return re.sub(r"[/*:_·()]?-?[Ii]n(nen)?$", r"{}in\1".format(symbol), word)

assert regender("Bundeskanzler*in", "*") == "Bundeskanzler*in"
assert regender("Bundeskanzler:in", "*") == "Bundeskanzler*in"
assert regender("Bundeskanzler_in", "*") == "Bundeskanzler*in"
assert regender("Bundeskanzler/in", "*") == "Bundeskanzler*in"
assert regender("Bundeskanzler/-in", "*") == "Bundeskanzler*in"
assert regender("Bundeskanzler·in", "*") == "Bundeskanzler*in"
assert regender("BundeskanzlerIn", "*") == "Bundeskanzler*in"
assert regender("BundesministerIn", "*") == "Bundesminister*in"
assert regender("BundesministerInnen", "*") == "Bundesminister*innen"

In [23]:
count_dict = {}
def add_to_count_dict(key, number, val):
    if (key, number, val) in count_dict.keys():
        count_dict[(key, number, val)] += 1
    else:
        count_dict[(key, number, val)] = 1

In [24]:
dereko_lists = [dereko_to_csv(a) for a in ["colon", "internal-i", "interpunct", "slash", "star", "underscore"]]
for l in dereko_lists:
    for word in l:
        if is_properly_gendered_word(word):
            key = ungender(word)
            suggestion = regender(word, "*")
            if is_gendered_plural(suggestion):
                if is_word(key):
                    add_to_count_dict(key, "pl", suggestion)
                if is_word(key):
                    add_to_count_dict(key + "en", "pl", suggestion)
            else:  
                add_to_count_dict(key, "sg", suggestion)

In [27]:
dereko_unified = {}
for (key, number, val), count in count_dict.items():
    if count > 1:
        add_to_dict(key, [val], dereko_unified)
        add_to_data(key, number, [val])
assert 'Bundeskanzler' in data["sg"].keys()

In [28]:
open(path.join(data_dir, "dereko", "unified.csv"), "w").write("\n".join(sorted([a[0] for a in dereko_unified.values()])))

134367

Because we cannot use regular expressions to enhance the "internal i" query directly on DeReKo (due to issues with case sensitivity), we perform some postprocessing for the results.

## The _retext equality_ data set

We skip this data set for now because many of the rules cannot be transformed to simple replacement rules.

In [29]:
# responses = {}
# for topic in topics:
#     responses[topic] = requests.get("https://raw.githubusercontent.com/retextjs/retext-equality/main/data/en/{}.yml".format(topic)).text
    

In [30]:
# for topic in topics:
#     data = yaml.safe_load(responses[topic])
#     for row in data:
#         considerate = row["considerate"]
#         inconsiderate = row["inconsiderate"]
#         if type(considerate) == str:
#             rules[considerate] = inconsiderate
#         elif type(considerate) == list:
#             for phrase in considerate:
#                 rules[phrase] = inconsiderate

In [31]:
# open(path.join("data", "retext_equality_raw.yaml"), "w").write(yaml.dump(data))

## Custom rules

We add some custom rules that we have written ourselves, inspired in part by the _retext-equality_ data set. 

In [32]:
custom_xml = open(path.join(data_dir, "custom_list_disability.xml")).read()

## Conversion to proper LanguageTool XML format

The LanguageTool rule format is described [over here](https://web.archive.org/web/20210910183442/https://dev.languagetool.org/development-overview) and [here](https://dev.languagetool.org/tips-and-tricks).

We devise a function to convert a _geschickt gendern_ entry to a XML LanguageTool entry.

In [33]:
re.findall(r"\w+|\W+", "Wiener*innen")

['Wiener', '*', 'innen']

In [34]:
def startupper(s: str) -> str:
    return s[0].capitalize() + s[1:]

In [35]:
assert startupper("absolvierende Person") == "Absolvierende Person"

In [36]:
def rule_to_xml(pattern: str, numerus: str, suggestions: List[str]) -> str:
    id = re.sub("\s", "_", pattern + "_" + numerus)
    id = re.sub("[^A-ZÄÖÜa-zäöüß_]", "", id)
    if numerus == "sg":
        postag_attributes = 'postag=".*:SIN:.*" postag_regexp="yes" '
    elif numerus == "pl":
        postag_attributes = 'postag=".*:PLU:.*" postag_regexp="yes" '
    replaced_tokens = "".join([
        '<token inflected="yes" {}>{}</token>'.format(postag_attributes, token) 
        for token in pattern.split(" ")])
    suggestions_ = ",\n\t\t".join(["<suggestion>{}</suggestion>".format(s) for s in suggestions])
    antipatterns = "\n\t\t".join(
        ["<antipattern>\n\t\t{}\n\t\t</antipattern>".format("\n\t\t".join(
            ['<token inflected="yes">{}</token>'.format(token) for token in re.findall(r"\w+|[.,:;*_·/]", s)]
        )) for s in suggestions])
    corrections = "|".join([startupper(s) for s in suggestions])
    return """
    <rule id="{id}" name="{pattern}">
        {antipatterns}
        <pattern>{replaced_tokens}</pattern>
        <message>
        Mit dem generischen Maskulinum werden nicht alle Geschlechter gleichermaßen assoziiert. Vielleicht passt einer der folgenden neutralen Begriffe besser: 
        {suggestions}
        </message>
        <short>Generisches Maskulinum</short>
        <example correction="{corrections}"><marker>{pattern}</marker></example>
    </rule>
    """.format(id=id, pattern=pattern, antipatterns=antipatterns, replaced_tokens=replaced_tokens, suggestions=suggestions_, corrections=corrections)

In [37]:
# print(rule_to_xml("Wiener", "pl", data["sg"]["pl"]))

In [38]:
xml = ""
for numerus in ["sg", "pl"]:
    xml += custom_xml + "\n\n" + "".join([rule_to_xml(key, numerus, val) for key, val in data[numerus].items()])

## Injecting the rules to the existing LanguageTool rule file

In [39]:
custom_filename = "grammar_custom.xml"
open(path.join(data_dir, custom_filename), "w").write(xml)
copy_files.copy_files()

## Validating and using the rules

Running the LanguageTool rule validation:

In [40]:
# subprocess.run(["./testrules.sh", "de"], cwd=languagetool_path)

Starting LanguageTool:

In [41]:
# subprocess.run(["java", "-jar", path.join(languagetool_path, "languagetool.jar")])