# Creating the rule set

This notebook is about creating the rule set for diversity sensitive suggestions for the LanguageTool server. Some sources will be pulled from the internet and will be processed to fit the format as well as possible.

The notebook uses [Poetry](https://python-poetry.org/) for reproducibility. For running the notebook in an environment where the appropriate dependencies are installed, run `poetry install` and then `poetry run jupyter notebook` to start the notebook server.

In [1]:
import cache_magic
import datetime
import io
from os import path
import pandas as pd
import yaml
import random
import re
import requests
import subprocess
from typing import *
import wayback

%cache magic is now registered in ipython


In [2]:
languagetool_path = path.join("..", "languagetool", "LanguageTool-5.4") # adjust this to the folder of the LanguageTool release
data_dir = "wordlists" # where the downloaded and processed data will be saved
datasets = []

## The data set by _geschickt gendern_

In [3]:
wb = client = wayback.WaybackClient()
%cache mem = wb.get_memento("https://geschicktgendern.de/download/1642/", datetime=datetime.datetime(2021, 9, 11, tzinfo=datetime.timezone.utc), exact=False)


creating new value for variable 'mem'


In [4]:
df = pd.read_excel(mem.content, header=None, names=["ungendered", "gendered"], skiprows=3, usecols=[1,2])
df.sort_values(by="ungendered")
df

Unnamed: 0,ungendered,gendered
0,"<div id=""A""><b>A</b><div>",
1,Abbrecherquote,Abbruchquote
2,Abenteurer (sg.),Waghals; abenteuerliebende Person; abenteuerlu...
3,Abgänger,absolvierende Person; Abschluss innehabende Pe...
4,Abiturient,"Abitur ablegende Person; Person, die Abitur macht"
...,...,...
1814,Zuschauer (pl.),Publikum; Zuschauende
1815,Zuschauerquote,Einschaltquote
1816,Zuschauerzahl,Publikumszahl
1817,Zuständiger,zuständige Person


We drop rows like the first one, where there is merely some HTML description but no value.

In [5]:
df = df[df["gendered"].notna()]
df.to_csv(path.join(data_dir, "geschicktgendern.csv"), index=False)

We convert the singular / plural annotations to part-of-speech tags for LanaguageTool:

In [6]:
def postag(key: Optional[str]) -> str:
    numerus = re.search("\(.*(sg|pl)\.\)", key)
    if numerus:
        if numerus[1] == "sg":
            return ".*SIN.*"
        if numerus[1] == "pl":
            return ".*PLU.*"
    else: 
        return None

In [7]:
print(postag("Baum (grün; sg.)"))

.*SIN.*


In [8]:
df.loc[13]

ungendered            Absolventenvorsprechen [Schauspielschule]
gendered      <a href="https://geschicktgendern.de/kontakt">...
Name: 13, dtype: object

We see that some suggestions are annotated by HTML, for example with the annotation that there is no good suggestion yet. This is too complicated for us to handle, so we will drop such suggestions.

The `gendered` column often contains multiple variants that are separated by a semicolon. We want to capture this.

In [9]:
records = df.to_records()
data = []
for (_, key, val) in records:
    unannotated_suggestions = [x for x in val.split("; ") if  not '<' in x]
    if not ('<' in key or "..." in key) and len(unannotated_suggestions) > 0:
        data.append({
            "pattern": re.sub(" ?\[.*\]", "", re.sub(" ?\(.*\)", "", key)),
            "postag": postag(key),
            "suggestions": unannotated_suggestions,
            "url": "https://geschicktgendern.de/"
        })
        
print("This reduces the number of rules from {} to {}.".format(len(records), len(data)))

This reduces the number of rules from 1792 to 1558.


Rows like this one contain values that include formatting. We drop these values, but not the whole row.

In [10]:
datasets.append(data)
data = None

## The Microsoft / Vienna catalog

In [11]:
%cache mem = wb.get_memento("https://www.data.gv.at/katalog/dataset/15d6ede8-f128-4fcd-aa3a-4479e828f477/resource/804f6db1-add7-4480-b4d0-e52e61c48534/download/worttabelle.csv", datetime=datetime.datetime(2021, 9, 13, tzinfo=datetime.timezone.utc), exact=False)
text = re.sub(";;\r\n", "\n", mem.content.decode("utf-8"))
df = pd.read_csv(io.StringIO(text))
df = df[df["Hauptwort"].notna()]
df.to_csv(path.join(data_dir, "vienna_catalog.csv"), index=False)
df

creating new value for variable 'mem'


Unnamed: 0,Laenge,Hauptwort,Vorschlag,Binnen
0,50,Verantwortlicher für Informationssicherheit (C...,CISO,N
1,50,Verantwortlicher für Informationssicherheit (C...,Verantwortliche bzw. Verantwortlicher für Info...,N
2,45,Diplomierte Gesundheits- und Krankenschwester,Diplomiertes Krankenpflegepersonal,N
3,43,Unabhängiger Bedienstetenschutzbeauftragter,Unabhängige Bedienstetenschutzbeauftragte bzw....,N
4,39,Kontrakt- und Berichtswesenbeauftragter,Kontrakt- und Berichtswesenbeauftragte bzw. -b...,N
...,...,...,...,...
2266,4,Koch,Köchin bzw. Koch,N
2267,4,Star,Berühmtheit,N
2268,4,User,Userin bzw. User,N
2269,4,User,UserInnen,Y


In [12]:
data_dict: Dict[str, List[str]] = {}
for (_, _, pattern, suggestion, binnenI) in df.to_records():
    if binnenI == "Y":
        suggestion = re.sub(r"([a-zäöüß])I", r"\1*i", suggestion)
    if pattern in data_dict.keys():
        data_dict[pattern].append(suggestion)
    else:
        data_dict[pattern] = [suggestion]

data = []
for key, val in data_dict.items():
    data.append({
                "pattern": key,
                "postag": None,
                "suggestions": val,
                "url": "https://www.data.gv.at/katalog/dataset/15d6ede8-f128-4fcd-aa3a-4479e828f477"
            })
data[-2]

{'pattern': 'User',
 'postag': None,
 'suggestions': ['Userin bzw. User', 'User*innen'],
 'url': 'https://www.data.gv.at/katalog/dataset/15d6ede8-f128-4fcd-aa3a-4479e828f477'}

In [13]:
datasets.append(data)
data = None

## The _retext equality_ data set

We skip this data set for now because many of the rules cannot be transformed to simple replacement rules.

In [14]:
# responses = {}
# for topic in topics:
#     responses[topic] = requests.get("https://raw.githubusercontent.com/retextjs/retext-equality/main/data/en/{}.yml".format(topic)).text
    

In [15]:
# for topic in topics:
#     data = yaml.safe_load(responses[topic])
#     for row in data:
#         considerate = row["considerate"]
#         inconsiderate = row["inconsiderate"]
#         if type(considerate) == str:
#             rules[considerate] = inconsiderate
#         elif type(considerate) == list:
#             for phrase in considerate:
#                 rules[phrase] = inconsiderate

In [16]:
# open(path.join("data", "retext_equality_raw.yaml"), "w").write(yaml.dump(data))
# datasets.append(data)
# data = None

## Custom rules

We add some custom rules that we have written ourselves, inspired in part by the _retext-equality_ data set. 

In [17]:
custom_xml = open(path.join(data_dir, "custom_list_disability.xml")).read()

## Conversion to proper LanguageTool XML format

The LanguageTool rule format is described [over here](https://web.archive.org/web/20210910183442/https://dev.languagetool.org/development-overview) and [here](https://dev.languagetool.org/tips-and-tricks).

We devise a function to convert a _geschickt gendern_ entry to a XML LanguageTool entry.

In [18]:
def rule_to_xml(pattern: str, postag: Optional[str], suggestions: List[str], url: str) -> str:
    id = re.sub("[^A-ZÄÖÜa-zäöüß_]", "", re.sub("\s", "_", "_".join([pattern, (postag or "")])))
    postag_attribute = 'postag="{}" '.format(postag) if postag is not None else ""
    replaced_tokens = "".join([
        '<token inflected="yes" {}postag_regexp="yes">{}</token>'.format(postag_attribute, token) 
        for token in pattern.split(" ")])
    suggestions_ = "\n\t\t".join(["<suggestion>{}</suggestion>".format(s) for s in suggestions])
    return """
    <rule id="{id}" name="{pattern}">
        <pattern>{replaced_tokens}</pattern>
        <message>
        Mit dem generischen Maskulinum werden nicht alle Geschlechter gleichermaßen assoziiert. Vielleicht passt einer der folgenden neutralen Begriffe besser: {suggestions}
        </message>
        <url>{url}</url>
        <short>Generisches Maskulinum</short>
        <example correction="{s}"><marker>{pattern}</marker></example>
    </rule>
    """.format(id=id, pattern=pattern, replaced_tokens=replaced_tokens, suggestions=suggestions_, s=suggestions[0], url=url)

In [19]:
xml = custom_xml + "\n\n" + "".join(["".join([rule_to_xml(**datum) for datum in dataset]) for dataset in datasets])

## Injecting the rules to the existing LanguageTool rule file

In [20]:
grammar_path = path.join(languagetool_path, "org", "languagetool", "rules", "de") # path of the German grammar files within the LanguageTool release


In [21]:
custom_filename = "grammar_custom.xml"
open(path.join(data_dir, custom_filename), "w").write(xml)
open(path.join(grammar_path, custom_filename), "w").write(xml)

1972631

In [22]:
# Use backup file if available (see comments below)
if path.isfile(path.join(grammar_path, "grammar.xml.old")):
  old_xml = open(path.join(grammar_path, "grammar.xml.old")).read()
else:
  old_xml = open(path.join(grammar_path, "grammar.xml")).read()
  # Save backup of the old grammar file.
  open(path.join(grammar_path, "grammar.xml.old"), "w").write(old_xml)

And then we inject the category tag with its contents to the existing LanguageTool rule XML file:

In [23]:
new_xml = old_xml.replace(
        "<!DOCTYPE rules [", 
        '<!DOCTYPE rules [ \n\t<!ENTITY UserRules SYSTEM "file:///{}">'.format(path.abspath(path.join(grammar_path, custom_filename)))
    ).replace(
        "</rules>", 
        '<category id="DIVERSITY_SENSITIVE_LANGUAGE" name="Erweiterung für diversitätssensible Sprache">\n&UserRules;\n</category>\n</rules>'
    )

# Replace with file where the new rules have been added.
open(path.join(grammar_path, "grammar.xml"), "w").write(new_xml)


3174368

## Validating and using the rules

Running the LanguageTool rule validation:

In [24]:
# subprocess.run(["./testrules.sh", "de"], cwd=languagetool_path)

Starting LanguageTool:

In [None]:
# subprocess.run(["java", "-jar", path.join(languagetool_path, "languagetool.jar")])