# Creating the rule set

This notebook is about creating the rule set for diversity sensitive suggestions for the LanguageTool server. Some sources will be pulled from the internet and will be processed to fit the format as well as possible.

The notebook uses [Poetry](https://python-poetry.org/) for reproducibility. For running the notebook in an environment where the appropriate dependencies are installed, run `poetry install` and then `poetry run jupyter notebook` to start the notebook server.

In [11]:
from os import path
from shared import add_to_dict, log
from typing import *
import cache_magic
import copy_files
import datetime
import io
import pandas as pd
import random
import re
import requests
import spacy
import subprocess

In [12]:
data_dir = "wordlists" # where the downloaded and processed data will be saved
fetch_data = False # whether to re-download the files

In [13]:
nlp = spacy.load("de_core_news_sm")

In [14]:
def number(s: str) -> List[str]:
    return nlp(s)[0].morph.get("Number")

assert number("Bundeskanzlerin") == ["Sing"]
assert number("Bundeskanzlerinnen") == ["Plur"]

In [15]:
def is_word(word: str) -> bool:
    return True
#     return len(gn.sysnsets(word)) > 0

# assert is_word("Baum") == True
# assert is_word("Bäum") == False

In [16]:
data = {
    "sg": {},
    "pl": {},
}

def strip_spaces(a):
    s = re.sub("  +|\"|'", " ", a)
    a = re.sub("^ | $|[.,:;!?]", "", a)
    return a

def add_to_data(pattern, numerus, suggestions):
    pattern = strip_spaces(pattern)
    if numerus == "sg":
        add_to_dict(pattern, suggestions, data["sg"])
    elif numerus == "pl":
        add_to_dict(pattern, suggestions, data["pl"])
    elif numerus == "unknown":
        if "Sing" in number(pattern):
            add_to_data(pattern, "sg", suggestions)
        if "Plur" in number(pattern):
            add_to_data(pattern, "pl", suggestions)

## Custom rules

We add some custom rules that we have written ourselves, inspired in part by the _retext-equality_ data set. 

In [39]:
custom_xml = open(path.join(data_dir, "custom_list_disability.xml")).read()

## Conversion to proper LanguageTool XML format

The LanguageTool rule format is described [over here](https://web.archive.org/web/20210910183442/https://dev.languagetool.org/development-overview) and [here](https://dev.languagetool.org/tips-and-tricks).

We devise a function to convert a _geschickt gendern_ entry to a XML LanguageTool entry.

In [40]:
re.findall(r"\w+|\W+", "Wiener*innen")

['Wiener', '*', 'innen']

In [None]:
def startupper(s: str) -> str:
    return s[0].capitalize() + s[1:]

In [None]:
assert startupper("absolvierende Person") == "Absolvierende Person"

In [None]:
def rule_to_xml(pattern: str, numerus: str, suggestions: List[str]) -> str:
    id = re.sub("\s", "_", pattern + "_" + numerus)
    id = re.sub("[^A-ZÄÖÜa-zäöüß_]", "", id)
    if numerus == "sg":
        postag_attributes = 'postag=".*:SIN:.*" postag_regexp="yes" '
    elif numerus == "pl":
        postag_attributes = 'postag=".*:PLU:.*" postag_regexp="yes" '
    replaced_tokens = "".join([
        '<token inflected="yes" {}>{}</token>'.format(postag_attributes, token) 
        for token in pattern.split(" ")])
    suggestions_ = ",\n\t\t".join(["<suggestion>{}</suggestion>".format(s) for s in suggestions])
    antipatterns = "\n\t\t".join(
        ["<antipattern>\n\t\t{}\n\t\t</antipattern>".format("\n\t\t".join(
            ['<token inflected="yes">{}</token>'.format(token) for token in re.findall(r"\w+|[.,:;*_·/]", s)]
        )) for s in suggestions])
    corrections = "|".join([startupper(s) for s in suggestions])
    return """
    <rule id="{id}" name="{pattern}">
        {antipatterns}
        <pattern>{replaced_tokens}</pattern>
        <message>
        Mit dem generischen Maskulinum werden nicht alle Geschlechter gleichermaßen assoziiert. Vielleicht passt einer der folgenden neutralen Begriffe besser: 
        {suggestions}
        </message>
        <short>Generisches Maskulinum</short>
        <example correction="{corrections}"><marker>{pattern}</marker></example>
    </rule>
    """.format(id=id, pattern=pattern, antipatterns=antipatterns, replaced_tokens=replaced_tokens, suggestions=suggestions_, corrections=corrections)

In [None]:
# print(rule_to_xml("Wiener", "pl", data["sg"]["pl"]))

In [None]:
xml = custom_xml
for numerus in ["sg", "pl"]:
    xml += "\n\n" + "".join([rule_to_xml(key, numerus, val) for key, val in data[numerus].items()])

## Injecting the rules to the existing LanguageTool rule file

In [None]:
custom_filename = "grammar_custom.xml"
open(path.join(data_dir, custom_filename), "w").write(xml)
copy_files.copy_files()

## Validating and using the rules

Running the LanguageTool rule validation:

In [None]:
# subprocess.run(["./testrules.sh", "de"], cwd=languagetool_path)

Starting LanguageTool:

In [None]:
# subprocess.run(["java", "-jar", path.join(languagetool_path, "languagetool.jar")])