# Grammatical Gender: Does it Translate?

I want to find out if the grammatical gender systems in German and French are connected. When a noun has feminine gender in German, what is the chance that the corresponding French noun is also feminine &mdash; is it better than a coin toss? I have downloaded a FR/DE dictionary from [here](https://www.dict.cc). I cannot include the file in the repo since that would violate the dict.cc ToS, but you can request it [here](https://www1.dict.cc/translation_file_request.php?l=e) &mdash; pick option "DE->FR (tab-delimited, UTF-8). I will be using many of their other dictionaries.

In [None]:
# all imports and configs go here

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display 
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, matthews_corrcoef
from scipy.stats import chi2_contingency, fisher_exact

# this line can make some cells very slow. Comment out if that's the case
pd.set_option("display.max_rows", None)

In [None]:
# load German-French dictionary file - if you download it from dict.cc, rename and move file appropriately!

defr = pd.read_csv('../datasets/defr/defr.txt', 
                   sep='\t', 
                   header=7,
                   names=['de', 'fr', 'cat', 'comment'])

Let's have a look at the raw data:

In [None]:
display(defr)

Some preprocessing is ostensibly necessary. The gender is typically denoted in braces `{f} {m} {n}`, and sometimes (in French) by `{f.pl}` and `{m.pl}`. Note that German has three grammatical genders (masculine, feminine, neutral), but French has only two (masculine and feminine): this will become important later!

There are outliers where either no gender is given, either by error (a few German entries only have `{pl}`), or because the translation of a noun is a paraphrased description instead of one noun. 
We will perform the following preprocessing steps:
* remove `comment` column
* remove everything that is not labelled as a noun
* extract gender by searching pattern
* discard lines where we cannot extract gender for either German or French

In [None]:
# remove superfluous column
defr = defr.drop(labels=['comment'], axis=1)

In [None]:
# restrict to nouns
nouns = defr[defr.cat == 'noun'].reset_index(drop=True)

In [None]:
# use regex to find gender markers: where no valid marker can be extracted, we will get 'nan'
nouns['de_gender'] = nouns.de.str.extract(r'(\{[mfn]\}|\{[mfn].pl\})')
nouns['fr_gender'] = nouns.fr.str.extract(r'(\{[mf]\}|\{[mf].pl\})')

In [None]:
# drop lines with less than two valid gender markers aka 'nan's; replace {f} with f etc
nouns_gendered = nouns.dropna(axis=0).reset_index(drop=True)
nouns_gendered['de_gender'] = nouns_gendered['de_gender'].map(lambda s: s[1])
nouns_gendered['fr_gender'] = nouns_gendered['fr_gender'].map(lambda s: s[1])

In [None]:
# how many entries did we lose in the last step?
print(f'We have {len(nouns)} nouns, and {len(nouns_gendered)} with two valid gender markers. \
We discarded {len(nouns)-len(nouns_gendered)} lines in the process.')

That seems like acceptable losses. Let's look at the data that's left, for sanity checking:

In [None]:
display(nouns_gendered)

In [None]:
labels = ['m', 'f', 'n']
cm = confusion_matrix(nouns_gendered['de_gender'].values,
                      nouns_gendered['fr_gender'].values, 
                      labels=labels)
plot = ConfusionMatrixDisplay(cm, display_labels=labels)
plot.plot()
plot.ax_.set(xlabel='French', ylabel='German');
plt.savefig('confusion_defr_nogrouping.png')

Of course, since French has no neutral gender but German has, the rightmost column must be empty. French's progenitor language Latin has three genders (as did the grandmother of them all, Proto-Indoeuropean), but many modern Romance languages whittled that down to just two &mdash; `m` and `f`. In Italian this happened roughly by merging masculine and neutral, it appears that French has had a similar development but it is less clear to me (I am not a linguist!). Some things may be muddier here: Latin argentum (n) &rarr; French argent (m); lac (n) &rarr; lait (m) BUT pōmum (n) &rarr; pomme (f); mare (n) &rarr; mer (f). We will be getting around to matching Latin and French noun genders later!

In the meantime, let's make a simplifying assumption for now: that the elimination of neutral gender in French went along similar lines as in Italian, i.e. the majority of neutral nouns assumed masculine gender. Since French has arguably suffered (? or profited?) from Germanic influence more than other Romance languages, we could assume that that may have made things messier than in Italian. But looking at neutral german nouns (bottom row of plot above), they tend to be predominantly masculine in French &mdash; almost twice as often. Hence the influence can be assumed to be subdominant &mdash; it would certainly be less justified to assume that the majority of neutral nouns assumed the feminine gender. 

Thus, let us pour all German masculine and neutral nouns into a bucket and compare that across the two genders in French. The intention is to artificially emulate the historical evolution of Romance language, to allow us to compare if a noun is "more of f or more of m" across the two languages.

In [None]:
nouns_gendered['de_gender_grouped'] = nouns_gendered['de_gender'].map({'f': 'f', 'm': 'm', 'n': 'm'})

In [None]:
labels = ['m', 'f']
cm2 = confusion_matrix(nouns_gendered['de_gender_grouped'].values, 
                       nouns_gendered['fr_gender'].values, 
                       labels=labels)
plot = ConfusionMatrixDisplay(cm2, display_labels=labels)
plot.plot()
plot.ax_.set(xlabel='French', ylabel='German', yticklabels=['m / n', 'f']);
plt.savefig('confusion_defr_grouping.png')

In [None]:
cm2.diagonal().sum() / cm2.sum()

So it turns out: when going German &rarr; French and guessing gender, just make it the same as in German (understood that n&rarr;m) and you'll be right 70% of the time! I can say that this result did indeed surprise me. Note that we are not weighing by noun usage frequency &mdash; it may very well be true that commonly used words differ in gender more ore less often than just 30%!

Time to make things more quantitative: [Matthew's $\varphi$ coefficient](https://en.wikipedia.org/wiki/Phi_coefficient) (also known as Yule coefficient) is the version of [Pearson's famous-infamous $r$ correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) for binary variates, rather than continuous for $r$. It quantifies correlation between two variates &mdash; it equals one for perfect correlation, negative for for perfect anticorrelation, and zero for independence (NB: $r=0$ does _not_ imply independence of two real-valued variates!). We can compute this number from the contingency tables:

In [None]:
print(f"Matthew's phi before grouping is {matthews_corrcoef(nouns_gendered['de_gender'], nouns_gendered['fr_gender'])}")
print(f"Matthew's phi after grouping is {matthews_corrcoef(nouns_gendered['de_gender_grouped'], nouns_gendered['fr_gender'])}")

Roughly ~0.3 without grouping and ~0.4 with: unsurprisingly, artificially replicating one language's historical development in the other does make them more similar.

We can quantify more things: one could start with the assumption that since German and French belong to distinct language families with different evolution histories, their divisions into bipartite or tripartite classes ought to be statistically independent. There are many statistical tools for testing such an assumption, such as the [$\chi^2$-test for independence](https://en.wikipedia.org/wiki/Chi-squared_test#Example_chi-squared_test_for_categorical_data), [Fisher's exact test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test), and the [G-test](https://en.wikipedia.org/wiki/G-test). They fit into the "classical" hypothesis testing framework: _postulate a hypothesis, calculate the probability of the seen data under that hypothesis, take that as a measure of how plausible the hypothesis is_. Without much explanation or discussion, let's just hunga-bunga our data into the entire battery &mdash; pay attention to the `pvalue` results:

In [None]:
print(chi2_contingency(cm[:,:2]))
print(chi2_contingency(cm2))

In [None]:
fisher_exact(cm2)

In [None]:
# G-test
print(chi2_contingency(cm[:,:2], lambda_=0))
print(chi2_contingency(cm2, lambda_=0))

All `pvalue`s come out `0.0`! As it turns out, all tests tell us that the chance of our data happening randomly under the assumption of independence is less than would be meaningful to report: we can safely discard that idea, there is _definitely_ a connection between gender in German and gender in French! From the data we have studied so far we cannot tell if that is because of a common language ancestor or because of mutual influence. We know both are indo-european languages, and that French and German have had millennia of exchange: German has borrowed many French words during the Enlightenment period, and French itself has had a comparably strong Germanic influence of compared to other Romance languages.