# Morphology Learner: Interpretable Machine Learning for Form-Meaning Mappings

This project explores how decision trees can be used to learn the mappings between *forms*–words or subparts of words–and their *meanings*, which are expressed in terms of binary features. As an illustration, for English third-person subject pronouns (*he*, *she*, *they*, *it*), a possible mapping we want the learner to learn might be as follows: 

1. +Plural ↔ *they*
2. -Plural +Animate +Feminine ↔ *she*
3. -Plural +Animate -Feminine ↔ *he*
4. -Plural -Animate ↔ *it*

There are three parts of this project: (i) synthetic data generation, (ii) training a decision tree on the synthetic data, and (iii) interpreting the trained decision tree to extract form-meaning mappings. Each part will be discussed in turn. 

## Synthetic data generation

In this project, synthetic data is generated to simulate mappings between *forms* and *meanings*. Each sample in the synthetic data consists of:  

1. **X** – a vector of binary feature values for all features, where `0` represents a minus value (-) and `1` represents a plus value (+).  
2. **y** – the corresponding morph (word form).  

For example, if the first feature is ±Plural, the second is ±Animate, and the third is ±Feminine, a sample could be:  

X = [0, 1, 1] # −Plural +Animate +Feminine
y = she


The data is generated from *morphological paradigms*, which specify the morph corresponding to each combination of feature values. For example, for English third-person subject pronouns, a possible paradigm is:

| Feature specification          | Morph |
|--------------------------------|-------|
| −Plural −Animate −Feminine      | it    |
| −Plural +Animate −Feminine      | he    |
| −Plural +Animate +Feminine      | she   |
| +Plural −Animate −Feminine      | they  |
| +Plural +Animate −Feminine      | they  |
| +Plural +Animate +Feminine      | they  |

> **Note:** Learning form-meaning mappings is not just regurgitating the paradigm. For instance, "they" appears in three different cells in the paradigm. A naive learner could generate three separate rules for "they," one for each cell. The goal, however, is to learn a **single generalized rule**:  
> - +Plural ↔ they

The function `generate_data_from_csv` takes a morphological paradigm (which is provided as a CSV file) and two additional parameters: 
- `accuracy_rate` – the probability that a generated sample uses the correct morph for a feature value combination from the paradigm
- `n` – the total number of samples to generate.  
> Each sample is generated by randomly picking a feature value combination (cell) from the paradigm. Then, with probability = accuracy_rate, this combination is paired with the correct morph from the paradigm. Otherwise, it is paired with a different morph, randomly chosen.

It returns: 
- A list of feature names in the paradigm in the CSV file 
- A list of unique morphs in the paradigm in the CSV file
- A matrix of binary feature values 0 or 1 (each row corresponds to a sample, and each column corresponds to a feature)
- A vector of morphs (one per sample)
- Error rate - the proportion of samples where a feature value combination is associated with an incorrect morph

Let us see illustrate this function by applying it to the full English person pronoun paradigm (`English_pronouns.csv`). Besides the third-person subject pronouns illustrates above, this paradigm also includes first- and second-person pronouns (e.g., *I*, *we*, *you*), as well as object pronouns (e.g., *me*, *him*, *them*) and possessive pronouns (*my*, *your*, *their*). 

Following linguistic convention, different person combinations are represented using two features, Participant and Author: 
1. First person: +Participant +Author
2. Second person: +Participant −Author
3. Third person: −Participant −Author

The difference between subject, object and possessive pronouns is also represented using two features, Subject and Possessive. 
1. Subject: +Subject −Possessive
2. Object: −Subject −Possessive
3. Possessive: −Subject +Possessive

We will generate 10,000 samples from this paradigm, where 80% of the samples have a probability of having the correct morph for a given feature specification. 

In [61]:
from src.data_generator import generate_data_from_csv

feature_names, morph_list, X, y, error_rate = generate_data_from_csv(
    "data/English_pronouns.csv",
    accuracy_rate=0.8,
    n=10000
)

Let us examine the output of `generate_data_from_csv`. We can first print some information about the features and morphs.  

In [62]:
print("Number of features:", len(feature_names))
print("Feature names:", feature_names)
print("Number of features:", len(morph_list))
print("Unique morphs:", morph_list)

Number of features: 7
Feature names: ['Participant', 'Author', 'Animate', 'Feminine', 'Plural', 'Possessive', 'Subject']
Number of features: 18
Unique morphs: ['I', 'he', 'her', 'him', 'his', 'it', 'its', 'me', 'my', 'our', 'she', 'their', 'them', 'they', 'us', 'we', 'you', 'your']


Now, consider the dimensions of the data generated. 

In [63]:
print("Number of rows in X:", X.shape[0],"(Should be equal to the number of samples, i.e., n)")
print("Number of columns in X:", X.shape[1],"(Should be equal to the number of features)")
print("Number of elements in y:", y.shape[0],"(Should be equal to the number of samples, i.e., n)")
print("Proportion of errors in synthetic data:", error_rate, "(Should be close to (1-accuracy rate), provided a large enough n)") 

Number of rows in X: 10000 (Should be equal to the number of samples, i.e., n)
Number of columns in X: 7 (Should be equal to the number of features)
Number of elements in y: 10000 (Should be equal to the number of samples, i.e., n)
Proportion of errors in synthetic data: 0.1988 (Should be close to (1-accuracy rate), provided a large enough n)


Finally, let us look at the first 10 rows of X and y to get a glimpse of what the data generated looks like. 

In [64]:
print("The first 10 samples in X") 
for i in range(10): 
    print(X[i])
print("The first 10 samples in y") 
for i in range(10): 
    print(y[i])

The first 10 samples in X
[0 0 0 0 0 0 1]
[0 0 1 0 0 0 1]
[1 1 1 1 0 0 1]
[1 0 1 0 0 0 0]
[0 0 1 1 1 0 0]
[1 1 1 1 0 1 0]
[0 0 1 1 0 0 1]
[1 1 1 0 1 0 1]
[0 0 1 1 0 0 1]
[1 0 1 1 1 1 0]
The first 10 samples in y
it
she
I
them
them
my
she
we
she
your


Consider the first sample for illustration. Its feature specification is [0 0 0 0 0 0 1], which has as a negative value for all features, except the last one, i.e. [−Participant −Author −Animate −Feminine −Plural −Possessive +Subject]. The expected morph for this feature combination is *it* (third-person inanimate non-possessive pronoun), which is also (correctly) the corresponding morph in y. In contrast, the second feature specification is [0 0 1 0 0 0 1], which is [−Participant −Author +Animate −Feminine −Plural −Possessive +Subject]. The correct morph for this specification is *he*, but the corresponding morph in y (i.e., the second morph) is *she*, which is an example of the error introduced by the generate_data_from_csv function. 

## Training a decision tree

Once the synthetic dataset is generated, the next step is to train a decision tree classifier, from which we will later extract the mappings between feature specifications and morphs. The decision tree is trained to predict the morph (class label) for any given feature specification (input vector).

The function train_tree takes three parameters:

- X_train – the training feature vectors (binary encodings of feature specifications).
- y_train – the corresponding morphs for the training set.
- min_imp_dec – the minimum impurity decrease, a hyperparameter that controls how much information gain is required for the tree to make a split. Larger values of min_imp_dec make the tree simpler and more interpretable, while smaller values allow the tree to capture more fine-grained distinctions but risk overfitting.

The trained tree is then evaluated on the held-out test set. Performance is reported using:
> Macro F1 score – measures how well the model balances precision and recall across all morphs.
> Classification report – shows precision, recall, and F1 for each morph class.


For this illustration, let us use the synthetic data generated above, and split it into 80% training data and 20% testing data. Here, I use a min_imp_dec value of 0.01. We will consider the implications of changing this value later. The macro F1 score is a moderate 0.776, but this is largely due to us chosing a moderate accurate rate of 85% for generating the data (both the training and testing data). 

In [65]:
from src.train_tree import train_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)

clf = train_tree(X_train, y_train, min_imp_dec=0.01)

y_pred = clf.predict(X_test)
print("\nTest macro F1 score:", f1_score(y_test, y_pred, average="macro"))
print("\nClassification report:\n", classification_report(y_test, y_pred))


Test macro F1 score: 0.7756862546611122

Classification report:
               precision    recall  f1-score   support

           I       0.84      0.78      0.81       102
          he       0.81      0.66      0.73        59
         her       0.80      0.76      0.78       100
         him       0.88      0.56      0.69        66
         his       0.77      0.57      0.65        63
          it       0.80      0.76      0.78       101
         its       0.77      0.69      0.72        67
          me       0.80      0.73      0.77       101
          my       0.82      0.72      0.77        92
         our       0.79      0.76      0.77        90
         she       0.78      0.63      0.70        60
       their       0.73      0.86      0.79       128
        them       0.83      0.84      0.84       142
        they       0.76      0.81      0.78       129
          us       0.81      0.79      0.80       101
          we       0.83      0.86      0.85        99
         you   

## Interpreting the decision tree: extracting form-meaning mappings

Finally, we can extract form–meaning mappings from the trained decision tree. The function interpret_tree does this by taking two inputs:
- the decision tree classifier (clf), and
- the list of feature names (feature_names).

It returns a list of interpretable rules that describe which combinations of feature values lead to which morphs. These rules are expressed in a linguist-readable format, closely resembling the generalizations that linguists propose in morphological analyses.

In [67]:
from src.interpret_tree import interpret_tree

specifications = interpret_tree(clf, feature_names)

print("Extracted morphological rules:")
for spec in specifications:
    print(spec)

Extracted morphological rules:
[-Participant, -Plural, -Animate, -Possessive] <--> it
[-Participant, -Plural, -Animate, +Possessive] <--> its
[-Participant, -Plural, +Animate, -Feminine, -Possessive, -Subject] <--> him
[-Participant, -Plural, +Animate, -Feminine, -Possessive, +Subject] <--> he
[-Participant, -Plural, +Animate, -Feminine, +Possessive] <--> his
[-Participant, -Plural, +Animate, +Feminine, -Subject] <--> her
[-Participant, -Plural, +Animate, +Feminine, +Subject] <--> she
[-Participant, +Plural, -Possessive, -Subject] <--> them
[-Participant, +Plural, -Possessive, +Subject] <--> they
[-Participant, +Plural, +Possessive] <--> their
[+Participant, -Author, -Possessive] <--> you
[+Participant, -Author, +Possessive] <--> your
[+Participant, +Author, -Plural, -Subject, -Possessive] <--> me
[+Participant, +Author, -Plural, -Subject, +Possessive] <--> my
[+Participant, +Author, -Plural, +Subject] <--> I
[+Participant, +Author, +Plural, -Subject, -Possessive] <--> us
[+Participant

One can verify that this gives us the correct distribution for different morphs. Note also that the model has made the appropriate generalizations, because it does not have duplicate rules for different instances of the same morph. For instance, instead of having a rule with +Feminine and -Feminine for *they*, it has a single rule that is not specified for gender. Likewise, the pronoun *it* can be used in both subject and object positions, so instead of having two rules for *it*, we only have one that is not specified for the feature Subject. Whether or not the model is able to make generalizations like this depends on the hyperparameters chosen, in particular, on the minimum impurity decrease. Let us see how increase or decreasing this hyperparameter changes the kind of rules learnt. 

Chosing a much lower value for minimum impurity decrease prevents the learner from forming generalizations, and encourages specific rules for each instance of a morph. Consider below what happens when min_imp_dec=0.001. We get two different rules for *it*: one for its subject use (+Subject) and one for its object use (-Subject), and three different rules for *they*: one for inanimates (-Animate), one for masculine (+Animate -Feminine) and one for feminine (+Animate +Feminine). 

In [71]:
clf2 = train_tree(X_train, y_train, min_imp_dec=0.001)
specifications2 = interpret_tree(clf2, feature_names)

print("Extracted morphological rules for min_imp_dec=0.001:")
for spec in specifications2:
    print(spec)

Extracted morphological rules for min_imp_dec=0.001:
[-Participant, -Plural, -Animate, -Possessive, -Subject] <--> it
[-Participant, -Plural, -Animate, -Possessive, +Subject] <--> it
[-Participant, -Plural, -Animate, +Possessive] <--> its
[-Participant, -Plural, +Animate, -Feminine, -Possessive, -Subject] <--> him
[-Participant, -Plural, +Animate, -Feminine, -Possessive, +Subject] <--> he
[-Participant, -Plural, +Animate, -Feminine, +Possessive] <--> his
[-Participant, -Plural, +Animate, +Feminine, -Subject, -Possessive] <--> her
[-Participant, -Plural, +Animate, +Feminine, -Subject, +Possessive] <--> her
[-Participant, -Plural, +Animate, +Feminine, +Subject] <--> she
[-Participant, +Plural, -Possessive, -Subject, -Animate] <--> them
[-Participant, +Plural, -Possessive, -Subject, +Animate, -Feminine] <--> them
[-Participant, +Plural, -Possessive, -Subject, +Animate, +Feminine] <--> them
[-Participant, +Plural, -Possessive, +Subject, -Animate] <--> they
[-Participant, +Plural, -Possessi

On the other hand, chosing a much higher value for min_imp_dec like 0.1 will lead to the learner not learning the specifications for all morphs. Below, the learner failed to learn *its*, *he*, *his*, *him*, *she*, *I*, *my*, *we* and *our*. 

In [72]:
clf3 = train_tree(X_train, y_train, min_imp_dec=0.1)
specifications3 = interpret_tree(clf3, feature_names)

print("Extracted morphological rules for min_imp_dec=0.1:")
for spec in specifications3:
    print(spec)

Extracted morphological rules for min_imp_dec=0.1:
[-Participant, -Plural, -Animate] <--> it
[-Participant, -Plural, +Animate] <--> her
[-Participant, +Plural, -Possessive, -Subject] <--> them
[-Participant, +Plural, -Possessive, +Subject] <--> they
[-Participant, +Plural, +Possessive] <--> their
[+Participant, -Author, -Possessive] <--> you
[+Participant, -Author, +Possessive] <--> your
[+Participant, +Author, -Plural] <--> me
[+Participant, +Author, +Plural] <--> us
