# Chapter 5: Meaning representations

## Preparations

Load necessary packages.

In [None]:
using JudiLing, CSV, DataFrames, Plots

Load the Dutch dataset we will be working with.

In [None]:
# Adjust the filepath to the location of your dutch.csv file.
dutch = JudiLing.load_dataset("../dat/dutch.csv");
# Subset the columns of interest for our present purposes
dutch = dutch[:,[:Ortho, :Word, :Number, :WordCat, :Lexeme, :Syllables, :Frequency]];

## Meaning representation

### One-hot encoding

Here, the semantic features are simply the words themselves, so we set `features_col=:Ortho`:

In [None]:
pS_obj = JudiLing.make_pS_matrix(dutch,
                            features_col=:Ortho);
JudiLing.display_matrix(dutch, :Ortho, pS_obj, pS_obj.pS, :pS)

### Multiple-hot encoding

Here, we need a `features_col` with the word, its number and word category. This can be easily created by pasting together the `Ortho`, `Number` and `WordCat` columns:

In [None]:
dutch[:,"features"] = string.(dutch.Lexeme, "_", dutch.Number, "_", dutch.WordCat)
first(dutch, 5)

Now we use the `features` column as `features_col`:

In [None]:
pS_obj = JudiLing.make_pS_matrix(dutch,
                            features_col=:features);
JudiLing.display_matrix(dutch, :Ortho, pS_obj, pS_obj.pS, :pS)

### Using word embeddings from NLP

There are various way in which word embeddings can be loaded. The first way is to load a prepared set of word embeddings. These already need to be in the same order as the dataset we are working with.

In [None]:
S, words = JudiLing.load_S_matrix("../dat/dutch_w2v.csv", 
                                header = false, sep = ",");
JudiLing.display_matrix(dutch, :Ortho, S, S, :S)

Make sure that the rows in the S matrix are the same as in the dutch dataset:

In [None]:
all(words .== dutch.Ortho)

Instead, a set of word embeddings can be loaded and subset on the fly. First, for fasttext vectors there are two methods. The first is to let Julia download vectors in the background.
The method `load_S_matrix_from_fasttext` takes as argument the dataframe we are working with, the language code of the language the dataframe is in (in this case, `:nl` stands for Dutch), and the target column in the dataframe with the orthographical representation of the words.
Finally, there is a fourth parameter which is optional. It allows you specify which set of fasttext vectors should be loaded.

 To see which sets of vectors are available, we first need to install an additional package called `Embeddings` and make it available in our session.

In [None]:
using Pkg
Pkg.add(name="Embeddings", version="0.4.6") # there is a bug in v0.4.5, so avoid that version
using Embeddings

Then we call:

In [None]:
language_files(FastText_Text{:nl})

... of course replacing the language code with the language code you are interested in. In this case, two sets of vectors are available. By default, the first one is loaded. If prompted to do so, input `y` in the input field to download the embeddings.

In [None]:
dutch_small, S_auto = JudiLing.load_S_matrix_from_fasttext(dutch, :nl, target_col=:Ortho)

If you want to use an alternative set of fasttext files, you can use `JudiLing.load_S_matrix_from_fasttext_file`. For word2vec, `JudiLing.load_S_matrix_from_word2vec_file` is available.

In [None]:
# You need to first replace the filepath below to the fasttext file of interest
# then you can comment in the lines of code and run it

# dutch_small, S_auto = JudiLing.load_S_matrix_from_fasttext_file(dutch,
#     "path/to/downloaded/fasttext_vectors.vec",
#     target_col=:Ortho)

In [None]:
# You need to first replace the filepath below to the word2vec file of interest
# then you can comment in the lines of code and run it

# dutch_small, S_auto = JudiLing.load_S_matrix_from_word2vec_file(dutch,
#     "path/to/downloaded/word2vec_vectors.txt",
#     target_col=:Ortho)

### Simulating semantic vectors

Semantic vectors are simulated by adding vectors for a word's lexeme (lemma) plus vectors for any grammatical features of interest (number and word category in this case).

The lexome matrix contains vectors for all lexemes (lemmas), as well as all features (number and word category). These vectors can later be used to construct individual semantic vectors.

It is constructed using the `JudiLing.make_L_matrix` function, which takes as parameters our dutch dataset, the list of columns for which "base" vectors should be generated (only the lexeme column in most cases), the list of columns for which feature vectors should be generated (number and word category in our case) and finally the dimensionality of the vectors:

In [None]:
L = JudiLing.make_L_matrix(
    dutch,
    ["Lexeme"],
    ["Number", "WordCat"],
    ncol=200)
JudiLing.display_matrix(dutch, :Ortho, L, L.L, :S)

What's in the L object?

In [None]:
?L

We have the `L` matrix itself, which we have just looked at. Then we have again two mappings: `f2i` and `i2f`. They are useful for getting from a word to its lexome vector.

Let's say we want to get the lexome vector of `brug`. We first need to find the correct row for `brug`:

In [None]:
L.f2i["brug"]

Then we can use this number to look up the lexome vector of `brug` in the `L` matrix:

In [None]:
L.L[L.f2i["brug"],1:7]

We do the same to get the inflectional vector of `plural` and of `noun`:

In [None]:
L.L[L.f2i["plural"],1:7]

In [None]:
L.L[L.f2i["noun"],1:7]

In order to get the semantics of `bruggen`, so the plural of `brug`, add the lexome vector of `brug` and the inflectional vector of `plural`:

In [None]:
# bruggen
bruggen = L.L[L.f2i["brug"],:] .+ L.L[L.f2i["plural"],:] .+ L.L[L.f2i["noun"],:]
bruggen[1:7]

This can also be done directly by using the `make_S_matrix`. Internally, it generates lexome vectors and feature vectors, and adds them together directly. It takes the same parameters as `make_L_matrix`.

In [None]:
S = JudiLing.make_S_matrix(
    dutch,
    ["Lexeme"],
    ["Number", "WordCat"],
    ncol=200)
JudiLing.display_matrix(dutch, :Ortho, S, S, :S)

### Imputing semantic vectors of lexemes and inflectional features 

First we need a matrix with information about which word forms include which lexemes and inflectional features. Here, we can make use of `make_pS_matrix` together with the column of combined inflectional features we have created above:

In [None]:
dutch.features[1:5]

Next, we create what we call the "lexome" matrix, i.e. the matrix containing information about which features are present for which wordforms. Note that this is the same matrix, that we created above as a "multiple-hot encoding" of semantics:

In [None]:
pS_obj = JudiLing.make_pS_matrix(dutch, features_col=:features);

The matrix itself can be found in `pS_obj.pS`:

In [None]:
JudiLing.display_matrix(dutch, :Ortho, pS_obj, pS_obj.pS, :pS)

Next, we require an embedding space for which we want to impute lexome and inflectional semantic vectors. We reuse the word2vec space from above:

In [None]:
S, words = JudiLing.load_S_matrix("../dat/dutch_w2v.csv", header = false, sep = ",");
JudiLing.display_matrix(dutch, :Ortho, S, S, :S)

Finally, we impute the lexome and inflectional vectors by means of:

In [None]:
W = JudiLing.make_transform_matrix(pS_obj.pS, S);

In [None]:
JudiLing.display_matrix(dutch, :Ortho, pS_obj, W, :F, ncol=5, nrow=5)

As you can see, the resulting matrix `W` contains a row vector for each lexome as well as for the inflectional features `noun`, `verb`, `singular`, `plural`. We could now use the imputed semantic vectors to again construct semantic vectors similar to how we have described above in section "Simulating semantic vectors".

## Exercises

### Preparation

Loading the latin dataset

In [None]:
latin = JudiLing.load_dataset("../dat/latin.csv")

### Exercise 1

One-hot semantic matrix:

In [None]:
pS_oh = JudiLing.make_pS_matrix(latin,
                            features_col=:Word);
JudiLing.display_matrix(latin, :Word, pS_oh, pS_oh.pS, :pS)

### Exercise 2

Multiple-hot matrix:

In [None]:
latin[:,"features"] = string.(latin.Lexeme,"_", latin.Person,"_", latin.Number, "_", latin.Tense, "_", latin.Voice, "_", latin.Mood)
first(latin, 5)

In [None]:
pS_mh = JudiLing.make_pS_matrix(latin,
                            features_col=:features);
JudiLing.display_matrix(latin, :Word, pS_mh, pS_mh.pS, :pS)

### Exercise 3

Fasttext semantic vectors:

In [None]:
latin_small, S_ft = JudiLing.load_S_matrix_from_fasttext(latin, :la, target_col=:Word)

In [None]:
JudiLing.display_matrix(latin, :Word, S_ft, S_ft, :S)

It is noteworthy that there are fasttext vectors for only 69 word forms. One reason which certainly adds to this is that the word forms are coded to represent long vowels, e.g.:

In [None]:
latin[latin.Word .== "vocaamus",:]

### Exercise 4

Simulated semantic matrix

In [None]:
S_sim = JudiLing.make_S_matrix(
    latin,
    ["Lexeme"],
    ["Person", "Number", "Tense", "Voice", "Mood"],
    ncol=300)
JudiLing.display_matrix(latin, :Word, S_sim, S_sim, :S)

### Exercise 5

Impute vectors for all features and lexemes:

In [None]:
pS_la = JudiLing.make_pS_matrix(latin_small, features_col=:features);

In [None]:
W_la = JudiLing.make_transform_matrix(pS_la.pS, S_ft);

In [None]:
JudiLing.display_matrix(latin, :Word, pS_la, W_la, :F, ncol=5, nrow=22)