# Chapters 3 & 4: Our Dutch dataset and Form representations

In [None]:
# if you haven't installed the packages in Chapter 2, do so now
using Pkg
Pkg.add("JudiLing")
Pkg.add("Plots")
Pkg.add("DataFrames")

## Preparations and dutch dataset

Load necessary packages.

In [None]:
using JudiLing, Plots, DataFrames

Load the Dutch dataset we will be working with.

In [None]:
# Adjust the filepath to the location of your dutch.csv file.
dutch = JudiLing.load_dataset("../dat/dutch.csv");
dutch = dutch[:,[:Ortho, :Word, :Number, :WordCat, :Lexeme, :Syllables, :Frequency]];

Take a look at the first 5 lines of the dataset.

In [None]:
first(dutch, 5)

### Exercises

#### Exercise 1

How many plurals and singulars?

In [None]:
nrow(dutch[dutch.Number .== "singular",:])

In [None]:
nrow(dutch[dutch.Number .== "plural",:])

Alternatively:

In [None]:
combine(groupby(dutch, :Number), nrow)

How many verbs and nouns?

In [None]:
nrow(dutch[dutch.WordCat .== "verb",:])

In [None]:
nrow(dutch[dutch.WordCat .== "noun",:])

Or alternatively:

In [None]:
combine(groupby(dutch, :WordCat), nrow)

How many singular and plural verbs, singular and plural nouns?

In [None]:
nrow(dutch[(dutch.WordCat .== "verb") .& (dutch.Number .== "singular") ,:])

In [None]:
nrow(dutch[(dutch.WordCat .== "verb") .& (dutch.Number .== "plural") ,:])

In [None]:
nrow(dutch[(dutch.WordCat .== "noun") .& (dutch.Number .== "singular") ,:])

In [None]:
nrow(dutch[(dutch.WordCat .== "noun") .& (dutch.Number .== "plural") ,:])

Or alternatively:

In [None]:
combine(groupby(dutch, [:WordCat, :Number]), nrow)

#### Exercise 2

Percentage of unique wordforms:

In [None]:
length(unique(dutch.Word))/length(dutch.Word)

#### Exercise 3

Word with highest frequency

In [None]:
findmax(dutch.Frequency)

In [None]:
dutch[818,:]

#### Exercise 4

Number of words with more than one syllable:

In [None]:
# split the syllables by the separator "-"
syllables = split.(dutch.Syllables, "-")
# count the number of syllables in each word
n_syllables = length.(syllables)
# count how many words have more than one syllable
sum(n_syllables .> 1)

## Form representation

### Visual

Generate cue object with bigrams:

In [None]:
cue_obj = JudiLing.make_cue_matrix(dutch,
                                   grams=2,
                                   target_col="Ortho");

What's in `cue_obj`? Use the help pages to find out:

In [None]:
?cue_obj

Let's take a look at the actual cue matrix in `cue_obj.C`:

In [None]:
JudiLing.display_matrix(dutch, :Ortho, cue_obj, 
                        cue_obj.C, :C, nrow=6, ncol=6)

In [None]:
?JudiLing.display_matrix

`cue_obj.f2i` provides a bigram to column mapping, and `cue_obj.i2f` a column to bigram mapping.

In [None]:
cue_obj.f2i

In [None]:
cue_obj.i2f

`cue_obj.A` is the adjacency matrix of the words in the dutch dataset:

In [None]:
JudiLing.display_matrix(dutch, :Ortho, cue_obj, cue_obj.A, :A)

`cue_obj.gold_ind` shows the order of the bigrams for each word form.

In [None]:
cue_obj.gold_ind

To use trigrams instead of bigrams, `grams` needs to be set to 3:

In [None]:
cue_obj = JudiLing.make_cue_matrix(dutch,
                                   grams=3,
                                   target_col="Ortho");
JudiLing.display_matrix(dutch, :Ortho, cue_obj, cue_obj.C, :C)

### Auditory

As you can see, there is also a simple phonological representation of the words in the `Word` column in our dataset:

In [None]:
first(dutch, 5)

We can use the `Word` column in the same way as the `Ortho` column to generate cues:

In [None]:
cue_obj = JudiLing.make_cue_matrix(dutch,
                                   grams=2,
                                   target_col="Word");
JudiLing.display_matrix(dutch, :Word, cue_obj, cue_obj.C, :C)

Finally, we can also use bi-syllables instead of bigrams. `dutch` also contains a column with `Syllables`.

In [None]:
first(dutch, 5)

Consult the help pages to get information on how to use syllables instead of n-grams:

In [None]:
?JudiLing.make_cue_matrix

We need to use `target_col="Syllables"`. This column already has syllable boundaries, so we set `tokenized=true` and because syllables are separated with a `-`, we set `sep_token="-"`. Finally, we want to keep the syllable information in the generated bi-syllables, so we set `keep_sep=true`.

In [None]:
cue_obj = JudiLing.make_cue_matrix(dutch,
                                   grams=2,
                                   target_col="Syllables",
                                   tokenized=true,
                                   keep_sep=true,
                                   sep_token="-");
JudiLing.display_matrix(dutch, :Word, cue_obj, cue_obj.C, :C, ncol=10)

Creating a cue matrix from Continuous Frequency Band Summaries (CFBS) features (Shafaei et al., 2023):

First, load prepared dataframe with the feature vectors (the code for generating such a dataframe as well as the audio tokens the vectors are based on can be found [here](https://osf.io/tdja2/)).

In [None]:
df_cfbs = JudiLing.load_dataset("../dat/cfbs_example.csv");
first(df_cfbs)

The issue with the feature vectors is that they are currently represented as strings:

In [None]:
df_cfbs.features[1]

So we need to parse them to vectors of numbers. I will show how we do this here for the first row, and then we apply it to all rows in the two dataframes. First, we remove the initial and final brackets `"["` and `"]"`:

In [None]:
nums = df_cfbs.features[1][2:(end-1)]

Next, we split them by `", "`:

In [None]:
nums_split = split(nums, ", ")

Finally, we need to convert the strings to floats:

In [None]:
parse.(Float64, nums_split)

Alternatively, we can perform the steps above in a single line:

In [None]:
cfbs_vecs = [parse.(Float64, split(features[2:(end-1)], r", ")) for features in df_cfbs.features];
cfbs_vecs[1]

This is now a float vector. However, the vectors have different lengths, e.g.:

In [None]:
@show length(cfbs_vecs[1]);
@show length(cfbs_vecs[2]);

We use `make_cue_matrix_from_CFBS` to turn the vectors of different lengths into a matrix where vectors are padded with zeros to make them all as long as the longest vector:

In [None]:
C = JudiLing.make_cue_matrix_from_CFBS(cfbs_vecs, pad_val=0.);
C[1:5, 1:5]

Since we padded the vectors, we expect that all the way to the right of the matrix there are many zeros:

In [None]:
C[1:5, (end-5):end]

Indeed, the second row has the maximum length feature vector, but for all other vectors there is padding.

In [None]:
findmax([length(c) for c in cfbs_vecs])

### Exercises

#### Exercise 1

Load the latin dataset

In [None]:
latin = JudiLing.load_dataset("../dat/latin.csv");
first(latin, 5)

#### Exercise 2

Get the number of rows and columns in the dataset.

In [None]:
nrow(latin)

In [None]:
ncol(latin)

Or alternatively:

In [None]:
size(latin)

#### Exercise 3

Creating a cue object:

In [None]:
cue_obj_latin = JudiLing.make_cue_matrix(latin, grams=3, 
                                        target_col="Word");

#### Exercise 4

Displaying first 5 rows and columns of the C matrix:

In [None]:
JudiLing.display_matrix(latin, :Word, cue_obj_latin, 
    cue_obj_latin.C, :C,
    nrow = 5, ncol = 5)

#### Exercise 5

Getting the number of unique cues:

In [None]:
# option 1
length(cue_obj_latin.f2i)

In [None]:
# option 2
length(cue_obj_latin.i2f)

In [None]:
# option 3
size(cue_obj_latin.C)[2]