# Sampling linguistic data

## Preparing the data

Let's start by installing the [*conllu*](https://pypi.org/project/conllu/) library for Python, which allows parsing annotations in the CoNLL-U format.

Note that the code below is not written in Python. This is a shell command, which is prefixed with an exclamation mark.

Run the cell below to install the *conllu* library.

In [None]:
!pip install conllu

Let's import the necessary modules (`pd`), functions (`parse`) and classes (`Path`).

In [None]:
from pathlib import Path
from conllu import parse

import pandas as pd
import numpy as np

Next, we create a Path object that points towards the directory `gum_conllu` on the server, which contains annotations from the [Georgetown University Multilayer corpus](https://github.com/amir-zeldes/gum) (GUM).

We then use the `glob` method to fetch every file (`*`) that has the suffix `conllu` stored in the directory.

We store the resulting *generator* object under the variable `files`.

We also create an empty list named `annotations` to hold the parsed annotations.

In [None]:
files = Path('gum_conllu').glob('*.conllu')

annotations = []

Next, we loop over the generator `files`, using the variable `file` to refer to each file yielded by the generator.

We open each file for reading using the `open` function and store the result under the variable `annotation_file`.

Next, we use the `read` method to read the contents of `annotation_file` and store the resulting string object under the variable `raw_annotation`.

We then call the `parse` function imported from the *conllu* library to parse the string object stored under the variable `raw_annotation`. We store the result under the variable `parsed_annotation`.

Finally, we append the parsed annotation to the list named `annotations`.

In [None]:
for file in files:
    
    with open(file, 'r') as annotation_file:
        
        raw_annotation = annotation_file.read()
        
        parsed_annotation = parse(raw_annotation)
        
        annotations.append(parsed_annotation)

Next, we create another empty list, which we assign under the variable `data`.

We then loop over each annotated document in the list named `annotations`.

For each text, we first retrieve metadata about the **genre** of the document and the unique identifier of the document.

These are stored under the first sentence of the document, which may be accessed using the brackets and the number `[0]`.

This information is stored under the attribute `metadata` under the keys `meta::genre` and `newdoc id`.

Next, we loop over each sentence in the document.

We retrieve the sentence type or *mood* of the sentence, stored in the metadata under the key `s_type`. In addition, we get document-level information for genre and the unique identifier. We also count the number of tokens in the sentence using Python's `len` function.

We combine all of this information into a Python dictionary under the keys `mood`, `genre`, `document_id` and `tokens`.

In addition, we store the index (or 'position') of each document and sentence under the variables `doc_ix` and `sent_ix`.

In [None]:
data = []

for doc_ix, annotation in enumerate(annotations):
        
    genre = annotation[0].metadata['meta::genre']
    document_id = annotation[0].metadata['newdoc id']
    
    for sent_ix, sentence in enumerate(annotation):
        
        mood = sentence.metadata['s_type']
        
        data.append({'mood': mood, 'genre': genre, 'document_id': document_id, 'tokens': len(sentence), 'doc_ix': doc_ix, 'sent_ix': sent_ix})

This gives us a Python list populated by dictionaries.

Let's use the `len` function to check its length – this essentially gives us the number of sentences.

In [None]:
len(data)

Just to illustrate how we can recover linguistic data from the list `annotations`, let's fetch the sentence stored at index 235 in the list `data`.

In [None]:
data[235]

The values stored under the keys `doc_ix` and `sent_ix` can be used to access the information stored in the list `annotations`.

In [None]:
annotations[data[235]['doc_ix']][data[235]['sent_ix']].metadata

This illustrates how we are able to associate each sentence in the list `data` with the actual linguistic annotations.

Next, we convert the list of dictionaries into a *pandas* DataFrame by calling the `DataFrame` class to which we provide the list of dictionaries `data` as input.

We store the resulting DataFrame under the variable `df`.

In [None]:
df = pd.DataFrame(data)

DataFrame is a tabular format for storing various types of data.

Let's call the variable `df` to examine its contents.

In [None]:
df

The DataFrame class has various useful functions for processing and manipulating tabular data.

Let's use the `sample` method to draw a random sample of 10 sentences.

For reproducibility, we also provide a value for the argument `random_state` - this number is used as the 'seed' for sampling and ensures we get the same result every time.

In [None]:
df.sample(n=10, random_state=42)

We can also count unique values in each column to get a better understanding of the data.

Let's count how many sentences we have from each genre.

To do so, we use the brackets and the string `genre` (note the single quotes that mark a string in Python) to select this column. We then call the `value_counts` method.

In [None]:
df['genre'].value_counts()

We can easily do the same for mood.

In [None]:
df['mood'].value_counts()

Whereas the columns *mood* and *genre* consist of categorical values, the column *tokens* contains numerical values. 

We can analyse them using the `describe` method, for example.

In [None]:
df['tokens'].describe()

The count gives the number of sentences included in the calculation. The remaining values provide the following information:

 - mean: the average number of tokens per sentence
 - std: standard deviation, which indicates the spread of the data around the mean
 - min: the smallest value for tokens
 - 25%: 25% of the data have 7 tokens or less
 - 50%: 50% of the data have 15 tokens or less
 - 75%: 75% of the data have 25 tokens or less
 - max: the largest value for tokens
 
We can also plot the number of tokens across sentences using a histogram.

To do so, we call the `hist` method. For the argument `bins`, which describe the 'bins' into which the observations are placed.

Here we use Python's `range` function to provide the bins with values that range from 1 to 136.

In [None]:
df['tokens'].hist(bins=range(1, 136))

## Sampling

We can think of the DataFrame stored under the variable `df` as our **sampling frame**, that is, a potential source of linguistic data.

For this dataframe, the **sampling unit** is a sentence.

We can also think of the information stored in the columns *mood* and *genre* as **strata** for sampling the sentences.

As illustrated above, we can use the `sample` method in *pandas* to draw samples from the DataFrame.

By providing the argument `n`, we can draw a random sample of *n* sentences from the DataFrame.

In [None]:
df.sample(n=300, random_state=42)

If we wish to draw a certain percentage of the data, we must use the argument `frac` instead.

In [None]:
df.sample(frac=0.1, random_state=42)

If we want to sample every fifth row in the DataFrame, we must use the `iloc` method.

In the expression `[::5]`, the first colon indicates that we are taking 'slices' of the rows, whereas the second part `:5` means that we take every fifth row.

In [None]:
df.iloc[::5]

What if we would like to draw a **balanced sample** for each genre, which means that each genre would be equally represented in the data?

To do so, we can group the sentences according to their genre using the `groupby` function, which takes the column name to be used as the basis for grouping as input.

We store the resulting groups under the variable `genres`.

In [None]:
genres = df.groupby('genre')

Next, we create an empty DataFrame and store it under the variable `balanced_genres`.

This will be used to store our balanced sample.

In [None]:
balanced_genres = pd.DataFrame()

Next, we loop over `genres`: each group consists of its name and the actual group of DataFrame rows.

We refer to them as `name` and `genre` during the loop.

In the loop, we draw a random sample of 200 sentences from each genre and store the result under the variable `sample`.

We then use the `concat` function from pandas to concatenate our newly created DataFrame `balanced_genres` and the current `sample`.

In [None]:
for name, genre in genres:
    
    sample = genre.sample(n=200, random_state=42)
    
    balanced_genres = pd.concat([balanced_genres, sample], axis=0)

Let's examine the result by calling the variable `balanced_genres`.

In [None]:
balanced_genres

As you can see, the new DataFrame uses the indices (first column) from the old DataFrame.

We can reset the index using the following command. The argument `drop` deletes the old index, whereas `inplace` modifies the existing DataFrame rather than returning a new one.

Calling the variable shows that the index has been reset.

In [None]:
balanced_genres.reset_index(drop=True, inplace=True)

balanced_genres

Let's check that each genre is represented the same number of sentences.

In [None]:
balanced_genres['genre'].value_counts()

Let's assume that our balanced sample of **genres** provides us with a more accurate distribution of **mood** among the sentences – after all, we have much more data for some genres than others.

Let's compare the counts for **mood** in the original data and the balanced sample of genres.

Here we provide the argument `normalize` and set it to `True` to return percentages rather than raw counts.

In [None]:
df['mood'].value_counts(normalize=True)

In [None]:
balanced_genres['mood'].value_counts(normalize=True)

As you can see, the percentages are slightly different.

What if we would like to sample the original data for **mood** based on their proportions in the balanced sample?

Turns out we can use the percentages for the balanced sample as *weights* for sampling!

You can think of these weights as reflecting the probability of each sentence being included in the sample.

Let's store the percentages under the variable `mood_weights`.

In [None]:
mood_weights = balanced_genres['mood'].value_counts(normalize=True)

Next, we use the `sample` method of a DataFrame to draw a sample of 1000 sentences from the sampling frame.

The `sample` method has the argument `weights`, which expects a weight to be associated with each row in the DataFrame.

We achieve this by mapping the value for **mood** in each column to the weights stored under `mood_weights`.

Put differently, we look up the probability for each mood and use this value as the weight.

In [None]:
mood_sample = df.sample(n=1000, weights=df['mood'].map(mood_weights), random_state=42)

Let's examine the resulting sample.

In [None]:
mood_sample

In [None]:
mood_sample['mood'].value_counts()

Without the weights, each row has an equal probability of being sampled.

When using the weights, there is a ~66% chance that a random row drawn from the DataFrame represents the category `decl`, etc.

## Estimating sample sizes

Let's continue by estimating sample sizes, that is, whether a corpus is large enough for studying some linguistic feature.

As pointed out by Egbert et al. ([2023](https://doi.org/10.1017/9781316584880), p. 130), estimating sample size is useful for answering two questions:

1. How many units of analysis (e.g. 'texts' or sentences) would be needed to reliably estimate the distribution of a linguistic feature?
2. Can an existing corpus be used to estimate the distribution of a linguistic feature?

Let's start by answering question #1 by using the sample stored under `mood_sample` to estimate the frequency of linguistic variables.

As a first step, let's start by retrieving the linguistic annotations using the indices stored in the DataFrame `mood_sample`.

First we convert the columns `doc_ix` and `sent_ix` of the DataFrame into Python lists.

We then use the `zip` function to combine these two lists, and iterate over these pairs, which we then use to retrieve items from the list `annotations`.

We store the resulting sentences into a list named `sentences`.

In [None]:
doc_ixs, sent_ixs = mood_sample['doc_ix'].tolist(), mood_sample['sent_ix'].tolist()

sentences = [annotations[doc_ix][sent_ix] for doc_ix, sent_ix in zip(doc_ixs, sent_ixs)]

Let's print out the first five sentences.

In [None]:
sentences[:5]

To estimate the size of a sample, we need to measure how frequently a linguistic feature occurs.

To do so, we define a Python function that takes the following inputs:

- a list of sentences annotated using the CoNLL-U schema and parsed using *conllu* (a list of `TokenList` objects)
- the name of a linguistic feature as a Python string (e.g. `'deprel'`, `'upos'`)
- the name of a tag as a Python string (e.g. `'advmod'`, `'NOUN'`)

This function returns the **mean** number of features per 1000 words and its **standard deviation**.

In [None]:
def calculate_mean_per_1000_words(sentences, feat, tag):
    
    # Initialize an empty list to store occurrences per 1000 words
    occurrences_per_k = []
    
    # Initialize a counter for the current number of occurrences within the current 1000 words
    current_k = 0
    
    # Initialize a counter for the total number of words processed
    word_counter = 0
    
    # Iterate through each sentence in the input list of sentences
    for sent in sentences:
        
        # Iterate through each token in the current sentence
        for token in sent:
            
            # Check if the token's tag matches the specified feature
            if token[feat] == tag:
                
                # If it does, increment the current count of occurrences
                current_k += 1
            
            # Increment the total word counter regardless of the tag match
            word_counter += 1
            
            # Check if the total word count has reached or exceeded 1000
            if word_counter >= 1000:
                
                # If so, add the current count of occurrences to the list
                occurrences_per_k.append(current_k)
                
                # Reset both the current count of occurrences and the total word counter
                current_k = 0
                word_counter = 0
    
    # Calculate the mean and standard deviation of the occurrences per 1000 words using NumPy
    return np.mean(occurrences_per_k), np.std(occurrences_per_k)

Egbert et al. ([2023](https://doi.org/10.1017/9781316584880), p. 130) provide the following formula for estimating the sample size needed for studying a particular linguistic feature:

$\large n = \frac{\large stdev^2}{(\frac{.5 \times \text{CI range}}{t})^2}$

Where *n* is the required sample size, *stdev* is the standard deviation, *CI range* is the range of the confidence interval and *t* is the *t*-value for the desired probability level.

For current purposes, we will use a *t*-value of 1.96, which corresponds to a 95% probability that the true mean value of the observed linguistic feature falls within the confidence interval.

Let's start by calculating the mean and standard deviation for nouns.

In [None]:
mean, stdev = calculate_mean_per_1000_words(sentences, 'upos', 'NOUN')

mean, stdev

Biber ([1993](https://doi.org/10.1093/llc/8.4.243), p. 253) recommends using $\pm 5$% of the observed mean value as the range for the confidence interval.

The upper and lower boundaries may be calculated by multiplying the mean value by 1.05 and 0.95, respectively.

To get the actual range for the confidence interval, we must deduct the lower boundary from the upper boundary.

In [None]:
ci_range = mean * 1.05 - mean * 0.95

ci_range

Now we have all the necessary values for filling the formula for estimating the required sample size.

In [None]:
n = (stdev ** 2) / ((0.5 * ci_range / 1.96) ** 2)

n

Just 12 sentences would suffice for reliably estimating the distribution of nouns!

What if you replace the tag `NOUN` with a rare tag, such as `SYM`?

What about dependencies `deprel` such as `nsubj` (nominal subject) or `nsubj:pass` (nominal subject in a passive construction)?

To estimate whether an existing corpus can be used to estimate the distribution of a linguistic feature, we can use a measure known as **standard error** (Egbert et al. [2023](https://doi.org/10.1017/9781316584880), p. 134).

Standard error measures how far the mean value for a sample is likely to be from the true mean value.

This measure is calculated by dividing the standard deviation with the square root of the sample size.

In [None]:
standard_error = (stdev / np.sqrt(1000))

standard_error

To compare standard errors between linguistic features, we can calculate a measure known as **relative standard error**.

This measure is calculated by dividing the standard error with the mean value for a linguistic feature.

In [None]:
relative_standard_error = standard_error / mean

relative_standard_error

Egbert et al. (2023, p. 135) provide the following critical values for error rates and relative standard error (RSE):

| Error rate (as a percentage of the mean) | RSE   |
|------------------------------------------|-------|
| 10%                                      | 0.051 |
| 5%                                       | 0.0255|
| 1%                                       | 0.0051|

A relative standard error of 0.0029 for nouns indicates that the true mean value for nouns is likely to fall within 1% of the sample mean.

This allows for reliable or 'precise' estimates concerning the distribution of nouns.

What happens to RSE if you adjust the sample size?