# Preliminary Data Annotation

We want positive/negative examples annotated with a series of linguistic metrics (coherence, fluency) both at the utterance level and at the dialogue level (< 5 turns). 

- Positive examples will be taken from the [BabyLM (Switchboard)](https://huggingface.co/datasets/hhoangphuoc/switchboard) dataset.
- Negative examples will be taken from BabyLlama outputs.

Corpus size: no more than 20 million tokens.

## 1. Setup

In [3]:
import torch
import spacy
import contextualSpellCheck

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

## 2. Data Processing

### 2.1 BabyLM (Switchboard) Dataset

#### 2.1.1 Downloading

In [None]:

dataset = load_dataset("hhoangphuoc/switchboard")

Downloading readme: 100%|██████████| 2.00k/2.00k [00:00<00:00, 2.19MB/s]
Downloading data: 100%|██████████| 378M/378M [00:38<00:00, 9.84MB/s] 
Downloading data: 100%|██████████| 383M/383M [00:51<00:00, 7.40MB/s] 
Downloading data: 100%|██████████| 389M/389M [00:48<00:00, 8.09MB/s] 
Downloading data: 100%|██████████| 401M/401M [00:42<00:00, 9.36MB/s] 
Downloading data: 100%|██████████| 392M/392M [00:48<00:00, 8.16MB/s] 
Downloading data: 100%|██████████| 385M/385M [00:52<00:00, 7.32MB/s] 
Downloading data: 100%|██████████| 391M/391M [00:42<00:00, 9.27MB/s] 
Downloading data: 100%|██████████| 390M/390M [00:39<00:00, 9.91MB/s] 
Downloading data: 100%|██████████| 387M/387M [00:40<00:00, 9.67MB/s] 
Downloading data: 100%|██████████| 392M/392M [00:45<00:00, 8.72MB/s] 
Downloading data: 100%|██████████| 396M/396M [01:02<00:00, 6.35MB/s] 
Downloading data: 100%|██████████| 397M/397M [00:44<00:00, 8.98MB/s] 
Downloading data: 100%|██████████| 385M/385M [00:36<00:00, 10.6MB/s] 
Downloading data:

#### 2.1.2 Selecting positive examples

What's our selection criteria? Are we basing it on the metrics we annotate (high vs. low scores) or what we think looks good?

### 2.2 Prompt BabyLlama Model

What will the prompts be? Are we simulating dialogues with a teacher? Could these come from our positive examples in the BabyLM?

In [None]:

tokenizer = AutoTokenizer.from_pretrained("timinar/baby-llama-58m")
model = AutoModelForCausalLM.from_pretrained("timinar/baby-llama-58m")

# Prompt
input_text = ...

# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt")

# Generate continuation
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.8,
        top_k=50,
        top_p=0.95
    )

# Decode output
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print result
print(generated_text)

## 3. Metrics

We discussed Fluency and Coherence as the two important things we want to annotate.

### 3.1 Fluency 

Fluency "measures the quality of individual sentences, are they grammatically correct, non-repetitive, and in accord with common English usage, with clear meanings" ([Hu et al., 2024](https://arxiv.org/html/2402.12055v2)).

From this definition, we decompose Fluency into Grammaticality and Conciseness.

#### 3.1.1 Grammaticality

Might consider using [ERRANT](https://github.com/chrisjbryant/errant) for error detection and classification. 

Alternatively, below is an example using SpaCy's contextual [spell-checker](https://pypi.org/project/contextualSpellCheck/).

In [None]:
nlp = spacy.load('en_core_web_sm')
contextualSpellCheck.add_to_pipe(nlp)

In [None]:
doc = nlp('Income was $9.4 milion comare to the prior year of $2.7 milion.')

print(doc._.performed_spellCheck) #Should be True
print(doc._.outcome_spellCheck) #Income was $9.4 million compared to the prior year of $2.7 million.

True
Income was $9.4 million compared to the prior year of $2.7 million.


#### 3.1.2 Conciseness

We could use the Conciseness metric define by [Cao and Zhuge (2022)](https://www.sciencedirect.com/science/article/pii/S0957417422010491) which "considers both the repetition of representations within each sentence and the similarity between sentences as redundancy contained in the summary, and adds the location of sentence to the calculation of redundancy for allowing the existence of some similarity between sentences that are far apart."

### 3.2 Coherence/cohesion

TAACO is a tool for the automatic analysis tool of local and global text cohesion ([Crossley et al., 2016](https://link.springer.com/article/10.3758/s13428-015-0651-7)).

Below are the different TAACO measures that `test_shiva_w_source.py` retrieves for you with their column indices in the output csv file. If we do not care about source-related measure (e.g., dialogue-level, between teacher and student turns), use `test_shiva.py` instead.

- ttr related metrics:  columns[1:16]
- Overlaps in Adjacent Sentences: columns[16:70]
- Overlaps in Adjacent Paragraphs: columns[70:124]
- Synonym overlap in adjacent sents: columns[124:126]
- Synonym overlap in adjacent pars: columns[126:128]
- LSA: columns[128:132]
- LDA: columns[132:136]
- word2vec: columns[136:140]
- Others: columns[140:170]    only positive casual and positive sth are not bad
- Topic relevance, source similarity, mag_news keywords columns[170:192]???
- source similarity, LSA, LDA, W2V columns[192:196]

## 4. Annotate

Below we annotate both the positive examples from the BabyLM (SwitchBoard) corpus and the negative examples generated using BabyLlama.