# 0. Introduction

In this Google Colab notebook, we will walk through how to run BERT-based linguistic feature detectors, applied to data from CORAAL. This is a modified version of our [2023 "Bridging methods to study sociolectal variation" workshop tutorial](https://brenocon.com/tut2023). Please don't hesitate to reach out if you have any questions on the material covered in this notebook!

We will use a model trained using the method described in [Masis et al.](https://aclanthology.org/2022.fieldmatters-1.2/) (2022). This model will detect instances of 17 African American English morphosyntactic features (features described below, in Section 2). The training data used for this model can be found in `data/CGEdit-ManualGen/AAE.tsv` in [this repository](https://github.com/slanglab/CGEdit), which is associated with the paper linked above. You can also find code for training your own feature detectors in this repository.

Contact: Tessa Masis (tmasis@cs.umass.edu), Brendan O'Connor (brenocon@cs.umass.edu)

## Use cases for sociolinguistic research

We note that these models (i.e. fine-tuned BERT classifiers) do not have perfect precision or recall. However, they often work better than other computational methods (e.g. keyword search, regex) and they take much less time/resources than manual annotation. As with any tool, we recommend you trying it out on the data you're interested in and seeing if the performance makes sense for your research goals.

We recommend two general strategies for incorporating these models into your research:

1. Augmenting manual annotation
  - Run the model on your data; then manually inspect the examples classified as positive, discarding incorrect ones. This strategy still involves manual annotation, but it will only be required for a subset of the original dataset and will guarantee high-precision results.

1. Replacing manual annotation
  - This strategy is useful in the case that you have a dataset large enough that manual annotation is out of the question (e.g. millions of social media posts). Although there may be errors in the automatic annotation, we assume a low enough error level such that trends or comparisons of interest will still be apparent.

Note: The model used in this notebook was trained with the intention of being used on transcript data (i.e. text data that includes nonstandard morphosyntactic features but uses standard orthography). That's why its training data does not include examples with variable spellings. You can certainly still use the model on non-transcript data (e.g. social media data), but performance may be worse since it will be out-of-domain data that the model wasn't necessarily trained for. A straightforward way to address this is to add data from the domain in which you're interested to the training set and then to retrain new models -- this is the strategy we used for a project analyzing AAE in Twitter data (you can find an abstract for that [here](https://scholarworks.umass.edu/scil/vol6/iss1/41/)).

## Getting started
To use this notebook, please first create your own copy (without doing this, you will only be able to run the notebook and won't be able to make any edits). To do this, go to the File menu and select 'Save a copy in Drive'. In the copy, click the 'Connect' button in the upper righthand corner. You can now run and edit the notebook!

*Software versions: we successfully ran this notebook in July 2023 on Google Colab, which at the time used Python 3.10 and `transformers` version 4.30 (from `pip install` in section "2. Load and run feature detector classifiers").*

# 1. Load and view CORAAL

[CORAAL](https://oraal.uoregon.edu/coraal) is a large public corpus of African American Language data, containing audio recordings and orthographic transcriptions from more than 220 sociolinguistic interviews (you can explore it online [here](http://lingtools.uoregon.edu/coraal/explorer/)).

## Preprocessing

We're going to use a version of the CORAAL transcripts which have already been preprocessed. This version includes 7 components: DCA, DCB, LES, ATL, PRV, ROC, and VLD. It has been preprocessed such that each file is a txt file containing all utterances from a single speaker, where each line in the txt file is a full sentence (i.e. the line ends with a period, exclamation point, or question mark). Non-linguistic sounds denoted by parantheses or angle brackets (e.g. "(laughing)", "\<cough\>") or notes by the transcriber denoted by backslashes (e.g. "/unintelligible/, "/inaudible/") have been removed, as well as the square brackets denoting overlapping speech (although the overlapping speech itself has not been removed).

The code used to execute this preprocessing can be found in `code/preprocessCORAAL.py` in  [this repository](https://github.com/slanglab/CGEdit) (the same repository as the one linked in Section 0).

## File naming

If the speaker is an **interviewee**, their file is named using the following convention:

\<CORAAL component\>\_\<socioeconomic group number\>\_\<age group number\>\_\<gender\>\_\<speaker number\>\_\<audio file number\>\.txt

For example, the file:

DCA\_se2\_ag1\_m\_05\_1.txt

is from DCA (the Washington, DC 1968 component of CORAAL). The speaker is in socioeconomic group 2, age group 1, and is male number 5. This is the first text file for this speaker.

If the speaker is an **interviewer**, then their file is named using this convention:

INT-\<interviewer speaker code\>-\<interviewee file name\>.txt

For example, the file:

INT-DCA\_int\_07-DCA\_se2\_ag1\_m\_05\_1.txt

contains the utterances from the interviewer (here, DCA\_int\_07) corresponding to the DCA\_se2\_ag1\_m\_05\_1 transcript file.

## More about CORAAL

For more details on CORAAL, including transcription practices and information about metadata, please see the [CORAAL User Guide](http://lingtools.uoregon.edu/coraal/userguide/CORAALUserGuide_current.pdf).

## Loading CORAAL

First we download our preprocessed version of CORAAL.

In [None]:
!git clone https://github.com/tmasis/tutorial2023.git

Let's print all the files by name.

<-- You can also see the files by clicking the folder button on the lefthand side (says "Files" when you hover). The files will be in `tutorial2023/CORAAL/`.


In [None]:
!ls tutorial2023/CORAAL/

We can look at the first few lines from the file ATL_se0_ag1_f_01_1.txt

In [None]:
test_dir = "tutorial2023/CORAAL/"
filename = "ATL_se0_ag2_f_01_1.txt"

with open(test_dir + filename) as f:
  for line in list(f)[:10]:
    print(line)

We can also look at the corresponding interviewer's speech.

In [None]:
filename = "INT-ATL_int_01-ATL_se0_ag2_f_01_1.txt"

with open(test_dir + filename) as f:
  for line in list(f)[:10]:
    print(line)

# 2. Load and run feature detector classifiers

We're now going to load and run the model described above in Section 0. This model will detect instances of 17 African American English morphosyntactic features:
1. zero possessive
1. zero copula
1. double tense
1. habitual be
1. resultant done
1. finna
1. come
1. double modal
1. multiple negation
1. negative auxiliary inversion
1. non-inverted negative concord
1. ain't
1. zero 3rd person singular present -s
1. is/was generalization
1. zero plural -s
1. double object
1. wh-question

For examples of each of these features, please see Table 3 (pg. 19) in [Masis et al.](https://aclanthology.org/2022.fieldmatters-1.2/) (2022).


## Load the model (in other words, download and get it ready for use)

In [None]:
# Let's first do some imports and define some variables
!pip install transformers[torch]
import transformers
import torch
import torch.nn as nn
import numpy as np
import dataclasses
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
from torch.utils.data.sampler import SequentialSampler
from typing import List, Union, Dict
import sys
import re
import os

# Here, we specify the BERT variant we'll be using; for Masis et al. (2022), we tried fine-tuning
#   a few different BERTs and this one worked best for detecting AAE features
model_name = 'bert-base-cased'
# Here we make the directory where we'll save results
!mkdir results/
# Here we define the directory to save the results
out_dir = 'Masis22_CORAAL.tsv'

In [None]:
# We'll now define some classes and a dataloading function

class MultitaskModel(transformers.PreTrainedModel):
    def __init__(self, encoder, taskmodels_dict):
        super().__init__(transformers.PretrainedConfig())
        self.encoder = encoder
        self.taskmodels_dict = nn.ModuleDict(taskmodels_dict)

    @classmethod
    def create(cls, model_name, head_type_list):
        """
        Creates each single-feature model (where task == feature), and
        has them share the same encoder transformer.
        """
        taskmodels_dict = {}
        shared_encoder = transformers.AutoModel.from_pretrained(
                model_name,
                config=transformers.AutoConfig.from_pretrained(model_name) )

        for task_name in head_type_list:
            head = torch.nn.Linear(768, 2)
            taskmodels_dict[task_name] = head

        return cls(encoder=shared_encoder, taskmodels_dict=taskmodels_dict)

    def forward(self, inputs, **kwargs):
        x = self.encoder(inputs)                # pass thru encoder once
        x = x.last_hidden_state[:,0,:]          # get CLS
        out_list = []
        for task_name,head in self.taskmodels_dict.items(): # pass thru each head
            out_list.append(self.taskmodels_dict[task_name](x))
        return torch.vstack(out_list)


def eval_dataloader(eval_dataset):
    eval_sampler = SequentialSampler(eval_dataset)

    data_loader = DataLoader(eval_dataset,
                batch_size=64,
                sampler=eval_sampler
                )

    return data_loader


class CustomEvalDataset(Dataset):
    def __init__(self, text):
        self.text = text

    def __len__(self):
        return len(self.text)

    def __getitem__(self, idx):
        return {"input_ids": self.text[idx]}

In [None]:
# Here we define the prediction function
def testM(tokenizer, model, test_f):
    features_dict = {"input_ids": []}

    with open(test_f) as r:
        for line in r:
            if len(line.split()) < 2: continue    # skips utterances that are only one word
            tokenized = tokenizer.encode(line.strip(), max_length=64, pad_to_max_length=True, truncation=True)
            features_dict["input_ids"].append(torch.LongTensor(tokenized))

    features_dict["input_ids"] = torch.stack(features_dict["input_ids"])
    features_dict = CustomEvalDataset(features_dict["input_ids"])

    # For each head/feature, predict on all sentences if feature is present
    dataloader = eval_dataloader(features_dict)
    with open("results/"+out_dir,'a') as f:
        for steps, inputs in enumerate(dataloader):
            for ex in inputs["input_ids"]:
                with torch.no_grad():
                    output = model(ex.unsqueeze(0).to(device))
                output = torch.nn.functional.softmax(output, dim=1)
                output = [str(float(x[1])) for x in output]
                sent = tokenizer.decode(ex).split()
                sent = [e for e in sent if e != '[PAD]' and e != '[CLS]' and e != '[SEP]']
                f.write(str(test_f)[11:-6]+"\t"+" ".join(sent)+"\t"+"\t".join(output)+"\n")

In [None]:
# Some final setup

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Here we define a head for each feature
# The names correspond to these 17 features:     zero possessive, zero copula, double tense, habitual be, resultant done, finna, come,
#                             double modal, multiple negation, negative auxiliary inversion, non-inverted negative concord/multiple negation,
#                             ain't, zero 3rd person singular present -s, is/was generalization, zero plural -s, double object, wh-question
# There are numbers in front of each head name just because that's what I named them when I trained this model
head_type_list=[
        "1-zero-poss",
        "3-zero-copula",
        "4-double-tense",
        "5-be-construction","5-resultant-done",
        "6-finna","6-come","6-double-modal",
        "7-multiple-neg","7-neg-inversion","7-n-inv-neg-concord","7-aint",
        "8-zero-3sg-pres-s","8-is-was-gen",
        "9-zero-pl-s",
        "10-double-object",
        "11-wh-qu"
        ]
multitask_model = MultitaskModel.create(
        model_name=model_name,
        head_type_list=head_type_list )

multitask_model.to(device)

In [None]:
# Now let's copy the model to the VM (~5 min)
!wget "http://hobbes.cs.umass.edu/~tmasis/final.pt"

In [None]:
# Load the model
checkpoint = torch.load("final.pt", map_location=device)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
multitask_model.load_state_dict(checkpoint['model_state_dict'])
multitask_model.eval()
multitask_model.to(device)

## Run feature detector classifiers on example sentences

Before running the feature detectors on CORAAL data, let's first try the detectors on sentences we create.

In [None]:
# Here we define a mini prediction function to test example sentences
def testExample(tokenizer, model, test_sent):
    features_dict = {"input_ids": []}

    tokenized = tokenizer.encode(test_sent.strip(), max_length=64, pad_to_max_length=True, truncation=True)
    features_dict["input_ids"].append(torch.LongTensor(tokenized))

    features_dict["input_ids"] = torch.stack(features_dict["input_ids"])
    features_dict = CustomEvalDataset(features_dict["input_ids"])

    # For each head/feature, predict on all sentences if feature is present
    dataloader = eval_dataloader(features_dict)
    for steps, inputs in enumerate(dataloader):
        for ex in inputs["input_ids"]:
            with torch.no_grad():
                output = model(ex.unsqueeze(0).to(device))
            output = torch.nn.functional.softmax(output, dim=1)
            output = [str(float(x[1])) for x in output]
            sent = tokenizer.decode(ex).split()
            sent = [e for e in sent if e != '[PAD]' and e != '[CLS]' and e != '[SEP]']
            header = ["zero possessive", "zero copula", "double tense", "habitual be", "resultant done", "finna",
                      "come", "double modal", "multiple negation", "negative auxiliary inversion",
                      "non-inverted negative concord", "ain't", "zero 3rd person singular present -s", "is/was generalization",
                      "zero plural -s", "double object", "wh-question"]
            print("example:\t" + " ".join(sent))
            for i in range(len(header)):
              print(header[i] + ":\t" + output[i])

In the code block below, the model is given the example sentence and predicts, for each feature, whether that sentence contains that feature. We can see in the printed output that, for each feature, there is a corresponding value which is the model's prediction for if the feature is in the utterance -- a number close to 0 means the model predicted that the utterance likely does not have the feature and a number close to 1 means it predicted that the utterance likely does have the feature.

The example sentence below (taken from twitter) contains both the zero copula and the finna feature. Accordingly, the model gives both of those features high scores (~0.99) and gives low scores for every other feature (< 0.0001).

(Note: In theory, scores in the middle, e.g. around 0.5, should correspond to ambiguous utterances in which it's unclear whether or not the utterance contains the given feature. However in practice, differences in score don't always correspond to this -- the model doesn't know explicit information about linguistic structures, so a difference in score may not have a linguistic reason. For example, an utterance may have a score that's lower than another utterance simply because it has a content word that wasn't in the training data or because one word has a nonstandard spelling.)


In [None]:
example_sentence = "Your boy really still waiting on that pizza. Dominoes finna close soon."
testExample(tokenizer, multitask_model, example_sentence)

Below is another example taken from twitter. Here we see a high score for habitual be (0.99) and pretty low scores for every other feature (< 0.001), as expected.

In [None]:
example_sentence = "These little kids basketball games be gettin intense as fuck lol."
testExample(tokenizer, multitask_model, example_sentence)

## Activity #1
Play around with example sentences you create!

In [None]:
example_sentence = ""       # TODO: Insert your example sentence here
testExample(tokenizer, multitask_model, example_sentence)

# 3. Run feature detector classifiers on CORAAL

Now let's run the model on the CORAAL data. The code below will run the model on 50 of the files (~1/3 of the total files) and should take ~5min. We'll also only look at interviewee speech files here, not interviewer ones.

In [None]:
i = 0
with os.scandir(test_dir) as d:
  for test_file in d:
    if 'INT' in test_file.name: continue    # This line skips interviewer speech files
    print("Processing " + test_file.name + "...")
    testM(tokenizer, multitask_model, test_file)
    print("Finished " + test_file.name)
    i += 1
    if i == 50: break      # Comment out this line if you'd like to run the model on all of the files

The results file is stored in `results/Masis22_CORAAL.tsv` (you should also be able to see it by clicking on the Files folder on the lefthand side).

The code below converts the results file into a pandas DataFrame, and then prints the total number of utterances which the model has made predictions on.

In [None]:
# Let's see the results!
filename = "results/Masis22_CORAAL.tsv"

import pandas as pd
df = pd.read_csv(filename, sep="\t", names=["speaker", "example", "zero possessive", "zero copula", "double tense", "habitual be",
                                  "resultant done", "finna", "come", "double modal", "multiple negation", "negative auxiliary inversion",
                                  "non-inverted negative concord", "ain't", "zero 3rd person singular present -s", "is/was generalization",
                                  "zero plural -s", "double object", "wh-question"] )
print("We have feature detection predictions for " + str(len(df.index)) + " utterances.")

Here we print the first 5 utterances. For each utterance, there is an 'example' column followed by 17 columns corresponding to our 17 morphosyntactic features. The value in the feature columns corresponds to the model's prediction for if that feature is in the utterance, where a 0 means it's predicted the utterance does not have the feature and 1 means it's predicted the utterance does have the feature

In [None]:
df.head()

Let's assume that 0.5 is an okay threshold and look at all the utterances that are predicted to have zero copula (aka the model predicted a score above 0.5 for that utterance).


In [None]:
pd.set_option('display.max_colwidth', None)
df.loc[df["zero copula"] > 0.5, ["example"]]

We can also try other thresholds and see how that affects the results. We note that there's often not a clear cutoff threshold which cleanly divides true positives from true negatives; it makes sense to choose a higher threshold if you'd like high precision (but would result in lower recall), whereas a lower threshold would give you higher recall (although lower precision).

In [None]:
df.loc[df["zero copula"] > 0.9, ["example"]]

The code below also prints the speaker corresponding to each utterance and the predicted score for the feature.

In [None]:
df.loc[df["zero copula"] > 0.5, ["speaker", "example", "zero copula"]]

Here we sort the utterances by predicted score.

In [None]:
df.sort_values("zero copula", ascending=False).loc[df["zero copula"] > 0.5, ["example", "zero copula"]]

## Activity #2

Take a look at predicted positive instances for different features, and with different thresholds. Different classifiers may have different levels of accuracy (see the rightmost column of Table 7 (pg. 22) in [Masis et al.](https://aclanthology.org/2022.fieldmatters-1.2/) (2022) for our precision@100 results for each feature). In our project using feature detectors on Twitter data (abstract [here](https://scholarworks.umass.edu/scil/vol6/iss1/41/)), we decided to manually calibrate thresholds for each detector.

In [None]:
# List of possible feature names:     "zero possessive", "zero copula", "double tense", "habitual be", "resultant done", "finna",
#                                  "come", "double modal", "multiple negation", "negative auxiliary inversion",
#                                  "non-inverted negative concord", "ain't", "zero 3rd person singular present -s", "is/was generalization",
#                                  "zero plural -s", "double object", "wh-question"


feature = ""        # TODO: insert feature name here
threshold = 0.5     # TODO: modify threshold; can be any number from 0 to 1

df.loc[df[feature] > threshold, ["speaker", "example", feature]]

# 4. Visualize feature use in (our subset of) CORAAL data

The code below will create a bar chart visualizing how much people from each region tend to use AAE features. For simplicity, we use 0.5 as a threshold for each feature, where a score above 0.5 means the model predicts the utterance has the feature and a score below 0.5 means the model predicts the utterance doesn't have the feature. We then calculate average feature frequency over all features.

In [None]:
from collections import defaultdict
import matplotlib.pyplot as plt

regions = defaultdict(lambda: [])

df2 = df.round(0)     # This rounds each score to either 0 or 1, if it's below or above 0.5, respectively
for i in df2.index:
  regions[df2["speaker"][i][:3]].extend(list(df2.iloc[i, 2:]))

for k,v in regions.items():
  regions[k] = sum(v)/len(v)
sorted_regions = dict(sorted(regions.items(), key=lambda x : x[1]))

fig, ax = plt.subplots()
xvals = sorted_regions.keys()
yvals = sorted_regions.values()
ax.bar(xvals, yvals)
ax.set_ylabel("Average feature frequency")
ax.set_xlabel("Region")
plt.show()

Here, we look at average frequency of just one feature across the regions -- habitual be.

In [None]:
regions = defaultdict(lambda: [])

for i in df2.index:
  regions[df2["speaker"][i][:3]].extend(list(df2.iloc[i, 5:6]))

for k,v in regions.items():
  regions[k] = sum(v)/len(v)
sorted_regions = dict(sorted(regions.items(), key=lambda x : x[1]))

fig, ax = plt.subplots()
xvals = sorted_regions.keys()
yvals = sorted_regions.values()
ax.bar(xvals, yvals)
ax.set_ylabel("Habitual be frequency")
ax.set_xlabel("Region")
plt.show()

And here we look at average frequency of resultant done across regions.

In [None]:
regions = defaultdict(lambda: [])

for i in df2.index:
  regions[df2["speaker"][i][:3]].extend(list(df2.iloc[i, 6:7]))

for k,v in regions.items():
  regions[k] = sum(v)/len(v)
sorted_regions = dict(sorted(regions.items(), key=lambda x : x[1]))
sorted_regions.pop("INT")

fig, ax = plt.subplots()
xvals = sorted_regions.keys()
yvals = sorted_regions.values()
ax.bar(xvals, yvals)
ax.set_ylabel("Resultant done frequency")
ax.set_xlabel("Region")
plt.show()

We can also visualize gender stuff! It's a bit harder to look at socioeconomic or age groups, because they're defined differently for different CORAAL componenets; so to look at those variables would require some manual aligning.

In [None]:
genders = defaultdict(lambda: [])

for i in df2.index:
  genders[df2["speaker"][i][12:13]].extend(list(df2.iloc[i, 2:]))

for k,v in genders.items():
  genders[k] = sum(v)/len(v)
sorted_genders = dict(sorted(genders.items(), key=lambda x : x[1]))

fig, ax = plt.subplots()
xvals = sorted_genders.keys()
yvals = sorted_genders.values()
ax.bar(xvals, yvals)
ax.set_ylabel("Average feature frequency")
ax.set_xlabel("Gender")
plt.show()

## Activity #3
Try looking at features distributions for different social variables!

In [None]:
social_vars = defaultdict(lambda: [])

for i in df2.index:
  speaker_var = df2["speaker"][i][:3]     # TODO: change the slice to correspond to social variable of your choice.
                                      #   For example, "[:3]" corresponds to region,
                                      #   "[4:7]" corresponds to socioeconomic group, "[8:11]" to age group,
                                      #   and "[12:13]" to gender

  feats = list(df2.iloc[i, 2:])            # TODO: change the slice to correspond to feature(s) of your choice.
                                      # For example, "[i, 2:]" corresponds to average of all features,
                                      # "[i, 5:6]" corrresponds to only habitual be,
                                      # and "[i, 6:7]" corresponds to only resultant done.
                                      # Features can be indexed according to placement in this list:
          # ["speaker", "example", "zero possessive", "zero copula", "double tense", "habitual be", "resultant done", "finna",
#            "come", "double modal", "multiple negation", "negative auxiliary inversion", "non-inverted negative concord",
#            "ain't", "zero 3rd person singular present -s", "is/was generalization", "zero plural -s", "double object", "wh-question"]

  social_vars[speaker_var].extend(feats)

for k,v in social_vars.items():
  social_vars[k] = sum(v)/len(v)
sorted_vars = dict(sorted(social_vars.items(), key=lambda x : x[1]))

fig, ax = plt.subplots()
xvals = sorted_vars.keys()
yvals = sorted_vars.values()
ax.bar(xvals, yvals)
ax.set_ylabel("Feature frequency")             # TODO: change feature label here
ax.set_xlabel("Social variable")               # TODO: change social variable label here
plt.show()