## Preface 

Hey Tava! 
The following is my best attempt at structuring an explanation of how the data works and how the code deals with the data. 
I recall not having given very structured explanations regarding this when I tried initially to explain all this, so I'm sure it felt all over the place.
With that, I wanted to take an opportunity to put things on paper.
Though this may be horribly redundant, I hope it'll make the coding clearer.
Please reach out to me for any questions!

To start, given our goal to identify HD via linguistic indicators— those being truncated clauses—in their speech.
To this end, we had annotators manually look for truncated clauses in interview transcripts (those with and without HD patients as interviewees).

However, annotators did not work directly with transcripts: recall that we broke all of our transcripts up into pieces we call *documents*.
For instance, transcript `001` was broken into documents `001_000.txt`, `001_001.txt`, all the way to `001_014.txt`.
Annotators worked directly with the documents, finding and labeling truncated clauses there.

As another thing, annotators could only access documents from one project at a time.
A *project* was our more or less arbitrary designation for a collection of documents.
The first project annotators were shown was `HD_set1_1-7`, which you can see as a directory in `./data/json/`, and all other projects were revealed one-by-one slowly over the course of the Fall 2024 and Spring 2025 semesters.
This fact is more or less trivial for the paper, but is important to acknowledge when working with the data through code. 
Although we'll often only care about analyzing particular documents, these documents are always going to live in proejct folders.

After labeling was completed, we exported the label data into a json format—this is the "data" I usually share with the dats and stats team—that right now should be in the `./data/json/` directory on your computer (it used to be in the `./data/` directory, but I changed that since the `./data/` directory will soon contain raw text files as well in `./data/raw/`).

# Data

## File Structure

You'll notice a few things inspecting this data. 
The first is the project directories, including `HD_set1_1-7`. 
Opening up any project folder will give a file tree like the following:
```shell
HD_set1_1-7/
    DOCUMENT-a_eads_ufl_edu/
    DOCUMENT-akadel1_ufl_edu/
    DOCUMENT-anna_oswald_ufl_edu/
        001_000.txt.json
        001_012.txt.json
        002_023.txt.json
        005_035.txt.json
        006_047.txt.json
        007_058.txt.json
    ...
    DOCUMENT-sotodaniel_ufl_edu/
        001_009.txt.json
        002_020.txt.json
        005_033.txt.json
        006_044.txt.json
        006_055.txt.json
        007_067.txt.json
    ...
    DOCUMENT-stiwary1_ufl_edu/
    DOCUMENT-swayam_patel_ufl_edu/
    REVIEW/
        001_000.txt.json
        001_001.txt.json
        001_002.txt.json
        001_003.txt.json
        ...
        007_067.txt.json
        007_068.txt.json
    ROOT_smoeller_ufl_edu/
    project-data.json
```

To keep things short, the only thing you really care about is in the `REVIEW/` directory here.
In every project, the `REVIEW/` directory contains the adjudicated label data corresponding to each document in that project.
That's what all of the code in this codebase is designed to work with.

If you're curious, all of those other directories that start with `DOCUMENT-{name}` contain document label data from `name` specifically in this project.
For instance, the file `001_009.txt.json` in `DOCUMENT-sotodaniel_ufl_edu/` contains the labels that Daniel Soto (me) annotated in document `001_009.txt` while working in the `HD-set1_1-7` project.
The `REVIEW/` directory will definitely contain `001_009.txt.json` along with some labels as well, but its labels will likely not be the same as the ones in my directory: someone would have looked through my labels, decided on if they were right or not, and then put them through to the data that we'd see in this file.

To add, the ROOT folder contains the document data still in json format, but with no label data, so we of course ignore that. 
The `project-data.json` file contains information that is already present in each document's json file, so we ignore this as well.

# Code

## `utils.document`

The code is designed to read in a document's json file from the `REVIEW/` folder and output relevant information to it, all attached to a custom `Document` class.
Let's use Document `001_009.txt` as an example:

In [2]:
from utils.document import Document

doc_name = "001_009.txt"
path_to_doc = f"./data/json/HD_set1_1-7/REVIEW/{doc_name}.json"
doc =  Document(path=path_to_doc)
doc

  from .autonotebook import tqdm as notebook_tqdm


Document(001_009.txt, HD_set1_1-7)

Inputting the path to Document `001_009.txt`'s json file in the `REVIEW/` folder to the `Document` class gives back a `Document` object (you can also access this information thorugh `doc.name` and `doc.project`).
Printing this object out will just tell you what the document's name and project of origin is.
But there's a lot more you can do!
Most importantly to us, you can print out a document's content, either as a string or as a list of lines:

In [3]:
# print out doc content as a string
print(doc.full_content)

Interviewer:
Okay. All right. So first question is how has COVID-19 impacted your life? What specific behaviors has changed?
Participant 1:
Going into work. I can't go in and see other people. When I go out and run, if I'm not wearing a mask, people start cowering and yelling at me. It's ridiculous. I'm out on a trail six feet away from somebody. They're ... But you know, in all fairness, it hasn't changed my life that much because once I was working from home quite a bit before and I did lose my job last January, and partly due to COVID because they closed our San Francisco office because the company was having money problems anyway, but they closed the San Francisco office and everybody is working remotely for good, which I think is really bad. It's really bad. I think it's alienating people and people are having problems-
Interviewer:
I agree.
Participant 1:
... and eating more than normal. When I'm up by my house where I can cook, and that's not good. I've had some health problems 

In [4]:
doc.lines

['Interviewer:',
 'Okay. All right. So first question is how has COVID-19 impacted your life? What specific behaviors has changed?',
 'Participant 1:',
 "Going into work. I can't go in and see other people. When I go out and run, if I'm not wearing a mask, people start cowering and yelling at me. It's ridiculous. I'm out on a trail six feet away from somebody. They're ... But you know, in all fairness, it hasn't changed my life that much because once I was working from home quite a bit before and I did lose my job last January, and partly due to COVID because they closed our San Francisco office because the company was having money problems anyway, but they closed the San Francisco office and everybody is working remotely for good, which I think is really bad. It's really bad. I think it's alienating people and people are having problems-",
 'Interviewer:',
 'I agree.',
 'Participant 1:',
 "... and eating more than normal. When I'm up by my house where I can cook, and that's not good. 

You can also filter a document's output by speaker.
The function I wrote for this gives a boolean `cleaned` option, which determines whether speaker labels, timestamps, and other unnecessary junk we've found gets displayed.

In [12]:
%load_ext autoreload
%autoreload 2

In [20]:
print(doc.content_by_speaker('Participant'))

So I needed time to do it after my event and I would use to take, in the afternoon, it was a pleasant afternoon's task to do that and I INAUDIBLE my hose clamps lined up just right and just went about the preventative maintenance.
I'm a great believer of planned maintenance and that's from professional experience but the ...So I went to bite this off and it took me three days to get it done, at the end of three days I did not have things quite right.
My scars were all peeled up and bloody from the mishaps, and I knew INAUDIBLE even though I knew how I wanted my hand to move, my brain wouldn't make it do it.
It was INAUDIBLE, okay.
That's what made me decide that I couldn't use my hand tools anymore.
CROSSTALK.
CROSSTALK.
Oh, yeah.
CROSSTALK.
CROSSTALK.
It was heartbreak for me to ...
I'm sorry, it's a heartbreak for me and still I try to be a realest.
So it's a INAUDIBLE, it's a reality, too bad.
I'm straying from your question.
I did show up.
Yep.
The older you get, the easiest it is 

In [21]:
doc.lines('Participant', cleaned=True)

["So I needed time to do it after my event and I would use to take, in the afternoon, it was a pleasant afternoon's task to do that and I INAUDIBLE my hose clamps lined up just right and just went about the preventative maintenance.",
 "I'm a great believer of planned maintenance and that's from professional experience but the ...So I went to bite this off and it took me three days to get it done, at the end of three days I did not have things quite right.",
 "My scars were all peeled up and bloody from the mishaps, and I knew INAUDIBLE even though I knew how I wanted my hand to move, my brain wouldn't make it do it.",
 'It was INAUDIBLE, okay.',
 "That's what made me decide that I couldn't use my hand tools anymore.",
 'CROSSTALK.',
 'CROSSTALK.',
 'Oh, yeah.',
 'CROSSTALK.',
 'CROSSTALK.',
 'It was heartbreak for me to ...',
 "I'm sorry, it's a heartbreak for me and still I try to be a realest.",
 "So it's a INAUDIBLE, it's a reality, too bad.",
 "I'm straying from your question.",
 

## `utils.datasaur`

The `utils.datasaur` module makes it easy to access the entire database using the language of the `Document` object. 
I typically import it as follows:

In [7]:
import utils.datasaur as data

What you probably care about is the `by_doc` list here, which contains a list of all of the documents in the entire data, all converted to `Document` objects:

In [8]:
data.by_doc

[Document(2015_246.txt, s1051-54_s2014-19_s3051-75),
 Document(2014_228.txt, s1051-54_s2014-19_s3051-75),
 Document(2014_233.txt, s1051-54_s2014-19_s3051-75),
 Document(2014_222.txt, s1051-54_s2014-19_s3051-75),
 Document(051_628.txt, s1051-54_s2014-19_s3051-75),
 Document(3001_056.txt, s1051-54_s2014-19_s3051-75),
 Document(2015_247.txt, s1051-54_s2014-19_s3051-75),
 Document(054_676.txt, s1051-54_s2014-19_s3051-75),
 Document(054_662.txt, s1051-54_s2014-19_s3051-75),
 Document(2018_276.txt, s1051-54_s2014-19_s3051-75),
 Document(052_645.txt, s1051-54_s2014-19_s3051-75),
 Document(2014_229.txt, s1051-54_s2014-19_s3051-75),
 Document(2018_278.txt, s1051-54_s2014-19_s3051-75),
 Document(3001_058.txt, s1051-54_s2014-19_s3051-75),
 Document(054_669.txt, s1051-54_s2014-19_s3051-75),
 Document(3001_051.txt, s1051-54_s2014-19_s3051-75),
 Document(2015_241.txt, s1051-54_s2014-19_s3051-75),
 Document(3001_061.txt, s1051-54_s2014-19_s3051-75),
 Document(2017_275.txt, s1051-54_s2014-19_s3051-75)

You'll see that it does, indeed, contain every document:

In [9]:
len(data.by_doc)  # should be 1234

1234

# The Actual Task

The following code is hopefully a start, using some of what I've shared from before. 
It'll give you a way of accessing every spoken line in the data, while allowing you to find out where the line came from.

Again, please ask me any questions if necessary!

In [26]:
import utils.datasaur as data
import string


for doc in data.by_doc:
    # Every line in the document
    # Doing it this way instead of through doc.lines since the lines_by_speaker
    # method cleans out speaker labels and other unwanted text by default,
    # not including punctuation
    doc_lines = [line 
                 for speaker in doc.speaker_set()
                 for line in doc.lines_by_speaker(speaker)]
    for line in doc_lines:
        assert any(punct in line for punct in string.punctuation), \
            f"No punctuation in line: {line}, {doc}"

AssertionError: No punctuation in line: So literal—, Document(2018_278.txt, s1051-54_s2014-19_s3051-75)