In [1]:
import os

if not os.path.exists('data'):
    os.chdir('..')
assert os.getcwd().endswith('HoardingDisorderScripts')

# Fixing 001-007

In the module below that I name `data`, `data.by_transcript` is a dictionary that maps transcript numbers (i.e. `'001'`) to a list of documents---fragments of transcripts---that correspond to each transcript (i.e. `['001_001.txt', '001_002.txt',  '001_004.txt', ...]`)

In [2]:
import utils.datasaur as data

data.by_transcript

{'001': [Document(name="001_000.txt", project="HD_set1_1-7"),
  Document(name="001_001.txt", project="HD_set1_1-7"),
  Document(name="001_002.txt", project="HD_set1_1-7"),
  Document(name="001_003.txt", project="HD_set1_1-7"),
  Document(name="001_004.txt", project="HD_set1_1-7"),
  Document(name="001_005.txt", project="HD_set1_1-7"),
  Document(name="001_006.txt", project="HD_set1_1-7"),
  Document(name="001_007.txt", project="HD_set1_1-7"),
  Document(name="001_008.txt", project="HD_set1_1-7"),
  Document(name="001_009.txt", project="HD_set1_1-7"),
  Document(name="001_010.txt", project="HD_set1_1-7"),
  Document(name="001_011.txt", project="HD_set1_1-7"),
  Document(name="001_012.txt", project="HD_set1_1-7"),
  Document(name="001_013.txt", project="HD_set1_1-7"),
  Document(name="001_014.txt", project="HD_set1_1-7")],
 '002': [Document(name="002_015.txt", project="HD_set1_1-7"),
  Document(name="002_016.txt", project="HD_set1_1-7"),
  Document(name="002_017.txt", project="HD_set1_1-

You'll already see that the list of documents is doubled if you haven't already modified the data, i.e. you'll see two copies of `001_001.txt`, two copies of `001_002.txt`, and so on. This is the issue that we want to solve! Each document name should be unique, and this leads to issues later down the line when trying to put these documents back together to obtain their original transcript.

In `broken_transcripts.ipynb`, I identified that transcripts '001', '002', '003', '004', '005', '006', and '007' all have this issue. 
We can filter `data.by_transcript` focus only on these key/value pairs:

In [3]:
from utils.document import Document # Importing this only for type-hints

broken_keys: list[str] = ['001', '002', '003', '004', '005', '006', '007']
broken_transcripts: dict[str, list[Document]] = {key : data.by_transcript[key] for key in broken_keys}
broken_transcripts

KeyError: '004'

We can then flatten this dictionary's values, so we get all docs from just these transcripts!

In [None]:
broken_docs = [doc for doc_list in broken_transcripts.values() for doc in doc_list]
broken_docs

[Document(name="001_000.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_000.txt", project="HD_set1_1-7"),
 Document(name="001_001.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_001.txt", project="HD_set1_1-7"),
 Document(name="001_002.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_002.txt", project="HD_set1_1-7"),
 Document(name="001_003.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_003.txt", project="HD_set1_1-7"),
 Document(name="001_004.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_004.txt", project="HD_set1_1-7"),
 Document(name="001_005.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_005.txt", project="HD_set1_1-7"),
 Document(name="001_006.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_006.txt", project="HD_set1_1-7"),
 Document(name="001_007.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_007.txt", project="HD_set1_1-7"),
 Document(name="001_008.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_008.txt", project="HD_set1

Now we have a list of our "problem" documents, useful for looping. We're going to filter out all of the documents that are in projects with sames that start with "HD_set1", trusting that this means they are necessarily hoarding documents due to the naming convention (if I can't trust even *this*, I will go mad).

In [None]:
docs_to_rename = set(broken_docs) - set(doc for doc in broken_docs if doc.project.startswith('HD_set1'))
docs_to_rename

{Document(name="001_000.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_001.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_002.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_003.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_004.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_005.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_006.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_007.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_008.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_009.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_010.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_011.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_012.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_013.txt", project="s1_21-27_s2_1-3"),
 Document(name="002_014.txt", project="s1_21-27_s2_1-3"),
 Document(name="002_015.txt", project="s1_21-27_s2_1-3"),
 Document(name="002_016.txt", project="s1_21-27_s2_1-3"),
 Document(name

Now, all of our documents are in the following projects:

In [None]:
{doc.project for doc in docs_to_rename}

{'s1_21-27_s2_1-3', 's1_28-35_s2_4-7'}

The naming convention for these projects tells us that their documents fall in Set 1 (s1), Hoarding; and Set 2 (s2), one of the control sets. So, unfortunately, we still have some work to do in identifying the Hoarding documents, i.e. the documents that don't need to be renamed. 

One heuristic we could use is the speakers that are in the document. Checking all the documents in the HD_set projects, again using the assumption that these are all necessarily hoarding documents, we can check to see the set of all speakers that appear across all documents in these projects:

In [None]:
# data.by_project is a dictionary where keys are project names [i.e. 'HD_set1_1-7'] and values are lists 
# of documents in that project represented as Document objects.
# We will filter out the projects that start with 'HD_set1' and collect their documents.
hdsets = {proj: docs for proj, docs in data.by_project.items() if proj.startswith('HD_set1')}
# We will then obtain all documents from these projects by flattening the lists of documents
hdset_docs = [doc for doclist in hdsets.values() for doc in doclist]
# Then we will collect all unique speakers from these documents
hdset_speakers = {speaker for doc in hdset_docs for speaker in doc.speaker_set()}
hdset_speakers

{'Interviewer', 'Participant'}

Ok! So this seems to be all of the speakers that appear in Hoarding documents. There may be more, but as of now I don't know of a more reliable method to tell aside from project names (I would usually use the naming convention that hoarding transcripts start with a 0, but oh well...).

Let's check the speakers that appear in the set of documents that need to be renamed:

In [None]:
{speaker for doc in docs_to_rename for speaker in doc.speaker_set()}

{'Interviewee', 'Interviewer', 'P1', 'P3'}

So, any document that has either 'Interviewer' or 'Interviewee' in its speaker set is likely a hoarding document. We can't be sure, unfortunately, as I've seen contexts where those labels are used in set 2 documents. But let's isolate the remaining documents that have these speakers:

In [None]:
[doc for doc in docs_to_rename if any(speaker in doc.speaker_set() for speaker in {'Interviewer', 'Interviewee'})]

[Document(name="002_025.txt", project="s1_21-27_s2_1-3"),
 Document(name="002_031.txt", project="s1_21-27_s2_1-3"),
 Document(name="007_112.txt", project="s1_28-35_s2_4-7"),
 Document(name="002_020.txt", project="s1_21-27_s2_1-3"),
 Document(name="005_076.txt", project="s1_28-35_s2_4-7"),
 Document(name="005_078.txt", project="s1_28-35_s2_4-7"),
 Document(name="003_048.txt", project="s1_21-27_s2_1-3"),
 Document(name="004_053.txt", project="s1_28-35_s2_4-7"),
 Document(name="003_046.txt", project="s1_21-27_s2_1-3"),
 Document(name="003_039.txt", project="s1_21-27_s2_1-3"),
 Document(name="003_041.txt", project="s1_21-27_s2_1-3"),
 Document(name="004_069.txt", project="s1_28-35_s2_4-7"),
 Document(name="006_093.txt", project="s1_28-35_s2_4-7"),
 Document(name="001_011.txt", project="s1_21-27_s2_1-3"),
 Document(name="007_116.txt", project="s1_28-35_s2_4-7"),
 Document(name="002_023.txt", project="s1_21-27_s2_1-3"),
 Document(name="005_089.txt", project="s1_28-35_s2_4-7"),
 Document(name

Well... shit. That's too many to sift through manually. Ok, instead of trying to identify the documents that are hoarding documents, let's instead try to identify the documents that are not hoarding documents. But in order to do that, I'll need to identify all of the speakers that occur in the set 2 and set 3 documents that we know of, but then I'll have to write a script for that. De madre. 

But unfortunately, not all Hoarding Documents necessarily have a 'Participant' label in them. Sometimes, only the Interviewer is present in some of the documents (because that's the way the documents were separated out); and unfortunately, the 'Interviewer' label sometimes appears in non-hoarding documents, albeit rarely.

I can't think of a better way to deal with this issue other than checking manually:

In [None]:
{doc.name : doc.content for doc in docs_to_rename if 'Interviewer' in doc.speaker_set()}

{'002_025.txt': 'Interviewer: What about, you know you could imagine two patients, right, and they\'re exhibiting, they have the exact same cognition, perhaps they have the same life experience these two hypothetical patients, they save the same amount of items, and one of them has a lot of money and can afford to buy six storage areas and has basically a mansion to live in; whereas, the other one lives in a one bedroom apartment.\nSo, one of them is going to have the appearance of much much much more clutter than the other, although they both have the same kind of underlying...even like brain behavior, you know what I mean like they have the same thoughts.\nHow do you decide that--would you decide that they both have hoarding disorder, although one of them isn\'t cluttered?\nWould you decide that only the person in the apartment has hoarding disorder or, what?\nHow would you make that call?\nInterviewee: Yeah, I think I would still be going back to conflict in distress in that case.\n

In [None]:
[doc for doc in docs_to_rename if doc.name == '006_049.txt']

[]

In [5]:
# Turns out I was wrong about Transcript 004---it's a combination of two transcripts labeled 004
from utils.transcript import Transcript

Transcript('2004').speaker_set()

{'Interviewee', 'Interviewer'}

In [13]:
assert all('Interviewee' in doc.speaker_set() for doc in data.by_doc if doc.set == 2)
{speaker for doc in data.by_doc for speaker in doc.speaker_set() if doc.project.startswith('HD_set1')}

{'Interviewer', 'Participant'}