In the module below that I name `data`, `data.by_transcript` is a dictionary that maps transcript numbers (i.e. `'001'`) to a list of documents---fragments of transcripts---that correspond to each transcript (i.e. `['001_001.txt', '001_002.txt',  '001_004.txt', ...]`)

In [1]:
import utils.datasaur as data

data.by_transcript

{'001': [Document(name="001_000.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_000.txt", project="HD_set1_1-7"),
  Document(name="001_001.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_001.txt", project="HD_set1_1-7"),
  Document(name="001_002.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_002.txt", project="HD_set1_1-7"),
  Document(name="001_003.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_003.txt", project="HD_set1_1-7"),
  Document(name="001_004.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_004.txt", project="HD_set1_1-7"),
  Document(name="001_005.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_005.txt", project="HD_set1_1-7"),
  Document(name="001_006.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_006.txt", project="HD_set1_1-7"),
  Document(name="001_007.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_007.txt", project="HD_set1_1-7"),
  Document(name="001_008.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_0

You'll already see that the list of documents is doubled if you haven't already modified the data. This is the issue that we want to solve! Each document name should be unique.

In `broken_transcripts.ipynb`, I identified that transcripts '001', '002', '003', '004', '005', '006', and '007' all have this issue. 
We can filter `data.by_transcript` focus only on these key/value pairs:

In [2]:
from utils.document import Document # Importing this only for type-hints

broken_keys: list[str] = ['001', '002', '003', '004', '005', '006', '007']
broken_transcripts: dict[str, list[Document]] = {key : data.by_transcript[key] for key in broken_keys}
broken_transcripts

{'001': [Document(name="001_000.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_000.txt", project="HD_set1_1-7"),
  Document(name="001_001.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_001.txt", project="HD_set1_1-7"),
  Document(name="001_002.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_002.txt", project="HD_set1_1-7"),
  Document(name="001_003.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_003.txt", project="HD_set1_1-7"),
  Document(name="001_004.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_004.txt", project="HD_set1_1-7"),
  Document(name="001_005.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_005.txt", project="HD_set1_1-7"),
  Document(name="001_006.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_006.txt", project="HD_set1_1-7"),
  Document(name="001_007.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_007.txt", project="HD_set1_1-7"),
  Document(name="001_008.txt", project="s1_21-27_s2_1-3"),
  Document(name="001_0

We can then flatten this dictionary's values, so we get all docs from just these transcripts!

In [16]:
broken_docs = [doc for doc_list in broken_transcripts.values() for doc in doc_list]
broken_docs

[Document(name="001_000.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_000.txt", project="HD_set1_1-7"),
 Document(name="001_001.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_001.txt", project="HD_set1_1-7"),
 Document(name="001_002.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_002.txt", project="HD_set1_1-7"),
 Document(name="001_003.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_003.txt", project="HD_set1_1-7"),
 Document(name="001_004.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_004.txt", project="HD_set1_1-7"),
 Document(name="001_005.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_005.txt", project="HD_set1_1-7"),
 Document(name="001_006.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_006.txt", project="HD_set1_1-7"),
 Document(name="001_007.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_007.txt", project="HD_set1_1-7"),
 Document(name="001_008.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_008.txt", project="HD_set1

Now we have a list of our "problem" documents, useful for looping. We're going to filter out all of the documents that are in projects with sames that start with "HD_set1", trusting that this means they are necessarily hoarding documents due to the naming convention (if I can't trust even *this*, I will go mad).

In [5]:
docs_to_rename = set(broken_docs) - set(doc for doc in broken_docs if doc.project.startswith('HD_set1'))
docs_to_rename

{Document(name="001_000.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_001.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_002.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_003.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_004.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_005.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_006.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_007.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_008.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_009.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_010.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_011.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_012.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_013.txt", project="s1_21-27_s2_1-3"),
 Document(name="002_014.txt", project="s1_21-27_s2_1-3"),
 Document(name="002_015.txt", project="s1_21-27_s2_1-3"),
 Document(name="002_016.txt", project="s1_21-27_s2_1-3"),
 Document(name

Now, all of our documents are in the following projects:

In [17]:
{doc.project for doc in docs_to_rename}

{'s1_21-27_s2_1-3', 's1_28-35_s2_4-7'}

The naming convention for these projects tells us that their documents fall in Set 1 (s1), Hoarding; and Set 2 (s2), one of the control sets. So, unfortunately, we still have some work to do in identifying the Hoarding documents, i.e. the documents that don't need to be renamed. 

One heuristic we could use is the speakers that are in the document. Checking all the documents in the HD_set projects, again using the assumption that these are all necessarily hoarding documents, we can check to see the set of all speakers that appear across all documents in these projects:

In [19]:
# data.by_project is a dictionary where keys are project names [i.e. 'HD_set1_1-7'] and values are lists 
# of documents in that project represented as Document objects.
# We will filter out the projects that start with 'HD_set1' and collect their documents.
hdsets = {proj: docs for proj, docs in data.by_project.items() if proj.startswith('HD_set1')}
# We will then obtain all documents from these projects by flattening the lists of documents
hdset_docs = [doc for doclist in hdsets.values() for doc in doclist]
# Then we will collect all unique speakers from these documents
hdset_speakers = {speaker for doc in hdset_docs for speaker in doc._speaker_set}
hdset_speakers

{'Interviewee', 'Interviewer', 'Participant'}

Ok! So this seems to be all of the speakers that appear in Hoarding documents. There may be more, but as of now I don't know of a more reliable method to tell aside from project names (I would usually use the naming convention that hoarding transcripts start with a 0, but oh well...).

Let's check the speakers that appear in the set of documents that need to be renamed:

In [21]:
{speaker for doc in docs_to_rename for speaker in doc._speaker_set}

{'Interviewee', 'Interviewer', 'P1', 'P3', 'Rebecca'}

So, any document that has either 'Interviewer' or 'Interviewee' in its speaker set is likely a hoarding document. We can't be sure, unfortunately, as I've seen contexts where those labels are used in set 2 documents. But let's isolate the remaining documents that have these speakers:

In [24]:
[doc for doc in docs_to_rename if any(speaker in doc._speaker_set for speaker in {'Interviewer', 'Interviewee'})]

[Document(name="002_033.txt", project="s1_21-27_s2_1-3"),
 Document(name="005_085.txt", project="s1_28-35_s2_4-7"),
 Document(name="002_025.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_006.txt", project="s1_21-27_s2_1-3"),
 Document(name="001_003.txt", project="s1_21-27_s2_1-3"),
 Document(name="006_099.txt", project="s1_28-35_s2_4-7"),
 Document(name="002_030.txt", project="s1_21-27_s2_1-3"),
 Document(name="003_037.txt", project="s1_21-27_s2_1-3"),
 Document(name="002_026.txt", project="s1_21-27_s2_1-3"),
 Document(name="004_072.txt", project="s1_28-35_s2_4-7"),
 Document(name="001_012.txt", project="s1_21-27_s2_1-3"),
 Document(name="004_051.txt", project="s1_28-35_s2_4-7"),
 Document(name="002_028.txt", project="s1_21-27_s2_1-3"),
 Document(name="004_061.txt", project="s1_28-35_s2_4-7"),
 Document(name="003_045.txt", project="s1_21-27_s2_1-3"),
 Document(name="004_050.txt", project="s1_28-35_s2_4-7"),
 Document(name="002_021.txt", project="s1_21-27_s2_1-3"),
 Document(name

Well... shit. Ok, instead of trying to identify the documents that aren't hoarding documents, let's instead try to identify the documents that are not hoarding documents. But in order to do that, I'll need to identify all of the speakers that occur in the set 2 and set 3 documents that we know of, but then I'll have to write a script for that. De madre. 

But unfortunately, not all Hoarding Documents necessarily have a 'Participant' label in them. Sometimes, only the Interviewer is present in some of the documents (because that's the way the documents were separated out); and unfortunately, the 'Interviewer' label sometimes appears in non-hoarding documents, albeit rarely.

I can't think of a better way to deal with this issue other than checking manually:

In [5]:
{doc.name : doc.content for doc in docs_to_rename if 'Interviewer' in doc._speaker_set}

{'004_072.txt': 'Interviewee: Yeah. Well, the other thing that’s important there related to that, not so much that it’s difficult, but... is the fact that the other pieces of this puzzle point to the possibility that what we’re looking at here is a form of giftedness.\nAnd that giftedness has to do with, number one, the ability to see the beauty in the physical world and appreciate the beauty in the physical world.\nLots of collecting behaviors are associated with aesthetically pleasing objects and not wanting to waste that aesthetic in that object.\nAnd so, one of the most frequent things people collect are things that are designed for some kind of arts and crafts project.\nAnd the difficulty is people see the potential for the aesthetic value of an object and they collect it with the intention of producing something, some kind of work of art or craft of some kind.\nBut they don’t have the organizational skills to really get that done, and so in some ways it’s an aesthetic gone awry.\

In [49]:
[doc for doc in docs_to_rename if doc.name == '006_049.txt']

[]

In [None]:
# Turns out I was wrong about Transcript 004---it's a combination of two transcripts labeled 004
from utils.transcript import Transcript

print(Transcript('004').content)

Interview 004
(part 1)
Christian: Um, I’m doing good.
I’m on the line with my colleague, NAME [4:54].
Interviewee: Hi NAME [4:55].
Rebecca: Hi.
Nice to meet you.
Interviewee: Nice to meet you.
Christian: Ok.
Um, well first of all we just wanted to say thank you so much to agreeing to meet and interview with us.
We’re really excited about this interview.
We both read a lot of your work on hoarding disorder, so we’re really excited to talk to you today.
Interviewee: Well, it’s my pleasure.
Christian: We also, of course, wanted to thank you for sending out the email to the special interest group.
We set up several interviews through that and have conducted about five or six now.
We have a few more scheduled for later on.
We’re still continuing to receive replies, so thank you again for that.
Interviewee: Good.
Christian: Just to kind of start things off, we have an informed consent that we’d like to read to you over the phone just to let you know a little but about the study, but really i

## Sanity Check: Identifying the broken documents in a different way

I'm going to specifically query for every document that appears twice.

In [17]:
doc_names = [doc.name for doc in data.by_doc]
names_that_appear_twice = [name for name in doc_names if doc_names.count(name) == 2]
names_that_appear_twice

['050_617.txt',
 '046_565.txt',
 '046_554.txt',
 '048_592.txt',
 '050_610.txt',
 '049_595.txt',
 '047_569.txt',
 '049_597.txt',
 '050_621.txt',
 '048_593.txt',
 '050_614.txt',
 '046_564.txt',
 '048_586.txt',
 '048_585.txt',
 '046_566.txt',
 '049_600.txt',
 '046_562.txt',
 '046_557.txt',
 '046_558.txt',
 '050_611.txt',
 '047_572.txt',
 '050_624.txt',
 '046_556.txt',
 '047_573.txt',
 '046_561.txt',
 '047_575.txt',
 '048_584.txt',
 '048_588.txt',
 '046_563.txt',
 '049_604.txt',
 '048_587.txt',
 '050_612.txt',
 '049_605.txt',
 '047_574.txt',
 '046_567.txt',
 '047_577.txt',
 '050_618.txt',
 '049_596.txt',
 '050_622.txt',
 '049_599.txt',
 '049_607.txt',
 '050_620.txt',
 '048_591.txt',
 '048_583.txt',
 '047_576.txt',
 '050_615.txt',
 '050_613.txt',
 '050_616.txt',
 '047_579.txt',
 '048_594.txt',
 '047_580.txt',
 '050_619.txt',
 '048_582.txt',
 '048_590.txt',
 '050_625.txt',
 '049_603.txt',
 '047_571.txt',
 '050_608.txt',
 '049_598.txt',
 '046_555.txt',
 '050_609.txt',
 '049_601.txt',
 '046_55

Oh hell... duplicate documents...

In [24]:
docs = [doc for doc in data.by_doc if doc.name == '050_621.txt']
[doc.content for doc in docs]

['Participant 50:\nBut if a person is very fearful of being discovered, I think that they probably put on a facade and I think that they will do anything not to be discovered.\nBecause if you look at the hoarding specials and stuff, kids will say, "I didn\'t know I got this bad."\nSo the person on the phone is going to talk like there\'s not a care in the world or anything, or they\'re not going to complain about their space.\nThey may complain about the rent or whatever.\nSo unless they tell you, sometimes they\'re not going to be able to just discover, because I think that there\'s more of us walking amongst them to various degrees, just that certain people have worse cases than other people.\nInterviewer:\nDefinitely.\nParticipant 50:\nI don\'t think-\nInterviewer:\nYeah, go ahead.\nParticipant 50:\nI don\'t think you can look at stuff yourself.\nInterviewer:\nYeah, so it sounds like you definitely could have ordering disorder and nobody could know if they didn\'t see your space.\nA

Welp.

In [21]:
broken_docs_names = [doc.name for doc in broken_docs]
assert set(names_that_appear_twice) == set(broken_docs_names)

AssertionError: 