# Finding All Speakers Across All Documents

Ok... I'm at my wit's end here. I should have started by doing this, but I didn't, so I have to do it now. That's just how it is!

Throughout this project, I've been making the inductive assumption that there were only speakers with certain names, i.e. 'Interviewer', 'Participant', 'Interviewee', and a few others. But when working to fix 001-007, I found that this assumption was massive understated. There are many more speaker labels that I have missed, and I need to look for **all** of them now so that this doesn't happen again. 

## Step 1: Modify Document.py

I'm going to start by modifying the speaker regex logic in `document.py` so that the regex doesn't only look for strings that match the types of speaker labels that we have seen followed by a colon. Instead, I want the base regex to match *any* string followed by a colon (with a few caveats, i.e. accounting for labels followed by numbers). 

###### ^ done in 0f0e355836c04639c63b8b5ecd7dda979666d96a

## Step 2: Look for all speakers across all documents

With that done, let's import the stuff from `datasaur.py` and get to work. 

The following code groups all speaker labels found across all documents (given the new, now less restrictive built-in regex in the `Document` class) into the keys of a dictionary, and maps each of these keys to a list of documents in which that speaker label is found.

In [1]:
import utils.datasaur as data
from collections import defaultdict

speaker_dict = defaultdict(list)
for doc in data.by_doc: 
    for speaker in doc.speaker_set(restrict=False):
        speaker_dict[speaker].append(doc)
list(speaker_dict.keys())

['P1',
 'Interviewee',
 'P2',
 'Interviewer',
 'Participant',
 'inaudible',
 'crosstalk',
 'Speaker',
 'Speaker1',
 'guess',
 'relevantly',
 'women',
 'be',
 'method',
 'say',
 'even',
 'virtually',
 'indistinguishable',
 'were',
 'LOCATION',
 'NAME',
 'until',
 'agents',
 'fireman',
 'vernet',
 'Speake',
 'Yeah',
 'died',
 'account',
 'attic',
 'taker',
 'know',
 'Interview',
 'Wrench',
 'INAUDIBLE',
 'one',
 'two',
 'speaker',
 'INAUDUBLE',
 'INSTITUTION',
 'state',
 'ORGANIZATION',
 'ORGANIZATIONS',
 'CBT',
 'genre',
 'mentioned',
 'is',
 'INUADIBLE',
 'options',
 'answer',
 'them',
 'of',
 'thinking',
 'Interviewer19',
 'triggers',
 'crap',
 'yeah',
 'No',
 'that',
 'Yes',
 'Female',
 'P3',
 'by',
 'interview',
 'idea',
 'decision',
 'states',
 'like',
 'example',
 'was',
 'hoarders',
 'consent',
 'says',
 'cell',
 'awful',
 'are',
 'Right',
 'but',
 'cleaned',
 'no',
 'Northeast',
 'young',
 'female',
 'Wooo',
 'unintelligible',
 'who',
 'words',
 'Caucasian',
 'you',
 'playful',


Now, I'm going to manually look through these speaker labels and see which of them are actually speakers. A lot of these are going to be duds, i.e. the regex might match "attic" in the string "I have 3 big chests in my attic: one full of pictures, another full of..." So, I'll need to look through these matches to see which ones might actually be speaker labels.

I will first filter out the names from here that I already know about.

In [2]:
from utils.document import SPEAKERS

speaker_dict = {speaker : lst for speaker, lst in speaker_dict.items() if speaker.title() not in SPEAKERS}
speaker_dict.keys()

dict_keys(['inaudible', 'crosstalk', 'Speaker1', 'guess', 'relevantly', 'women', 'be', 'method', 'say', 'even', 'virtually', 'indistinguishable', 'were', 'LOCATION', 'NAME', 'until', 'agents', 'fireman', 'vernet', 'Speake', 'Yeah', 'died', 'account', 'attic', 'taker', 'know', 'Interview', 'Wrench', 'INAUDIBLE', 'one', 'two', 'INAUDUBLE', 'INSTITUTION', 'state', 'ORGANIZATION', 'ORGANIZATIONS', 'CBT', 'genre', 'mentioned', 'is', 'INUADIBLE', 'options', 'answer', 'them', 'of', 'thinking', 'Interviewer19', 'triggers', 'crap', 'yeah', 'No', 'that', 'Yes', 'Female', 'by', 'interview', 'idea', 'decision', 'states', 'like', 'example', 'was', 'hoarders', 'consent', 'says', 'cell', 'awful', 'are', 'Right', 'but', 'cleaned', 'no', 'Northeast', 'young', 'female', 'Wooo', 'unintelligible', 'who', 'words', 'Caucasian', 'you', 'playful', 'stuff', 'Currator', 'Len', 'waste', 'at', 'US', 'UNIVERSITY', 'question', 'problem', 'said', 'Treddle', 'drawers', 'hell', 'listing', 'worded', 'quote', 'out', 'To

In the filtered keys, I notice 'Speaker1' and 'Interviewer19'. They are found in:

In [3]:
keys_with_numbers = ['Interviewer19', 'Speaker1']
[line for key in keys_with_numbers for line in speaker_dict[key][0].lines if key in line]

['Interviewer19:09- Ok sounds good. So, you’ve Mentioned that your partner keeps your behavior in check. But how do your other friends and family feel about your behavior?',
 'Speaker1: Umm']

These look like typos, both likely from missing spaces. Not sure how to handle these, but I'll return to them later. 

Let me focus on looking for names. I'll want to deidentify these so that everything is ubiquitous. 

In [4]:
set({'A'} - {'B'})

{'A'}

In [6]:
keys = ['inaudible', 'crosstalk', 'Speaker1', 'guess', 'relevantly', 'women', 'be', 'method', 'say', 'virtually', 'even', 'were', 'indistinguishable', 'Ann', 'Buttonheim', 'until', 'Sand', 'agents', 'vernet', 'fireman', 'Yeah', 'Speake', 'died', 'account', 'attic', 'taker', 'Interview', 'know', 'Wrench', 'INAUDIBLE', 'NAME', 'two', 'one', 'INAUDUBLE', 'LOCATION', 'INSTITUTION', 'state', 'ORGANIZATION', 'ORGANIZATIONS', 'CBT', 'genre', 'mentioned', 'is', 'INUADIBLE', 'Josha', 'options', 'answer', 'them', 'of', 'Interviewer19', 'thinking', 'triggers', 'crap', 'yeah', 'No', 'Yes', 'that', 'Female', 'Christian', 'by', 'interview', 'decision', 'idea', 'states', 'like', 'example', 'was', 'hoarders', 'consent', 'says', 'cell', 'are', 'awful', 'Right', 'but', 'no', 'cleaned', 'Wooo', 'young', 'Northeast', 'female', 'unintelligible', 'who', 'words', 'Caucasian', 'you', 'playful', 'Mellin', 'stuff', 'Currator', 'Len', 'waste', 'at', 'US', 'UNIVERSITY', 'question', 'problem', 'said', 'Treddle', 'drawers', 'hell', 'listing', 'worded', 'quote', 'out', 'Tollin', 'plan', 'reads']
# Manually filter for keys that look like they could be someone's name
# Note the typo Speake instead of Speaker, we'll have to fix that
names = ["Ann", "Buttonheim", "Sand", "vernet", "Josha", "Christian", "Mellin", "Rebecca"]
assert any(name in speaker_dict for name in set(names) - {'Rebecca'})
speaker_dict_names_with_examples = {
    speaker: [doc for doc in data.by_doc if speaker in doc.speaker_set(restrict=False)]
    for speaker, lst in speaker_dict.items() 
    if speaker in names
}
speaker_dict_names_with_examples

{'vernet': [Document(name="036_448.txt", project="s1036-42_s2008-9_s3000-15")]}

My solution to these names is to replace all of them with NAME.

In [None]:
[doc.speaker_set(restrict=False) for doc in data.by_doc if 'Christian' in doc.speaker_set(restrict=False)]

[]