# Use case examples to study the LTS corpus

Import the necessary libraries

In [1]:
import os,sys,inspect
current_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parent_dir = os.path.dirname(current_dir)
sys.path.insert(0, parent_dir)
from Utilities import blacklab as bl

## 1. Find all possible forms of a verb with the lemma attribute: *flee, flees, fleeing, and fled*

### Research Problem:

Readers searching for moments in interviews when a victim is recalling the experience of *fleeing* face a difficulty: a simple search of *flee* would not find suffixed forms such as *fleeing* and *flees*.

### Solution:

The corpus engine stores the lemma of every word in the 2700 transcripts; in more technical terms, each word in the corpus has a lemma attribute. As a result, readers can use the lemma attribute as a search criterium to find all possible suffixed forms of a noun or a verb such as flee. In CQL, attributes used as search criteria have to be placed between a pair of square brackets, which will then match individual words.

```[lemma="flee"]```

In [2]:
query = '[lemma="flee"]'

In [3]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [4]:
print (response[0]['match_word'])

could show you a picture of my family when we fled to Italy . This is the uncle I 'm talking 


Print the number of results

In [5]:
len(response)

1146

## 2. Disambiguation with part-of-speech information: *fly* (meaning insect) versus *fly* (meaning travel through air)

### Research Problem:

Readers want to find textual contexts where victims talk about the experience of being bothered by flies. By entering fly or flies to the search box, they are also given textual contexts where fly means traveling through air.

### Solution:

CQL enables the combination of lemma and grammatical category, defined through the pos attribute, which can be used for disambiguation.

```[lemma="fly" & pos="N.*"]```

In [6]:
query = '[lemma="fly" & pos="N.*"]'

In [7]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [8]:
print (response[0]['match_word'])

they pushed you in the water like nothing , like flies . You know , they did n't care about you 


Print the number of results

In [9]:
len(response)

528

This example highlights two very important features of CQL. First, attributes can be connected with the & operator; this expresses the logical relationship that natural languages express with and. In other words, the pattern above matches a given word if its lemma is fly and if it is used as a verb. Second, when defining the content of an attribute, CQL enables character level pattern matching, also known as regular expression. In the example above, the pos attribute, standing for grammatical category, is defined by the sequence of V, dot, and an asterisk: V.* In this list, readers will find the abbreviations of all grammatical categories used to annotate interview transcripts. But they will not find V; instead they will for instance find VB (base form of a verb) or VBN (past participle of a verb). V.* will still match all possible verbal formats thanks to character level pattern matching. The use of dot with asterisk indicates that after V there can be any number of additional characters. In more technical terms, dot stands for a wildcard character; the asterisk, known as a quantifier, tells that V can be followed by 0 or more wildcard characters. Hence, V.* covers both VB or VBN. In CQL, just as in regular expression, not only the asterisk but also other quantifiers are available (see the Documentation of BlackLab).

## 3. Search for synonyms with multiple lemmas: *mother, mummy, etc.*

### Research Problem:

Readers want to find textual contexts where victims speak about the experience of mothers, which can be expressed through a number of synonyms (mum, mummy, mom, etc.). A simple word search does not include synonyms.

### Solution:

With CQL readers can search for multiple lemmas at the same time; they can thus define an entire synonym set within one search.

```[lemma="mother" | lemma="mum" | lemma = "mummy" | lemma = "mom" | lemma = "mommy"]```

In [10]:
query = '[lemma="mother" | lemma="mum" | lemma = "mummy" | lemma = "mom" | lemma = "mommy"]'

In [11]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [12]:
print (response[0]['match_word'])

, uh , I do n't even know why my mother let me go to the post office that day . 


Print the number of results

In [13]:
len(response)

83074

## 4. Find terms with spelling variants: *capo* and *kapo*.

### Research Problem:

The same term can be present in the data with different spellings. For instance, one can find both capo and kapo in the transcripts.

### Solution:

```[lemma = "(c|k)apo"]```

In [14]:
query = '[lemma = "(c|k)apo"]'

In [15]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [16]:
print (response[0]['match_word'])

you ? SUBJECT : The -- what they called the kapos , the supervisors . INTERVIEWER 2 : Jewish ? SUBJECT 


Print the number of results

In [17]:
len(response)

2767

## 5. Find terms with both British and American spelling: *labour* and *labor*.

### Research Problem:

Transcripts follow sometimes the British sometimes the American spelling system. For instance, both labour and labor are present in the transcripts. It is therefore recommended to run searches in terms of both spelling systems.

### Solution:

```[lemma="labo(u?)r"]```

In [18]:
query = '[lemma="labo(u?)r"]'

In [19]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [20]:
print (response[0]['match_word'])

a week . They had -- had to evacuate this labor or this work camp where we were in . And 


Print the number of results

In [21]:
len(response)

6167

## 6. Differentiate homonymic terms with the help of case sensitivity: *Joint* (The American Joint Distribution Committee) versus *joint* (body part).

### Research Problem:

Our default search is agnostic to case-sensitivity. By searching for joint or Joint, readers will be given occurrences where joint either refers to the colloquial name of The American Joint Distribution Committee or to a body part. One thus needs to differentiate the two meanings of joint.

### Solution:

Since Joint as the colloquial name of The American Joint Distribution Committeealways begins with a capital letter, case-sensitivity can be used to enforce CQL to find only those instances where the first letter is capitalized.

```[word="(?-i)Joint"]```

Case sensitivity is enforced by means of (?-i). At the same time, the pattern above still matches Joint as a body part if it is at the beginning of a sentence.

In [22]:
query = '["(?-i)Joint"]'

In [23]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [24]:
print (response[0]['match_word'])

zone of Berlin . And then , already , the Joint Distribution Committee had set up receiving camps in Berlin . 


Print the number of results

In [25]:
len(response)

1437

## 7. Search for possible word sequences: *mothers were crying, mother cried, mother started to cry*

### Research Problem:

The retrieval of moments when an interviewee is speaking about mothers crying is difficult. This can be expressed in a variety of ways and between mother and cry there can be multiple terms.

### Solution:

```[lemma="mother"] []{0,3} [lemma="cry"]```

This pattern matches sequences where a term, the dictionary form of which is mother, is followed by another term, the dictionary form of which is cry, within a window of maximum three words. 3 signs that between cry and mother there can be zero or maximum 3 terms; [] signs that the term in-between can be any word.

In [26]:
query = '[lemma="mother"] []{0,3} [lemma="cry"]'

In [27]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [28]:
print (response[0]['match_word'])

just too big and too shocking to absorb . My mother started crying , but I still somehow refused to believe it . 


Print the number of results

In [29]:
len(response)

327

## 8. Matching sequences with similar meaning through grouping operation: *I will never forget* and *I will alway remember*

### Research Problem:

A key moment in an interview when a victim tells the phrase, *I will never forget*. But this can be also expressed as *I will always remember*, *I couldn’t forget*.

### Solution:

First, one needs to write two sequences in which either I, never,n’t, which expresses negation, and forget or I, always, and remember occur.

```[word="I"] []{0,5}[word="never" | word = "n't" ] [lemma="forget"]```

```[word="I"] []{0,5}[word="always"] [lemma="remember"]```

Second, the two sequences need to be connected as groups with the or (|) operator; grouping is done with the help of parenthesis.

```([word="I"] []{0,5}[word="never" | word = "n't" ] [lemma="forget"]) | ([word="I"] []{0,5}[word="always"] [lemma="remember"])```

In [30]:
query = '([word="I"] []{0,5}[word="never" | word = "n\'t" ] [lemma="forget"]) | (["I"] []{0,5}[word="always"] [lemma="remember"])'

In [31]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [32]:
print (response[0]['match_word'])

to us . Actually , they actually were scared . I never forget one forester who came by foot , uh , and 


Print the number of results

In [33]:
len(response)

3200

## 9. Find repetitive sequences: *why, why, why*

### Research Problem:

The repetition of the same term can signal moments when traumatic memories are recalled. Finding repetitive uses of words is not possible with traditional word search.

### Solution:

We need to create a sequence and declare that elements of the sequence are the same. In the example below, we create a sequence of three terms divided by commas, and in the last step we declare that elements of the sequence are the same words.

When using this pattern on the lts website, you need to insert your query between < and > sign.

```<A:[] [word=","] B:[] [word=","] B:[] :: A.word = B.word>```

In [62]:
query = 'A:[] [","] B:[] [","] B:[] [","] A:[]  :: A.word = B.word'

In [63]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [64]:
print (response[0]['match_word'])

wanted to kill us . But one , then , no , no , no , let 's -- because they thought we were the Fifth Column 


Print the number of results

In [37]:
len(response)

5468

## 10. Find rhetorical questions:  *why should I fear death ?*

### Research Problem:

An interesting moment in testimonies when survivors ask - rhetorical - questions from themselves; sometimes these rhetorical questions are addressing the reason why something happened in the past. 

### Solution:

CQL can match complete sentences that contain certain patterns. In the example below, we are looking for a sententence that begins with *why* followed by *I* (with minimum one word in-between) and ends with a question mark.

When using this pattern on the lts website, you need to insert your query between < and > sign.

```<<s/> containing ([word="why"] []{1,} [word="I"] []{0,10} ["?"] )>```

In [38]:
query = '<s/> containing ([word="why"] []{1,} [word="I"] []{0,10} ["?"] )'

In [39]:
response = bl.search_blacklab(query,window=0,lemma=False)

Print the first result

In [40]:
print (response[2]['match_word'])

Why do n't I do something ? 


Print the number of results

In [41]:
len(response)

60

## 11. Find analogies: *like animals*

### Research Problem:

Survivors often use analogies to describe their experiences. In English *like* is a preposition that can be used to express analogies. However *like* can be both a preposition meant to draw comparisons and a verb expressing wish or affection.

### Solution:

CQL language can be used to distinguish *like* as a verb from *like* as a preposition. Furtermore, with CQL we can also form a sequence in which *like* as a prepositions is followed by a noun.

```[lemma="like" & pos="IN"] [pos="N.*"]```

In [42]:
query = '[lemma="like" & pos="IN"] [pos="N.*"]'

In [43]:
response = bl.search_blacklab(query,window=0,lemma=False)

Print the first result

In [44]:
print (response[10]['match_word'])

like hate 


Print the number of results

In [45]:
len(response)

14839

## 12. Find possibilities: *if I march and I fall, they will shoot me*

### Research Problem:

Survivors often recall troubling possibilities they faced during persecutions; retrieving these possibilities is almost impossible since they are often not expressed directly as possibilities. 

### Solution:

Conditional sentences or if sentences often convey possibilites experienced in the past; through the sequence *if* and *i* followed by a verb we can find examples for possibilities from the past.

```[word="if"] [word="i"] [pos="V.*"]```

In [46]:
query = '[word="if"] [word="i"] [pos="V.*"]'

In [47]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [48]:
print (response[15]['match_word'])

make me one Hungarian goulash . I do n't care if I die . We used to talk about our boyfriends . And 


Print the number of results

In [49]:
len(response)

9884

## 13. Finding moments of survivor guilt: *I should have died , they should have lived*

### Research Problem:

Survivors' guilt is a leitmotif in testimonies, though they do not always express it explicitly.

### Solution:

We can form a sequence consisting of *I* followed by *should*, *have*, and the past participle of a verb.

```[word="i"] [word="should"] [word="have"] [pos="V.*"]```

In [65]:
query = '[word="i"] [word="should"] [word="have"] [pos="VBN"]'

In [66]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [68]:
print (response[1]['match_word'])

do right , that I did everything wrong , that I should have been able to keep him alive . Because I have known 


Print the number of results

In [53]:
len(response)

512

## 14. Finding moments of crying

### Research Problem:

Testimonies are often interrupted by crying. Here you can download how these moments are signed in testimonies of the Fortunoff Archive and in testimonies of the USC Shoah Foundation.

### Solution:

Since these moments are signed by the term *crying* in uppercase, you can search for them with the following query.

```[word="(?-i)CRYING"]```

In [54]:
query = '[word="(?-i)CRYING"]'

In [55]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [56]:
print (response[50]['match_word'])

. So we went to the orchards and -- [ CRYING ] Mrs. Huzar said , the same thing . Say 


Print the number of results

In [57]:
len(response)

1329

## 15. Finding moments of silence

### Research Problem:

Testimonies are often interrupted by moments when survivors were unable to carry on with recalling their memories. 

### Solution:

Since these moments are signed by the term *PAUSES* in uppercase, you can search for them with the following query.

```[word="(?-i)PAUSES"]```

In [58]:
query = '[word="(?-i)PAUSES"]'

In [59]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [60]:
print (response[50]['match_word'])

for you , and they arrested my father . [ PAUSES FOR 3 SECONDS ] But he was in the police 


Print the number of results

In [61]:
len(response)

45880