# Use case examples to study the LTS corpus

Import the necessary libraries

In [40]:
import os,sys,inspect
current_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parent_dir = os.path.dirname(current_dir)
sys.path.insert(0, parent_dir)
from Utilities import blacklab as bl

## 1. Find all possible forms of a verb with the lemma attribute: *flee, flees, fleeing, and fled*

### Research Problem:

Readers searching for moments in interviews when a victim is recalling the experience of *fleeing* face a difficulty: a simple search of *flee* would not find suffixed forms such as *fleeing* and *flees*.

### Solution:

The corpus engine stores the lemma of every word in the 2700 transcripts; in more technical terms, each word in the corpus has a lemma attribute. As a result, readers can use the lemma attribute as a search criterium to find all possible suffixed forms of a noun or a verb such as flee. In CQL, attributes used as search criteria have to be placed between a pair of square brackets, which will then match individual words.

```[lemma="flee"]```

In [41]:
query = '[lemma="flee"]'

In [42]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [43]:
print (response[0]['match_word'])

could show you a picture of my family when we fled to Italy . This is the uncle I 'm talking 


Print the number of results

In [44]:
len(response)

1146

## 2. Disambiguation with part-of-speech information: *fly* (meaning insect) versus *fly* (meaning travel through air)

### Research Problem:

Readers want to find textual contexts where victims talk about the experience of being bothered by flies. By entering fly or flies to the search box, they are also given textual contexts where fly means traveling through air.

### Solution:

CQL enables the combination of lemma and grammatical category, defined through the pos attribute, which can be used for disambiguation.

```[lemma="fly" & pos="N.*"]```

In [45]:
query = '[lemma="fly" & pos="N.*"]'

In [46]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [47]:
print (response[0]['match_word'])

they pushed you in the water like nothing , like flies . You know , they did n't care about you 


Print the number of results

In [48]:
len(response)

528

This example highlights two very important features of CQL. First, attributes can be connected with the & operator; this expresses the logical relationship that natural languages express with and. In other words, the pattern above matches a given word if its lemma is fly and if it is used as a verb. Second, when defining the content of an attribute, CQL enables character level pattern matching, also known as regular expression. In the example above, the pos attribute, standing for grammatical category, is defined by the sequence of V, dot, and an asterisk: V.* In this list, readers will find the abbreviations of all grammatical categories used to annotate interview transcripts. But they will not find V; instead they will for instance find VB (base form of a verb) or VBN (past participle of a verb). V.* will still match all possible verbal formats thanks to character level pattern matching. The use of dot with asterisk indicates that after V there can be any number of additional characters. In more technical terms, dot stands for a wildcard character; the asterisk, known as a quantifier, tells that V can be followed by 0 or more wildcard characters. Hence, V.* covers both VB or VBN. In CQL, just as in regular expression, not only the asterisk but also other quantifiers are available (see the Documentation of BlackLab).

## 3. Search for synonyms with multiple lemmas: *mother, mummy, etc.*

### Research Problem:

Readers want to find textual contexts where victims speak about the experience of mothers, which can be expressed through a number of synonyms (mum, mummy, mom, etc.). A simple word search does not include synonyms.

### Solution:

With CQL readers can search for multiple lemmas at the same time; they can thus define an entire synonym set within one search.

```[lemma="mother" | lemma="mum" | lemma = "mummy" | lemma = "mom" | lemma = "mommy"]```

In [49]:
query = '[lemma="mother" | lemma="mum" | lemma = "mummy" | lemma = "mom" | lemma = "mommy"]'

In [50]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [51]:
print (response[0]['match_word'])

, uh , I do n't even know why my mother let me go to the post office that day . 


Print the number of results

In [52]:
len(response)

83074

## 4. Find terms with spelling variants: *capo* and *kapo*.

### Research Problem:

The same term can be present in the data with different spellings. For instance, one can find both capo and kapo in the transcripts.

### Solution:

```[lemma = "(c|k)apo"]```

In [53]:
query = '[lemma = "(c|k)apo"]'

In [54]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [55]:
print (response[0]['match_word'])

you ? SUBJECT : The -- what they called the kapos , the supervisors . INTERVIEWER 2 : Jewish ? SUBJECT 


Print the number of results

In [56]:
len(response)

2767

## 5. Find terms with both British and American spelling: *labour* and *labor*.

### Research Problem:

Transcripts follow sometimes the British sometimes the American spelling system. For instance, both labour and labor are present in the transcripts. It is therefore recommended to run searches in terms of both spelling systems.

### Solution:

```[lemma="labo(u?)r"]```

In [57]:
query = '[lemma="labo(u?)r"]'

In [58]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [59]:
print (response[0]['match_word'])

a week . They had -- had to evacuate this labor or this work camp where we were in . And 


Print the number of results

In [60]:
len(response)

6167

## 6. Differentiate homonymic terms with the help of case sensitivity: *Joint* (The American Joint Distribution Committee) versus *joint* (body part).

### Research Problem:

Our default search is agnostic to case-sensitivity. By searching for joint or Joint, readers will be given occurrences where joint either refers to the colloquial name of The American Joint Distribution Committee or to a body part. One thus needs to differentiate the two meanings of joint.

### Solution:

Since Joint as the colloquial name of The American Joint Distribution Committeealways begins with a capital letter, case-sensitivity can be used to enforce CQL to find only those instances where the first letter is capitalized.

```["(?-i)Joint"]```

Case sensitivity is enforced by means of (?-i). At the same time, the pattern above still matches Joint as a body part if it is at the beginning of a sentence.

In [61]:
query = '["(?-i)Joint"]'

In [62]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [63]:
print (response[0]['match_word'])

zone of Berlin . And then , already , the Joint Distribution Committee had set up receiving camps in Berlin . 


Print the number of results

In [64]:
len(response)

1437

## 7. Search for possible word sequences: *mothers were crying, mother cried, mother started to cry*

### Research Problem:

The retrieval of moments when an interviewee is speaking about mothers crying is difficult. This can be expressed in a variety of ways and between mother and cry there can be multiple terms.

### Solution:

```[lemma="mother"] []{0,3} [lemma="cry"]```

This pattern matches sequences where a term, the dictionary form of which is mother, is followed by another term, the dictionary form of which is cry, within a window of maximum three words. 3 signs that between cry and mother there can be zero or maximum 3 terms; [] signs that the term in-between can be any word.

In [65]:
query = '[lemma="mother"] []{0,3} [lemma="cry"]'

In [66]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [67]:
print (response[0]['match_word'])

just too big and too shocking to absorb . My mother started crying , but I still somehow refused to believe it . 


Print the number of results

In [68]:
len(response)

327

## 8. Matching sequences with similar meaning through grouping operation: *I will never forget* and *I will alway remember*

### Research Problem:

A key moment in an interview when a victim tells the phrase, *I will never forget*. But this can be also expressed as *I will always remember*, *I couldn’t forget*.

### Solution:

First, one needs to write two sequences in which either I, never,n’t, which expresses negation, and forget or I, always, and remember occur.

```[lemma="mother"] []{0,3} [lemma="cry"]```

This pattern matches sequences where a term, the dictionary form of which is mother, is followed by another term, the dictionary form of which is cry, within a window of maximum three words. 3 signs that between cry and mother there can be zero or maximum 3 terms; [] signs that the term in-between can be any word.

In [69]:
query = '[lemma="mother"] []{0,3} [lemma="cry"]'

In [70]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [71]:
print (response[0]['match_word'])

just too big and too shocking to absorb . My mother started crying , but I still somehow refused to believe it . 


Print the number of results

In [72]:
len(response)

327