# Extended CQL tutorial for *LTS*

The goal of this tutorial is to introduce users of *LTS* into the use of Corpus Query Language (CQL) when searching testimonies of Holocaust survivors. This tutorial was made by following the tutorial on the website of Sketch Engine and BlackLab Engine. Please bearn in mind that not all functionalities discussed on these websites can be used with LTS. This tutorial contains CQL functionalities that work with *LTS*.

In [3]:
import os,sys,inspect
current_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parent_dir = os.path.dirname(current_dir)
sys.path.insert(0, parent_dir)
from Utilities import blacklab as bl

## Basic Syntax

To activate a CQL search on LTS, you need to use its special syntax. In CQL every word has to be between a pair of square brackets.

You can search directly for a given word:

```["murderers"]```

In [9]:
query = '["murderers"]'

In [10]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [11]:
print (response[0]['match_word'])

, such a high-cultured people became like barbarians , like murderers , all of them to listen to one man , 


### Attention: the word to be searched for has to be between double quotation marks 

The following query will not produce any result:

```['murderers']```

## Attributes

Attributes on the one hand enable the search for specific type of words. Consider the following example that finds all gerunds:

```[pos="VBG"]```

In [19]:
query = '[pos="VBG"]'

In [20]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [21]:
print (response[0]['match_word'])

INTERVIEWER 1 : We 're going to go back into the past . Where are you 


The attribute we used is *pos* and the value of the attribute is *VBG*. The general pattern is this:

```[attribute="value"]```

The corpus underlying *LTS* has three attributes:
- *pos*: part of speech or grammatical category of a word
-*lemma*: the dictionary form of a word
-*word*: the actual word form 

### The *pos* attribute

This attribute helps define the grammatical category of a word that we are searching for. The grammatical categories are the categories defined by the Penn Treebank and they can be downloaded from here.

### The lemma attribute

This attribute enables the search based on the dictionary form of a word. For instance, the following query finds all occurrences of both *murderer* and *murderers* since the dictionary form is murderer.

```[lemma="murderer"]```

### The *word* attribute

By contrast, the following pattern finds only those occurrences where *murderer* is used in the plural form.

```[word="murderers"]```

### Attention: the default attribute is *word*

If you do not define any attribute then it is taken as the word attribute; the following query is the same as above:

```["murderers"]```

## The combination of attributes

*Murders* might be a verb ("the guard murders someone") or a noun ("Murders often happened"). By combining the *pos* attribute with the *word* attribute you can find these cases.

```[pos="NNS" & word="murders"]```

In [45]:
query = '[pos="NNS" & word="murders"]'

In [46]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [47]:
print (response[0]['match_word'])

the lagers in order to become -- there have been murders . One case , one guy , you know , 


Different attributes of the same word have to be between the same square brackets and they can be combined with the following boolean operators:

-& (ampersand) = AND

-| (pipe) = OR

-! (exclamation mark) = NOT

## The use of regular expression like syntax to define attributes

In [None]:
When defining an attribute value, you can use regular expression like syntax.

The dot (.) is wildcard character that matches any character. Consider the following example. You want to match adverbs both in the superlative and the comparative form. They are signed by the pos categories *RBR* and *RBS*. You can search for them with the following two different queries:

```[pos="RBR"]```
```[pos="RBS"]```

In [None]:
Or you can combine the two queries:

```[pos="RB."]```

The number of wildcard characters can be also defined. Consider the following example. You want to match all verbs in the corpus. The Penn treebank distinguishes the following subcategories of verbs:

- VB 	Verb, base form
- VBD 	Verb, past tense
- VBG 	Verb, gerund or present participle
- VBN 	Verb, past participle
- VBP 	Verb, non-3rd person singular present
- VBZ 	Verb, 3rd person singular present

One can match them with the following way:

```[pos="V.*"]```

The asterisk (*) tells that *V* can be followed any number of wildcard characters; hence it will match both *VB* and *VBD*.

The following regular expression operators are supported: +, *, ?, {n}, {n,m}
See some examples below.

## Escaping special characters

Consider the problem that you want to find dots in testimonies, i.e. you want to use dots in the literal sense and not as a wildcard characters. To do this you need to escape the dot with the \ operator.

```[word="Mr\."]```

In [52]:
query = '[word="Mr\."]'

In [53]:
response = bl.search_blacklab(query,window=10,lemma=False)

Print the first result

In [54]:
print (response[0]['match_word'])

about religion , nationalities . We better be careful about Mr. So-and-so , because he is of German descent . And 


## Case sensitivity:

The case insensitity matching is currently used. The following query will therefore match both *CRYING* and *crying*.

```[word="crying"]```

To enforce case sensitivity, you need to begin the attribute value with "(?-i)".

```[word="(?-i)crying"]```

## Some further examples for the use of attribues

Find all words beginning with *anti*

```[lemma="anti.*"]```

Find all words ending with *ism*

```[lemma=".*ism"]```

Find both *labour* and *labor* by making u optional

```[lemma="labou?r"]```

Find both *casa* and *cassa* by using a regex range operator

```[lemma="cas{0,2}a"]```

Find both *specialize* and *specialise* by using a regex or (|) operator

```[lemma="speciali(s|z)e"]```

In [115]:
query = '[word="kill"]'

In [116]:
response = bl.search_blacklab(query,window=0,lemma=False)

Print the first result

In [117]:
print (response[1]['match_word'])

kill 


## Sequence matching

With CQL you can match a sequence of words; with the following query you can find the sequence *big man*

``` ["big"] ["man"] ```

### Sequences with optional in-between words

You might want to match both *she was very nice* and *she was nice*. You can make *very* an optional element of the sequence with the ? operator

``` ["she"] ["was"] ["very"]? ["nice"] ```

In [137]:
query = '["she"] ["was"] ["very"]? ["nice"]'

In [138]:
response = bl.search_blacklab(query,window=0,lemma=False)

Print the first result

In [140]:
print (response[0]['match_word'])

She was nice 


### Sequences with distance between two words

Consider the following example; you want to find moments when one is recalling the experience of her or his mother crying. In this case *mother* is not necessarily followed by *crying*. There can be words between them (*my mother was really really crying*). To tackle this problem, we can define a sequence starting with *mother*, ending with *cry*, and with a number of other words in-between.

```[lemma="mother"] ["."]{0,5} [lemma="cry"] ```

The query above uses the regex range operator; it declares that there can be maximum 5 words and minimum 0 words between *mother* and *cry*.

Suppose that you want to know all possible adjectives with *Jews*. You can define this as a CQL query:

```[word="an" | word="the" ] [pos="JJ"]+ [lemma="Jew"]```

The plus sign after [pos=“JJ”] means that the adjective should occur one or more times (similarly, * means “zero or more times”, and ? means “zero or one time”).

### Grouping sequences of words

By placing a sequence of words between parenthesis, you can form groups and apply operators to the entire group. The following patterns for instance finds the sequence *the average Jew ,the small Jew*

```([word="an" | word="the" ] [pos="JJ"]+ [lemma="Jew"] []){2,3}```

In [149]:
query = '([word="an" | word="the" ] [pos="JJ"]+ [lemma="Jew"] []){2,3}'
response = bl.search_blacklab(query,window=0,lemma=False)
print (response[10]['match_word'])

the average Jew , the small Jew , 


###  Relating words to one another

Consider that you want to find the sequence *step by step*. You need to define a sequence in which the first and the last elements are the same:

```A:[] "by" B:[] :: A.word = B.word```

In [154]:
query = 'A:[] "by" B:[] :: A.word = B.word'
response = bl.search_blacklab(query,window=0,lemma=False)
print (response[11]['match_word'])

one by one 


## Finding entire sentences

CQL enables the finding of entire sentences based on various criteria. You can define the content of a sentence by declaring what the sentence contains. For example, the following query finds a sentence in which baker occurs.

```<s/> containing "baker"```

In [158]:
query = '<s/> containing "baker"'
response = bl.search_blacklab(query,window=0,lemma=False)
print (response[11]['match_word'])

INTERVIEWER : You were a baker at that time ? 


You can also define the beginning and the end of the sentence by using either \<s\> or \<\/s\>. For instance, the following query finds all sentences beginngin with *no*.

```<s> "no"```

In [160]:
query = '<s> "no"'
response = bl.search_blacklab(query,window=0,lemma=False)
print (response[11]['match_word'])

No 
