# Loading concordances into FlexiConc

This notebook demonstrates how to load concordances into FlexiConc from various supported concordancing tools.

## Preparation

Make sure that FlexiConc and its dependencies are installed, following the instructions in the course slides (which will also install Python packages required by some of the algorithms included in the FlexiConc distribution). Don't forget to activate your virtual environment before starting the JupyterLab server.

The code cell below is only needed when running this notebook in Google Colab. It uses `!` to run a shell command from the notebook because manual software installation is not supported in Colab. The `-U` upgrades FlexiConc if it is already installed (we frequently release minor or major upgrades). We do not install any extensions as we only want to demonstrate the concordance retrieval functions. For serious concordance reading, the additional dependencies should be installed as well.

In [None]:
!pip install -U flexiconc

We can now import FlexiConc and its convenience functions for Jupyter notebooks. Most concordance retrieval functions are automatically available as methods of the `Concordance` object. Only `wmatrix` is a special case that needs to be imported separately.

In [None]:
from flexiconc import Concordance
from flexiconc.utils.notebook_utils import add_node_ui, add_annotation_ui, show_kwic, show_analysis_tree
from flexiconc.utils import wmatrix

Note that many of the code cells below require a password, access token, or other special provisions in order to work. It is recommended to focus on the approaches that you need in your work and/or concordancing tools you already have access to.

## CLiC

The easiest approach is to load concordance data from the public [CLiC server](https://clic-fiction.com), which is freely accessible without a user account.

As an example we load a concordance for _eyes_ within long suspensions across both 19C (19th century novels) and DNov (Charles Dickens) corpora. Read the method docmentation for further information about the arguments and available options.

In [None]:
C = Concordance()
C.retrieve_from_clic(query=['eyes'], 
                     corpora=["corpus:19C", "corpus:DNov"], subset="longsus")

In [None]:
help(C.retrieve_from_clic)

Recall that you can get a glimpse of each concordance with `show_kwic()`.

In [None]:
show_kwic(C.root, n=10, metadata_columns=("text_id", "chapter"))

In the following examples, we will usually just display the number of concordance lines in order to demonstrate that the import was successful.

In [None]:
C.root.line_count

## Sketch Engine

If you have an account for the commercial [Sketch Engine](https://app.sketchengine.eu/) platform, you can load concordances in a similar way. SkE includes rich token-level annotation, but its support for line-level metadata is rather limited. You will be able to access both your own corpora (_user corpora_) as well as a wide range of pre-installed corpora in many languages.

As preparation you need to note down the full path of the relevant corpus, as well as generate an API access token. Both steps are illustrated in the course slides.

Here, we search for the phrase _fake news_ in the Trump Twitter Archive corpus. It is a user corpus of the account `SEvert`, to which the access token used below also belongs. Note that by the time you run this notebook, the access token has likely been invalidated and you will need to obtain your own access token.

In [None]:
C = Concordance()
C.retrieve_from_sketchengine(query='[lc="fake"] [lc="news"]', 
                             corpus="user/SEvert/tta", 
                             api_key="[YOUR API KEY]")

In [None]:
C.root.line_count

In [None]:
show_kwic(C.root, n=10)

## CQPweb

There is no direct interface to [CQPweb servers](https://corpora.linguistik.uni-erlangen.de/cqpweb/) yet (due to the lack of a fully functional API), but you can download concordance data from a CQPweb session and import it into FlexiConc. After running a corpus query, select the _Download …_ action and adjust format options as explained in the course slides. Put the download file (which should automatically be saved with extension `.txt`) in the same folder as this Jupyter notebook.

A sample concordance download for _water and sanitation_ in the ParlSpeech UK corpus can be downloaded from GitHub.

In [None]:
!wget -nc https://github.com/reading-concordances/teaching/raw/refs/heads/main/course/data/CQPweb_WaterSanitation_ParlUK.txt

In [None]:
C = Concordance()
C.load_from_cqpweb_export("CQPweb_WaterSanitation_ParlUK.txt")

In [None]:
C.root.line_count

Any metadata included in the download file are automatically imported into FlexiCon. `URL` provides a link to display extended context for a concordance line on the CQPweb server.

In [None]:
C.metadata.head()

In [None]:
C.metadata.URL[0]

In [None]:
show_kwic(C.root, n=10, metadata_columns=("Date", "Party"))

## WMatrix

[WMatrix](https://ucrel-wmatrix7.lancaster.ac.uk/) is a specialised online tool with two main purposes:

- You can upload text files and have them automatically compiled into a corpus annotated with part-of-speech tags, lemmata and semantic tags (_concepts_). Metadata can be encoded in the filenames.
- It then provides keyword analysis on the annotated corpus at the level of word forms, lemmata, POS tags, and concepts.

The concordance display for individual keywords is rather basic, so FlexiConc makes for an ideal companion software.

Connecting WMatrix to FlexiConc works differently than for the other tools. Rather than export each individual concordance separately, FlexiConc has to download the entire annotated corpus from WMatrix in its internal SQLite format. You can then create concordances for single words and multiword units. This is convenient because you will typically want to look at concordances for multiple keywords brought up by WMatrix.

In order to try the example below, you first have to copy the `LabourManifesto2005` corpus from the WMatrix library to your own user account. Then insert your login and password in the respective function arguments.

In [None]:
labour2005 = wmatrix.load(
    corpus_name="LabourManifesto2005",
    username="[USER]",
    password="[PASSWORD]",
    db_filename="labour2005.db")
labour2005

Creating a concordance is easy for a single word or multiword unit at wordform level. Keep in mind that the search is case-sensitive here so this will often miss some instances.

In [None]:
C = labour2005.concordance_from_query('antisocial behaviour')

In [None]:
C.root.line_count

In [None]:
show_kwic(C.root)

In order to search for lemmas or concepts, you need to specify a query in a vaguely CQP-like notation. Token-level annotations are accessed under the names
- `word`: literal word forms
- `lemma`: lemmata
- `pos`: POS tags
- `sem`: concepts = semantic tags
You can find suitable values for your search through the keyword analysis functions in WMatrix. The `%c` flag carries out a case-insensitive search.

In [None]:
C = labour2005.concordance_from_query(r'[lemma="community" %c]')
C.root.line_count

The WMatrix corpus needs to be downloaded only once and will then be stored locally in the specified file. Next time you access this corpus, you can simply load it from the file.

In [None]:
labour2005 = wmatrix.load(db_filename="labour2005.db")

For your convenience, we have created a corpus `ESSLLI_Water_ParlUK` in the public WMatrix library, which includes all sentences containing the noun _water_ from the ParlSpeech UK corpus. This is still a large download of close to 1 GB, so we also provide a pre-processed version for use with FlexiConc.

Download the file `WMatrix_Water_ParlUK.db` and save it to the same directory as this notebook.

In [None]:
!wget -nc https://github.com/reading-concordances/teaching/raw/refs/heads/main/course/data/WMatrix_Water_ParlUK.db

You can now load the pre-processed corpus into FlexiConc and continue exploring water discourses in the UK parliament, using the keyword analyses of WMatrix as an entry point into the discourse.

In [None]:
water = wmatrix.load(db_filename="WMatrix_Water_ParlUK.db")

In [None]:
C = water.concordance_from_query(r'[lemma="water"] [lemma="and"] [lemma="sanitation"]')
C.root.line_count

In [None]:
n1 = C.root.add_subset_node(
    ("Random Sample",
     {'sample_size': 20, 'seed': 42}))

In [None]:
show_kwic(n1, metadata_columns=["file.file"])