# An introduction to FlexiConc

##Konvens 2025

This worked example shows how to use FlexiConc from an interactive Jupyter notebook for organising and reading concordances. We use CLiC as a query backend so that no passwords, downloads or other extra steps are needed.

## Preparation

If you are running this notebook on your own machine:

Make sure that FlexiConc and its dependencies are installed, following the instructions in the tutorial slides (which will also install Python packages required by some of the algorithms included in the FlexiConc distribution). Don't forget to activate your virtual environment before starting the JupyterLab server.

If you are running this notebook in Google Colab:

Execute the code cell below. It uses `!` to run a shell command from the notebook because manual software installation is not supported in Colab. The `-U` upgrades FlexiConc if it is already installed (we frequently release minor or major upgrades), and the extensions in brackets make sure that all required dependencies are also installed.

In [None]:
!pip install -U 'flexiconc[notebooks,web,ICU,partitioning,annotation]'

We can now import FlexiConc and its convenience functions for Jupyter notebooks. Normally, you would access classes and functions via the package namespace (e.g. `import flexiconc as fc`) or use `import * from ...`; here we explicitly import the relevant classes and functions for your reference.

In [None]:
from flexiconc import Concordance
from flexiconc.utils.notebook_utils import add_node_ui, add_annotation_ui, show_kwic, show_analysis_tree

## Obtaining a concordance

FlexiConc includes helper functions for importing concordances from various source, including the public [CLiC server](https://clic-fiction.com) (where you can also use FlexiConc via a Web UI), [CQPweb](https://corpora.linguistik.uni-erlangen.de/cqpweb/), [WMatrix](https://ucrel-wmatrix7.lancaster.ac.uk/), and [Sketch Engine](https://app.sketchengine.eu/).

Here, we want to analyse a concordance for _head_ in the 19C corpus of English 19th century novels. Note that the query can specify several items, which can either be single words or whitespace-separated multiword units. You can also obtain a combined concordance across multiple corpora or restrict the search e.g. to _quotes_ or _long suspensions_.

In [None]:
C = Concordance()
C.retrieve_from_clic(query=['head'], corpora=['corpus:19C'])

The concordance object `C` now holds the entire concordance data, as well as the analysis tree representing our concordance reading process. So far, the tree only consists of a root node. We can conveniently display the concordance in kwic format with `show_kwic()`, here limited to the first 10 lines. We can also display metadata such as `text_id` and `chapter`.

In [None]:
show_kwic(C.root, n=10, metadata_columns=("text_id", "chapter"))

You can find out the size of the concordance from the `line_count` field of the root node.

In [None]:
C.root.line_count

## Concordance views

We can now obtain different perspectives on the concordance in the form of concordance views by applying algorithms. Each concordance view is a node in the analysis tree. Recall that we distinguish between **subset nodes** (created by a Selecting algorithm) and **arrangement nodes** (created by a combination of Ordering and/or Grouping algorithms).

As a very first step, we look at a random sample of concordance lines to get a better overview. In the FlexiConc approach, this is achieved by application of an algorithm – there are no shortcuts like a _random shuffle_ button (but you can easily define your own Python functions for this purpose!). While the required arrangement node can be added from Python code (with `C.root.add_subset_node()`), this is only for expert users – you need to know the name of the algorithm, all its parameters, and the correct format for specifying parameter values.

Fortunately, the FlexiConc notebook utilities also include a convenient interactive interface. You can select the desired algorithm(s), choose their parameters from menus, and then click _OK_ to apply the algorithms. This step adds a node to the analysis tree that can be captured in a Python variable (so that we can reference it for displaying the concordance view etc.).

In the interactive dialogue, choose _subset node_, then pick _Random Sample_ and set the _sample_size_ to the desired value (e.g. 20). The _seed_ is for ensuring that the random sample can be reproduced later. Don't forget to click the _OK_ button. Some Python code will show up below the button, which is the actual method call that was performed in the FlexiConc library.

In [None]:
n1 = add_node_ui(C.root)

You can now display the random sample of concordance lines.

In [None]:
show_kwic(n1, metadata_columns=("text_id", "chapter"))

Note that the interactive dialogue above remains active: you can always change parameters and click _OK_ again. This will overwrite variable `n1` with a new node reflecting the changed parameters. Try it now: modify the sample size, or pick a different seed. Then execute the `show_kwic()` cell again to see the new sample.

Note that each call adds a new node to your analysis tree, though only the last one is accessible via `n1`. You can always display the complete analysis tree to get an overview of what you've done.

In [None]:
show_analysis_tree(C)

You can always locate a node by its ID value (shown in brackets) and display it or discard it from the tree. Let's do this with the first subset node you've added, which should have ID `1`.

In [None]:
n = C.find_node_by_id(1)
n.line_count # number of matches

In [None]:
n.remove()
show_analysis_tree(C)

## Sorting

Concordance reading often involves sorting concordance lines (e.g. by left or right context). This is achieved in FlexiConc via an arrangement node. For convenience, we pick the appropriate algorithm with the interactive chooser again. Make sure to start from the root of the concordance rather than the random sample we've taken.

Select an _arrangement node_, click `+` to add an Ordering algorithm, and choose _Sort by Token-Level Attribute_. Set _sorting_scope_ to _left_ to sort by the left context of the node.

In [None]:
n2 = add_node_ui(C.root)

If you want to re-run the notebook in future to repeat the analysis, it is inconvenient to use the interactive dialogues because you will have to repeat all settings (and you'll get different results if you don't set exactly the same parameter values). For a reproducible notebook, execute the interactive dialogue once, then copy the method call it has generated into a new code cell, as shown below. Afterwards, you can comment out the `add_node_ui()` call or deleted the code cell entirely.

In [None]:
n2 = C.root.add_arrangement_node(
    grouping=None,
    ordering=[
        ('Sort by Token-Level Attribute',
         {'tokens_attribute': 'word', 'sorting_scope': 'left', 'offset': 0,
          'case_sensitive': False, 'reverse': False, 'backwards': False, 'locale_str': 'en'})
    ])

Note how the tokens on which the sorting is based are highlighted in the concordance view.

In [None]:
show_kwic(n2, n=400)

Scrolling through the sorted concordance, you will notice that there is a lot of variation in the left context, except for one very frequent pattern with a possessive pronound before _head_. We would now like to confirm this with a frequency count for the word in L1 position, also highlighting which possessive pronouns occur most frequently. This is done with a Partitioning algorithm.

## Grouping

Select an _arrangement node_, then click the left `+` to add a Grouping algorithm, choosing _Partition by Ngrams_. The desired L1 position is `-1` (i.e. a negative offset indicates a position to the left of the node). You might also want add an Ordering node to sort the concordance lines _within_ each group by right context (or a random sort to view a random sample of lines from the group).

In [None]:
n3 = add_node_ui(n2)

Note that `n=10` below shows at most 10 lines from each partition, which makes sense because we are manly interested in the frequency counts and only need the actual concordance lines as illustrative examples. Possessive pronouns account for considerably more than half of the concordance lines, mainly with _his_, _her_, _my_, and _your_.

In [None]:
show_kwic(n3, n=10)

Let us view the analysis tree again. If you want, you can clean up some of the nodes that were introduced by trying different options.

In [None]:
show_analysis_tree(C)

For further analysis, let us zoom in on the pattern “possessive pronoun + _head_”. We can use the Selecting algorithm _Select by Token-Level String Attribute_. We skip the interactive dialogue this time and directly provide the method call generated by it, listing all possessive pronouns that come to mind).

In [None]:
poss = n3.add_subset_node(
    ('Select by Token-Level String Attribute',
     {'search_terms': ['his', 'her', 'my', 'your', 'our', 'their', 'its'],
      'tokens_attribute': 'word', 'offset': -1,
      'case_sensitive': False, 'regex': False, 'negative': False}))

In order to find our way around a large analysis tree, we can mark selected nodes specified by their ID. Here we mark the subset node that we just created.

In [None]:
show_analysis_tree(C, mark=poss.id)

Notice how the partitioning and ordering are still in effect because the subset node is a daughter of this arrangement node.

In [None]:
show_kwic(poss, n=10)

## Ranking

Further concordance reading of this subsets suggests that _head_ might tend to co-occur with other body parts. We can explore this hypothesis with the help of the _KWIC Grouper Ranker_ algorithm. For convenience, we directly provide the method call with a longish list of body part nouns.

In [None]:
n4 = poss.add_arrangement_node(
    grouping=None,
    ordering=[
        ('KWIC Grouper Ranker',
         {'search_terms': ['shoulder', 'neck', 'cheek', 'hand', 'hands', 'arm', 'eyes', 'waist', 'knees', 'chest', 'face'],
          'tokens_attribute': 'word', 'mode': 'literal', 'case_sensitive': False,
          'window_start': -20, 'window_end': 20, 'include_node': False, 'count_types': False})
    ])

In [None]:
show_kwic(n4, n=100)

## Semantic Clustering

If you have the _Sentence Transformers_ library and enough computing power and time to spare, you can also perform a semantic clustering of the concordance. First, we need to annotate the concordance object with sentence embeddings (for a specified context span). Note that when you run this annotation for the first time, it will download a large neural language model and cache it on your computer, which can take a considerable amount of time. You should see some progress messages as soon as the annotation process starts.

Select the _Annotate with Sentence Transformers_ algorithm,  and specify an L5…R5 window, i.e. start position `-5` and end position `5` (note that larger windows often fail to produce intuitive clusterings). Or you can simply execute the prepared method call in the code cell below.

In [None]:
add_annotation_ui(C)

In [None]:
C.add_annotation(
    ("Annotate with Sentence Transformers",
    {"token_attribute": "word", "window_start": -5, "window_end": 5}))

Now choose the Grouping algorithm _Flat Clustering by Embeddings_ and increase the number of clusters (_n_partitions_) to 20. Add a _Random Sort_ so we each cluster can be represented by a small set of randomly selected lines. Of course, if you are an experienced clusterer or clusteress you can also change the other parameters of the clustering algorithm.

In [None]:
n5 = add_node_ui(n4)

In [None]:
show_kwic(n5, n=20)

Take a closer look at the individual clusters. Can you spot meaningful patterns?

Starting from the clustered concordance view, you can zoom in a selected set of clusters that you feel exhibit a common pattern. In order to do so, specify the relevant cluster IDs shown above by editing the code cell below. As an example, we select clusters #0, #4, and #14.

In [None]:
n6 = n5.add_subset_node(
    ('Manual Line Selection',
     {'groups': ['Cluster_0', 'Cluster_4', 'Cluster_14']}))

In [None]:
show_kwic(n6, n=20)

This is a concordance view we might want to bookmark.

In [None]:
n6.bookmarked = True
show_analysis_tree(C)