# Loading concordances into FlexiConc

This notebook demonstrates how to load concordances into FlexiConc from various supported concordancing tools.

## Preparation

Make sure that FlexiConc and its dependencies are installed, following the instructions in the course slides (which will also install Python packages required by some of the algorithms included in the FlexiConc distribution). Don't forget to activate your virtual environment before starting the JupyterLab server.

The code cell below is only needed when running this notebook in Google Colab. It uses `!` to run a shell command from the notebook because manual software installation is not supported in Colab. The `-U` upgrades FlexiConc if it is already installed (we frequently release minor or major upgrades). We do not install any extensions as we only want to demonstrate the concordance retrieval functions. For serious concordance reading, the additional dependencies should be installed as well.

In [1]:
!pip install -U flexiconc

Collecting flexiconc
  Downloading flexiconc-0.1.19-py3-none-any.whl.metadata (6.4 kB)
Collecting anytree>=2.12.1 (from flexiconc)
  Downloading anytree-2.13.0-py3-none-any.whl.metadata (8.0 kB)
Collecting intervaltree>=3.0 (from flexiconc)
  Downloading intervaltree-3.1.0.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading flexiconc-0.1.19-py3-none-any.whl (121 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.3/121.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading anytree-2.13.0-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.1/45.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: intervaltree
  Building wheel for intervaltree (setup.py) ... [?25l[?25hdone
  Created wheel for intervaltree: filename=intervaltree-3.1.0-py2.py3-none-any.whl size=26098 sha256=95ef9228eecc81d979569d0077c384481bf22cad3758fd94e0eb3631ddce790a
  Stored in dire

We can now import FlexiConc and its convenience functions for Jupyter notebooks. Most concordance retrieval functions are automatically available as methods of the `Concordance` object. Only `wmatrix` is a special case that needs to be imported separately.

In [2]:
from flexiconc import Concordance
from flexiconc.utils.notebook_utils import add_node_ui, add_annotation_ui, show_kwic, show_analysis_tree
from flexiconc.utils import wmatrix

Note that many of the code cells below require a password, access token, or other special provisions in order to work. It is recommended to focus on the approaches that you need in your work and/or concordancing tools you already have access to.

## CLiC

The easiest approach is to load concordance data from the public [CLiC server](https://clic-fiction.com), which is freely accessible without a user account.

As an example we load a concordance for _eyes_ within long suspensions across both 19C (19th century novels) and DNov (Charles Dickens) corpora. Read the method docmentation for further information about the arguments and available options.

In [None]:
C = Concordance()
C.retrieve_from_clic(query=['eyes'],
                     corpora=["corpus:19C", "corpus:DNov"], subset="longsus")

In [None]:
help(C.retrieve_from_clic)

Recall that you can get a glimpse of each concordance with `show_kwic()`.

In [None]:
show_kwic(C.root, n=10, metadata_columns=("text_id", "chapter"))

In the following examples, we will usually just display the number of concordance lines in order to demonstrate that the import was successful.

In [None]:
C.root.line_count

## Sketch Engine

If you have an account for the commercial [Sketch Engine](https://app.sketchengine.eu/) platform, you can load concordances in a similar way. SkE includes rich token-level annotation, but its support for line-level metadata is rather limited. You will be able to access both your own corpora (_user corpora_) as well as a wide range of pre-installed corpora in many languages.

As preparation you need to note down the full path of the relevant corpus, as well as generate an API access token. Both steps are illustrated in the course slides.

Here, we search for the phrase _fake news_ in the Trump Twitter Archive corpus. It is a user corpus of the account `SEvert`, to which the access token used below also belongs. Note that if you run this notebook some time after the tutorial, the access token will likely have been invalidated, and you will need to obtain your own access token.

In [18]:
C = Concordance()
C.retrieve_from_sketchengine(query='[lc="fake"] [lc="news"]',
                             corpus="user/SEvert/tta",
                             api_key="66260be9038677cd68a2559ec1153f20")

In [19]:
C.root.line_count

968

In [20]:
show_kwic(C.root, n=10)

Line ID,Left Context,Node,Right Context
0,"by @CNN that I will be working on The Apprentice during my Presidency, even part time, are ridiculous & amp, untrue -",FAKE NEWS,"! A very interesting read. Unfortunately, so much is true. https://t.co/ER2BoM765M RT @TrumpInaugural: Counting"
1,' Trump Helps Lift Small Business Confidence to 12-Yr. High ' https://t. co / MhbABREhzt https://t.co/CWAvJ4fRdx,FAKE NEWS,- A TOTAL POLITICAL WITCH HUNT! RT @MichaelCohen212: I have never been to Prague in my life. #fakenews
2,"! I win an election easily, a great """" movement """" is verified, and crooked opponents try to belittle our victory with",FAKE NEWS,". A sorry state! Intelligence agencies should never have allowed this fake news to """" leak """" into the public. One last"
3,try to belittle our victory with FAKE NEWS. A sorry state! Intelligence agencies should never have allowed this,fake news,"to """" leak """" into the public. One last shot at me. Are we living in Nazi Germany? We had a great News Conference at Trump"
4,public. One last shot at me. Are we living in Nazi Germany? We had a great News Conference at Trump Tower today. A couple of,FAKE NEWS,"organizations were there but the people truly get what's going on """" @zhu _amy3: @realDonaldTrump It's Morning in"
5,courage. People will support you even more now. Buy L. L. Bean. @LBPerfectMaine. @CNN is in a total meltdown with their,FAKE NEWS,because their ratings are tanking since election and their credibility will soon be gone! Congrats to the Senate for
6,"afraid of being sued .... Totally made up facts by sleazebag political operatives, both Democrats and Republicans -",FAKE NEWS,"! Russia says nothing exists. Probably... released by """" Intelligence """" even knowing there is no proof, and never will"
7,"worse - just look at Syria (red line), Crimea, Ukraine and the build-up of Russian nukes. Not good! Was this the leaker of",Fake News,? Celebrate Martin Luther King Day and all of the many wonderful things that he stood for. Honor him for being the great
8,"from Ford, G.M., Lockheed & amp, others that jobs are coming back... to the U.S., but had nothing to do with TRUMP, is more",FAKE NEWS,". Ask top CEO's of those companies for real facts. Came back because of me! """" Bayer AG has pledged to add U.S. jobs and"
9,Congratulations to @FoxNews for being number one in inauguration ratings. They were many times higher than,FAKE NEWS,"@CNN - public is smart! If Chicago doesn't fix the horrible """" carnage """" going on, 228 shootings in 2017 with 42 killings"


## CQPweb

There is no direct interface to [CQPweb servers](https://corpora.linguistik.uni-erlangen.de/cqpweb/) yet (due to the lack of a fully functional API), but you can download concordance data from a CQPweb session and import it into FlexiConc. After running a corpus query, select the _Download …_ action and adjust format options as explained in the course slides. Put the download file (which should automatically be saved with extension `.txt`) in the same folder as this Jupyter notebook.

A sample concordance download for _water and sanitation_ in the ParlSpeech UK corpus can be downloaded from GitHub.

In [33]:
!wget -nc https://github.com/reading-concordances/teaching/raw/refs/heads/main/course/data/CQPweb_WaterSanitation_ParlUK.txt

--2025-09-09 09:50:13--  https://github.com/reading-concordances/teaching/raw/refs/heads/main/course/data/CQPweb_WaterSanitation_ParlUK.txt
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/reading-concordances/teaching/refs/heads/main/course/data/CQPweb_WaterSanitation_ParlUK.txt [following]
--2025-09-09 09:50:13--  https://raw.githubusercontent.com/reading-concordances/teaching/refs/heads/main/course/data/CQPweb_WaterSanitation_ParlUK.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 514812 (503K) [text/plain]
Saving to: ‘CQPweb_WaterSanitation_ParlUK.txt’


2025-09-09 09:50:13 (14.8 MB/s) - ‘

In [34]:
C = Concordance()
C.load_from_cqpweb_export("CQPweb_WaterSanitation_ParlUK.txt")

In [35]:
C.root.line_count

669

Any metadata included in the download file are automatically imported into FlexiCon. `URL` provides a link to display extended context for a concordance line on the CQPweb server.

In [36]:
C.metadata.head()

Unnamed: 0,line_id,Number of hit,Text ID,Date,Party,Year,URL,Matchbegin corpus position,Matchend corpus position
0,0,1,c_3909,1988-12-07,con,1988,https://corpora.linguistik.uni-erlangen.de/cqp...,887083,887085
1,1,2,c_78921,1990-01-29,con,1990,https://corpora.linguistik.uni-erlangen.de/cqp...,16138891,16138893
2,2,3,c_84408,1990-02-19,con,1990,https://corpora.linguistik.uni-erlangen.de/cqp...,17114608,17114610
3,3,4,c_101968,1990-05-02,lab,1990,https://corpora.linguistik.uni-erlangen.de/cqp...,20424398,20424400
4,4,5,c_113242,1990-06-26,con,1990,https://corpora.linguistik.uni-erlangen.de/cqp...,22609673,22609675


In [37]:
C.metadata.URL[0]

'https://corpora.linguistik.uni-erlangen.de/cqpweb/parlspeech_uk/context.php?qname=i42i3vii9qn&batch=0'

In [38]:
show_kwic(C.root, n=10, metadata_columns=("Date", "Party"))

Line ID,Date,Party,Left Context,Node,Right Context
0,1988-12-07,con,Decade but intends to carry on after the decade. Currently a million people throughout the world are getting improved,water and sanitation,"from projects supported by Water Aid, some complete and some still under way. The necessity of massively improving"
1,1990-01-29,con,"This covered training awards in Britain, and technical co-operation which included assistance with forestry, agriculture, education,",water and sanitation,"and roads. Is my right hon. Friend aware that, owing to the trade and transit dispute between"
2,1990-02-19,con,"programmes have a central place. Child health also benefits from our support for family planning, provision of clean",water and sanitation,", education, especially women education, and progammes to improve the status of women. Does my right hon"
3,1990-05-02,lab,Member for Torfaen (Mr. Murphy) spoke eloquently of the role of local authorities over the centuries in providing,water and sanitation,services and caring for the local environment generally. New clause 24 underlines their role as local environmental protection agencies
4,1990-06-26,con,all international environmental forums. The hon. Member for Cynon Valley also referred to water. I agree that,water and sanitation,are both critical. Ten years ago only 40 per cent. of the world population had access to a
5,1990-06-26,con,is a pressing need to accelerate the impetus worldwide during the 1990s. That is why we are funding numerous,water and sanitation,"projects in the developing countries. That is why, in 1988 alone, we had 86 on - going"
6,1990-11-08,con,"children, be it in the form of education, health care, the provision of nutritious food or clean",water and sanitation,. We can all unite around that cause. At a time when we are considering the possibility of having
7,1991-03-15,lab,"in Baghdad face a public health crisis of vast proportions because of what international health authorities call""grossly inadequate",water and sanitation,"services.""The Red Cross says that, unless Iraq immediately receives massive international relief, the city could"
8,1991-12-12,con,"the situation at first hand. Since September, we have given £ 1 million to UNICEF for medicines,",water and sanitation,"in the south, £ 78,000 to the Save the Children Fund for water and sanitation in the south,"
9,1991-12-12,con,"UNICEF for medicines, water and sanitation in the south, £ 78,000 to the Save the Children Fund for",water and sanitation,"in the south, and almost £ 700,000 to organisations that are working to help refugees from southern Iraq and"


## WMatrix

[WMatrix](https://ucrel-wmatrix7.lancaster.ac.uk/) is a specialised online tool with two main purposes:

- You can upload text files and have them automatically compiled into a corpus annotated with part-of-speech tags, lemmata and semantic tags (_concepts_). Metadata can be encoded in the filenames.
- It then provides keyword analysis on the annotated corpus at the level of word forms, lemmata, POS tags, and concepts.

The concordance display for individual keywords is rather basic, so FlexiConc makes for an ideal companion software.

Connecting WMatrix to FlexiConc works differently than for the other tools. Rather than export each individual concordance separately, FlexiConc has to download the entire annotated corpus from WMatrix in its internal SQLite format. You can then create concordances for single words and multiword units. This is convenient because you will typically want to look at concordances for multiple keywords brought up by WMatrix.

In order to try the example below, you first have to copy the `LabourManifesto2005` corpus from the WMatrix library to your own user account. Then insert your login and password in the respective function arguments.

In [21]:
labour2005 = wmatrix.load(
    corpus_name="LabourManifesto2005",
    username="demo1@esslli.2025",
    password="u73ripee4y",
    db_filename="labour2005.db")
labour2005

labour2005.db: 22.2MB [00:01, 16.5MB/s]


✅ Download complete!


<Corpus: 25913 tokens | token attributes: [word, word_lowercase, pos, lemma, sem, file] | spans: s (912), file (1)>

Creating a concordance is easy for a single word or multiword unit at wordform level.

In [22]:
C = labour2005.concordance_from_query('antisocial behaviour')

In [23]:
C.root.line_count

14

In [24]:
show_kwic(C.root)

Line ID,Left Context,Node,Right Context
0,New powers to tackle,antisocial behaviour,"have been introduced, with nearly 4,000 AntiSocial Behaviour Orders issued so far and nearly 66,000 fixed penalty notices."
1,"New powers to tackle antisocial behaviour have been introduced, with nearly 4,000",AntiSocial Behaviour,"Orders issued so far and nearly 66,000 fixed penalty notices."
2,"But our security is threatened by major organised crime; volume crimes such as burglary and car theft, often linked to drug abuse; fear of violent crime; and",antisocial behaviour,.
3,We are giving the police and local councils the power to tackle,antisocial behaviour,; we will develop neighbourhood policing for every community and crack down on dr ug dealing and hard drug use to reduce volume crime; we are modernising our asylum and immigration system; and we will take the necessary measures to protect our country from internation al terrorism.
4,"We believe in being tough on crime and its causes so we will expand drugs testing and treatment, and tackle the conditions from lack of youth provision to irresponsible drinking that foster crime and",antisocial behaviour,.
5,"Not all problems need a 999 response, so a single phone number staffed by police, local councils and other local services will be available across the country to deal with",antisocial behaviour,and other nonemergency problems.
6,Empowering communities against,antisocial behaviour,People want communities where the decent lawabiding majority are in charge.
7,"The experience of almost 4,000",AntiSocial Behaviour,"Orders, nearly 66,000 Penalty Notices for Disorder, and the closure of over 150 crack houses shows that communities can fight back against crime.We are ready to go further."
8,"Parish Council wardens, like those working for local authorities, will be given the power to issue Penalty Notices for Disorder for noise, graffiti and throwing fireworks.Victims of",antisocial behaviour,will be able to give evidence anonymously.
9,But with rights must go responsibilities so we have provided tough new powers 'We are giving the police and local councils the power to tackle,antisocial behaviour,; we will develop neighbourhood policing for every community' for councils and the police to tackle the problem of unauthorised sites.


In order to search for lemmas or concepts, you need to specify a query in a CQP-like notation. Token-level annotations are accessed under the names
- `word`: literal word forms
- `lemma`: lemmata
- `pos`: POS tags
- `sem`: concepts = semantic tags
You can find suitable values for your search through the keyword analysis functions in WMatrix.

In [26]:
C = labour2005.concordance_from_query(r'[lemma="community"]')
C.root.line_count

65

The WMatrix corpus needs to be downloaded only once and will then be stored locally in the specified file. Next time you access this corpus, you can simply load it from the file.

In [27]:
labour2005 = wmatrix.load(db_filename="labour2005.db")

For your convenience, we have created a corpus `ESSLLI_Water_ParlUK` in the public WMatrix library, which includes all sentences containing the noun _water_ from the ParlSpeech UK corpus. This is still a large download of close to 1 GB, so we also provide a pre-processed version for use with FlexiConc.

Download the file `WMatrix_Water_ParlUK.db` and save it to the same directory as this notebook.

In [28]:
!wget -nc https://github.com/reading-concordances/teaching/raw/refs/heads/main/course/data/WMatrix_Water_ParlUK.db

--2025-09-09 09:46:55--  https://github.com/reading-concordances/teaching/raw/refs/heads/main/course/data/WMatrix_Water_ParlUK.db
Resolving github.com (github.com)... 140.82.116.3
Connecting to github.com (github.com)|140.82.116.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/reading-concordances/teaching/refs/heads/main/course/data/WMatrix_Water_ParlUK.db [following]
--2025-09-09 09:46:55--  https://raw.githubusercontent.com/reading-concordances/teaching/refs/heads/main/course/data/WMatrix_Water_ParlUK.db
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 70701056 (67M) [application/octet-stream]
Saving to: ‘WMatrix_Water_ParlUK.db’


2025-09-09 09:46:57 (170 MB/s) - ‘WMatrix_Water_ParlUK.db’ s

You can now load the pre-processed corpus into FlexiConc and continue exploring water discourses in the UK parliament, using the keyword analyses of WMatrix as an entry point into the discourse.

In [29]:
water = wmatrix.load(db_filename="WMatrix_Water_ParlUK.db")

In [30]:
C = water.concordance_from_query(r'[lemma="water"] [lemma="and"] [lemma="sanitation"]')
C.root.line_count

670

In [31]:
n1 = C.root.add_subset_node(
    ("Random Sample",
     {'sample_size': 20, 'seed': 42}))

In [32]:
show_kwic(n1, metadata_columns=["file.file"])

Line ID,file.file,Left Context,Node,Right Context
25,Cons,Providing clean,water and sanitation,"is one of the most efficient and cost - effective ways of saving lives and reducing mortality, particularly infant mortality."
27,Cons,If relatively few developing countries see,water and sanitation,"as a priority for inclusion in their Poverty Reduction Strategy Papers, what then?"
30,Cons,of the population had access to clean,water and sanitation,", and 51 per cent."
32,Cons,I have recently observed in Bangladesh that it is possible to make progress on,water and sanitation,"with, for example, microcredit schemes, enabling the manufacture of sanitation facilities in small, rural communities."
89,Cons,"Unfortunately, it has been widely recognised, and DFID has been candid in admitting - I therefore hope that the Minister will take this criticism on the chin; I assure him that it is the only criticism that I shall make of his Department today - that just as the world was waking up to the importance of",water and sanitation,", DFID was, unfortunately, restructured and has been criticised for rather taking its eye off the ball."
95,Cons,The issue of,water and sanitation,"- or sanitation and water, as we prefer to put it - is central to the survival of people on a massive scale, and I shall give some indication of that."
104,Cons,"Surely,",water and sanitation,must be a big part of that.
114,Cons,On the reference to,water and sanitation,", however, I note that we debated the Committee report on water and sanitation in this Chamber on 29 April 2008."
142,Cons,"The results we want to deliver in Sudan are to help 1 million people to get enough food to eat; to enable 240,000 more children to go to primary school; to provide malaria prevention and treatment for 750,000 people; to give 800,000 people access to clean drinking",water and sanitation,"; to provide life - saving health and nutrition for to up to 10 million people; and to give 250,000 women better access to justice."
223,Cons,The evidence makes it clear that focusing aid money on delivering,water and sanitation,"gives value for money, because of the changes it brings about."


## Importing data from your own files
You can also use your own data stored as plaintext files with FlexiConc. For the purposes of this demonstration, we will download two novels by Lewis Carrol form Project Gutenberg and use them.

In [3]:
import requests, re
from bs4 import BeautifulSoup
from pathlib import Path

def gutenberg_html_to_txt(url, outpath):
    resp = requests.get(url)
    resp.encoding = "utf-8"
    soup = BeautifulSoup(resp.text, "html.parser")
    text = soup.get_text()
    # collapse single newlines (soft wraps) into spaces, preserve paragraph breaks
    clean = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)
    Path(outpath).write_text(clean, encoding="utf-8")

Path("data").mkdir(exist_ok=True)

# Download and process Alice in Wonderland and Through the Looking Glass from Gutenberg
gutenberg_html_to_txt("https://www.gutenberg.org/files/11/11-h/11-h.htm", "data/alice_in_wonderland.txt")
gutenberg_html_to_txt("https://www.gutenberg.org/files/12/12-h/12-h.htm", "data/through_the_looking_glass.txt")

Now, import `TextImport` form `flexiconc` and create an instance of it. It has no data until you call `load_files`, which takes the following arguments:

- `paths` is a list of files or directories. If you give a directory, all files are read recursively.  
- `db_name="alice.sqlite"` tells `TextImport` to save the tokenized corpus to that SQLite database file, so that it can be loaded later.  
- You can choose whether to use spaCy (`use_spacy=True`) or a simple regex tokenizer and sentencizer.  
- If you use spaCy, options `lemma`, `pos`, `tag` control what annotations are stored.

In [7]:
from flexiconc import TextImport
T = TextImport()
T.load_files(
    paths=["data"],
    shorten_paths=True,
    db_name="alice.sqlite",
    use_spacy=True,
    lemma=True,
    pos=False,
    tag=False
)

You can now work with this TextImport object as you did with the WMatrix corpora, for instance querying it and passing concordances to `flexiconc`. Here, we set context size to 20 tokens to the left and 20 tokens to the right of the node, and we do not require the context to be a single sentence (`limit_context_span=None` rather than `="s"`).

In [17]:
C = T.concordance_from_query('thing', context_size=(20,20), limit_context_span=None)
show_kwic(C.root, metadata_columns=["file.path"])

Line ID,file.path,Left Context,Node,Right Context
0,alice_in_wonderland.txt,"is like after the candle is blown out, for she could not remember ever having seen such a",thing,". After a while, finding that nothing more happened, she decided on going into the garden"
1,alice_in_wonderland.txt,"but it was too slippery; and when she had tired herself out with trying, the poor little",thing,"sat down and cried. “ Come, there ’s no use in crying like that! ” said"
2,alice_in_wonderland.txt,"it now, I suppose, by being drowned in my own tears! That will be a queer",thing,", to be sure! However, everything is queer to - day. ” Just then she"
3,alice_in_wonderland.txt,thought this must be the right way of speaking to a mouse: she had never done such a,thing,"before, but she remembered having seen in her brother ’s Latin Grammar, “ A mouse — of"
4,alice_in_wonderland.txt,’d take a fancy to cats if you could only see her. She is such a dear quiet,thing,", ” Alice went on, half to herself, as she swam lazily about in the pool,"
5,alice_in_wonderland.txt,"by the fire, licking her paws and washing her face — and she is such a nice soft",thing,"to nurse — and she ’s such a capital one for catching mice — oh, I beg your"
6,alice_in_wonderland.txt,"” said the Mouse with an important air, “ are you all ready? This is the driest",thing,"I know. Silence all round, if you please! ‘ William the Conqueror, whose cause was"
7,alice_in_wonderland.txt,"’ means. ” “ I know what ‘ it ’ means well enough, when I find a",thing,", ” said the Duck: “ it ’s generally a frog or a worm. The question is"
8,alice_in_wonderland.txt,"going to say, ” said the Dodo in an offended tone, “ was, that the best",thing,to get us dry would be a Caucus - race. ” “ What is a Caucus - race
9,alice_in_wonderland.txt,"to explain it is to do it. ” (And, as you might like to try the",thing,"yourself, some winter day, I will tell you how the Dodo managed it.) First"
