rusecqp: An R package for corpus linguistic analysis with Corpus Workbench corpora
3.4.2018, Thilo Wiertz
The Open Source Project IMS Corpus Workbench (CWB, Evert & Hardie, 2011) provides a data model and corpus query processor (CQP) for linguistic analysis. The existing rcqp-Package (Desgraupes & Loiseau 2018) enables users to perform queries on Corpus Workbench corpora in R and access token level information, but does not provide high level functions for corpus linguistic analysis. rusecqp wraps and builds on top of rcqp to provide functions such as frequency distribution, ngrams, or keyword and collocation analysis.
The package requires a working installation of the IMS Corpus Workbench on the system and rcqp. For installation instructions of both dependencies see the respective websites. Make sure to configure the path to the corpus workbench registry directory by
Sys.setenv(CORPUS_REGISTRY = path_to_registry_dir) before loading rcqp. To check whether all dependencies are met and corpora are available, you can type in
rcqp::cqi_list_corpora(). This should print a list of corpora available on the system.
To install rusecqp, install the devtools package and than rcqp. This will also install the dependency packages data.table and stringr as required:
Load the package with
Overview on functions
See the help of the respective functions for explanations of their usage. Currently implemented functions are:
Accessing and subsetting corpora
list_corporalists corpora available on the system.
get_corpusinitializes a corpus for further processing.
subset_corpuscreates a corpus subset based on corpus meta information (CWB structural attribute values).
Corpus based analysis
frequency_listcalculates a frequency list of tokens for a corpus or corpus subset.
ngramscalculates a frequency list of ngrams for a corpus or subcorpus. Note that ngram-calculation is very memory intensive and may cause the computer to hang if performed on large corpora (e.g. above 100 Mio. tokens).
keywordsperforms keyword analysis based on two frequency lists (e.g. as retrieved from
query_corpuscan be used to query a corpus using the CQP syntax. The result can be handed to the following functions for analysis:
q_frequency_breakdownreturns a frequency table of query matches.
q_distributionshows number of matches in different categories defined by a structural attribute.
q_collocationsanalyzes collocations based on a window (context) around the query match.
- Desgraupes, B. and S. Loiseau (2016): rcqp: Interface to the Corpus Query Protocol. – Available online at: https://cran.r-project.org/package=rcqp. – accessed online 22/1/2018
- Evert, S., and A. Hardie 2011. Twenty-First Century Corpus Workbench: Updating a Query Architecture for the New Millennium. – Proceedings of the Corpus Linguistics 2011 Conference. – University of Birmingham, UK
Thilo Wiertz, email@example.com