This repository contains corpus queries for investigating word order in Old and Middle English texts, and R scripts for producing figures based on the outputs of those corpus queries. The queries are designed to be used in conjunction with the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), the Penn-Helsinki Parsed Corpus of Middle English, 2nd edition (PPCME2), and the Parsed Linguistic Atlas of Early Middle English (PLAEME). I distribute the query files, rather than the results of those queries, because not all the corpora have licenses which would permit distribution of the results.
Included in this repository are:
- Competition_mat_OE_full.c, Competition_sub_OE_full.c : coding query files for use with YCOE
- Competition_mat_PPCME_full.c, Competition_sub_PPCME_full.c : coding query files for use with PPCME2
- Competition_mat_PLAEME_full.c, Competition_sub_PLAEME_full.c, Competition_mat_PLAEME_full_v3.c, Competition_sub_PLAEME_full_v3.c, V2.c : coding query files for use with PLAEME
- WhRel.def : Definitions file referred to by the coding queries
- OoosIds.q : generic query file for extracting codes and IDs from the output of coding queries
- CM_maps_final.R, Competition_maps_final.R, Competition_plots_final.R : R scripts for generating maps and figures.
- PLAEME_more_info.csv : metadata for PLAEME texts.
- CM_grammar_comparison.csv : CSV file created by manual triage + summarization of the first six coding queries. It would be desirable, and possible in principle to automate the manual triage, but this research was performed in lockdown over a flaky SSH connection, and it wasn't practical under those circumstances.
.c and .q files should be run using CorpusSearch.
Workflow is as follows (assumes all files in the same directory):
- Run coding queries on relevant corpora.
- Run OoosIds.q on .cod files output by coding queries.
- Perform minor edits on .cod.ooo files output by OoosIds.q (globally replace
@
symbol with:
; globally delete token IDs while retaining text IDs — for YCOE and PPCME2 queries this involves globally deleting the regex string,.*$
; for PLAEME queries, delete\..*$
). - Run R scripts (scripts assume that .cod.ooo files are accessible in the working directory).
NB the outputs of the first six coding queries are not called by any R script. I have included these queries because they are the basis for the summary counts in CM_grammar_comparison.csv
.