Setup a stats and quality plots in the README #291

MansMeg · 2023-05-08T09:00:35Z

We want to be able to follow the quality of the corpus and reasonable stats for each new release of the corpus. This is commonly used by people doing research and should be easy to update. Also, old numbers for previous releases should be stored as now with the quality plot. We should probably store and plot the stats both in total and by year since many researchers will cut out some years that are of relevance to them).

Hence, ideally, we would have one stats dashboard and one quality dashboard. Then, we could link to these figures from the project's homepage.

We should add the following plots to the README:

Corpus information
(This should probably just be a table).

The total number of persons in the MP catalogue
The total number of pages by document type
The size in Gb of the corpus folder

Corpus Statistics - Figures

Number of documents by type (protocols, motions, government bills etc), year, and corpus version
Number of pages by type (protocols, motions, government bills etc), year, and corpus version
Number of MPs by year and chamber (with the actual number of seats as a line - Bobs plot, but more readable)
The number of speeches by year and corpus version

Corpus Quality

Speech-to-speaker mapping proportion by year and corpus version
Number of members of parliament by year and chamber as a ratio with respect to the actual number of seats (see paper)
The OCR quality (character- and word error rate) by decade and document type (records) - note this is a two-stage sample and needs to be estimated by taking this into account, liam knows more
The segmentation error by decennia and document type (records)
The segmentation classification error by decennia and document type (records)
The number proportion of empty chairs and chairs with multiple persons overlap.

MansMeg · 2023-10-15T11:42:58Z

@BobBorges I now updated the issue to be more clear what I think we should include in the README.

BobBorges · 2023-10-16T09:57:46Z

Excellent! Thanks for the direction.

BobBorges · 2023-10-16T14:06:15Z

It seems like github markdown doesn't support any type of variable substitution or file transclusion, so the following strategies would not be possible in the README.

variable substitution (django template)

There are currently {{ number_of_MPs }} in the MP database.

or

File transclusion (latex)

\subsection{Number of MPs}
\include{number_of_MPs.tex}

Do any of you have suggestions about automating dynamic updates to the readme? @MansMeg @ninpnin Plots are not an issue -- these can be added / updated the same way as the speaker mapping accuracy plot as part of each release cycle. Text / tables are more troublesome. Two options that come to mind:

text and tables as images. Problem: text shouldn't be an image as far as I'm concerned
storing the main text parts of the readme (sub)sections as separate files and compiling a single readme while concatenating newly generated segments in the appropriate order as part of the release workflow OR the whole readme as a single F-string --> substitute vars and dump to .md. Problem: (a) it's not a very DRY strategy with the readme file (subsections) ending up somewhere as input (b) direct on-the-fly edits to the readme are overwritten on new release

I'm not aware of a markdown parser that could handle reading an md file and updating targeted fragments (like etree for xml or orgparse for .org files). But maybe something like this exists.

I started working with strategy (2), but wanted to put the issue up for discussion before getting too deep into it.

MansMeg · 2023-10-16T14:47:25Z

In R, I can do this with knitr and the kable() function. Then it computes the thing automatically; see here:
https://bookdown.org/yihui/rmarkdown-cookbook/kable.html

But I'm not sure how to do this in the best way in Python. @ninpnin probably know this better than me. But maybe the Markdown library? It looks mature to me, but unsure if it solves the problem.
https://pypi.org/project/Markdown/

This comment was marked as off-topic.

Sign in to view

MansMeg changed the title ~~Setup a stats and quality dashboards for the project~~ Setup a stats and quality plots in the README Sep 28, 2023

BobBorges mentioned this issue Nov 9, 2023

Adding dynamic README #413

Merged

BobBorges closed this as completed Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup a stats and quality plots in the README #291

Setup a stats and quality plots in the README #291

MansMeg commented May 8, 2023 •

edited by BobBorges

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

MansMeg commented Oct 15, 2023

BobBorges commented Oct 16, 2023

BobBorges commented Oct 16, 2023

MansMeg commented Oct 16, 2023

Setup a stats and quality plots in the README #291

Setup a stats and quality plots in the README #291

Comments

MansMeg commented May 8, 2023 • edited by BobBorges Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

MansMeg commented Oct 15, 2023

BobBorges commented Oct 16, 2023

BobBorges commented Oct 16, 2023

MansMeg commented Oct 16, 2023

MansMeg commented May 8, 2023 •

edited by BobBorges

Loading