Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

Setup a stats and quality plots in the README #291

Closed
9 of 13 tasks
MansMeg opened this issue May 8, 2023 · 6 comments
Closed
9 of 13 tasks

Setup a stats and quality plots in the README #291

MansMeg opened this issue May 8, 2023 · 6 comments

Comments

@MansMeg
Copy link
Collaborator

MansMeg commented May 8, 2023

We want to be able to follow the quality of the corpus and reasonable stats for each new release of the corpus. This is commonly used by people doing research and should be easy to update. Also, old numbers for previous releases should be stored as now with the quality plot. We should probably store and plot the stats both in total and by year since many researchers will cut out some years that are of relevance to them).

Hence, ideally, we would have one stats dashboard and one quality dashboard. Then, we could link to these figures from the project's homepage.

We should add the following plots to the README:

Corpus information
(This should probably just be a table).

  • The total number of persons in the MP catalogue
  • The total number of pages by document type
  • The size in Gb of the corpus folder

Corpus Statistics - Figures

  • Number of documents by type (protocols, motions, government bills etc), year, and corpus version
  • Number of pages by type (protocols, motions, government bills etc), year, and corpus version
  • Number of MPs by year and chamber (with the actual number of seats as a line - Bobs plot, but more readable)
  • The number of speeches by year and corpus version

Corpus Quality

  • Speech-to-speaker mapping proportion by year and corpus version
  • Number of members of parliament by year and chamber as a ratio with respect to the actual number of seats (see paper)
  • The OCR quality (character- and word error rate) by decade and document type (records) - note this is a two-stage sample and needs to be estimated by taking this into account, liam knows more
  • The segmentation error by decennia and document type (records)
  • The segmentation classification error by decennia and document type (records)
  • The number proportion of empty chairs and chairs with multiple persons overlap.
@salgo60

This comment was marked as off-topic.

@salgo60

This comment was marked as off-topic.

@MansMeg MansMeg changed the title Setup a stats and quality dashboards for the project Setup a stats and quality plots in the README Sep 28, 2023
@MansMeg
Copy link
Collaborator Author

MansMeg commented Oct 15, 2023

@BobBorges I now updated the issue to be more clear what I think we should include in the README.

@BobBorges
Copy link
Collaborator

Excellent! Thanks for the direction.

@BobBorges
Copy link
Collaborator

It seems like github markdown doesn't support any type of variable substitution or file transclusion, so the following strategies would not be possible in the README.

variable substitution (django template)

There are currently {{ number_of_MPs }} in the MP database.

or

File transclusion (latex)

\subsection{Number of MPs}
\include{number_of_MPs.tex}

Do any of you have suggestions about automating dynamic updates to the readme? @MansMeg @ninpnin Plots are not an issue -- these can be added / updated the same way as the speaker mapping accuracy plot as part of each release cycle. Text / tables are more troublesome. Two options that come to mind:

  1. text and tables as images. Problem: text shouldn't be an image as far as I'm concerned
  2. storing the main text parts of the readme (sub)sections as separate files and compiling a single readme while concatenating newly generated segments in the appropriate order as part of the release workflow OR the whole readme as a single F-string --> substitute vars and dump to .md. Problem: (a) it's not a very DRY strategy with the readme file (subsections) ending up somewhere as input (b) direct on-the-fly edits to the readme are overwritten on new release

I'm not aware of a markdown parser that could handle reading an md file and updating targeted fragments (like etree for xml or orgparse for .org files). But maybe something like this exists.

I started working with strategy (2), but wanted to put the issue up for discussion before getting too deep into it.

@MansMeg
Copy link
Collaborator Author

MansMeg commented Oct 16, 2023

In R, I can do this with knitr and the kable() function. Then it computes the thing automatically; see here:
https://bookdown.org/yihui/rmarkdown-cookbook/kable.html

But I'm not sure how to do this in the best way in Python. @ninpnin probably know this better than me. But maybe the Markdown library? It looks mature to me, but unsure if it solves the problem.
https://pypi.org/project/Markdown/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants