Skip to content

UppsalaNLP/SOU-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

SOU corpus

This repository contains cleaned and further processed versions of Swedish Government Official Reports - Statens offentliga utredningar (SOU). The documents are based on html versions from Riksdagens öppna data and cover the years 1994 to 2020. Details of the cleaning procedure are described in:

Luise Dürlich, Sebastian Reimann, Gustav Finnveden, Joakim Nivre and Sara Stymne. Cause and Effect in Governmental Reports: Two Data Sets for Causality Detection in Swedish. In Proceedings of the First Workshop on Natural Language Processing for Political Sciences. June 24, 2022. Marseilles, France.

html/

In contrast to the original html versions, the extracted files distinguish section headers and titles from text body. Tables, lists and diagrams as well as non-Swedish text were removed whenever they could be identified as such. The documents were split into summaries and full report. The filenames consist of a code for the type of text and the document id. The text codes are:

  • ft: for full text
  • s: for standard Swedish summary
  • SEs: for simple Swedish summary
  • ENs: for English summary

The original html can be found at https://data.riksdagen.se/dokument/ + document id. So for the file ft_H4B319.html, the corresponding original file at Riksdagen is https://data.riksdagen.se/dokument/H4B319.html.

tagged/

This directory contains sentence-segmented and dependency-parsed versions of the SOU-text bodies. These are saved as csv, with the second field containing a processed sentence at a time and the first field containing the corresponding raw section header or title. For sentence segmentation and parsing, the Swedish spaCy model was used. The sentence segmentation was complemented with some rules for

  • abbreviations like 'm.m.', 'osv.' and 'etc.' that can occur at the end of a sentence
  • the use of a colon with acronyms to express possessive or plural (e.g. 'EU:s huvudmål ...', 'SOU:er', etc.)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages