# Riksdagen corpus v0.6.0

2023-03-01

This Colab notebook demonstrates how to quickly access data from the Riksdagen corpus.

First, we download and unzip the data. On your local machine, you can also use your browser to download the file, just use the link below.

In [1]:
!wget https://github.com/welfare-state-analytics/riksdagen-corpus/releases/latest/download/corpus.zip --show-progress
!7z x corpus.zip

--2023-06-13 09:03:51--  https://github.com/welfare-state-analytics/riksdagen-corpus/releases/latest/download/corpus.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/welfare-state-analytics/riksdagen-corpus/releases/download/v0.9.0/corpus.zip [following]
--2023-06-13 09:03:51--  https://github.com/welfare-state-analytics/riksdagen-corpus/releases/download/v0.9.0/corpus.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/346788931/952f74f7-2ce8-4871-b408-3d858e4b2ee8?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230613%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230613T090351Z&X-Amz-Expires=300&X-Amz-Signature=a1e959b2007e8d51e392d5751b0bdcf7dbe4b3fb14bcc616e99ab51667ba898

Now we can start to work with the data. For that, we need a couple of python modules. Let's install them and set things up

In [2]:
%pip install pyriksdagen
from lxml import etree
import progressbar, argparse
from pyparlaclarin.read import paragraph_iterator, speeches_with_name
from pyriksdagen.utils import protocol_iterators

# We need a parser for reading in XML data
parser = etree.XMLParser(remove_blank_text=True)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyriksdagen
  Downloading pyriksdagen-0.9.0-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.5/41.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting SPARQLWrapper (from pyriksdagen)
  Downloading SPARQLWrapper-2.0.0-py3-none-any.whl (28 kB)
Collecting base58 (from pyriksdagen)
  Downloading base58-2.1.1-py3-none-any.whl (5.6 kB)
Collecting dateparser (from pyriksdagen)
  Downloading dateparser-1.1.8-py2.py3-none-any.whl (293 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.8/293.8 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting kblab-client (from pyriksdagen)
  Downloading kblab_client-0.0.16a0-py3-none-any.whl (15 kB)
Collecting pyparlaclarin (from pyriksdagen)
  Downloading pyparlaclarin-0.7.2-py3-none-any.whl (6.4 kB)
Collecting unidecode (from pyriksdagen)
  Downloading Unidecode-1.3.6-p

Now we can go over some protocols from, say, 1955-1956.

In [3]:
protocols = list(protocol_iterators("corpus/protocols/", start=1955, end=1956))
protocols[:5]

['corpus/protocols/1955/prot-1955--ak--1.xml',
 'corpus/protocols/1955/prot-1955--ak--10.xml',
 'corpus/protocols/1955/prot-1955--ak--11.xml',
 'corpus/protocols/1955/prot-1955--ak--12.xml',
 'corpus/protocols/1955/prot-1955--ak--13.xml']

It is straightforward to print out all content, including speeches, dates, speaker introductions and topic titles.

In [4]:
# Select a protocol cause it's a whole lot of text
protocol_in_question = protocols[12]
print(protocol_in_question)
root = etree.parse(protocol_in_question, parser).getroot()

corpus/protocols/1955/prot-1955--ak--20.xml


In [5]:
for elem in list(paragraph_iterator(root, output="lxml"))[:7]:
  print(" ".join(elem.itertext()))



            RIKSDAGENS BV PROTOKOLL
          

            Vd
          

            1955 ANDRA KAMMAREN Nr 20
          

            23—24 maj
          

            Debatter m. m.
          

            Tisdagen den 24 maj fm. Sid.
          


Moreover, the metadata catalogues are also available. They are stored as CSV files, and can be accessed with pandas in python, or the spreadsheet program of your choice.

In [6]:
import pandas as pd
from pyriksdagen.db import filter_db
from pyriksdagen.utils import parse_date

mop = pd.read_csv("corpus/metadata/member_of_parliament.csv")
name = pd.read_csv("corpus/metadata/name.csv")
name = name[name["primary_name"]][["swerik_id", "name"]]
person = pd.read_csv("corpus/metadata/person.csv")

# We merge mandate periods of the MOPs with the names of the MOPs
mop = mop.merge(name, on="swerik_id", how="left")
# Let's also add person-level metadata, such as birth year and gender
mop = mop.merge(person, on="swerik_id", how="left")
mop

Unnamed: 0,swerik_id,start,end,district,role,name,born,dead,gender,riksdagen_id
0,Q117288697,1858,1860,Värmlands läns valkrets,andrakammarledamot,Birgitta Sjöqvist,1916-12-17,,woman,
1,Q104839950,1867,1867,Eskilstuna och Strängnäs valkrets,andrakammarledamot,Sven Palmgren,1821-08-30,1880-09-29,man,
2,Q5577408,1867,1867,Västra Götalands läns västra valkrets,förstakammarledamot,Gustaf Daniel Björck,1806-05-30,1888-01-03,man,
3,Q5618809,1867,1867,Torna härads valkrets,andrakammarledamot,Robert De la Gardie den äldre,1823-12-17,1916-05-19,man,
4,Q5630560,1867,1867,Värmlands läns valkrets,förstakammarledamot,Gustaf Ekman,1804-05-26,1876-05-03,man,
...,...,...,...,...,...,...,...,...,...,...
13084,Q18243853,2023-05-08,,Malmö kommuns valkrets,ledamot,Rasmus Ling,1984-03-23,,man,7.441392e+11
13085,Q19976148,,,,ledamot,Erik Georg Danielsson,1815-07-13,1881-06-19,man,
13086,Q5553916,,,,andrakammarledamot,Anders Andersson,1820-01-15,1894-10-10,man,
13087,Q98271639,,,,ledamot,Bengt Nording,,,man,


Let's find a specific person based on their name, for example Elis Håstad

In [7]:
# Let's find mr. Håstad
mop[mop.name.str.contains("Håstad")]

Unnamed: 0,swerik_id,start,end,district,role,name,born,dead,gender,riksdagen_id
4447,i-S9dMiL9yRpDaYPfVGjK1Gp,1941-12-05,1959-05-07,Stockholms kommuns valkrets,andrakammarledamot,Elis Håstad,1900-01-18,1959-05-07,man,


His identifier is i-S9dMiL9yRpDaYPfVGjK1Gp. Using that, we can find all his speeches. Let's do that and print out the first one

In [8]:
# Elis Håstad (i-S9dMiL9yRpDaYPfVGjK1Gp)
hastad_speeches = []
for protocol in progressbar.progressbar(protocols):
  root = etree.parse(protocol, parser).getroot()
  protocol_speech = []
  for speech in speeches_with_name(root, name="i-S9dMiL9yRpDaYPfVGjK1Gp"):
    protocol_speech.append(speech)
  protocol_speech = "\n".join(protocol_speech).strip()
  if protocol_speech != "":
    hastad_speeches.append(protocol_speech)

print(hastad_speeches[0])

100% (130 of 130) |######################| Elapsed Time: 0:00:02 Time:  0:00:02


Herr talman! Jag skall först mycket öppet ge ett erkännande åt
              statsrådet för den upprustning av Stockholms högskola som hans
              proposition i år vittnar om. I detta sammanhang bör vi erinra
              oss det uttalande att man skall återuppta och fullfölja femårsplanen,
              som statsutskottet och dess andra avdelning i fjol gjorde och
              som nu Kungl. Maj:t har beaktat. Det är — jag upprepar det än
              en gång — ett avsevärt steg mot den tidigare utlovade upprustningen
              som årets riksdag nu går att ta.
            

              Å andra sidan kan man inte undgå att konstatera, att det på ett
              par punkter i årets proposition finns tendenser också till vad
              man skulle kunna kalla nedrustning, nämligen därigenom att ett
              par professurer eventuellt indrages.
            

              Den första av dessa är professuren i teaterhistoria, vår enda
              i landet, såsom fr