# How to download and read the Solvency 2 legislation

In our first NLP project we will download, clean and read the Delegated Acts of the Solvency 2 legislation in all European languages.

In [1]:
import os
import re
import requests
#import fitz

The languages of the European Union are
Bulgarian (BG),
Spanish (ES),
Czech (CS),
Danish (DA),
German (DE),
Estonian (ET),
Greek (EL),
English (EN),
French (FR),
Croatian (HR),
Italian (IT),
Latvian (LV),
Lithuanian (LT),
Hungarian (HU),
Maltese (MT),
Dutch (NL),
Polish (PL),
Portuguese (PT),
Romanian (RO),
Slovak (SK),
Solvenian (SL),
Finnish (FI),
Swedish (SV).

In [2]:
languages = ['BG','ES','CS','DA','DE','ET','EL',
             'EN','FR','HR','IT','LV','LT','HU',
             'MT','NL','PL','PT','RO','SK','SL',
             'FI','SV']

The urls of the Delegated Acts of Solvency 2 are constructed for these languages.

In [3]:
urls = ['https://eur-lex.europa.eu/legal-content/' + lang +
        '/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN' 
        for lang in languages]

The following for loop retrieves the pdfs of the Delegated Acts from the website of the European Union and stores them in da_path.

In [4]:
da_path = '/10_central_data/legislation/'

for index in range(len(urls)):
    
    print("Retrieving " + languages[index] + ' from ' + urls[index])
    
    filename = 'Solvency II Delegated Acts - ' + languages[index]+ '.pdf'

    if not(os.path.isfile(da_path + filename)):
        
        r = requests.get(urls[index])

        f = open(da_path + filename,'wb+')
        f.write(r.content) 
        f.close()

        fh = open(da_path + filename, "rb")
        pdffile = PyPDF2.PdfFileReader(fh)
        fh.close()

Retrieving BG from https://eur-lex.europa.eu/legal-content/BG/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN
Retrieving ES from https://eur-lex.europa.eu/legal-content/ES/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN
Retrieving CS from https://eur-lex.europa.eu/legal-content/CS/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN
Retrieving DA from https://eur-lex.europa.eu/legal-content/DA/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN
Retrieving DE from https://eur-lex.europa.eu/legal-content/DE/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN
Retrieving ET from https://eur-lex.europa.eu/legal-content/ET/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN
Retrieving EL from https://eur-lex.europa.eu/legal-content/EL/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN
Retrieving EN from https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN
Retrieving FR from https://eur-lex.europa.eu/legal-content/FR/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN
Retrieving HR from https://eur-lex.europa.eu/legal-content/HR/TXT/PDF/?uri=OJ:L:20

# Data cleaning

If you look at the pdfs then you see that each page has a header with page number and information about the legislation and the language. These headers must be deleted to access the articles in the text.

In [5]:
DA_dict = dict({
                'BG': 'Официален вестник на Европейския съюз',
                'CS': 'Úřední věstník Evropské unie',
                'DA': 'Den Europæiske Unions Tidende',
                'DE': 'Amtsblatt der Europäischen Union',
                'EL': 'Επίσημη Εφημερίδα της Ευρωπαϊκής Ένωσης',
                'EN': 'Official Journal of the European Union',
                'ES': 'Diario Oficial de la Unión Europea',
                'ET': 'Euroopa Liidu Teataja',           
                'FI': 'Euroopan unionin virallinen lehti',
                'FR': "Journal officiel de l'Union européenne",
                'HR': 'Službeni list Europske unije',         
                'HU': 'Az Európai Unió Hivatalos Lapja',      
                'IT': "Gazzetta ufficiale dell'Unione europea",
                'LT': 'Europos Sąjungos oficialusis leidinys',
                'LV': 'Eiropas Savienības Oficiālais Vēstnesis',
                'MT': 'Il-Ġurnal Uffiċjali tal-Unjoni Ewropea',
                'NL': 'Publicatieblad van de Europese Unie',  
                'PL': 'Dziennik Urzędowy Unii Europejskiej',  
                'PT': 'Jornal Oficial da União Europeia',     
                'RO': 'Jurnalul Oficial al Uniunii Europene', 
                'SK': 'Úradný vestník Európskej únie',        
                'SL': 'Uradni list Evropske unije',            
                'SV': 'Europeiska unionens officiella tidning'})

The following code reads the pdfs, deletes the headers from all pages and saves the clean text to a .txt file.

In [6]:
DA = dict()

files = [f for f in os.listdir(da_path) if os.path.isfile(os.path.join(da_path, f))]    

print("Reading language ", end='')

for language in languages:

    print(language + " ", end='')

    if not("Delegated_Acts_" + language + ".txt" in files):
    
        # reading pages from pdf file
        da_pdf = fitz.open(da_path + 'Solvency II Delegated Acts - ' + language + '.pdf')
        da_pages = [page.getText(output = "text") for page in da_pdf]
        da_pdf.close()

        # deleting page headers
        header = "17.1.2015\\s+L\\s+\\d+/\\d+\\s+" + DA_dict[language].replace(' ','\\s+') + "\\s+" + language + "\\s+"
        da_pages = [re.sub(header, '', page) for page in da_pages]
        DA[language] = ''.join(da_pages)
    
        # some preliminary cleaning -> should be more 
        DA[language] = DA[language].replace('\xad ', '')
    
        # saving txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "wb")
        da_txt.write(DA[language].encode('utf-8'))
        da_txt.close()

    else:
        
        # loading txt file
        da_txt = open(da_path + "Delegated_Acts_" + language + ".txt", "rb")
        DA[language] = da_txt.read().decode('utf-8')
        da_txt.close()        

Reading language BG ES CS DA DE ET EL EN FR HR IT LV LT HU MT NL PL PT RO SK SL FI SV 

# Retrieve the text within articles

Retrieving the text within articles is not straightforward. In English we have 'Article 1 some text', i.e. de word Article is put before the number. But some European languages put the word after the number and there are two languages, HU and LV, that put a dot between the number and the article. To  be able to read the text within the articles we need to know this ordering (and we need of course the word for article in every language).

In [7]:
art_dict= dict({
                'BG': ['Член',      'pre'],
                'CS': ['Článek',    'pre'],
                'DA': ['Artikel',   'pre'],
                'DE': ['Artikel',   'pre', 'TITEL|KAPITEL|ABSCHNITT|Unterabschnitt'],
                'EL': ['Άρθρο',     'pre'],
                'EN': ['Article',   'pre', 'TITLE|CHAPTER|SECTION|Subsection'],
                'ES': ['Artículo',  'pre'],
                'ET': ['Artikkel',  'pre'],
                'FI': ['artikla',   'post'],
                'FR': ['Article',   'pre', 'TITRE|CHAPITRE|SECTION|Sous-section'],
                'HR': ['Članak',    'pre'],
                'HU': ['cikk',      'postdot'],
                'IT': ['Articolo',  'pre'],
                'LT': ['straipsnis','post'],
                'LV': ['pants',     'postdot'],
                'MT': ['Artikolu',  'pre'],
                'NL': ['Artikel',   'pre', 'TITEL|HOOFDSTUK|AFDELING|Onderafdeling'],
                'PL': ['Artykuł',   'pre'],
                'PT': ['Artigo',    'pre'],
                'RO': ['Articolul', 'pre'],
                'SK': ['Článok',    'pre'],
                'SL': ['Člen',      'pre'],
                'SV': ['Artikel',   'pre']})

Next we can define a regex to select the text within an article.

In [8]:
def article_regex(language, num):
    order = art_dict[language][1]
    headings = art_dict[language][2]
    art_id = art_dict[language][0]
    if order == 'pre':
        string = art_id+'\s('+str(num)+')\s\n(.*?)(\n.*?)\n((\s'+headings+').*)?'+art_id+'\s'+str(num+1)
    elif order == 'post':
        string = str(num)+'\s('+art_id+')\s(.*?)'+str(num+1)+' '+art_id
    elif order == 'postdot':
        string = str(num)+'.\s('+art_id+')\s(.*?)'+str(num+1)+'. '+art_id
    return re.compile(string, re.DOTALL)

def retrieve_article(language, num):
    art_re = article_regex(language, num)
    art_text = art_re.search(DA[language])
    art_num = int(art_text[1])
    art_title = ' '.join(art_text[2].split())
    art_body = art_text[3]
    if art_body[0:3] == '\n1.': 
        # if the article start with '1.' then it has numbered paragraphs
        paragraph_re = re.compile('\n(\d+)\. (.*?)(?=(\n(\d+)\.)|$)', re.DOTALL)
        art_paragraphs = [(int(p[0]), p[1]) for p in paragraph_re.findall(art_body)]
    else:
        art_paragraphs = [(0, ' '.join(art_body.split()))]
    return (art_num, art_title, art_paragraphs)

Okay, where are we now? We have a function that can retrieve the text of all the articles in the Delegated Acts for each European language.

In [9]:
import pandas as pd

In [11]:
df = pd.DataFrame()
for art in range(1, 381):
    data = retrieve_article('EN', art)
    for para in range(len(data[2])):
        ref = str(data[0])
        if data[2][para][0]!=0:
            ref = ref + '('+str(data[2][para][0])+')'
        df = df.append([[ref,data[0], data[2][para][0], data[1],  data[2][para][1]]])
df.columns = ['reference', 'article', 'paragraph', 'article_title', 'paragraph_text']
df = df.set_index('reference')
df

Unnamed: 0_level_0,article,paragraph,article_title,paragraph_text
reference,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,0,Definitions,"For the purposes of this Regulation, the follo..."
2(1),2,1,Expert judgement,Where insurance and reinsurance undertakings m...
2(2),2,2,Expert judgement,"Insurance and reinsurance undertakings shall, ..."
3,3,0,Association of credit assessments to credit qu...,The scale of credit quality steps referred to ...
4(1),4,1,General requirements on the use of credit asse...,Insurance or reinsurance undertakings may use ...
...,...,...,...,...
377(1),377,1,Significant intragroup transactions (definitio...,Participating insurance and reinsurance undert...
377(2),377,2,Significant intragroup transactions (definitio...,For the purposes of identifying significant in...
378,378,0,Criteria for assessing third country equivalence,The criteria to be taken into account in order...
379,379,0,Criteria for assessing third country equivalence,The criteria to be taken into account in order...


In [240]:
df[df['article']==26].paragraph_text.values[0]

'Bei der Bestimmung der Wahrscheinlichkeit, dass die Versicherungsnehmer vertragliche Optionen, einschließlich Storno- und Rückkaufsmöglichkeiten, wahrnehmen, analysieren die Versicherungs- und Rückversicherungsunternehmen das frühere Verhalten der Versicherungsnehmer und bewerten prospektiv das erwartete Verhalten. Bei dieser Analyse wird Folgendem Rechnung getragen: (a) der Frage, wie vorteilhaft die Ausübung der Optionen für den Versicherungsnehmer unter den zum Zeitpunkt der Ausübung herrschenden Umständen war und künftig sein wird; (b) dem Einfluss vergangener und künftiger wirtschaftlicher Rahmenbedingungen; (c) den Auswirkungen vergangener und künftiger Maßnahmen des Managements; (d) allen anderen etwaigen Umständen, die die Entscheidungen der Versicherungsnehmer über die Wahrnehmung einer Option beeinflussen dürften. Dass die Wahrscheinlichkeit von den unter den Buchstaben a bis d genannten Elementen unabhängig ist, wird nur dann angenommen, wenn empirische Nachweise eine solch

(11, 'Recognition of contingent liabilities', [(1, 'Insurance and reinsurance undertakings shall recognise contingent liabilities, as defined in accordance with  Article 9 of this Regulation, that are material, as liabilities. '), (2, 'Contingent liabilities shall be material where information about the current or potential size or nature of those  liabilities could influence the decision-making or judgement of the intended user of that information, including the  supervisory authorities. ')])


In [81]:
retrieve_article('DE', 292)

(292,
 'Zusammenfassung',
 [(1,
   'Der Bericht über Solvabilität und Finanzlage enthält eine klare, knappe Zusammenfassung. Die Zusammenfassung  des Berichts ist für Versicherungsnehmer und Anspruchsberechtigte verständlich. '),
  (2,
   'In der Zusammenfassung werden etwaige wesentliche Änderungen in Bezug auf Geschäftstätigkeit und Leistung  des Versicherungs- oder Rückversicherungsunternehmens, sein Governance-System, sein Risikoprofil, die Bewertung für  Solvabilitätszwecke und das Kapitalmanagement im Berichtszeitraum herausgestellt. ')])

In [82]:
retrieve_article('FR', 292)

(292,
 'Synthèse',
 [(1,
   'Le rapport sur la solvabilité et la situation financière contient une synthèse concise et claire. Cette synthèse est  compréhensible par les preneurs et les bénéficiaires. '),
  (2,
   "La synthèse met en évidence tout changement important survenu dans l'activité et les résultats de l'entreprise  d'assurance ou de réassurance, son système de gouvernance, son profil de risque, la valorisation qu'elle applique à des  fins de solvabilité et la gestion de son capital sur la période de référence. ")])

In [83]:
retrieve_article('EL', 292)

(292,
 'Περίληψη',
 [(1,
   'Η έκθεση φερεγγυότητας και χρηματοοικονομικής κατάστασης περιλαμβάνει σαφή και σύντομη περίληψη. Η περίληψη  της έκθεσης πρέπει να είναι κατανοητή από τους αντισυμβαλλομένους και τους δικαιούχους. '),
  (2,
   'Η περίληψη της έκθεσης επισημαίνει τυχόν ουσιώδεις αλλαγές όσον αφορά τη δραστηριότητα και τις επιδόσεις της  ασφαλιστικής και αντασφαλιστικής επιχείρησης, το σύστημα διακυβέρνησης, το προφίλ κινδύνου, την εκτίμηση της αξίας για  τους σκοπούς φερεγγυότητας και τη διαχείριση κεφαλαίου κατά την περίοδο αναφοράς. ')])

In [84]:
retrieve_article('NL', 295)

(295,
 'Risicoprofiel',
 [(1,
   "Het verslag over de solvabiliteit en financiële toestand bevat kwalitatieve en kwantitatieve informatie over het  risicoprofiel van de verzekerings- of herverzekeringsonderneming, zulks in overeenstemming met de leden 2 tot en  met 7 en afzonderlijk voor de volgende risicocategorieën: \n(a)  verzekeringstechnisch risico; \n(b)  marktrisico; \n(c)  kredietrisico; \n(d)  liquiditeitsrisico; \n(e)  operationeel risico; \n(f)  andere materiële risico's. "),
  (2,
   "Het verslag over de solvabiliteit en financiële toestand bevat de volgende informatie over de risicoblootstelling van  de verzekerings- of herverzekeringsonderneming, met inbegrip van de blootstelling die voortvloeit uit buitenbalansposities en de overdracht van risico aan special purpose vehicles: \n(a)  een beschrijving van de maatregelen om deze risico's binnen die onderneming te beoordelen, met vermelding van  alle in de loop van de rapportageperiode opgetreden materiële veranderingen; \n(