# EU Gesetzt Code explanation

Tis is an explanation for myself to show how I builded the code. 

tutorial requests: https://medium.com/technofunnel/web-scraping-with-python-using-beautifulsoup-76b710e3e92f


In [None]:
import requests 

In [None]:
URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:32022R0868&qid=1666873096926&from=EN"
r = requests.get(URL) 
#print(r.content) 

 Using the requests library, we’re able to extract the HTML associated with the website. The output received needs to be further evaluated before we can start extracting data from it. Currently, the output received is of the type string.

 Para parsear para uma estrutura tree-based, precisaremos usar a Beautifulsoup

In [None]:
from bs4 import BeautifulSoup 

In [None]:
soup = BeautifulSoup(r.content, 'html5lib') 
print(soup.prettify()) 

O texto que precede o início da lei é "HABEN FOLGENDE VERORDNUNG ERLASSEN:"

The ```<div>``` tag defines a division or a section in an HTML document.

No artigo um, temos a estrurura: Artigo >> Absatz >> Buchstaben (para alguns Absatz)
No artigo dois, temos a estrutura: Artigo >> Nummer >> Buchstaben (para alguns Nummer)

Entao:

- No artigo 1, teremos 5 documentos (5 Absätze)
- No artigo 2, teremos 1 documento (pois nao há divisao em Absatz)

--

Olhando o código, pude perceber que o id da DIV vai mudando conforme vamos navegando no artigo e no Absatz: ```<div id="001">``` é Artigo 1 e ```<div id="001.001">``` significa artigo 1, Absatz 1 e assim por diante. Quando o Artigo vai direto para os nummers, nao tem subdivisao da div ```<div id="002"> ``` é o Artigo 2 e pronto. 

Exemplos dos códigos:

1. Comeco do capítulo

```
    <p class="oj-ti-section-1" id="d1e887-1-1">
    <span class="oj-italic">
     KAPITEL I
```

2. Comeco do Artigo:

```
    <div id="001"> 
    <p class="oj-ti-art" id="d1e897-1-1">
     Artikel 1
    </p>
    <p class="oj-sti-art">
     Gegenstand und Anwendungsbereich
    </p>
```

3. Comeco do Absatz:

```
    <div id="001.001">
     <p class="oj-normal">
      (1)   In dieser Verordnung wird Folgendes festgelegt:
     </p>
```

Absatz 2 do mesmo Artigo:

```
    <div id="001.002">
        <p class="oj-normal">
        (2)   Diese Verordnung begründet weder eine Verpflichtung für öffentliche Stellen, die Weiterverwendung von Daten zu erlauben, noch befreit sie öffentliche Stellen von ihren Geheimhaltungspflichten nach dem Unionsrecht oder dem nationalen Recht.
        </p>
```

O que eu quero:

- coletar todos os artigos separadamente
- dentro dos artigos, separar por Absatz (quando tiver)

In [None]:
# Código inteiro:

import requests 
from bs4 import BeautifulSoup 

URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:32022R0868&qid=1666873096926&from=EN"
r = requests.get(URL) 
soup = BeautifulSoup(r.content, 'html5lib') 

# Printa primeiro artigo
table = soup.find('div', attrs = {'id':'001.001'}) 

for row in table:
    print(row.text)

In [None]:
import unicodedata
import pandas as pd

In [None]:
#Código inteiro

#import unicodedata
#import pandas as pd


# colocando em um dataframe:

# Printa primeiro artigo

df_law = pd.DataFrame(columns = ['artikel', 'absatz', 'text'])

id_num= '001.001'
table = soup.find('div', attrs = {'id':id_num}) 

my_str = table.text
result = unicodedata.normalize('NFKD', my_str)   # removind \xa0
result = " ".join(result.split())
#result # lindo, está o absatz completo
df_temp = pd.DataFrame({'artikel': [id_num]
                        'absatz' : [id_num],
                        'text': [result]})

#df_temp.dtypes funciona
df_law = pd.concat([df_law, df_temp])
df_law



In [None]:
# Mas eu quero descobrir todos os ids:

ids = [tag['id'] for tag in soup.select('div[id]')]


Explicacao:  " in your soup id is an attribute not a type of tag. To get a list of all id's in the soup you can use the following one liner (above) this uses CSS selectors instead of bs4's find_all since I find bs4's docs regarding its built-ins lacking."

In [None]:
print(ids)

O que eu quero fazer:

1. Descobrir se ID tem absatz 
2. Se id tem Absatz - buscar absatz
3. Se Id nao tem Absatz - Buscar ID

In [None]:
import pandas as pd
import numpy as np

In [42]:
#Código final

#import pandas as pd
#import numpy as np

ids = [tag['id'] for tag in soup.select('div[id]')]

df = pd.DataFrame(ids)
df.rename(columns={list(df)[0]:'ids'}, inplace=True) # renomeando baseado na posicao
df['artikel'] = df['ids'].str[:3]
df

# Achando qtde de Absätze:

df1 = df.groupby(['artikel']).count().reset_index()
df1 = df1.query("ids > 1")
ids_mit = df1['artikel']#.astype('str')

ids_final = set(ids) - set(ids_mit)

print(ids_final)   # conferi o número, está correto

{'025.003', '029.003', '024.002', '011.007', '024.005', '004.004', '011.014', '011.013', '011.004', '005.008', '005.011', '006.004', '005.007', '030', '014.004', '006.005', '004.006', '011.010', '001.002', '014.003', '019.004', '013.002', '013.003', '001.004', '004.005', '014.002', '011.009', '026.004', '012', '006.003', '032.004', '019.003', '029.001', '006.002', '004.002', '009.001', '019.005', '026.003', '005.003', '007.001', '018', '028.003', '032.005', '032.006', '028.002', '021.004', '038', '021.002', '025.001', '034.002', '024.004', '007.003', '011.006', '024.001', '014.006', '008.002', '019.006', '031.001', '010', '029.002', '023.002', '019.007', '027.002', '002', '008.001', '026.005', '029.004', '001.001', '022.002', '021.003', '011.001', '011.012', '014.007', '026.001', '007.005', '006.006', '021.006', '023.001', '005.004', '014.001', '005.010', '017.001', '022.001', '011.011', '026.006', '025.004', '032.002', '011.002', '024.006', '032.003', '019.001', '027.001', '025.002', 

In [52]:
# Fazendo o loop final para pegar todos os IDs com e sem Absatze:

ids_teste = ['025.003', '029.003', '024.002','038']

# colocando em um dataframe:
df_law = pd.DataFrame(columns = ['artikel', 'absatz', 'text'])

for id_entry in ids_teste:
    #pegando tudo:
    table = soup.find('div', attrs = {'id':id_entry}) 
    my_str = table.text
    result = unicodedata.normalize('NFKD', my_str)   # removind \xa0
    result = " ".join(result.split())
    #result # lindo, está o absatz completo
    #agora pegando os artigos e os absatz
    #criando um df temporario com os detalhes do artigo:
    if len(id_entry) < 4:
        df_temp = pd.DataFrame({'artikel': [id_entry.split('.')[0]], 
                                'absatz' : ["NA"],
                                'text': [result]})
    else:
        df_temp = pd.DataFrame({'artikel': [id_entry.split('.')[0]], 
                                'absatz' : [id_entry.split('.')[1]],
                                'text': [result]})
    #df_temp.dtypes funciona
    df_law = pd.concat([df_law, df_temp])

df_law

Unnamed: 0,artikel,absatz,text
0,25,3.0,"(3) Werden personenbezogene Daten erfasst, so ..."
0,29,3.0,(3) Die Kommission führt den Vorsitz in den S...
0,24,2.0,(2) Die für die Registrierung von datenaltrui...
0,38,,Artikel 38 Inkrafttreten und Geltung Diese Ver...


# Código inteiro

In [54]:
import requests 
from bs4 import BeautifulSoup 
import pandas as pd
import numpy as np

In [55]:
#### CÓDIGO FINAL: ####


## Getting the law:

URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:32022R0868&qid=1666873096926&from=EN"
r = requests.get(URL) 
soup = BeautifulSoup(r.content, 'html5lib') 


## Getting all the artikels and separating Artikels with and without Absatze:

#1. getting all artikels and absatze:

ids = [tag['id'] for tag in soup.select('div[id]')]

df = pd.DataFrame(ids)
df.rename(columns={list(df)[0]:'ids'}, inplace=True)
df['artikel'] = df['ids'].str[:3]
df

#2. Looking which one of them have Absatze

df1 = df.groupby(['artikel']).count().reset_index()
df1 = df1.query("ids > 1")
ids_mit = df1['artikel']#.astype('str')

#3. Getting the final ids, so the ids and ids with absatze dont appear twice:
ids_final = set(ids) - set(ids_mit)

## Creating the table with all Artikel and Absatze:

#1. Creating empty df:
df_law = pd.DataFrame(columns = ['artikel', 'absatz', 'text'])

#2. loop for ids:

for id_entry in ids_final:
    #pegando tudo:
    table = soup.find('div', attrs = {'id':id_entry}) 
    my_str = table.text
    result = unicodedata.normalize('NFKD', my_str)   # removind \xa0
    result = " ".join(result.split())
    #result # lindo, está o absatz completo
    #agora pegando os artigos e os absatz
    #criando um df temporario com os detalhes do artigo:
    if len(id_entry) < 4:
        df_temp = pd.DataFrame({'artikel': [id_entry.split('.')[0]], 
                                'absatz' : ["NA"],
                                'text': [result]})
    else:
        df_temp = pd.DataFrame({'artikel': [id_entry.split('.')[0]], 
                                'absatz' : [id_entry.split('.')[1]],
                                'text': [result]})
    #df_temp.dtypes funciona
    df_law = pd.concat([df_law, df_temp])

df_law

df_law.to_csv("df_law_teste.csv")

