# <font color='blue'>Data Science Academy</font>
# <font color='blue'>Processamento de Linguagem Natural</font>

## Estudo de Caso

### Web Scraping e PLN Para Extração de Casos Jurídicos Mais Citados em Jurisprudência

In [1]:
# Versão da Linguagem Python
from platform import python_version
print('Versão da Linguagem Python Usada Neste Jupyter Notebook:', python_version())

Versão da Linguagem Python Usada Neste Jupyter Notebook: 3.7.6


![title](imagens/estudocaso.png)

**A definição deste projeto está no manual em pdf onde você encontrou este Jupyter Notebook.**

In [2]:
# Para atualizar um pacote, execute o comando abaixo no terminal ou prompt de comando:
# pip install -U nome_pacote

# Para instalar a versão exata de um pacote, execute o comando abaixo no terminal ou prompt de comando:
# pip install nome_pacote==versão_desejada

# Depois de instalar ou atualizar o pacote, reinicie o jupyter notebook.

# Instala o pacote watermark. 
# Esse pacote é usado para gravar as versões de outros pacotes usados neste jupyter notebook.
!pip install -q -U watermark

In [3]:
# Import relevant libraries
import re as re
import pandas as pd
import urllib3
import bs4
from bs4 import BeautifulSoup

# A nova versão do Pandas traz diversas mensagens de aviso ao desenvolvedor. Vamos desativar isso.
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

In [4]:
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Data Science Academy" --iversions

re      2.2.1
pandas  1.0.3
urllib3 1.25.8
bs4     4.8.2
Data Science Academy


### Web Scraping

Começamos aplicando web scraping para extrair os dados do endereço de uma pesquisa web no site http://www.bailii.org.

Observe abaixo que a url é o resultado de uma pesquisa. Vamos até o site, executamos a pesquisa com os filtros necessários, copiamos a url da barra de endereço do navegador e colocamos aqui no código.

Coloque a url abaixo no seu navegador para compreender o que estaremos extraindo.

In [5]:
# Define a url
url = 'http://www.bailii.org/cgi-bin/lucy_search_1.cgi?query=%22Planning+court%22+AND+%28%22The+Royal+Courts+of+Justice%22+OR+%22Supreme+Court%22+OR+%22Manchester+Civil+Justice+Centre%22%29&datelow=&datehigh=&sort=date&highlight=1'

# Cria o pool http para conexão
http = urllib3.PoolManager()

# Faz um request à página para extração de dados da url
response = http.request('GET', url)

# Trata a resposta e armazena
soup = BeautifulSoup(response.data)

O próximo passo é encontrar os links nos dados extraídos.

In [6]:
# Lista para receber os links
links = []

# Loop pelos dados retornados, busca pelas tags html <li> e <a> e armazena na lista de links
# Para compreender o que são tags html, use esse material como referência: https://www.w3schools.com/html/
for data in soup.find_all('li'):
    for a in data.find_all('a'):
        links.append('http://www.bailii.org' + a.get('href'))

In [7]:
# Remove os links não necessários
links = links[1::2]

In [8]:
# Extrai o conteúdo dos casos, limpa a tag html e armazena os casos na lista
list_cases = []
counter = 0
for i in links:
    case_request = http.request('GET',i)
    counter += 1
    list_cases.append(BeautifulSoup(case_request.data).text)

In [9]:
# Fazemos tokenização para limpar e dividir o texto. Observe o caracter usado para fazer o split
clean_cases = []
for i in list_cases:
    text = str(i).split('\n')    
    text = list(filter(None,text))
    clean_cases.append(text)

In [10]:
# Lista de casos jurídicos
clean_cases

[['The Mayor of London v the Secretary of State for Housing, Communities And Local Government & Ors [2020] EWHC 1176 (Admin) (12 May 2020)',
  '  @media screen {',
  '  }',
  '  @media print {',
  '    #screenonly {',
  '      display: none;',
  '    }',
  '  }',
  '    [Home]',
  '    [Databases]',
  '    [World Law]',
  '    [Multidatabase Search] ',
  '    [Help]',
  '    [Feedback]',
  '  ',
  'England and Wales High Court (Administrative Court) Decisions',
  'You are here:',
  'BAILII >>',
  '      ',
  '      Databases >>',
  '      ',
  '      England and Wales High Court (Administrative Court) Decisions >>',
  '      ',
  '      The Mayor of London v the Secretary of State for Housing, Communities And Local Government & Ors [2020] EWHC 1176 (Admin) (12 May 2020)',
  '    URL: http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html',
  'Cite as: ',
  '[2020] EWHC 1176 (Admin)',
  '    ',
  '[New search]',
  '[Printable PDF version]',
  '[Help]',
  ' ',
  ' ',
  'Neutral Citatio

Agora vamos extrair o nome do caso, número do caso, URL, datas, semana e citação.

In [11]:
# Começamos extraindo o nome do caso e a url
# Observe que estamos usando a notação de indexação para buscar os dados na posição exata que desejamos
case_name = []
case_url = []
for i in clean_cases:
    case_name.append(i[23:24])
    case_url.append(i[24:25])

In [12]:
# Criamos um dataframe com o nome do caso
df = pd.DataFrame(case_name, columns = ['case_name'])

In [13]:
# Adicionamos a url
df['url'] = [i.replace('URL: ','') for i in list(map(''.join,case_url))]

In [14]:
# Adicionamos as datas
# Usamos expressões regulares para buscar o padrão de data
case_date = [re.findall('(\d\d\s[a-z]+\s\d{4})', str(i), re.IGNORECASE) for i in case_name]
df['date'] = list(map(''.join,case_date))
df['date'] = pd.to_datetime(df['date'])

In [15]:
# Adicionamos a semana, extraindo a semana da data de cada caso
df['week'] = df['date'].dt.strftime('%YWk%w')

In [16]:
# Adicionamos a citação
df['cite_as'] = [i[25:28] for i in clean_cases]

In [17]:
# Visualizamos os dados
df.head(10)

Unnamed: 0,case_name,url,date,week,cite_as
0,The Mayor of London v the Secretary of S...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-12,2020Wk2,"[Cite as: , [2020] EWHC 1176 (Admin), ]"
1,"Lochailort Investments Ltd, R (on the ap...",http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-11,2020Wk1,"[Cite as: , [2020] EWHC 1146 (Admin), ]"
2,The Open Spaces Society v Secretary of ...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-05,2020Wk2,"[Cite as: , [2020] EWHC 1085 (Admin), ]"
3,"Wiltshire Council, R (on the application...",http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-23,2020Wk4,"[Cite as: , [2020] WLR(D) 244,, [2020] EWHC 95..."
4,Hampshire County Council v Secretary of ...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-23,2020Wk4,"[Cite as: , [2020] EWHC 959 (Admin),, [2020] W..."
5,T & P Real Estate Ltd v London Borough o...,http://www.bailii.org/ew/cases/EWHC/Ch/202...,2020-04-21,2020Wk2,"[Cite as: , [2020] EWHC 879 (Ch), ]"
6,"Corbett, R (On the Application Of) v [2...",http://www.bailii.org/ew/cases/EWCA/Civ/20...,2020-04-09,2020Wk4,"[Cite as: , [2020] EWCA Civ 508, ]"
7,South Derbyshire District Council v Secr...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-09,2020Wk4,"[Cite as: , [2020] WLR(D) 229,, [2020] EWHC 87..."
8,Qatar National Bank (QPSC) v Force Indi...,http://www.bailii.org/ew/cases/EWHC/Admlty...,2020-03-25,2020Wk3,"[Cite as: , [2020] EWHC 719 (Admlty), ]"
9,Starbones Ltd v Secretary of State for H...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-03-10,2020Wk2,"[Cite as: , [2020] EWHC 526 (Admin), ]"


Legal. Já temos quase tudo que precisamos. Está faltando o número de cada caso. Vai dar um pouco mais de trabalho, pois teremos que encontrar a tag html e então o número do caso.

Vamos criar uma função que será usada para extrair o que precisamos de dentro da tag html.

tag (string): tag html a pesquisar

Retorna um dicionário com os valores extraídos. A chave é equivalente ao índice de decisão no dataframe.

In [18]:
# Função para extrair os dados com base na tag desejada
def extractor(tag):
    
    # Dicionário e contador
    dictionary = {}
    counter = -1 
    
    # Loop pelos links
    for i in links:
        
        # Incrementa o contador
        counter += 1
        
        # Define a url
        url = i
        
        # Obtém os dados
        response = http.request('GET', i)
        
        # Armazena os dados
        soup = BeautifulSoup(response.data)
        
        # Extrai os dados de acordo com a tag
        var = soup.select(tag)
        
        # Extrai o padrão usando expressão regular
        dictionary[counter] = re.sub("<.*?>", " ", str(var[0]))

    return dictionary   

Aplicamos a função para extrair o número do caso.

In [19]:
# Extrai o número do caso e adiciona ao dataframe
case_num = extractor('casenum')
df['casenum'] = df.index.map(case_num)

Muito bom! Agora vamos extrair dados de juízes, nomes dos tribunais e autoridades usando a mesma função e depois inserindo no dataframe. Isso não é exatamente necessário, mas já que estamos aqui, por que não?

In [20]:
# Extraímos o nome do juiz (ou juízes) responsável por cada processo e inserimos no dataframe
judges_all = extractor('panel')
df['judges'] = df.index.map(judges_all)

In [21]:
# Extraímos o nome do tribunal responsável por cada processo e inserimos no dataframe
court_all = extractor('court')
df['court'] = df.index.map(court_all)

In [22]:
# Extraímos o nome da autoridade responsável por cada processo e inserimos no dataframe
# Mais uma vez usamos expressões regulares para buscar pelos padrões desejados
authorities = extractor('parties')
authority_all = []
for key, value in authorities.items():
    case = []
    test = [s.strip() for s in re.split("and|\n", value)]
    for i in test:
        if any(re.findall('COUNCIL|BOROUGH|CITY|AUTHORITY', i, re.IGNORECASE)):
            case.append(i)
    authority_all.append(case)
df['authorities'] = authority_all

In [23]:
# Visualizamos os dados
df.head(10)

Unnamed: 0,case_name,url,date,week,cite_as,casenum,judges,court,authorities
0,The Mayor of London v the Secretary of S...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-12,2020Wk2,"[Cite as: , [2020] EWHC 1176 (Admin), ]",Case No: CO/4849/2019 &amp; CO/4851/2019,THE HON. MR JUSTICE HOLGATE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[(3) THE LONDON BOROUGH OF HARROW]
1,"Lochailort Investments Ltd, R (on the ap...",http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-11,2020Wk1,"[Cite as: , [2020] EWHC 1146 (Admin), ]",Case No: CO/3929/2019,MRS JUSTICE LANG DBE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,"[MENDIP DISTRICT COUNCIL, NORTON ST PHILIP PAR..."
2,The Open Spaces Society v Secretary of ...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-05,2020Wk2,"[Cite as: , [2020] EWHC 1085 (Admin), ]",Case No: CO/59/2020,MRS JUSTICE LIEVEN,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[]
3,"Wiltshire Council, R (on the application...",http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-23,2020Wk4,"[Cite as: , [2020] WLR(D) 244,, [2020] EWHC 95...",Case No: CO/5006/2019,MRS JUSTICE LIEVEN,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[WILTSHIRE COUNCIL]
4,Hampshire County Council v Secretary of ...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-23,2020Wk4,"[Cite as: , [2020] EWHC 959 (Admin),, [2020] W...",Case No: CO/3493/2019,THE HON. MR JUSTICE HOLGATE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[Hampshire County Council]
5,T & P Real Estate Ltd v London Borough o...,http://www.bailii.org/ew/cases/EWHC/Ch/202...,2020-04-21,2020Wk2,"[Cite as: , [2020] EWHC 879 (Ch), ]",Case No: PT-2019-000766,DEPUTY MASTER BOWLES,IN THE HIGH COURT OF JUSTICE BUSINESS AND PR...,[Burgesses of the London Borough of Sutton]
6,"Corbett, R (On the Application Of) v [2...",http://www.bailii.org/ew/cases/EWCA/Civ/20...,2020-04-09,2020Wk4,"[Cite as: , [2020] EWCA Civ 508, ]",Case No: C1/2019/2179,Lord Justice Lewison Lord Justice Lindblom a...,IN THE COURT OF APPEAL (CIVIL DIVISION) ON AP...,[The Cornwall Council]
7,South Derbyshire District Council v Secr...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-09,2020Wk4,"[Cite as: , [2020] WLR(D) 229,, [2020] EWHC 87...",Case No: CO/4505/2019,THE HONOURABLE MRS JUSTICE ANDREWS DBE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[SOUTH DERBYSHIRE DISTRICT COUNCIL]
8,Qatar National Bank (QPSC) v Force Indi...,http://www.bailii.org/ew/cases/EWHC/Admlty...,2020-03-25,2020Wk3,"[Cite as: , [2020] EWHC 719 (Admlty), ]",Case No: AD 2018 000096,MR. JUSTICE TEARE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[]
9,Starbones Ltd v Secretary of State for H...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-03-10,2020Wk2,"[Cite as: , [2020] EWHC 526 (Admin), ]",Case No: CO/3356/2019,MRS JUSTICE LANG DBE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,"[(1) SECRETARY OF STATE FOR HOUSING, COMMUNITI..."


Pronto, os dados estão limpos e organizados.

Nossa próxima tarefa é extrair as citações dos casos em jurisprudência.

In [24]:
# Vamos definir listas e dicionário que vão armazenar os dados que precisamos. Temos ainda um contador.
cases_body = []
list_acts = []
dict_acts = {}
counter = 0

In [25]:
# Para cada caso nos links que foram extraídos
for case in links:
    
    # Extrai os dados do caso
    response = http.request('GET', case)
    
    # Armazena os dados
    soup_case = BeautifulSoup(response.data)
    
    # Coleta o corpo da página html extraída
    body = soup_case.body
    
    # Adiciona à lista de casos os dados nas tags <li> e <p>
    cases_body.append(body.find_all(['li','p'])) 
    
    # Buscamos pelo padrão 
    for i in re.findall('(section.+?\s\d{4})',str(cases_body[0]),re.IGNORECASE):
        if len(i) <= 100:
            list_acts.append(i)  
            
    # Contamos os atos e inserimos no dicionário
    dict_acts[counter] = list_acts
    
    # Limpamos as listas e incrementamos o contador.
    list_acts = []
    cases_body = []
    counter += 1    

In [26]:
# Essas listas vão receber a lista e índice de atos e depois inserimos no dataframe final
act_index = []
act_list = []

In [27]:
# Percorremos o dicionário, extraímos chave e valor e adicionamos nas listas
for keys, values in dict_acts.items():
    for values in dict_acts[keys]:
        act_index.append(keys)
        act_list.append(values)

In [28]:
# Cria o dicionário
dic = {'act_index':act_index, 'act_list':act_list}

In [29]:
# Converte em dataframe
act_dataframe = pd.DataFrame(dic)

In [30]:
# Contando o número de casos citados várias vezes e removendo valores duplicados
act_dataframe['nr_act'] = act_dataframe['act_list'].apply(act_dataframe['act_list'].tolist().count)
act_dataframe = act_dataframe.drop_duplicates(subset = ['act_index','act_list'])
act_dataframe.sort_values(by = ['act_index'], inplace = True, ascending = True)

In [31]:
# Formata a string
act_dataframe['nr_act'] = act_dataframe['nr_act'].apply(lambda x: '{'+str(x)+'}')

In [32]:
# Limpa o dataframe 
act_dataframe['act_list'] = act_dataframe['act_list'] + ' // ' + act_dataframe['nr_act']
act_dataframe = act_dataframe.drop(columns = ['nr_act'])

In [33]:
# Criação de dicionário de lista baseado em case_index
act_dict = {}
temp_list = []
counter = 0

In [34]:
# Loop para preencher o dicionário
for i in range(0,10):
    temp_list = []
    for case in act_dataframe.iterrows():
        if i == case[1][0]:
            temp_list.append(case[1][1])
        else:
            act_dict[i] = temp_list

In [35]:
# Adiciona ao dataframe
df['cited_act'] = df.index.map(act_dict)

In [36]:
# Visualizamos os dados
df.head(10)

Unnamed: 0,case_name,url,date,week,cite_as,casenum,judges,court,authorities,cited_act
0,The Mayor of London v the Secretary of S...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-12,2020Wk2,"[Cite as: , [2020] EWHC 1176 (Admin), ]",Case No: CO/4849/2019 &amp; CO/4851/2019,THE HON. MR JUSTICE HOLGATE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[(3) THE LONDON BOROUGH OF HARROW],[sections 66(1) and 72(1) of the Planning (Lis...
1,"Lochailort Investments Ltd, R (on the ap...",http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-11,2020Wk1,"[Cite as: , [2020] EWHC 1146 (Admin), ]",Case No: CO/3929/2019,MRS JUSTICE LANG DBE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,"[MENDIP DISTRICT COUNCIL, NORTON ST PHILIP PAR...","[section 61N(2) TCPA 1990 // {3}, section 38A(..."
2,The Open Spaces Society v Secretary of ...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-05,2020Wk2,"[Cite as: , [2020] EWHC 1085 (Admin), ]",Case No: CO/59/2020,MRS JUSTICE LIEVEN,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[],"[section 119 of the Highways Act 1980 // {1}, ..."
3,"Wiltshire Council, R (on the application...",http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-23,2020Wk4,"[Cite as: , [2020] WLR(D) 244,, [2020] EWHC 95...",Case No: CO/5006/2019,MRS JUSTICE LIEVEN,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[WILTSHIRE COUNCIL],[]
4,Hampshire County Council v Secretary of ...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-23,2020Wk4,"[Cite as: , [2020] EWHC 959 (Admin),, [2020] W...",Case No: CO/3493/2019,THE HON. MR JUSTICE HOLGATE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[Hampshire County Council],"[section 10 of the 1965 // {1}, Section 13 of ..."
5,T & P Real Estate Ltd v London Borough o...,http://www.bailii.org/ew/cases/EWHC/Ch/202...,2020-04-21,2020Wk2,"[Cite as: , [2020] EWHC 879 (Ch), ]",Case No: PT-2019-000766,DEPUTY MASTER BOWLES,IN THE HIGH COURT OF JUSTICE BUSINESS AND PR...,[Burgesses of the London Borough of Sutton],"[section of the 1990 // {1}, section 78 of the..."
6,"Corbett, R (On the Application Of) v [2...",http://www.bailii.org/ew/cases/EWCA/Civ/20...,2020-04-09,2020Wk4,"[Cite as: , [2020] EWCA Civ 508, ]",Case No: C1/2019/2179,Lord Justice Lewison Lord Justice Lindblom a...,IN THE COURT OF APPEAL (CIVIL DIVISION) ON AP...,[The Cornwall Council],[section 70(2) of the Town and Country Plannin...
7,South Derbyshire District Council v Secr...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-09,2020Wk4,"[Cite as: , [2020] WLR(D) 229,, [2020] EWHC 87...",Case No: CO/4505/2019,THE HONOURABLE MRS JUSTICE ANDREWS DBE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[SOUTH DERBYSHIRE DISTRICT COUNCIL],[]
8,Qatar National Bank (QPSC) v Force Indi...,http://www.bailii.org/ew/cases/EWHC/Admlty...,2020-03-25,2020Wk3,"[Cite as: , [2020] EWHC 719 (Admlty), ]",Case No: AD 2018 000096,MR. JUSTICE TEARE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[],[]
9,Starbones Ltd v Secretary of State for H...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-03-10,2020Wk2,"[Cite as: , [2020] EWHC 526 (Admin), ]",Case No: CO/3356/2019,MRS JUSTICE LANG DBE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,"[(1) SECRETARY OF STATE FOR HOUSING, COMMUNITI...","[section 70(2) TCPA 1990 // {1}, section 66(1)..."


Casos Principais

In [37]:
# Extraindo os casos
cases_body = []
for i in links:
    response = http.request('GET', i)
    soup_case = BeautifulSoup(response.data)
    body = soup_case.body
    cases_body.append(body.find_all(['li','p']))  

In [38]:
# Criando um dicionário com casos e casos relevantes por decisão
casos_principais = {}
counter = -1

# Loop pelos casos
for case in range(0,len(links)):
    cases = []
    
    # Extraindo casos principais com base em tags html
    for i in range(0,len(cases_body[case])):
        
        for val in cases_body[case][i].findAll(['i','a','u'], attrs = {"name":False}):
            try:
                if (val.name == 'i' or 'u') and 'v.' in val.string:
                    cases.append(val.string)
                elif (val.name == 'i' or 'u') and (' v ') in val.string:
                    cases.append(val.string)
                else:
                    pass
                if (val.get('href') == None) or ('cgi-bin' in val.get('href')):
                    pass
                else:
                    url = 'http://www.bailii.org' + val.get('href')
                    cases.append(url)
            except:
                cases.append('')
    counter += 1
    
    # Limpando e criando dicionário
    cases.remove('http://www.bailii.org/form/search_cases.html')
    cases.remove('http://www.bailii.org/bailii/help/')
    casos_principais[counter] = cases[1:]

In [39]:
# Casos principais
casos_principais

{0: ['http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para12',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para24',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para29',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para52',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para69',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para70',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para70',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para94',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para104',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para135',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para136',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para146',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para147',
  'http://www.bailii.org/ew/cases/EWHC/Admin/2020/1176.html#para151',
  'http://www.bailii.org/

In [40]:
# Criamos algumas listas para nosso processamento intermediário
case_index = []
casos_principais_lista = []
link = []
saver = None

In [41]:
# Percorremos o dicionário de casos principais, extraímos chaves e valores e colocamos nas respectivas listas
for keys, values in casos_principais.items():
    for values in casos_principais[keys]:
        case_index.append(keys)
        if 'http' in values:
            link.append(values)
            casos_principais_lista.append(saver)
        else:
            casos_principais_lista.append(values)
            link.append('Link não disponível')
        saver = values

In [42]:
# Cria o dicionário
dic = {'case_index':case_index, 'key_cases_list':casos_principais_lista, 'link':link}

In [43]:
# Converte para dataframe
df_casos_principais = pd.DataFrame(dic)

In [44]:
# Amostra de 5 Casos principais
df_casos_principais.head(5)

Unnamed: 0,case_index,key_cases_list,link
0,0,,http://www.bailii.org/ew/cases/EWHC/Admin/2020...
1,0,http://www.bailii.org/ew/cases/EWHC/Admin/2020...,http://www.bailii.org/ew/cases/EWHC/Admin/2020...
2,0,http://www.bailii.org/ew/cases/EWHC/Admin/2020...,http://www.bailii.org/ew/cases/EWHC/Admin/2020...
3,0,http://www.bailii.org/ew/cases/EWHC/Admin/2020...,http://www.bailii.org/ew/cases/EWHC/Admin/2020...
4,0,http://www.bailii.org/ew/cases/EWHC/Admin/2020...,http://www.bailii.org/ew/cases/EWHC/Admin/2020...


In [45]:
# Contando o número de casos citados várias vezes e removendo valores duplicados
df_casos_principais['nr_citation'] = df_casos_principais['key_cases_list'].apply(df_casos_principais['key_cases_list'].tolist().count)
df_casos_principais.sort_values(by = ['link'], inplace = True, ascending = False)
df_casos_principais = df_casos_principais.drop_duplicates(subset = ['case_index','key_cases_list','link'], keep = 'first')
df_casos_principais.sort_values(by = ['case_index'], inplace = True, ascending = True)

In [46]:
# Formata a string
df_casos_principais['nr_citation'] = df_casos_principais['nr_citation'].apply(lambda x: '{'+str(x)+'}')

In [47]:
# Limpando o dataframe
df_casos_principais['case_cited'] = df_casos_principais['key_cases_list'] + ' // ' + df_casos_principais['nr_citation'] + ' ' + df_casos_principais['link']
df_casos_principais = df_casos_principais.drop(columns=['key_cases_list','link','nr_citation'])

In [48]:
# Criação de dicionário de lista baseado em case_index
case_cited_dict = {}
temp_list = []
counter = 0

In [49]:
# Loop para criar o dicionário de casos citados
for i in range(0,10):
    temp_list = []
    for case in df_casos_principais.iterrows():
        if i == case[1][0]:
            temp_list.append(case[1][1])
        else:
            case_cited_dict[i] = temp_list

In [50]:
# Insere o dicionário no dataframe
df['cited_cases'] = df.index.map(case_cited_dict)

In [51]:
# Visualizamos os dados
df.head(10)

Unnamed: 0,case_name,url,date,week,cite_as,casenum,judges,court,authorities,cited_act,cited_cases
0,The Mayor of London v the Secretary of S...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-12,2020Wk2,"[Cite as: , [2020] EWHC 1176 (Admin), ]",Case No: CO/4849/2019 &amp; CO/4851/2019,THE HON. MR JUSTICE HOLGATE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[(3) THE LONDON BOROUGH OF HARROW],[sections 66(1) and 72(1) of the Planning (Lis...,[ // {36} http://www.bailii.org/ew/cases/EWCA/...
1,"Lochailort Investments Ltd, R (on the ap...",http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-11,2020Wk1,"[Cite as: , [2020] EWHC 1146 (Admin), ]",Case No: CO/3929/2019,MRS JUSTICE LANG DBE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,"[MENDIP DISTRICT COUNCIL, NORTON ST PHILIP PAR...","[section 61N(2) TCPA 1990 // {3}, section 38A(...",[LN Newham v Khatun // {2} http://www.bailii....
2,The Open Spaces Society v Secretary of ...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-05-05,2020Wk2,"[Cite as: , [2020] EWHC 1085 (Admin), ]",Case No: CO/59/2020,MRS JUSTICE LIEVEN,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[],"[section 119 of the Highways Act 1980 // {1}, ...",[R v Secretary of State for the Environment ex...
3,"Wiltshire Council, R (on the application...",http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-23,2020Wk4,"[Cite as: , [2020] WLR(D) 244,, [2020] EWHC 95...",Case No: CO/5006/2019,MRS JUSTICE LIEVEN,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[WILTSHIRE COUNCIL],[],[Miah v Secretary of State for Work and Pensio...
4,Hampshire County Council v Secretary of ...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-23,2020Wk4,"[Cite as: , [2020] EWHC 959 (Admin),, [2020] W...",Case No: CO/3493/2019,THE HON. MR JUSTICE HOLGATE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[Hampshire County Council],"[section 10 of the 1965 // {1}, Section 13 of ...",[Attorney General ex rel. Sutcliffe v Calderda...
5,T & P Real Estate Ltd v London Borough o...,http://www.bailii.org/ew/cases/EWHC/Ch/202...,2020-04-21,2020Wk2,"[Cite as: , [2020] EWHC 879 (Ch), ]",Case No: PT-2019-000766,DEPUTY MASTER BOWLES,IN THE HIGH COURT OF JUSTICE BUSINESS AND PR...,[Burgesses of the London Borough of Sutton],"[section of the 1990 // {1}, section 78 of the...",[Burdle v Secretary of State for Environment [...
6,"Corbett, R (On the Application Of) v [2...",http://www.bailii.org/ew/cases/EWCA/Civ/20...,2020-04-09,2020Wk4,"[Cite as: , [2020] EWCA Civ 508, ]",Case No: C1/2019/2179,Lord Justice Lewison Lord Justice Lindblom a...,IN THE COURT OF APPEAL (CIVIL DIVISION) ON AP...,[The Cornwall Council],[section 70(2) of the Town and Country Plannin...,[R. (on the application of Morge) v Hampshire ...
7,South Derbyshire District Council v Secr...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-04-09,2020Wk4,"[Cite as: , [2020] WLR(D) 229,, [2020] EWHC 87...",Case No: CO/4505/2019,THE HONOURABLE MRS JUSTICE ANDREWS DBE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[SOUTH DERBYSHIRE DISTRICT COUNCIL],[],[Hopkins Homes Ltd v Secretary of State for Co...
8,Qatar National Bank (QPSC) v Force Indi...,http://www.bailii.org/ew/cases/EWHC/Admlty...,2020-03-25,2020Wk3,"[Cite as: , [2020] EWHC 719 (Admlty), ]",Case No: AD 2018 000096,MR. JUSTICE TEARE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,[],[],[http://www.bailii.org/ew/cases/EWCA/Civ/2019/...
9,Starbones Ltd v Secretary of State for H...,http://www.bailii.org/ew/cases/EWHC/Admin/...,2020-03-10,2020Wk2,"[Cite as: , [2020] EWHC 526 (Admin), ]",Case No: CO/3356/2019,MRS JUSTICE LANG DBE,IN THE HIGH COURT OF JUSTICE QUEEN'S BENCH DI...,"[(1) SECRETARY OF STATE FOR HOUSING, COMMUNITI...","[section 70(2) TCPA 1990 // {1}, section 66(1)...",[R (CPRE Kent) v Dover District Council // {4}...


In [52]:
# Salva o dataframe em disco
df.to_csv('dados/casos_mais_citados.csv')

Trabalho concluído. Tente aplicar o mesmo processo e extrair dados de outras fontes.

# Fim