# Final Project

## Basic Information


| **Title:**       | Deep Learning and Natural Language Processing applied to the legal texts |
|------------------|----------------------------------------------------------|
| **Abstract:**    |                                                        |
| **Author:**      | Thiago Raulino Dal Pont                                |
| **Affiliation:** | Graduate Program in Automation and Systems Engineering  |
| **Date**         | July 14, 2022                                          |


## Goals of the project

- ...

## Project structure
- Preprocessing
- Representation
- Modeling
- Evaluation


## Requirements


``pip install -r requirements.txt``

``python3 -m spacy download pt_core_news_sm``


## Importing dependencies

In [1]:
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from src.modeling.util import get_class_weight
from src.preprocessing.preprocessing_shallow_ml import PreProcessingShallowML

## Dataset basic information

In [2]:
DATASET_2CLASS_PATH = os.path.join("Data", "final_dataset_2l_wo_result", "")

preprocessor = PreProcessingShallowML()
preprocessor.load_dataset(DATASET_2CLASS_PATH)


Loading dataset
{'labels': {'ganha': None, 'perde': None}}
  -> Found 1044 files inside Data/final_dataset_2l_wo_result/ganha/*.txt
  -> Found 116 files inside Data/final_dataset_2l_wo_result/perde/*.txt


# Data preparation

- In this project, we implemented a class to handle the text preprocessing in such a way that we can easily select distinct methods.

In [3]:
preprocessor.preprocess_corpus(
    keep_raw=True,
    lowercase=True,
    stemming=False,
    remove_html=True,
    remove_punct=True,
    remove_stopwords=True
)

dataset_shallow_ml = preprocessor.df_corpora

Preprocessing corpus
  -> Converting to lowercase
  -> Removing HTML
  -> Tokenizing
  -> Removing punctuation
  -> Removing Stopwords
  -> Joining tokens into string
  -> A sample of the preprocessed data:
["autos n° 0301090-84.2019.8.24.0090 ação procedimento juizado especial cível/proc autor antonio schincariol vicente réu oceanair linhas aéreas s/a avianca vistos etc i. relatório relatório dispensado forma artigo 38 caput lei 9.099/95 ii fundamentação trato ação indenizatória ajuizada antonio schincariol vicente face oceanair linhas aéreas s/a avianca julgo antecipadamente feito porquanto solução lide pode obtida análise direito disciplina matéria bem provas carreadas autos serem suficientes formação convencimento valho-me pois art 355 i código processo civil além disso impende ressaltar presente demanda consubstancia relação consumo vez partes envolvidas avença enquadram conceitos fornecedor consumidor dispostos arts 3º 2º lei 8.078/1990 respectivamente sendo portanto imperioso ap

In [29]:
preprocessor.preprocess_corpus(
    keep_raw=True,
    lowercase=True,
    stemming=False,
    remove_html=True,
    remove_punct=True,
    remove_stopwords=False
)

dataset_dl = preprocessor.df_corpora

Preprocessing corpus
  -> Converting to lowercase
  -> Removing HTML
  -> Tokenizing
  -> Removing punctuation
  -> Joining tokens into string
  -> A sample of the preprocessed data:
['autos n° 0305612-96.2015.8.24.0090 ação procedimento do juizado especial cível/proc autor francisca maria barbosa cavalcanti réu tam linhas aéreas s/a vistos para sentença i – relatório relatório dispensado nos termos do art 38 da lei 9.099/1995 ii fundamentação trato de ação condenatória ajuizada por francisca maria barbosa cavalcanti contra tam linhas aéreas s/a julgo antecipadamente o feito porquanto a solução dos autos pode ser obtida através da análise do direito que disciplina a matéria bem como pelo fato de que as provas carreadas são suficientes para a formação de meu convencimento valho-me pois do art 355 i do código de processo civil além disso impende ressaltar que a presente demanda se consubstancia em relação de consumo uma vez que as partes envolvidas na avença se enquadram nos conceitos de

- Dataset splitting

In [4]:
X = dataset_dl["processed_content"]
y = dataset_dl["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y, shuffle=True)
print("Dataset shapes:")
print(" -> Train: X=%s\ty=%s" % (str(X_train.shape), str(y_train.shape)))
print(" -> Test:  X=%s\ty=%s" % (str(X_test.shape), str(y_test.shape)))

NameError: name 'dataset_dl' is not defined

In [31]:

# Class weights
class_weights = get_class_weight(y_train)
class_weights

Calculating class weights
Labels: ['ganha' 'perde']
Class weights: [0.55568862 4.98924731]


{0: 0.555688622754491, 1: 4.989247311827957}

## Modelling with Deep Learning

In this section, we apply the dataset

## Modeling with Shallow ML

In [5]:
from sklearn import preprocessing

X = dataset_shallow_ml["processed_content"]
y = dataset_shallow_ml["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123, stratify=y, shuffle=True)
print("Dataset shapes:")
print(" -> Train: X=%s\ty=%s" % (str(X_train.shape), str(y_train.shape)))
print(" -> Test:  X=%s\ty=%s" % (str(X_test.shape), str(y_test.shape)))

le = preprocessing.LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)

Dataset shapes:
 -> Train: X=(1044,)	y=(1044,)
 -> Test:  X=(116,)	y=(116,)


In [6]:
vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.05, max_features=5000)

X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [7]:
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn import metrics
import tqdm
from sklearn.linear_model import LogisticRegression

accs = []
f1s = []
models = []
for i in tqdm.tqdm(range(10)):
    X_train_i, X_val_i, y_train_i, y_val_i = train_test_split(X_train_bow, y_train, test_size=0.1, random_state=i,
                                                               stratify=y_train, shuffle=True)

    # Class weights
    class_weights = get_class_weight(y_train_i)

    #model = RandomForestClassifier(class_weight=class_weights, max_depth=10, n_estimators=200, n_jobs=3)
    model = MLPClassifier(batch_size=32)
    model.fit(X_train_i, y_train_i)

    y_pred = model.predict(X_val_i)

    accs.append(metrics.accuracy_score(y_val_i, y_pred))
    f1s.append(metrics.f1_score(y_val_i, y_pred))
    models.append(model)

print("Acc: %.1f(%.1f)%%" % (100 * np.mean(accs), 100 * np.std(accs)))
print("F1:  %.2f(%.2f)" % (np.mean(f1s), np.std(f1s)))

best_model = models[np.argmax(f1s)]


100%|██████████| 10/10 [02:36<00:00, 15.69s/it]

Acc: 95.6(2.3)%
F1:  0.73(0.15)





In [8]:
from sklearn.metrics import classification_report

y_pred = best_model.predict(X_test_bow)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       104
           1       1.00      0.83      0.91        12

    accuracy                           0.98       116
   macro avg       0.99      0.92      0.95       116
weighted avg       0.98      0.98      0.98       116



In [9]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred))

[[104   0]
 [  2  10]]
