# Final Project

## Basic Information


| **Title:**       | Deep Learning and Natural Language Processing applied to the legal texts |
|------------------|----------------------------------------------------------|
| **Abstract:**    |                                                        |
| **Author:**      | Thiago Raulino Dal Pont                                |
| **Affiliation:** | Graduate Program in Automation and Systems Engineering  |
| **Date**         | July 14, 2022                                          |


## Goals of the project

- ...

## Project structure
- Preprocessing
- Representation
- Modeling
- Evaluation


## Requirements


``pip install -r requirements.txt``

``python3 -m spacy download pt_core_news_sm``


## Importing dependencies

In [1]:
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from src.modeling.util import get_class_weight
from src.preprocessing.preprocessing_shallow_ml import PreProcessingShallowML

## Dataset basic information

In [2]:
DATASET_2CLASS_PATH = os.path.join("Data", "final_dataset_2l_wo_result", "")

preprocessor = PreProcessingShallowML()
preprocessor.load_dataset(DATASET_2CLASS_PATH)


Loading dataset
{'labels': {'ganha': None, 'perde': None}}
  -> Found 1044 files inside Data/final_dataset_2l_wo_result/ganha/*.txt
  -> Found 116 files inside Data/final_dataset_2l_wo_result/perde/*.txt


# Data preparation

- In this project, we implemented a class to handle the text preprocessing in such a way that we can easily select distinct methods.

In [3]:
preprocessor.preprocess_corpus(
    keep_raw=True,
    lowercase=True,
    stemming=False,
    remove_html=True,
    remove_punct=True,
    remove_stopwords=True
)

dataset_shallow_ml = preprocessor.df_corpora

Preprocessing corpus
  -> Converting to lowercase
  -> Removing HTML
  -> Tokenizing
  -> Removing punctuation
  -> Removing Stopwords
  -> Joining tokens into string
  -> A sample of the preprocessed data:
["autos n° 0301090-84.2019.8.24.0090 ação procedimento juizado especial cível/proc autor antonio schincariol vicente réu oceanair linhas aéreas s/a avianca vistos etc i. relatório relatório dispensado forma artigo 38 caput lei 9.099/95 ii fundamentação trato ação indenizatória ajuizada antonio schincariol vicente face oceanair linhas aéreas s/a avianca julgo antecipadamente feito porquanto solução lide pode obtida análise direito disciplina matéria bem provas carreadas autos serem suficientes formação convencimento valho-me pois art 355 i código processo civil além disso impende ressaltar presente demanda consubstancia relação consumo vez partes envolvidas avença enquadram conceitos fornecedor consumidor dispostos arts 3º 2º lei 8.078/1990 respectivamente sendo portanto imperioso ap

In [3]:
preprocessor.preprocess_corpus(
    keep_raw=True,
    lowercase=True,
    stemming=False,
    remove_html=True,
    remove_punct=True,
    remove_stopwords=False
)

dataset_dl = preprocessor.df_corpora

Preprocessing corpus
  -> Converting to lowercase
  -> Removing HTML
  -> Tokenizing
  -> Removing punctuation
  -> Joining tokens into string
  -> A sample of the preprocessed data:
['autos n. 0702116-33.2011.8.24.0090 ação procedimento do juizado especial cível requerente s talles josé de oliveira requerido a s tam linhas aereas s/a vistos para sentença talles josé de oliveira ajuizou ação de cunho condenatório em face de tam – linhas aéreas s/a na qual busca a reparação dos danos materiais e morais decorrentes de perda de um voo e realocação em outro com mais de 07 sete horas de diferença eis o sucinto relatório apesar de dispensado na forma do artigo 38 caput da lei 9.099/95 decido resolvo antecipadamente a lide nos termos do artigo 330 i do cpc uma vez que o juiz tem o poder-dever de julgar a lide antecipadamente desprezando a realização de audiência para a produção de provas ao constatar que o acervo documental é suficiente para nortear e instruir seu entendimento é do seu livre 

- Dataset splitting

In [4]:
X = dataset_dl["processed_content"]
y = dataset_dl["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y, shuffle=True)
print("Dataset shapes:")
print(" -> Train: X=%s\ty=%s" % (str(X_train.shape), str(y_train.shape)))
print(" -> Test:  X=%s\ty=%s" % (str(X_test.shape), str(y_test.shape)))

Dataset shapes:
 -> Train: X=(928,)	y=(928,)
 -> Test:  X=(232,)	y=(232,)


In [5]:

# Class weights
class_weights = get_class_weight(y_train)
class_weights

{0: 0.555688622754491, 1: 4.989247311827957}

## Modelling with Deep Learning

In this section, we apply the dataset

In [7]:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Conv1D
from keras.layers import MaxPooling1D
from keras.layers import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from sklearn.preprocessing import LabelEncoder

# fix random seed for reproducibility
seed = 42
np.random.seed(seed)

In [8]:
t = Tokenizer(oov_token='<UNK>')
# fit the tokenizer on the documents
t.fit_on_texts(X_train)
t.word_index['<PAD>'] = 0

In [9]:
train_sequences = t.texts_to_sequences(X_train)
test_sequences = t.texts_to_sequences(X_test)

In [10]:
print("Vocabulary size={}".format(len(t.word_index)))
print("Number of Documents={}".format(t.document_count))

Vocabulary size=17424
Number of Documents=928


In [11]:
MAX_SEQUENCE_LENGTH = 1000

In [12]:
# pad dataset to a maximum review length in words
X_train = sequence.pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
X_test = sequence.pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
X_train.shape, X_test.shape

((928, 1000), (232, 1000))

In [13]:
le = LabelEncoder()
num_classes=2 # positive -> 1, negative -> 0

In [14]:
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

In [15]:
VOCAB_SIZE = len(t.word_index)

In [16]:
EMBED_SIZE = 300
EPOCHS=2
BATCH_SIZE=32

In [None]:
# create the model
model = Sequential()
model.add(Embedding(VOCAB_SIZE, EMBED_SIZE, input_length=MAX_SEQUENCE_LENGTH))
model.add(Conv1D(filters=128, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=64, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=32, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

2022-07-15 22:05:18.168140: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-07-15 22:05:18.206901: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-15 22:05:18.207355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce 940MX computeCapability: 5.0
coreClock: 1.189GHz coreCount: 3 deviceMemorySize: 1.96GiB deviceMemoryBandwidth: 37.33GiB/s
2022-07-15 22:05:18.207678: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2022-07-15 22:05:18.209962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-07-15 22:05:18.211899: I tensorflow/stream_executor/platform

In [40]:
# Fit the model
model.fit(X_train, y_train,
          validation_split=0.1,
          epochs=EPOCHS,
          batch_size=BATCH_SIZE,
          class_weight=class_weights,
          verbose=1)

Epoch 1/2


2022-07-15 22:02:58.859432: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2022-07-15 22:02:58.866042: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR


UnknownError: 2 root error(s) found.
  (0) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node sequential_1/conv1d_3/conv1d (defined at tmp/ipykernel_53494/310409201.py:7) ]]
	 [[gradient_tape/sequential_1/embedding_1/embedding_lookup/Reshape/_40]]
  (1) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node sequential_1/conv1d_3/conv1d (defined at tmp/ipykernel_53494/310409201.py:7) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_2633]

Function call stack:
train_function -> train_function


In [None]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=1)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [None]:
predictions = model.predict_classes(X_test).ravel()
predictions[:10]

In [None]:
predictions = ['positive' if item == 1 else 'negative' for item in predictions]
predictions[:10]

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

labels = ['negative', 'positive']
print(classification_report(y_test, predictions))
pd.DataFrame(confusion_matrix(y_test, predictions), index=labels, columns=labels)

## Modeling with Shallow ML

In [5]:
from sklearn import preprocessing

X = dataset_shallow_ml["processed_content"]
y = dataset_shallow_ml["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123, stratify=y, shuffle=True)
print("Dataset shapes:")
print(" -> Train: X=%s\ty=%s" % (str(X_train.shape), str(y_train.shape)))
print(" -> Test:  X=%s\ty=%s" % (str(X_test.shape), str(y_test.shape)))

le = preprocessing.LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)

Dataset shapes:
 -> Train: X=(1044,)	y=(1044,)
 -> Test:  X=(116,)	y=(116,)


In [6]:
vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.05, max_features=5000)

X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [7]:
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn import metrics
import tqdm
from sklearn.linear_model import LogisticRegression

accs = []
f1s = []
models = []
for i in tqdm.tqdm(range(10)):
    X_train_i, X_val_i, y_train_i, y_val_i = train_test_split(X_train_bow, y_train, test_size=0.1, random_state=i,
                                                               stratify=y_train, shuffle=True)

    # Class weights
    class_weights = get_class_weight(y_train_i)

    #model = RandomForestClassifier(class_weight=class_weights, max_depth=10, n_estimators=200, n_jobs=3)
    model = MLPClassifier(batch_size=32)
    model.fit(X_train_i, y_train_i)

    y_pred = model.predict(X_val_i)

    accs.append(metrics.accuracy_score(y_val_i, y_pred))
    f1s.append(metrics.f1_score(y_val_i, y_pred))
    models.append(model)

print("Acc: %.1f(%.1f)%%" % (100 * np.mean(accs), 100 * np.std(accs)))
print("F1:  %.2f(%.2f)" % (np.mean(f1s), np.std(f1s)))

best_model = models[np.argmax(f1s)]


100%|██████████| 10/10 [02:36<00:00, 15.69s/it]

Acc: 95.6(2.3)%
F1:  0.73(0.15)





In [8]:
from sklearn.metrics import classification_report

y_pred = best_model.predict(X_test_bow)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       104
           1       1.00      0.83      0.91        12

    accuracy                           0.98       116
   macro avg       0.99      0.92      0.95       116
weighted avg       0.98      0.98      0.98       116



In [9]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred))

[[104   0]
 [  2  10]]
