# Final Project

## Basic Information


| **Title:**       | Deep Learning and Natural Language Processing applied to the legal texts |
|------------------|----------------------------------------------------------|
| **Abstract:**    |                                                        |
| **Author:**      | Thiago Raulino Dal Pont                                |
| **Affiliation:** | Graduate Program in Automation and Systems Engineering  |
| **Date**         | July 14, 2022                                          |


## Goals of the project

- ...

## Project structure
- Preprocessing
- Representation
- Modeling
- Evaluation


## Requirements


``pip install -r requirements.txt``

``python3 -m spacy download pt_core_news_sm``


## Importing dependencies

In [7]:
import os

from sklearn.model_selection import train_test_split

from src.modeling.util import get_class_weight
from src.preprocessing.preprocessing_shallow_ml import PreProcessingShallowML

## Dataset basic information

In [8]:
DATASET_2CLASS_PATH = os.path.join("Data", "final_dataset_2l_wo_result", "")

preprocessor = PreProcessingShallowML()
preprocessor.load_dataset(DATASET_2CLASS_PATH)


Loading dataset
{'labels': {'ganha': None, 'perde': None}}
  -> Found 1044 files inside Data/final_dataset_2l_wo_result/ganha/*.txt
  -> Found 116 files inside Data/final_dataset_2l_wo_result/perde/*.txt


# Data preparation

- In this project, we implemented a class to handle the text preprocessing in such a way that we can easily select distinct methods.

In [9]:
# preprocessor.preprocess_corpus(
#     keep_raw=True,
#     lowercase=True,
#     stemming=False,
#     remove_html=True,
#     remove_punct=True,
#     remove_stopwords=True
# )
#
# dataset_shallow_ml = preprocessor.df_corpora

In [10]:
preprocessor.preprocess_corpus(
    keep_raw=True,
    lowercase=True,
    stemming=False,
    remove_html=True,
    remove_punct=True,
    remove_stopwords=False
)

dataset_dl = preprocessor.df_corpora

Preprocessing corpus
  -> Converting to lowercase
  -> Removing HTML
  -> Tokenizing
  -> Removing punctuation
  -> Joining tokens into string
  -> A sample of the preprocessed data:
["autos n° 0300982-26.2017.8.24.0090 ação procedimento do juizado especial cível/proc autor marco aurelio prass goetten réu gol transportes aéreos s/a vistos etc i. relatório relatório dispensado na forma do artigo 38 caput da lei 9.099/95 ii fundamentação trato de ação de indenização por danos morais ajuizada por marco aurelio prass goetten em face de gol transportes aéreos s/a julgo antecipadamente o feito porquanto a solução dos autos pode ser obtida da análise do direito que disciplina a matéria bem como pelo fato de que as provas carreadas são suficientes para a formação de meu convencimento valho-me pois do art 355 i do código de processo civil além disso impende ressaltar que a presente demanda se consubstancia em relação de consumo uma vez que as partes envolvidas na avença se enquadram nos conceit

- Dataset splitting

In [11]:
X = dataset_dl["processed_content"]
y = dataset_dl["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y, shuffle=True)
print("Dataset shapes:")
print(" -> Train: X=%s\ty=%s" % (str(X_train.shape), str(y_train.shape)))
print(" -> Test:  X=%s\ty=%s" % (str(X_test.shape), str(y_test.shape)))

Dataset shapes:
 -> Train: X=(928,)	y=(928,)
 -> Test:  X=(232,)	y=(232,)


In [12]:

# Class weights
class_weights = get_class_weight(y_train)
class_weights

Calculating class weights
Labels: ['ganha' 'perde']
Class weights: [0.55568862 4.98924731]


{0: 0.555688622754491, 1: 4.989247311827957}

## Modelling with Deep Learning

In this section, we apply the dataset

In [16]:
## some config values
embed_size = 300 # how big is each word vector
max_features = 95000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 50 # max number of words in a question to use

import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import f1_score, roc_auc_score

import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from tf.keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, CuDNNLSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D
from keras.layers import Bidirectional, GlobalMaxPool1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D, concatenate
from keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D
from keras.optimizers import Adam
from keras.models import Model
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.layers import concatenate
from keras.callbacks import *

ModuleNotFoundError: No module named 'tf'