<a href="https://colab.research.google.com/github/valmirf/mineracao_textual/blob/main/CNN/05_CNN_TextClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [86]:
!git clone https://github.com/valmirf/mineracao_textual.git

fatal: destination path 'mineracao_textual' already exists and is not an empty directory.


# Classificação de Textos com CNNs

## Carregar Dados

In [87]:
import pandas as pd

df = data = pd.read_csv("mineracao_textual/Dados/dataset_ness_law.csv")

# descartar classes minoritarias
major_classes = list(df['objeto'].value_counts()[0:4].index)
majority      = df.loc[df['objeto'].isin(major_classes),]
minority      = df.loc[~df['objeto'].isin(major_classes),]

print('\n',5*'=', 'major classes: \n',    majority['objeto'].value_counts()[0:4])
print('\n',5*'=', 'minority classes: \n', minority['objeto'].value_counts()[4:])

majority.head()


 ===== major classes: 
 Atraso ou Cancelamento de Voo    136
Alteração na Malha Aérea          79
Doméstico                         59
Extravio de Bagagem               49
Name: objeto, dtype: int64

 ===== minority classes: 
 Provisório                      8
Erro ou Alteração na Reserva    6
Outros                          3
Name: objeto, dtype: int64


Unnamed: 0,txt,objeto
0,fatos joão césar sala eldorado contagem marque...,Extravio de Bagagem
1,fatos fatos ocorridos trecho conforme document...,Doméstico
2,fatos autores presente ação programaram meses ...,Atraso ou Cancelamento de Voo
3,fatos objetivando viajar porto alegre particip...,Alteração na Malha Aérea
4,fatos autor comprou passagem retorno casa após...,Extravio de Bagagem


## Codificar Labels e Separar treino e teste

In [88]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)

_labels = majority['objeto'].values.reshape((len(majority['objeto']), 1))
X = majority['txt']
y = encoder.fit_transform(_labels)

sentences_train, sentences_test, y_train, y_test  = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [89]:
sentences_train[:4], y_train[:4]

(107    fatos autora cliente requerida utilizou serviç...
 31     fatos atraso injustificado perda conexão chega...
 342    fatos requerente planejou realizar viagem dest...
 176    fatos requerente adquiriu passagem realizar vi...
 Name: txt, dtype: object, array([[0., 0., 1., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.]]))

In [90]:
y_train.shape

(226, 4)

## Tokenizando Texto

In [91]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)
X_train   = tokenizer.texts_to_sequences(sentences_train)
X_test    = tokenizer.texts_to_sequences(sentences_test)

vocab_size = len(tokenizer.word_index) + 1 

print(vocab_size)
print(sentences_train.iloc[2])
print(X_train[2])

18463
fatos requerente planejou realizar viagem destino fernando noronha objetivo efetuar mergulho grande profundidade participar jogo futebol junto amigos jiqui country club escolhido destino procurou agência viagem brasil operadora agência viagens meio realizou compra passagens aéreas quais ficaram importe setecentos vinte reais trinta centavos sendo empresa azul linhas aéreas brasileiras requerida companhia escolhida anexo passagem marcada saída aeroporto recife chegada fernando noronha questão companhia azul linhas aéreas brasileiras anexo volta marcada ocorre requerente chegou aeroporto volta check ficou aguardando entanto somente volta informado havia sido cancelado inclusive neste meio tempo saiu marcado mesma companhia aérea destino vogado apesar anúncio sido cancelado ocorrido volta requerente ficou aguardando posicionamento companhia saber iria embarcar somente volta companhia chamou requerente voucher alimentação valor pudesse almoçar disse após almoço retornasse sala embarq

In [92]:
#X_train

In [93]:
#list(tokenizer.word_index)

## Sequence Padding

Um problema que temos é que cada sequência de texto tem na maioria dos casos diferentes comprimentos de palavras. Para corrigir isso, você pode usar `pad_sequence()` que simplesmente preenche a sequência de palavras com zeros. 

Por padrão, ele anexa zeros, mas queremos anexá-los. Normalmente, não importa se você acrescenta ou acrescenta zeros.

In [94]:
from keras.preprocessing.sequence import pad_sequences

maxlen = 100
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

print(X_train[0, :], y_train[0])

[2049  213 1460 4651    2    6    5  322    2  605  325   46  417  456
   18   37   57  155    6    5  275    2  398  129   46  417  456  730
  231   43    8    6    5  426   50   87  323 1352  560    2   95  299
 1303    2 2345    8  160    2 1304 4652 2742  111  168 1520   15  304
  111   50   87 1522    8 1353   90  193  282    1   77   87 2519   13
   34 3622 2519   13 1732  336  124   43   64  633  287 2987 1586   86
  359    5  146   43   56  800 1932    2  332   82    7   28  125 2743
    6 4649] [0. 0. 1. 0.]


## Arquitetura da CNN

In [117]:
from keras.models import Sequential
from keras import layers

embedding_dim = 200
num_classes = y_train.shape[1]

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.Conv1D(300, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(num_classes, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

Model: "sequential_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_19 (Embedding)     (None, 100, 200)          3692600   
_________________________________________________________________
conv1d_19 (Conv1D)           (None, 96, 300)           300300    
_________________________________________________________________
global_max_pooling1d_19 (Glo (None, 300)               0         
_________________________________________________________________
dense_38 (Dense)             (None, 10)                3010      
_________________________________________________________________
dense_39 (Dense)             (None, 4)                 44        
Total params: 3,995,954
Trainable params: 3,995,954
Non-trainable params: 0
_________________________________________________________________


In [120]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Training Accuracy: 1.0000
Testing Accuracy:  0.5155


## Adicionando embeddings pré-treinados
É possível usarmos embeddings pré-treinados. A escolha é sempre relativa ao seu problema. Por exemplo, se você precisa resolver um problema de classificação de texto de cunho geral, pode pegar um Embeddings pré-treinado do Google, com milhões de textos. Porém, se quiser resolver um problema de classificação de sentimentos de review de livros, pode ser útil utilizar um embeddings mais próximo do seu problema, como por exemplo, um embeddings pré-treinado com informações e review de livros da Amazon. Nessa atividade, vamos utilizar....
### Word2Vec

In [97]:
!wget https://gist.githubusercontent.com/bastings/4d1c346c68969b95f2c34cfbc00ba0a0/raw/76b4fefc9ef635a79d0d8002522543bc53ca2683/googlenews.word2vec.300d.txt

--2021-04-28 00:57:15--  https://gist.githubusercontent.com/bastings/4d1c346c68969b95f2c34cfbc00ba0a0/raw/76b4fefc9ef635a79d0d8002522543bc53ca2683/googlenews.word2vec.300d.txt
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66209703 (63M) [text/plain]
Saving to: ‘googlenews.word2vec.300d.txt.3’


2021-04-28 00:57:15 (191 MB/s) - ‘googlenews.word2vec.300d.txt.3’ saved [66209703/66209703]



In [98]:
!head -n 1 googlenews.word2vec.300d.txt | cut -c-50

in 0.0703125 0.08691406 0.087890625 0.0625 0.06933


### GLove

In [99]:
!wget https://gist.githubusercontent.com/bastings/b094de2813da58056a05e8e7950d4ad1/raw/3fbd3976199c2b88de2ae62afc0ecc6f15e6f7ce/glove.840B.300d.sst.txt

--2021-04-28 00:57:15--  https://gist.githubusercontent.com/bastings/b094de2813da58056a05e8e7950d4ad1/raw/3fbd3976199c2b88de2ae62afc0ecc6f15e6f7ce/glove.840B.300d.sst.txt
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53933315 (51M) [text/plain]
Saving to: ‘glove.840B.300d.sst.txt.3’


2021-04-28 00:57:16 (82.4 MB/s) - ‘glove.840B.300d.sst.txt.3’ saved [53933315/53933315]



In [100]:
!head -n 1 glove.840B.300d.sst.txt | cut -c-50

, -0.082752 0.67204 -0.14987 -0.064983 0.056491 0.


## Teste: Word2Vec

In [101]:
import numpy as np

def create_embedding_matrix(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # Adding again 1 because of reserved 0 index
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    with open(filepath) as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word] 
                embedding_matrix[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix
    
embedding_matrix = create_embedding_matrix('googlenews.word2vec.300d.txt',
                      tokenizer.word_index, embedding_dim)

In [102]:
embedding_matrix[0:3,:10]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [103]:
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=maxlen, 
                           trainable=False))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(num_classes, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

Model: "sequential_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 100, 200)          3692600   
_________________________________________________________________
conv1d_12 (Conv1D)           (None, 96, 128)           128128    
_________________________________________________________________
global_max_pooling1d_12 (Glo (None, 128)               0         
_________________________________________________________________
dense_24 (Dense)             (None, 10)                1290      
_________________________________________________________________
dense_25 (Dense)             (None, 4)                 44        
Total params: 3,822,062
Trainable params: 129,462
Non-trainable params: 3,692,600
_________________________________________________________________


In [104]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Training Accuracy: 0.8363
Testing Accuracy:  0.4021


## Teste: Glove

In [105]:
embedding_matrix = create_embedding_matrix('glove.840B.300d.sst.txt',
                      tokenizer.word_index, embedding_dim)
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, 
                           weights=[embedding_matrix], 
                           input_length=maxlen, 
                           trainable=False))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(num_classes, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_13 (Embedding)     (None, 100, 200)          3692600   
_________________________________________________________________
conv1d_13 (Conv1D)           (None, 96, 128)           128128    
_________________________________________________________________
global_max_pooling1d_13 (Glo (None, 128)               0         
_________________________________________________________________
dense_26 (Dense)             (None, 10)                1290      
_________________________________________________________________
dense_27 (Dense)             (None, 4)                 44        
Total params: 3,822,062
Trainable params: 129,462
Non-trainable params: 3,692,600
_________________________________________________________________


In [106]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Training Accuracy: 0.8451
Testing Accuracy:  0.3608


## Métricas detalhadas

In [107]:
from sklearn import metrics

preds = model.predict_classes(X_test)
true  = np.argmax(y_test, axis=1)

print(metrics.classification_report(true, preds))

              precision    recall  f1-score   support

           0       0.25      0.29      0.27        24
           1       0.51      0.56      0.53        41
           2       0.12      0.06      0.08        18
           3       0.25      0.29      0.27        14

    accuracy                           0.36        97
   macro avg       0.28      0.30      0.29        97
weighted avg       0.34      0.36      0.35        97





## Como melhorar?

- Hyperparameter tuning
- Adicionar mais filtros (convolution layers), por exemplo:

```
filter_sizes = [3,4,5]
num_filters = 512

...

conv_0 = Conv2D(num_filters, kernel_size=(filter_sizes[0], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_1 = Conv2D(num_filters, kernel_size=(filter_sizes[1], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)
conv_2 = Conv2D(num_filters, kernel_size=(filter_sizes[2], embedding_dim), padding='valid', kernel_initializer='normal', activation='relu')(reshape)

maxpool_0 = MaxPool2D(pool_size=(sequence_length - filter_sizes[0] + 1, 1), strides=(1,1), padding='valid')(conv_0)
maxpool_1 = MaxPool2D(pool_size=(sequence_length - filter_sizes[1] + 1, 1), strides=(1,1), padding='valid')(conv_1)
maxpool_2 = MaxPool2D(pool_size=(sequence_length - filter_sizes[2] + 1, 1), strides=(1,1), padding='valid')(conv_2)

concatenated_tensor = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2])
flatten = Flatten()(concatenated_tensor)

...
```