<a href="https://colab.research.google.com/github/viniciusrpb/cic0269_natural_language_processing/blob/main/lectures/natural_language_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capítulo 12 - Redes Neurais Recorrentes

### Aplicação: Inferência de Linguagem Natural

Script que realiza a inferência de linguagem natural por meio de uma tarefa de classificação. Utilizam-se modelos de linguagem baseados em redes neurais recorrentes.

O dataset pode ser coletado no repositório do GitHub da disciplina ou no endereço a seguir:

https://www.tensorflow.org/datasets/catalog/snli

Adotaremos a segunda opção:

In [None]:
!pip install tensorflow-datasets
!pip install keras
!pip install tensorflow

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import pandas as pd
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense,Activation,Embedding,SimpleRNN,Dropout
from keras.utils.np_utils import to_categorical

Como o conjunto de treinamento é muito grande, vamos utilizar 50% de seu tamanho. Se os resultados de classificação não ajudarem, você pode aumentar esse tamanho...

In [4]:
ds_train = tfds.load('snli', split='train[50%:]', shuffle_files=True)
ds_valid = tfds.load('snli', split='validation', shuffle_files=False)
ds_test = tfds.load('snli', split='test', shuffle_files=False)

[1mDownloading and preparing dataset 90.17 MiB (download: 90.17 MiB, generated: 87.00 MiB, total: 177.17 MiB) to ~/tensorflow_datasets/snli/1.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating test examples...:   0%|          | 0/10000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/snli/1.1.0.incompleteD1ZLGQ/snli-test.tfrecord*...:   0%|          | 0/10000 […

Generating validation examples...:   0%|          | 0/10000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/snli/1.1.0.incompleteD1ZLGQ/snli-validation.tfrecord*...:   0%|          | 0/1…

Generating train examples...:   0%|          | 0/550152 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/snli/1.1.0.incompleteD1ZLGQ/snli-train.tfrecord*...:   0%|          | 0/550152…

[1mDataset snli downloaded and prepared to ~/tensorflow_datasets/snli/1.1.0. Subsequent calls will reuse this data.[0m


In [5]:
df_train = tfds.as_dataframe(ds_train)
df_valid = tfds.as_dataframe(ds_valid)
df_test = tfds.as_dataframe(ds_test)

In [6]:
df_train.head()

Unnamed: 0,hypothesis,label,premise
0,b'A child reaches up.',0,b'a child reaches up into the air as a woman s...
1,b'A person with a backpack',0,b'A young woman wearing a backpack takes the b...
2,b'A man holding a hard hat is running.',0,b'A man holding a hard hat runs across a street.'
3,b'A female in weird clothing holding a glass.',0,"b'Woman wearing a costume, drinking a beverage.'"
4,b'There are bikers.',0,b'A group of bikers head out the gates in a Lo...


Pré-processamento das sentenças do DataFrame do SNLI: remoção do prefixo b e concatenação da premissa e da hipótese em uma única sentença.

**Obs.:** Você pode fazer outros pré-processamentos (remoção de stop-words, stemmização, lemmatização etc)...

Segue uma proposta de pré-processamento. Thanks to Gabriel Nogueira:

In [7]:
def preprocessDataFrame(df):

    dic = {}
    dic['premise_hypothesis'] = []
    dic['label'] = []

    hypothesis = [x.decode('utf-8') for x in df['hypothesis'].values]
    premise = [x.decode('utf-8') for x in df['premise'].values]

    for idx,sentence in enumerate(premise):
        dic['premise_hypothesis'].append(premise[idx]+" "+hypothesis[idx])
        dic['label'].append(df['label'][idx])
        
    return pd.DataFrame.from_dict(dic)

In [8]:
df_train = preprocessDataFrame(df_train)
df_valid = preprocessDataFrame(df_valid)
df_test = preprocessDataFrame(df_test)

In [9]:
df_train.head()

Unnamed: 0,premise_hypothesis,label
0,a child reaches up into the air as a woman sta...,0
1,A young woman wearing a backpack takes the blo...,0
2,A man holding a hard hat runs across a street....,0
3,"Woman wearing a costume, drinking a beverage. ...",0
4,A group of bikers head out the gates in a Lond...,0


Criação dos logits

In [10]:
df_train['label'] = pd.Categorical(df_train['label'])
y_train_int = df_train['label'].cat.codes

df_valid['label'] = pd.Categorical(df_valid['label'])
y_valid_int = df_valid['label'].cat.codes

df_test['label'] = pd.Categorical(df_test['label'])
y_test_int = df_test['label'].cat.codes

y_train = to_categorical(y_train_int)
y_valid = to_categorical(y_valid_int)
y_test = to_categorical(y_test_int)

Numericalização e padding!

Homework :)

In [11]:
model = Sequential()
model.add(Embedding(2000,output_dim=64))
model.add(SimpleRNN(64,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3,activation="softmax"))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          128000    
                                                                 
 simple_rnn (SimpleRNN)      (None, 64)                8256      
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 3)                 195       
                                                                 
Total params: 136,451
Trainable params: 136,451
Non-trainable params: 0
_________________________________________________________________
