# Projeto Final: Baseline

O projeto consiste de detecção de sarcasmo em manchetes a partir de duas fontes: "HuffingtonPost" para manchetes confiáveis e "The Onion" para manchetes sarcásticas. O resultado a seguir é apenas inicial, um baseline, para depois ser aprimorado.

In [3]:
import matplotlib.pyplot as plt
from os.path import join
import pandas as pd
import re

from utils import *

Para auxílio no processamento de linguagem natural, será utilizada a biblioteca _NLTK_. É uma biblioteca muito grande, mas felizmente não serão necessários todos os módulos.

**TODO: Adicionar download dos módulos no código.**

In [4]:
# Reading dataset
folder = 'Dataset'
dataset = pd.read_json(join(folder, 'Sarcasm_Headlines_Dataset_v2.json'), lines=True)

In [5]:
dataset.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


Para melhor análise dos dados, devemos dividir cada manchete em palavras, chamadas de _tokens_ . Assim, podemos processar melhor o texto.

In [6]:
# Tokenize headlines (nltk tokenizer is more robust with punctuation)
dataset.headline = dataset.headline.apply(clean_up)
dataset.headline = dataset.headline.apply(word_tokenize)
dataset.headline = dataset.headline.apply(stop_words)
dataset.headline = dataset.headline.apply(stemmezation)

In [7]:
words, counts = count_words(dataset.headline)
# Create a list of insignificant words (Words with low frequency)
MIN_FREQ = 3
discard = []
for (w,c) in zip(words, counts[0]):
    if c < MIN_FREQ : discard.append(w)

In [8]:
# Create a set with the selected words by the countvectorizer
# and the discarded ones due to low frequency, and removes any
# words from the headlines which dont belong to the such set
select = set(words)-set(discard)
dataset.headline = dataset.headline.apply(lambda x : list(set(x) & select))

In [9]:
# One Hot Encoding 
# First we need to label the words
labels = LabelEncoder().fit(list(select))
encoded = dataset.headline.apply(labels.transform) # Transforming words to labels
encoded = encoded.apply(lambda x : onehotencoding(x, len(select)))

In [10]:
#### VARIBLES ####
X = np.array(encoded.tolist())
Y = np.array([[x] for x in dataset.is_sarcastic.tolist()]) 

In [11]:
# Splitting dataset.
X, X_test, Y, Y_test = train_test_split(X, Y, test_size=0.1)

In [59]:
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding, RNN, SimpleRNN
from keras.models import Model

def RNN_network(max_words = 1000, max_len = 8037):

    inputs = Input(name='inputs',shape=[max_len])
    layer = Embedding(max_words,64,input_length=max_len, mask_zero=True)(inputs)
    layer = SimpleRNN(30)(layer)
    layer = Dense(256, name='FC1', activation='relu')(layer)
    layer = Dropout(0.5)(layer)
    layer = Dense(1,name='out_layer')(layer)
    layer = Activation('softmax')(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

In [60]:
recurrent_network = RNN_network()

In [61]:
recurrent_network.summary()
recurrent_network.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          (None, 8037)              0         
_________________________________________________________________
embedding_11 (Embedding)     (None, 8037, 64)          64000     
_________________________________________________________________
simple_rnn_6 (SimpleRNN)     (None, 30)                2850      
_________________________________________________________________
FC1 (Dense)                  (None, 256)               7936      
_________________________________________________________________
dropout_10 (Dropout)         (None, 256)               0         
_________________________________________________________________
out_layer (Dense)            (None, 1)                 257       
_________________________________________________________________
activation_10 (Activation)   (None, 1)                 0         
Total para

In [63]:
recurrent_network.fit(X,Y,batch_size=256,epochs=5, validation_split=0.1)

Train on 23181 samples, validate on 2576 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
 2304/23181 [=>............................] - ETA: 22:29 - loss: 8.0335 - acc: 0.4961

KeyboardInterrupt: 

## Testing Fully Connected Networks

In [None]:
def NN(max_words = 1000, max_len = X.shape[1]):

    inputs = Input(name='inputs',shape=[max_len])
    layer = Dense(4000, name='FC1', activation='relu')(inputs)
    layer = Dense(1000, name='FC2', activation='relu')(layer)
    layer = Dense(250, name='FC3', activation='relu')(layer)
    layer = Dense(50, name='FC4', activation='relu')(layer)
    layer = Dense(1,name='out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

In [None]:
neural_network = NN()
neural_network.summary()
neural_network.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
neural_network.fit(X,Y,batch_size=100,epochs=5, validation_split=0.1)