# Projeto Final: Baseline

O projeto consiste de detecção de sarcasmo em manchetes a partir de duas fontes: "HuffingtonPost" para manchetes confiáveis e "The Onion" para manchetes sarcásticas. O resultado a seguir é apenas inicial, um baseline, para depois ser aprimorado.

In [1]:
from nltk import download
from nltk import pos_tag
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer # Lemmatizer of coice
from nltk.stem import SnowballStemmer # Stemmer of choice

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt
from os.path import join
import pandas as pd
import re


Para auxílio no processamento de linguagem natural, será utilizada a biblioteca _NLTK_. É uma biblioteca muito grande, mas felizmente não serão necessários todos os módulos.

**TODO: Adicionar download dos módulos no código.**

In [2]:
folder = 'Dataset'
dataset = pd.read_json(join(folder, 'Sarcasm_Headlines_Dataset_v2.json'), lines=True)

In [3]:
dataset.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


Para melhor análise dos dados, devemos dividir cada manchete em palavras, chamadas de _tokens_ . Assim, podemos processar melhor o texto.

In [4]:
# Tokenize headlines
dataset.headline = dataset['headline'].apply(word_tokenize)

In [5]:
# # Removing stopwords: common words that are less useful for detection (example:"the")
# stop = set(stopwords.words('english'))
# filt = token_head.apply(lambda row: list(filter(lambda w: w not in stop, row)))
# dataset['headline'] = filt
# print(filt)

Em uma frase, normalmente existem palavras comuns que não contribuem tanto para o significado, chamadas de _stopwords_ . No inglês, um exemplo é a palavra "the". Para melhorar o tempo de processamento e não gerar padrões indesejados, podemos "enxugar" as manchetes removendo essas palavras.

In [6]:
# Stemming words (good reduction of words)
stemmer = SnowballStemmer('english')
stemmezation = lambda words : [stemmer.stem(w) for w in words]
dataset.headline = dataset.headline.apply(stemmezation)

In [7]:
def big_line(text, tokenized=True):
    if tokenized : 
        text = text.tolist()
        line = []
        for x in text : line += x   
    else:
        for x in text : line += " " + x  
    
    return line

def clean_up(text, pattern="[.,:%-?()&$'\"!\“\”¯°–―—_\/|#\[\]…@ツ¡\d]"):
    # Cleanning up the data 
    clean_up = lambda txt : re.sub(pattern, '', txt)
    text = text.apply(clean_up)

def onehotencoding(ids, size):
    hotencoded = [0]*size
    for x in ids : hotencoded[x] = 1
    return hotencoded

In [8]:
counter = CountVectorizer(stop_words='english')
matrix = counter.fit_transform(big_line(dataset.headline))
words = counter.get_feature_names()

In [9]:
# One Hot Encoding 
# First we need to label the words
labels = LabelEncoder().fit(words)
removal = lambda words : [x for x in words if x in labels.classes_]
dataset.headline = dataset.headline.apply(removal) # Remove words which are not labels
encoded = dataset.headline.apply(labels.transform) # Transforming words to labels
encoded = encoded.apply(lambda x : onehotencoding(x, len(words)))

In [19]:
#### VARIBLES ####
from numpy import array
X = array(encoded.tolist())
Y = array([[x] for x in dataset.is_sarcastic.tolist()]) 
# Splitting dataset.
X, X_test, Y, Y_test = train_test_split(X, Y, test_size=0.1)

(25757, 18166)
[[1]
 [0]
 [1]
 ...
 [1]
 [0]
 [0]]


In [20]:
print(X.shape)
print(Y.shape)

(25757, 18166)
(25757, 1)


In [21]:
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.models import Model

def RNN(max_words = 1000, max_len = 18166):

    inputs = Input(name='inputs',shape=[max_len])
    layer = Embedding(max_words,50,input_length=max_len)(inputs)
    layer = LSTM(64)(layer)
    layer = Dense(256, name='FC1', activation='relu')(layer)
    layer = Dropout(0.5)(layer)
    layer = Dense(1,name='out_layer')(layer)
    layer = Activation('softmax')(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

In [22]:
recurrent_network = RNN()

In [23]:
recurrent_network.summary()
recurrent_network.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          (None, 18166)             0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 18166, 50)         50000     
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                29440     
_________________________________________________________________
FC1 (Dense)                  (None, 256)               16640     
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
out_layer (Dense)            (None, 1)                 257       
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
Total para

In [24]:
recurrent_network.fit(X,Y,batch_size=100,epochs=5, validation_split=0.1)

Train on 23181 samples, validate on 2576 samples
Epoch 1/5
  100/23181 [..............................] - ETA: 10:59:10 - loss: 7.6523 - acc: 0.5200

KeyboardInterrupt: 