# Projeto Final: Baseline

O projeto consiste de detecção de sarcasmo em manchetes a partir de duas fontes: "HuffingtonPost" para manchetes confiáveis e "The Onion" para manchetes sarcásticas. O resultado a seguir é apenas inicial, um baseline, para depois ser aprimorado.

In [1]:
from nltk import word_tokenize
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from os.path import join
import pandas as pd

Para auxílio no processamento de linguagem natural, será utilizada a biblioteca _NLTK_. É uma biblioteca muito grande, mas felizmente não serão necessários todos os módulos.

**TODO: Adicionar download dos módulos no código.**

In [3]:
folder = 'Dataset'
dataset = pd.read_json(join(folder, 'Sarcasm_Headlines_Dataset_v2.json'), lines=True)

Para melhor análise dos dados, devemos dividir cada manchete em palavras, chamadas de _tokens_ . Assim, podemos processar melhor o texto.

In [4]:
# Tokenize headlines
token_head = dataset['headline'].apply(word_tokenize)
print(token_head)

0        [thirtysomething, scientists, unveil, doomsday...
1        [dem, rep., totally, nails, why, congress, is,...
2        [eat, your, veggies, :, 9, deliciously, differ...
3        [inclement, weather, prevents, liar, from, get...
4        [mother, comes, pretty, close, to, using, word...
                               ...                        
28614    [jews, to, celebrate, rosh, hashasha, or, some...
28615    [internal, affairs, investigator, disappointed...
28616    [the, most, beautiful, acceptance, speech, thi...
28617    [mars, probe, destroyed, by, orbiting, spielbe...
28618           [dad, clarifies, this, not, a, food, stop]
Name: headline, Length: 28619, dtype: object


Em uma frase, normalmente existem palavras comuns que não contribuem tanto para o significado. No inglês, um exemplo é a palavra

In [5]:
# Removing stopwords: common words that are less useful for detection (example:"the")
stop = set(stopwords.words('english'))
filt = token_head.apply(lambda row: list(filter(lambda w: w not in stop, row)))
dataset['headline'] = filt
print(filt)

0        [thirtysomething, scientists, unveil, doomsday...
1        [dem, rep., totally, nails, congress, falling,...
2        [eat, veggies, :, 9, deliciously, different, r...
3        [inclement, weather, prevents, liar, getting, ...
4        [mother, comes, pretty, close, using, word, 's...
                               ...                        
28614         [jews, celebrate, rosh, hashasha, something]
28615    [internal, affairs, investigator, disappointed...
28616    [beautiful, acceptance, speech, week, came, qu...
28617    [mars, probe, destroyed, orbiting, spielberg-g...
28618                         [dad, clarifies, food, stop]
Name: headline, Length: 28619, dtype: object


In [6]:
# Splitting dataset.
X, X_test, Y, Y_test = train_test_split(dataset['headline'], dataset['is_sarcastic'], test_size=0.1)

print(X)
print(Y)


27704    [wanted, government, run, like, business, ?, g...
24834          [hair, salon, acquires, rare, nagel, print]
11360                     [cbs, picks, nbc, nightly, news]
11243    [rotating, knife, vortex, closed, pending, saf...
6695            [boardroom, hokey, pokey, :, dance, women]
                               ...                        
27430    [nation, 's, dogs, vow, keep, shit, together, ...
8316                   [30, reasons, give, thanks, horses]
22066    [princess, nokia, reveals, threw, soup, racist...
7170     [olympic, bronze, medalist, appear, flintstone...
9659             [9, parking, garage, designs, works, art]
Name: headline, Length: 25757, dtype: object
27704    0
24834    1
11360    1
11243    1
6695     0
        ..
27430    1
8316     0
22066    0
7170     1
9659     0
Name: is_sarcastic, Length: 25757, dtype: int64
