# Sarcasm headline detection
### Author: Bartosz Wnorowski

## Introduction

Purpose of this notebook is to create machine learning model to classify a headline to be sarcastic or not. All of the work is based on a given dataset.

We will start with getting a closer look at the dataframe. After that, if there will be a need for some data cleaning and processing we will do this to ensure better model's quality. And finally, we will create classification model with methods that suit our case best. 

## Data preparation

In [2]:
import pandas as pd
import numpy as np
from nltk.stem.porter import PorterStemmer
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import naive_bayes, svm
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score

In [3]:
df = pd.read_json('HEADLINES.json', lines=True)

In [4]:
df.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


Dataset consists of 2 columns, headline text and boolean is_sarcastic value.

In [5]:
df.describe()

Unnamed: 0,is_sarcastic
count,26709.0
mean,0.438953
std,0.496269
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


There is nearly equal amount of sarcastic and non-sarcastic headlines in the dataset.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26709 entries, 0 to 26708
Data columns (total 2 columns):
headline        26709 non-null object
is_sarcastic    26709 non-null int64
dtypes: int64(1), object(1)
memory usage: 417.5+ KB


We can see that dataset does not contain any empty cells. That means we can move forward to prepare text to be 'edible' by some statistical model. 

In [16]:
ps = PorterStemmer()

def clean(text, stemmer):
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = [stemmer.stem(word) for word in text.split()]
    text = ' '.join(text)
    return text

df['headline'] = df['headline'].apply(clean, args = (ps,))

Now we have cleaned our data in two steps: 
1. All special characters and digits are removed.
2. All words are replaced by theirs stems, for example verbs in different tenses has the same     reprezentation.

It came out that converting leters to lowercase has negative impact on our models' quality.

This actions will reduce the number of features and help with model's generalization.

In [22]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(df['headline'], df['is_sarcastic'])
vectorizer = TfidfVectorizer(stop_words = ['english'], ngram_range=(1,3))
Xtrain = vectorizer.fit_transform(Xtrain)
Xtest = vectorizer.transform(Xtest)

In vectorization- step that transforms text to vactors of features we use Tf-idf. This solution will help us to reduce the impact of most common words. We also increased default ngram_range to (1,3) to express better meaning of the whole sentences.

Parameters of vectorizer was experimentally choosen to get the best possible training score results.

## Model creation

We will try Multinominal Naive Bayes and Support Vector Machine methods for our classification and check which of them better suits for this problem. Main comparative cryterion will be F1 score (harmonic mean of precision and recall) on testing dataset.

In [23]:
modelNB = naive_bayes.MultinomialNB()
modelNB.fit(Xtrain, Ytrain)

trainPred = modelNB.predict(Xtrain)
testPred = modelNB.predict(Xtest)

trainScore = f1_score(Ytrain,trainPred)
testScore = f1_score(Ytest,testPred)

print("Train score: " + str(trainScore))
print("Test score: " + str(testScore))

Train score: 0.9837322798047874
Test score: 0.7029430446749947


In [24]:
modelSVM = svm.SVC(kernel="linear", C=1.0)
modelSVM.fit(Xtrain, Ytrain)

trainPred = modelSVM.predict(Xtrain)
testPred = modelSVM.predict(Xtest)

trainScore = f1_score(Ytrain,trainPred)
testScore = f1_score(Ytest,testPred)

print("Train score: " + str(trainScore))
print("Test score: " + str(testScore))

Train score: 0.9946131805157593
Test score: 0.8342954159592529


## Conclusion
We have managed to create text classifier of a decent quality. SMV classifier with linear kernel reached 0.834 testing F1 score. Having first results while working on a task like that, it is good to move back and experiment with choosen parameters. As it is almost impossible to guess optimal parameters on our own, it helps to improve quality of our model's significantly.