         Sarcasam Classification using Naive Bayes and Logistic Regression models. 


Introduction:

Goal:
The goal of this assignment is to build a Naive Bayes and Logistic regression classification models that will predict if the piece of text is sarcastic or not. We use different features as input to the models and expect the output to be sarcastic or not. We evaluate the model using 10 fold cross validation and the metrics are accuracy and F-score.

Dataset: The dataset used for this assignmnet is news headlines which is collected from two news website: The Onion2, which aims at producing sarcastic versions of current events, and HuffPost3, which provides the set of real (and non-sarcastic) news headlines.
Source: https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection

Programming Component

Step 1: Import all the required libraries

In [1]:
import math 
import random 
import numpy as np
import pandas as pd
import json
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

Step 2: Load the Data

In [2]:
df = pd.read_json(r'/Users/sriharshithaayyalasomayajula/Desktop/Sarcasm_Headlines_Dataset.json', lines=True)
df.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


In [3]:
#Checking if there are any null values in the data
print(df.isnull().any(axis = 0))

is_sarcastic    False
headline        False
article_link    False
dtype: bool


Step 3: Lets define the corpus here

In [4]:
#Defining the corpus 
corpus = []
for i in range(0, len(df)):
    review = re.sub('[^a-zA-Z]', ' ', df['headline'][i])
    review = review.lower()
    corpus.append(review)

Step 4: Feature Engineering with N-grams

In [25]:
#Using Ngrams as features, fitting the data using countvectorizer
#In here I'm using all the three uni,bi and tri 

cv = CountVectorizer(max_features=5000,ngram_range=(1,3))
X = cv.fit_transform(corpus).toarray()
y = df['is_sarcastic']
y.head()

0    1
1    0
2    0
3    1
4    1
Name: is_sarcastic, dtype: int64

In [26]:
#Spliting the data into test and train (80,20)
x_train,x_test,y_train,y_test = train_test_split(X,y,random_state = 0,test_size = 0.2)

Step 5: Classification models and Evaluation

In [27]:
# Naive Bayes Classifier with N-grams 
from sklearn.naive_bayes import MultinomialNB

modelNB = MultinomialNB().fit(x_train,y_train)
y_pred = modelNB.predict(x_test)

#Accuracy and F1 Score for Naive Bayes
from sklearn import metrics
from sklearn.metrics import f1_score

score = metrics.accuracy_score(y_test,y_pred)
f1_score = metrics.f1_score(y_test,y_pred)
print("Accuracy of Naive Bayes Model:", score)
print("F Score of Naive Bayes Model:", f1_score)

#Naive Baye's with K-fold Cross validation 
from sklearn.model_selection import cross_val_score

model1 = MultinomialNB().fit(X,y)
score_NB = cross_val_score(model1,X,y,cv = 10)
print("Accuracy of Naive Bayes Model with 10 fold cross Validation:", score_NB.mean())

Accuracy of Naive Bayes Model: 0.826869322152341
F Score of Naive Bayes Model: 0.8175290001841282
Accuracy of Naive Bayes Model with 10 fold cross Validation: 0.8368914735896198


In [28]:
# Logistic Regression Model with N-grams 
from sklearn.linear_model import LogisticRegression

modelLR = LogisticRegression(solver='liblinear', random_state=0).fit(x_train, y_train)
y_predLR = modelLR.predict(x_test)

#Accuracy and F1 Score for Logistic Regression

score_LR = metrics.accuracy_score(y_test,y_predLR)
f1_score_LR = metrics.f1_score(y_test,y_predLR)
print("Accuracy of Logistic Regression model:", score)
print("F Score of Logistic Regression model:", f1_score)

#Logistic Regression with K-fold Cross validation

model2 = LogisticRegression(solver='liblinear', random_state=0).fit(X,y)
score_LRM = cross_val_score(model2,X,y,cv=10)
print("Accuracy of LR Model with 10 fold cross validation:", score_LRM.mean())

Accuracy of Logistic Regression model: 0.826869322152341
F Score of Logistic Regression model: 0.8175290001841282
Accuracy of LR Model with 10 fold cross validation: 0.8388481960952993


In [29]:
models = ['Naive Bayes', 'Logistic Regression']
col = [score_NB.mean() ,score_LRM.mean() ]
data ={'Models': models, 'Accuracy': col}
graph_df = pd.DataFrame(data)
graph_df

Unnamed: 0,Models,Accuracy
0,Naive Bayes,0.836891
1,Logistic Regression,0.838848


Written Component:

1. In here I have used a combination of uni,bi and trigrams as features. The accuracy after evaluation is 83.6% for NB classifier and 83.88% for LR classifer. I have experimented with other combinations of n-grams as well - for example with uni and bi accuracy for both classifiers was in betwween 69% and 73%, with bi and tri grams it was between 79% and 82%. So, I have chosen to go with a combination of all the three. 

2. For this data, the best model is Logistic Regression. LR model yields the best results for boolean classification examples. Even though both LR and NB are used for linear classification, LR uses a direct function of the probability of classifiying correctly to do the predictions. In here I have used N-grams as features and have achieve accuracy of 83.88%. 

3. For error analysis, we find that doing various effective pre-processing methods would help to reduce errors. For instance if we have weblinks (which are present in the dataset that we used), the model also will consider the https etc, which would reduce the performance(if not needed then drop the coulmn, in this case article link column can be dropped). By removing stop words,PoS(in this case), links, any emoticons, punctuations etc. can help to correctly predict the piece of text to be sarcastic or not as well as improve the accuracy of the same. 

4. For future work, I'd like to use stop words, punctuations  and also see how embeddings would effect the performance of the models. I would also like to see how a neural network model would work with these features and if there would be any increase in accuracy compared to the linear classifiers.

References: 
1. https://scikit-learn.org/stable/modules/cross_validation.html?highlight=f1+score
2. https://thinkingneuron.com/how-to-generate-n-grams-in-python/
3. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html