<a href="https://colab.research.google.com/github/ttvhh/CS114.K21/blob/master/%5BCase_Study%5D_Sarcasm_Detection_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# **PROBLEM DESCRIPTION**

Due to the problem of the case study, we have given the News Headlines Dataset and our goal is to predict if a given text is sarcastic or not.

In this case, we use the News Headlines Dataset For Sarcasm Detection from the Kaggle. This dataset is collected from two different websites, ***The Onion*** which aim at producing sarcastic versions of current events and ***HuffPost*** which collects real news headlines, respectively.

You can get this dataset from the link below:

> [Click here to download dataset](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection/data#)





## **Dataset Description**

This dataset includes thre columns:
*   ```article_link``` (type: Object): contains links to the news articles
*   ```headline``` (type: Object): contains headlines of the news articles
*   ```is_sarcastic``` (type: int64): contains 0 (for nonsarcastic) and 1 (for sarcastic)

## **Source Code**

The first thing that we have to do is to import some important libraries and module which are required for this case.

In [16]:
import pandas as pd
import numpy as np, re, time
from nltk.stem.porter import PorterStemmer

Next, we also need to import multiple models and modudules in order to prepare for training.

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

Loading the datasets into Pandas Dataframe is what we have to do in this step. By 

In [37]:
import json
from google.colab import files
uploaded = files.upload()

Saving Sarcasm_Headlines_Dataset_v2.json to Sarcasm_Headlines_Dataset_v2 (1).json


In [38]:
# Loading dataset from json file
data = pd.read_json('Sarcasm_Headlines_Dataset_v2.json', lines = True)

Using ```data.head() ``` method of Pandas Dataframe to see the first five rows of our loaded dataset.



In [39]:
data.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


Moreover, it is essential to check the null values in the dataset first. If our dataset doesn't have any null values, we can proceed to learn the data.

In [40]:
print(data.isnull().any(axis = 0))

is_sarcastic    False
headline        False
article_link    False
dtype: bool


## **Cleaning the data**

The headline column has some special symbols that have to be eliminated

Using Regular Expression to eliminate these special symbols

In [41]:
# Replacing special symbols and digits in headline column
# Using Regular Expresion
data['headline'] = data['headline'].apply(lambda s : re.sub('[^a-zA-Z]', ' ', s))

## **Feature and label extractions**

As shown aboved, our dataset has three types of data. However, the ```article_link``` is not important in predicting the label. So, the only feature that we have here is the  ```headline``` column. Furthermore, ```is_sarcastic``` is the only label in our dataset.

In [42]:
# Getting features and labels
features = data['headline']
labels = data['is_sarcastic']

## **Stemming of features**

Steming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or the ***root of words*** knowns as a lemma.

Stemming is important in NLU and NLP.

We can explore through an example below:

*   original words: ```reading``` and ```reader```
*   stemmed words: ```read```




In [43]:
# Stemming the data
ps = PorterStemmer()

features = features.apply(lambda x: x.split())
features = features.apply(lambda x: ' '.join([ps.stem(word) for word in x]))

## **Vectorization of features**

A very common algorithm to transform the text into a meaningful representation of numbers: ***TF-IDF (Term Frequency-Inverse Document Frequency)***.

This technique is widely used to *extract features* across various NLP applications.

***TF - Term Frequency***: counts how many times a word has occured in a given text.

***IDF - Inverse Document Frequency***: counts for how rarely a word occurs within a document.

>$TF(i, j) = \frac{Term i frequency in document j}{Total words in document j}$






In [44]:
# Vectorizing the data with maximum of 5000 features
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(max_features = 5000)
features = list(features)
features = tv.fit_transform(features).toarray()

## **Training and Testing data**

It's time to split our data into training and testing sets.


In [45]:
# Getting training and testing dataset
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size = .05, random_state = 0)

## **Training and Testing of models**

Train different types of model using various Machine Learning algorithms
1.   Linear Support Vector Classifier
2.   Gaussian Naive Bayes
3.   Logistic Regression
4.   Random Forest Classifier




In [46]:
# Model 1: Linear Support Vector Classifier

lsvc = LinearSVC()

# Training the model
lsvc.fit(features_train, labels_train)

# Getting the score of train and test data
print(lsvc.score(features_train, labels_train))
print(lsvc.score(features_test, labels_test))

0.9069074591731646
0.8322851153039832


In [47]:
# Model 2: Gaussian Naive Bayes

gnb = GaussianNB()

# Training the model
gnb.fit(features_train, labels_train)

# Getting the score of train and test data
print(gnb.score(features_train, labels_train))
print(gnb.score(features_test, labels_test))

0.7977416507282624
0.7169811320754716


In [48]:
# Model 3: Logistic Regression

lr = LogisticRegression()

# Training the model
lr.fit(features_train, labels_train)

# Getting the score of train and test data
print(lr.score(features_train, labels_train))
print(lr.score(features_test, labels_test))

0.8778137413564808
0.8252969951083159


In [49]:
# Model 4: Random Forest Classifier

rfc = RandomForestClassifier(n_estimators= 10, random_state= 0)

# Training the model
rfc.fit(features_train, labels_train)

# Getting the score of train and test data
print(rfc.score(features_train, labels_train))
print(rfc.score(features_test, labels_test))

0.9883404443136677
0.7721872816212438
