# Sarcasm detection

Dataset original source:

- [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection)

Build a predictive model

- Compare: NB, KNN, SVM

Theorical sources

- [NB](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
- [KNN](https://www.codecademy.com/learn/introduction-to-supervised-learning-skill-path/modules/k-nearest-neighbors-skill-path/cheatsheet)
- [SVM](https://es.wikipedia.org/wiki/M%C3%A1quinas_de_vectores_de_soporte)

Sklearn algorithm references

- [Column Transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)
- [One Hot Encoder](https://datagy.io/sklearn-one-hot-encode/)
- [Text feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
- [NB](https://scikit-learn.org/stable/modules/naive_bayes.html)
- [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
- [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

## Import data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
dataset_name = 'Sarcasm_Headlines_Dataset.json'

In [None]:
df = pd.read_json(dataset_name, lines=True)
df.head()

In [None]:
df.shape

In [None]:
df.isna().sum()

## Clean headlines

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('omw-1.4')
import re
from nltk.corpus import stopwords
from nltk.corpus import wordnet

In [None]:
def denoise(text):
    tokenizer = lambda text : nltk.word_tokenize(text)
    to_lower = lambda text : text.lower()
    parse_url = lambda text : re.sub('http\S+' , '' , text)
    strip = lambda text : text.strip()
    to_raw = lambda text : re.sub('[^a-z\s]', '', text)  #drop any symbol except a-z

    en_stop = set(stopwords.words('english'))
    
    ws = tokenizer(text)
    ws = [to_lower(w) for w in ws]
    ws = [parse_url(w) for w in ws]
    ws = [strip(w) for w in ws]
    ws = [to_raw(w) for w in ws]
    ws = [w for w in ws if w not in en_stop]
    ws = [w for w in ws if wordnet.synsets(w)] # known synomous of this word
    
    return ' '.join(ws).strip()
    
df['cleaned_headline'] = df['headline'].apply(denoise)

df[['headline', 'cleaned_headline']].head()

## Visualize headline tokens with WordCloud

In [None]:
from wordcloud import WordCloud 

In [None]:
non_sarcastic_headline_df = df[df['is_sarcastic'] == 0]['cleaned_headline']
sarcastic_headline_df = df[df['is_sarcastic'] == 1]['cleaned_headline']

non_sarcastic_headline_np = non_sarcastic_headline_df.values
sarcastic_headline_np = sarcastic_headline_df.values

In [None]:
text = ' '.join(non_sarcastic_headline_np)
plt.figure(figsize = (10,10))
wc = WordCloud(width = 2000 , height = 1000 , max_words = 500).generate(text)
plt.axis('off')
plt.title('Worcloud of non sarcastic words')
plt.imshow(wc , interpolation = 'bilinear')

In [None]:
text = ' '.join(sarcastic_headline_np)
plt.figure(figsize = (10,10))
wc = WordCloud(width = 2000 , height = 1000 , max_words = 500).generate(text)
plt.axis('off')
plt.title('Worcloud of sarcastic words')
plt.imshow(wc , interpolation = 'bilinear')

## The headline transmitter

In [None]:
extract_transmitter = lambda source : source.split('.')[1]
df['transmitter'] = df['article_link'].apply(extract_transmitter)

In [None]:
df.drop('article_link', inplace=True, errors='ignore', axis=1)
df.head()

In [None]:
# todo: encode transmitter columns
df['transmitter'].unique()

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
transmitter_transformed = ohe.fit_transform(df[['transmitter']])
print(ohe.categories_)
print(transmitter_transformed.toarray()[0:5])
# this could be a way to transform this columns but
# I'll use ColumnTransfomer class to acomplish this
# df[ohe.categories_[0]] = transmitter_transformed.toarray()
# df.head()

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

transformer = make_column_transformer(
    (OneHotEncoder(), ['transmitter']),
    remainder='passthrough')

transformed = transformer.fit_transform(df)

transformed_df = pd.DataFrame(
    transformed, 
    columns=transformer.get_feature_names_out()
)

transformed_df.head()