Dataset original source:

- [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection)

Build a predictive model

- Compare: NB, KNN, SVM

Theorical sources

- [NB](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
- [KNN](https://www.codecademy.com/learn/introduction-to-supervised-learning-skill-path/modules/k-nearest-neighbors-skill-path/cheatsheet)
- [SVM](https://es.wikipedia.org/wiki/M%C3%A1quinas_de_vectores_de_soporte)

Sklearn algorithm references

- [Column Transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)
- [One Hot Encoder](https://datagy.io/sklearn-one-hot-encode/)
- [Text feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
- [NB](https://scikit-learn.org/stable/modules/naive_bayes.html)
- [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
- [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

In [226]:
import pandas as pd

In [227]:
dataset_name = 'Sarcasm_Headlines_Dataset.json'

In [228]:
df = pd.read_json(dataset_name, lines=True)
df.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [229]:
df.shape

(26709, 3)

In [230]:
df.isna().sum()

article_link    0
headline        0
is_sarcastic    0
dtype: int64

In [231]:
extract_transmitter = lambda source : source.split('.')[1]
df['transmitter'] = df['article_link'].apply(extract_transmitter)

In [232]:
df.drop('article_link', inplace=True, errors='ignore', axis=1)
df.head()

Unnamed: 0,headline,is_sarcastic,transmitter
0,former versace store clerk sues over secret 'b...,0,huffingtonpost
1,the 'roseanne' revival catches up to our thorn...,0,huffingtonpost
2,mom starting to fear son's web series closest ...,1,theonion
3,"boehner just wants wife to listen, not come up...",1,theonion
4,j.k. rowling wishes snape happy birthday in th...,0,huffingtonpost


In [233]:
# todo: encode transmitter columns
df['transmitter'].unique()

array(['huffingtonpost', 'theonion'], dtype=object)

In [234]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
transmitter_transformed = ohe.fit_transform(df[['transmitter']])
print(ohe.categories_)
print(transmitter_transformed.toarray()[0:5])
# this could be a way to transform this columns but
# I'll use ColumnTransfomer class to acomplish this
# df[ohe.categories_[0]] = transmitter_transformed.toarray()
# df.head()

[array(['huffingtonpost', 'theonion'], dtype=object)]
[[1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]]


In [236]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

transformer = make_column_transformer(
    (OneHotEncoder(), ['transmitter']),
    remainder='passthrough')

transformed = transformer.fit_transform(df)

transformed_df = pd.DataFrame(
    transformed, 
    columns=transformer.get_feature_names()
)

transformed_df.head()



Unnamed: 0,onehotencoder__x0_huffingtonpost,onehotencoder__x0_theonion,headline,is_sarcastic
0,1.0,0.0,former versace store clerk sues over secret 'b...,0
1,1.0,0.0,the 'roseanne' revival catches up to our thorn...,0
2,0.0,1.0,mom starting to fear son's web series closest ...,1
3,0.0,1.0,"boehner just wants wife to listen, not come up...",1
4,1.0,0.0,j.k. rowling wishes snape happy birthday in th...,0
