## Detecting Clickbait Headlines in Indonesia

In this notebook, we will try to predict whether a headline in Indonesian News is a clickbait or not.

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv("/kaggle/input/judul-artikel-online-dengan-label-clickbait/primary-dataset.csv")

Let's add a new column called `length`, which is simply the number of words on a text.

In [None]:
data["length"] = [len(text.split()) for text in data.text]
data.head()

## 1. Exploratory Data Analysis

First let's see the distribution around the data

In [None]:
print(len(data.index))

There are 3237 texts/headlines in this data

In [None]:
data.label.value_counts().plot.bar()

There are less clickbait contents/headlines than non-clickbait one, but it's closer to balance.

In [None]:
import seaborn as sns
sns.kdeplot(data.sort_values(by="length", ascending=False).length, shade=True)

The length of the headlines has a normal distribution.

## 2. Text Pre-processing

Before building our NLP model, we have to clean the text first through some steps.

1. Lower-casing
2. Remove numbers and punctuations
3. Remove stopwords
4. Tokenizing

In [None]:
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
stopwords_id = pd.read_csv("/kaggle/input/indonesian-stoplist/stopwordbahasa.csv", header=None)
stopwords_id.columns = ["Words"]

In [None]:
def preprocess(data):
    token = word_tokenize(data)
    token = [text.lower() for text in token]
    token = [text for text in token if text.isalpha()]
    token = [text for text in token if not stopwords_id.Words.eq(text).any()]
    return token

In [None]:
def clean_text(data):
    token = preprocess(data)
    words = token[0]
    for num in range(1,len(token)):
        words = words + (" " + token[num])
    return words

In [None]:
data["text_clean"] = [clean_text(text) for text in data.text]

In [None]:
count_vec = CountVectorizer(ngram_range=(1,2), min_df = 2)
token = count_vec.fit_transform(data.text_clean)

In [None]:
token

There are 4599 terms which appear at least in 2 documents, and these terms could be a single word or two words (bigram).

## 3. Most Frequent Terms

After tokenizing the text, we can now see which terms appear the most.

In [None]:
def most_freq_terms(min_len, max_len):
    most_freq_vec = CountVectorizer(ngram_range=(min_len,max_len))
    most_freq_mat = most_freq_vec.fit_transform(data.text_clean)
    terms = most_freq_vec.get_feature_names()
    freq = most_freq_mat.toarray().sum(axis=0)
    df = pd.DataFrame(freq, terms)
    df.columns = ["Terms"]
    df = df.sort_values(by = "Terms", ascending = False)
    df.head(10).plot.bar()

In [None]:
most_freq_terms(1,1)

In [None]:
most_freq_terms(2,2)

Terms like **new normal**, **virus corona**, and **pandemi** appear a lot since the data was scraped from the last few months.

## 4. Modelling : Naive-Bayes

We can now fit the data into the model. First we will try a simple model aka Naive-Bayes.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [None]:
train_x, test_x, train_y, test_y = train_test_split(token, data.label, test_size = 0.2, random_state = 42)

In [None]:
print("Train Size :", train_x.shape)
print("Test Size :", test_x.shape)

In [None]:
nb = MultinomialNB()
nb.fit(train_x, train_y)

In [None]:
nb.score(train_x, train_y)

In [None]:
nb.score(test_x, test_y)

87.6% Accuracy for train dataset and only 71,7% for test dataset. Not really a good performance.

## 5. Modelling : Neural Networks

In most cases, for NLP, you'd prefer to use neural networks rather than models like Naive-Bayes simply because it is a much better algorithm. First we have to do pad sequencing, which is a step of giving index to each words (sequencing), and then normalizing the text length by using padding method.

In [None]:
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

In [None]:
x_train, x_test, y_train, y_test = train_test_split(data.text, data.label, test_size = 0.2, random_state = 42)

In [None]:
VOCAB_SIZE = 2000
MAX_LEN = 50
tkz = Tokenizer(num_words=VOCAB_SIZE)
tkz.fit_on_texts(x_train)
sequences = tkz.texts_to_sequences(x_train)
sequences = sequence.pad_sequences(sequences, maxlen=MAX_LEN)

Here we set the vocabulary size to 2000 and maximum sequence/length for each text to be 50 words.

Then we can simply fit our RNN model

In [None]:
from tensorflow.random import set_seed

In [None]:
np.random.seed(42)
set_seed(42)
model = Sequential()
model.add(Embedding(VOCAB_SIZE, 50, input_length = MAX_LEN))
model.add(LSTM(64))
model.add(Dense(256, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

In [None]:
model.summary()

In [None]:
model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

In [None]:
model.fit(sequences, y_train, batch_size=128, epochs=10)

In [None]:
sequences_test = tkz.texts_to_sequences(x_test)
sequences_test = sequence.pad_sequences(sequences_test, maxlen=MAX_LEN)

In [None]:
model.evaluate(sequences_test, y_test)

Even a simple LSTM model (3 layers and 10 epochs) has already done well. With a small data like this, 76% is a decent accuracy. Surely we can still improve this model by adding more layers, using different batch size, do more epochs, etc.

## 6. Summary

1. There are a lot of clickbait headlines in Indonesia
2. The last few months, the most occuring terms in Indonesian headlines are virus corona, new normal, and pandemi
3. We can predict a headline is clickbait or not with a 76% accuracy.

## Demo

You can copy this code snippet for interactive demo

In [None]:
def interactive(title):
    print("Headline :",title)
    title = tkz.texts_to_sequences([title])
    title = sequence.pad_sequences(title, maxlen=MAX_LEN)
    label = model.predict_classes(title)
    if label[0][0]==0:
        print("This news is not a clickbait")
    else:
        print("This news is a clickbait")

In [None]:
interactive("Berikut 5 fakta mengenai Enzy Storia, nomor 4 sempat menuai kontroversi")

In [None]:
interactive("Kevin De Bruyne cetak 2 gol ke gawang Arsenal")