# Headline Classification

To classify headlines as clickbait and non-clickbait, we developed two supervised machine learning models:

1. We simple used the most frequent words in headlines to train our model thereby trying out different number of top words.
2. Using the insights gained from our data exploration, we developed more granular features such as different word types, word sentiment and word count in headlines.

## Setup

We used several libraries in this project. *csv* to load the clickbait data set, *random* to randomly select headlines for the training and test data sets, *nltk* for the natural language processing and time to measure the execution time of the models. From nltk we used *stopwords* and *word_tokenize* to tokenzie the headlines in our dataset.

In [1]:
import csv
import random
import time

import nltk
import pandas as pd

nltk.download("punkt", quiet = True)
nltk.download("stopwords", quiet = True)

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

## Natural Language Processing—Model 1

In our first natural language processing model we used the top $n$ most occuring words in headlines as features.

### Headline Tokenization

We defined a very basic custom list of stop words.

In [2]:
STOP_WORDS = set(
    [
        "a",
        "am",
        "an",
        "and",
        "are",
        "as",
        "at",
        "be",
        "because",
        "been",
        "but",
        "by",
        "did",
        "do",
        "for",
        "from",
        "get",
        "has",
        "have",
        "if",
        "in",
        "into",
        "is",
        "it",
        "its",
        "just",
        "of",
        "on",
        "or",
        "out",
        "over",
        "than",
        "that",
        "the",
        "their",
        "then",
        "there",
        "to",
        "was",
        "where",
        "which",
        "will",
        "with",
    ]
)

Define functions to remove stop words from a given list of words and tokenize a sentence.

In [3]:
def remove_stopwords(word_list):
    """Remove unecessary, meaningless stopwords from a list of words, returning a list
    of filtered words.

    Args:
        word_list (list): List of words to filter.

    Returns:
        list
    """
    filtered_word_list = list()
    
    for word in word_list:
        if word not in STOP_WORDS:
            filtered_word_list.append(word)
    
    return filtered_word_list


def tokenize_sentence(sentence):
    """Split sentence into tokens, remove every but alphanumeric characters, convert to
    lower case and remove stopwords. Returning a list of cleaned tokens.

    Args:
        sentence (str): Sentence to tokenize.

    Returns:
        list
    """
    # Tokenize headline
    token_list = nltk.tokenize.word_tokenize(sentence, language="english")
    # Remove punctuations and special characters
    token_list = [token for token in token_list if token.isalnum()]
    # Convert to lowercase
    token_list = [token.lower() for token in token_list]
    # Remove stopwords
    token_list = remove_stopwords(token_list)
    
    return token_list

Read clickbait data and tokenize headlines.

In [4]:
CLICKBAIT_DATA_PATH = r"data/clickbait_data.csv"

with open(CLICKBAIT_DATA_PATH, "rt", encoding="utf-8") as clickbait_data_file:
    # Read .csv as dictionairy with headers as keys
    clickbait_data = csv.DictReader(clickbait_data_file)
    token_label_list = list()
    for row in clickbait_data:
        # Tokenize sentence
        token_list = tokenize_sentence(row["headline"])
        # Add token_list to token_label_list
        token_label_list.append([token_list, row["clickbait"]])

print(token_label_list[:2])

[[['should', 'i', 'bings'], '1'], [['tv', 'female', 'friend', 'group', 'you', 'belong'], '1']]


### Split of Feature Set into Training and Testing

Split feature set into training and testing using a 80% to 20% ratio.

In [5]:
# Shuffle token_label_list
random.shuffle(token_label_list)
# Divide shuffled token_label_list into train and test set using a 80/20 ratio
token_label_list_train, token_label_list_test = (
    token_label_list[: int(len(token_label_list) * 0.8)],
    token_label_list[int(len(token_label_list) * 0.8) :],
)

print("Fraction of training set:", len(token_label_list_train) / len(token_label_list))
print("Fraction of testing set:", len(token_label_list_test) / len(token_label_list))

# List of all tokens in training set
token_list_train = [row[0] for row in token_label_list_train]
# Flatten token_list_train (remove list nesting)
token_list_train = [item for sublist in token_list_train for item in sublist]

Fraction of training set: 0.8
Fraction of testing set: 0.2


### Feature Extraction

In [6]:
def extract_features(token_list, feature_list):
    """Extract features from a list of tokens given a list of features, returning a
    dictionary of unique features for the list of tokens.

    Args:
        token_list (list): List of tokens to extract features from.
        feature_list (list): List of features to extract from list of tokes.

    Returns:
        dict
    """
    # Convert list of tokens into a set to remove word duplicates
    unique_tokens = set(token_list)
    # Initialize an emptiy dictionary for the features
    feature_dict = {}
    
    for feature in feature_list:
        # Check if feature exists in unique_tokens
        if feature in unique_tokens:
            feature_dict["contains({})".format(feature)] = True
        else:
            feature_dict["contains({})".format(feature)] = False
    
    return feature_dict


def most_frequent_words(word_list, n):
    """Find most frequent words in a list of words, returning a list of n top words in
    decending order.

    Args:
        word_list (list): List of words to find frequency of.
        n (int): Number of most frequent words to return.

    Returns:
        list
    """
    # Count the number of times each word occurs in word_list
    top_words = nltk.FreqDist(word_list)
    # Arrange words in frequency order and select top n words
    top_words = list(top_words)[:n]
    
    return top_words


def token_list_to_feature_list(token_label_list, input_feature_list):
    """Convert a list of lists with tokens and a corresponding label into a list of
    lists with features and a corresponding label using an input list of features for
    feature extraction. Returning the list of features-label lists.

    Args:
        token_label_list (list): List of lists in the form [[token_list, label], ...].
        input_feature_list (list): List of features (e.g. most frequent words).

    Returns:
        list
    """
    feature_label_list = list()
    
    for row in token_label_list:
        feature_list = extract_features(row[0], input_feature_list)
        label = row[1]
        feature_label_list.append([feature_list, label])
    
    return feature_label_list


INPUT_FEATURES = {
    # Most frequent words in headlines
    "top_tokens": most_frequent_words(token_list_train, 1000)
}

# Features for every headline
feature_label_list_train = token_list_to_feature_list(
    token_label_list_train, INPUT_FEATURES["top_tokens"]
)

# Features for every headline
feature_label_list_test = token_list_to_feature_list(
    token_label_list_test, INPUT_FEATURES["top_tokens"]
)

### Naïve Bayes Classification Model

In [7]:
start_time = time.process_time()
classifier = nltk.NaiveBayesClassifier.train(feature_label_list_train)
print("Process time:", time.process_time() - start_time)

model_accuracy = nltk.classify.accuracy(classifier, feature_label_list_test)
print("Model Accuracy:", model_accuracy)
classifier.show_most_informative_features(5)

Process time: 21.927514000000002
Model Accuracy: 0.95421875
Most Informative Features
        contains(things) = True                1 : 0      =    319.9 : 1.0
           contains(you) = True                1 : 0      =    211.7 : 1.0
      contains(actually) = True                1 : 0      =    182.6 : 1.0
        contains(zodiac) = True                1 : 0      =    162.3 : 1.0
      contains(everyone) = True                1 : 0      =    161.6 : 1.0


### Decision Tree Classification Model

In [8]:
# Note, the deicion tree classifier takes a significant amount of time to run when n
# is large (about 1 hour and 20 minutes for n=1000 in our test). Thus, we commented
# these lines of code out. To run the code, simply uncomment the following lines in
# the code block.

# start_time = time.process_time()
# classifier = nltk.DecisionTreeClassifier.train(feature_label_list_train)
# print("Process time:", time.process_time() - start_time)
# model_accuracy = nltk.classify.accuracy(classifier, feature_label_list_test)
# print("Model Accuracy:", model_accuracy)

### Accuracy

When testing our model the accuracy was about

| n    | model         | accuracy | time (s) |
|------|---------------|----------|----------|
| 10   | Naive Bayes   | 0.81     | 0.3      |
| 10   | Decision Tree | 0.81     | 5.1      |
| 100  | Naive Bayes   | 0.90     | 2.5      |
| 100  | Decision Tree | 0.90     | 220      |
| 1000 | Naive Bayes   | 0.96     | 25       |
| 1000 | Decision Tree | 0.93     | 4879     |

where $n$ is the number of top most frequent words (e.g. top 10 most frequent words).

## Natural Language Processing—Model 2

In an alternative aproach, we used word types such as auxiliary verbs, interrogative
pro-forms and personal pronouns as features instead of the most frequently occuring
words. In addition, we added strongly negative and positive words from the
[AFINN lexicon](https://github.com/fnielsen/afinn/tree/master/afinn/data) as features
to our model.

In [9]:
# Source: https://en.wikipedia.org/wiki/Auxiliary_verb
AUXILIARY_VERBS = set(
    [
        "be",
        "can",
        "could",
        "dare",
        "do",
        "have",
        "may",
        "might",
        "must",
        "need",
        "ought",
        "shall",
        "should",
        "will",
        "would",
    ]
)

# Source: https://en.wiktionary.org/wiki/Category:English_interrogative_pro-forms
INTERROGATIVE_PRO_FORMS = set(
    [
        "how",
        "how come",
        "how far",
        "how long",
        "how many",
        "how much",
        "in what world",
        "since when",
        "the hell",
        "to what end",
        "what",
        "what about",
        "what for",
        "what kind of",
        "what the heck",
        "what the hell",
        "whatever",
        "whatsoever",
        "when",
        "whence",
        "where",
        "whereto",
        "wherever",
        "whether",
        "which",
        "which one",
        "whichever",
        "whichsoe'er",
        "whichsoever",
        "whither",
        "who",
        "whoever",
        "whom",
        "whomever",
        "whomsoever",
        "whose",
        "whoso",
        "whosoe'er",
        "whosoever",
        "why",
    ]
)

# Source: https://en.wikipedia.org/wiki/English_personal_pronouns
PERSONAL_PRONOUNS = set(
    [
        "he",
        "her",
        "hers",
        "herself",
        "him",
        "himselfshe",
        "his",
        "i",
        "it",
        "its",
        "itself",
        "me",
        "mine",
        "my",
        "myself",
        "one",
        "one's",
        "oneself",
        "our",
        "ours",
        "ourself",
        "ourselves",
        "thee",
        "their",
        "theirs",
        "them",
        "themself",
        "themselves",
        "they",
        "thine",
        "thou",
        "thy",
        "thyself",
        "us",
        "we",
        "y'all",
        "y'all's",
        "y'all's selves",
        "y'alls",
        "y'alls selves",
        "ye",
        "yeer",
        "yeers",
        "yeerselves",
        "you",
        "you all",
        "your",
        "yours",
        "yourself",
        "yourselves",
        "youse",
    ]
)

### Headline Tokenization

In [10]:
NLTK_STOP_WORDS = stopwords.words("english")

STOP_WORDS = [
    word
    for word in NLTK_STOP_WORDS
    if word not in PERSONAL_PRONOUNS
    and word not in INTERROGATIVE_PRO_FORMS
    and word not in AUXILIARY_VERBS
]

print("Number of words in nltk stop word list:", len(NLTK_STOP_WORDS))
print("Number of words in reduced stop word list:", len(STOP_WORDS))

AFINN = pd.read_csv("data/AFINN-en-165.txt", sep="\t", header=None)

STRONGLY_POSITIVE_WORDS = AFINN[AFINN[1] > 2]

STRONGLY_NEGATIVE_WORDS = AFINN[AFINN[1] < -2]

print(
    "Number of strongly positive words in AFINN word list:",
    len(STRONGLY_POSITIVE_WORDS),
)
print(
    "Number of strongly negative words in AFINN word list:",
    len(STRONGLY_NEGATIVE_WORDS),
)

Number of words in nltk stop word list: 179
Number of words in reduced stop word list: 138
Number of strongly positive words in AFINN word list: 303
Number of strongly negative words in AFINN word list: 439


Read clickbait data and tokenize headlines.

In [11]:
CLICKBAIT_DATA_PATH = r"data/clickbait_data.csv"

with open(CLICKBAIT_DATA_PATH, "rt", encoding="utf-8") as clickbait_data_file:
    # Fead .csv as dictionairy with headers as keys
    clickbait_data = csv.DictReader(clickbait_data_file)
    token_label_list = list()
    for row in clickbait_data:
        # Tokenize sentence
        token_list = tokenize_sentence(row["headline"])
        # Add token_list to token_label_list
        token_label_list.append([token_list, row["clickbait"]])

print(token_label_list[:2])

[[['should', 'i', 'get', 'bings'], '1'], [['which', 'tv', 'female', 'friend', 'group', 'do', 'you', 'belong'], '1']]


### Split of Feature Set into Training and Testing

In [12]:
# Shuffle token_label_list
random.shuffle(token_label_list)
# Divide shuffled token_label_list into train and test set using a 80/20 ratio
token_label_list_train, token_label_list_test = (
    token_label_list[: int(len(token_label_list) * 0.8)],
    token_label_list[int(len(token_label_list) * 0.8) :],
)

print("Fraction of training set:", len(token_label_list_train) / len(token_label_list))
print("Fraction of testing set:", len(token_label_list_test) / len(token_label_list))

# List of all tokens in training set
token_list_train = [row[0] for row in token_label_list_train]
# Flatten token_list_train (remove list nesting)
token_list_train = [item for sublist in token_list_train for item in sublist]

print(token_label_list_train[:2])

Fraction of training set: 0.8
Fraction of testing set: 0.2
[[['andrew', 'marr', 'angers', 'bloggers', 'describing', 'them', 'pimpled', 'single'], '0'], [['pettitte', 'maintains', 'composure', 'victory'], '0']]


### Feature Extraction

In [13]:
def extract_features_alternative(token_list, feature_set_dict):
    """Extract features from a list of tokens by checking membership of tokens in every
    feature set given a dictionary of feature sets. Weight the membership of tokens in
    feature sets based on the number of token-memberships for each token list.
    Returning a dictionary of unique features for the list of tokens.

    Args:
        token_list (list): List of tokens to extract features from.
        feature_set_dict (dict): Dictionary of feature sets to check membership of.

    Returns:
        dict
    """
    # Initialize an emptiy dictionary for the features
    feature_dict = {}
    for feature_set in feature_set_dict:
        # feature_dict["contains({})".format(feature_set)] = False
        feature_dict["contains({})".format(feature_set)] = 0
        for token in token_list:
            if token in feature_set_dict[feature_set]:
                # feature_dict["contains({})".format(feature_set)] = True
                feature_dict["contains({})".format(feature_set)] += 1
        feature_dict["token_length"] = len(token_list)
    return feature_dict


def token_list_to_feature_list_alternative(token_label_list, feature_set_dict):
    """Convert a list of lists with tokens and a corresponding label into a list of
    lists with features and a corresponding label using feature sets for feature
    extraction. Returning the list of features-label lists.

    Args:
        token_label_list (list): List of lists in the form [[token_list, label], ...].
        feature_set_dict (dict): Dictionary of feature sets to check membership of.

    Returns:
        list
    """
    feature_label_list = list()
    
    for row in token_label_list:
        feature_list = extract_features_alternative(row[0], feature_set_dict)
        label = row[1]
        feature_label_list.append([feature_list, label])
    
    return feature_label_list


INPUT_FEATURE_SETS = {
    "auxiliary_verb": AUXILIARY_VERBS,
    "interrogative_pro_form": INTERROGATIVE_PRO_FORMS,
    "personal_pronoun": PERSONAL_PRONOUNS,
    "strongly_positive_word": STRONGLY_POSITIVE_WORDS,
    "strongly_negative_word": STRONGLY_NEGATIVE_WORDS,
}

# Features for every headline
feature_label_list_train = token_list_to_feature_list_alternative(
    token_label_list_train, INPUT_FEATURE_SETS
)

# Features for every headline
feature_label_list_test = token_list_to_feature_list_alternative(
    token_label_list_test, INPUT_FEATURE_SETS
)

print(feature_label_list_train[:2])

[[{'contains(auxiliary_verb)': 0, 'token_length': 8, 'contains(interrogative_pro_form)': 0, 'contains(personal_pronoun)': 1, 'contains(strongly_positive_word)': 0, 'contains(strongly_negative_word)': 0}, '0'], [{'contains(auxiliary_verb)': 0, 'token_length': 4, 'contains(interrogative_pro_form)': 0, 'contains(personal_pronoun)': 0, 'contains(strongly_positive_word)': 0, 'contains(strongly_negative_word)': 0}, '0']]


### Naïve Bayes Classification Model

In [14]:
start_time = time.process_time()
classifier = nltk.NaiveBayesClassifier.train(feature_label_list_train)
print("Process time:", time.process_time() - start_time)

model_accuracy = nltk.classify.accuracy(classifier, feature_label_list_test)
print("Model Accuracy:", model_accuracy)
classifier.show_most_informative_features(5)

Process time: 0.12546499999999838
Model Accuracy: 0.77578125
Most Informative Features
contains(personal_pronoun) = 3                   1 : 0      =    164.3 : 1.0
contains(interrogative_pro_form) = 2                   1 : 0      =    103.4 : 1.0
contains(personal_pronoun) = 2                   1 : 0      =     29.4 : 1.0
contains(interrogative_pro_form) = 1                   1 : 0      =     23.6 : 1.0
contains(auxiliary_verb) = 1                   1 : 0      =      6.9 : 1.0


### Decision Tree Classification Model

In [15]:
start_time = time.process_time()
classifier = nltk.DecisionTreeClassifier.train(feature_label_list_train)
print("Process time:", time.process_time() - start_time)

model_accuracy = nltk.classify.accuracy(classifier, feature_label_list_test)
print("Model Accuracy:", model_accuracy)

Process time: 1.389078000000005
Model Accuracy: 0.77546875


### Accuracy

When testing our model the accuracy was about

| model         | accuracy | time (s) |
|---------------|----------|----------|
| Naive Bayes   | 0.79     | 0.16     |
| Decision Tree | 0.79     | 1.16     |