# Classifying Political Tweets as Clickbait


### Datasets 
- framing.p: 23.448 tweets by 501 different U.S. senators mentioning news articles from 13/08/2020 to 14/11/2020.
- clickbait.csv: dataset containing titles labeled as clickbait/non-clickbait.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import random
import spacy
from collections import Counter

### Importing the Data

First, I import the relevant datasets. I import the framing.p dataset and the clickbait.csv that I will use to train a model to classify the titles in the framing.p dataset as clickbait/non-clickbait.

In [2]:
df = pd.read_pickle('framing.p')

DATASET_URL = 'https://gist.githubusercontent.com/amitness/0a2ddbcb61c34eab04bad5a17fd8c86b/raw/66ad13dfac4bd1201e09726677dd8ba8048bb8af/clickbait.csv'
data = pd.read_csv(DATASET_URL)

### Training the Model

Next, to train the model that I will use to classify the titles as clickbait/non-clickbait, I split the data in a train and test set so I can evaluate its performance on the test set. Furthermore, I use CountVectorizer to vectorize the clickbait dataset to the Bag of Words model. Finally, I train a simple logistic regression model on the vectorized texts.

In [3]:
X = list(data.title.values)
y = list(data.label.values)
labels = ['not clickbait', 'clickbait']

X_train_str, X_test_str, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

cv = CountVectorizer() # this initializes the CountVectorizer 

cv.fit(X_train_str) # create the vocabulary

X_train = cv.transform(X_train_str)
X_test = cv.transform(X_test_str)

lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train, y_train)

LogisticRegression()

### Evaluating the Model

Subsequently, I evaluate the performance of the simple logistic regression model. As shown below, the model performs very well with an average f1-score 0.97. With this score, the model performs much better than the model based on randomly generated data. As its performance has been evaluated on a test set, it can be expected that this simple model will do well on unseen data. Furthermore, because of the high score, a more complex model or a different preprocessing approach will not be required.

In [4]:
y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred, 
                          target_names=labels))

random_preds = [random.randint(0,1) for i in range(len(y_test))]

print(classification_report(y_test, random_preds, 
                          target_names=labels))

               precision    recall  f1-score   support

not clickbait       0.96      0.98      0.97      3178
    clickbait       0.98      0.96      0.97      3220

     accuracy                           0.97      6398
    macro avg       0.97      0.97      0.97      6398
 weighted avg       0.97      0.97      0.97      6398

               precision    recall  f1-score   support

not clickbait       0.51      0.52      0.51      3178
    clickbait       0.52      0.51      0.51      3220

     accuracy                           0.51      6398
    macro avg       0.51      0.51      0.51      6398
 weighted avg       0.51      0.51      0.51      6398



### Classifying the Titles

Next, I classify the titles in framing.p using the logistic regression model that I just trained on the clickbait.csv data. I store the results in a new column in the dataframe of framing.p, with column name 'cb_label'.

In [5]:
new_examples = df.title
new_examples_bow = cv.transform(new_examples)
predictions = lr.predict(new_examples_bow)
df['cb_label'] = predictions

### Inspecting the Clickbait Titles

Subsequently, I count the titles that were classified as clickbait for both the republicans and the democrats. Next, I print the first 30 clickbait titles for both the republicans and the democrats.

In [6]:
print(df.cb_label[(df.party == 'R') & (df.cb_label == 1)].count())
print(df.cb_label[(df.party == 'D') & (df.cb_label == 1)].count())

1058
2830


In [7]:
pd.set_option('display.max_colwidth', -1)
df[(df.cb_label == 1) & (df.party == 'D')].title[:20]

  """Entry point for launching an IPython kernel.


3      I just gave!                                                                                                                                                                                                                                  
8      I just gave!                                                                                                                                                                                                                                  
9      I just gave!                                                                                                                                                                                                                                  
19     Everything You Need to Vote - Vote.org                                                                                                                                                                                                        
21     I just ga

In [8]:
df[(df.cb_label == 1) & (df.party == 'R')].title[:20]

0      The 10 best US cities to move to if you want to retire early, where living costs are low and salaries are high                                                                                    
39     Thank you Representative Bradley Byrne                                                                                                                                                            
50     Fowler named to Top 30 Women of the Year list - Spring Hill College Athletics                                                                                                                     
66     How Right Now                                                                                                                                                                                     
101    Roby: God bless our veterans - Yellowhammer News, Roby: God bless our veterans                                                                                                           

When inspecting the first 20 titles classified as clickbait referred to by democrats, it can be noted that quite a few of the titles contain 'COVID-19'. When inspecting the first 30 titles classified as clickbait referred to by republicans this is less so. I will investigate this further by calculating the most common words for both parties' clickbait titles. To do this effectively, I lemmatize them first and remove stop words.

### Inspecting the Most Common Words in the Clickbait Titles

In [9]:
nlp = spacy.load("en_core_web_sm")
texts = df.title
texts = [text.lower() for text in texts]
processed_texts = [text for text in nlp.pipe(texts, 
                                             disable=["ner",
                                                      "parser"])]
df['processed_texts'] = processed_texts
processed_D = df[(df.cb_label == 1) & (df.party == 'D')].processed_texts
processed_R = df[(df.cb_label == 1) & (df.party == 'R')].processed_texts

lemmatized_D = [[token.lemma_ for token in text if not token.is_punct and not token.is_stop] for text in processed_D]
lemmatized_R = [[token.lemma_ for token in text if not token.is_punct and not token.is_stop] for text in processed_R]

flatten = lambda t: [item for sublist in t for item in sublist]
word_counts_D = Counter(flatten(lemmatized_D))
print(word_counts_D.most_common()[:15])
word_counts_R = Counter(flatten(lemmatized_R))
print(word_counts_R.most_common()[:15])

[('trump', 682), ('covid-19', 340), ('|', 247), ('need', 176), ('americans', 172), ('woman', 161), ('say', 151), ('black', 147), ('vote', 138), ('coronavirus', 134), ('pandemic', 125), ('opinion', 117), ('know', 115), ('die', 114), ('new', 111)]
[('|', 200), ('trump', 137), ('covid-19', 102), ('america', 74), ('biden', 73), ('rep', 57), ('sen', 53), ('cruz', 51), ('say', 50), ('podcast', 48), ('vaccine', 47), ('people', 47), ('time', 45), ('vote', 44), ('barrett', 44)]


Republicans referred to an article with a clickbait title 1058 times and democrats 2830 times. The most common words indicate that democrats refer to titles with 'COVID-19' or related words like 'coronavirus' and 'pandemic' much more often than republicans. Furthermore, democrats refer to titles with 'trump' much more often than republicans. Apparently, titles about Trump or COVID-19 are often classified as clickbait. 

This could explain why democrats refer to an article with a clickbait title much more often than republicans. Republicans may not be interested in responding to articles about COVID-19 or Trump because it is not in their political interest to express themselves about these topics. Therefore, as articles that are about COVID-19 and Trump are often classified as clickbait, the number of times they refer to a clickbait title will logically be lower than that of democrats, who do refer more often to articles about COVID-19 or Trump, perhaps because it may be in their political interest to do so. 

Furthermore, the data I trained the model on did not contain any titles about COVID-19 as examples, and, therefore, the model may not be very good at classifying examples with COVID-19 correctly, which could help explain why titles with COVID-19 are often classified as clickbait.
