# Examine The Examiner: Algorithmic Clickbait Headlines Analysis

## Overview

This notebook explores the fascinating dataset of over 3 million article headlines from *The Examiner*, a pseudo news site that produced an enormous volume of click-driven, algorithmic-style headlines between 2010 and 2015.

## Contents

- Data loading and preprocessing  
- SQL-based analysis of publication trends  
- Word frequency and headline pattern exploration  
- Visualization of monthly and yearly headline dynamics  
- Insights about the rise and fall of clickbait-driven digital journalism  

## Objective

The goal is to uncover how *The Examiner* crafted viral headlines and shaped trends, offering a glimpse into early algorithmic media before the AI era.

---

*Let’s dive in to discover what lessons this massive archive holds about digital content, media evolution, and the economics of attention!*


In [None]:
import pandas as pd
import sqlite3

# Load dataset CSV and rceate SQLite connection
df = pd.read_csv('../input/examine-the-examiner/examiner-date-text.csv')
conn = sqlite3.connect('examiner_headlines.db')

# Uploading dataframe to SQLite
df.to_sql('headlines', conn, index=False, if_exists='replace')

# Articles published every year

In [None]:
query= '''SELECT 
  SUBSTR(publish_date, 1, 4) AS year,
  COUNT(*) AS articles_count
FROM headlines
GROUP BY year
ORDER BY year;'''
result = pd.read_sql(query, conn)
print(result)


# SQL Query for the Top 20 Most Frequent Words in Headlines

In [None]:
# Ensure all entries are strings and fill NaN values with empty string and flatten list of words, exclude empty strings
words = df['headline_text'].fillna('').astype(str).str.lower().str.split(' ')

flat_words = pd.DataFrame({'word': [w for ws in words for w in ws if w.strip() != '']})


**Why Python Preprocessing Is Necessary**

Raw text data in fields like headline_text is often unstructured, noisy, and inconsistent, containing issues such as missing values, upper/lowercase variation, punctuation, and non-standard content. Most SQL engines including SQLite in Kaggle,do not natively support advanced string splitting or tokenization, making tasks like word frequency and keyword extraction impractical or error-prone in pure SQL

In [None]:
flat_words.to_sql('headline_words', conn, index=False, if_exists='replace')

query = '''
SELECT word, COUNT(*) AS freq
FROM headline_words
GROUP BY word
ORDER BY freq DESC
LIMIT 20;
'''
result = pd.read_sql(query, conn)
print(result)


Query to count only words that have at least 3 characters, excluding shorter words like “a”, “to”, “in” etc.

In [None]:
query = '''
SELECT word, COUNT(*) AS freq
FROM headline_words
WHERE LENGTH(word) >= 3
GROUP BY word
ORDER BY freq DESC
LIMIT 20;
'''
result = pd.read_sql(query, conn)
print(result)

In [None]:
query = '''
SELECT word, COUNT(*) AS freq
FROM headline_words
WHERE LENGTH(word) >= 4
GROUP BY word
ORDER BY freq DESC
LIMIT 20;
'''
result = pd.read_sql(query, conn)
print(result)

## Top Bigrams and Trigrams in Headlines

This section identifies the most frequent phrases composed of two and three words (bigrams and trigrams) found in the headlines dataset. These multi-word phrases often capture common news topics, recurring themes, and characteristic clickbait constructions.

### Methodology

- Headlines are first tokenized and split into consecutive word pairs (bigrams) and triplets (trigrams) using Python preprocessing.
- These phrase lists are then loaded into SQLite tables for efficient frequency counting.
- The above SQL queries extract the top 20 most frequent bigrams and trigrams, revealing key word combinations driving headline trends.

### SQL Queries



In [None]:
from collections import Counter
import pandas as pd

# Clean and tokenize headlines into words 
tokens = df['headline_text'].fillna('').astype(str).str.lower().str.split()

# Extract bigrams (2-word phrases)
bigrams = []
for headline_words in tokens:
    bigrams.extend([' '.join(headline_words[i:i+2]) for i in range(len(headline_words)-1)])

# Extract trigrams (3-word phrases)
trigrams = []
for headline_words in tokens:
    trigrams.extend([' '.join(headline_words[i:i+3]) for i in range(len(headline_words)-2)])
# Extract quadgrams (4-word phrases)
quadgrams = []
for headline_words in tokens:
    quadgrams.extend([' '.join(headline_words[i:i+4]) for i in range(len(headline_words)-2)])

bigrams_df = pd.DataFrame({'phrase': bigrams})
trigrams_df = pd.DataFrame({'phrase': trigrams})
quadgrams_df = pd.DataFrame({'phrase': quadgrams})


bigrams_df.to_sql('bigrams', conn, index=False, if_exists='replace')
trigrams_df.to_sql('trigrams', conn, index=False, if_exists='replace')
quadgrams_df.to_sql('quadgrams', conn, index=False, if_exists='replace')


In [None]:
top_bigrams_query = '''
SELECT phrase, COUNT(*) AS freq
FROM bigrams
GROUP BY phrase
ORDER BY freq DESC
LIMIT 20;
'''

top_trigrams_query = '''
SELECT phrase, COUNT(*) AS freq
FROM trigrams
GROUP BY phrase
ORDER BY freq DESC
LIMIT 20;
'''
top_quadgrams_query = '''
SELECT phrase, COUNT(*) AS freq
FROM quadgrams
GROUP BY phrase
ORDER BY freq DESC
LIMIT 20;
'''

top_bigrams = pd.read_sql(top_bigrams_query, conn)
top_trigrams = pd.read_sql(top_trigrams_query, conn)
top_quadgrams = pd.read_sql(top_quadgrams_query, conn)

print("Top 20 bigrams:")
print(top_bigrams)
print("\nTop 20 trigrams:")
print(top_trigrams)
print("\nTop 20 quadgrams:")
print(top_quadgrams)



# Analysing headline length distribution

In [None]:
query = '''
SELECT 
  SUBSTR(publish_date, 1, 6) AS year_month,
  COUNT(*) AS articles_count
FROM headlines
GROUP BY year_month
ORDER BY year_month;
'''
result = pd.read_sql(query, conn)
print(result)

In [None]:
import matplotlib.pyplot as plt

# Example: yearly article counts
yearly_counts = pd.read_sql('''
SELECT SUBSTR(publish_date, 1, 4) AS year, COUNT(*) AS count 
FROM headlines GROUP BY year ORDER BY year;
''', conn)

plt.figure(figsize=(10,6))
plt.plot(yearly_counts['year'], yearly_counts['count'], marker='o')
plt.title('Articles Published Per Year')
plt.xlabel('Year')
plt.ylabel('Number of Articles')
plt.grid(True)
plt.show()


# Monthly distribution of articles published each year from 2010 to 2015.

* Each dashed line represents the article count for a specific year, showing trends and seasonality within that year.
* The solid red line indicates the average number of articles published per month across all six years.
* This helps identify consistent monthly patterns and highlights months with unusually high or low publication volume.
* For example, peaks in certain months could correspond to major news events or seasonally relevant reporting periods.
* The visualization combines detailed year-wise data with a clear average reference, supporting trend analysis and comparative exploration of publishing behavior over time.


**Note on Data Exclusion: Years 2009 and 2016**
Years 2009 and 2016 were excluded from this analysis because they respectively represent the rise and decline phases of The Examiner’s publishing activity. These boundary years may contain incomplete data or exhibit extreme fluctuations, which could act as outliers and skew overall trends.

In [None]:
import matplotlib.pyplot as plt

# Run the SQL query to get the pivoted data with averages
query = '''
WITH monthly_counts AS (
  SELECT 
    SUBSTR(publish_date, 5, 2) AS month,
    SUBSTR(publish_date, 1, 4) AS year,
    COUNT(*) AS articles_count
  FROM headlines
  GROUP BY year, month
),
pivoted AS (
  SELECT 
    month,
    SUM(CASE WHEN year = '2010' THEN articles_count ELSE 0 END) AS "2010",
    SUM(CASE WHEN year = '2011' THEN articles_count ELSE 0 END) AS "2011",
    SUM(CASE WHEN year = '2012' THEN articles_count ELSE 0 END) AS "2012",
    SUM(CASE WHEN year = '2013' THEN articles_count ELSE 0 END) AS "2013",
    SUM(CASE WHEN year = '2014' THEN articles_count ELSE 0 END) AS "2014",
    SUM(CASE WHEN year = '2015' THEN articles_count ELSE 0 END) AS "2015"
  FROM monthly_counts
  GROUP BY month
)
SELECT 
  month,
  "2010", "2011", "2012", "2013", "2014", "2015",
  ROUND( ("2010" + "2011" + "2012" + "2013" + "2014" + "2015") / 6.0, 2) AS average
FROM pivoted
ORDER BY month;
'''

df = pd.read_sql(query, conn)

# Convert month to integer for plotting
df['month'] = df['month'].astype(int)

plt.figure(figsize=(12, 6))

# Plot each year
years = ["2010", "2011", "2012", "2013", "2014", "2015"]
for year in years:
    plt.plot(df['month'], df[year], label=year, linestyle='--', alpha=0.7)

# Plot average with a distinct color and solid line
plt.plot(df['month'], df['average'], label='Average', color='red', linewidth=2.5)

plt.title('Monthly Article Counts by Year with Average')
plt.xlabel('Month')
plt.ylabel('Number of Articles')
plt.xticks(range(1, 13))
plt.legend()
plt.grid(True)
plt.show()
