# Wikipedia Pageviews — Data Pull

Source: **Wikimedia REST API** (no authentication needed)  
Coverage: daily top 100 articles, 2015-07-01+

| Column | Type | Description |
|--------|------|-------------|
| date | date | Day of measurement |
| rank | int | Rank within the daily top 100 (1 = most viewed) |
| article | str | Wikipedia article title |
| views | int | Pageview count for that day |

**Why Wikipedia Pageviews?** Complements the Current Events scrape:
- Current Events = what Wikipedia *editors* flagged as notable
- Pageviews = what readers *actually looked up* each day

The gap between the two reveals topics with massive public interest that
Wikipedia's editorial process overlooked or de-emphasised.

## 1. Setup

In [None]:
from _notebook_setup import *

hit_api = True
save_output = True

PAGEVIEWS_CSV = RAW_DATA_DIR / '01_wikipedia_pageviews.csv'

# Date range for initial backfill — adjust as needed
START_DATE = date(2016, 1, 1)
END_DATE = date.today() - timedelta(days=1)  # yesterday (today not yet complete)

## 2. Fetch

`backfill()` skips dates already saved — safe to interrupt and re-run.  
First run from 2024-01-01 → yesterday is ~770 API calls (~7 minutes at 0.5s delay).

In [5]:
if hit_api:
    df_pageviews = wiki_pageviews.backfill(
        start_date=START_DATE,
        end_date=END_DATE,
        save_path=PAGEVIEWS_CSV,
    )
else:
    print("hit_api = False — loading from local cache.")
    df_pageviews = pd.read_csv(PAGEVIEWS_CSV)
    df_pageviews['date'] = pd.to_datetime(df_pageviews['date'])

  Fetching 1833 days (2016-01-17 → 2026-02-22) ...
    100/1833 days fetched ...
    200/1833 days fetched ...
    300/1833 days fetched ...
    400/1833 days fetched ...
    500/1833 days fetched ...
    600/1833 days fetched ...
    700/1833 days fetched ...
    800/1833 days fetched ...
    900/1833 days fetched ...
    1000/1833 days fetched ...
    1100/1833 days fetched ...
    1200/1833 days fetched ...
    1300/1833 days fetched ...
    1400/1833 days fetched ...
    1500/1833 days fetched ...
    1600/1833 days fetched ...
    1700/1833 days fetched ...
    1800/1833 days fetched ...
  Done. Saved to /Users/annebode/dev/selfevidence.github.io/projects/news_tracker/output/raw_data/01_wikipedia_pageviews.csv
  Loaded 369,200 rows — 3692 days


## 3. Inspect

In [6]:
print(f"Shape: {df_pageviews.shape}")
print(f"Date range: {df_pageviews['date'].min().date()} → {df_pageviews['date'].max().date()}")
df_pageviews.head(10)

Shape: (369200, 4)
Date range: 2016-01-01 → 2026-02-22


Unnamed: 0,date,rank,article,views
0,2016-01-01,1,Main_Page,16226325
1,2016-01-01,2,List_of_stock_market_crashes_and_bear_markets,1876491
2,2016-01-01,3,Special:Search,1740903
3,2016-01-01,4,Natalie_Cole,609871
4,2016-01-01,5,Special:Book,441228
5,2016-01-01,6,Star_Wars:_The_Force_Awakens,353499
6,2016-01-01,7,Bobby_Leach,301290
7,2016-01-01,8,Wayne_Rogers,231465
8,2016-01-01,9,Star_Wars,211832
9,2016-01-01,10,Nat_King_Cole,199732


## 4. Clean

In [16]:
consistent = (
    df_pageviews.groupby('article')
    .agg(
        days_in_top100=('date', 'nunique'),
        total_views=('views', 'sum'),
        avg_rank=('rank', 'mean'),
        first_seen=('date', 'min'),
        last_seen=('date', 'max'),
    )
    .sort_values('days_in_top100', ascending=False)
    .reset_index()
)
consistent['avg_rank'] = consistent['avg_rank'].round(1)

print("Most consistently popular articles (most days in top 100):")
display(consistent.head(30))

Most consistently popular articles (most days in top 100):


Unnamed: 0,article,days_in_top100,total_views,avg_rank,first_seen,last_seen
0,Main_Page,3692,39103916387,1.0,2016-01-01,2026-02-22
1,Special:Search,3692,5933784721,2.1,2016-01-01,2026-02-22
2,Portal:Current_events,3143,159867726,63.9,2016-01-04,2026-02-21
3,United_States,2768,126357949,72.7,2016-01-04,2026-02-12
4,Donald_Trump,2374,251609039,49.8,2016-01-01,2026-02-21
5,YouTube,2246,203939272,45.8,2016-04-30,2026-02-21
6,Cleopatra,1883,238981171,25.7,2016-08-13,2025-12-30
7,Wikipedia:Featured_pictures,1774,438627141,15.5,2021-03-22,2026-02-22
8,Elon_Musk,1567,121674132,53.5,2016-03-22,2026-02-05
9,Elizabeth_II,1390,130907988,56.0,2016-03-27,2026-02-20


In [25]:
df_pageviews_clean.isna().sum()

date       0
rank       0
article    6
views      0
dtype: int64

In [26]:
df_pageviews_clean = df_pageviews.copy(deep=True)

df_pageviews_clean.dropna(inplace=True)

generically_popular = consistent['article'].tolist()[:100]
df_pageviews_clean = df_pageviews_clean[~df_pageviews_clean['article'].isin(generically_popular)]

others_to_remove = ['404.php', 'Pornhub', 'Web_scraping']
df_pageviews_clean = df_pageviews_clean[~df_pageviews_clean['article'].isin(others_to_remove)]

# explicit content searches
df_pageviews_clean = df_pageviews_clean[~df_pageviews_clean['article'].str.contains('XXX')]

print(f"Shape: {df_pageviews_clean.shape}")
print(f"Rows Removed: {df_pageviews.shape[0] - df_pageviews_clean.shape[0]}")
df_pageviews_clean.head(10)

Shape: (298670, 4)
Rows Removed: 70530


Unnamed: 0,date,rank,article,views
1,2016-01-01,2,List_of_stock_market_crashes_and_bear_markets,1876491
3,2016-01-01,4,Natalie_Cole,609871
5,2016-01-01,6,Star_Wars:_The_Force_Awakens,353499
6,2016-01-01,7,Bobby_Leach,301290
7,2016-01-01,8,Wayne_Rogers,231465
8,2016-01-01,9,Star_Wars,211832
9,2016-01-01,10,Nat_King_Cole,199732
10,2016-01-01,11,Auld_Lang_Syne,177494
11,2016-01-01,12,Jenny_McCarthy,160004
12,2016-01-01,13,Antarctic_Snow_Cruiser,159010


## 5. Explore

### Top articles by total views

In [27]:
top_total = (
    df_pageviews_clean.groupby('article')['views']
    .sum()
    .sort_values(ascending=False)
    .reset_index()
    .rename(columns={'views': 'total_views'})
    .head(30)
)

fig = px.bar(
    top_total,
    x='total_views',
    y='article',
    orientation='h',
    title='Top 30 Wikipedia Articles by Total Pageviews (all time)',
    labels={'total_views': 'Total Views', 'article': ''},
)
fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig.show()

### Most consistently popular articles

Articles that appeared in the daily top 100 on the most distinct days.

In [28]:
consistent = (
    df_pageviews_clean.groupby('article')
    .agg(
        days_in_top100=('date', 'nunique'),
        total_views=('views', 'sum'),
        avg_rank=('rank', 'mean'),
        first_seen=('date', 'min'),
        last_seen=('date', 'max'),
    )
    .sort_values('days_in_top100', ascending=False)
    .reset_index()
)
consistent['avg_rank'] = consistent['avg_rank'].round(1)

print("Most consistently popular articles (most days in top 100):")
display(consistent.head(30))

Most consistently popular articles (most days in top 100):


Unnamed: 0,article,days_in_top100,total_views,avg_rank,first_seen,last_seen
0,World_War_II,262,10709319,82.8,2016-04-11,2025-09-04
1,Google_Translate,261,12058893,76.0,2016-10-22,2024-02-22
2,Apple_Network_Server,260,16347035,58.7,2019-09-24,2025-04-03
3,Melania_Trump,260,31304699,49.9,2016-01-07,2026-02-05
4,Elvis_Presley,255,20836831,52.5,2016-01-08,2026-01-08
5,Novak_Djokovic,255,29232338,46.2,2016-01-26,2026-02-02
6,List_of_Bollywood_films_of_2018,254,10726706,75.7,2018-01-19,2018-12-29
7,Winston_Churchill,251,15193009,65.6,2016-02-28,2025-05-05
8,2024_United_States_presidential_election,251,28775862,53.0,2022-11-16,2025-11-05
9,"Diana,_Princess_of_Wales",250,28061072,45.6,2016-04-21,2025-08-31


### Daily top article over time

In [15]:
daily_top = (
    df_pageviews[df_pageviews['rank'] == 1]
    .sort_values('date')
    [['date', 'article', 'views']]
    .reset_index(drop=True)
)

print(f"Daily #1 article — {len(daily_top)} days")

fig = px.scatter(
    daily_top,
    x='date',
    y='views',
    hover_data=['article'],
    title='Daily #1 Wikipedia Article by Pageviews',
    labels={'views': 'Pageviews', 'date': 'Date'},
)
fig.update_traces(marker_size=4)
fig.show()

print("\nRecent daily top articles:")
display(daily_top.tail(20))

Daily #1 article — 3692 days



Recent daily top articles:


Unnamed: 0,date,article,views
3672,2026-02-03,Main_Page,7039712
3673,2026-02-04,Main_Page,6814731
3674,2026-02-05,Main_Page,6754706
3675,2026-02-06,Main_Page,6851728
3676,2026-02-07,Main_Page,7136168
3677,2026-02-08,Main_Page,7265262
3678,2026-02-09,Main_Page,7022665
3679,2026-02-10,Main_Page,6979025
3680,2026-02-11,Main_Page,6893519
3681,2026-02-12,Main_Page,6797062


## 6. Gap Analysis: Popular articles NOT in Wikipedia Current Events

Articles that drew massive pageviews but were absent from Wikipedia's
Current Events editorial list — revealing what readers cared about
that editors didn't flag as a notable news event.

In [None]:
df_wiki = load_data(filename='00_wikipedia_events.csv', subdir='raw_data')

if df_wiki is not None:
    wiki_blob = ' '.join(df_wiki['description'].str.lower().fillna(''))

    # Score each article by total views, then check presence in Current Events
    article_totals = (
        df_pageviews.groupby('article')['views']
        .sum()
        .sort_values(ascending=False)
        .reset_index()
        .rename(columns={'views': 'total_views'})
    )

    article_totals['in_current_events'] = article_totals['article'].apply(
        lambda a: a.lower().replace('_', ' ') in wiki_blob
    )

    gaps = article_totals[~article_totals['in_current_events']].copy()
    covered = article_totals[article_totals['in_current_events']].copy()

    print(f"Total unique articles: {len(article_totals):,}")
    print(f"In Wikipedia Current Events: {len(covered):,} ({100*len(covered)/len(article_totals):.1f}%)")
    print(f"NOT in Wikipedia Current Events: {len(gaps):,} ({100*len(gaps)/len(article_totals):.1f}%)")
    print(f"\nTop 50 high-traffic articles absent from Current Events:")
    display(gaps.head(50))
else:
    print("Wikipedia Current Events data not found.\n"
          "Run 00_wikipedia_events.ipynb first to generate 00_wikipedia_events.csv")

In [None]:
# Visualise the gap
if df_wiki is not None and 'gaps' in dir():
    top_gaps = gaps.head(30)

    fig = px.bar(
        top_gaps,
        x='total_views',
        y='article',
        orientation='h',
        title='Top 30 High-Traffic Wikipedia Articles NOT in Current Events',
        labels={'total_views': 'Total Pageviews', 'article': ''},
        color='total_views',
        color_continuous_scale='Blues',
    )
    fig.update_layout(yaxis={'categoryorder': 'total ascending'}, coloraxis_showscale=False)
    fig.show()