# Wikipedia Page Views Data Pipeline

This notebook extracts the **top viewed Wikipedia pages** between 2025-11-23 and 2025-11-25, transforms the data into **JSON Lines**, and uploads it to **S3**.



### Imports

In [9]:
# Uncomment the line below if running this notebook locally
# and required packages are not installed
# %pip install requests boto3

In [10]:
import requests
import json
import boto3
from datetime import datetime, timedelta, timezone

### Configuration

Use the **same username as in the in-class Wikipedia Edits assignment**.

- **Username**: `shirind76`
- **S3 Bucket**: `shirind76-wikidata` (format: `<username>-wikidata`)
- **Athena Database**: `shirind76` (format: `<username>`)
- **Lambda**: Write output to the same S3 bucket (`shirind76-wikidata`)


In [11]:
USERNAME = "shirind76"
BUCKET = f"{USERNAME}-wikidata"

BASE_DATE = datetime(2025, 11, 20)
N_DAYS = 3

s3 = boto3.client("s3")


### API inspection

Before building the pipeline, we inspect the Wikimedia Pageviews API
response for a single day to understand its structure.

The response contains a top-level `items` array, where each element
represents one day and includes a list of `articles` with their
corresponding page views and rankings.


In [12]:
test_date = BASE_DATE

url = (
    "https://wikimedia.org/api/rest_v1/metrics/pageviews/top/"
    f"en.wikipedia.org/all-access/{test_date.strftime('%Y/%m/%d')}"
)

resp = requests.get(url, headers={"User-Agent": "curl/7.68.0"})
resp.raise_for_status()

raw = resp.json()

# Professor-style inspection
raw.keys()


dict_keys(['items'])

In [13]:
items = raw.get("items", [])
articles = items[0].get("articles", []) if items else []
len(articles), articles[0].keys() if articles else None


(1000, dict_keys(['article', 'views', 'rank']))

### Extraction and transformation logic

For each date, the notebook:
1. Requests the top viewed Wikipedia pages from the Pageviews API
2. Validates the API response
3. Extracts article title, view count, and rank
4. Enriches each record with the query date and a retrieval timestamp
5. Writes the result as JSON Lines and uploads it to S3


In [14]:
def fetch_and_upload(date_obj):
    date_str = date_obj.strftime("%Y-%m-%d")

    url = (
        "https://wikimedia.org/api/rest_v1/metrics/pageviews/top/"
        f"en.wikipedia.org/all-access/{date_obj.strftime('%Y/%m/%d')}"
    )

    resp = requests.get(url, headers={"User-Agent": "curl/7.68.0"})
    resp.raise_for_status()
    data = resp.json()

    retrieved_at = datetime.now(timezone.utc).isoformat()

    records = []
    articles = data["items"][0].get("articles", [])

    for a in articles:
        title = a.get("article")
        views = a.get("views")
        rank = a.get("rank")

        if title is None or views is None or rank is None:
            continue

        records.append({
            "title": title,
            "views": int(views),
            "rank": int(rank),
            "date": date_str,
            "retrieved_at": retrieved_at
        })

    key = f"raw-views/raw-views-{date_str}.json"
    body = "\n".join(json.dumps(r) for r in records)

    s3.put_object(Bucket=BUCKET, Key=key, Body=body.encode("utf-8"))

    print(f"Uploaded s3://{BUCKET}/{key}")


### Automated multi-day extraction

Instead of manually changing the date, the extraction logic is wrapped
in a loop that automatically fetches data for multiple consecutive days
starting from a base date.


In [15]:
for i in range(N_DAYS):
    fetch_and_upload(BASE_DATE - timedelta(days=i))


Uploaded s3://shirind76-wikidata/raw-views/raw-views-2025-11-20.json
Uploaded s3://shirind76-wikidata/raw-views/raw-views-2025-11-19.json
Uploaded s3://shirind76-wikidata/raw-views/raw-views-2025-11-18.json
