# Wikipedia Edits Data Pipeline

This notebook demonstrates a simple ETL pipeline that:
1. Extracts data from the Wikipedia REST API
2. Transforms it to JSON Lines format
3. Uploads directly to S3

## Choose Your Username

Pick a unique username (e.g., your name or initials) and use it consistently:

- **S3 Bucket**: `<username>-wikidata` (e.g., `johndoe-wikidata`)
- **Athena Database**: `<username>` (e.g., `johndoe`)
- **Lambda**: Use the same `<username>-wikidata` bucket

**Important:** No hyphens in database names! Use underscores if needed (e.g., `john_doe`).

In [1]:
# Set your username here - use it consistently across all resources
USERNAME = "<etnav>"

## Setup and Imports

In [2]:
import datetime
import json

import boto3
import requests

## Extract: Retrieve Data from Wikipedia API

We use the Wikimedia Analytics API to fetch the most edited pages for a specific date. The API returns JSON with page titles and edit counts.

**API Documentation:** https://doc.wikimedia.org/generated-data-platform/aqs/analytics-api/reference/edits.html

In [3]:
# Try different dates to see how the data changes
DATE_PARAM = "2025-11-25"

date = datetime.datetime.strptime(DATE_PARAM, "%Y-%m-%d")

# Construct the API URL
url = f"https://wikimedia.org/api/rest_v1/metrics/edited-pages/top-by-edits/en.wikipedia/user/content/{date.strftime('%Y/%m/%d')}"
print(f"Requesting REST API URL: {url}")

# Make the API request
wiki_server_response = requests.get(url, headers={"User-Agent": "curl/7.68.0"})
wiki_response_status = wiki_server_response.status_code
wiki_response_body = wiki_server_response.text

print(f"Wikipedia REST API Response body: {wiki_response_body[:500]}...")
print(f"Wikipedia REST API Response Code: {wiki_response_status}")

# Validate response
if wiki_response_status != 200:
    raise Exception(f"Received non-OK status code from Wiki Server: {wiki_response_status}")
print(f"Successfully retrieved Wikipedia data, content-length: {len(wiki_response_body)}")

Requesting REST API URL: https://wikimedia.org/api/rest_v1/metrics/edited-pages/top-by-edits/en.wikipedia/user/content/2025/11/25
Wikipedia REST API Response body: {"items":[{"project":"en.wikipedia","editor-type":"user","page-type":"content","granularity":"daily","results":[{"timestamp":"2025-11-25T00:00:00.000Z","top":[{"page_title":"Akiko_Nakamura","edits":143,"rank":1},{"page_title":"2026_Men's_T20_World_Cup","edits":110,"rank":2},{"page_title":"Udit_Narayan","edits":105,"rank":3},{"page_title":"2010_Iowa_House_of_Representatives_election","edits":99,"rank":4},{"page_title":"Cyprus_Basketball_Division_A","edits":80,"rank":5},{"page_title":"2025_UK_Cham...
Wikipedia REST API Response Code: 200
Successfully retrieved Wikipedia data, content-length: 6284


## Transform: Process Raw Data into JSON Lines

Convert the raw API response into a structured JSON Lines format suitable for analytics. Each line is a valid JSON object representing one page's edit statistics.

In [4]:
# Parse the API response and extract top edits
wiki_response_parsed = wiki_server_response.json()
top_edits = wiki_response_parsed["items"][0]["results"][0]["top"]

# Transform to JSON Lines format
current_time = datetime.datetime.now(datetime.timezone.utc)
json_lines = ""
for page in top_edits[:5]:
    record = {
        "title": page["page_title"],
        "edits": page["edits"],
        "date": date.strftime("%Y-%m-%d"),
        "retrieved_at": current_time.replace(tzinfo=None).isoformat(),
    }
    json_lines += json.dumps(record) + "\n"

print(f"Transformed {len(top_edits)} records to JSON Lines")
print(f"First few lines:\n{json_lines[:500]}...")

Transformed 100 records to JSON Lines
First few lines:
{"title": "Akiko_Nakamura", "edits": 143, "date": "2025-11-25", "retrieved_at": "2025-12-23T19:25:35.351974"}
{"title": "2026_Men's_T20_World_Cup", "edits": 110, "date": "2025-11-25", "retrieved_at": "2025-12-23T19:25:35.351974"}
{"title": "Udit_Narayan", "edits": 105, "date": "2025-11-25", "retrieved_at": "2025-12-23T19:25:35.351974"}
{"title": "2010_Iowa_House_of_Representatives_election", "edits": 99, "date": "2025-11-25", "retrieved_at": "2025-12-23T19:25:35.351974"}
{"title": "Cyprus_Basket...


---
## Lab 1: Create an S3 Bucket

**Task:** Create an S3 bucket for the Wikipedia data pipeline.

**Requirements:**
- Bucket name: `<username>-wikidata` (use your USERNAME from above)
- Create the bucket if it doesn't exist

**Documentation:** [create_bucket](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/create_bucket.html)

In [13]:
S3_WIKI_BUCKET = "etnav-wikidata"
s3 = boto3.client("s3")

bucket_names = [bucket["Name"] for bucket in s3.list_buckets()["Buckets"]]

if S3_WIKI_BUCKET not in bucket_names:
    s3.create_bucket(Bucket=S3_WIKI_BUCKET)
    print(f"Created new bucket: {S3_WIKI_BUCKET}")
else:
    print(f"Using existing bucket: {S3_WIKI_BUCKET}")



Using existing bucket: etnav-wikidata


In [15]:
# Test Lab 1
assert USERNAME != "etnav", "Please set your USERNAME at the top of the notebook"
assert S3_WIKI_BUCKET.endswith("-wikidata"), "Bucket name must end with '-wikidata'"

try:
    s3.head_bucket(Bucket=S3_WIKI_BUCKET)
    print(f"Bucket {S3_WIKI_BUCKET} exists!")
except Exception as e:
    print(f"Bucket {S3_WIKI_BUCKET} not found: {e}")
    raise

Bucket etnav-wikidata exists!


---
## Lab 2: Upload JSON Lines to S3

**Task:** Upload the `json_lines` data directly to S3 (no local file!).

**Requirements:**
- Use `s3.put_object()` to upload the data directly
- Place the file under `raw-edits/` prefix in S3
- File name: `raw-edits-YYYY-MM-DD.json` (use the date from `DATE_PARAM`)

**Example S3 path:** `s3://johndoe-wikidata/raw-edits/raw-edits-2025-11-25.json`

**Documentation:** [put_object](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/put_object.html)

In [16]:
# LAB 2: Upload json_lines directly to S3
# YOUR SOLUTION COMES HERE =========================

# ==================================================

In [17]:
# Test Lab 2
expected_key = f"raw-edits/raw-edits-{date.strftime('%Y-%m-%d')}.json"
try:
    s3.head_object(Bucket=S3_WIKI_BUCKET, Key=expected_key)
    print(f"File uploaded successfully to s3://{S3_WIKI_BUCKET}/{expected_key}")
except Exception as e:
    print(f"File not found at s3://{S3_WIKI_BUCKET}/{expected_key}")
    raise

File uploaded successfully to s3://etnav-wikidata/raw-edits/raw-edits-2025-11-25.json


In [18]:
# List all raw-edits files in S3

response = s3.list_objects_v2(
    Bucket=S3_WIKI_BUCKET,
    Prefix="raw-edits/"
)

files = [obj["Key"] for obj in response.get("Contents", [])]

print("Files in raw-edits/:")
for f in files:
    print(" -", f)


Files in raw-edits/:
 - raw-edits/raw-edits-2025-11-01.json
 - raw-edits/raw-edits-2025-11-19.json
 - raw-edits/raw-edits-2025-11-20.json
 - raw-edits/raw-edits-2025-11-21.json
 - raw-edits/raw-edits-2025-11-22.json
 - raw-edits/raw-edits-2025-11-23.json
 - raw-edits/raw-edits-2025-11-24.json
 - raw-edits/raw-edits-2025-11-25.json
 - raw-edits/raw-edits-2025-11-26.json
 - raw-edits/raw-edits-2025-11-27.json
 - raw-edits/raw-edits-2025-11-28.json
 - raw-edits/raw-edits-2025-11-29.json
 - raw-edits/raw-edits-2025-11-30.json


In [19]:
DATE_PARAM = "2025-11-23"
DATE_PARAM = "2025-11-24"
DATE_PARAM = "2025-11-25"
