# Wikipedia Pageview Data Analysis for Rare Diseases

This project aims to analyze Wikipedia pageviews related to rare disease articles. We collected desktop and mobile view data from the Wikimedia Pageviews API for articles sourced from the National Organization for Rare Diseases (NORD). Our analysis covers the time period from July 2015 to the latest available month.

**Steps Involved:**
1. **Data Acquisition**: Collecting monthly pageview data for desktop and mobile access using the Wikimedia Pageviews API.
2. **Data Processing**: Cleaning and organizing the data into usable formats.
3. **Analysis and Visualization**: Graphing subsets of the data, focusing on max/min pageviews, peak pageviews, and articles with fewest months of data.


### Importing Required Libraries

We import the necessary Python libraries for handling API requests, JSON manipulation, data analysis, and plotting. Libraries like `pandas` and `matplotlib` are what I will be using extensively for data processing and visualization.


In [14]:
# Importing necessary libraries

import json   # Handling JSON responses
import time   # Managing wait times between API requests
import urllib.parse  # URL encoding article titles

# Requests module is needed to interact with APIs
import requests  

# Pandas for data manipulation
import pandas as pd  

# Matplotlib for plotting results
import matplotlib.pyplot as plt  

# datetime to help with handling date ranges
from datetime import datetime


In [15]:
!pip3 install pandas

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m



## Step 1: Data Acquisition

We fetch pageview data for a subset of Wikipedia articles related to rare diseases. The data will be collected for desktop and mobile views, and we'll store it as JSON files for further analysis. The data includes monthly views from July 2015 to the latest available month.
Setting up API constraints 



In [18]:
# Constants to keep things consistent

# Base URL for all API requests
API_BASE_URL = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# API parameters - we'll fill these in when making requests
API_REQUEST_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# To avoid exceeding the rate limit (100 requests per second), we're adding a small wait time
API_LATENCY = 0.002  # Assuming 2ms latency
API_RATE_LIMIT_WAIT = (1.0 / 100.0) - API_LATENCY

# Headers - API recommends adding your contact info
HEADERS = {
    'User-Agent': 'rsethi3@uw.edu, University of Washington, MSDS DATA 512 - AUTUMN 2024',
}

# Default API template - values can change depending on the request
API_PARAMS_TEMPLATE = {
    "project": "en.wikipedia.org",
    "access": "desktop",  # 'desktop' or 'mobile-web' or 'mobile-app' 
    "agent": "user",
    "article": "",  # We'll fill this dynamically
    "granularity": "monthly",
    "start": "2015070100",  # Starting from July 2015
    "end": "2024093000"  # Up until September 2024
}


### Data Acquisition Setup

The Wikimedia Pageviews API allows us to fetch monthly pageview data for Wikipedia articles. Here, we set up the base URL, API request parameters, and headers required for the API calls. We'll also introduce a slight delay between requests to avoid hitting the rate limits imposed by the API.


In [19]:
#Function to make API requests 

# Function to request pageviews for an article
def fetch_pageviews(article_title=None, access_type="desktop"):
    """
    Fetches pageviews data for a specific article.
    Parameters:
        - article_title: The Wikipedia article to fetch data for.
        - access_type: 'desktop', 'mobile-web', 'mobile-app' (what kind of access views to get)
    Returns:
        - JSON response from the API or None in case of error
    """
    
    # Making sure we have an article title to work with
    if article_title:
        API_PARAMS_TEMPLATE['article'] = article_title
    else:
        raise ValueError("Need to provide an article title for fetching pageviews!")

    # Setting the correct access type
    API_PARAMS_TEMPLATE['access'] = access_type

    # Encoding the article title for the URL
    encoded_article_title = urllib.parse.quote(API_PARAMS_TEMPLATE['article'].replace(' ', '_'))
    API_PARAMS_TEMPLATE['article'] = encoded_article_title
    
    # Combining everything into a proper request URL
    request_url = API_BASE_URL + API_REQUEST_PARAMS.format(**API_PARAMS_TEMPLATE)
    
    # Making the API request
    try:
        time.sleep(API_RATE_LIMIT_WAIT)  # Wait to avoid hitting rate limits
        response = requests.get(request_url, headers=HEADERS)
        return response.json()
    except Exception as e:
        print(f"Error fetching data for {article_title}: {e}")
        return None


### Loading Article Titles

We load a list of rare disease article titles from a CSV file. This file contains the names of the articles for which we will retrieve pageview data. Once loaded, the titles are stored in a list for easy iteration during the data acquisition phase.


In [30]:
# Loading article titles from the CSV file
data_path = '/Users/radhikasethi/Documents/github/data-512-homework_1/data/Copy of rare-disease_cleaned.AUG.2024.csv'

# Read the CSV file using pandas and pull the 'disease' column for article titles
try:
    articles_df = pd.read_csv(data_path)
    article_titles = articles_df['disease'].tolist()  # Convert the 'disease' column to a list
    print(f"Successfully loaded {len(article_titles)} article titles.")
except FileNotFoundError as e:
    print(f"Error: File not found. Please check the file path: {data_path}")
except Exception as e:
    print(f"An error occurred while reading the CSV: {e}")


Successfully loaded 1773 article titles.


### Data Collection Loop

For each article, we fetch monthly pageview data from the API for desktop, mobile-web, and mobile-app views. The results are logged for progress tracking, and the data is aggregated into lists for further processing. A small sleep time is introduced to prevent exceeding the API request rate limit.


### Saving Data to JSON

After collecting all the data, we convert the lists into `pandas` DataFrames for easier manipulation. The data is then saved into JSON files for desktop, mobile, and cumulative pageviews, so it can be analyzed later.


In [31]:
import json, time, urllib.parse, requests, pandas as pd, os

# Setting up directories
data_dir = '/Users/radhikasethi/Documents/github/data-512-homework_1/data/'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

# Initialize lists to store the pageview data
desktop_views_data = []
mobile_views_data = []
cumulative_views_data = []

# Function to log progress and errors
def log_message(message):
    print(f"[LOG]: {message}")

# Function to handle pageview fetching with error handling
def fetch_pageviews_safe(article_title, access_type="desktop"):
    try:
        data = fetch_pageviews(article_title=article_title, access_type=access_type)
        if data and 'items' in data:
            log_message(f"{len(data['items'])} records fetched for {article_title} ({access_type})")
            return data['items']
        else:
            log_message(f"No data found for {article_title} ({access_type})")
            return []
    except Exception as e:
        log_message(f"Error fetching data for {article_title} ({access_type}): {str(e)}")
        return []

# Start data acquisition
log_message(f"Starting data collection for {len(article_titles)} articles...")

for idx, article in enumerate(article_titles):
    log_message(f"Processing article {idx+1}/{len(article_titles)}: {article}")
    
    # Fetch desktop views
    desktop_records = fetch_pageviews_safe(article_title=article, access_type="desktop")
    if desktop_records:
        for record in desktop_records:
            desktop_views_data.append({
                "article_title": article,
                "timestamp": record['timestamp'],
                "views": record['views']
            })
    
    # Fetch mobile views (web + app) and then sum them
    mobile_web_records = fetch_pageviews_safe(article_title=article, access_type="mobile-web")
    mobile_app_records = fetch_pageviews_safe(article_title=article, access_type="mobile-app")
    if mobile_web_records and mobile_app_records:
        for web_record, app_record in zip(mobile_web_records, mobile_app_records):
            if web_record['timestamp'] == app_record['timestamp']:
                total_mobile_views = web_record['views'] + app_record['views']
                mobile_views_data.append({
                    "article_title": article,
                    "timestamp": web_record['timestamp'],
                    "views": total_mobile_views
                })
                
                # Calculate the cumulative views by summing desktop and mobile views
                matching_desktop_record = next((rec for rec in desktop_records if rec['timestamp'] == web_record['timestamp']), None)
                if matching_desktop_record:
                    cumulative_views_data.append({
                        "article_title": article,
                        "timestamp": web_record['timestamp'],
                        "views": matching_desktop_record['views'] + total_mobile_views
                    })
    
    # Logging the progress
    log_message(f"Processed {idx+1}/{len(article_titles)} articles.")
    
    # Optional sleep to avoid hitting API limits as I ran into this issue 
    time.sleep(0.01)

# Converting the collected data into pandas DataFrames
df_desktop = pd.DataFrame(desktop_views_data)
df_mobile = pd.DataFrame(mobile_views_data)
df_cumulative = pd.DataFrame(cumulative_views_data)

# Saving the data to JSON files
log_message("Saving data to JSON files...")
df_desktop.to_json(f'{data_dir}rare-disease_monthly_desktop_201507-202409.json', orient='records')
df_mobile.to_json(f'{data_dir}rare-disease_monthly_mobile_201507-202409.json', orient='records')
df_cumulative.to_json(f'{data_dir}rare-disease_monthly_cumulative_201507-202409.json', orient='records')

log_message("Data collection and saving completed successfully.")


[LOG]: Starting data collection for 1773 articles...
[LOG]: Processing article 1/1773: Klinefelter syndrome
[LOG]: 111 records fetched for Klinefelter syndrome (desktop)
[LOG]: 111 records fetched for Klinefelter syndrome (mobile-web)
[LOG]: 111 records fetched for Klinefelter syndrome (mobile-app)
[LOG]: Processed 1/1773 articles.
[LOG]: Processing article 2/1773: Aarskog–Scott syndrome
[LOG]: 111 records fetched for Aarskog–Scott syndrome (desktop)
[LOG]: 111 records fetched for Aarskog–Scott syndrome (mobile-web)
[LOG]: 111 records fetched for Aarskog–Scott syndrome (mobile-app)
[LOG]: Processed 2/1773 articles.
[LOG]: Processing article 3/1773: Abetalipoproteinemia
[LOG]: 111 records fetched for Abetalipoproteinemia (desktop)
[LOG]: 111 records fetched for Abetalipoproteinemia (mobile-web)
[LOG]: 111 records fetched for Abetalipoproteinemia (mobile-app)
[LOG]: Processed 3/1773 articles.
[LOG]: Processing article 4/1773: MT-TP
[LOG]: 111 records fetched for MT-TP (desktop)
[LOG]: 11


## Step 2: Data Processing

Once we've collected the data, we process it by combining views from different access types (desktop, mobile-web, mobile-app). We also create a cumulative dataset that includes all-access pageviews. This processed data is stored in JSON files and will be used for visualization in the next step.



In [37]:
import os
import pandas as pd
import json

# Paths to the JSON files
desktop_json_path = '/Users/radhikasethi/Documents/github/data-512-homework_1/data/rare-disease_monthly_desktop_201507-202409.json'
mobile_json_path = '/Users/radhikasethi/Documents/github/data-512-homework_1/data/rare-disease_monthly_mobile_201507-202409.json'
cumulative_json_path = '/Users/radhikasethi/Documents/github/data-512-homework_1/data/rare-disease_monthly_cumulative_201507-202409.json'

# Function to read and validate JSON data
def load_and_check_json(json_file_path, expected_start='2015070100', expected_end='2024093000'):
    try:
        # Check if the file exists
        if not os.path.exists(json_file_path):
            print(f"[ERROR]: File not found - {json_file_path}")
            return
        
        # Load JSON data
        print(f"[LOG]: Loading data from {os.path.basename(json_file_path)}")
        with open(json_file_path, 'r') as f:
            data = json.load(f)
        
        # Check and print the first few records for timestamp issues
        for record in data[:5]:  # Show the first 5 records to verify
            print(f"[DEBUG]: Article: {record['article_title']}, Timestamp: {record['timestamp']}, Views: {record['views']}")
        
        # Convert timestamp to correct format -- caused many issues!
        df = pd.DataFrame(data)
        
        print(f"[LOG]: Checking timestamps in {os.path.basename(json_file_path)}")
        # Attempt to convert timestamp again!
        df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y%m%d%H', errors='coerce')
        
        # Checking for invalid timestamps 
        invalid_timestamps = df['timestamp'].isna().sum()
        if invalid_timestamps > 0:
            print(f"[ERROR]: {invalid_timestamps} invalid timestamps found in {os.path.basename(json_file_path)}")
        
        # Checking the range of timestamps
        earliest_timestamp = df['timestamp'].min()
        latest_timestamp = df['timestamp'].max()
        
        print(f"[LOG]: Earliest timestamp: {earliest_timestamp}")
        print(f"[LOG]: Latest timestamp: {latest_timestamp}")
        
        # Checking for timestamps outside the expected range
        expected_start_date = pd.to_datetime(expected_start, format='%Y%m%d%H')
        expected_end_date = pd.to_datetime(expected_end, format='%Y%m%d%H')
        
        out_of_range_records = df[(df['timestamp'] < expected_start_date) | (df['timestamp'] > expected_end_date)]
        if not out_of_range_records.empty:
            print(f"[ERROR]: Found {len(out_of_range_records)} records with dates outside the range in {os.path.basename(json_file_path)}")
            print(out_of_range_records[['article_title', 'timestamp']].head())  # Display first few records
        
        else:
            print(f"[LOG]: All timestamps are within the expected range in {os.path.basename(json_file_path)}.")
    
    except Exception as e:
        print(f"[ERROR]: An error occurred while reading {json_file_path}: {e}")

# Running these checks for all of our JSON files till now 
print("[LOG]: Checking desktop data timestamps...")
load_and_check_json(desktop_json_path)

print("[LOG]: Checking mobile data timestamps...")
load_and_check_json(mobile_json_path)

print("[LOG]: Checking cumulative data timestamps...")
load_and_check_json(cumulative_json_path)


[LOG]: Checking desktop data timestamps...
[LOG]: Loading data from rare-disease_monthly_desktop_201507-202409.json
[DEBUG]: Article: Klinefelter syndrome, Timestamp: 2015070100, Views: 36798
[DEBUG]: Article: Klinefelter syndrome, Timestamp: 2015080100, Views: 33180
[DEBUG]: Article: Klinefelter syndrome, Timestamp: 2015090100, Views: 35882
[DEBUG]: Article: Klinefelter syndrome, Timestamp: 2015100100, Views: 39887
[DEBUG]: Article: Klinefelter syndrome, Timestamp: 2015110100, Views: 40749
[LOG]: Checking timestamps in rare-disease_monthly_desktop_201507-202409.json
[LOG]: Earliest timestamp: 2015-07-01 00:00:00
[LOG]: Latest timestamp: 2024-09-01 00:00:00
[LOG]: All timestamps are within the expected range in rare-disease_monthly_desktop_201507-202409.json.
[LOG]: Checking mobile data timestamps...
[LOG]: Loading data from rare-disease_monthly_mobile_201507-202409.json
[DEBUG]: Article: Klinefelter syndrome, Timestamp: 2015070100, Views: 38513
[DEBUG]: Article: Klinefelter syndrome, 

## Step 3: Visual Analysis
nWe'll create visualizations to show trends in pageviews over time. This includes identifying the articles with the highest and lowest average views, the top 10 peak pageview articles, and those with the fewest months of data. Each graph will be saved in the images folder. 

Done in the data analysis.ipynb! 

Credit: I made use of ChatGPT for the purpose of understanding certain syntax of python code and understanding how I can structure my code when I was trying to fetch data from the API, as I spent a lot of time building the code, and I would run into errors with the back and forward slash of the API. 

It helped me understand where I was going wrong in my code!