# Analysis of the Rare Diseases Wikipedia Pageview Dataset

In this notebook, I conduct basic visual analysis of the Rare Diseases Wikipedia Pageview Dataset. To see how that dataset was constructed, refer to `building_article_views_datasets.ipynb`.

## Importing Dependencies

In [69]:
import json
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
import plotly.io as pio

import pkg_resources

# As recommended by "Assessing Reproducibility" by Rokem, et al. (2018), I report the version of the packages used to execute the
# computations in thie notebook:
packages = ['pandas', 'matplotlib', 'numpy', 'plotly'] # reporting JSON version didn't work

for package in packages:
    version = pkg_resources.get_distribution(package).version
    print(f"{package}: {version}")

## AI ATTRIBUTION: I derived the code to report packages from a ChatGPT search on October 4, 2024 with the prompt:
## "how can I report the versions of each package used in a python script"

pandas: 2.2.3
matplotlib: 3.9.2
numpy: 2.1.1
plotly: 5.24.1


## Data Prep

First, we import the datasets for this analyses. They were created by making requests to the Wikimedia REST API for around 1700 rare disease-related articles' pageview data.

In [2]:
with open('../output/rare-disease_monthly_desktop_201507-202409.json', 'r') as file:
    desktop = json.load(file)

with open('../output/rare-disease_monthly_mobile_201507-202409.json', 'r') as file:
    mobile = json.load(file)

with open('../output/rare-disease_monthly_cumulative_201507-202409.json', 'r') as file:
    cumulative = json.load(file)


`pd.json_normalize()` is a [pandas function](https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html) that turns the fields of a `json` with a recurring structure into a dataframe, interpreting these fields as column names. The function below transforms the data so that `json_normalize()` works properly, then converts it to a dataframe (which allows us to plot time series data).

In [3]:
def json_to_df(json_data):

    transformed_data = []

    for key, value in json_data.items():
        try:
            for item in value["items"]:
                item["id"] = key
                transformed_data.append(item)
        except:
            continue

    output_df = pd.json_normalize(transformed_data)

    return output_df

Below is a helper function to convert timestamps from the way they're formatted in the Wikimedia REST API responses to a format pandas recognizes.

In [4]:
def parse_timestamps(df, timestamp_column = "timestamp"):
    datestring = df[timestamp_column].str[:4] + "-" + df[timestamp_column].str[4:6] + "-" + df[timestamp_column].str[6:8]
    df[timestamp_column] = pd.to_datetime(datestring)

    return df

With pre-processing complete, we convert the datasets into dataframes, enabling time series plotting.

In [5]:
desktop_df = json_to_df(desktop)
desktop_df = parse_timestamps(desktop_df)
desktop_df.set_index('timestamp', inplace=True)

In [6]:
mobile_df = json_to_df(mobile)
mobile_df = parse_timestamps(mobile_df)
mobile_df.set_index('timestamp', inplace=True)

In [7]:
cumulative_df = json_to_df(cumulative)
cumulative_df = parse_timestamps(cumulative_df)
cumulative_df.set_index('timestamp', inplace=True)

## Maximum Average and Minimum Average

From the assignment spec:

>Maximum Average and Minimum Average - The first graph should contain time series for the articles that have the highest average page requests and the lowest average page requests for desktop access and mobile access over the entire time series. Your graph should have four lines (max desktop, min desktop, max mobile, min mobile).

To achieve this, we first identify which pages had the highest and lowest average pageviews over the entire time series:

In [49]:
desktop_df_ = desktop_df[["id", "views"]]

avg_desktop_views = desktop_df_.groupby("id").mean().sort_values("views", ascending=False)

id_of_desktop_max = avg_desktop_views["views"].idxmax()

id_of_desktop_max

'Black Death'

In [50]:
id_of_desktop_min = avg_desktop_views["views"].idxmin()

id_of_desktop_min

'Filippi Syndrome'

In [51]:
avg_desktop_views

Unnamed: 0_level_0,views
id,Unnamed: 1_level_1
Black Death,104859.315315
Tuberculosis,71768.621622
Multiple sclerosis,57457.684685
Smallpox,55645.261261
Dopamine,48815.486486
...,...
Hypoplasminogenemia,11.027273
Primary anemia,9.162162
CDLS,8.135135
18p,4.495495


In [52]:
mobile_df_ = mobile_df[["id", "views"]]

avg_mobile_views = mobile_df_.groupby("id").mean().sort_values("views", ascending=False)

id_of_mobile_max = avg_mobile_views["views"].idxmax()

id_of_mobile_min = avg_mobile_views["views"].idxmin()

In [53]:
id_of_mobile_max

'Black Death'

In [54]:
id_of_mobile_min

'Filippi Syndrome'

Having identified above which pages have the highest and lowest average pageviews on desktop and mobile, I now construct a dataset of monthly pageviews that has *only* these four `{page} - {access-type}` combinations, which can then be plotted:

In [56]:
mobile_desktop_df = pd.concat([desktop_df, mobile_df])

The logic below filters the concatenated dataframe down to only the max and min for each access type:

In [57]:
df_for_graph1 = mobile_desktop_df[(mobile_desktop_df["id"] == id_of_desktop_max) | 
                                      (mobile_desktop_df["id"] == id_of_desktop_min) |
                                      (mobile_desktop_df["id"] == id_of_mobile_max) | 
                                      (mobile_desktop_df["id"] == id_of_mobile_min)
                                      ]

To make the plot more readable, we create a new column labeling each series by access type and max/min.

In [64]:
def label_series(df):
    if df["id"] == id_of_desktop_max and df["access"] == "desktop":
        return "Desktop Max: " + df["id"]
    if df["id"] == id_of_desktop_min and df["access"] == "desktop":
        return "Desktop Min: " + df["id"]
    if df["id"] == id_of_mobile_max and pd.isna(df["access"]):
        return "Mobile Max: " + df["id"]
    if df["id"] == id_of_mobile_min and pd.isna(df["access"]):
        return "Mobile Min: " + df["id"]
    
df_for_graph1.loc[:,"label"] = df_for_graph1.apply(label_series, axis = 1)

In [65]:
df_for_graph1

Unnamed: 0_level_0,project,article,granularity,access,agent,views,id,label
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-01,en.wikipedia,Filippi_Syndrome,monthly,desktop,user,20,Filippi Syndrome,Desktop Min: Filippi Syndrome
2022-01-01,en.wikipedia,Filippi_Syndrome,monthly,desktop,user,10,Filippi Syndrome,Desktop Min: Filippi Syndrome
2022-02-01,en.wikipedia,Filippi_Syndrome,monthly,desktop,user,12,Filippi Syndrome,Desktop Min: Filippi Syndrome
2022-03-01,en.wikipedia,Filippi_Syndrome,monthly,desktop,user,46,Filippi Syndrome,Desktop Min: Filippi Syndrome
2022-04-01,en.wikipedia,Filippi_Syndrome,monthly,desktop,user,11,Filippi Syndrome,Desktop Min: Filippi Syndrome
...,...,...,...,...,...,...,...,...
2024-05-01,en.wikipedia,Black_Death,monthly,,user,107363,Black Death,Mobile Max: Black Death
2024-06-01,en.wikipedia,Black_Death,monthly,,user,99637,Black Death,Mobile Max: Black Death
2024-07-01,en.wikipedia,Black_Death,monthly,,user,141835,Black Death,Mobile Max: Black Death
2024-08-01,en.wikipedia,Black_Death,monthly,,user,150811,Black Death,Mobile Max: Black Death


The code below creates an interactive plot; users can hover over series to see which one they're looking at and get detailed information about pageviews over time.

In [60]:
fig = px.line(
    df_for_graph1,
    x = df_for_graph1.index,
    y = "views",
    color = "label",
    title = "Page views over time for articles with minimum and maximum average views, 2015-2024"
)

fig.data[2].line.dash = 'dash' # make one of the minimum lines dashed so that they both show up

fig.show()

Next, we output the plot to the `output/plots` folder:

In [61]:
pio.write_image(fig, "../output/plots/fig1.png", width = 1000, height = 800)

The plot above is hard to read because min/max are on the same scale but differ by several orders of magnitude. To make it easier to see differences, I will adjust the y-axis to be on a log scale and re-plot:

In [62]:
fig.data[2].line.dash = 'solid'

fig.update_layout(
    yaxis_type = "log",
    title = "LOG Page views over time for articles with minimum and maximum average views, 2015-2024"
)

fig.show()

In [63]:
pio.write_image(fig, "../output/plots/fig1_log.png", width = 1000, height = 300)

## Top 10 Peak Page Views

Per the assignment spec:
>Top 10 Peak Page Views - The second graph should contain time series for the top 10 article pages by largest (peak) page views over the entire time series by access type. You first find the month for each article that contains the highest (peak) page views, and then order the articles by these peak values. Your graph should contain the top 10 for desktop and top 10 for mobile access (20 lines).

My approach to transform the data to generate these plots is:
- Group by article and access type; figure out the maximum (peak) pageview by article and access type
- Sort peak views high to low
- Take the 10 articles with highest peaks *on mobile* and the 10 articles with the highest peaks *on desktop*
- Plot them

Code executing this strategy is below:

In [66]:
mobile_desktop_df_ = mobile_desktop_df[["article", "access", "views"]]
mobile_desktop_df_.loc[:, "access"].fillna("mobile", inplace=True)

sorted_mobile_desktop = mobile_desktop_df_.groupby(["article", "access"]).max().sort_values("views", ascending=False).reset_index()

sorted_mobile_desktop.rename(columns={"views" : "peak_views"}, inplace=True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [24]:
sorted_mobile_desktop

Unnamed: 0,article,access,peak_views
0,Black_Death,mobile,2313741
1,Pandemic,mobile,2276916
2,Pandemic,desktop,1046521
3,Black_Death,desktop,823649
4,Pfeiffer_syndrome,mobile,777886
...,...,...,...
3535,Joseph_Vinetz,mobile,30
3536,Project_Nicaragua,mobile,25
3537,HOXA6,mobile,22
3538,Filippi_Syndrome,mobile,11


Below, I report the 10 articles with the highest peak views, by access type:

In [25]:
mobile_peak = sorted_mobile_desktop[sorted_mobile_desktop["access"] == "mobile"].head(10)
mobile_peak_articles = set(mobile_peak["article"])

mobile_peak_articles

{'Black_Death',
 'Botulism',
 'Chloroquine',
 'Cleidocranial_dysostosis',
 'Glioblastoma',
 'Kawasaki_disease',
 'Pandemic',
 'Pfeiffer_syndrome',
 'Porphyria',
 'Stiff-person_syndrome'}

In [26]:
desktop_peak = sorted_mobile_desktop[sorted_mobile_desktop["access"] == "desktop"].head(10)
desktop_peak_articles = set(desktop_peak["article"])

desktop_peak_articles

{'Amyotrophic_lateral_sclerosis',
 'Black_Death',
 'Botulism',
 'Chloroquine',
 'Cleidocranial_dysostosis',
 'Fibrodysplasia_ossificans_progressiva',
 'Pandemic',
 'Pfeiffer_syndrome',
 'Robert_Koch',
 'Smallpox'}

We can use the lists of articles with highest peaks for each access type to filter the big dataframe down only to the relevant subset of information for our plot:

In [27]:
df_for_graph2 = mobile_desktop_df_[((mobile_desktop_df_["article"].isin(mobile_peak_articles)) & 
                                    (mobile_desktop_df_["access"] == "mobile")) |
                                    ((mobile_desktop_df_["article"].isin(desktop_peak_articles)) &
                                     (mobile_desktop_df_["access"] == "desktop"))]

In [28]:
df_for_graph2

Unnamed: 0_level_0,article,access,views
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-07-01,Pandemic,desktop,14291
2015-08-01,Pandemic,desktop,15232
2015-09-01,Pandemic,desktop,18668
2015-10-01,Pandemic,desktop,20499
2015-11-01,Pandemic,desktop,18930
...,...,...,...
2024-05-01,Botulism,mobile,46810
2024-06-01,Botulism,mobile,46597
2024-07-01,Botulism,mobile,41879
2024-08-01,Botulism,mobile,44741


Again, to make the plot more user-friendly I label each series by access type and article title.

In [29]:
def label_series_graph2(row):
    
    if row["article"] in (mobile_peak_articles) and row["access"] == "mobile":
        return "Mobile: " + row["article"]
    if row["article"] in (desktop_peak_articles) and row["access"] == "desktop":
        return "Desktop: " + row["article"]


In [30]:
df_for_graph2.loc[:,"label"] = df_for_graph2.apply(label_series_graph2, axis = 1)

df_for_graph2



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0_level_0,article,access,views,label
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-07-01,Pandemic,desktop,14291,Desktop: Pandemic
2015-08-01,Pandemic,desktop,15232,Desktop: Pandemic
2015-09-01,Pandemic,desktop,18668,Desktop: Pandemic
2015-10-01,Pandemic,desktop,20499,Desktop: Pandemic
2015-11-01,Pandemic,desktop,18930,Desktop: Pandemic
...,...,...,...,...
2024-05-01,Botulism,mobile,46810,Mobile: Botulism
2024-06-01,Botulism,mobile,46597,Mobile: Botulism
2024-07-01,Botulism,mobile,41879,Mobile: Botulism
2024-08-01,Botulism,mobile,44741,Mobile: Botulism


The code below creates an interactive plot; users can hover over series to see which one they're looking at and get detailed information about pageviews over time.

In [31]:
fig = px.line(
    df_for_graph2,
    x = df_for_graph2.index,
    y = "views",
    color = "label",
    title = "Page views over time for articles with top-10 peak views, by platform (hover to identify series)",
    labels = {"timestamp": "Date"}
)

fig.update_layout(
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=False),
    legend_title_text = "Platform & Article"
)

fig.show()

Next I write the figure to the `output/plots` folder:

In [32]:
pio.write_image(fig, "../output/plots/fig2.png", width=1600, height=800)

A couple of series in the plot above predominate the graphic because of their very high peak in 2020; to see the rest of the data better, I'm making a log-scaled view available too:

In [33]:
fig.update_layout(
    yaxis_type = "log",
    title = "LOG Page views over time for articles with minimum and maximum average views, 2015-2024"
)

fig.show()

In [34]:
pio.write_image(fig, "../output/plots/fig2_log.png", width=1600, height=800)

## Fewest Months of Data

Finally, the assignment spec asks:

>Fewest Months of Data - The third graph should show pages that have the fewest months of available data. These will likely be relatively short time series, some may only have one month of data. Your graph should show the 10 articles with the fewest months of data for desktop access and the 10 articles with the fewest months of data for mobile access.
In order to complete the analysis correctly and receive full credit, your graph will need to be the right scale to view the data; all units, axes, and values should be clearly labeled. Your graph should possess a legend and a title. You must generate a .png or .jpeg formatted image of your final graph.

My approach is very similar to the one outlined above; instead of taking the max view when grouping by access type and article, I use `.count()` as my aggregation function – since each observation represents a month with data, this captures how many months of data exist for each access type / article combo.

In [35]:
fewest_months_sorted = mobile_desktop_df_.groupby(["article", "access"]).count().sort_values("views").reset_index()

In [36]:
fewest_mobile_months = fewest_months_sorted[fewest_months_sorted["access"] == "mobile"].head(10)
fewest_mobile_months

Unnamed: 0,article,access,views
0,Retinal_vasculopathy_with_cerebral_leukoenceph...,mobile,33
3,Filippi_Syndrome,mobile,34
4,COVID-19_vaccine_misinformation_and_hesitancy,mobile,37
6,CDKL5_deficiency_disorder,mobile,40
8,Joseph_Vinetz,mobile,40
11,Bradley_Monk,mobile,41
13,Spongy_degeneration_of_the_central_nervous_system,mobile,41
15,Hemolytic_jaundice,mobile,41
17,Deaf_plus,mobile,41
19,Reinforced_lipids,mobile,44


In [37]:
fewest_desktop_months = fewest_months_sorted[fewest_months_sorted["access"] == "desktop"].head(10)
fewest_desktop_months

Unnamed: 0,article,access,views
1,Retinal_vasculopathy_with_cerebral_leukoenceph...,desktop,33
2,Filippi_Syndrome,desktop,34
5,COVID-19_vaccine_misinformation_and_hesitancy,desktop,37
7,Joseph_Vinetz,desktop,40
9,CDKL5_deficiency_disorder,desktop,40
10,Deaf_plus,desktop,41
12,Spongy_degeneration_of_the_central_nervous_system,desktop,41
14,Bradley_Monk,desktop,41
16,Hemolytic_jaundice,desktop,41
18,Reinforced_lipids,desktop,44


The sets below capture the articles with the fewest months of data for each access type:

In [38]:
fewest_mobile_months_articles = set(fewest_mobile_months["article"])
fewest_desktop_months_articles = set(fewest_desktop_months["article"])

Having determined the lowest-available data in this dataset above, I filter down the concatenated dataframe to only the articles with the fewest months of data for each acess type.

In [39]:
df_for_graph3 = mobile_desktop_df_[((mobile_desktop_df_["article"].isin(fewest_mobile_months_articles)) &
                                   (mobile_desktop_df_["access"] == "mobile")) | 
                                   ((mobile_desktop_df_["article"].isin(fewest_desktop_months_articles)) &
                                    (mobile_desktop_df_["access"] == "desktop"))]

In [40]:
df_for_graph3

Unnamed: 0_level_0,article,access,views
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-05-01,Spongy_degeneration_of_the_central_nervous_system,desktop,34
2021-06-01,Spongy_degeneration_of_the_central_nervous_system,desktop,23
2021-07-01,Spongy_degeneration_of_the_central_nervous_system,desktop,21
2021-08-01,Spongy_degeneration_of_the_central_nervous_system,desktop,23
2021-09-01,Spongy_degeneration_of_the_central_nervous_system,desktop,23
...,...,...,...
2024-05-01,COVID-19_vaccine_misinformation_and_hesitancy,mobile,2364
2024-06-01,COVID-19_vaccine_misinformation_and_hesitancy,mobile,2383
2024-07-01,COVID-19_vaccine_misinformation_and_hesitancy,mobile,1855
2024-08-01,COVID-19_vaccine_misinformation_and_hesitancy,mobile,2471


For readability, I label each series by access title and article title:

In [41]:
def label_series_graph3(row):
    if row["article"] in (fewest_mobile_months_articles) and row["access"] == "mobile":
        return "Mobile: " + row["article"]
    if row["article"] in (fewest_desktop_months_articles) and row["access"] == "desktop":
        return "Desktop: " + row["article"]

In [42]:
df_for_graph3.loc[:,"label"] = df_for_graph3.apply(label_series_graph3, axis = 1)

df_for_graph3



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0_level_0,article,access,views,label
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-05-01,Spongy_degeneration_of_the_central_nervous_system,desktop,34,Desktop: Spongy_degeneration_of_the_central_ne...
2021-06-01,Spongy_degeneration_of_the_central_nervous_system,desktop,23,Desktop: Spongy_degeneration_of_the_central_ne...
2021-07-01,Spongy_degeneration_of_the_central_nervous_system,desktop,21,Desktop: Spongy_degeneration_of_the_central_ne...
2021-08-01,Spongy_degeneration_of_the_central_nervous_system,desktop,23,Desktop: Spongy_degeneration_of_the_central_ne...
2021-09-01,Spongy_degeneration_of_the_central_nervous_system,desktop,23,Desktop: Spongy_degeneration_of_the_central_ne...
...,...,...,...,...
2024-05-01,COVID-19_vaccine_misinformation_and_hesitancy,mobile,2364,Mobile: COVID-19_vaccine_misinformation_and_he...
2024-06-01,COVID-19_vaccine_misinformation_and_hesitancy,mobile,2383,Mobile: COVID-19_vaccine_misinformation_and_he...
2024-07-01,COVID-19_vaccine_misinformation_and_hesitancy,mobile,1855,Mobile: COVID-19_vaccine_misinformation_and_he...
2024-08-01,COVID-19_vaccine_misinformation_and_hesitancy,mobile,2471,Mobile: COVID-19_vaccine_misinformation_and_he...


And finally, we generate an interactive plot of pageviews over time for the pages with the fewest months of available data. Viewers can hover over each series to better identify the page name and access type, or they can view the static version in the `outputs/figures` folder.

In [43]:
fig = px.line(
    df_for_graph3,
    x = df_for_graph3.index,
    y = "views",
    color = "label",
    title = "Page views over time for articles with fewest months of available data, by platform (hover to identify series)",
    labels = {"timestamp": "Date"}
)

fig.update_layout(
    autosize = True,
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=False),
    legend_title_text = "Platform & Article",
)

fig.show()

Exporting `fig3`:

In [44]:
pio.write_image(fig, "../output/plots/fig3.png", width=1600, height=800)