# Time Series of Wikipedia Page Revisions

**Author**: Eni Mustafaraj  
**Date**: Original notebook on Oct 2021. New data collected in Dec 2022. Current version: Mar 2024.

**Explanation:** This notebook was part of a course in which students learned to work with the MediaWiki API and used it to get the list of revisions for Wikipedia pages. Having such revisions allow us to investigate how events in the real-world affect the behavior of Wikipedia editors. This notebook has a companion notebook that shows how the data is collected from Wikipedia. It starts off with loading the data received from the API call and it focuses on massaging that data to turn it into a format that can be visualized. 

The data in each file contains the history of revisions on the Wikipedia pages of four famous female artistis.


**Table of Content**

1. [Data Exploration](#sec1)
2. [Data Massaging](#sec2)
3. [Visualizing the time series](#sec3)

<a id="sec1"></a>
## 1. Data Exploration

We will load the data from the JSON files, that are stored in the folder raw_data_2022.

In [None]:
import os, json
folder = "raw_data_2022"

files = os.listdir(folder) # read content of folder
files = [file for file in files if file.endswith('json')] # ensure that we will read only JSON files
files

We can store their content into a dictionary, because so it's easier to access them all together, instead of having several named variables:

In [None]:
# new dictionary to store the content of each JSON file
rawRevisionsDct = {}

for filename in files:
    name = filename.split('_')[0]
    filepath = os.path.join(folder, filename)
    rawRevisionsDct[name] = json.load(open(filepath))

Let's check what was stored in this dictionary:

In [None]:
names = list(rawRevisionsDct.keys())
names

And now check the values associated with one of the keys:

In [None]:
rawRevisionsDct[names[0]][0]

Notice how the value for the key 'timestamp' is in a list format, instead as of time or datetime object. This is because when we turn an object into JSON format, some data structures have to be flattened and turned into simpler data structures, such as lists or dicts, because JSON only accepts the simpler data structures (string, list, dict), and not complex types such as datetime.

Our goal in the following is to convert the list of the year,month,day,hour, etc. values into a datetime object. We will first excplicitely convert the list into a tuple value, otherwise the function `mktime` below will complain about the passed argument. 

In [None]:
from time import mktime
from datetime import datetime

def convert_timestamp(ts):
    """Convert the timestamp into a datetime object.
    """
    return datetime.fromtimestamp(mktime(tuple(ts))) # mktime expects a tuple

Let's test this function on one timestamp:

In [None]:
ts = rawRevisionsDct[names[0]][0]['timestamp']
convert_timestamp(ts)

Since the only thing we care about from this data at the moment is the timestamps, we will create a simple dictionary of lists to store the converted timestamps of the revisions for each artist.

In [None]:
# new dictionary to only store the timestamps
timeSeriesDct = {}

for name in rawRevisionsDct:
    revLst = rawRevisionsDct[name]
    timeSeriesDct[name] = [convert_timestamp(rev['timestamp']) for rev in revLst]
    
    
# let's test it
for name in timeSeriesDct:
    print(name, len(timeSeriesDct[name]))

As we can see, each list associated with a key has a different number of timestamps. And we can check to make sure that they are datetime objects:

In [None]:
timeSeriesDct['Alicia Keys'][:5]

Revisions on a page can happen at any time. Often there is only one revision once in a while, and at other times there are dozen of revisions in a single date. Ideally, we want the revisions to be shown at more regular intervals, such as the total revisions in a month.

We can do that if we work with timeseries objects. But, given that we have multiple artists, it will be good to create a dataframe with two additional columns: timestamp, artist, count. We are going to set the count at 1, since each revision is one event on its own. 

Let's try it out for one artist and then we can package the code into a function.

In [None]:
import pandas as pd

alicia = timeSeriesDct['Alicia Keys']

triplets = [(alicia[i], 1, 'Alicia Keys') for i in range(len(alicia))]
df = pd.DataFrame(triplets, columns=['Timestamp', 'Count', 'Artist'])
df.head()

Now we can turn this into a timeseries by setting the Timestamp column to be the index of the dataframe:

In [None]:
df.set_index('Timestamp', inplace=True)
df.head()

And now you can use the method `resample` to find the counts by month (using the symbol 'ME' for month-end):

In [None]:
df.resample('ME')['Count'].sum()

Here are a few things to notice here:
- the count values have been summed, 
- timestamps have been ordered from the earliest to the most recent

It also looks like months that were not in the data were included with the count 0. Let's make sure that this is the case, by asking for 12 months:

In [None]:
df.resample('ME')['Count'].sum()[:12]

This shows an entire year from June 2002 to May 2003, no missing months, even though there were no timestamps for several months. This is quite good for our purposes, because we don't have to worry about missing indices within the datarange.

There is one issue with the result, it's shown as a series, while we would like to operate with a dataframe. But we can easily fix it with the following code:

In [None]:
df.groupby('Artist').resample('ME')['Count'].sum().reset_index()

This is a perfect tidy table. Each row is an observation of how many revisions were performed during a particular month on one's artist Wikipedia page. Now that we now how to do this for one artist, we can go and do it for all artists and then further massage the data to be in the shape we want for the visualization.

<a id="sec2"></a>
## 2. Data Massaging

Our ultimate goal is to produce a visualization similar to this one from Plotly's website tutorials:

In [None]:
import plotly.express as px

df = px.data.stocks(indexed=True)-1
fig = px.area(df, facet_col="company", facet_col_wrap=2)
fig.show()

When finding such visualizations is always good to see what the dataframe looks like, so that we know what we need to aim for:

In [None]:
df.head()

What we notice here is that this table is not a tidy table, instead is a pivot table each column is the name of a company. 
Eventually, we also will need to transform our dataframe to this format, from a tidy long table to a wide table.

**Function to create each dataframe**

We will take the code we wrote before and package it into a function, since we have multiple artists.

In [None]:
import pandas as pd

def createDF(timestamps, name):
    """Create a dataframe from the timestamps.
    """ 
    triplets = [(timestamps[i], 1, name) for i in range(len(timestamps))]
    df = pd.DataFrame(triplets, 
                      columns=['Timestamp', 'Count', 'Artist'])
    df.set_index('Timestamp', inplace=True)
    dfCounts = df.groupby('Artist').resample('ME')['Count'].sum().reset_index()
    return dfCounts

Let's call the function with each artist and its associated timestamps list:

In [None]:
dataframes = []
for artist in timeSeriesDct:
    dataframes.append(createDF(timeSeriesDct[artist], artist))

for df in dataframes:
    print(df.head())
    print()

One thing we can notive from the printed dataframes is that they start at different years, which is to be expected since these artists are of different ages. We will sort them from the earliest to the latest, before we combine them together, so that they are ordered when we create the plot. We can find the earliest timestamp with this line:

In [None]:
dataframes[0]['Timestamp'].min()

Now, we can use this value, as our key for sorting by time:

In [None]:
dataframes.sort(key=lambda item: item['Timestamp'].min())

# Check that the dataframes were sorted
for df in dataframes:
    print(df.head())
    print()

Now we will simply concatenate all of them together:

In [None]:
dfMerged = pd.concat(dataframes)
dfMerged.shape

It's clear that this dataframe is big, as a result of the concatenation. Let's see how it looks like:

In [None]:
dfMerged

Notice that the concatenation has kept the original index values, so we need to reset the index and make sure to drop the old one:

In [None]:
dfMerged.reset_index(drop=True, inplace=True)
dfMerged

Because our table is a nice tidy long table, we can easily turn it into a pivot table:

In [None]:
finalDF = dfMerged.pivot(index='Timestamp', 
                         columns='Artist', 
                         values='Count').fillna(0)
finalDF.head()

This looks good, but it looks like the transformation has changed the order of columns to be alphabetical. Meanwhile, we want to start with the oldest timeseries, as we had them in dfMerged. 

In [None]:
# find the order of the columns
dfMerged['Artist'].unique()

Now use the method `reindex`:

In [None]:
finalDF = finalDF.reindex(columns=list(dfMerged['Artist'].unique()))
finalDF.tail()

This is the order we want!

<a id='sec3'></a>
## 3. Visualizing the time series

Let's look again at the dataframe of the example we are trying to emulate:

In [None]:
import plotly.express as px

df = px.data.stocks(indexed=True)-1
df.head()

It seems like our dataframe looks quite similar to this example, so we can try to visualize it:

In [None]:
fig = px.area(finalDF, facet_col="Artist", facet_col_wrap=2)
fig.show()

Or we can show them in one single column, so it's easier to compare them:

In [None]:
fig = px.area(finalDF, facet_col="Artist", facet_col_wrap=1)
fig.show()

It is possible to decorate this graph to provide titles for the axes, the whole plot, increase the height, reduce the width, set the subplot y axis labels to an empty string (since they are repeated),  etc. That makes sense to do if we end up using the graph for communication purposes.