# Graphing Recent Changes on a Wiki #

### Table of Contents ###

* [Introduction](#intro)
* [Method and Limitations](#method)
* [Summary Statistics](#stats)
* [Graphs](#graph)


## <a class="anchor" id="intro">Introduction</a> ##



I wanted to determine how frequently users made edits to Talk pages on [Fanlore](www.fanlore.org), a fan-created and fan-run encyclopedia and part of the [Organization for Transformative Works (OTW)](www.transformativeworks.org). I have been an OTW volunteer since 2015 in various roles; having recently become more interested in data analysis and programming, I saw an opportunity to undertake this small project.


Fanlore recruits volunteers called "Gardeners" to handle higher-level editing tasks like moving and deleting pages. Often, users will get the attention of the gardener team by adding a special template to the [Talk page](https://fanlore.org/wiki/Help:Talk_pages) of articles, categories, or files. While exploring possible solutions to helping the volunteers stay on top of changes to these pages and other discussions happening on the site, the question was raised as to the volume of these edits - while it would have been possible to set up some sort of email notification to contact the Gardeners mailing list whenever a Talk page was edited, we would not want to spam the volunteers with notifications. 

###  <a class="anchor" id="method">Method and Limitations</a>

Fanlore runs on the [MediaWiki](https://www.mediawiki.org/wiki/MediaWiki) platform. The software includes an API, which allowed me to pull data from the site directly rather than having to parse an HTML file. I queried the RecentChanges list, specifying that I only wanted the "Talk" namespace.

RecentChanges is an [automatically-generated list](https://www.mediawiki.org/wiki/API:Lists) of every edit made to any page on the Wiki. It aggregates data from the individual edit histories of each page, and therefore it is not typically used as a long-term record. Only 90 days' worth of data is kept at once, and the limit on the number of items returned is 500 for regular users. As such, I was only able to examine edits over the previous 3 months. 

I used the following libraries for this project:

In [None]:
import requests
import json
from collections import Counter, OrderedDict
import pandas
import plotly.express as px
from datetime import datetime

Using the Requests module, I specified the below parameters to retrieve data from the API. To specify the length of time I wanted to examine, I had to convert 90 days into seconds, as the API did not accept other time formats.

(Computers are really weird about time, y'all....)

In [None]:
#convert 90 days into seconds - for rcend parameter below
rangeEnd = 60*60*24*90

#dictionary w/ parameters for the API request
parameters = {
    "format": "json", 
    "action": "query", 
    "list": "recentchanges", 
    "rcnamespace": 1, #Talk pages only, excluding User Talk pages
    "rcprop": "timestamp|title|ids|user", #date of edit, title of page, version IDs, and username of editor
    "rcstart": "now", #start of the time range
    "rcend": f"{rangeEnd}", #end of the time range
    "rclimit": 500, #how many edits to return
}

With the parameters set, I could now use the Requests module to ask the API server for data. 
As we specified above, the response was already formatted as a JSON file, but we need a separate library to convert it into a Python object. In this case, it will be formatted as a **dictionary** - an array that contains key-value pairs.

In [None]:
#ask the Fanlore server for the data and load it with JSON
r = requests.get("https://fanlore.org/w/api.php", params=parameters)
print("Paging fanlore - reponse is: ", r.status_code)
scrape = json.loads(r.content)

"200" means the server accepted our request and is sending over data. 

Now we can loop through the dictionary we created from the JSON data, pulling out the timestamp of each edit and adding it to the `tsList` variable we created to hold the timestamps.

In [None]:
tsList = []  # Empty list to store the timestamps we will get later

for change in scrape['query']['recentchanges']:
    tsList.append(change['timestamp'])

The timestamps in our list are formatted like this: `2021-06-03T17:10:04Z`
Because we aren't looking at the *time* edits edit were made, only the day, we can use another loop to go through the list and split each timestamp at the "T" character. This will be stored in a new list, called `tsListTrim`.

In [None]:
tsListTrim = []
for item in tsList:
    tsListTrim.append(item.split('T')[0])

###  <a class="anchor" id="stats">Summary Statistics</a>

Before we create visualizations, lets examine some high-level summary statistics.

In [None]:
print(f"Retrieved {len(tsListTrim)} edits from the server. (Maximum is 500)")
print(f"The latest date is {tsListTrim[0]}. The earliest date is {tsListTrim[(len(tsListTrim)-1)]}")

Now that we have a clean list of dates, we can use the Counter object (part of the Collections library) to determine their frequency. This will create a new dictionary, which we will call `tsCounter`.

In [None]:
tsCounter = Counter(tsListTrim)

print(f'The dates with the top 5 most edits are:{(tsCounter.most_common(3))}')
print(f'The average number of edits per day is {sum(tsCounter.values())/len(tsListTrim)}')

I also got the bright idea to see if certain days of the week had more edits than others. This basically required me to take the list of timestamps, convert it into a Time Object (don't worry about it), then convert it back into text with the day of the week and the integer (otherwise the graph wouldn't sort properly)

In [None]:
#Examining the days of the week
tsListTrim_days = []
for each in tsListTrim:
    tsListTrim_days.append(datetime.strptime(each,"%Y-%m-%d"))

days_List = []
for each in tsListTrim_days:
    days_List.append(datetime.strftime(each, "%u-%A"))

days_Listcount = Counter(days_List)

print(days_Listcount)


## <a class="anchor" id="graph">Graphs</a> ##


In [None]:
#This is our coutner dictionary from before; we have to convert it into a regular dictionary to sort it properly for plotting.
graphData = dict(tsCounter)

days_Listcount = Counter(days_List)
days_count_sorted = OrderedDict(sorted(days_Listcount.items(), key=lambda t: t[0]))

days_df = pandas.DataFrame(list(days_count_sorted.items()), columns=['Day', 'Edits'])


Now we are ready to start creating our graphs - one that shows the frequency of edits over the last 90 days (`line_graph`), and another that shows how many edits were done on each day of the week ()

In [None]:
# # # ### Bar Graph ####

daysGraph = px.bar(
    days_df,
    title="Days with highest Talk page activity",
    x='Day',
    y='Edits',
    labels={"x":"Day","y":"Edits"},
    color='Day',
    color_discrete_sequence=px.colors.qualitative.Plotly,
    template="plotly",
    )

# # ### Line Graph ####
freqData = dict(tsCounter)
ts_x = list(freqData.keys()) #dates
ts_y = list(freqData.values()) #freq

freqGraph = px.line(
    x=ts_x,
    y=ts_y,
    title="Frequency of Talk Page edits on Fanlore.org, last 90 days",
    labels={"x" : "Date","y" : "# of Edits"})

freqGraph.update_xaxes(rangeslider_visible=True)

In [None]:
freqGraph.show()

In [None]:
daysGraph.show()

You can hover over the line to see individual values for each day, or use the slider below the graph to zoom in and isolate a particular range.