# Research Questions
There are 3 research questions in this study, as enrolled below:
1. Can we treat Wikipedia as a stable source of information?
1. Is the published material on Wikipedia accountable?
1. What are the most frequent words in edited titles?

# Setup and Initial Exploration of the Dataset

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import nltk

In [2]:
%%javascript
// We disable auto-scrolling in our notebook.
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

Let's first make a working copy of our input file:

In [3]:
!cp -rf ../input/edits.csv edits.csv

We must replace all GEO IP information (represented as embedded dictionaries) with null to allow Pandas to properly process the CSV file. Otherwise, it may get confused about unequal number of fields, while encountering nested structures (they also contain fields separated by commas). The easiest way is to rely on `sed`.

In [4]:
!sed -Ei 's/\{.+\}/null/' edits.csv

Now, we must put all titles under double quotation marks, as some contain commas.

In [5]:
!sed -Ei 's/([^,]+,)([^,]+,)([^,]+,)([^,]+,)([^,]+,)([^,]+,)(.+),http/\1\2\3\4\5\6"\7",http/' edits.csv

We will read in the dataset, show some statistics, and drop unused columns.

In [6]:
all_edits = pd.read_csv('edits.csv',
                        header = None,
                        names=["Action", "Size", "Geo IP", "Is Anonymous?", 
                               "Is Bot?", "Is Minor?", "Title", "URL", "User"])
all_edits.shape

In [7]:
all_edits.dtypes

In [8]:
all_edits.head(10)

In [9]:
all_edits.drop(all_edits.columns[[0, 2, 8]], axis=1, inplace=True)
all_edits.head()

In [10]:
all_edits["URL"].nunique()

Obviously there are two edits for the same URL. Nonetheless, this single outlier is negligible. We can drop now the URL column, too.

In [11]:
all_edits.drop("URL", axis=1, inplace=True)
all_edits.head()

Finally, we must convert sizes to positive values, and introduce an extra boolean column regarding the edit type. A zero means that an equal number of characters were deleted and added during an edit. Of course, this doesn't imply a small update, though.

In [12]:
all_edits["Is Deletion?"] = all_edits["Size"] < 0
all_edits["Size"] = abs(all_edits["Size"])
all_edits.head()

Let we see the summary statistics of sizes, and how many of them are zeroes.

In [13]:
all_edits["Size"].describe()

In [14]:
sum(all_edits["Size"] == 0)

# Can we Treat Wikipedia as a Reliable Source of Information?

To answer our first and second research questions we will produce stacked histograms. Nonetheless, our first step is to look at the relationship (pattern) between sizes of changes and whether they are minor/major. There is no much sense to try establish the correlation coefficient here, since the Y axis contains only two values.

In [15]:
fig, axis = plt.subplots(figsize=(12, 7))

axis.xaxis.grid(True)
axis.set_title("Are size differences related to importance?", fontsize=13)
axis.set_xlabel("Size of Change (in number of characters)", fontsize=10)
axis.set_ylabel("Is Minor?", fontsize=10)
axis.set_yticks([0, 1])
axis.set_yticklabels(["False", "True"])

axis.scatter(all_edits["Size"], all_edits["Is Minor?"])
plt.show()

We may notice that bigger changes (whose size is above 2500) are major ones. In other words, a large difference in content's size is regarded as an important edit.  

In [16]:
fig, axis = plt.subplots(figsize=(20, 15), nrows=2, ncols=2)
ax0, ax1, ax2, ax3 = axis.flatten()
colors = ['red', 'green']

# Contains common logic for setting up the subplots.
def setup_hist(ax, labels, data):
    ax.set_title("Distribution of {} vs. {} edits".format(labels[0], labels[1]), fontsize=13)
    ax.set_xlabel("Size of Change (in number of characters)", fontsize=11)
    ax.set_ylabel("Number of edits (log. scale)", fontsize=11)
    ax.set_yscale("log", nonposy='clip')
    ax.hist(data, 15, histtype='bar', density=False, color=colors, label=labels, stacked=True)
    ax.legend(prop={'size': 15})

# Contains logic to retrieve pertinent data for the next plot.
def filter_edits(column):
    data_mask = all_edits[column]
    primary_edits = all_edits[data_mask]["Size"]
    complementary_edits = all_edits[np.logical_not(data_mask)]["Size"]
    return (primary_edits, complementary_edits)

setup_hist(ax0, ["Anonymous", "Registered"], filter_edits("Is Anonymous?"))
ax0.annotate(">2500 only major edits are present,\nas depicted on the previous scatter plot.",
             xy=(2800, 3), xycoords='data',
             xytext=(2800, 30), textcoords='data',
             arrowprops=dict(arrowstyle="->",
                             connectionstyle="arc3"),
             fontsize=13
            )

setup_hist(ax1, ["Bot", "Human"], filter_edits("Is Bot?"))
setup_hist(ax2, ["Minor", "Major"], filter_edits("Is Minor?"))
setup_hist(ax3, ["Reduction in Size", "Non-reduction in Size"], filter_edits("Is Deletion?"))

fig.tight_layout()
plt.show()

We may observe the following patterns in these diagrams:
- Updates entailing larger differences in size are made by registered users. This suggests that anonymous edits are mostly about correcting smaller issues, and somehow establishes a proper accountability mechanism in Wikipedia, where the bulk of the new material does possess a lineage. 
- There are no bots doing the edits, which boosts our confidence in Wikipedia's content.
- As we have noticed before, alterations resulting in larger differences in size are considered as major changes.
- Larger edits are mostly about adding content. This means that material on Wikipedia is mostly stable, and deletions are probably for smaller changes.
- Most frequent updates are about minor fixes.

**The size attribute is about relative difference in size. We will assume that a small size indicates a small change!**

# List of the Most Frequent Words in Edited Titles

`nltk` has a sophisticated  `word_tokenize` function to properly tokenize the titles:

In [18]:
titles = ''.join(all_edits["Title"])
tokenized_titles = nltk.word_tokenize(titles)

The tokenized titles are full of punctuations, and words useless for counting purposes like "of" or "that" are also included. Those words are named *stopwords* and `nltk` has a convenient set that we can use:

In [19]:
import string
print("Punctuations as defined in Python:", string.punctuation)

In [20]:
# Remove punctuations.
filtered_words = [word.translate(str.maketrans('', '', string.punctuation)) for word in tokenized_titles]
# Remove empty strings.
filtered_words = [word.lower() for word in filtered_words if word]
# Eliminate stopwords.
stopwords = set(nltk.corpus.stopwords.words("english"))
filtered_words = [word for word in filtered_words if not word in stopwords]

The `collection` package of the standard library contains a `Counter` class that is handy for counting frequencies of words in our list:

In [21]:
from collections import Counter

word_counter = Counter(filtered_words)

It also has a `most_common()` method to access the words with the higher count:

In [22]:
most_common_words = word_counter.most_common()[:30]
most_common_words

We will now produce a Word Cloud using the open-source Python library [worldcloud](https://github.com/amueller/word_cloud):

In [23]:
from wordcloud import WordCloud

In [24]:
wordcloud = WordCloud(max_font_size=40).generate(' '.join(filtered_words))
plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Based upon this sample we can see terms related to movies, sport, music, etc. Scientific and advanced technical concepts are rare, which means that sort of material is more steady (requires a higher expertise to edit).  