# Basic damage detection in Wikipedia
This notebook demonstrats the basic contruction of a vandalism classification system using [revscoring](http://pythonhosted.org/revscoring/) and [editquality](https://github.org/wiki-ai/editquality). 

The basic process that we'll follow is this:

1. Gather example of human judgement applied to Wikipedia edits.  In this case, we'll take advantage of [reverts](https://meta.wikimedia.org/wiki/Research:Revert).  
2. Train a machine learning model on the data.  
3. Test the machine learning model against some data we withheld.

And then we'll have some fun applying the model to some edits using RCStream.  The following diagram gives a good sense for the whole process.

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/09/Supervised_machine_learning_in_a_nutshell.svg/640px-Supervised_machine_learning_in_a_nutshell.svg.png" /></center>

## Part 1: Getting labeled observations
Regretfully, running SQL queries isn't something we can do directly from the notebook *yet*.  So, we'll use [Quarry](https://quarry.wmflabs.org) to generate a nice random sample of edits.  20,000 observations should do just fine.  Here's the query I want to run:

```SQL
USE enwiki_p;
SELECT rev_id 
FROM revision 
WHERE rev_timestamp BETWEEN "20141001" AND "20151001" 
ORDER BY RAND() 
LIMIT 20000;
```

See http://quarry.wmflabs.org/query/7530.  By clicking around the UI, I can see that this URL will download my tab-separated file: http://quarry.wmflabs.org/run/65415/output/0/tsv?download=true

In [1]:
# Magical ipython notebook stuff puts the result of this command into a variable
revids_f = !wget http://quarry.wmflabs.org/run/65415/output/0/tsv?download=true -qO- 

revids = [int(line) for line in revids_f[1:]]
len(revids)

20000

OK.  Now that we have a set of revisions, we need to label them.  In this case, we're going to label them as reverted/not.  We want to exclude a few different types of reverts -- e.g. when a user reverts themself or when an edit is reverted back to by someone else.  For this, we'll use the [mwreverts](https://pythonhosted.org/mwreverts) and [mwapi](https://pythonhosted.org/mwapi) libraries.  

In [None]:
import sys, traceback
import mwreverts.api
import mwapi

# We'll use the mwreverts API check.  In order to do that, we need an API session
session = mwapi.Session("https://en.wikipedia.org", 
                        user_agent="Revert detection demo <ahalfaker@wikimedia.org>")

# For each revision, find out if it was "reverted" and label it so.
rev_reverteds = []
for rev_id in revids:
    try:
        _, reverted, reverted_to = mwreverts.api.check(
            session, rev_id, radius=5,  # most reverts within 5 edits
            window=48*60*60,  # 2 days
            rvprop={'user', 'ids'})  # Some properties we'll make use of
    except RuntimeError as e:
        sys.stderr.write(str(e))
        continue
    
    if reverted is not None:
        reverted_doc = [r for r in reverted.reverteds
                        if r['revid'] == rev_id][0]

        # self-reverts
        self_revert = \
            reverted_doc['user'] == reverted.reverting['user']
        
        # revisions that are reverted back to by others
        reverted_back_to = \
            reverted_to is not None and \
            reverted_doc['user'] != \
            reverted_to.reverting['user']
        
        # If we are reverted, not by self or reverted back to by someone else, 
        # then, let's assume it was damaging.
        damaging_reverted = !(self_revert or reverted_back_to)
    else:
        damaging_reverted = False

    rev_reverteds.append((rev_id, damaging_reverted))
    sys.stderr.write("r" if damaging_reverted else ".")


...............r.......

Eeek!  This takes too long.  You get the idea.  So, I uploaded dataset that has already been labeled here @ `../datasets/demo/enwiki.rev_reverted.20k_2015.tsv`

In [37]:
rev_reverteds_f = !cat ../datasets/demo/enwiki.rev_reverted.20k_2015.tsv
rev_reverteds = [line.strip().split("\t") for line in rev_reverteds_f[1:]]
rev_reverteds = [(int(rev_id), reverted == "True") for rev_id, reverted in rev_reverteds]
len(rev_reverteds)

19868

OK.  It looks like we got an error when trying to extract the reverted status of ~132 edits, which is an acceptable loss.  Now just to make sure we haven't gone crazy, let's check some of the reverted edits:

* https://en.wikipedia.org/wiki/?diff=695071713 (section blanking)
* https://en.wikipedia.org/wiki/?diff=667375206 (unexplained addition of nonsense)
* https://en.wikipedia.org/wiki/?diff=670204366 (vandalism "I don't know")
* https://en.wikipedia.org/wiki/?diff=680329354 (adds non-existent category)
* https://en.wikipedia.org/wiki/?diff=668682186 (test edit -- removes punctuation)
* https://en.wikipedia.org/wiki/?diff=666882037 (adds spamlink)
* https://en.wikipedia.org/wiki/?diff=663302354 (adds nonsense special char)
* https://en.wikipedia.org/wiki/?diff=675803278 (unconstructive link changes)
* https://en.wikipedia.org/wiki/?diff=680203994 (vandalism -- "Pepe meme")
* https://en.wikipedia.org/wiki/?diff=656734057 ("JELENAS BOOTY UNDSO")

OK.  Looks like we are doing pretty good. :) 

# Part 2: Train the machine learning model on the data
Before we move on with training, it's important that we hold back some of the data for testing later.  If we train on the same data we'll test with, we risk [overfitting](https://en.wikipedia.org/wiki/Overfitting) and not noticing!

In [39]:
train_set = rev_reverteds[:15000]
test_set = rev_reverteds[15000:]
len(train_set), len(test_set)

(15000, 4868)

OK.  In order to train the machine learning model, we'll need to give it a source of signal.  This is where "features" come into play.  A feature represents a simple numerical statistic that we can extract from our observations that we think will be *predictive* of our outcome.  Luckily, `revscoring` provides a whole suite of features that work well for damage detection.  In this case, we'll be looking at features of the edit diff.  

In [40]:
from revscoring.features import wikitext
from revscoring.languages import english

diff_features = [
    # Catches long key mashes like kkkkkkkkkkkk
    wikitext.revision.diff.longest_repeated_char_added,  
    # Measures the size of the change in added words
    wikitext.revision.diff.words_added,  
    # Measures the size of the change in removed words
    wikitext.revision.diff.words_removed,  
    # Measures the proportional change in "badwords"
    english.badwords.revision.diff.match_prop_delta_sum, 
    # Measures the proportional change in "informals"
    english.informals.revision.diff.match_prop_delta_sum,  
    # Measures the proportional change meaningful words
    english.stopwords.revision.diff.non_stopword_prop_delta_sum  
]

Now, we'll need to turn to `revscoring`s feature extractor to help us get us feature values for each revision.

In [43]:
from revscoring.extractors import api
api_extractor = api.Extractor(session)

print(695071713, list(api_extractor.extract(695071713, features)))
print(667375206, list(api_extractor.extract(667375206, features)))

695071713 [1, 0, 10974, -1.0, -2.5476190476190474, -1477.9699604325447]
667375206 [1, 1, 1, 0.0, 0.0, 0.33333333333333337]


In [48]:
# Now for the whole set!
training_features_reverted = []
for rev_id, reverted in train_set:
    try:
        feature_values = list(api_extractor.extract(rev_id, features))
    except RuntimeError as e:
        sys.stderr.write(str(e))
        continue
    
    sys.stderr.write(".")
    training_features_reverted.append((rev_id, feature_values, reverted))
    

................................................................................................................................................................................................................................................................................................................................................................................................................................................................

KeyboardInterrupt: 

Eeek!  Again this takes too long, so again, I uploaded a dataset with features already extracted @ ../datasets/demo/enwiki.features_reverted.training.20k_2015.tsv

In [51]:
from revscoring.utilities.util import read_observations
training_features_reverted_f = !cut ../datasets/demo/enwiki.features_reverted.training.20k_2015.tsv -f2-
training_features_reverted = list(read_observations(training_features_reverted_f, features, lambda v: v=="True"))
len(training_features_reverted)

14660

Cool.  Now that we have a set of features extracted for our training set, it's time to train a model.  `revscoring` provides a set of different classifier algorithms.  From past experience, I know a [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) classifier works well, so we'll use that.  

In [64]:
from revscoring.scorer_models import GradientBoosting
is_reverted = GradientBoosting(features, version="live demo!", 
                               learning_rate=0.01, max_features="log2", 
                               n_estimators=700, max_depth=5,
                               balanced_sample_weight=True, scale=True, center=True)

is_reverted.train(training_features_reverted)

{'seconds_elapsed': 10.442254543304443}

We now have a trained model that we can play around with.  Let's try a few edits from our test set.

In [66]:
reverted_obs = [rev_id for rev_id, reverted in test_set if reverted]
non_reverted_obs = [rev_id for rev_id, reverted in test_set if not reverted]

for rev_id in reverted_obs[:10]:
    feature_values = list(api_extractor.extract(rev_id, features))
    score = is_reverted.score(feature_values)
    print(True, rev_id, score)

for rev_id in non_reverted_obs[:10]:
    feature_values = list(api_extractor.extract(rev_id, features))
    score = is_reverted.score(feature_values)
    print(False, rev_id, score)

True 699665317 {'prediction': True, 'probability': {False: 0.3653827159786929, True: 0.6346172840213071}}
True 683832871 {'prediction': False, 'probability': {False: 0.5427601696020695, True: 0.4572398303979305}}
True 653913156 {'prediction': False, 'probability': {False: 0.8731074978367457, True: 0.1268925021632543}}
True 654545786 {'prediction': False, 'probability': {False: 0.688150988890976, True: 0.31184901110902397}}
True 670608733 {'prediction': False, 'probability': {False: 0.6281418079815413, True: 0.37185819201845866}}
True 689399141 {'prediction': False, 'probability': {False: 0.6031169436522997, True: 0.39688305634770027}}
True 662365029 {'prediction': True, 'probability': {False: 0.10219941693145007, True: 0.8978005830685499}}
True 656782076 {'prediction': True, 'probability': {False: 0.3093742221434129, True: 0.6906257778565871}}
True 698954388 {'prediction': True, 'probability': {False: 0.3584112509135534, True: 0.6415887490864466}}
True 645603577 {'prediction': False, '

## Testing the model
So, the above analysis can help give us a sense for whether the model is working or not, but it's hard to standardize between models.  So, we can apply some metrics that are specially crafted for machine learning models.  

But first, I'll need to load the pre-generated feature values.  

In [68]:
testing_features_reverted_f = !cut ../datasets/demo/enwiki.features_reverted.testing.20k_2015.tsv -f2-
testing_features_reverted = list(read_observations(testing_features_reverted_f, features, lambda v: v=="True"))
len(testing_features_reverted)

4862

* [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision) -- The proportion of correct predictions
* [Precision](https://en.wikipedia.org/wiki/Precision_and_recall) -- The proportion of correct positive predictions
* [Recall](https://en.wikipedia.org/wiki/Precision_and_recall) -- The proportion of positive examples detected
* [Receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) -- An information theoretic measure comparing the false-positive and true-positive rates across the prediction probability for a model. 
* Filter rate at 90% recall -- The proportion of observations that can be ignored while still catching 90% of "reverted" edits.  

We'll use revscoring statistics to measure these against the test set.  

In [69]:
from revscoring.scorer_models.test_statistics import accuracy, precision, recall, roc, filter_rate_at_recall

is_reverted.test(testing_features_reverted, 
                 test_statistics=[accuracy(), precision(), recall(), roc(), filter_rate_at_recall(0.90)])

print(is_reverted.format_info())

ScikitLearnClassifier
 - type: GradientBoosting
 - params: max_leaf_nodes=null, max_features="log2", balanced_sample_weight=true, warm_start=false, scale=true, min_samples_split=2, learning_rate=0.01, random_state=null, verbose=0, center=true, n_estimators=700, presort="auto", init=null, loss="deviance", min_weight_fraction_leaf=0.0, min_samples_leaf=1, max_depth=5, subsample=1.0
 - version: live demo!
 - trained: 2016-02-22T17:11:09.186836

Accuracy: 0.809
Precision: 0.119
Recall: 0.387
ROC-AUC: 0.674
Filter rate @ 0.9 recall: threshold=0.337, filter_rate=0.195, recall=0.902


# Bonus round!  Let's listen to Wikipedia's vandalism!

So we don't have the most powerful damage detection classifier, but then again, we're only including 6 features.  Usually we run with ~60 features and get to much higher levels of fitness.  *but* this model is still useful and it should help us detect the most aggregious vandalism in Wikipedia.  In order to listen to Wikipedia, we'll need to connect to [RCStream](https://wikitech.wikimedia.org/wiki/RCStream) -- the same live feed that powers [listen to Wikipedia](http://listen.hatnote.com/).

In [77]:
import socketIO_client

class WikiNamespace(socketIO_client.BaseNamespace):
    def on_change(self, change):
        if change['type'] not in ('new', 'edit'):
            return
        
        rev_id = change['revision']['new']
        feature_values = list(api_extractor.extract(rev_id, features))
        score = is_reverted.score(feature_values)
        if score['prediction']:
            print("Please review", rev_id, score)

    def on_connect(self):
        self.emit('subscribe', 'en.wikipedia.org')


socketIO = socketIO_client.SocketIO('stream.wikimedia.org', 80)
socketIO.define(WikiNamespace, '/rc')

socketIO.wait(30)



Please review 706374308 {'prediction': True, 'probability': {False: 0.4067633764284275, True: 0.5932366235715725}}
Please review 706374313 {'prediction': True, 'probability': {False: 0.49926439971761394, True: 0.5007356002823861}}
Please review 706374334 {'prediction': True, 'probability': {False: 0.49926439971761394, True: 0.5007356002823861}}
Please review 706374338 {'prediction': True, 'probability': {False: 0.06753894769918001, True: 0.93246105230082}}
Please review 706374341 {'prediction': True, 'probability': {False: 0.3759632050952484, True: 0.6240367949047516}}
Please review 706374345 {'prediction': True, 'probability': {False: 0.49832034024378247, True: 0.5016796597562175}}
Please review 706374347 {'prediction': True, 'probability': {False: 0.49913166722881364, True: 0.5008683327711864}}
Please review 706374349 {'prediction': True, 'probability': {False: 0.4112152426682467, True: 0.5887847573317533}}


KeyboardInterrupt: 