# Wikipedia Talk Data - Getting Started

This notebook gives an introduction to working with the various data sets in [Wikipedia
Talk](https://figshare.com/projects/Wikipedia_Talk/16731) project on Figshare. The release includes:

1. a large historical corpus of discussion comments on Wikipedia talk pages
2. a sample of over 100k comments with human labels for whether the comment contains a personal attack
3. a sample of over 100k comments with human labels for whether the comment has aggressive tone

Please refer to our [wiki](https://meta.wikimedia.org/wiki/Research:Detox/Data_Release) for documentation of the schema of each data set and our [research paper](https://arxiv.org/abs/1610.08914) for documentation on the data collection and modeling methodology. 

In this notebook we show how to build a simple classifier for detecting personal attacks and apply the classifier to a random sample of the comment corpus to see whether discussions on user pages have more personal attacks than discussion on article pages.

## Building a classifier for personal attacks
In this section we will train a simple bag-of-words classifier for personal attacks using the [Wikipedia Talk Labels: Personal Attacks]() data set.

In [1]:
import pandas as pd
import urllib
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [2]:
from string import punctuation
from sklearn.model_selection import cross_validate

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
import numpy as np

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

In [6]:
from sklearn.neural_network import MLPClassifier

In [7]:
from sklearn.naive_bayes import MultinomialNB

In [8]:
from sklearn.metrics import precision_recall_fscore_support as prf

In [9]:
from sklearn.naive_bayes import GaussianNB

In [10]:
from sklearn.svm import LinearSVC

In [11]:
from sklearn.svm import SVC

In [12]:
from sklearn.metrics import confusion_matrix

In [13]:
from sklearn.model_selection import cross_val_predict


Please include .tsv files in local folder

In [14]:
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

In [15]:
len(annotations['rev_id'].unique())

115864

In [16]:
# labels a comment as an atack if the majority of annoatators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [17]:
# join labels and comments
comments['attack'] = labels

#### @Data Cleaning

This is done by adding all unwanted words in punct set and then iterating over data to remove all the words, by replacing them with space.

Then all the unwanted space is merged back to 1 space character.


I tried to remove stop words as well from the data but it did not make much difference.


In [18]:
punct = set(punctuation)
# print ('.' in punct)
punct.remove(".")
# print (punct)
punct = str(punct)


words = ["NEWLINE_TOKEN", "TAB_TOKEN"]

In [19]:
comments['comment'] = comments['comment'].apply(lambda line : ''.join([line.replace(w, ' ') for w in words]))

In [20]:
comments['comment'] = comments['comment'].apply(lambda line : ''.join([' ' if c in punct else c for c in line]))

In [21]:
comments['comment'] = comments['comment'].apply(lambda line : ' '.join(line.split()))

##### @Features
Dropped column year and split from final data

In [22]:
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")
train_comments.drop(columns=['year', 'split'])
comments.drop(columns=['year', 'split'])

Unnamed: 0_level_0,comment,logged_in,ns,sample,attack
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
37675,This is not creative . Those are the dictionar...,False,article,random,False
44816,the term standard model is itself less NPOV th...,False,article,random,False
49851,True or false the situation as of March 2002 w...,False,article,random,False
89320,Next maybe you could work on being less condes...,True,article,random,False
93890,This page will need disambiguation. This page ...,True,article,random,False
102817,Important note for all sysops There is a bug i...,True,user,random,False
103624,I removed the following All names of early Pol...,True,article,random,False
111032,If you ever claimed in a Judaic studies progra...,True,article,random,False
120283,My apologies I m English I watch cricket I kno...,True,article,random,False
128532,Someone wrote More recognizable perhaps is a t...,True,article,random,False


# 1. Logistic Regression (Best Model)

This model provided the best ROC_AUC and F1 scores.

By tuning Countectorizer, TfidfTransformer and LogisticRegression parameters F1 score increased from around 68.5 to 73.5%.

Adding class_weight = "balanced" to LogisticRegression had the most effect on scores.

In [23]:
clf_logs_reg = Pipeline([
    ('vect', CountVectorizer(min_df=5, max_features = 50000, analyzer='word')),
    ('tfidf', TfidfTransformer(norm = 'l2',sublinear_tf = True)),
    ('clf', LogisticRegression(class_weight = "balanced")),
])

In [24]:
clf_logs_reg = clf_logs_reg.fit(train_comments['comment'], train_comments['attack'])

In [25]:
tn, fp, fn, tp = confusion_matrix(test_comments['attack'], clf_logs_reg.predict(test_comments['comment'])).ravel()

In [26]:
print ("True Positive=", tp)
print ("True Negative=", tn)
print ("False Positive=", fp)
print ("False Negative=", fn)

True Positive= 2295
True Negative= 19232
False Positive= 1190
False Negative= 461


##### Confusion Matrix
    True Positive= 2295
    True Negative= 19232
    False Positive= 1190
    False Negative= 461

##### Train-Test Validation

In [27]:
auc = roc_auc_score(test_comments['attack'], clf_logs_reg.predict_proba(test_comments['comment'])[:, 1])
precesion, recall, f1, _ = prf(test_comments['attack'], clf_logs_reg.predict(test_comments['comment']), average='binary')
print('Test ROC AUC: %.3f' %(auc*100))
print("Precesion: %.3f" %(precesion*100))
print("Recall: %.3f" %(recall*100))
print("F1: %.3f" %(f1*100))

Test ROC AUC: 96.191
Precesion: 65.854
Recall: 83.273
F1: 73.546


##### Accuracy
    Test ROC AUC: 96.191
    Precesion:  65.85365853658537
    Recall:  83.27285921625544
    F1:  73.5459061047909

##### K-Fold Validation

In [113]:
scores = cross_val_score(clf_logs_reg, comments['comment'], comments['attack'], cv=10, scoring='accuracy')

In [114]:
print(scores)

[0.92759127 0.91464572 0.92707344 0.9187883  0.91765924 0.92266529
 0.92784395 0.93017435 0.92922493 0.92525462]


###### Score for each fold 
    [0.92759127 0.91464572 0.92707344 0.9187883  0.91765924 0.92266529
     0.92784395 0.93017435 0.92922493 0.92525462]

In [115]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.92 (+/- 0.01)


###### Accuracy: 0.92 (+/- 0.01)

# 2. Multinomial Naive Bayes

In [27]:
clf_multi_nb = Pipeline([
    ('vect1', CountVectorizer(min_df=5, max_features = 50000, analyzer='word')),
    ('tfidf2', TfidfTransformer(norm = 'l2',sublinear_tf = True)),
    ('multi_nb', MultinomialNB(fit_prior=False)),
])

In [28]:
clf_multi_nb = clf_multi_nb.fit(train_comments['comment'], train_comments['attack'])

In [56]:
tn, fp, fn, tp = confusion_matrix(test_comments['attack'], clf_multi_nb.predict(test_comments['comment'])).ravel()

In [57]:
print ("True Positive=", tp)
print ("True Negative=", tn)
print ("False Positive=", fp)
print ("False Negative=", fn)

True Positive= 2020
True Negative= 19048
False Positive= 1374
False Negative= 736


##### Confusion Matrix
    True Positive= 2020
    True Negative= 19048
    False Positive= 1374
    False Negative= 736

In [29]:
auc = roc_auc_score(test_comments['attack'], clf_multi_nb.predict_proba(test_comments['comment'])[:, 1])
precesion, recall, f1, _ = prf(test_comments['attack'], clf_multi_nb.predict(test_comments['comment']), average='binary')

print('Test ROC AUC: %.3f' %(auc*100))
print("Precesion: %.3f" %(precesion*100))
print("Recall: %.3f" %(recall*100))
print("F1: %.3f" %(f1*100))

Test ROC AUC: 91.425
Precesion:  59.51679434295816
Recall:  73.29462989840349
F1:  65.6910569105691


##### Accuracy
    Test ROC AUC: 91.425
    Precesion:  59.51679434295816
    Recall:  73.29462989840349
    F1:  65.6910569105691


# 3. Random Forest Classifier

In [42]:
clf_rand_for = Pipeline([
    ('vect', CountVectorizer(max_features = 10000, analyzer='word')),
    ('tfidf', TfidfTransformer(norm = 'l2',sublinear_tf = True)),
    ('rand_for', RandomForestClassifier(n_estimators=500)),
])

In [43]:
clf_rand_for = clf_rand_for.fit(train_comments['comment'], train_comments['attack'])

In [58]:
tn, fp, fn, tp = confusion_matrix(test_comments['attack'], clf_rand_for.predict(test_comments['comment'])).ravel()

In [59]:
print ("True Positive=", tp)
print ("True Negative=", tn)
print ("False Positive=", fp)
print ("False Negative=", fn)

True Positive= 1339
True Negative= 20329
False Positive= 93
False Negative= 1417


##### Confusion Matrix
    True Positive= 1339
    True Negative= 20329
    False Positive= 93
    False Negative= 1417

In [44]:
auc = roc_auc_score(test_comments['attack'], clf_rand_for.predict_proba(test_comments['comment'])[:, 1])
precesion, recall, f1, _ = prf(test_comments['attack'], clf_rand_for.predict(test_comments['comment']), average='binary')
# type(prfscore)
print('Test ROC AUC: %.3f' %(auc*100))
print("Precesion: %.3f" %(precesion*100))
print("Recall: %.3f" %(recall*100))
print("F1: %.3f" %(f1*100))

Test ROC AUC: 95.482
Precesion: 93.506
Recall: 48.585
F1: 63.945


##### Accuracy
    Test ROC AUC: 95.482
    Precesion: 93.506
    Recall: 48.585
    F1: 63.945


# 4. SVC

In [31]:
clf_SVC = Pipeline([
    ('vect', CountVectorizer(max_features = 10000, analyzer='word', ngram_range = (1,2))),
    ('tfidf', TfidfTransformer(norm = 'l2',sublinear_tf = True)),
    ('rand_for', SVC(kernel='linear')),
])

In [32]:
clf_SVC = clf_SVC.fit(train_comments['comment'], train_comments['attack'])

In [61]:
tn, fp, fn, tp = confusion_matrix(test_comments['attack'], clf_SVC.predict(test_comments['comment'])).ravel()

In [62]:
print ("True Positive=", tp)
print ("True Negative=", tn)
print ("False Positive=", fp)
print ("False Negative=", fn)

True Positive= 1674
True Negative= 20235
False Positive= 187
False Negative= 1082


##### Confusion Matirx
    True Positive= 1674
    True Negative= 20235
    False Positive= 187
    False Negative= 1082

In [33]:
auc = roc_auc_score(test_comments['attack'], clf_SVC.predict(test_comments['comment']))
precesion, recall, f1, _ = prf(test_comments['attack'], clf_SVC.predict(test_comments['comment']), average='binary')
# type(prfscore)
print('Test ROC AUC: %.3f' %(auc*100))
print("Precesion: ", (precesion*100))
print("Recall: ", (recall*100))
print("F1: ", (f1*100))

Test ROC AUC: 79.912
Precesion:  89.95163890381515
Recall:  60.74020319303338
F1:  72.51461988304094


##### Accuracy
    Test ROC AUC: 79.912 vs 88.723(LogisticRegression)
    Precesion:  89.95163890381515
    Recall:  60.74020319303338
    F1:  72.51461988304094

# 5. Linear SVC

In [34]:
clf_SVC2 = Pipeline([
    ('vect', CountVectorizer(max_features = 10000, analyzer='word')),
    ('tfidf', TfidfTransformer(norm = 'l2',sublinear_tf = True)),
    ('rand_for', LinearSVC()),
])

In [35]:
clf_SVC2 = clf_SVC2.fit(train_comments['comment'], train_comments['attack'])

In [63]:
tn, fp, fn, tp = confusion_matrix(test_comments['attack'], clf_SVC2.predict(test_comments['comment'])).ravel()

In [64]:
print ("True Positive=", tp)
print ("True Negative=", tn)
print ("False Positive=", fp)
print ("False Negative=", fn)

True Positive= 1751
True Negative= 20131
False Positive= 291
False Negative= 1005


##### Confusion Matrix
    True Positive= 1751
    True Negative= 20131
    False Positive= 291
    False Negative= 1005

In [38]:
auc = roc_auc_score(test_comments['attack'], clf_SVC2.predict(test_comments['comment']))
precesion, recall, f1, _ = prf(test_comments['attack'], clf_SVC2.predict(test_comments['comment']), average='binary')

print('Test ROC AUC: %.3f' %(auc*100))
print("Precesion: %.3f" %(precesion*100))
print("Recall: %.3f" %(recall*100))
print("F1: %.3f" %(f1*100))

Test ROC AUC: 81.055
Precesion: 85.749
Recall: 63.534
F1: 72.989


##### Accuracy

    Test ROC AUC: 81.055  vs 88.723(LogisticRegression)
    Precesion: 85.749
    Recall: 63.534
    F1: 72.989

# 6. Multi-layer Perceptron

In [67]:
clf_MLP = Pipeline([
    ('vect', CountVectorizer(max_features = 10000, analyzer='word', ngram_range = (1,2))),
    ('tfidf', TfidfTransformer(norm = 'l2',sublinear_tf = True)),
    ('rand_for', MLPClassifier()),
])

In [68]:
clf_MLP = clf_MLP.fit(train_comments['comment'], train_comments['attack'])

In [69]:
tn, fp, fn, tp = confusion_matrix(test_comments['attack'], clf_MLP.predict(test_comments['comment'])).ravel()

In [70]:
print ("True Positive=", tp)
print ("True Negative=", tn)
print ("False Positive=", fp)
print ("False Negative=", fn)

True Positive= 1789
True Negative= 19771
False Positive= 651
False Negative= 967


##### Confusion Matrix
    True Positive= 1789
    True Negative= 19771
    False Positive= 651
    False Negative= 967

In [None]:
auc = roc_auc_score(test_comments['attack'], clf_MLP.predict_proba(test_comments['comment'])[:, 1])
precesion, recall, f1, _ = prf(test_comments['attack'], clf_MLP.predict(test_comments['comment']), average='binary')

print('Test ROC AUC: %.3f' %(auc*100))
print("Precesion: ", (precesion*100))
print("Recall: ", (recall*100))
print("F1: ", (f1*100))

##### Output
    Test ROC AUC: 91.655
    Precesion:  71.08533554266778
    Recall:  62.264150943396224
    F1:  66.38297872340425

In [77]:
type(clf_logs_reg.predict(test_comments['comment']))

numpy.ndarray

In [79]:
# correctly classify nice comment
clf_logs_reg.predict(['Thanks for you contribution, you did a great job!'])

array([False])

In [None]:
print (np.any(a == True))

In [80]:
# correctly classify nasty comment
clf_logs_reg.predict(['People as stupid as you should not edit Wikipedia!'])

array([ True])