# Wikipedia Talk Data - Getting Started

This notebook gives an introduction to working with the various data sets in [Wikipedia
Talk](https://figshare.com/projects/Wikipedia_Talk/16731) project on Figshare. The release includes:

1. a large historical corpus of discussion comments on Wikipedia talk pages
2. a sample of over 100k comments with human labels for whether the comment contains a personal attack
3. a sample of over 100k comments with human labels for whether the comment has aggressive tone

Please refer to our [wiki](https://meta.wikimedia.org/wiki/Research:Detox/Data_Release) for documentation of the schema of each data set and our [research paper](https://arxiv.org/abs/1610.08914) for documentation on the data collection and modeling methodology. 

In this notebook we show how to build a simple classifier for detecting personal attacks and apply the classifier to a random sample of the comment corpus to see whether discussions on user pages have more personal attacks than discussion on article pages.

##  Building a classifier for personal attacks
In this section we will train a simple bag-of-words classifier for personal attacks using the [Wikipedia Talk Labels: Personal Attacks]() data set.

In [2]:
import pandas as pd
import numpy as np
import urllib
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
pd.options.display.max_colwidth = 500

In [3]:
# download annotated comments and annotations

ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637' 

def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)

# You can edit the code here to download only once, and not download it later                
download_file(ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
download_file(ANNOTATIONS_URL, 'attack_annotations.tsv')

In [4]:
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

In [5]:
len(annotations['rev_id'].unique())

115864

In [6]:
# labels a comment as an atack if the majority of annotators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [221]:
# join labels and comments
comments['attack'] = labels
#print (df)

#below code was used to filter the dataframe according to various conditions so I could find out the percentage attack/
#nonattack depending on ns, logged_in and sample values. I used the data found here to create visualizations in excel 
filterinfDataframe = comments[(comments['ns'] == 'user') & (comments['attack'] == True)]
filterinfDataframe
filterinfDataframetwo = comments[(comments['logged_in'] == True) & (comments['attack'] == True)]
filterinfDataframetwo
comments['logged_in'].value_counts()
filterinfDataframethree = comments[(comments['sample'] == 'blocked') & (comments['attack'] == True)]
#filterinfDataframethree

# Answering Questions a-k
a. From my visualizations, I learned that approximately 88% of the comments in the dataset are not attacks while 12% are
(13590/115864). I also learned that attacks are more likely to be found when the ns value is equal to user versus when
it is equal to article (17.6% of user ns have attacks compared to 4.4% of article). I also saw that when logged_in is equal
to false, the percentage of attacks is about 3.5x higher than when it is equal to true (24.7% vs 7%). I also saw that when
sample is equal to blocked, the percentage of attacks is higher than when it is equal to random (16.9% vs 0.9%). Lastly,
when looking at year, I saw that in the earliest years of the dataset (from about 2001-2004), the percentage of attacks is 
miniscule (ranging from 0-2%) and after those years, it remains pretty consistent between 10-13%.

b. For text cleaning methods, I tried removing words that were within parentheses, lowercased all the words, removed
punctuation, removed apostrophes, removed stop words and removed other miscellaneous symbols (i.e. equal signs, dashes, etc.) I saw when going over the comment column. I included all of these in the final code. 

c. The features I considered using were ns, year, logged_in and sample. I also considered adding columns to the dataset showing the number of characters in each comment and the number of exclamation points since I saw that attack comments seemed to be shorter and use a lot of exclamation points when examining the comments but when I tried that it had no effect on my f1 score so I got rid of/stopped using it. For the final code, I used the ns and logged_in features as well as comment as I saw a very slight improvement in f1 score using those features. 

d. I simply labeled comments as attacks if the mean of the attack values was greater than 0.5. I tried modifying the values to see if there were better thresholds but when I lowered it, the f1 scores would tend to go down and when I increased it, there was not much effect so I stuck with 0.5. 

e. I did not add any special optimizations to my code. 

f. The ML methods I tried out were LinearSVC, MultinomialNB and RandomForestClassifier. Without hyperparameter tuning, the results for my LinearSVC were macro average 0.84 for precision, 0.86 for recall and 0.85 for f1 score. For MultinomialNB, my results were 0.88 for precision, 0.79 for recall and 0.83 for f1 score. For RandomForestClassifier, my results were 0.90 for precision, 0.81 for recall and 0.84 for f1 score. The best ML method was LinearSVC, as it had the highest f1 score. 

g. For hyperparameter tuning, I created a param_grid that included loss (with two possible values: squared_hinge and hinge), C (with five possible values: 0.1, 1, 10, 100, 1000), class_weight(with 8 possible values: balanced, 1:2, 1:3, 1:4, 1:5, 1:10, 1:20 and 1:50) to help deal with imbalanced data and max_iter (with 3 possible values: 10, 100, 1000). I also used random oversampling of the minority class in order to deal with the imbalanced data and put that with the columntransform and linearsvc model that were in the pipeline. My f1 score went up 2 percentage points as a result of my hyperparameter tuning (precision went up 5 percentage points, recall went down 1 percentage point). 

h. From the different metrics, I learned that each of the models had pretty similar macro averaged precisions (between 0.88-0.90) but there was a wider gap in the range of macro averaged recalls (between 0.79-0.85) and that difference helped create the difference in the f1 score. Cross validation was very useful because it allows us to estimate the skill of our machine learning models on unseen data and generalizability is important in determining how effective our model truly is. The more generalizable the models, the less we might have to deal with problems in our data like overfitting.     

i. My best final result metrics were macro average 0.89 precision, 0.85 recall and 0.87 f1 score. Compared to the strawman code, this is a 5 percentage point increase in f1 score, a 7 percentage point increase in precision and a 4 percentage point increase in recall. LinearSVC gave me this performance. 

j. The most interesting thing I learned from this project was how to approach a machine learning project from beginning to end, from observing the data and possible trends, thinking about the most effective ways to clean text, how to do feature extraction/choose features, using pipelines and column transformers and doing hyperparameter tuning. This was my first time doing a project of this nature and it was cool to play around with all these steps and try to find an optimal solution. 

k. The hardest thing was getting the columntransformer and pipelines working as I kept getting some error messages but eventually they were figured out. Also, it was difficult to select features and do hyperparameter tuning because it felt like no matter what features I choose or what tuning I did, there was not significant improvements in f1 score (seemed to top out at 0.87, which was 2 percentage points better than without hyperparameter tuning and 5 percentage points higher than strawman but I felt like it should have been higher). 

# Text Cleanup

In [207]:
# remove newline and tab tokens

# did some additional text cleanup here, removed words that were within parentheses, lowercased all the words, removed
# punctuation, removed apostrophes, removed stop words and removed other miscellaneous symbols I saw when going over the 
#comment column
from nltk.corpus import stopwords
stop = stopwords.words('english')
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("=", ""))
comments['comment'] = comments['comment'].apply(lambda x: x.replace('`', ""))
comments['comment'] = comments['comment'].apply(lambda x: x.replace(':', ""))
comments['comment'] = comments['comment'].apply(lambda x: x.replace('<', ""))
comments['comment'] = comments['comment'].apply(lambda x: x.replace('>', ""))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("-", ""))
comments['comment'] = comments['comment'].str.replace('[^\w\s]','')
comments['comment'] = comments['comment'].str.replace(r"\(.*\)","")
comments['comment'] = comments['comment'].apply(lambda x: x.lower())
comments['comment'] = comments['comment'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
comments.head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,attack
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
37675,creative dictionary definitions terms insurance ensurance properly applied destruction understand fine legitimate criticism ill write three man cell bounty hunter easy understand ensured insured different differ assured sentence quote absolutely neutral familiar underlying theory strikeback disservice reader someone comes research topic like mad want context beyond history history book fine history book claimed,2002,False,article,random,train,False
44816,term standard model less npov think wed prefer newage speak lot oldage people speak karl popper pope etc heres karl poppers view clearest title article would particle physics cosmology say would require broader treatment issues like anthropic principle cognitive bias beyond particle physics zoo etc accelerators clear use someone still looking particles yet settled cosmology certain abandon search arbitrary foundation ontology suggest subject question,2002,False,article,random,train,False
49851,true false situation march 2002 saudi proposal land peace recognition arab countries made day proposal made formal arab league day israelis command ariel sharon began invasion palestinian selfrule areas userarab,2002,False,article,random,train,False
89320,next maybe could work less condescending suggestions reading naming conventions fdl read quite ago thanks really liked bit explaining interest fixing things complained felt insulted yet extremely insulting time luck learn less jerk greglindahl,2002,True,article,random,dev,False
93890,page need disambiguation,2002,True,article,random,train,False


In [208]:
comments.query('attack')['comment'].head(5)

rev_id
801279                                                                                                                                                                                                                                                                                                                                     iraq good usa bad
2702703    ____ fuck little asshole want talk human start showing fear way humans act around humans continue beligerant campaign cross another boundary begin offsite recruitmehnt escalate till rhetorically nuclear whole goddamed mob think find want better start expressing interest concerns presented credibility either document community pile shit
4632658                                                                                                                                                                                                                                                                                                

In [276]:
# fit a simple text classifier

train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

clf = Pipeline([
    ('vect', CountVectorizer(max_features = 10000, ngram_range = (1,2))),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    ('clf', DecisionTreeClassifier(random_state = 123)),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])

met = metrics.classification_report(test_comments['attack'], clf.predict(test_comments['comment']))
print(met)

              precision    recall  f1-score   support

       False       0.95      0.96      0.96     20422
        True       0.69      0.66      0.67      2756

    accuracy                           0.92     23178
   macro avg       0.82      0.81      0.82     23178
weighted avg       0.92      0.92      0.92     23178



In [277]:
#print out confusion matrix
from sklearn.metrics import confusion_matrix
y_true = comments.attack
y_pred = grid.predict(comments)
print(confusion_matrix(y_true, y_pred))

[[98339  3935]
 [ 1321 12269]]


# Machine Learning Method 1: LinearSVC

In [280]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

In [281]:
from imblearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
from imblearn.over_sampling  import RandomOverSampler
t = [('cat', OneHotEncoder(handle_unknown='ignore'), ["ns"]), ('cat_two', OneHotEncoder(handle_unknown='ignore'), ["logged_in"]), ('comment', TfidfVectorizer(), "comment")]
col_transform = ColumnTransformer(transformers=t)
svc_pipe  = Pipeline([('prep',col_transform),
                     ('sampler', RandomOverSampler(sampling_strategy='minority',random_state=42)),
                     ('model',   LinearSVC())])

In [282]:
def evaluate_model(
    train_df : pd.DataFrame,
    test_df  : pd.DataFrame,
    pipe     : Pipeline,
) -> None:

    model = pipe.fit(train_comments, 
                     train_comments["attack"])


    pred  = model.predict(test_comments)
    print(metrics.classification_report(test_comments["attack"],
                                pred)) 

In [182]:
from functools import partial
evaluate_pipeline = partial(evaluate_model,
                            train_comments,
                            test_comments)

In [183]:
evaluate_pipeline(svc_pipe)

              precision    recall  f1-score   support

       False       0.97      0.96      0.96     20422
        True       0.72      0.75      0.74      2756

    accuracy                           0.94     23178
   macro avg       0.84      0.86      0.85     23178
weighted avg       0.94      0.94      0.94     23178



# Tuning Hyperparameters LinearSVC

In [220]:
from imblearn.pipeline import Pipeline
from imblearn.over_sampling  import RandomOverSampler
from sklearn.model_selection import GridSearchCV
t = [('cat', OneHotEncoder(handle_unknown='ignore'), ["ns"]), ('cat_two', OneHotEncoder(handle_unknown='ignore'), 
                                                               ["logged_in"]), ('comment', TfidfVectorizer(), "comment")]
col_transform = ColumnTransformer(transformers=t)
model = LinearSVC()
svc_pipe  = Pipeline([('prep',col_transform),
                     ('sampler', RandomOverSampler(sampling_strategy='minority',random_state=42)),
                     ('model',   LinearSVC())])
param_grid = {
              'model__loss': ['squared_hinge', 'hinge'],
              'model__C': [0.1, 1, 10, 100, 1000],
              'model__class_weight': ['balanced', {1:2},{1:3},{1:4},{1:5},{1:10},{1:20},{1:50}],
              'model__max_iter': [10, 100, 1000]}
grid = GridSearchCV(svc_pipe, param_grid=param_grid, scoring='f1_macro', verbose=10, n_jobs=-1)
grid.fit(train_comments, train_comments['attack'])
print(grid.best_params_)
print(grid.best_estimator_)
grid_predictions = grid.predict(test_comments)
print(metrics.classification_report(test_comments['attack'], grid_predictions))

Fitting 5 folds for each of 240 candidates, totalling 1200 fits




{'model__C': 1, 'model__class_weight': 'balanced', 'model__loss': 'hinge', 'model__max_iter': 10}
Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['ns']),
                                                 ('cat_two',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['logged_in']),
                                                 ('comment', TfidfVectorizer(),
                                                  'comment')])),
                ('sampler',
                 RandomOverSampler(random_state=42,
                                   sampling_strategy='minority')),
                ('model',
                 LinearSVC(C=1, class_weight='balanced', loss='hinge',
                           max_iter=10))])
              precisio

# Tuning Hyperparameters Best Result

In [148]:
from sklearn.model_selection import GridSearchCV
t = [('cat', OneHotEncoder(handle_unknown='ignore'), ["ns"]), ('cat_two', OneHotEncoder(handle_unknown='ignore'), ["logged_in"]), ('comment', TfidfVectorizer(), "comment")]
col_transform = ColumnTransformer(transformers=t)
model = LinearSVC()
pipeline = Pipeline(steps=[('prep',col_transform), ('m', model)])
param_grid = {
              'm__loss': ['squared_hinge', 'hinge'],
              'm__C': [0.1, 1, 10, 100, 1000],
              'm__class_weight': ['balanced', {1:2},{1:3},{1:4},{1:5},{1:10},{1:20},{1:50}],
              'm__max_iter': [10, 100, 1000]}
grid = GridSearchCV(pipeline, param_grid=param_grid, scoring='f1_macro', verbose=10, n_jobs=-1)
grid.fit(train_comments, train_comments['attack'])
print(grid.best_params_)
print(grid.best_estimator_)
grid_predictions = grid.predict(test_comments)
print(metrics.classification_report(test_comments['attack'], grid_predictions))

Fitting 5 folds for each of 240 candidates, totalling 1200 fits
{'m__C': 0.1, 'm__class_weight': {1: 3}, 'm__loss': 'squared_hinge', 'm__max_iter': 100}
Pipeline(steps=[('prep',
                 ColumnTransformer(transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['ns']),
                                                 ('cat_two',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['logged_in']),
                                                 ('comment', TfidfVectorizer(),
                                                  'comment')])),
                ('m', LinearSVC(C=0.1, class_weight={1: 3}, max_iter=100))])
              precision    recall  f1-score   support

       False       0.96      0.98      0.97     20422
        True       0.81      0.73      0.77      2756

    accuracy 

In [283]:
#printing out cross_val_score for LinearSVC model, chose 10 as the value for k fold cross validation because from 
#outside research, I saw that 10 provided a good trade-off between low computational cost and low bias when estimating
#model performance 
from sklearn.model_selection import cross_val_score
import numpy as np
model = LinearSVC()
y=np.array(comments.attack)
X=comments['comment']
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
X = le.fit_transform(X)
print(cross_val_score(model, X.reshape(-1, 1), y, scoring='f1_macro', cv=10, error_score='raise'))



[0.34990966 0.46885171 0.46885171 0.46885171 0.46884885 0.46884885
 0.46884885 0.46884885 0.34857764 0.46884885]




In [284]:
#printing out confusion matrix
from sklearn.metrics import confusion_matrix
y_true = comments.attack
y_pred = grid.predict(comments)
print(confusion_matrix(y_true, y_pred))

[[98339  3935]
 [ 1321 12269]]


# Machine Learning Method 2: MultinomialNB
Below are the best results from using MultinomialNB. I also tried MultinomialNB using max_features 10000 and 20000 and word analyzer with n_gram range from (1,2), (1,3) and (1,4) which led to f1 scores between 0.79 and 0.80, as the recall scores went down to about 0.73-0.74. I also tried adding various combinations of features like logged_in, ns, sample and year but that saw f1 scores decline to about 0.6-0.64 so I didn't use them.   

In [278]:
from sklearn.naive_bayes import MultinomialNB
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

clf = Pipeline([
    ('vect', CountVectorizer(max_features = 30000, analyzer='char', ngram_range = (1,6))),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    ('clf', MultinomialNB()),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])

met = metrics.classification_report(test_comments['attack'], clf.predict(test_comments['comment']))
print(met)

              precision    recall  f1-score   support

       False       0.95      0.98      0.96     20422
        True       0.80      0.61      0.69      2756

    accuracy                           0.94     23178
   macro avg       0.88      0.79      0.83     23178
weighted avg       0.93      0.94      0.93     23178



In [272]:
#printing out cross_val_score for MultinomialNB model, chose 10 as the value for k fold cross validation because from 
#outside research, I saw that 10 provided a good trade-off between low computational cost and low bias when estimating
#model performance 
from sklearn.model_selection import cross_val_score
import numpy as np
model = MultinomialNB()
y=np.array(comments.attack)
X=comments['comment']
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
X = le.fit_transform(X)
print(cross_val_score(model, X.reshape(-1, 1), y, scoring='f1_macro', cv=10, error_score='raise'))

[0.46885171 0.46885171 0.46885171 0.46885171 0.46884885 0.46884885
 0.46884885 0.46884885 0.46884885 0.46884885]


In [279]:
# print out confusion matrix
from sklearn.metrics import confusion_matrix
y_true = comments.attack
y_pred = clf.predict(comments['comment'])
print(confusion_matrix(y_true, y_pred))

[[99883  2391]
 [ 5018  8572]]


# Machine Learning Method 3: RandomForestClassifier
Below are the best results from using RandomForestClassifier. I also tried changing the max_features in countvectorizer to 20000 and use n_gram ranges for words and char going from (1,2) to (1,5) as well as adding other features such as logged_in, ns, sample and year but they did not have any significant positive impact on the final metrics. 

In [235]:
from sklearn.ensemble import RandomForestClassifier
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

clf = Pipeline([
    ('vect', CountVectorizer(max_features = 10000, ngram_range = (1,2))),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    ('clf', RandomForestClassifier()),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])

met = metrics.classification_report(test_comments['attack'], clf.predict(test_comments['comment']))
print(met)

              precision    recall  f1-score   support

       False       0.95      0.98      0.97     20422
        True       0.84      0.63      0.72      2756

    accuracy                           0.94     23178
   macro avg       0.90      0.81      0.84     23178
weighted avg       0.94      0.94      0.94     23178



In [273]:
#printing out cross_val_score for RandomForestClassifier model, chose 10 as the value for k fold cross validation because from 
#outside research, I saw that 10 provided a good trade-off between low computational cost and low bias when estimating
#model performance 
from sklearn.model_selection import cross_val_score
import numpy as np
model = RandomForestClassifier()
y=np.array(comments.attack)
X=comments['comment']
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
X = le.fit_transform(X)
print(cross_val_score(model, X.reshape(-1, 1), y, scoring='f1_macro', cv=10, error_score='raise'))

[0.61886448 0.61760728 0.63242549 0.62684177 0.63821585 0.64709311
 0.61750404 0.62003192 0.63015819 0.63264445]


In [236]:
#print out confusion matrix
from sklearn.metrics import confusion_matrix
y_true = comments.attack
y_pred = clf.predict(comments['comment'])
print(confusion_matrix(y_true, y_pred))

[[101525    749]
 [  2120  11470]]


In [10]:
# correctly classify nice comment
clf.predict(['Thanks for you contribution, you did a great job!'])

array([False])

In [11]:
# correctly classify nasty comment
clf.predict(['People as stupid as you should not edit Wikipedia!'])

array([ True])

## Prevalence of personal attacks by namespace
In this section we use our classifier in conjunction with the [Wikipedia Talk Corpus](https://figshare.com/articles/Wikipedia_Talk_Corpus/4264973) to see if personal attacks are more common on user talk or article talk page discussions. In our paper we show that the model is not biased by namespace.

In [None]:
import os
import re
from scipy.stats import bernoulli
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# download and untar data

USER_TALK_CORPUS_2004_URL = 'https://ndownloader.figshare.com/files/6982061'
ARTICLE_TALK_CORPUS_2004_URL = 'https://ndownloader.figshare.com/files/7038050'

download_file(USER_TALK_CORPUS_2004_URL, 'comments_user_2004.tar.gz')
download_file(ARTICLE_TALK_CORPUS_2004_URL,  'comments_article_2004.tar.gz')

os.system('tar -xzf comments_user_2004.tar.gz')
os.system('tar -xzf comments_article_2004.tar.gz')

In [None]:
# helper for collecting a sample of comments for a given ns and year from 
def load_no_bot_no_admin(ns, year, prob = 0.1):
    
    dfs = []
    
    data_dir = "comments_%s_%d" % (ns, year)
    for _, _, filenames in os.walk(data_dir):
        for filename in filenames:
            if re.match("chunk_\d*.tsv", filename):
                df = pd.read_csv(os.path.join(data_dir, filename), sep = "\t")
                df['include'] = bernoulli.rvs(prob, size=df.shape[0])
                df = df.query("bot == 0 and admin == 0 and include == 1")
                dfs.append(df)
                
    sample = pd.concat(dfs)
    sample['ns'] = ns
    sample['year'] = year
    
    return sample

In [None]:
# collect a random sample of comments from 2004 for each namespace
corpus_user = load_no_bot_no_admin('user', 2004)
corpus_article = load_no_bot_no_admin('article', 2004)
corpus = pd.concat([corpus_user, corpus_article])

In [None]:
# Apply model
corpus['comment'] = corpus['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
corpus['comment'] = corpus['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
corpus['attack'] = clf.predict_proba(corpus['comment'])[:,1] > 0.425 # see paper

In [None]:
# plot prevalence per ns

sns.pointplot(data = corpus, x = 'ns', y = 'attack')
plt.ylabel("Attack fraction")
plt.xlabel("Dicussion namespace")

Attacks are far more prevalent in the user talk namespace.