# Predicting Wikipedia Article Quality With Natural Language Processing

![img](images/header.jpg)

*(photo courtesy of Dmitrij Paskevic, hosted on [Unsplash](https://unsplash.com/photos/YjVa-F9P9kk))*

### Author

> **Luke Dowker** ([GitHub](https://github.com/toastdeini) | [LinkedIn](https://www.linkedin.com/in/luke-dowker/) | [Email](mailto:lhdowker@gmail.com))

## Overview

Over the course of its twenty-plus-year existence, Wikipedia's reputation has gradually evolved from that of a [digital "Wild West"](https://www.cnn.com/2009/TECH/08/26/wikipedia.editors/index.html), [replete with misinformation](https://usatoday30.usatoday.com/news/opinion/editorials/2005-11-29-wikipedia-edit_x.htm), to that of a [meticulously curated](https://en.wikipedia.org/wiki/Vandalism_on_Wikipedia#Prevention) and (generally) reliable resource for [fact-checking](https://en.wikipedia.org/wiki/Wikipedia_and_fact-checking) & bird's-eye/survey-level research.

The site's reliability and ongoing improvement can be attributed, in large part, to the fastidiousness of Wikipedia's volunteer editors, who have been using Bayesian statistics for at least fifteen years now to identify "[vandalism](https://en.wikipedia.org/wiki/Wikipedia:Vandalism)" - bad-faith edits "deliberately intended to obstruct" the distribution of verifiable, open-source knowledge - with scripts like [ClueBot](https://en.wikipedia.org/wiki/User:ClueBot_NG). The steadily increasing proportion of "[good articles](https://en.wikipedia.org/wiki/Wikipedia:Good_article_statistics)" is the direct result of a concerted, altruistic effort by English speakers across the world to create an accessible, democratized encyclopedia.

![img](images/pct_good_articles.png)

*(This image recreated from a similar Wikipedia graph - code for this is located in the [EDA notebook](prep/Exploratory_Analysis_and_Visualization.ipynb).)*

Regrettably, not everyone who fires up their computer (or phone!) to edit Wikipedia has equally noble intentions. Though the site's policy makes clear that Wikipedia is "[**not** a soapbox or means of promotion](https://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_not#Wikipedia_is_not_a_soapbox_or_means_of_promotion)" (emphasis mine), upwards of 20,000 articles on the site fail to purport a neutral point of view, and hundreds of these "[articles with a promotional tone](https://en.wikipedia.org/wiki/Category:Articles_with_a_promotional_tone)" are identified monthly by editors and everyday visitors.

Articles with a slanted perspective present a threat not only to Wikipedia's credibility as a source of knowledge, but also to the average user: without prior knowledge of the subject at hand, how can a reader know if the information they're getting is objective, other than by intuition? This is where machine learning and natural language processing (NLP) enter the picture: a model trained on data that represents the contents of **both "good" and "promotional" articles** will be able to forecast whether a body of text meets an encyclopedic editorial standard, or if the text is likely to be marked by readers as "promotional" and thus not useful.

The final model classifies unseen documents with an accuracy rate hovering **just above 90%** using a term importance (TF-IDF) vectorizer and a classifier from the popular `XGBoost` library. Users inputting a passage of text into the [application](https://share.streamlit.io/toastdeini/wikipedia-article-quality/main/app_testing.py) that employs this model can be confident that, nine times out of ten, they will know almost immediately if a given Wikipedia is written from a sufficiently neutral point of view, or if the article is more like a glorified advertisement.

## Business Problem

First, the good news: Wikipedia is better than it’s ever been. It was frequently criticized in its infancy for the unreliability of its contents, but the site’s volunteer editors have worked tirelessly over the years to track and remove vandalism while demanding a rigorous editorial standard.

As a result, the number of articles on Wikipedia that contain “factually accurate and verifiable information […and are] neutral in point of view” has been steadily increasing since the site started labeling and tracking “good articles” in 2006.

…That said, while it’s a core principle of the site that anyone can edit Wikipedia, not all edits are done in good faith. Blatant vandalism and spam are ongoing problems, but ones that are easily rectified with existing Wikipedia scripts like ClueBot, which uses naïve Bayes classification to detect “bad” edits.

On the other hand, articles written in a “promotional tone” pose a more existential threat to Wikipedia, as the information they purvey is biased. It’s more difficult to parse tone than to identify a simple instance of childish vandalism, especially if you know nothing about a subject.

This is where **natural language processing** comes in handy! Using the text of Wikipedia articles that have already been identified as either good or having a promotional tone, we can train a model that’s able to **predict which class an unlabeled article will belong to!**

## Data

> All data used in this project has been uploaded externally by the author [here](https://www.mediafire.com/folder/kqlo7r936ufdp/flatiron-capstone-data) to expedite reproducibility, due to the size of the combined, lemmatized `.csv` file.

Data used in this project is freely available for download on [Kaggle](https://www.kaggle.com/datasets/urbanbricks/wikipedia-promotional-articles), courtesy of user `urbanbricks`. "[Good articles](https://en.wikipedia.org/wiki/Wikipedia:Good_articles)" - articles which meet a "core set of editorial standards" - were stored as strings (with corresponding URLs) in one CSV file, `good.csv`. Articles with a "[promotional tone](https://en.wikipedia.org/wiki/Category:Articles_with_a_promotional_tone)" were stored in a separate CSV (`promotional.csv`) that, in addition to `text` and `url` columns, contains one-hot encoded columns that identify a subclass of promotional tone, e.g. `advert` (written like an advertisement) or `coi` (conflict of interest with subject).

> Full exploratory data analysis, including all visualizations created for this project, is available in [this notebook](prep/Exploratory_Analysis_and_Visualization.ipynb).

### Libraries, Packages, and Scripts

In [1]:
# Data comprehension/manipulation
import pandas as pd
import numpy as np

# Loading in stored objects
import pickle

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Machine learning packages
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier

# Natural language processing
import nltk

# etc.
import os
import sys
module_path = os.path.abspath(os.pardir)
if module_path not in sys.path:
    sys.path.append(module_path)
    
# Custom/helper functions
from src.parse_it import *
from src.modeling import *
from src.EDA import *

### Load in Data

Full exploratory data analysis for this project can be found in [a separate notebook](prep/Exploratory_Analysis_and_Visualization.ipynb); this final notebook contains the most notable parts of that analysis, with truncated/imported code whenever possible.

Data is stored in two separate `.csv` files; one contains articles marked as "good," while the other contains articles marked as "promotional," with various subclasses of "promotional."

In [2]:
df_good = pd.read_csv('../data/good.csv')
df_promo = pd.read_csv('../data/promotional.csv')

### [Data Exploration and Preparation](prep/Exploratory_Analysis_and_Visualization.ipynb)

First, I examine the size of the dataset and the distribution of the target variable.

In [3]:
print(f"{df_good.shape[0]} documents in this file/dataset:") 
df_good.head(5)

30279 documents in this file/dataset:


Unnamed: 0,text,url
0,Nycticebus linglom is a fossil strepsirrhine p...,https://en.wikipedia.org/wiki/%3F%20Nycticebus...
1,Oryzomys pliocaenicus is a fossil rodent from ...,https://en.wikipedia.org/wiki/%3F%20Oryzomys%2...
2,.hack dt hk is a series of single player actio...,https://en.wikipedia.org/wiki/.hack%20%28video...
3,The You Drive Me Crazy Tour was the second con...,https://en.wikipedia.org/wiki/%28You%20Drive%2...
4,0 8 4 is the second episode of the first seaso...,https://en.wikipedia.org/wiki/0-8-4


In [4]:
print(f"{df_promo.shape[0]} documents in this file/dataset:") 
df_promo.head(5)

23837 documents in this file/dataset:


Unnamed: 0,text,advert,coi,fanpov,pr,resume,url
0,"1 Litre no Namida 1, lit. 1 Litre of Tears als...",0,0,1,0,0,https://en.wikipedia.org/wiki/1%20Litre%20no%2...
1,"1DayLater was free, web based software that wa...",1,1,0,0,0,https://en.wikipedia.org/wiki/1DayLater
2,1E is a privately owned IT software and servic...,1,0,0,0,0,https://en.wikipedia.org/wiki/1E
3,1Malaysia pronounced One Malaysia in English a...,1,0,0,0,0,https://en.wikipedia.org/wiki/1Malaysia
4,"The Jerusalem Biennale, as stated on the Bienn...",1,0,0,0,0,https://en.wikipedia.org/wiki/1st%20Jerusalem%...


Curious about the distribution of subclasses in `df_promo` - that is, in what way can its promotional tone *best be described?* - I plotted the value counts of the categorical columns:

- **Advertisement-like** / `advert` - The article reads like an advertisement for a company, a product, or an organization, or is otherwise an advertisement "masquerading" as a legitimate article.
- **Conflict of interest** / `coi` - There appears to be a conflict of interest between the subject of the article and the author of the article, which "undermines public confidence" in Wikipedia.
- **Fan's point of view** / `fanpov` - The article appears to have been written by a fan or admirer of the subject, rather than from a neutral point of view.
- **News article/press release-like** / `pr` - The article reads like a news article, i.e. "the article may not be promotional or overly-negative, but is still unencyclopedic in tone."
- **Résumé-like** / `resume` - The article reads like a résumé or CV.

---

![img](images/promo_dist.png)

---

The imbalance of promotional subclasses in the dataset, along with the overlap in semantic meaning between the subclass descriptions - *are these not all 'promotional' in some form or another?* - shaped my decision to approach this project as a **binary classification problem**.

In [5]:
df_good = df_good[['text']]
df_promo = df_promo[['text']]

All rows in `good.csv` were assigned the label `0`, indicating that the article is *not* promotional, whereas all rows in `promotional.csv` were assigned the label `1`, regardless of their subclass.

In [6]:
# Assigning 'good' articles the label 0, a 'falsy' value,
# indicating that the article is NOT promotional

df_good['label'] = 0
df_good.head(1)

Unnamed: 0,text,label
0,Nycticebus linglom is a fossil strepsirrhine p...,0


In [7]:
# Assigning 'promotional' articles the label 1, a 'truthy'
# value, indicating that the article IS promotional

df_promo['label'] = 1
df_promo.head(1)

Unnamed: 0,text,label
0,"1 Litre no Namida 1, lit. 1 Litre of Tears als...",1


The two dataframe objects were then concatenated to prepare the data for modeling.

In [8]:
# Concatenate dataframes - reset index of `df_promo`
# for numeric consistency

# Replace .append with .concat if using newer pandas version

df = df_good.append(df_promo,
                    ignore_index=True)

In [9]:
# Shuffle dataframe for randomness, etc.

df = df.sample(frac=1).reset_index(drop=True)

Following some investigation into document word counts, I found that "good" articles were about **four times longer** on average than promotional articles.

In [None]:
compare_mean_counts(df, 'text', 'label', 0, 1)

![img](images/median_word_count.png)

The following cell is one of the most computationally expensive tasks in this whole project - it applies the custom function `parse_doc` (see the [.py file](src/parse_it.py) for further information) to the content of the `text` column for each row in the dataframe, i.e. each document in the corpus. The result is a string of **lemmas**, the approximate morphological roots of each word in the document, stored in a new column called `text_lem`. This helps standardize the text for vectorization and, eventually, modeling.

Again, the following two cells are only necessary to run if you're interested in completing reproducing the project, step by step. For convenience, the output is saved as `lemmed_combined.csv`, which will probably save you at least an hour of computing time if loaded in.

In [None]:
# df['text_lem'] = df['text'].apply(parse_doc) 

In [None]:
# df.to_csv('lemmed_combined.csv')

## Methods and Models

Below, the updated `.csv` file is read in - it can be downloaded externally [here](https://www.mediafire.com/folder/kqlo7r936ufdp/flatiron-capstone-data).

In [10]:
# Read in usable dataframe

df = pd.read_csv('../data/lemmed_combined.csv',
                 index_col=0)

# Reorder columns, keeping only the lemmatized text

df = df[['text_lem', 'label']]

In [11]:
X = df['text_lem']
y = df['label']

In [12]:
# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=138,
                                                    stratify=y)

# Secondary train-test split for validation

X_tr_val, X_val, y_tr_val, y_val = train_test_split(X_train, y_train,
                                                    random_state=138,
                                                    stratify=y_train)

### Baseline Model: `DummyClassifier`

scikit-learn's `DummyClassifier` was used as a baseline for measuring model performance.

In [None]:
dum_pipe = Pipeline(steps=[
    ('cvec', CountVectorizer()),
    ('dum', DummyClassifier(strategy='most_frequent'))
])

In [None]:
dum_pipe.fit(X_tr_val, y_tr_val)

In [None]:
dum_pipe.score(X_tr_val, y_tr_val)

### [Algorithm Exploration](prep/Model_General_Testing.ipynb)

To create some simple models, I train scikit-learn's `MultinomialNB` algorithm (default hyperparameters) on two vectorized instances of the corpus: One using a `CountVectorizer`, a simple bag-of-words approach, and the other using `TfidfVectorizer`, a term importance approach. Accuracy and $F_1$ scores for each algorithm-vectorizer combination, using cross-validation with five *k*-folds, are displayed, along with corresponding training scores to determine the extent to which a given model is overfit. [This notebook](prep/Model_General_Testing.ipynb) details this process more verbosely.

In [None]:
mnb_count_pipe = Pipeline(steps=[
    ('cvec', CountVectorizer()),
    ('mnb', MultinomialNB())
])
                          
mnb_tfidf_pipe = Pipeline(steps=[
    ('tfidf', TfidfVectorizer()),
    ('mnb', MultinomialNB())
])         
                          
# Fitting simple model(s) to validation sets
mnb_count_pipe.fit(X_tr_val, y_tr_val)
mnb_tfidf_pipe.fit(X_tr_val, y_tr_val)

In [None]:
mnb_count_model = ModelForScoring(mnb_count_pipe, 'mnb_cvec', X_val, y_val)

In [None]:
mnb_tfidf_model = ModelForScoring(mnb_tfidf_pipe, 'mnb_tfidf', X_val, y_val)

### Model Selection

[This notebook](prep/Model_Tuning.ipynb) details the tuning process following selection more verbosely. The gradient boosting algorithm `XGBoost` performed best on validation data, with an accuracy score of **0.945.**

In [13]:
with open('models/xgb_model.sav', 'rb') as f:
    final_model = pickle.load(f)

In [14]:
final_model.fit(X_tr_val, y_tr_val)

model = ModelForScoring(final_model, 'xgb_tfidf', X_val, y_val)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


CV Results
Accuracy
--------------------------------
Training accuracy: 1.000
Test accuracy:     0.945
F-1 Score
--------------------------------
Training F1 score: 1.000
Test F1 score:     0.944


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  6.3min finished


## Results

![img](images/acc_scores_bar.png)

---

![img](images/f1_scores_bar.png)

### Performance on Unseen Data

<!--- bar graph of dummy classifier, simple model, better models, best model - acc and F1 -->

In [15]:
final_model.fit(X_train, y_train)

final_model.score(X_test, y_test)

0.9559464853278143

In [17]:
print(classification_report(y_test, final_model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96      7570
           1       0.95      0.94      0.95      5959

    accuracy                           0.96     13529
   macro avg       0.96      0.95      0.96     13529
weighted avg       0.96      0.96      0.96     13529



After allowing the model to see a larger training set, final accuracy and macro $F_1$ scores on unseen data were both **`0.96`**. Considering that the baseline performance was only `0.55` / 55% accuracy, I'm satisfied with this result, to say the least.

## Deployment

You can view the code that deploys the Streamlit application [here](app_testing.py).

![img](images/wiki_app.gif)

## Conclusions

1. **Integrate the model/application into existing Wikipedia UI:** This could immediately assist site visitors who are curious about the verifiability or quality of an article they might be reading.

2. **Auto-classify articles for expedited review:** Create *preliminary classifications* for unlabeled articles, which can later be reviewed by Wikipedia contributors and editors.

### Next Steps

![img](images/citation.png)

*(Image courtesy of [Wikimedia Commons](https://upload.wikimedia.org/wikipedia/commons/1/18/%22Citation_needed%22.jpg))*

- **Collaborate with administrators of [Fandom](https://www.fandom.com/explore) wikis to gather new, robust training data:** Articles containing in-depth information on a wider variety of subjects - especially niche ones - than Wikipedia offers might help the model's performance on unseen data.
- **Engineer additional features:** The number of citations in an article, for instance, might be a relevant predictor of the article's objective quality. The bar charts displaying the disparate median word count between "good" and "promotional" articles also present a strong case for further examining word count as a predictive feature.
- **Explore new vectorization and modeling techniques:** Methods like *word embedding* and incorporating neural network algorithms might improve model performance, especially on less "clean" data than Wikipedia provides.
- **Refine application to accept URLs as input:** Including options to input **either** raw text or a URL into the model for classification would further streamline the user experience.

## Citations and Further Reading

### Internal Links

- The [GitHub repository](https://github.com/toastdeini/Wikipedia-article-quality) for this project.
- The [raw dataset](https://www.kaggle.com/datasets/urbanbricks/wikipedia-promotional-articles) used for this project, hosted on Kaggle.

### Outside Resources & References

- etc.
- etc.