# Predicting Wikipedia Article Quality With Natural Language Processing

![img](images/tomes.jpg)

*(photo courtesy of Dmitrij Paskevic, hosted on [Unsplash](https://unsplash.com/photos/YjVa-F9P9kk))*

### Author

> **Luke Dowker** ([GitHub](https://github.com/toastdeini) | [LinkedIn](https://www.linkedin.com/in/luke-dowker/) | [Email](mailto:lhdowker@gmail.com))

## Overview

## Business Problem

## Data

### Libraries, Packages, and Scripts

In [3]:
# Data comprehension/manipulation
import pandas as pd
import numpy as np

# Loading in stored objects
import pickle

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Machine learning packages
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier

# Natural language processing
import nltk

# etc.
import os
import sys
module_path = os.path.abspath(os.pardir)
if module_path not in sys.path:
    sys.path.append(module_path)
    
# Custom/helper functions
from src.parse_it import *
from src.modeling import *
from src.EDA import *

### Load in Data

Full exploratory data analysis for this project can be found in [a separate notebook](prep/Exploratory_Analysis.ipynb); this final notebook contains the most notable parts of that analysis, with truncated/imported code whenever possible.

Data is stored in two separate `.csv` files; one contains articles marked as "good," while the other contains articles marked as "promotional," with various subclasses of "promotional."

In [13]:
df_good = pd.read_csv('../data/good.csv')
df_promo = pd.read_csv('../data/promotional.csv')

### Data Exploration and Preparation

In [14]:
print(f"{df_good.shape[0]} documents in this file/dataset:") 
df_good.head(5)

30279 documents in this file/dataset:


Unnamed: 0,text,url
0,Nycticebus linglom is a fossil strepsirrhine p...,https://en.wikipedia.org/wiki/%3F%20Nycticebus...
1,Oryzomys pliocaenicus is a fossil rodent from ...,https://en.wikipedia.org/wiki/%3F%20Oryzomys%2...
2,.hack dt hk is a series of single player actio...,https://en.wikipedia.org/wiki/.hack%20%28video...
3,The You Drive Me Crazy Tour was the second con...,https://en.wikipedia.org/wiki/%28You%20Drive%2...
4,0 8 4 is the second episode of the first seaso...,https://en.wikipedia.org/wiki/0-8-4


In [15]:
print(f"{df_promo.shape[0]} documents in this file/dataset:") 
df_promo.head(5)

23837 documents in this file/dataset:


Unnamed: 0,text,advert,coi,fanpov,pr,resume,url
0,"1 Litre no Namida 1, lit. 1 Litre of Tears als...",0,0,1,0,0,https://en.wikipedia.org/wiki/1%20Litre%20no%2...
1,"1DayLater was free, web based software that wa...",1,1,0,0,0,https://en.wikipedia.org/wiki/1DayLater
2,1E is a privately owned IT software and servic...,1,0,0,0,0,https://en.wikipedia.org/wiki/1E
3,1Malaysia pronounced One Malaysia in English a...,1,0,0,0,0,https://en.wikipedia.org/wiki/1Malaysia
4,"The Jerusalem Biennale, as stated on the Bienn...",1,0,0,0,0,https://en.wikipedia.org/wiki/1st%20Jerusalem%...


Curious about the distribution of subclasses in `df_promo` - that is, in what way can its promotional tone *best be described?* - I plotted the value counts of the categorical columns:

- **Advertisement-like** / `advert` - The article reads like an advertisement for a company, a product, or an organization, or is otherwise an advertisement "masquerading" as a legitimate article.
- **Conflict of interest** / `coi` - There appears to be a conflict of interest between the subject of the article and the author of the article, which "undermines public confidence" in Wikipedia.
- **Fan's point of view** / `fanpov` - The article appears to have been written by a fan or admirer of the subject, rather than from a neutral point of view.
- **News article/press release-like** / `pr` - The article reads like a news article, i.e. "the article may not be promotional or overly-negative, but is still unencyclopedic in tone."
- **Résumé-like** / `resume` - The article reads like a résumé or CV.

---

![img](images/promo_dist.png)

---

The imbalance of promotional subclasses in the dataset, along with the overlap in semantic meaning between the subclass descriptions - *are these not all 'promotional' in some form or another?* - shaped my decision to approach this project as a **binary classification problem**.

In [16]:
df_good = df_good[['text']]
df_promo = df_promo[['text']]

In [17]:
# Assigning 'good' articles the label 0, a 'falsy' value,
# indicating that the article is NOT promotional

df_good['label'] = 0
df_good.head(1)

Unnamed: 0,text,label
0,Nycticebus linglom is a fossil strepsirrhine p...,0


In [18]:
# Assigning 'promotional' articles the label 1, a 'truthy'
# value, indicating that the article IS promotional

df_promo['label'] = 1
df_promo.head(1)

Unnamed: 0,text,label
0,"1 Litre no Namida 1, lit. 1 Litre of Tears als...",1


In [20]:
# Concatenate dataframes - reset index of `df_promo`
# for numeric consistency

df = df_good.append(df_promo,
                    ignore_index=True)

  df = df_good.append(df_promo,


In [11]:
# Shuffle dataframe for randomness, etc.

df = df.sample(frac=1).reset_index(drop=True)

![img](images/median_word_count.png)

In [11]:
compare_mean_counts(df, 'text', 'label', 0, 1)

Mean document length, label 0: 2629 words, 15649 characters.
Mean document length, label 1: 768 words, 4771 characters.


(2628.5815581756333, 768.2722658052608)

The following cell is one of the most computationally expensive tasks in this whole project - it applies the custom function `parse_doc` (see the [.py file](src/parse_it.py) for further information) to the content of the `text` column for each row in the dataframe, i.e. each document in the corpus. The result is a string of **lemmas**, the approximate morphological roots of each word in the document, stored in a new column called `text_lem`. This helps standardize the text for vectorization and, eventually, modeling.

Again, the following two cells are only necessary to run if you're interested in completing reproducing the project, step by step. For convenience, the output is saved as `lemmed_combined.csv`, which will probably save you at least an hour of computing time if loaded in.

In [51]:
# df['text_lem'] = df['text'].apply(parse_doc) 

In [52]:
# df.to_csv('lemmed_combined.csv')

## Methods and Models

In [61]:
# Read in usable dataframe

df = pd.read_csv('../data/lemmed_combined.csv',
                 index_col=0)

# Reorder columns, keeping only the lemmatized text

df = df[['text_lem', 'label']]

In [62]:
X = df['text_lem']
y = df['label']

In [63]:
# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=138,
                                                    stratify=y)

# Secondary train-test split for validation

X_tr_val, X_val, y_tr_val, y_val = train_test_split(X_train, y_train,
                                                    random_state=138,
                                                    stratify=y_train)

### Baseline Model: `DummyClassifier`

scikit-learn's `DummyClassifier` was used as a baseline for measuring model performance.

In [38]:
dum_pipe = Pipeline(steps=[
    ('cvec', CountVectorizer()),
    ('dum', DummyClassifier(strategy='most_frequent'))
])

In [30]:
dum_pipe.fit(X_tr_val, y_tr_val)

Pipeline(steps=[('cvec', CountVectorizer()),
                ('dum', DummyClassifier(strategy='most_frequent'))])

In [31]:
dum_pipe.score(X_tr_val, y_tr_val)

0.559526938239159

### Algorithm Exploration

To create some simple models, I train scikit-learn's `MultinomialNB` algorithm (default hyperparameters) on two vectorized instances of the corpus: One using a `CountVectorizer`, a simple bag-of-words approach, and the other using `TfidfVectorizer`, a term importance approach. Accuracy and $F_1$ scores for each algorithm-vectorizer combination, using cross-validation with five *k*-folds, are displayed, along with corresponding training scores to determine the extent to which a given model is overfit. [This notebook](prep/Model_General_Testing.ipynb) details this process more verbosely.

In [46]:
mnb_count_pipe = Pipeline(steps=[
    ('cvec', CountVectorizer()),
    ('mnb', MultinomialNB())
])
                          
mnb_tfidf_pipe = Pipeline(steps=[
    ('tfidf', TfidfVectorizer()),
    ('mnb', MultinomialNB())
])         
                          
# Fitting simple model(s) to validation sets
mnb_count_pipe.fit(X_tr_val, y_tr_val)
mnb_tfidf_pipe.fit(X_tr_val, y_tr_val)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('mnb', MultinomialNB())])

In [48]:
mnb_count_model = ModelForScoring(mnb_count_pipe, 'mnb_cvec', X_val, y_val)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


CV Results
Accuracy
--------------------------------
Training accuracy: 0.955
Test accuracy:     0.890
F-1 Score
--------------------------------
Training F1 score: 0.954
Test F1 score:     0.886


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  2.3min finished


In [49]:
mnb_tfidf_model = ModelForScoring(mnb_tfidf_pipe, 'mnb_tfidf', X_val, y_val)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


CV Results
Accuracy
--------------------------------
Training accuracy: 0.833
Test accuracy:     0.777
F-1 Score
--------------------------------
Training F1 score: 0.818
Test F1 score:     0.750


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  2.2min finished


### Model Selection

[This notebook](prep/Model_Tuning.ipynb) details the tuning process following selection more verbosely.

In [4]:
#pickle.dump()

## Results

![img](images/acc_scores_bar.png)

---

![img](images/f1_scores_bar.png)

### Performance on Unseen Data

<!--- bar graph of dummy classifier, simple model, better models, best model - acc and F1 -->

### Error Metrics

<!--- image of ROC curves for various algorithms --->

## Deployment

You can view the code that deploys the Streamlit application [here](app_testing.py).

## Conclusions

### Next Steps

- Use web scraping to **create new features** - number of citations in an article is worth exploring.
- **Multilabel classification** - can we determine more distinctly what kind of promotional tone is being used in a given article?
    - This may require **upsampling**.

## Citations and Further Reading

### Internal Links

- The [GitHub repository](https://github.com/toastdeini/Wikipedia-article-quality) for this project.
- The [raw dataset](https://www.kaggle.com/datasets/urbanbricks/wikipedia-promotional-articles) used for this project, hosted on Kaggle.

### Outside Resources & References

- etc.
- etc.