# Predicting Wikipedia Article Quality With Natural Language Processing

![img](images/tomes.jpg)

*(photo courtesy of Dmitrij Paskevic, hosted on [Unsplash](https://unsplash.com/photos/YjVa-F9P9kk))*

### Author

> **Luke Dowker** ([GitHub](https://github.com/toastdeini) | [LinkedIn](https://www.linkedin.com/in/luke-dowker/) | [Email](mailto:lhdowker@gmail.com))

## Overview

## Business Problem

## Data

### Libraries, Packages, and Scripts

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Saving models
import pickle

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Machine learning: scikit-learn and XGBoost
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier

# NLTK
import nltk

# etc.
import os
import sys
module_path = os.path.abspath(os.pardir)
if module_path not in sys.path:
    sys.path.append(module_path)
    
# Custom/helper functions
import src.parse_it
import src.modeling
import src.EDA

### Load in Data

Full exploratory data analysis for this project can be found in [a separate notebook](prep/Exploratory_Analysis.ipynb); this final notebook contains the most vital and notable parts of that analysis.

Data is stored in two separate `.csv` files; one contains articles marked as "good," while the other contains articles marked as "promotional," with various subclasses of "promotional."

In [2]:
df_good = pd.read_csv('../data/good.csv')
df_promo = pd.read_csv('../data/promotional.csv')

### Data Exploration and Preparation

In [4]:
print(f"{df_good.shape[0]} documents in this file/dataset:") 
df_good.head(5)

30279 documents in this file/dataset:


Unnamed: 0,text,url
0,Nycticebus linglom is a fossil strepsirrhine p...,https://en.wikipedia.org/wiki/%3F%20Nycticebus...
1,Oryzomys pliocaenicus is a fossil rodent from ...,https://en.wikipedia.org/wiki/%3F%20Oryzomys%2...
2,.hack dt hk is a series of single player actio...,https://en.wikipedia.org/wiki/.hack%20%28video...
3,The You Drive Me Crazy Tour was the second con...,https://en.wikipedia.org/wiki/%28You%20Drive%2...
4,0 8 4 is the second episode of the first seaso...,https://en.wikipedia.org/wiki/0-8-4


In [6]:
print(f"{df_promo.shape[0]} documents in this file/dataset:") 
df_promo.head(5)

23837 documents in this file/dataset:


Unnamed: 0,text,advert,coi,fanpov,pr,resume,url
0,"1 Litre no Namida 1, lit. 1 Litre of Tears als...",0,0,1,0,0,https://en.wikipedia.org/wiki/1%20Litre%20no%2...
1,"1DayLater was free, web based software that wa...",1,1,0,0,0,https://en.wikipedia.org/wiki/1DayLater
2,1E is a privately owned IT software and servic...,1,0,0,0,0,https://en.wikipedia.org/wiki/1E
3,1Malaysia pronounced One Malaysia in English a...,1,0,0,0,0,https://en.wikipedia.org/wiki/1Malaysia
4,"The Jerusalem Biennale, as stated on the Bienn...",1,0,0,0,0,https://en.wikipedia.org/wiki/1st%20Jerusalem%...


Curious about the distribution of subclasses in `df_promo` - that is, in what way can its promotional tone *best be described?* - I plotted the value counts of the categorical columns:

- **Advertisement-like** / `advert` - The article reads like an advertisement for a company, a product, or an organization, or is otherwise an advertisement "masquerading" as a legitimate article.
- **Conflict of interest** / `coi` - There appears to be a conflict of interest between the subject of the article and the author of the article, which "undermines public confidence" in Wikipedia.
- **Fan's point of view** / `fanpov` - The article appears to have been written by a fan or admirer of the subject, rather than from a neutral point of view.
- **News article/press release-like** / `pr` - The article reads like a news article, i.e. "the article may not be promotional or overly-negative, but is still unencyclopedic in tone."
- **Résumé-like** / `resume` - The article reads like a résumé or CV.

#### Binary Classification

## Methods and Models

## Results

## Conclusions

### Next Steps

## Citations and Further Reading

- The [GitHub repository](https://github.com/toastdeini/Wikipedia-article-quality) for this project.
- The [raw dataset](https://www.kaggle.com/datasets/urbanbricks/wikipedia-promotional-articles) used for this project, hosted on Kaggle.