<a href="https://colab.research.google.com/github/stanfordio/wikipedia-notebook/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wikipedia Scanner

Created for INTLPOL268D at Stanford University by Team Wikipedia.

### License

Copyright Stanford University and R. Miles McCain (2020).

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0).

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

### How to Use

Enter your query in the search box to the right, and select the appropriate sort order. Selecting `incoming_links_asc` will yield less popular pages; it's worth experimenting with the options to find what works best for your investigation.

In [None]:
#@title Search Parameters
query = "American \"chief of staff\" incategory:living_people" #@param {type:"string"}
sort_order = 'incoming_links_asc' #@param ["create_timestamp_desc", "incoming_links_asc", "random", "none", "relevance", "last_edit_desc"]

## Setup

In [None]:
%pip install pandas numpy requests seaborn matplotlib tqdm
import requests
import functools
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statistics
from datetime import datetime, timedelta
from dateutil.parser import parse as dateparse
from tqdm.gui import tqdm

In [None]:
@functools.lru_cache()
def __search_wikipedia(query):
  """Internal function for searching Wikipedia and getting the raw results"""

  # Any way to get recent number of edits?
  PARAMS = {
    "action": "query",
    "format": "json",
    "prop": "flagged|info|pageprops|description",
    "generator": "search",
    "redirects": 1,
    "converttitles": 1,
    "pvipmetric": "pageviews",
    "gsrsearch": query,
    "gsrsort": sort_order,
    "gsrlimit": 100,
  }

  return requests.get("https://en.wikipedia.org/w/api.php", params=PARAMS).json()

@functools.lru_cache()
def __page_metadata(page_name):
  """Internal function for getting the revision history of an individual page."""

  PARAMS = {
    "action": "query",
    "format": "json",
    "prop": "flagged|info|pageassessments|pageprops|revisions|pageviews",
    "titles": page_name,
    "redirects": 1,
    "rvlimit": "max",
    "converttitles": 1,
  }

  resp = requests.get("https://en.wikipedia.org/w/api.php", params=PARAMS).json()["query"]["pages"].values()
  page_data = list(resp)[0]
  return {
      "revisions": page_data["revisions"],
      "pageviews": page_data["pageviews"]
  }

In [None]:
def search(query):
  print("Searching for pages...")
  results = __search_wikipedia(query)
  pages = results["query"]["pages"].values()

  print("Loading revision histories...")
  for page in tqdm(pages):      
    # Get number of recent edits, pageviews
    page_metadata = __page_metadata(page["title"])
    page["revisions"] = page_metadata["revisions"]
    page["recent_revisions"] = 0
    for revision in page["revisions"]:
      if dateparse(revision["timestamp"]).replace(tzinfo=None) > datetime.utcnow() - timedelta(days=30):
        page["recent_revisions"] += 1

    if "pageviews" in page_metadata:
      page["pageview_avg"] = statistics.mean([value if value is not None else 0 for value in page_metadata["pageviews"].values()])

    page["link"] = f"https://en.wikipedia.org/?curid={page['pageid']}"

  return pd.DataFrame(data=pages)

## Querying, Loading, and Processing Data

In [None]:
results = search(query)
results

In [None]:
results["revisions_per_pageview"] = results.apply(lambda k: k["recent_revisions"] / max(1, k["pageview_avg"]), axis=1)

In [None]:
results["length_per_pageview"] = results.apply(lambda k: k["length"] / max(1, k["pageview_avg"]), axis=1)

In [None]:
results["editors_per_revision"] = results.apply(lambda k: len(set([l.get("user") for l in k["revisions"]])) / max(1, k["recent_revisions"]), axis=1)

## Analysis

#### General distributions

This table provides a general overview of the data pulled from Wikipedia. Remember that the pages found are _not_ representative of the search query; they are influenced by the chosen `sort_order`!

In [None]:
results.describe()

### Length Distribution

This chart shows the general distribution of the pages' length (in characters).

In [None]:
sns.displot(results["length"])

### Revision Distribution

This chart shows the general distribution of the number of recent revisions (past 30 days).

In [None]:
sns.displot(results["recent_revisions"])

### Pageview Distribution

This chart shows the general distribution of the number of pageviews the pages received (the exact number of days is determined by Wikipedia, but it is guaranteed to be internally consistent).

In [None]:
sns.distplot(results["pageview_avg"])

### Relationship between pageviews and revisions

This chart shows the general relationship between pageviews and number of recent revisions. It can help reveal outliers (pages with significantly higher ratios of revisions to pageviews are notable).

In [None]:
sns.scatterplot(results["pageview_avg"], results["recent_revisions"])

### Relationship between pageviews and length

This chart shows the general relationship between pageviews and page length. It can help reveal outliers (pages with significantly higher ratios of length to pageviews are notable).

In [None]:
sns.scatterplot(results["pageview_avg"], results["length"])

### Worth Checking Manually

#### Highest edits-to-pageview ratio

In [None]:
results.sort_values("revisions_per_pageview", ascending=False).head()

#### Highest length-per-pageview ratio

In [None]:
results.sort_values("length_per_pageview", ascending=False).head()

#### Lowest editors per revision

In [None]:
results.sort_values("editors_per_revision", ascending=True).head()

#### Shortest length

In [None]:
results.sort_values("length", ascending=True).head()

#### Fewest pageviews

In [None]:
results.sort_values("pageview_avg", ascending=True).head()