There may be errors in parsing wiki pages from snapshot and this notebook help quantify the errors
by comparing the parsed versions from snapshots with the latest html version retrieved from the
wikipedia api.

Sources of errors:
1. Snapshot is old and out of sync with the latest wikipedia page
2. Templates are not expanded in the snapshot

... and may be more unknown reasons

This notebook uses the wikipedia api to retrieve the latest html version of the page and compares
quantities of interest like internal links, category labels, etc. between the snapshot and the latest
version on a sample of pages to estimate the error rate.

In [28]:
import requests
import importlib
import json, time, random, numpy as np
import src.wiki_parser as wiki_parser

_ = importlib.reload(wiki_parser)

In [None]:
"""
Methodology:
1. Get a sample of 10000 pages from processed summaries
2. Shuffle them randomly
3. Process them one-by-one to compute defect rate for a prefix of length i
    - call wikipedia api on need basis (and cache results)
    - fetch full text in batches of 1000
4. Progressively compute following quantities on the prefix of length i:
    - Share of category pages, redirects, normal pages
    - Fraction of category pages for which parent category set doesn't match the parent category set from the wikipedia api
        - break down by whether template is present in it or not
    - Fraction of category to parent links in wikipedia api that are missing in the local data 
    - Fraction of category to parent links in local data that are missing in the wikipedia api
    - Fraction of normal pages for which the internal links don't match the internal links from the wikipedia api exactly
    - Fraction of internal links in wikipedia api that are missing in the local data
    - Fraction of internal links in local data that are missing in the wikipedia api
5. Plot the above quantities as a function of i and break the loop when it looks like the defect rate has stabilized
"""
_ = 1

In [12]:
data_root_dir = r'C:\Users\mohitvyas\MyDesktop\WikipediaDataset\data\\'

In [24]:
# sample N random pages for defect rate measurement from processed summaries
N = 10000
selection_probability = N / 2e7
sampled_pages = []
silent = False
processed_line_count = 0
start_time = time.time()

random_numbers, curr_idx = np.random.rand(100000), 0

for i in range(10):
    with open(data_root_dir + f'processed_summaries/part-{i}.txt', 'r') as f:
        for line in f:
            if curr_idx == 100000:
                random_numbers, curr_idx = np.random.rand(100000), 0
            if random_numbers[curr_idx] < selection_probability:
                sampled_pages.append(json.loads(line))
            processed_line_count += 1
            curr_idx += 1
            if processed_line_count % 1000000 == 0 and not silent:
                print(f"Processed {processed_line_count} lines in {(time.time() - start_time) / 60} minutes")

    print(f"Finished part {i} in {(time.time() - start_time) / 60} minutes")

print(f"Final sample count: {len(sampled_pages)}")

Processed 1000000 lines in 0.15629770755767822 minutes
Processed 2000000 lines in 0.28094823757807413 minutes
Finished part 0 in 0.30155783891677856 minutes
Processed 3000000 lines in 0.3926133910814921 minutes
Processed 4000000 lines in 0.4932368715604146 minutes
Finished part 1 in 0.5140209754308065 minutes
Processed 5000000 lines in 0.5859047253926595 minutes
Processed 6000000 lines in 0.6896283507347107 minutes
Finished part 2 in 0.714378555615743 minutes
Processed 7000000 lines in 0.7894844849904378 minutes
Processed 8000000 lines in 0.8782358765602112 minutes
Finished part 3 in 0.9057334780693054 minutes
Processed 9000000 lines in 0.9764098803202311 minutes
Processed 10000000 lines in 1.0799196163813274 minutes
Finished part 4 in 1.1055760304133098 minutes
Processed 11000000 lines in 1.1746357083320618 minutes
Processed 12000000 lines in 1.2679887016614277 minutes
Finished part 5 in 1.3032696843147278 minutes
Processed 13000000 lines in 1.3661518812179565 minutes
Processed 140000

In [26]:
# load full page contents for these samples

In [29]:
# load index
snapshot_name = 'enwiki-20240501-pages-articles-multistream'

start_time = time.time()
parser = wiki_parser.WikiParser(data_root_dir+snapshot_name+"-index.txt.bz2", 
                                data_root_dir+snapshot_name+".xml.bz2")
print (f"Time taken to load the index: {time.time() - start_time} seconds")

Time taken to load the index: 227.78338074684143 seconds


In [36]:
# fetch full raw data for sampled pages
sampled_ids = [page['page_id'] for page in sampled_pages]
page_id_to_full_raw_data = {}

for page in parser.page_stream(sampled_ids, include_text=True):
    page_id_to_full_raw_data[page['page_id']] = page

In [33]:
page_id_to_full_raw_data[sampled_ids[4]]

{'page_id': 16229,
 'title': 'Joystick',
 'redirect_title': None,
 'namespace': 0,
 'text': '{{Short description|Control lever used in aircraft and video games}}\n{{Redirect|Control stick|the joystick often called a control stick in many controllers|Analog stick}}\n{{Other uses}}\n\n[[Image:Joyopis.svg|right|thumb|Possible elements of a video game joystick: 1.&nbsp;stick, 2.&nbsp;base, 3.&nbsp;trigger, 4.&nbsp;extra buttons, 5.&nbsp;autofire switch, 6.&nbsp;throttle, 7.&nbsp;[[#Hat switch|hat switch (POV hat)]], 8.&nbsp;suction cups.]]\n\nA \'\'\'joystick\'\'\', sometimes called a \'\'\'flight stick\'\'\', is an [[input device]] consisting of a stick that pivots on a base and reports its angle or direction to the device it is controlling. A joystick, also known as the \'\'\'control column\'\'\', is the principal control device in the [[cockpit]] of many civilian and military aircraft, either as a [[centre stick]] or [[side-stick]]. It has various switches to control functions of the ai

In [25]:
# count how many sampled pages are redirects, category pages, normal pages
sampled_pages[0]

{'categories': [],
 'internal_links': ['List of sovereign states'],
 'text_length': 5,
 'num_unique_words': 5,
 'number_of_files': 0,
 'number_of_external_links': 0,
 'number_of_info_boxes': 0,
 'number_of_sections': 1,
 'page_id': 5074,
 'title': 'CountriesH',
 'redirect_title': 'List of sovereign states',
 'namespace': 0}

In [None]:
"""
Quantities to measure:
- fraction of category pages with 
"""

In [9]:


# https://en.wikipedia.org/w/api.php?action=parse&page=Pet_door&prop=text&formatversion=2

def fetch_expanded_wikipedia_page(page_title):
    URL = "https://en.wikipedia.org/w/api.php"
    PARAMS = {
        "action": "parse",
        "page": page_title,
        "prop": "text|categories|links",
        "format": "json",
        "formatversion": "2"
    }

    response = requests.get(URL, params=PARAMS)

    return response.json()

# Example usage
page_title = "Category:1945 in Alabama"
page_title = "Coriander"
html_content = fetch_expanded_wikipedia_page(page_title)

# save page content to a file
with open("tmp/tmp.html", "w") as file:
    file.write(html_content)

TypeError: write() argument must be str, not dict

In [10]:
html_content

{'parse': {'title': 'Coriander',
  'pageid': 341640,
  'text': '<div class="mw-content-ltr mw-parser-output" lang="en" dir="ltr"><div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Annual herb</div>\n<style data-mw-deduplicate="TemplateStyles:r1033289096">.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}</style><div role="note" class="hatnote navigation-not-searchable">This article is about the herb. For other uses, see <a href="/wiki/Coriander_(disambiguation)" class="mw-disambig" title="Coriander (disambiguation)">Coriander (disambiguation)</a>.</div>\n<link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1033289096"><div role="note" class="hatnote navigation-not-searchable">"Cilantro" redirects here. Not to be confused with the related herb <i><a href="/wiki/Eryn

In [6]:
html_content

{'parse': {'title': 'Category:1945 in Alabama',
  'pageid': 39812076,
  'text': '<div class="mw-content-ltr mw-parser-output" lang="en" dir="ltr"><style data-mw-deduplicate="TemplateStyles:r1217611005">.mw-parser-output .side-box{margin:4px 0;box-sizing:border-box;border:1px solid #aaa;font-size:88%;line-height:1.25em;background-color:#f9f9f9;display:flow-root}.mw-parser-output .side-box-abovebelow,.mw-parser-output .side-box-text{padding:0.25em 0.9em}.mw-parser-output .side-box-image{padding:2px 0 2px 0.9em;text-align:center}.mw-parser-output .side-box-imageright{padding:2px 0.9em 2px 0;text-align:center}@media(min-width:500px){.mw-parser-output .side-box-flex{display:flex;align-items:center}.mw-parser-output .side-box-text{flex:1;min-width:0}}@media(min-width:720px){.mw-parser-output .side-box{width:238px}.mw-parser-output .side-box-right{clear:right;float:right;margin-left:1em}.mw-parser-output .side-box-left{margin-right:1em}}</style><div class="side-box side-box-right plainlinks s

In [4]:
from bs4 import BeautifulSoup
import pyperclip

# extract links
links = []
soup = BeautifulSoup(html_content, "html.parser")
pyperclip.copy(soup.prettify())