# Downloading and preprocessing Wikipedia data

Download Wikipedia XML dump, decompress and filter articles to only include those with a geotag in Allegheny County.
This notebook was heavily inspired by [this notebook](https://github.com/WillKoehrsen/wikipedia-data-science/blob/master/notebooks/Downloading%20and%20Parsing%20Wikipedia%20Articles.ipynb).

Link to dumps: https://dumps.wikimedia.org/enwiki/

### Import packages

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
import bz2
import subprocess
import io
import re
import gc
import json
from multiprocessing import Pool, set_start_method
from itertools import chain
from functools import partial
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
from timeit import default_timer as timer
import warnings
warnings.filterwarnings(action="ignore")

import requests                    # make http requests
import xml.sax                     # parse xml
import mwparserfromhell            # parse wikimedia
import pandas as pd                # data processing
from bs4 import BeautifulSoup      # parsing HTML
from tqdm.notebook import tqdm     # progress bars

import multiprocessor_wiki         # multiprocessing
import data_processor              # parsing coordinates and text
from wiki_utils import *           # general utilities

Define constants.

- ``PATH``: Path to the base data folder.
- ``DUMP``: Which Wikipedia dump to use.
- ``REGION_NAME``: Name of the selected region.
- ``CPU_CORES``: how many cpu cores to use, default = all - 1.
- ``LAT_RANGE``: Latitude range of venues to consider.
- ``LONG_RANGE``: Longitude range of venues to consider.
- ``ALSO_SUMMARY``: Whether to also save only the summary.

In [3]:
PATH = "C:\\Users\\Tim\\.keras\\datasets\\wikipedia_rss\\"
DUMP = "20211201"
REGION_NAME = "NY"
CPU_CORES = os.cpu_count() - 1
LAT_RANGE = [40.150, 41.250]
LONG_RANGE = [-74.750, -73.150]
ALSO_SUMMARY = True

Show list of dumps.

In [4]:
base_url = 'https://dumps.wikimedia.org/enwiki/'
index = requests.get(base_url).text
soup_index = BeautifulSoup(index, 'html.parser')

# Find the links that are dates of dumps
dumps = [a['href'] for a in soup_index.find_all('a') if 
         a.has_attr('href')]
dumps

['../',
 '20211020/',
 '20211101/',
 '20211120/',
 '20211201/',
 '20211220/',
 '20220101/',
 '20220120/',
 'latest/']

In [5]:
dump_url = base_url + DUMP + '/'

# Retrieve the html
dump_html = requests.get(dump_url).text

In [6]:
# Convert to a soup
soup_dump = BeautifulSoup(dump_html, 'html.parser')

# Find li elements with the class file
soup_dump.find_all('li', {'class': 'file'}, limit = 10)[:4]

[<li class="file"><a href="/enwiki/20211201/enwiki-20211201-pages-articles-multistream.xml.bz2">enwiki-20211201-pages-articles-multistream.xml.bz2</a> 18.9 GB</li>,
 <li class="file"><a href="/enwiki/20211201/enwiki-20211201-pages-articles-multistream-index.txt.bz2">enwiki-20211201-pages-articles-multistream-index.txt.bz2</a> 226.8 MB</li>,
 <li class="file"><a href="/enwiki/20211201/enwiki-20211201-pages-articles-multistream1.xml-p1p41242.bz2">enwiki-20211201-pages-articles-multistream1.xml-p1p41242.bz2</a> 243.3 MB</li>,
 <li class="file"><a href="/enwiki/20211201/enwiki-20211201-pages-articles-multistream-index1.txt-p1p41242.bz2">enwiki-20211201-pages-articles-multistream-index1.txt-p1p41242.bz2</a> 221 KB</li>]

Iterate through files to find all downloadable files and show first 5.

In [7]:
files = []

# Search through all files
for file in soup_dump.find_all('li', {'class': 'file'}):
    text = file.text
    # Select the relevant files
    if 'pages-articles' in text:
        files.append((text.split()[0], text.split()[1:]))
        
files[:5]

[('enwiki-20211201-pages-articles-multistream.xml.bz2', ['18.9', 'GB']),
 ('enwiki-20211201-pages-articles-multistream-index.txt.bz2', ['226.8', 'MB']),
 ('enwiki-20211201-pages-articles-multistream1.xml-p1p41242.bz2',
  ['243.3', 'MB']),
 ('enwiki-20211201-pages-articles-multistream-index1.txt-p1p41242.bz2',
  ['221', 'KB']),
 ('enwiki-20211201-pages-articles-multistream2.xml-p41243p151573.bz2',
  ['327.9', 'MB'])]

Select all compressed xml files.

In [8]:
files_to_download = [file[0] for file in files if '.xml-p' in file[0]]
files_to_download[-5:]

['enwiki-20211201-pages-articles26.xml-p62585851p63975909.bz2',
 'enwiki-20211201-pages-articles27.xml-p63975910p65475909.bz2',
 'enwiki-20211201-pages-articles27.xml-p65475910p66975909.bz2',
 'enwiki-20211201-pages-articles27.xml-p66975910p68475909.bz2',
 'enwiki-20211201-pages-articles27.xml-p68475910p69411557.bz2']

Disregard multistream files and add dump url.

In [9]:
files_to_download = [x for x in files_to_download if "multistream" not in x]
files_to_download[-5:]

['enwiki-20211201-pages-articles26.xml-p62585851p63975909.bz2',
 'enwiki-20211201-pages-articles27.xml-p63975910p65475909.bz2',
 'enwiki-20211201-pages-articles27.xml-p65475910p66975909.bz2',
 'enwiki-20211201-pages-articles27.xml-p66975910p68475909.bz2',
 'enwiki-20211201-pages-articles27.xml-p68475910p69411557.bz2']

Download all relevant files. If file is already downloaded, display size of file.

In [10]:
data_paths, file_info = download_wikipedia(PATH, files_to_download, dump_url)

Found File enwiki-20211201-pages-articles1.xml-p1p41242.bz2, size: 254.34 MB
Found File enwiki-20211201-pages-articles2.xml-p41243p151573.bz2, size: 340.13 MB
Found File enwiki-20211201-pages-articles3.xml-p151574p311329.bz2, size: 369.19 MB
Found File enwiki-20211201-pages-articles4.xml-p311330p558391.bz2, size: 409.46 MB
Found File enwiki-20211201-pages-articles5.xml-p558392p958045.bz2, size: 438.83 MB
Found File enwiki-20211201-pages-articles6.xml-p958046p1483661.bz2, size: 470.29 MB
Found File enwiki-20211201-pages-articles7.xml-p1483662p2134111.bz2, size: 482.18 MB
Found File enwiki-20211201-pages-articles8.xml-p2134112p2936260.bz2, size: 491.23 MB
Found File enwiki-20211201-pages-articles9.xml-p2936261p4045402.bz2, size: 531.32 MB
Found File enwiki-20211201-pages-articles10.xml-p4045403p5399366.bz2, size: 520.98 MB
Found File enwiki-20211201-pages-articles11.xml-p5399367p6899366.bz2, size: 504.21 MB
Found File enwiki-20211201-pages-articles11.xml-p6899367p7054859.bz2, size: 48.41

Download and check md5 checksums.

In [11]:
compressed_path = PATH + "wikipedia\\compressed\\"
checksums = get_file(compressed_path + "md5_checksums", dump_url + f"enwiki-{DUMP}-md5sums.txt")
check_md5(checksums, files_to_download, compressed_path)

  0%|          | 0/61 [00:00<?, ?it/s]

Downloads verified by MD5 checksums


Display total size and article count of downloaded dump.

In [12]:
file_sizes = [file[1] for file in file_info]
article_count = [file[2] for file in file_info]

print(f"The total size of files on disk is {round(sum(file_sizes) / 1e3, 2)} GB")
print(f"The total number of articles is {sum(article_count)}")

The total size of files on disk is 19.25 GB
The total number of articles is 69411496


Let's take a peek at the data.

In [13]:
data_path = data_paths[15]
data_path

lines = []
for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    lines.append(line)
    if i > 5e5:
        break

lines[27190:27210]

[b'        <username>RjwilmsiBot</username>\n',
 b'        <id>10996774</id>\n',
 b'      </contributor>\n',
 b'      <minor />\n',
 b'      <comment>redirect tagging using [[Project:AWB|AWB]]</comment>\n',
 b'      <model>wikitext</model>\n',
 b'      <format>text/x-wiki</format>\n',
 b'      <text bytes="68" xml:space="preserve">#REDIRECT [[Pedro de los R\xc3\xados]]\n',
 b'{{R from title without diacritics}}</text>\n',
 b'      <sha1>dacyl986lde8jytgaupies8aiib8qy4</sha1>\n',
 b'    </revision>\n',
 b'  </page>\n',
 b'  <page>\n',
 b'    <title>Electoral district of Fuller</title>\n',
 b'    <ns>0</ns>\n',
 b'    <id>10675421</id>\n',
 b'    <revision>\n',
 b'      <id>1047397406</id>\n',
 b'      <parentid>1004444442</parentid>\n',
 b'      <timestamp>2021-09-30T17:04:21Z</timestamp>\n']

Parse compressed files to find first 500 articles and display title for 10 of those.

In [14]:
# Object for handling xml
handler = multiprocessor_wiki.SimpleWikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    
    parser.feed(line)
    
    # Stop when 500 articles have been found
    if len(handler._pages) > 500:
        break
        
print([x[0] for x in handler._pages][190:200])

['Northfield Chateau', 'Video games in China', 'Branch Closing (The Office episode)', 'Brooklyn Tip Tops', 'Category:Low-importance Shopping center articles', 'Chalet Schell', 'Wikipedia:Wikiquette assistance/Archive/2005', 'Category:Unknown-importance Shopping center articles', 'Wikipedia:Wikiquette alerts/archive', 'Birnam House']


Select one to take a closer look.

In [15]:
print(handler._pages[190][0])
page_text = handler._pages[190][1]
wiki = mwparserfromhell.parse(page_text)
wiki[:1000]

Northfield Chateau


"[[Image:Northfield Chateau (Northfield, MA) - exterior.jpg|thumb|right|250px|The Northfield Chateau]] \n [[Image:Northfield Chateau (Northfield, MA) - interior.jpg|thumb|right|250px|Interior]] \n \n The '''Northfield Chateau''', also variously known as '''Chalet Schell''' and '''Birnam House''', was a large mansion on Birnham Road in [[Northfield, Massachusetts]]. It no longer exists. \n \n The chateau was designed by noted architect [[Bruce Price]] (of the [[Château Frontenac]]) for Francis Robert Schell, a New York capitalist attracted by his interest in [[Dwight Lyman Moody]]'s work at the nearby [[Northfield Seminary]] and [[Northfield Mount Hermon School|Mount Hermon School]]. It was completed in 1903 on grounds of {{convert|125|acre}}. \n \n The building was loosely patterned upon a French [[chateau]] but fanciful in style, with 99 rooms in a compact, three-story structure ornamented with prominent turrets. Contrary to popular rumors that Mrs. Schell despised the Chateau and ref

In [16]:
links = [x.title for x in wiki.filter_wikilinks()]
print(f"There are {len(links)} wikilinks in this article:")
links[:5]

There are 18 wikilinks in this article:


['Image:Northfield Chateau (Northfield, MA) - exterior.jpg',
 'Image:Northfield Chateau (Northfield, MA) - interior.jpg',
 'Northfield, Massachusetts',
 'Bruce Price',
 'Château Frontenac']

Comments were not downloaded, so the following will always be empty.

In [17]:
print(wiki.filter_arguments())
print(wiki.filter_comments())

[]
[]


In [18]:
external_links = [x.url for x in wiki.filter_external_links()]
print(f'There are {len(external_links)} external links:')
external_links[:5]

There are 4 external links:


['http://www.nmhschool.org/alumni/history/chateauhistory.php',
 'http://www.nmhschool.org/alumni/history/chateau.php',
 'http://hdl.loc.gov/loc.pnp/hhh.ma0192',
 'http://www.eric-goldscheider.com/id121.html']

In [19]:
templates = wiki.filter_templates()
print(f'There are {len(templates)} templates:')
templates[:5]

There are 2 templates:


['{{convert|125|acre}}', '{{Coord|42|42|10.31|N|72|26|47.40|W|display=title}}']

Look for coordinates.

In [20]:
infobox = wiki.filter_templates(matches="coord")[0]
print(infobox)
print(infobox.name.strip_code().strip().lower())

{{Coord|42|42|10.31|N|72|26|47.40|W|display=title}}
coord


Test the extract coordinates function.

In [21]:
data_processor.extract_coordinates(str(infobox))

(42.70286388888889, -72.4465)

Display main text of article.

In [28]:
wiki.strip_code().strip()

'thumb|right|250px|The Northfield Chateau \n thumb|right|250px|Interior \n \n The Northfield Chateau, also variously known as Chalet Schell and Birnam House, was a large mansion on Birnham Road in Northfield, Massachusetts. It no longer exists. \n \n The chateau was designed by noted architect Bruce Price (of the Château Frontenac) for Francis Robert Schell, a New York capitalist attracted by his interest in Dwight Lyman Moody\'s work at the nearby Northfield Seminary and Mount Hermon School. It was completed in 1903 on grounds of . \n \n The building was loosely patterned upon a French chateau but fanciful in style, with 99 rooms in a compact, three-story structure ornamented with prominent turrets. Contrary to popular rumors that Mrs. Schell despised the Chateau and refused to live in it, the Schells summered at their beautiful home for 25 years. It was only after the death of her beloved husband in 1928 that Mrs. Schell refused to set foot in the house again, insisting when she stay

Split article into chapters (chapter headings are denoted with "== [HEADING] =="). We are usually only interested in the summary, which can be seen as the first chapter.

In [23]:
wiki.strip_code().strip().split("==")

['thumb|right|250px|The Northfield Chateau \n thumb|right|250px|Interior \n \n The Northfield Chateau, also variously known as Chalet Schell and Birnam House, was a large mansion on Birnham Road in Northfield, Massachusetts. It no longer exists. \n \n The chateau was designed by noted architect Bruce Price (of the Château Frontenac) for Francis Robert Schell, a New York capitalist attracted by his interest in Dwight Lyman Moody\'s work at the nearby Northfield Seminary and Mount Hermon School. It was completed in 1903 on grounds of . \n \n The building was loosely patterned upon a French chateau but fanciful in style, with 99 rooms in a compact, three-story structure ornamented with prominent turrets. Contrary to popular rumors that Mrs. Schell despised the Chateau and refused to live in it, the Schells summered at their beautiful home for 25 years. It was only after the death of her beloved husband in 1928 that Mrs. Schell refused to set foot in the house again, insisting when she sta

Search for objects in Western Pennsylvania region in the 16th Wikipedia file and stop if 3 are found.

In [24]:
# Object for handling xml
handler = multiprocessor_wiki.WikiXmlHandler(LAT_RANGE, LONG_RANGE, ALSO_SUMMARY)

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

print(f"Searching for articles in {data_path}...")
for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    parser.feed(line)
    # Stop when 3 objects have been found
    if len(handler._articles) > 2:
        break

print(f'Searched through {handler._article_count} articles to find {len(handler._articles)} objects in {REGION_NAME}.')

Searching for articles in C:\Users\Tim\.keras\datasets\wikipedia_rss\wikipedia\compressed\enwiki-20211201-pages-articles13.xml-p10672789p11659682.bz2...
Searched through 2638 articles to find 3 objects in NY.


Let's see what articles have been identified to be in the specified region.

In [25]:
print(*[article[0] for article in handler._articles], sep=", ")

Rose Hill, Manhattan, American Surety Building, Lenox Hill


## Process all articles with multiprocessing

Check if all files are correctly in the compressed path and display an exemplary data path.

In [26]:
partitions = [compressed_path + file for file in os.listdir(compressed_path) if 'xml-p' in file]
len(partitions), partitions[-1]

(61,
 'C:\\Users\\Tim\\.keras\\datasets\\wikipedia_rss\\wikipedia\\compressed\\enwiki-20211201-pages-articles9.xml-p2936261p4045402.bz2')

Run the script to process all compressed files and look for articles in the specified region.

In [27]:
# multiprocessor_wiki.process(compressed_path, REGION_NAME, LAT_RANGE, LONG_RANGE, ALSO_SUMMARY, CPU_CORES)

## Joining the data back together

Read all json files containing information about locations from each partition.

In [32]:
uncompressed_path = os.path.dirname(os.path.dirname(compressed_path)) + f"\\uncompressed_{REGION_NAME}\\"

saved_files = [uncompressed_path + x for x in os.listdir(uncompressed_path)]  # find all data to read

articles = []
for file in saved_files:
    articles.extend(read_json(file))

Save all articles in one file.

In [33]:
# create path to new file
f_path = os.path.dirname(os.path.dirname(uncompressed_path)) + f"\\wikipedia_selected_{REGION_NAME}.ndjson"

if not os.path.exists(f_path):
    with open(f_path, 'wt') as fout:
        json.dump(articles, fout)
    print('Articles saved.')
else:
    print('File already saved.')

Articles saved.


Assert whether data was successfully exported.

In [34]:
with open(f_path) as fin:
    data_loaded = json.load(fin)

assert data_loaded == articles