# Analyzing Wikipedia Pages

Skills: API, Web Scraping, Multi-Threading, Multi-Processing, Benchmarking

In this project, we'll be working with data scraped from Wikipedia, a popular online encyclopedia. We'll be analyzing 54 megabytes worth of articles to figure out patterns in the Wikipedia writing and content presentation style. The scraping code is in this folder, in the scrape_random.py file.

Our main goals will be to:

- Extract only the text from the Wikipedia pages, and remove all HTML and Javascript markup.
- Remove common page headers and footers from the Wikipedia pages.
- Figure out what tags are the most common in Wikipedia pages.
- Figure out patterns in the text.


In [1]:
# List all of the files in the wiki folder.
import os
list_wikifile = os.listdir('wiki')
print(list_wikifile)

['%C3%89cole_des_Mines_de_Douai.html', '%C3%89taule.html', '%C5%8Cnog%C5%8D_Station.html', '100_Greatest_Romanians.html', '104th_Logistic_Support_Brigade_(United_Kingdom).html', '16th_Virginia_Infantry.html', '1896_Indiana_Hoosiers_football_team.html', '1898_Colgate_football_team.html', '1910_in_literature.html', '1915_Montana_football_team.html', '1951_National_League_tie-breaker_series.html', '1953%E2%80%9354_FA_Cup_qualifying_rounds.html', '1958_Wightman_Cup.html', '1988_State_of_Origin_series.html', '1st_Strategic_Aerospace_Division.html', '2001_Australian_Individual_Speedway_Championship.html', '2001_NCAA_Division_I_Field_Hockey_Championship.html', '2004_Tuvalu_A-Division.html', '2005%E2%80%9306_in_Welsh_football.html', '2007%E2%80%9308_Huddersfield_Town_A.F.C._season.html', '2008_Fed_Cup_World_Group_II.html', '2009_English_cricket_season.html', '2009_World_Junior_Ice_Hockey_Championships_rosters.html', '2010_Karshi_Challenger_%E2%80%93_Singles.html', '2011%E2%80%9312_Western_Coll

In [2]:
# Count up and display the number of files in the wiki folder.
no_of_files = len(list_wikifile)
print(no_of_files)

999


In [3]:
# Display a single file from the wiki folder:
with open("wiki/Millennium_Art_Academy.html", encoding="utf-8") as file:
    data = file.read()
data

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Millennium Art Academy - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Millennium_Art_Academy","wgTitle":"Millennium Art Academy","wgCurRevisionId":766325210,"wgRevisionId":766325210,"wgArticleId":8753919,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Coordinates on Wikidata","Public high schools in New York City","Schools in the Bronx"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","Janu

Now that we know the file structure, and the structure of a single file, we can read in all of the files. This will get us started in our explorations.

As this task is I/O bound, we can use threads to help us read in the data more quickly.

We will benchmark the read process, with no thread, 4 threads and 8 threads (my maximum processor's threads per core)

In [4]:
# Import concurrent.futures package to execute multithreading process,
# and time package to benchmark the performance
import concurrent.futures as cf
import time

content = []
articles = [name[:-5] for name in os.listdir('wiki')]
print(articles)

# function to read all files
def read_all(files):
    with open('wiki/{}'.format(files), encoding="utf-8") as file:
        return file.read()

# no threads
start = time.time()
content_0 = []
for file in list_wikifile:
    content_0.append(read_all(file))
duration_0 = time.time() - start

# 4 threads
start = time.time()
# Create pool of threads
pool = cf.ThreadPoolExecutor(max_workers=4)
content_4 = list(pool.map(read_all, list_wikifile))
duration_4 = time.time() - start

# 4 threads
start = time.time()
# Create pool of threads
pool = cf.ThreadPoolExecutor(max_workers=8)
content_8 = list(pool.map(read_all, list_wikifile))
duration_8 = time.time() - start

['%C3%89cole_des_Mines_de_Douai', '%C3%89taule', '%C5%8Cnog%C5%8D_Station', '100_Greatest_Romanians', '104th_Logistic_Support_Brigade_(United_Kingdom)', '16th_Virginia_Infantry', '1896_Indiana_Hoosiers_football_team', '1898_Colgate_football_team', '1910_in_literature', '1915_Montana_football_team', '1951_National_League_tie-breaker_series', '1953%E2%80%9354_FA_Cup_qualifying_rounds', '1958_Wightman_Cup', '1988_State_of_Origin_series', '1st_Strategic_Aerospace_Division', '2001_Australian_Individual_Speedway_Championship', '2001_NCAA_Division_I_Field_Hockey_Championship', '2004_Tuvalu_A-Division', '2005%E2%80%9306_in_Welsh_football', '2007%E2%80%9308_Huddersfield_Town_A.F.C._season', '2008_Fed_Cup_World_Group_II', '2009_English_cricket_season', '2009_World_Junior_Ice_Hockey_Championships_rosters', '2010_Karshi_Challenger_%E2%80%93_Singles', '2011%E2%80%9312_Western_Collegiate_Hockey_Association_women%27s_ice_hockey_season', '2011_ITU_Duathlon_World_Championships', '2011_UK_Open_Qualifier

In [5]:
# Now, we compare the performance of different threads number
print(duration_0)
print(duration_4)
print(duration_8)

23.36261510848999
0.2063000202178955
0.2257218360900879


It can be seen that, in this case, using multi-threading method is advantageous. It may be because although files are opened, most of the task is not offset by the overhead of creating new threads.

Now that we've read in the data files, we can remove the extraneous markup that's outside the div#content tag that most of the content seems to be inside.

We can use the BeautifulSoup package for this. BeautifulSoup enables us to extract all of the content inside a specific tag.

Using the BeautifulSoup package, we'll parse each wiki article, then extract the div with id content and everything inside it.

Since this operation is more CPU intensive than before, let's try using a process pool to see if the speed improves.

We'll be using single core, dual core, and quad core (my processor's maximum core).

In [None]:
from bs4 import BeautifulSoup

# Function to parse the file using BeautifulSoup
def rm_markup(document):
    with open('wiki/{}'.format(document), encoding="utf-8") as file:
        data = file.read()
    parser = BeautifulSoup(data, 'html.parser')
    content_div = parser.find_all("div", id="content")[0]
    return str(content_div)

# single core
start = time.time()
parsed_0 = []
for file in list_wikifile:
    parsed_0.append(rm_markup(file))
duration = time.time() - start

# dual core
start = time.time()
# Create pool of process
pool = cf.ProcessPoolExecutor(max_workers=1)
parsed_2 = list(pool.map(rm_markup, list_wikifile))
duration_2 = time.time() - start

# # quad core
# start = time.time()
# # Create pool of process
# pool = cf.ProcessPoolExecutor(max_workers=4)
# parsed_4 = list(pool.map(rm_markup, list_wikifile))
# duration_4 = time.time() - start

# if __name__ == '__main__':
#     rm_markup(list_wikifile)

In [None]:
# Now, we compare the performance of different core numbers
print(duration)
print(duration_2)
print(duration_4)

It seems that using multiprocessing is rather advantageous in our case. With best performance by using 2 or 4 cores for processing.

Now that we've extracted the main part of each page, let's count up how many times each tag occurs. This will give us clues about how Wikipedia pages are typically structured.

In this step, we will use multiprocessing because it will use many CPU resources. And we will use 3 cores considering cost vs benefit of overhead vs performance.

In [None]:
def count_tags(document):
    parser = BeautifulSoup(document, 'html.parser')
    all_tags = parser.find_all()
    tags = {}
    for tag in all_tags:
        if tag.name not in all_tags:
            tags[tag.name] = 0
        tags[tag.name] += 1
    return tags

start = time.time()
pool = cf.ProcessPoolExecutor(max_workers=3)
result = list(pool.map(count_tags, parsed_2))

overall_tags = {}
for each in result:
    for k,v in each.items():
        if k not in overall_tags:
            overall_tags[k] = 0
        overall_tags[k] += v
        
duration = (time.time() - start)
print(duration)
overall_tags

Based on our findings, it looks like there are quite a few td, a, li, and span tags. This indicates that articles tend to have lots of links, along with lists and tables. Links are the most numerous tag, which indicates how interconnected articles on Wikipedia are.

Now we find the most common words.

In [None]:
from collections import Counter
import re

def count_words(html):
    soup = BeautifulSoup(html, 'html.parser')
    words = {}
    text = soup.get_text()
    text = re.sub("\W+", " ", text.lower())
    words = text.split(" ")
    words = [w for w in words if len(w) >= 5]
    return Counter(words).most_common(10)

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
words = pool.map(count_words, parsed_2)
words = list(words)

word_counts = {}
for wc in words:
    for word, count in wc:
        if word not in word_counts:
            word_counts[word] = 0
        word_counts[word] += 1
end = time.time()

print(end - start)
word_counts

Only selecting the top 10 words from each article speeds up performance quite a bit.