# Analysing Wikipedia Pages

In this project, we'll be analyzing 54 megabytes worth of articles to figure out patterns in the Wikipedia writing and content presentation style. The articles were scraped by hitting random pages on Wikipedia, then downloading the contents using the requests package. The scraping code is in this folder, in the scrape_random.py file.

Our main goals will be to:

- Extract only the text from the Wikipedia pages, and remove all HTML and Javascript markup.
- Remove common page headers and footers from the Wikipedia pages.
- Figure out what tags are the most common in Wikipedia pages.
- Figure out patterns in the text.

In [1]:
import os
print('Printing first 10 articles/files in wiki directory:')
os.listdir('wiki')[:10]

Printing first 10 articles/files in wiki directory:


['Ronald_McCaffer.html',
 'Communities_of_Tulu_Nadu.html',
 'Mountune_Racing.html',
 'Tim_Spencer_(singer).html',
 'Nathaniel_Merriman.html',
 'One_Night_of_Sin.html',
 'Middle_Park,_Victoria.html',
 'Zgornji_Otok.html',
 'Josef_Mik.html',
 'Gaston_Lane.html']

In [2]:
print('Number of files in directory: ', len(os.listdir('wiki')))

Number of files in directory:  999


In [5]:
with open("wiki/Ronald_McCaffer.html") as f:
    print(f.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Ronald McCaffer - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Ronald_McCaffer","wgTitle":"Ronald McCaffer","wgCurRevisionId":726527002,"wgRevisionId":726527002,"wgArticleId":17402798,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["BLP articles lacking sources from March 2011","All BLP articles lacking sources","Articles with topics of unclear notability from March 2015","All articles with topics of unclear notability","All stub articles","Academics of Loughborough University","Scottish civil engineers","Fellows of the Royal Academy of Engineerin

## Reading in the data

Now that we know the file structure, and the structure of a single file, we can read in all of the files. This will get us started in our explorations.

As this task is I/O bound, we can use threads to help us read in the data more quickly.

In [11]:
import concurrent.futures
import time

pool = concurrent.futures.ThreadPoolExecutor(max_workers=4)

def read_data(filename):
    with open(filename) as f:
        data = f.read()
    return data

start = time.time()
filenames = ["wiki/{}".format(f) for f in os.listdir("wiki")]
content = pool.map(read_data, filenames)
content = list(content)

end = time.time()
print('{:.3f} seconds to read in all files'.format(end - start))
articles = [f.replace(".html", "").replace("wiki/", "") for f in filenames]

0.255 seconds to read in all files


After doing some profiling, it doesn't appear that threading makes a huge difference to performance. It may be because although files are opened, most of the task is offset by the overhead of creating new threads.

## Isolating content

Now that we've read in the data files, we can remove the extraneous markup that's outside the `<div id="content">` tag that most of the content seems to be inside.

Since this operation is more CPU intensive than before, let's try using a process pool to see if the speed improves.

In [12]:
from bs4 import BeautifulSoup

def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    return str(soup.find_all("div", id="content")[0])

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
parsed = pool.map(parse_html, content)
parsed = list(parsed)
end = time.time()

print('{:.3f} seconds to isolate content tag for all files'.format(end - start))

28.831 seconds to isolate content tag for all files


This operation is quite slow and CPU-intensive. It looks like using as many processes are there are available processors speeds things up.

## Finding common tags

Now that we've extracted the main part of each page, we can count up how many times each tag occurs. This will give us clues about how Wikipedia pages are typically structured. For example, if there are a lot of `a` tags on each page, we know that Wikipedia articles tend to be very connected to other articles or pages. On the other hand, a lot of `div` tags will tell us that Wikipedia pages tend to have a nested structure with many page elements.

In [13]:
def count_tags(html):
    soup = BeautifulSoup(html, 'html.parser')
    tags = {}
    for tag in soup.find_all():
        if tag.name not in tags:
            tags[tag.name] = 0
        tags[tag.name] += 1
    return tags

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
tags = pool.map(count_tags, parsed)
tags = list(tags)

tag_counts = {}
for tag in tags:
    for k,v in tag.items():
        if k not in tag_counts:
            tag_counts[k] = 0
        tag_counts[k] += v
end = time.time()

print('{:.3f} seconds to find common tags'.format(end - start))
tag_counts

14.104 seconds to find common tags


{'div': 28581,
 'a': 161065,
 'h1': 999,
 'table': 4010,
 'tr': 27300,
 'td': 57673,
 'img': 6701,
 'span': 67350,
 'b': 14455,
 'small': 3272,
 'i': 18246,
 'br': 4986,
 'p': 7998,
 'h2': 4045,
 'ul': 10972,
 'li': 85779,
 'h3': 777,
 'abbr': 3665,
 'noscript': 999,
 'ol': 858,
 'sup': 11157,
 'th': 14472,
 'cite': 3563,
 'strong': 599,
 'caption': 200,
 'big': 75,
 'dl': 457,
 'dt': 334,
 'dd': 1376,
 'sub': 151,
 'code': 108,
 'blockquote': 58,
 'h4': 117,
 'wbr': 85,
 'q': 76,
 'center': 64,
 'bdi': 4,
 'hr': 51,
 'pre': 1,
 'u': 51,
 'audio': 2,
 'source': 2,
 's': 10,
 'h5': 4,
 'math': 2,
 'semantics': 2,
 'mrow': 2,
 'mstyle': 2,
 'mo': 2,
 'annotation': 2,
 'map': 2,
 'area': 39,
 'ruby': 16,
 'rb': 16,
 'rp': 32,
 'rt': 16,
 'h6': 1,
 'samp': 2,
 'font': 40,
 'del': 2}

Based on our findings, it looks like there are quite a few td, a, li, and span tags. This indicates that articles tend to have lots of links, along with lists and tables. Links are the most numerous tag, which indicates how interconnected articles on Wikipedia are.

## Finding common words

After finding the common tags, we should be able to find the common words in the article body. 

In [14]:
from collections import Counter
import re

def count_words(html):
    soup = BeautifulSoup(html, 'html.parser')
    words = {}
    text = soup.get_text()
    text = re.sub("\W+", " ", text.lower())
    words = text.split(" ")
    words = [w for w in words if len(w) >= 5]
    return Counter(words).most_common(10)

start = time.time()
pool = concurrent.futures.ProcessPoolExecutor(max_workers=3)
words = pool.map(count_words, parsed)
words = list(words)

word_counts = {}
for wc in words:
    for word, count in wc:
        if word not in word_counts:
            word_counts[word] = 0
        word_counts[word] += 1
end = time.time()

print('{:.3f} seconds to find common words'.format(end - start))

15.614 seconds to find common words


Only selecting the top 10 words from each article speeds up performance quite a bit.

## Next steps...

Here are some further potential questions to explore:

- What tags have the most content inside of them?
- What articles are most commonly linked to from our articles?
- What phrases are the most common?
- What's the distribution of letters per word? How many 3 letter words are there? 4 letter?
- What's the average reading level of a Wikipedia article? You can calculate this with readability metrics.
- What images are most commonly shown in articles?