# Analyzing Wikipedia Pages

In this project, we'll work with data scraped from [Wikipedia](https://www.wikipedia.org/). Volunteer content contributors and editors maintain Wikipedia by continuously improving content. Anyone can edit Wikipedia (you can read more about how to make an edit [here](https://en.wikipedia.org/wiki/Help:Editing)). Because Wikipedia is crowdsourced, it has rapidly assembled a huge library of articles.

In this guided project, we'll implement a simplified version of the `grep` [command-line utility](https://en.wikipedia.org/wiki/Grep) to search for data in 54 megabytes worth of articles. If you're not familiar with the `grep` command, the `grep` utility essentially allows searching for textual data in all files from a given directory.

Articles were saved using the last component of their URLs. For example, a page on Wikipedia has the URL structure https://en.wikipedia.org/wiki/Yarkant_County. If we were saving the article with the previous URL, we'd save it to the file `Yarkant_County.html`. All the data files are in the wiki folder. Note that the files are raw HTML.

We're going to treat those files like plain-text and we won't rely on any of the specific HTML structure of those files.

Our main goals will be the following:
- Search for all occurrences of a string in all of the files.
- Provide a case-insensitive option to the search.
- Refine the result by providing the specific locations of the files.

Let's get an overview about the files in the `wiki`folder first.

In [30]:
import os
import math
import functools
from multiprocessing import Pool
import pandas as pd

In [2]:
# List all files 
file_names = os.listdir("wiki")
for i in range(len(file_names)):
    print(file_names[i])

Bay_of_ConcepciC3B3n.html
Bye_My_Boy.html
Valentin_Yanin.html
Kings_XI_Punjab_in_2014.html
William_Harvey_Lillard.html
Radial_Road_3.html
George_Weldrick.html
Zgornji_Otok.html
Blue_Heelers_(season_8).html
Taggen_Nunatak.html
Henri_BraqueniC3A9.html
Vrila.html
William_Henry_Porter.html
Clive_Brown_(footballer).html
Blick_nach_Rechts.html
Central_District_(Rezvanshahr_County).html
Alexios_Aspietes.html
Mei_Lanfang.html
Wangeroogeclass_tug.html
Dowell_Philip_O27Reilly.html
Coalville_Town_railway_station.html
Gennady_Lesun.html
Bartrum_Glacier.html
Victor_S._Mamatey.html
Gottfried_Keller.html
Table_Point_Formation.html
Nobuhiko_Ushiba.html
Master_of_Space_and_Time.html
Early_medieval_states_in_Kazakhstan.html
Eressa_aperiens.html
Myrtle_(sternwheeler).html
Abanycha_bicolor.html
JeecyVea.html
Aubrey_Fair.html
Ingrid_GuimarC3A3es.html
Urban_chicken.html
Elgin_National_Watch_Company.html
AlMidan.html
Antae_temple.html
Metis_Institute_of_Polytechnic.html
Sverre_Solberg.html
John_Reid_(British

In [3]:
# Count the number of files
print(len(file_names))

999


In [4]:
# Contents of the first file
folder_name = "wiki"
file_name = file_names[0]  # "Bay_of_ConcepciC3B3n.html"
with open(os.path.join(folder_name, file_name), 'r', encoding='utf-8') as f:
    lines = [line for line in f.readlines()]

print(lines)

['<!DOCTYPE html>\n', '<html class="client-nojs" lang="en" dir="ltr">\n', '<head>\n', '<meta charset="UTF-8"/>\n', '<title>Bay of Concepción - Wikipedia</title>\n', '<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n', '<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Bay_of_Concepción","wgTitle":"Bay of Concepción","wgCurRevisionId":647460156,"wgRevisionId":647460156,"wgArticleId":16044270,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Coordinates on Wikidata","All stub articles","Landforms of Bío Bío Region","Bays of Chile","Bío Bío Region geography stubs"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgD

## Adding the MapReduce Framework

Let's explore the data a little bit more and count the total number of lines in all files stored in the wiki folder. There are several ways to do this.  We will use MapReduce.

In [5]:
# Make chunks function
def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

In [6]:
# MapReduce function
def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    pool = Pool(num_processes)
    chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

In [7]:
# Mapper function to count lines in each chunk
def map_line_count(file_names):
    total = 0
    for fn in file_names:
        with open(os.path.join("wiki", fn)) as f:
            total += len(f.readlines())
    return total

In [8]:
# Reducer function to sum the counts
def reduce_line_count(count1, count2):
    return count1 + count2

In [9]:
# Counting the total number of lines using MapReduce
map_reduce(file_names, 8, map_line_count, reduce_line_count)

499797

## Grep Exact Match

Let's start by implementing a first MapReduce grep algorithm. The goal is to locate all lines in all files from the `wiki` folder that contains a given string.

In [10]:
# Mapper function to find a string
def map_find_string(file_names):
    occurences = {}
    for fn in file_names:
        with open(os.path.join("wiki", fn)) as f:
            lines = f.readlines()
            for i, line in enumerate(lines):
                if target in line:
                    if fn not in occurences:
                        occurences[fn] = []
                    occurences[fn].append(i)
    return occurences

In [11]:
# Reducer function to merge the results
def reduce_find_string(occurences1, occurences2):
    for fn in occurences2:
        if fn not in occurences1:
            occurences1[fn] = occurences2[fn]
        else:
            occurences1[fn].extend(occurences2[fn])
    return occurences1

In [12]:
# Searching for "data" using MapReduce
target = "data"
results = map_reduce(file_names, 8, map_find_string, reduce_find_string)
print(results)  # keys are filenames, values are linenumbers

{'Bay_of_ConcepciC3B3n.html': [6, 45, 58, 60, 62, 105, 188, 205], 'Bye_My_Boy.html': [276, 359, 376], 'Valentin_Yanin.html': [101, 144, 227, 244], 'Kings_XI_Punjab_in_2014.html': [221, 229, 237, 245, 253, 269, 277, 293, 301, 317, 325, 341, 374, 376, 381, 383, 388, 390, 395, 397, 402, 564, 647, 664], 'William_Harvey_Lillard.html': [45, 65, 81, 129, 212, 229], 'Radial_Road_3.html': [52, 103, 301, 505, 588, 605], 'George_Weldrick.html': [194, 277, 294], 'Zgornji_Otok.html': [6, 53, 55, 65, 69, 211, 260, 262, 311, 394, 411], 'Blue_Heelers_(season_8).html': [49, 79, 82, 105, 107, 125, 127, 133, 135, 141, 143, 660, 695, 730, 739, 886, 969, 986], 'Taggen_Nunatak.html': [6, 44, 46, 48, 93, 176, 193], 'Henri_BraqueniC3A9.html': [43, 46, 92, 175, 192], 'Vrila.html': [6, 57, 59, 69, 73, 99, 100, 102, 151, 234, 251], 'William_Henry_Porter.html': [48, 88, 171, 188], 'Clive_Brown_(footballer).html': [146, 229, 246], 'Blick_nach_Rechts.html': [43, 46, 134, 170, 253, 270], 'Central_District_(Rezvansha

## Grep Case Insensitive

Let's improve our grep function by making it case insensitive. This means that the case of the characters in the strings won't matter.

In [13]:
# Mapper function to find a string (case insensitive)
def map_find_string_ci(file_names):
    occurences = {}
    for fn in file_names:
        with open(os.path.join("wiki", fn)) as f:
            lines = map(str.lower, f.readlines())  # convert all lines to lower case
            for i, line in enumerate(lines):
                if target.lower() in line:  # convert search string to lower case
                    if fn not in occurences:
                        occurences[fn] = []
                    occurences[fn].append(i)
    return occurences

In [14]:
# Searching for "data" using MapReduce (case insensitive)
target = "data"
results_ci = map_reduce(file_names, 8, map_find_string_ci, reduce_find_string)
print(results_ci)  # keys are filenames, values are linenumbers

{'Bay_of_ConcepciC3B3n.html': [6, 45, 58, 60, 62, 105, 188, 205], 'Bye_My_Boy.html': [276, 359, 376], 'Valentin_Yanin.html': [101, 144, 227, 244], 'Kings_XI_Punjab_in_2014.html': [221, 229, 237, 245, 253, 269, 277, 293, 301, 317, 325, 341, 374, 376, 381, 383, 388, 390, 395, 397, 402, 564, 647, 664], 'William_Harvey_Lillard.html': [45, 65, 81, 129, 212, 229], 'Radial_Road_3.html': [52, 103, 301, 505, 588, 605], 'George_Weldrick.html': [194, 277, 294], 'Zgornji_Otok.html': [6, 53, 55, 65, 69, 211, 260, 262, 311, 394, 411], 'Blue_Heelers_(season_8).html': [49, 79, 82, 105, 107, 125, 127, 133, 135, 141, 143, 660, 695, 730, 739, 886, 969, 986], 'Taggen_Nunatak.html': [6, 44, 46, 48, 93, 176, 193], 'Henri_BraqueniC3A9.html': [43, 46, 92, 175, 192], 'Vrila.html': [6, 57, 59, 69, 73, 99, 100, 102, 151, 234, 251], 'William_Henry_Porter.html': [48, 88, 171, 188], 'Clive_Brown_(footballer).html': [146, 229, 246], 'Blick_nach_Rechts.html': [43, 46, 134, 170, 253, 270], 'Central_District_(Rezvansha

In [15]:
# Searching for "DATA" using MapReduce (case insensitive)
target = "DATA"
results_ci2 = map_reduce(file_names, 8, map_find_string_ci, reduce_find_string)
print(results_ci2)  # keys are filenames, values are linenumbers

{'Bay_of_ConcepciC3B3n.html': [6, 45, 58, 60, 62, 105, 188, 205], 'Bye_My_Boy.html': [276, 359, 376], 'Valentin_Yanin.html': [101, 144, 227, 244], 'Kings_XI_Punjab_in_2014.html': [221, 229, 237, 245, 253, 269, 277, 293, 301, 317, 325, 341, 374, 376, 381, 383, 388, 390, 395, 397, 402, 564, 647, 664], 'William_Harvey_Lillard.html': [45, 65, 81, 129, 212, 229], 'Radial_Road_3.html': [52, 103, 301, 505, 588, 605], 'George_Weldrick.html': [194, 277, 294], 'Zgornji_Otok.html': [6, 53, 55, 65, 69, 211, 260, 262, 311, 394, 411], 'Blue_Heelers_(season_8).html': [49, 79, 82, 105, 107, 125, 127, 133, 135, 141, 143, 660, 695, 730, 739, 886, 969, 986], 'Taggen_Nunatak.html': [6, 44, 46, 48, 93, 176, 193], 'Henri_BraqueniC3A9.html': [43, 46, 92, 175, 192], 'Vrila.html': [6, 57, 59, 69, 73, 99, 100, 102, 151, 234, 251], 'William_Henry_Porter.html': [48, 88, 171, 188], 'Clive_Brown_(footballer).html': [146, 229, 246], 'Blick_nach_Rechts.html': [43, 46, 134, 170, 253, 270], 'Central_District_(Rezvansha

We successfully made the search case-insensitive, as we can find all the results with either passing 'data' or 'DATA'. Let's doublecheck that.

## Checking the Implementation

Let's verify that the new implementation works by seeing if it finds more matches than the previous implementation.

In [19]:
# Comparing the case-sensitive with the case-insensitive results
def compare(results1, results2):
    for key in results2:
        matches1 = len(results1.get(key, []))
        matches2 = len(results2[key])
        if matches2 > matches1:
            print(f"{key}: {results2[key]}")

compare(results, results_ci)

Table_Point_Formation.html: [68, 69, 70, 71, 72, 80, 83, 85, 97, 110, 112, 160, 243, 260]
Ingrid_GuimarC3A3es.html: [6, 49, 167, 173, 176, 178, 190, 192, 231, 241, 324, 349, 376]
Jules_Verne_ATV.html: [6, 47, 241, 247, 255, 264, 278, 286, 293, 309, 320, 394, 450, 572, 918, 1169, 1253, 1254, 1255, 1256, 1332, 1415, 1440]
Pictogram.html: [44, 47, 57, 133, 141, 156, 198, 202, 206, 210, 214, 217, 228, 238, 248, 258, 268, 278, 288, 298, 351, 357, 394, 397, 2435, 2518, 2543]
Claire_Danes.html: [6, 49, 125, 142, 427, 430, 807, 813, 818, 819, 820, 1664, 1712, 1795, 1820]
PTPRS.html: [49, 58, 148, 151, 322, 325, 326, 362, 366, 377, 387, 389, 419, 736, 738, 787, 870, 887]
A_Beautiful_Valley.html: [49, 177, 216, 299, 316]
Mudramothiram.html: [196, 199, 201, 250, 333, 350]
Gordon_Bau.html: [131, 139, 140, 141, 142, 143, 148, 179, 187, 270, 287, 314]
Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html: [6, 80, 94, 96, 131, 161, 162, 171, 178, 181, 588, 671, 688]
Code_page_1023.html: [142, 533, 557, 1315

There seem to be quite a few more instances where our case-insensitive function finds more matches than in the case-sensitive case.

## Finding Match Positions on Lines

Right now, we are only finding the line numbers where there is at least one occurrence. Let's extend the algorithm so that it provides information about the location of the matches in those lines.The current implementation will just return the index of the line. The new implementation should return pairs of indices where the first value is the line index and the second index if the index of the first character of the match on that line.

In [25]:
# Updated mapper function to find a string (case insensitive)
def map_find_string_ci_updated(file_names):
    occurences = {}
    for fn in file_names:
        with open(os.path.join("wiki", fn)) as f:
            lines = map(str.lower, f.readlines())  # convert all lines to lower case
            for i, line in enumerate(lines):
                if target.lower() in line:  # convert search string to lower case
                    for j, word in enumerate(line.split()):
                        if target.lower() in word:
                            position = j
                            if fn not in occurences:
                                occurences[fn] = []
                            occurences[fn].append((i, position))
    return occurences

In [26]:
# Searching for "data" using updated MapReduce function (case insensitive)
target = "data"
results_ci_updated = map_reduce(file_names, 8, map_find_string_ci_updated, reduce_find_string)
print(results_ci_updated)  # keys are filenames, values are linenumbers and positions

{'Bay_of_ConcepciC3B3n.html': [(6, 4), (45, 14), (45, 15), (58, 21), (58, 38), (60, 1), (62, 12), (62, 13), (105, 3), (105, 40), (105, 43), (105, 45), (188, 70), (188, 74), (188, 78), (205, 5)], 'Bye_My_Boy.html': [(276, 3), (359, 70), (359, 74), (359, 78), (376, 5)], 'Valentin_Yanin.html': [(101, 17), (144, 3), (227, 70), (227, 74), (227, 78), (244, 5)], 'Kings_XI_Punjab_in_2014.html': [(221, 18), (221, 19), (229, 18), (229, 19), (237, 18), (237, 19), (245, 18), (245, 19), (253, 18), (253, 19), (269, 18), (269, 19), (277, 18), (277, 19), (293, 18), (293, 19), (301, 18), (301, 19), (317, 18), (317, 19), (325, 18), (325, 19), (341, 18), (341, 19), (374, 13), (374, 14), (376, 21), (376, 22), (381, 13), (381, 14), (383, 21), (383, 22), (388, 15), (388, 16), (390, 19), (390, 20), (395, 15), (395, 16), (397, 19), (397, 20), (402, 13), (402, 14), (564, 3), (647, 70), (647, 74), (647, 78), (664, 5)], 'William_Harvey_Lillard.html': [(45, 10), (45, 11), (65, 9), (81, 9), (129, 3), (212, 70), (2

## Displaying the Results

Our grep algorithms can now find all matches. However, with the dictionary it produces, it's not very easy to see those matches.
Let's write the results into a CSV file.

In [33]:
# Updated mapper function to find a string (case insensitive) with context
def map_find_string_ci_updated(file_names):
    occurences = []
    for fn in file_names:
        with open(os.path.join("wiki", fn)) as f:
            lines = map(str.lower, f.readlines())  # convert all lines to lower case
            for i, line in enumerate(lines):
                if target.lower() in line:  # convert search string to lower case
                    for j, word in enumerate(line.split()):
                       if target.lower() in word:
                            start = max(0, j - 5)
                            end = min(len(line), j + len(word) + 5)  # context window +/- 5 chars
                            context = line[start:end].strip()
                            occurences.append((fn, i, j, context))
    return occurences

In [28]:
# Reducer function to merge the results
def reduce_find_string(occurences1, occurences2):
    return occurences1 + occurences2

In [29]:
# Searching for "data" using MapReduce
target = "data"
results = map_reduce(file_names, 8, map_find_string_ci_updated, reduce_find_string)

In [34]:
# Print as a dataframe
results_df = pd.DataFrame(results, columns=['File', 'Line', 'Index', 'Context'])
results_df

Unnamed: 0,File,Line,Index,Context
0,Bay_of_ConcepciC3B3n.html,6,4,<script>(window.rlq=window.rlq||[]).push(funct...
1,Bay_of_ConcepciC3B3n.html,45,14,"<div class=""thumbinner"" style=""width:202px;""><..."
2,Bay_of_ConcepciC3B3n.html,45,15,"<div class=""thumbinner"" style=""width:202px;""><..."
3,Bay_of_ConcepciC3B3n.html,58,21,"<p><span style=""font-size: small;""><span id=""c..."
4,Bay_of_ConcepciC3B3n.html,58,38,"<p><span style=""font-size: small;""><span id=""c..."
...,...,...,...,...
20512,William_McDonald_(Australian_politician).html,117,3,"<div id=""catlinks"" class=""catlinks"" data-mw=""i..."
20513,William_McDonald_(Australian_politician).html,200,70,"<li id=""t-whatlinkshere""><a href=""/wiki/specia..."
20514,William_McDonald_(Australian_politician).html,200,74,"<li id=""t-whatlinkshere""><a href=""/wiki/specia..."
20515,William_McDonald_(Australian_politician).html,200,78,"<li id=""t-whatlinkshere""><a href=""/wiki/specia..."


In [35]:
# Export to CSV
results_df.to_csv('results.csv', index=False)