# Analyzing Wikipedia Pages

In this project, we would be implementing a simplified version of the grep (abbreviated as global regular expression print) command-line utility to search for data in 54 megabytes worth of Wikipedia articles as HTML files. The main goals of this project would be the following:
* Search for all occurrences of a string in all of the files.
* Provide a case-insensitive option to the search.
* Refine the result by providing the specific locations of the files.

# Initial Exploration of the Directory

In [1]:
# importing the necessary libraries
import os

In [2]:
# listing the filenames in the `wiki` folder
file_names = os.listdir("wiki")
file_names

['Bay_of_ConcepciC3B3n.html',
 'Bye_My_Boy.html',
 'Valentin_Yanin.html',
 'Kings_XI_Punjab_in_2014.html',
 'William_Harvey_Lillard.html',
 'Radial_Road_3.html',
 'George_Weldrick.html',
 'Zgornji_Otok.html',
 'Blue_Heelers_(season_8).html',
 'Taggen_Nunatak.html',
 'Henri_BraqueniC3A9.html',
 'Vrila.html',
 'William_Henry_Porter.html',
 'Clive_Brown_(footballer).html',
 'Blick_nach_Rechts.html',
 'Central_District_(Rezvanshahr_County).html',
 'Alexios_Aspietes.html',
 'Mei_Lanfang.html',
 'Wangeroogeclass_tug.html',
 'Dowell_Philip_O27Reilly.html',
 'Coalville_Town_railway_station.html',
 'Gennady_Lesun.html',
 'Bartrum_Glacier.html',
 'Victor_S._Mamatey.html',
 'Gottfried_Keller.html',
 'Table_Point_Formation.html',
 'Nobuhiko_Ushiba.html',
 'Master_of_Space_and_Time.html',
 'Early_medieval_states_in_Kazakhstan.html',
 'Eressa_aperiens.html',
 'Myrtle_(sternwheeler).html',
 'Abanycha_bicolor.html',
 'JeecyVea.html',
 'Aubrey_Fair.html',
 'Ingrid_GuimarC3A3es.html',
 'Urban_chicken.ht

In [3]:
# Count and display number of files in the `wiki` folder
len(file_names)

999

In [4]:
# reading the first file and printing contents
with open(os.path.join("wiki", file_names[0])) as f:
    lines = f.readlines()

# Adding the Map Reduce Framework

In this project, we would be making use of the MapReduce framework, implemented with the `map_reduce` function in the code block below.

In [5]:
import math
import functools
from multiprocessing import Pool

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data)/num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    with Pool(num_processes) as pool:
        results = pool.map(mapper, chunks)
    return functools.reduce(reducer, results)

Let's attempt to count the total number of lines using the MapReduce framework.

In [6]:
# Counting the number of lines in all files

# Defining the mapper function
def mapper_count_lines(chunk):
    num_lines = 0
    for file in chunk:
        with open(os.path.join("wiki", file)) as f:
            num_lines += len(f.readlines())
    return num_lines

# Defining the reducer function
def reducer_count_lines(num1, num2):
    return num1 + num2

In [7]:
total_num_lines = map_reduce(file_names, 10, mapper_count_lines, reducer_count_lines)
total_num_lines

499797

# Grep Exact Match

We are going to implement a MapReduce grep algorithm, with the goal being to locate all lines in all files from the `wiki` folder that contains a given passed string.

In [8]:
# Setting the target - with this variable being used to set the target string before applying the 
# map_reduce function

target = "data"

# Defining the mapper function
def mapper_grep_exact(chunk):
    result = {}
    for file in chunk:
        with open(os.path.join("wiki", file)) as f:
            lines = f.readlines()
        
        for index, line in enumerate(lines):
            if target in line:
                if file not in result:
                    result[file] = []
                result[file].append(index)
    return result

# Defining the reducer function
def reducer_grep_exact(result1, result2):
    result1.update(result2)
    return result1

In [9]:
grep_exact_results = map_reduce(file_names, 10, mapper_grep_exact, reducer_grep_exact)
grep_exact_results

{'Bay_of_ConcepciC3B3n.html': [6, 45, 58, 60, 62, 105, 188, 205],
 'Bye_My_Boy.html': [276, 359, 376],
 'Valentin_Yanin.html': [101, 144, 227, 244],
 'Kings_XI_Punjab_in_2014.html': [221,
  229,
  237,
  245,
  253,
  269,
  277,
  293,
  301,
  317,
  325,
  341,
  374,
  376,
  381,
  383,
  388,
  390,
  395,
  397,
  402,
  564,
  647,
  664],
 'William_Harvey_Lillard.html': [45, 65, 81, 129, 212, 229],
 'Radial_Road_3.html': [52, 103, 301, 505, 588, 605],
 'George_Weldrick.html': [194, 277, 294],
 'Zgornji_Otok.html': [6, 53, 55, 65, 69, 211, 260, 262, 311, 394, 411],
 'Blue_Heelers_(season_8).html': [49,
  79,
  82,
  105,
  107,
  125,
  127,
  133,
  135,
  141,
  143,
  660,
  695,
  730,
  739,
  886,
  969,
  986],
 'Taggen_Nunatak.html': [6, 44, 46, 48, 93, 176, 193],
 'Henri_BraqueniC3A9.html': [43, 46, 92, 175, 192],
 'Vrila.html': [6, 57, 59, 69, 73, 99, 100, 102, 151, 234, 251],
 'William_Henry_Porter.html': [48, 88, 171, 188],
 'Clive_Brown_(footballer).html': [146, 22

# Grep Case Insensitive

Let's improve on our previous MapReduce implementation of grep exact match by making it case sensitive. This can be achieved by converting our target and lines to lowercase within the function such that the case of the characters in the strings do not matter.

In [10]:
# Setting the target - with this variable being used to set the target string before applying the 
# map_reduce function

target = "dAtA"

# Defining the mapper function
def mapper_grep_case_insen(chunk):
    result = {}
    for file in chunk:
        with open(os.path.join("wiki", file)) as f:
            lines = f.readlines()
        
        for index, line in enumerate(lines):
            if target.lower() in line.lower():
                if file not in result:
                    result[file] = []
                result[file].append(index)
    return result

# Defining the reducer function
def reducer_grep_case_insen(result1, result2):
    result1.update(result2)
    return result1

In [11]:
grep_case_insen_results = map_reduce(file_names, 10, mapper_grep_case_insen, reducer_grep_case_insen)
grep_case_insen_results

{'Bay_of_ConcepciC3B3n.html': [6, 45, 58, 60, 62, 105, 188, 205],
 'Bye_My_Boy.html': [276, 359, 376],
 'Valentin_Yanin.html': [101, 144, 227, 244],
 'Kings_XI_Punjab_in_2014.html': [221,
  229,
  237,
  245,
  253,
  269,
  277,
  293,
  301,
  317,
  325,
  341,
  374,
  376,
  381,
  383,
  388,
  390,
  395,
  397,
  402,
  564,
  647,
  664],
 'William_Harvey_Lillard.html': [45, 65, 81, 129, 212, 229],
 'Radial_Road_3.html': [52, 103, 301, 505, 588, 605],
 'George_Weldrick.html': [194, 277, 294],
 'Zgornji_Otok.html': [6, 53, 55, 65, 69, 211, 260, 262, 311, 394, 411],
 'Blue_Heelers_(season_8).html': [49,
  79,
  82,
  105,
  107,
  125,
  127,
  133,
  135,
  141,
  143,
  660,
  695,
  730,
  739,
  886,
  969,
  986],
 'Taggen_Nunatak.html': [6, 44, 46, 48, 93, 176, 193],
 'Henri_BraqueniC3A9.html': [43, 46, 92, 175, 192],
 'Vrila.html': [6, 57, 59, 69, 73, 99, 100, 102, 151, 234, 251],
 'William_Henry_Porter.html': [48, 88, 171, 188],
 'Clive_Brown_(footballer).html': [146, 22

# Checking implementation

Let's verify that the new implementation works by seeing if it finds more matches than the previous implementation.

In [12]:
for file in grep_case_insen_results:
    if file not in grep_exact_results:
        print("Found {} new matches on file {}".format(len(grep_case_insen_results[file]), file))
    elif len(grep_case_insen_results[file]) > len(grep_exact_results[file]):
        print("Found {} new matches on file {}".format(len(grep_case_insen_results[file]) - len(grep_exact_results[file]), file))

Found 1 new matches on file Table_Point_Formation.html
Found 1 new matches on file Ingrid_GuimarC3A3es.html
Found 2 new matches on file Jules_Verne_ATV.html
Found 1 new matches on file Pictogram.html
Found 2 new matches on file Claire_Danes.html
Found 1 new matches on file PTPRS.html
Found 1 new matches on file A_Beautiful_Valley.html
Found 1 new matches on file Mudramothiram.html
Found 2 new matches on file Gordon_Bau.html
Found 1 new matches on file Embraer_Unidade_GaviC3A3o_Peixoto_Airport.html
Found 3 new matches on file Code_page_1023.html
Found 1 new matches on file Cryptographic_primitive.html
Found 1 new matches on file Alex_Kurtzman.html
Found 1 new matches on file Filip_Pyrochta.html
Found 1 new matches on file Morgana_King.html
Found 1 new matches on file Don_Parsons_(ice_hockey).html
Found 1 new matches on file Bias.html
Found 2 new matches on file Tomohiko_ItC58D_(director).html
Found 1 new matches on file Imperial_Venus_(film).html
Found 1 new matches on file Camp_Nelson_

# Finding Match Positions on Lines

Let's further improve the implementation by not just providing the line index within the file where the given passed string is found but also the exact first index character position for which the passed string is found within the line. The returned result would be a tuple consisting of the `(Line Index, Index on the Line)`.

In [13]:
# Setting the target - with this variable being used to set the target string before applying the 
# map_reduce function

target = "dAtA"

# Defining a function to find the index positions within the line for the target string
def find_match_indexes(line, target_string):
    matched_results = []
    i = line.find(target_string, 0)
    while i != -1:
        matched_results.append(i)
        i = line.find(target_string, i + 1)
    return matched_results

# Defining the mapper function
def mapper_grep_match_pos(chunk):
    result = {}
    for file in chunk:
        with open(os.path.join("wiki", file)) as f:
            lines = f.readlines()
        
        for list_index, line in enumerate(lines):
            match_indexes = find_match_indexes(line.lower(), target.lower())
            if target.lower() in line.lower():
                if file not in result:
                    result[file] = []
                result[file] += [(list_index, match_index) for match_index in match_indexes]
    return result

# Defining the reducer function
def reducer_grep_match_pos(result1, result2):
    result1.update(result2)
    return result1

In [14]:
grep_match_pos_results = map_reduce(file_names, 10, mapper_grep_match_pos, reducer_grep_match_pos)
grep_match_pos_results

{'Bay_of_ConcepciC3B3n.html': [(6, 422),
  (45, 628),
  (45, 650),
  (58, 447),
  (58, 692),
  (60, 18),
  (62, 568),
  (62, 590),
  (105, 40),
  (105, 748),
  (105, 789),
  (105, 814),
  (188, 1039),
  (188, 1088),
  (188, 1132),
  (205, 125)],
 'Bye_My_Boy.html': [(276, 40),
  (359, 999),
  (359, 1048),
  (359, 1092),
  (376, 125)],
 'Valentin_Yanin.html': [(101, 323),
  (101, 360),
  (144, 40),
  (227, 1007),
  (227, 1056),
  (227, 1100),
  (244, 125)],
 'Kings_XI_Punjab_in_2014.html': [(221, 487),
  (221, 510),
  (229, 487),
  (229, 510),
  (237, 485),
  (237, 508),
  (245, 449),
  (245, 472),
  (253, 451),
  (253, 474),
  (269, 485),
  (269, 508),
  (277, 449),
  (277, 472),
  (293, 451),
  (293, 474),
  (301, 451),
  (301, 474),
  (317, 485),
  (317, 508),
  (325, 451),
  (325, 474),
  (341, 449),
  (341, 472),
  (374, 459),
  (374, 482),
  (376, 498),
  (376, 521),
  (381, 465),
  (381, 488),
  (383, 525),
  (383, 547),
  (388, 492),
  (388, 515),
  (390, 446),
  (390, 469),
  (

In [15]:
target = "science"
occurrences = map_reduce(file_names, 10, mapper_grep_match_pos, reducer_grep_match_pos)
occurrences

{'Valentin_Yanin.html': [(6, 840),
  (6, 890),
  (66, 90),
  (66, 145),
  (66, 173),
  (144, 1440),
  (144, 1502),
  (144, 1548),
  (144, 1632),
  (144, 1697),
  (144, 1746)],
 'William_Harvey_Lillard.html': [(80, 166)],
 'Victor_S._Mamatey.html': [(48, 682), (48, 728), (48, 767)],
 'Table_Point_Formation.html': [(68, 907), (68, 937), (68, 953)],
 'Master_of_Space_and_Time.html': [(6, 499),
  (6, 554),
  (45, 531),
  (61, 49),
  (61, 99),
  (61, 122),
  (109, 342),
  (109, 391),
  (109, 424),
  (109, 609),
  (109, 660),
  (109, 695)],
 'Urban_chicken.html': [(105, 256)],
 'AlMidan.html': [(277, 48),
  (277, 108),
  (277, 161),
  (281, 48),
  (281, 128),
  (281, 181)],
 'Jules_Verne_ATV.html': [(208, 507),
  (208, 551),
  (208, 568),
  (427, 231),
  (427, 427),
  (427, 831),
  (427, 971),
  (941, 60),
  (941, 127),
  (982, 29),
  (982, 63),
  (982, 90),
  (1007, 33),
  (1033, 43)],
 'Pictogram.html': [(491, 47),
  (491, 92),
  (491, 114),
  (497, 27),
  (497, 51),
  (497, 68),
  (499, 3

# Displaying the Results

Let's display the results. We will create a CSV file listing all occurrences. We will also show the text around each occurrence.

In [16]:
import csv

# How many character to show before and after the match
context_delta = 30

with open("results.csv", "w") as f:
    writer = csv.writer(f)
    rows = [["File", "Line", "Index", "Context"]]
    for fn in occurrences:
        with open(os.path.join("wiki",fn)) as f:
            lines = [line.strip() for line in f.readlines()]
        for line, index in occurrences[fn]:
            start = max(index - context_delta, 0)
            end   = index + len(target) + context_delta
            rows.append([os.path.join("wiki",fn), line, index, lines[line][start:end]])
    writer.writerows(rows)

In [17]:
import pandas
df = pandas.read_csv("results.csv")
df.head(10)

Unnamed: 0,File,Line,Index,Context
0,wiki/Valentin_Yanin.html,6,840,"embers of the USSR Academy of Sciences"",""Full ..."
1,wiki/Valentin_Yanin.html,6,890,"ers of the Russian Academy of Sciences"",""Demid..."
2,wiki/Valentin_Yanin.html,66,90,"href=""/wiki/Soviet_Academy_of_Sciences"" class=..."
3,wiki/Valentin_Yanin.html,66,145,"ect"" title=""Soviet Academy of Sciences"">Soviet..."
4,wiki/Valentin_Yanin.html,66,173,"f Sciences"">Soviet Academy of Sciences</a>; he..."
5,wiki/Valentin_Yanin.html,144,1440,"rs_of_the_USSR_Academy_of_Sciences"" title=""Cat..."
6,wiki/Valentin_Yanin.html,144,1502,"rs of the USSR Academy of Sciences"">Full Membe..."
7,wiki/Valentin_Yanin.html,144,1548,rs of the USSR Academy of Sciences</a></li><li...
8,wiki/Valentin_Yanin.html,144,1632,"of_the_Russian_Academy_of_Sciences"" title=""Cat..."
9,wiki/Valentin_Yanin.html,144,1697,"of the Russian Academy of Sciences"">Full Membe..."
