# Analyze Wikipedia Pages

Wikipedia is crowdsourced, it has rapidly assembled a huge library of articles. In this guided project, we'll implement a simplified version of the grep command-line utility to search for data in 54 megabytes worth of articles. All the data files are in the wiki folder. We will provide a case-insensitive search for all occurrences of a string in all of the files.

List all of the files in the wiki folder.
Count and display the number of files in the wiki folder.
Read the first file in the wiki folder, and print its contents.

## List all files

In [22]:
import os
os.listdir("wiki")

['100_Greatest_Romanians.html',
 '104th_Logistic_Support_Brigade_(United_Kingdom).html',
 '16th_Virginia_Infantry.html',
 '1896_Indiana_Hoosiers_football_team.html',
 '1898_Colgate_football_team.html',
 '1910_in_literature.html',
 '1915_Montana_football_team.html',
 '1951_National_League_tiebreaker_series.html',
 '1953E2809354_FA_Cup_qualifying_rounds.html',
 '1958_Wightman_Cup.html',
 '1988_State_of_Origin_series.html',
 '1st_Strategic_Aerospace_Division.html',
 '2001_Australian_Individual_Speedway_Championship.html',
 '2001_NCAA_Division_I_Field_Hockey_Championship.html',
 '2004_Tuvalu_ADivision.html',
 '2005E2809306_in_Welsh_football.html',
 '2007E2809308_Huddersfield_Town_A.F.C._season.html',
 '2008_Fed_Cup_World_Group_II.html',
 '2009_English_cricket_season.html',
 '2009_World_Junior_Ice_Hockey_Championships_rosters.html',
 '2010_Karshi_Challenger_E28093_Singles.html',
 '2011E2809312_Western_Collegiate_Hockey_Association_women27s_ice_hockey_season.html',
 '2011_ITU_Duathlon_World_

In [23]:
len(os.listdir("wiki"))

999

## Overview of the first file

In [24]:
filenames = os.listdir("wiki")

In [25]:
foldername = "wiki"
with open(os.path.join(foldername, filenames[0]), encoding='utf-8') as fp:
    print(fp.read())

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>100 Greatest Romanians - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"100_Greatest_Romanians","wgTitle":"100 Greatest Romanians","wgCurRevisionId":739997309,"wgRevisionId":739997309,"wgArticleId":5885981,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from November 2012","Articles containing Romanian-language text","Greatest Nationals","Lists of Romanian people","Romanian Television","Romanian television series"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"w

## Count the total number of lines in all files. 
Use the multiprocessing and functools libraries of python to speed up the processing time.  

In [26]:
import math
import functools
from multiprocessing import Pool

def make_chunks(data, num_chunks):
    chunk_size = math.ceil(len(data) / num_chunks)
    return [data[i:i+chunk_size] for i in range(0, len(data), chunk_size)]

def map_reduce(data, num_processes, mapper, reducer):
    chunks = make_chunks(data, num_processes)
    with Pool(num_processes) as pool: 
        chunk_results = pool.map(mapper, chunks)
    return functools.reduce(reducer, chunk_results)

In [27]:
import mapper_reducer
overall_lines = map_reduce(filenames, 4, mapper_reducer.map_lines_count, mapper_reducer.reduce_lines_count)
print(overall_lines)

499797


## Find all occurrences of the string "data" in the files

Case-sensitive string search.

In [28]:
overall_locs = map_reduce(filenames, 4, mapper_reducer.map_word_locs_chunk, mapper_reducer.reduce_word_locs)
print(overall_locs)

{'100_Greatest_Romanians.html': [45, 52, 261, 344, 361], '104th_Logistic_Support_Brigade_(United_Kingdom).html': [49, 57, 61, 170, 253, 270], '16th_Virginia_Infantry.html': [49, 63, 94, 98, 102, 104, 146, 153, 236, 253, 280], '1896_Indiana_Hoosiers_football_team.html': [153, 159, 420, 503, 520], '1898_Colgate_football_team.html': [49, 133, 403, 486, 503], '1910_in_literature.html': [353, 436, 461], '1915_Montana_football_team.html': [49, 562, 645, 662], '1951_National_League_tiebreaker_series.html': [36, 48, 1697, 1790, 1873, 1890], '1953E2809354_FA_Cup_qualifying_rounds.html': [47, 48, 3292, 3375, 3392], '1958_Wightman_Cup.html': [46, 52, 70, 71, 83, 84, 96, 97, 109, 110, 123, 124, 137, 138, 150, 151, 196, 197, 366, 449, 466], '1988_State_of_Origin_series.html': [49, 65, 69, 117, 119, 158, 160, 195, 197, 241, 242, 246, 247, 251, 255, 259, 263, 264, 265, 269, 270, 274, 275, 276, 280, 281, 285, 289, 290, 294, 298, 299, 303, 304, 305, 309, 310, 311, 315, 331, 335, 339, 340, 344, 345, 346

Case-insensitive string search.

In [29]:
overall_locs_caseins = map_reduce(filenames, 4, mapper_reducer.map_word_locs_chunk_case_insen, mapper_reducer.reduce_word_locs)
print(overall_locs_caseins)

{'100_Greatest_Romanians.html': [45, 52, 261, 344, 361], '104th_Logistic_Support_Brigade_(United_Kingdom).html': [49, 57, 61, 170, 253, 270], '16th_Virginia_Infantry.html': [49, 63, 94, 98, 102, 104, 146, 153, 236, 253, 280], '1896_Indiana_Hoosiers_football_team.html': [153, 159, 420, 503, 520], '1898_Colgate_football_team.html': [49, 133, 403, 486, 503], '1910_in_literature.html': [353, 436, 461], '1915_Montana_football_team.html': [49, 562, 645, 662], '1951_National_League_tiebreaker_series.html': [36, 48, 1697, 1790, 1873, 1890], '1953E2809354_FA_Cup_qualifying_rounds.html': [47, 48, 2538, 3292, 3375, 3392], '1958_Wightman_Cup.html': [46, 52, 70, 71, 83, 84, 96, 97, 109, 110, 123, 124, 137, 138, 150, 151, 196, 197, 366, 449, 466], '1988_State_of_Origin_series.html': [49, 65, 69, 117, 119, 158, 160, 195, 197, 241, 242, 246, 247, 251, 255, 259, 263, 264, 265, 269, 270, 274, 275, 276, 280, 281, 285, 289, 290, 294, 298, 299, 303, 304, 305, 309, 310, 311, 315, 331, 335, 339, 340, 344, 34

In [30]:
more_matches = {}
for key in overall_locs:
    if len(overall_locs[key]) < len(overall_locs_caseins[key]):
        more_matches[key] = overall_locs_caseins[key]
print(more_matches)
        

{'1953E2809354_FA_Cup_qualifying_rounds.html': [47, 48, 2538, 3292, 3375, 3392], '83_(number).html': [203, 230, 2537, 2620, 2645], 'Acceptance_(Heroes).html': [52, 132, 135, 157, 496, 579, 596], 'Agaritine_gammaglutamyltransferase.html': [59, 86, 87, 350, 352, 356, 406, 489, 506], 'Alex_Kurtzman.html': [49, 338, 353, 387, 398, 481, 498, 525], 'Amborella.html': [6, 50, 53, 119, 315, 325, 351, 370, 371, 376, 391, 449, 532, 557], 'Antibiotic_use_in_livestock.html': [107, 110, 199, 201, 207, 209, 214, 221, 229, 232, 265, 272, 278, 279, 298, 378, 390, 429, 551, 634, 651], 'Appa_(film).html': [49, 191, 269, 352, 369], 'Avengers_Academy.html': [50, 527, 878, 1094, 1177, 1194], 'A_Beautiful_Valley.html': [49, 177, 216, 299, 316], 'Bahmanabade_Olya.html': [6, 57, 59, 69, 73, 111, 126, 223, 414, 417, 419, 468, 551, 568], 'Battle_of_Wattignies.html': [6, 52, 77, 78, 79, 85, 86, 87, 88, 89, 90, 403, 411, 429, 441, 454, 467, 558, 563, 1039, 1105, 1123, 1152, 1168, 1181, 1195, 1209, 1543, 1581, 1664

In [31]:
for key in overall_locs:
    if len(overall_locs[key]) < len(overall_locs_caseins[key]):
        diff = len(overall_locs_caseins[key]) - len(overall_locs[key])
        print("Found {} new matches for {}.".format(diff, key))

Found 1 new matches for 1953E2809354_FA_Cup_qualifying_rounds.html.
Found 1 new matches for 83_(number).html.
Found 1 new matches for Acceptance_(Heroes).html.
Found 2 new matches for Agaritine_gammaglutamyltransferase.html.
Found 1 new matches for Alex_Kurtzman.html.
Found 1 new matches for Amborella.html.
Found 1 new matches for Antibiotic_use_in_livestock.html.
Found 1 new matches for Appa_(film).html.
Found 1 new matches for Avengers_Academy.html.
Found 1 new matches for A_Beautiful_Valley.html.
Found 1 new matches for Bahmanabade_Olya.html.
Found 1 new matches for Battle_of_Wattignies.html.
Found 1 new matches for Benny_Lee.html.
Found 1 new matches for Bias.html.
Found 1 new matches for Bibiana_Beglau.html.
Found 1 new matches for Blue_SWAT.html.
Found 1 new matches for Boardman_Township_Mahoning_County_Ohio.html.
Found 1 new matches for Brownfield_(software_development).html.
Found 1 new matches for C11orf30.html.
Found 1 new matches for C389cole_des_Mines_de_Douai.html.
Found 1

Yes, there are more matches using the case insensitive algorithm.

## Location of all the matches for string 'data' in each line 

Right now, we are only finding the line numbers where there is at least one occurrence. We will extend the algorithm to collect the location of the matches in those lines.

In [32]:
target_word = 'data'
overall_locs_tuples_caseins = map_reduce(filenames, 4, mapper_reducer.map_word_locs_rowcol_chunk_case_insen, mapper_reducer.reduce_word_locs)
print(overall_locs_tuples_caseins)

{'100_Greatest_Romanians.html': [(45, 451), (45, 473), (52, 547), (52, 569), (261, 40), (344, 1039), (344, 1087), (344, 1131), (361, 125)], '104th_Logistic_Support_Brigade_(United_Kingdom).html': [(49, 325), (49, 347), (57, 460), (57, 483), (61, 463), (61, 485), (170, 40), (253, 1139), (253, 1188), (253, 1232), (270, 124)], '16th_Virginia_Infantry.html': [(49, 624), (49, 646), (63, 744), (63, 766), (94, 303), (94, 325), (98, 628), (98, 650), (102, 18), (104, 383), (104, 404), (104, 812), (104, 833), (146, 41), (153, 40), (236, 1039), (236, 1088), (236, 1132), (253, 125), (280, 968)], '1896_Indiana_Hoosiers_football_team.html': [(153, 163), (153, 183), (159, 218), (159, 236), (159, 350), (159, 712), (159, 732), (159, 823), (420, 40), (503, 1091), (503, 1141), (503, 1185), (520, 124)], '1898_Colgate_football_team.html': [(49, 511), (49, 533), (133, 163), (133, 183), (403, 40), (486, 1055), (486, 1105), (486, 1149), (503, 124)], '1910_in_literature.html': [(353, 40), (436, 1023), (436, 10

The output is in dictionary that is not so easy to read. We will now structure the output and save it in cvs file.

## Improve the structure of output and save in CSV file

In [33]:
import csv

In [34]:
context_delta = 10
rows = [['File', 'Line', 'Index', 'Context']]
for filematches in overall_locs_tuples_caseins:
    filepath = os.path.join(foldername, filematches)
    for match in overall_locs_tuples_caseins[filematches]:
        row = []
        row.append(filepath)
        row.extend(match)
        with open(filepath, encoding='utf-8') as f:
            line = f.readlines()[match[0]].strip()
            start_idx = match[1] - context_delta
            if start_idx < 0:
                start_idx = 0
            end_idx = match[1] + context_delta
            if end_idx > (len(line) - 1):
                end_idx = len(line) - 1
            row.append(line[start_idx : (end_idx+1)])
        rows.append(row)
with open("results.csv", "w", encoding='utf-8') as fw:
    writer = csv.writer(fw)
    writer.writerows(rows)

In [35]:
import pandas as pd
with open("results.csv") as f:
    df = pd.read_csv(f)

In [36]:
df

Unnamed: 0,File,Line,Index,Context
0,wiki\100_Greatest_Romanians.html,45,451,"v.jpg 2x"" data-file-w"
1,wiki\100_Greatest_Romanians.html,45,473,"dth=""373"" data-file-h"
2,wiki\100_Greatest_Romanians.html,52,547,"9.jpg 2x"" data-file-w"
3,wiki\100_Greatest_Romanians.html,52,569,"dth=""640"" data-file-h"
4,wiki\100_Greatest_Romanians.html,261,40,"inks"" data-mw=""interf"
...,...,...,...,...
20620,wiki\Zoom_Systems.html,124,40,"inks"" data-mw=""interf"
20621,wiki\Zoom_Systems.html,207,999,wikidata.org/wiki/Q17
20622,wiki\Zoom_Systems.html,207,1049,ted data repository i
20623,wiki\Zoom_Systems.html,207,1093,Wikidata item</a></li


## Finding matches for 'science'

In [37]:
overall_locs_science = map_reduce(filenames, 4, mapper_reducer.map_science_locs_rowcol_chunk_case_insen, mapper_reducer.reduce_word_locs)
print(overall_locs_science)

{'100_Greatest_Romanians.html': [], '104th_Logistic_Support_Brigade_(United_Kingdom).html': [], '16th_Virginia_Infantry.html': [], '1896_Indiana_Hoosiers_football_team.html': [], '1898_Colgate_football_team.html': [], '1910_in_literature.html': [(115, 27), (115, 51), (115, 60), (253, 133), (253, 169), (253, 198), (280, 162), (280, 186), (280, 203)], '1915_Montana_football_team.html': [], '1951_National_League_tiebreaker_series.html': [], '1953E2809354_FA_Cup_qualifying_rounds.html': [], '1958_Wightman_Cup.html': [], '1988_State_of_Origin_series.html': [], '1st_Strategic_Aerospace_Division.html': [], '2001_Australian_Individual_Speedway_Championship.html': [], '2001_NCAA_Division_I_Field_Hockey_Championship.html': [], '2004_Tuvalu_ADivision.html': [], '2005E2809306_in_Welsh_football.html': [], '2007E2809308_Huddersfield_Town_A.F.C._season.html': [], '2008_Fed_Cup_World_Group_II.html': [], '2009_English_cricket_season.html': [], '2009_World_Junior_Ice_Hockey_Championships_rosters.html': 

In [38]:
context_delta = 10
rows = [['File', 'Line', 'Index', 'Context']]
for filematches in overall_locs_science:
    filepath = os.path.join(foldername, filematches)
    for match in overall_locs_science[filematches]:
        row = []
        row.append(filepath)
        row.extend(match)
        with open(filepath, encoding='utf-8') as f:
            line = f.readlines()[match[0]].strip()
            start_idx = match[1] - context_delta
            if start_idx < 0:
                start_idx = 0
            end_idx = match[1] + context_delta
            if end_idx > (len(line) - 1):
                end_idx = len(line) - 1
            row.append(line[start_idx : (end_idx+1)])
        rows.append(row)
with open("results_science.csv", "w", encoding='utf-8') as fw:
    writer = csv.writer(fw)
    writer.writerows(rows)

In [39]:
import pandas as pd
with open("results_science.csv") as f:
    df_science = pd.read_csv(f)
print(df_science)    

                               File  Line  Index                Context
0      wiki\1910_in_literature.html   115     27  i/1910_in_science" ti
1      wiki\1910_in_literature.html   115     51  ="1910 in science">Sc
2      wiki\1910_in_literature.html   115     60   science">Science</a>
3      wiki\1910_in_literature.html   253    133  /wiki/The_Science_of_
4      wiki\1910_in_literature.html   253    169  itle="The Science of 
...                             ...   ...    ...                    ...
1265        wiki\Wolfgang_Lutz.html   166    868  my_of_Sciences" title
1266        wiki\Wolfgang_Lutz.html   166    943  my of Sciences">Membe
1267        wiki\Wolfgang_Lutz.html   166   1002  my of Sciences</a></l
1268  wiki\Young_Finnish_Party.html   269     40  political_science#Pol
1269  wiki\Young_Finnish_Party.html   269     96  political science">Ot

[1270 rows x 4 columns]


Locating data from text files is a very common and time-consuming operation when many files are involved. By using MapReduce, we can significantly reduce the time required to locate that data.