# Dataset Creation

This notebook contains code for generating a hard dataset from wikipedia so we can test our embedding methods.

## Pre-reqs

In [None]:
!pip install apache_beam mwparserfromhell
!pip install datasets

Imports

In [2]:
import json
import random
import pandas as pd
from datasets import load_dataset

## The Data

Full dataset is around 20G.

In [3]:
# load dataset
wiki_data = load_dataset("wikipedia", "20220301.en")
wiki_data = wiki_data['train']

In [4]:
size = len(wiki_data)
size

6458670

### Similar Documents

Get similar documents by title e.g., for Chinese dynasties: Han, Ming, Qin, etc.

In [5]:
# title of documents
titles = "Battle_of_Fort_Sumter, First_Battle_of_Bull_Run, Battle_of_Shiloh, Battle_of_Antietam, Battle_of_Chancellorsville, Siege_of_Vicksburg, Battle_of_Gettysburg, Battle_of_Chickamauga, Battle_of_Atlanta, Battle_of_Appomattox_Court_House"
titles = titles.replace('_', ' ')
titles = titles.lower()

title_list = titles.split(', ')
title_list

['battle of fort sumter',
 'first battle of bull run',
 'battle of shiloh',
 'battle of antietam',
 'battle of chancellorsville',
 'siege of vicksburg',
 'battle of gettysburg',
 'battle of chickamauga',
 'battle of atlanta',
 'battle of appomattox court house']

Get the corresponding wiki articles. It's slow but we only have to do it once.

In [6]:
filtered_wiki = wiki_data.filter(lambda x: x['title'].lower() in title_list)

In [7]:
# convert to dataframe
df = pd.DataFrame.from_dict(filtered_wiki)

In [8]:
# ---------------------
# run this sanity check
# ---------------------
assert (len(df) == len(title_list)), f"Number of documents retrieved mismatched: should be {len(title_list)} got {len(df)} instead."
df

Unnamed: 0,id,url,title,text
0,4849,https://en.wikipedia.org/wiki/Battle%20of%20Ge...,Battle of Gettysburg,The Battle of Gettysburg () was fought July 1–...
1,48780,https://en.wikipedia.org/wiki/Battle%20of%20Ch...,Battle of Chancellorsville,The Battle of Chancellorsville was a major bat...
2,84849,https://en.wikipedia.org/wiki/Battle%20of%20An...,Battle of Antietam,"The Battle of Antietam (), or Battle of Sharps..."
3,144155,https://en.wikipedia.org/wiki/Battle%20of%20Sh...,Battle of Shiloh,The Battle of Shiloh (also known as the Battle...
4,176263,https://en.wikipedia.org/wiki/Battle%20of%20At...,Battle of Atlanta,The Battle of Atlanta was a battle of the Atla...
5,204642,https://en.wikipedia.org/wiki/Battle%20of%20Ch...,Battle of Chickamauga,"The Battle of Chickamauga, fought on September..."
6,228867,https://en.wikipedia.org/wiki/First%20Battle%2...,First Battle of Bull Run,The First Battle of Bull Run (the name used by...
7,229668,https://en.wikipedia.org/wiki/Siege%20of%20Vic...,Siege of Vicksburg,"The siege of Vicksburg (May 18 – July 4, 1863)..."
8,339819,https://en.wikipedia.org/wiki/Battle%20of%20Fo...,Battle of Fort Sumter,"The Battle of Fort Sumter (April 12–13, 1861) ..."
9,1478485,https://en.wikipedia.org/wiki/Battle%20of%20Ap...,Battle of Appomattox Court House,"The Battle of Appomattox Court House, fought i..."


Saving data

In [9]:
# load old data
df_old = pd.read_pickle("./wiki_hard.pkl")

# combine data frame
df_new = pd.concat([df_old, df], ignore_index=True)

# save the data
df_new.to_pickle("./wiki_hard.pkl")