# Create Dataset Notebook

This notebook documents how to create the dataset used in our project.

In [None]:
import gdown

from create_dataset import process_dataset, process_metadata, unzip_archive

# for some reason, jupyter is complaining about this specific function being imported so we import it directly here
def download_gdrive_file(file_id, output_file):
    '''
    Downloads dataset stored on Google Drive
    '''
    gdown.download(f"https://drive.google.com/uc?id={file_id}", output_file, quiet=False)

## Downloading the Data

This project utilizes data from the [Standardized Project Gutenberg Corpus](https://zenodo.org/records/2422561). The code is open-sourced in [this repo](https://github.com/pgcorpus/gutenberg). However, running it from scratch will take a **__long__** time, as it has to query the 70,000+ books located in Project Gutenberg (it takes 4+ hours depending on your internet connection). Our solution to this was run it on our machine, select the English works from it, and store the result publically on Google Drive. 

This section will document our process of getting to that zip file that's stored on the google drive:

1. Clone the repo (linked above)
2. Install the packages in the `requirements.txt` 
3. Run the `get_data.py` script (**Note:** This will take hours to run as it has to download 70,000+ books):
    - Before you run this, you can fix an issue we found by modifying line 103 of `src/utils.py`. Replace the second argument of `os.path.join` (originally `"[p123456789][g0123456789][0-9]*"`) with `"[pg0-9]*"`. This fixes an issue where certain books were excluded despite being downloaded.
4. Run the `process_data.py` script - This will create the tokens used in this dataset.
5. To package these files, run `zip -r spgc_raw.zip metadata/ data/tokens`
    - Requires the `zip` binary which may not be installed by default.


## Processing Dataset

In [None]:
# Downloads raw dataset (will take a while to download)

download_gdrive_file("1VJcL_0B-7YcAkaSTXnHOKXLa_EAbmpCK", "data/spgc_raw.zip")
unzip_archive("data/spgc_raw.zip", "data/spgc/")

In [None]:
meta_df = process_metadata()
# combines all the data into a single parquet file and update the metadata csv file
process_dataset(meta_df)

Unnamed: 0,title,author,pg_code,author_id,book_id
0,The Declaration of Independence of the United ...,"Jefferson, Thomas",1,9168,0
1,The United States Bill of Rights: The Ten Orig...,United States,2,17600,0
2,John F. Kennedy's Inaugural Address,"Kennedy, John F. (John Fitzgerald)",3,9625,0
3,Lincoln's Gettysburg Address: Given November 1...,"Lincoln, Abraham",4,10540,0
4,The United States Constitution,United States,5,17600,1
...,...,...,...,...,...
51711,German wit and humor : $b A collection from va...,"Downes, Minna Sophie Marie Baumann",74708,5025,0
51712,"A knight of the air : $b Or, The aerial rivals","Coxwell, Henry",74709,4017,0
51713,Celtic Scotland : $b A history of ancient Alban,"Skene, W. F. (William Forbes)",74710,15924,2
51714,The opinions of Jérôme Coignard,"France, Anatole",74713,6236,35
