# Create Dataset Notebook

This notebook documents how to create the dataset used in our project.

In [1]:
import gdown
import os

from create_dataset import process_dataset, process_metadata, unzip_archive

# for some reason, jupyter is complaining about this specific function being imported so we import it directly here
def download_gdrive_file(file_id, output_file):
    '''
    Downloads dataset stored on Google Drive
    '''
    gdown.download(f"https://drive.google.com/uc?id={file_id}", output_file, quiet=False)

## Downloading the Data

This project utilizes data from the [Standardized Project Gutenberg Corpus](https://zenodo.org/records/2422561). The code is open-sourced in [this repo](https://github.com/pgcorpus/gutenberg). However, running it from scratch will take a **__long__** time, as it has to query the 70,000+ books located in Project Gutenberg (it takes 4+ hours depending on your internet connection). Our solution to this was run it on our machine, select the English works from it, and store the result publically on Google Drive. 

This section will document our process of getting to that zip file that's stored on the google drive:

1. Clone the repo (linked above)
2. Install the packages in the `requirements.txt` 
3. Run the `get_data.py` script (**Note:** This will take hours to run as it has to download 70,000+ books):
    - Before you run this, you can fix an issue we found by modifying line 103 of `src/utils.py`. Replace the second argument of `os.path.join` (originally `"[p123456789][g0123456789][0-9]*"`) with `"[pg0-9]*"`. This fixes an issue where certain books were excluded despite being downloaded.
4. Run the `process_data.py` script - This will create the tokens used in this dataset.
5. To package these files, run `zip -r spgc_raw.zip metadata/ data/tokens`
    - Requires the `zip` binary which may not be installed by default.


## Processing Dataset

In [5]:
# Downloads raw dataset (will take a while to download)

os.makedirs("data", exist_ok=True)
download_gdrive_file("1VJcL_0B-7YcAkaSTXnHOKXLa_EAbmpCK", "data/spgc_raw.zip")
unzip_archive("data/spgc_raw.zip", "data/spgc/")

Downloading...
From (original): https://drive.google.com/uc?id=1VJcL_0B-7YcAkaSTXnHOKXLa_EAbmpCK
From (redirected): https://drive.google.com/uc?id=1VJcL_0B-7YcAkaSTXnHOKXLa_EAbmpCK&confirm=t&uuid=fc6cdb63-e2d8-4c57-a378-ebf50598de18
To: /media/volume/team11data/524Project-Group11/proj2/data/spgc_raw.zip
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.44G/8.44G [01:45<00:00, 79.8MB/s]


In [2]:
meta_df = process_metadata()

2024-11-18 13:54:47,587 - dataset_logger - INFO - Processing metadata...
2024-11-18 13:54:47,890 - dataset_logger - INFO - Finished processing metadata...


In [4]:
# combines all the data into a single parquet file and update the metadata csv file
process_dataset(meta_df)

2024-11-18 13:55:26,304 - dataset_logger - INFO - Processed book 0
2024-11-18 13:55:26,305 - dataset_logger - INFO - Processed book 1
2024-11-18 13:55:26,306 - dataset_logger - INFO - Processed book 2
2024-11-18 13:55:26,307 - dataset_logger - INFO - Processed book 3
2024-11-18 13:55:26,308 - dataset_logger - INFO - Processed book 4
2024-11-18 13:55:26,308 - dataset_logger - INFO - Processed book 5
2024-11-18 13:55:26,309 - dataset_logger - INFO - Processed book 6
2024-11-18 13:55:26,312 - dataset_logger - INFO - Processed book 7
2024-11-18 13:55:26,312 - dataset_logger - INFO - Processed book 8
2024-11-18 13:55:26,330 - dataset_logger - INFO - Processed book 9
2024-11-18 13:55:26,348 - dataset_logger - INFO - Processed book 10
2024-11-18 13:55:26,353 - dataset_logger - INFO - Processed book 11
2024-11-18 13:55:26,375 - dataset_logger - INFO - Processed book 12
2024-11-18 13:55:26,393 - dataset_logger - INFO - Processed book 13
2024-11-18 13:55:26,398 - dataset_logger - INFO - Processe

KeyboardInterrupt: 

## Downloading Pre-created Dataset