# Textbooks to JSON

#### Author

Tijmen de Haan <tijmen@post.kek.jp>

#### History

 - 2023 Nov 25 started coding

#### Description

I have extracted 250MB of text data from public-domain astro-related textbooks. Unfortunately, some of the books are misformatted, irrelevant, or low-quality. This notebook serves to turn the large amount of raw data into medium- to high-quality data chunks ready to be tokenized and fine-tuned on.

### Plan

#### Step 1: loss function-based filter

I already fine tuned zephyr-7b-beta on papers from the arxiv and a physics QA dataset from huggingface. This fine-tuned model called "zephyr-7b-beta_cosmosage_v1" took 11 hours of compute on one A6000 GPU. 

Let's try using this fine-tuned model to evaluate the quality of these textbooks individually. We'll take a few small "biopsies" from each book and calculate the loss function of the fine-tuned model. Then we'll plot up these loss functions by book. The books that are very poorly predicted are likely junk data and can be thrown out. 

#### Step 2: manual rejection of certain textbooks

My dataset contains some weird books! I believe I saw one about UFOs. This is not intended to be in the expertise of cosmosage, so these need to be manually rejected.

#### Step 3: trimming

For each of the remaining books, we'll want to trim off some initial part corresponding to the title page and such, and some final part that may consist of references or other irrelevant stuff. We will use the incidence of special characters (characters other than letters, punctuation, and whitespace) to decide where to start and stop the trim.

#### Step 4 (optional): chunking

We'll evaluate if it's at all possible to chunk by paragraph. If this appears difficult, we'll ignore it, as the Tokenizer will chunk by token count, anyway.

#### Step 5: save prepared dataset as JSON

Save as "list"-type JSON file, ready to be loaded into a TextDataset. The current plan is to train for one epoch on this dataset, then train one additional epoch on the other smaller higher-quality dataset.


In [None]:
import os
import random
from IPython.display import clear_output
import torch
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, TensorDataset
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the fine-tuned model and tokenizer
fine_tuned_model_path = "zephyr-7b-beta_cosmosage_v1"
tokenizer = AutoTokenizer.from_pretrained(fine_tuned_model_path)
model = (
    AutoModelForCausalLM.from_pretrained(fine_tuned_model_path)
    .to("cuda")
    .to(dtype=torch.bfloat16)
)

In [None]:
def sample_text_chunks(file_path, chunk_size=512, num_samples=16):
    with open(file_path, "r", encoding="utf-8") as file:
        text = file.read()
    tokens = tokenizer.encode(text)
    max_start = max(0, len(tokens) - chunk_size)
    samples = []
    for _ in range(num_samples):
        start = random.randint(0, max_start)
        end = start + chunk_size
        sample = tokens[start:end]
        samples.append(sample)
    return samples

textbooks_dir = "datasets/textbooks_extracted/"
textbook_files = os.listdir(textbooks_dir)

# Filter out textbook files smaller than 1 kB
textbook_files = [file for file in textbook_files if os.path.getsize(os.path.join(textbooks_dir, file)) >= 1024]

# Now, sort the remaining textbook_files by file size
textbook_files = sorted(textbook_files, key=lambda file: os.path.getsize(os.path.join(textbooks_dir, file)))

resume = True
if not resume:
    loss = {}  # Mapping from book index to its loss, do this the first time
else:
    loss = torch.load("loss.pt")  # Load loss dict from disk to resume

# some books may have been deleted manually, remove those from the saved loss dict
for file in list(loss.keys()):
    if file not in textbook_files:
        del loss[file]

batch_size = 16

for file in textbook_files:  # Just try a few books for now

    # save loss dict to disk in case of crashes
    torch.save(loss, "loss.pt")

    if file in loss:
        continue  # Skip books we've already evaluated
    print(f"Collecting samples from {file}.")
    file_path = os.path.join(textbooks_dir, file)
    samples = sample_text_chunks(file_path, num_samples=batch_size)
    
    print("Converting all samples into a DataLoader.")
    padded_samples = pad_sequence([torch.tensor(sample) for sample in samples], batch_first=True, padding_value=0)
    all_samples_tensor = padded_samples.to("cuda")
    dataset = TensorDataset(all_samples_tensor)
    dataloader = DataLoader(dataset, batch_size=batch_size)  # Adjust batch size as needed

    print("Evaluating model on samples.")
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        this_loss = 0
        for batch in dataloader:
            inputs = batch[0]
            outputs = model(inputs, labels=inputs)
            this_loss += outputs.loss.item()
        loss[file] = this_loss
    print(f"Loss for {file}: {loss[file]}")

    

print(loss)

### Step 2: Manual Rejection of Certain Textbooks

In [None]:
# manual step

### Step 3: Trimming Text Data

In [None]:

# skip for now


### Step 4 (Optional): Chunking by Paragraph

In [None]:

# skip for now


### Step 5: Saving the Prepared Dataset as JSON

In [None]:
import glob
import json

textbooks_path = glob.glob('datasets/textbooks_extracted/*.txt')
textbooks = []
for textbook_path in textbooks_path:
    with open(textbook_path, 'r', encoding='utf-8') as textbook_file:
        textbook = textbook_file.read()
    textbooks.append(textbook)

# Saving the data as JSON
with open("datasets/textbooks_clean.json", "w", encoding="utf-8") as json_file:
    json.dump(textbooks, json_file)
