In [1]:
import os
import sys
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import gzip
import ndjson
from tqdm import tqdm

import matplotlib.pyplot as plt

import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

# Analyzing Arxiv Data
The purpose of this notebook is to verify the integrity of the pre-processing pipeline for arXiv files used to build the `proof-pile`. In particular, I want to be confident that I am filtering out all text, that from a human or language model perspective, can be considered completely unintelligible. 

### Strategy
The preprocessing heuristics described in the main `README` are informed by $\LaTeX{}$ expertise and manual inspection of data. However, I am not aware of every single subtelty of $\LaTeX$, and I can only inspect so many training examples. Therefore, these two methods of analysis do not completely convince me these heuristics yield a clean dataset.

In this notebook I try to detect noise in the dataset by identifying documents that achieve a large loss when processed by an off-the-shelf pre-trained language model, specifically `EleutherAI/gpt-neo-125M`. 

### Code
The next three cells are basic housekeeping: loading data and models. 

In [2]:
# load subset of data
print("loading data batch...")
fle_name = "/data/corpora/proof-pile/train/proofpile_train_0.jsonl"
with open(fle_name) as f: 
    data = ndjson.load(f)

loading data batch...


In [3]:
torch.cuda.empty_cache()
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-neo-125M").cuda()

tokenizer = AutoTokenizer.from_pretrained(
    "EleutherAI/gpt-neo-125M")

tokenizer.pad_token = tokenizer.eos_token

context = 2048

In [4]:
# only look at arxiv. The rest of the data is very high quality and is definitely clean.
# We also restrict our attention to 10,000 examples; I'm happy with this size of random sample.
n = 10_000
data = [x for x in data if "config" in x["meta"] and x["meta"]["config"]=="arxiv"][:n]

print(f"full batch length: {len(data)}")

full batch length: 10000


As a sanity check, we append a random string of alphanumeric characters to our data. This should achieve a very high loss.

In [5]:
import random
import string
data.append({"text": ''.join(random.choices(string.ascii_uppercase + string.digits, k=8000))})

The following code calculates the LM loss for every document in the subset of the data we've loaded.

In [None]:
loss_fn = torch.nn.CrossEntropyLoss(reduction='none')

batch_size = 15

print("We're going to get an indexing warning, ignore it.")
for i in tqdm(range(len(data))): 
    example = data[i]
    
    tokens = tokenizer([example["text"]], 
                    return_tensors="pt", 
                    padding=True, 
                    pad_to_multiple_of=context)
        
    tokens = {key: tokens[key].reshape((-1, context)).cuda() for key in tokens}   
    
    labels = tokens["input_ids"].clone()
        
    unreduced_loss = 0
    num_tokens = 0 
    for j in range(0, tokens["input_ids"].shape[0], batch_size):
        this_ids = tokens["input_ids"][j:j+batch_size, :]
        this_mask = tokens["attention_mask"][j:j+batch_size, :]
        this_labels = labels[j:j+batch_size, :]
    
        with torch.no_grad():
            out = model(input_ids=this_ids, attention_mask=this_mask)
    
        preds = out.logits[:, :-1, :]
            
    
        preds = preds.flatten(end_dim=1)
        flat_labels = this_labels[:, 1:].flatten()
        flat_mask = this_mask[:, 1:].flatten()
                
        unreduced_loss += torch.sum(loss_fn(preds, flat_labels)*flat_mask).item()
        num_tokens += torch.sum(flat_mask).item()
    
    loss = unreduced_loss/num_tokens       
                      
    data[i]["loss"] = loss



  0%|                                                                                                                                                      | 0/10001 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (12288 > 2048). Running this sequence through the model will result in indexing errors
 72%|███████████████████████████████████████████████████████████████████████████████████████████████████                                      | 7235/10001 [1:26:39<21:05,  2.19it/s]

### Analysis
Let's plot a histogram of the losses.

In [None]:
losses = [x["loss"] for x in data]
plt.hist(losses)
plt.title("Document-level losses (GPT-Neo 125M)")
plt.show()

In [None]:
print("random sequence loss: ", data[-1]["loss"])
print(data[-1]["text"][:100], "...")

Our random sequence is the bar way at the far right, that's encouraging! Let's find the documents with the highest loss. 

In [None]:
ordered_idxs = sorted(list(range(len(data))), key = lambda i: -data[i]["loss"])

print("Index of 10 documents with highest loss")
print(ordered_idxs[:10])

In [None]:
idx = 1
print("loss : ", data[ordered_idxs[idx]]["loss"])
print(data[ordered_idxs[idx]]["text"])

### Discussion
In the cell above, we can set `idx = n` to view the document that generates `n`th highest loss. We can see even the documents that yield the highest losses look like high quality, useful data. This means we can be relatively confident our pre-training data is free of complete noise. 

A limitation of this approach is that `gpt-neo` itself was trained on arXiv, and if EleutherAI's preprocessing pipeline allowed some noise into the pre-training data, `gpt-neo` might've learned that noise and be unable to detect it. 