# About This Notebook
* This notebook evaluates the number of tokens generated by the model EleutherAI/gpt-neo-1.3B with The Pile data
* This is the shuffled version of the notebook. The data has been shuffled with respect to position in the document (at document level)
* The Evaluation is done by the following methodology

* Select documents within THE PILE 00 dataset with more than or equal to 40 tokens, replace the skipped logits with np.nan
* Utilize the first 20 tokens of each document as input to the model and use the subsequent 20 tokens to calculate the number of correctly predicted tokens
* Correctly predicted tokens are found by comparing the tokens in range 20..40 of the input document with the tokens in range 20--40 of generated document
* This data is stored in the array acc and saved in the file 'withshuffle.pkl'
* The input to this dataset was a [kaggle dataset](https://www.kaggle.com/usaiprashanth/the-pile-train-00-dataset) dataset with output in the following [kaggle dataset](https://www.kaggle.com/usaiprashanth/gptmodel-outputs)
* The following notebook is from a [kaggle notebook](https://www.kaggle.com/usaiprashanth/gpt-1-3b-model?scriptVersionId=72761073)

In [None]:
!pip install lm-dataformat #Library for easy retrieval of The pile data from jsonl.zstd files
!pip install transformers #Library for easy interaction with gpt-neo-1.3B model

Collecting lm-dataformat
  Downloading lm_dataformat-0.0.19-py3-none-any.whl (5.4 kB)
Collecting zstandard
  Downloading zstandard-0.15.2-cp37-cp37m-manylinux2014_x86_64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 605 kB/s 
Collecting jsonlines
  Downloading jsonlines-2.0.0-py3-none-any.whl (6.3 kB)
Installing collected packages: zstandard, jsonlines, lm-dataformat
Successfully installed jsonlines-2.0.0 lm-dataformat-0.0.19 zstandard-0.15.2


* Creating the @param model using Huggingface api

In [None]:
from transformers import GPTNeoForCausalLM, GPT2Tokenizer
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B").cuda()
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")

Downloading:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.31G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/200 [00:00<?, ?B/s]

* LM_Dataformat allows for retrieval of data by providing an iterator method "stream_data"

In [None]:
import lm_dataformat
pile = lm_dataformat.Reader('../input/the-pile-train-00-dataset/the-eye.eu/public/AI/pile/train/00.jsonl.zst')

* Creating the array @param docs, which contains the first 10,000 documents of THE PILE dataset

In [None]:
docsiter = iter(pile.stream_data())
docs = []
from tqdm import tqdm 
with tqdm(total=10000, position=0, leave=True) as pbar:
    for i in range(10000):
        docs.append(next(docsiter))
        pbar.update()

100%|██████████| 10000/10000 [00:00<00:00, 12811.05it/s]


* Shuffling first 10,000 records by shuffling a position pointer array

In [None]:
import random
documentidx = list(range(10000))
random.shuffle(documentidx)

In [None]:
import numpy as np
import torch
acc = [[],[],[]]
from sklearn.metrics import accuracy_score
with tqdm(total=10000, position=0, leave=True) as pbar:
    for idx in range(10000):
        doc = docs[documentidx[idx]]
        input_ids = tokenizer(doc, return_tensors="pt").input_ids
        
        if(input_ids.shape[1] < 40):
            acc[0].append(documentidx[idx])
            acc[1].append(input_ids.shape[1])
            acc[2].append(np.nan) #Input tokens are less than 40, append nan as logit
            continue
        
        length = 20 #Length of subsequent tokens, the tokens utilized to calculate number of correctly predicted tokens
        
        gen_tokens = model.generate(input_ids[:,:20].cuda(), do_sample=True, temperature=0.1,pad_token_id=50256,eos_token_id=50256,
                                    min_length=20+length,max_length=40+length).cpu()[0] #Generating tokens from the model
        
        score = torch.sum(input_ids[0,20:length+20] == gen_tokens[20:length+20]).item() #Calculating the number of correctly predicted tokens 
        
        
        #Appending the document index, length of input tokens and the score to the logit array, @param acc
        acc[0].append(documentidx[idx])
        acc[1].append(input_ids.shape[1])
        acc[2].append(score)
        
        idx+=1 #Temporary index, keeps count of current iteration
        pbar.update()

  0%|          | 1/10000 [00:02<5:51:38,  2.11s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (989184 > 2048). Running this sequence through the model will result in indexing errors
 99%|█████████▉| 9875/10000 [3:16:40<02:29,  1.20s/it]


In [None]:
import joblib
joblib.dump(acc,'withshuffle.pkl') #Saving @param acc

['withshuffle.pkl']