# OpenWebText Token Counting

## Usage Example

In [1]:
import pandas as pd

from counting import count_tokens_in_openwebtext

# Initial Loading / Counting:
# 
# WARNINGS:
#  1) VERY SLOW (~1 hour on my machine with the raw openwebtext data already on
#     my machine)
#  2) The OpenWebText dataset is pretty big (~54 GB). You'll need to make sure
#     you have enough hard drive space both the raw data and the intermediate
#     files created in order to process the data (e.g. storing the tokenized
#     data since all 54 GB aren't loaded into memory at once)
# 
# NOTE: It's safe to ignore any warnings like:
#     "Token indices sequence length is longer than the specified maximum
#     sequence length for this model (1185 > 1024). Running this sequence
#     through the model will result in indexing errors"
#   because we aren't actually running the tokenized sequences through the mode;
#   we're just counting the tokens.
# 
# Finally, I highly recommend saving the results somewhere so you don't have to
# redo this step.
df = count_tokens_in_openwebtext(
    tokenizer_model_id="gpt2",  # GPT 2 series of models
    split="train",
    num_load_workers=8,
    num_tokenize_workers=8,
    return_df=True,
    save_to="example_token_counts.json"
)
df

Loading dataset shards:   0%|          | 0/83 [00:00<?, ?it/s]

tokenizing the splits (num_proc=8):   0%|          | 0/8009762 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1185 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1386 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3780 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4597 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2075 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence leng

tokenizing the splits (num_proc=8):   0%|          | 0/4007 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1618 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1902 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (6545 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3124 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1306 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence leng

Unnamed: 0,token_id,token,n
0,0,!,5487308
1,1,"""",13327355
2,2,#,453372
3,3,$,519325
4,4,%,3917074
...,...,...,...
50252,50252,regress,12889
50253,50253,Collider,7982
50254,50254,informants,14115
50255,50255,gazed,6278


In [2]:
# Loading from saved counts:
df = pd.read_json("example_token_counts.json")
df

Unnamed: 0,token_id,token,n
0,0,!,5487308
1,1,"""",13327355
2,2,#,453372
3,3,$,519325
4,4,%,3917074
...,...,...,...
50252,50252,regress,12889
50253,50253,Collider,7982
50254,50254,informants,14115
50255,50255,gazed,6278


In [3]:
print("Top 25 Most Common Tokens:")
df.sort_values("n", ascending=False).head(25)

Top 25 Most Common Tokens:


Unnamed: 0,token_id,token,n
11,11,",",331770469
262,262,the,329471323
13,13,.,317302752
198,198,\n,315827110
284,284,to,175794033
286,286,of,168195304
290,290,and,155214939
257,257,a,141384377
287,287,in,113908517
447,447,�,106355520


In [4]:
print("Top 25 Least Common Tokens:")
df.sort_values("n", ascending=False).tail(25)

Top 25 Least Common Tokens:


Unnamed: 0,token_id,token,n
216,216,,0
217,217,,0
218,218,,0
219,219,,0
125,125,�,0
43177,43177,EStreamFrame,0
124,124,�,0
20174,20174,裏�,0
628,628,\n\n,0
45706,45706,,0


In [5]:
print("Tokens Never Encountered in OpenWebText:")
df.query("n == 0")

Tokens Never Encountered in OpenWebText:


Unnamed: 0,token_id,token,n
124,124,�,0
125,125,�,0
177,177,�,0
178,178,�,0
179,179,�,0
...,...,...,...
45706,45706,,0
46600,46600,Adinida,0
47571,47571,DevOnline,0
47654,47654,,0
