# Part 5: Datasets

Tiny Shakespeare dataset was really small. With 0.5 million tokens in a batch size, we will be processing the entire dataset in two epochs. So we need bigger dataset, especially when training on multiple GPUs.

GPT-2 dataset was never released. It was trained from data from outbound links from Reddit. 

For training our GPT, we're going to use FineWeb, which is a preprocessed and a filtered subset of common crawl. Specifically, the 10B token subset of FineWeb.

The goal of this part is to load the dataset from huggingface and prepare it such that our data loader can handle it. Assuming that the dataset has already been downloaded in a directory call `edu_fineweb10B`.

In [None]:
import tiktoken
import numpy as np
import torch
import os

In [None]:
def load_tokens(filename):
    npt = np.load(filename)
    ptt = torch.tensor(npt, dtype=torch.long)
    return ptt

In [None]:
class DataLoaderLite:
    def __init__(self, B, T, process_rank, num_processes, split):
        self.B = B
        self.T = T
        self.process_rank = process_rank
        self.num_processes = num_processes
        assert split in {'train', 'val'}

        data_root = 'edu_fineweb10B'
        shards = os.listdir(data_root)
        shards = [s for s in shards if split in s]
        shards = sorted(shards)
        shards = [os.path.join(data_root, s) for s in shards]
        self.shards = shards

        self.current_shard = 0
        self.tokens = load_tokens(self.shards[self.current_shard])
        self.current_position = self.B * self.T * self.process_rank

    def next_batch(self):
        B, T = self.B, self.T
        buf = self.tokens[self.current_position : self.current_position + B*T+1]
        x = (buf[:-1]).view(B, T)
        y = (buf[1:]).view(B, T)

        self.current_position += B * T * self.num_processes

        if self.current_position + (B * T * self.num_processes + 1) > len(self.tokens):
            self.current_shard = (self.current_shard + 1) % len(self.shards)
            self.tokens = load_tokens(self.shards[self.current_shard])
            self.current_position = B * T * self.process_rank
        return x, y