# Introduction

Let us build a Small Language Model (SLM) from scratch. We will try to keep the parameter size to 10-15 million.

Our goal is to generate creative and coherent text based on the input data.

## Step 1: Import the Dataset

TinyStories is a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We can get it from HuggingFace.

In [1]:
!pip install datasets



In [2]:
from datasets import load_dataset

ds = load_dataset("roneneldan/TinyStories")

  from .autonotebook import tqdm as notebook_tqdm


## Step 2: Tokenize the Dataset

In this step, we will do the following:

(1) Tokenize the dataset into tokenIDs.

(2) Create a file called "train.bin" and "validtion.bin" where we will store the tokenIDs from the entire dataset.

(3) We make sure the tokenIDs are stored on a disk, rather than on the RAM for efficient computations.

In [3]:
!pip install tiktoken
import tiktoken
import os
import numpy as np
from tqdm.auto import tqdm

enc = tiktoken.get_encoding("gpt2")

# Some functions from https://github.com/karpathy/nanoGPT/blob/master/data/openwebtext/prepare.py

def process(example):
    ids = enc.encode_ordinary(example['text']) # encode_ordinary ignores any special tokens
    out = {'ids': ids, 'len': len(ids)}
    return out

if not os.path.exists("train.bin"):
    tokenized = ds.map(
        process,
        remove_columns=['text'],
        desc="tokenizing the splits",
        num_proc=8,
        )
    # concatenate all the ids in each dataset into one large file we can use for training
    for split, dset in tokenized.items():
        arr_len = np.sum(dset['len'], dtype=np.uint64)
        filename = f'{split}.bin'
        dtype = np.uint16 # (can do since enc.max_token_value == 50256 is < 2**16)
        arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))
        total_batches = 1024

        idx = 0
        for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):
            # Batch together samples for faster write
            batch = dset.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
            arr_batch = np.concatenate(batch['ids'])
            # Write into mmap
            arr[idx : idx + len(arr_batch)] = arr_batch
            idx += len(arr_batch)
        arr.flush()



tokenizing the splits (num_proc=8): 100%|██████████| 2119719/2119719 [00:31<00:00, 67680.72 examples/s]
tokenizing the splits (num_proc=8): 100%|██████████| 21990/21990 [00:00<00:00, 50811.37 examples/s]
writing train.bin: 100%|██████████| 1024/1024 [00:05<00:00, 176.59it/s]
writing validation.bin: 100%|██████████| 1024/1024 [00:00<00:00, 1687.47it/s]


## Step 3: Create Input-Output batches for the dataset

In [4]:
# Some functions from https://github.com/karpathy/nanoGPT/blob/master/train.py with slight modifications
def get_batch(split):
    # We recreate np.memmap every batch to avoid a memory leak, as per
    # https://stackoverflow.com/questions/45132940/numpy-memmap-memory-usage-want-to-iterate-once/61472122#61472122
    if split == 'train':
        data = np.memmap('train.bin', dtype=np.uint16, mode='r')
    else:
        data = np.memmap('validation.bin', dtype=np.uint16, mode='r')
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y


## Step 4: Define the SLM Model Architecture

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass
import numpy as np
from tqdm.auto import tqdm
from contextlib import nullcontext
import os

class LayerNorm(nn.Module):
    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
    def forward(self, x):
        return F.layer_norm(x, self.weight.shape, self.weight, self.bias, 1e-5)

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.flash = hasattr(F, 'scaled_dot_product_attention')
        if not self.flash:
            self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                       .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size()
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        if self.flash:
            y = F.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.attn_dropout.p if self.training else 0.0, is_causal=True)
        else:
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            y = att @ v

        y = y.transpose(1, 2).contiguous().view(B, T, C)
        y = self.resid_dropout(self.c_proj(y))
        return y

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu = nn.GELU()
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)
    def forward(self, x):
        return self.dropout(self.c_proj(self.gelu(self.c_fc(x))))

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = LayerNorm(config.n_embd, config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln2 = LayerNorm(config.n_embd, config.bias)
        self.mlp = MLP(config)
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

@dataclass
class GPTConfig:
    block_size: int
    vocab_size: int
    n_layer: int
    n_head: int
    n_embd: int
    dropout: float = 0.0
    bias: bool = True

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.transformer = nn.ModuleDict(dict(
            wte=nn.Embedding(config.vocab_size, config.n_embd),
            wpe=nn.Embedding(config.block_size, config.n_embd),
            drop=nn.Dropout(config.dropout),
            h=nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f=LayerNorm(config.n_embd, config.bias),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.transformer.wte.weight = self.lm_head.weight  # weight tying

        self.apply(self._init_weights)
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * config.n_layer))

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size
        pos = torch.arange(0, t, dtype=torch.long, device=device)

        tok_emb = self.transformer.wte(idx)
        pos_emb = self.transformer.wpe(pos)
        x = self.transformer.drop(tok_emb + pos_emb)
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)

        if targets is not None:
            logits = self.lm_head(x)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
            return logits, loss
        else:
            logits = self.lm_head(x[:, [-1], :])
            return logits, None

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Generate tokens given a conditioning sequence.
        idx: Tensor of shape (B, T)
        """
        for _ in range(max_new_tokens):
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx



In [6]:
config = GPTConfig(
    vocab_size=50257,     # use the tokenizer's vocab size
    block_size=128,       # or whatever context size you're training with
    n_layer=6,
    n_head=6,
    n_embd=384,
    dropout=0.1,
    bias=True
)

model = GPT(config)

## Step 5: Define the loss function

In [7]:
def estimate_loss(model):
    out = {}
    model.eval()
    with torch.inference_mode():
        for split in ['train', 'val']:
            losses = torch.zeros(eval_iters)
            for k in range(eval_iters):
                X, Y = get_batch(split)
                with ctx:
                    logits, loss = model(X, Y)
                losses[k] = loss.item()
            out[split] = losses.mean()
    model.train()
    return out

## Step 6: Define SLM Training Configuration Part 1

In [8]:
# Training Config
import torch
from contextlib import nullcontext

learning_rate = 1e-4 #more stable training, earlier 1e-4
max_iters = 20000 #increase from 25000
warmup_steps = 1000 #smoother initial train, earlier 100
min_lr = 5e-4 #lower rate, earlier 5e-4
eval_iters = 500 # increased from 100
batch_size = 32 # changed from 16, better gradient estimate
block_size = 128 #changed from 64, capture longer range dependencies

gradient_accumulation_steps = 32 # reduced from 50

device =  "cuda" if torch.cuda.is_available() else "cpu"
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
# note: float16 data type will automatically use a GradScaler

# How to use autocast https://wandb.ai/wandb_fc/tips/reports/How-To-Use-Autocast-in-PyTorch--VmlldzoyMTk4NTky
#dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]

ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

torch.set_default_device(device)
torch.manual_seed(42)

<torch._C.Generator at 0x124242db0>

## Step 7: Define SLM Training Configuration Part 2

In [9]:
from torch.optim.lr_scheduler import LinearLR,SequentialLR, CosineAnnealingLR

##PUT IN WEIGHT DECAY, CHANGED BETA2 to 0.95
optimizer =  torch.optim.AdamW(model.parameters(), lr=learning_rate, betas=(0.9, 0.95), weight_decay=0.1, eps=1e-9) #weight decay for regularization

scheduler_warmup = LinearLR(optimizer, total_iters = warmup_steps) #Implement linear warmup
scheduler_decay = CosineAnnealingLR(optimizer,T_max = max_iters - warmup_steps, eta_min = min_lr) #Implement lr decay
scheduler = SequentialLR(optimizer, schedulers=[scheduler_warmup, scheduler_decay], milestones=[warmup_steps]) #Switching from warmup to decay

# https://stackoverflow.com/questions/72534859/is-gradscaler-necessary-with-mixed-precision-training-with-pytorch
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

  scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))


## Step 8: Pre-train the SLM

In [10]:
best_val_loss = float('inf')
best_model_params_path = "best_model_params.pt"
train_loss_list, validation_loss_list = [], []

# Ensure model is on the correct device
model = model.to(device)

# In your training loop
for epoch in tqdm(range(max_iters)):
    if epoch % eval_iters == 0 and epoch != 0:
        # Ensure estimate_loss uses the correct device
        losses = estimate_loss(model)
        print(f"Epoch {epoch}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        print(f"The current learning rate: {optimizer.param_groups[0]['lr']:.5f}")
        train_loss_list += [losses['train']]
        validation_loss_list += [losses['val']]

        if losses['val'] < best_val_loss:
            best_val_loss = losses['val']
            torch.save(model.state_dict(), best_model_params_path)

    # Ensure X and y are on the correct device
    X, y = get_batch("train")
    X, y = X.to(device), y.to(device)

    with ctx:
        logits, loss = model(X, y)
        loss = loss / gradient_accumulation_steps
        scaler.scale(loss).backward()

    if ((epoch + 1) % gradient_accumulation_steps == 0) or (epoch + 1 == max_iters):
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad(set_to_none=True)
    scheduler.step()

  2%|▎         | 500/20000 [09:54<6:20:46,  1.17s/it]

Epoch 500: train loss 9.4365, val loss 9.4403
The current learning rate: 0.00007


  5%|▌         | 1000/20000 [26:42<6:21:14,  1.20s/it]

Epoch 1000: train loss 8.4975, val loss 8.5039
The current learning rate: 0.00010


  8%|▊         | 1500/20000 [43:46<6:10:21,  1.20s/it]   

Epoch 1500: train loss 7.5426, val loss 7.5412
The current learning rate: 0.00010


 10%|█         | 2000/20000 [9:09:05<6:02:49,  1.21s/it]    

Epoch 2000: train loss 6.6762, val loss 6.6737
The current learning rate: 0.00010


 12%|█▎        | 2500/20000 [10:45:50<5:39:53,  1.17s/it]     

Epoch 2500: train loss 5.9776, val loss 5.9702
The current learning rate: 0.00011


 15%|█▌        | 3000/20000 [11:03:08<5:48:58,  1.23s/it]   

Epoch 3000: train loss 5.4659, val loss 5.4651
The current learning rate: 0.00011


 18%|█▊        | 3500/20000 [11:35:45<1264:25:53, 275.88s/it]

Epoch 3500: train loss 5.0603, val loss 5.0519
The current learning rate: 0.00012


 20%|██        | 4000/20000 [14:06:23<5:46:39,  1.30s/it]      

Epoch 4000: train loss 4.7378, val loss 4.7366
The current learning rate: 0.00012


 22%|██▎       | 4500/20000 [14:29:00<5:07:02,  1.19s/it]    

Epoch 4500: train loss 4.4927, val loss 4.4909
The current learning rate: 0.00013


 25%|██▌       | 5000/20000 [14:45:35<4:54:37,  1.18s/it]   

Epoch 5000: train loss 4.2817, val loss 4.2748
The current learning rate: 0.00014


 28%|██▊       | 5500/20000 [15:56:57<5:12:00,  1.29s/it]     

Epoch 5500: train loss 4.1035, val loss 4.1027
The current learning rate: 0.00015


 30%|███       | 6000/20000 [16:13:36<4:29:44,  1.16s/it]   

Epoch 6000: train loss 3.9564, val loss 3.9542
The current learning rate: 0.00016


 32%|███▎      | 6500/20000 [16:30:09<4:20:49,  1.16s/it]   

Epoch 6500: train loss 3.8143, val loss 3.8158
The current learning rate: 0.00018


 35%|███▌      | 7000/20000 [16:46:39<4:13:53,  1.17s/it]   

Epoch 7000: train loss 3.7003, val loss 3.7116
The current learning rate: 0.00019


 38%|███▊      | 7500/20000 [17:29:59<10:51:19,  3.13s/it]  

Epoch 7500: train loss 3.5942, val loss 3.6004
The current learning rate: 0.00020


 40%|████      | 8000/20000 [18:30:52<3:59:02,  1.20s/it]    

Epoch 8000: train loss 3.4870, val loss 3.4867
The current learning rate: 0.00022


 42%|████▎     | 8500/20000 [18:47:53<3:50:59,  1.21s/it]   

Epoch 8500: train loss 3.4033, val loss 3.4029
The current learning rate: 0.00024


 45%|████▌     | 9000/20000 [19:04:53<3:33:40,  1.17s/it]   

Epoch 9000: train loss 3.3135, val loss 3.3258
The current learning rate: 0.00025


 48%|████▊     | 9500/20000 [19:40:34<4:53:06,  1.67s/it]   

Epoch 9500: train loss 3.2468, val loss 3.2510
The current learning rate: 0.00027


 50%|█████     | 10000/20000 [20:04:09<3:18:07,  1.19s/it]  

Epoch 10000: train loss 3.1834, val loss 3.1764
The current learning rate: 0.00028


 52%|█████▎    | 10500/20000 [20:20:57<3:08:36,  1.19s/it]   

Epoch 10500: train loss 3.1003, val loss 3.1129
The current learning rate: 0.00030


 55%|█████▌    | 11000/20000 [20:38:17<2:53:40,  1.16s/it]   

Epoch 11000: train loss 3.0491, val loss 3.0604
The current learning rate: 0.00032


 57%|█████▊    | 11500/20000 [20:55:20<2:52:47,  1.22s/it]   

Epoch 11500: train loss 2.9935, val loss 3.0005
The current learning rate: 0.00033


 60%|██████    | 12000/20000 [21:12:03<2:37:10,  1.18s/it]   

Epoch 12000: train loss 2.9384, val loss 2.9417
The current learning rate: 0.00035


 62%|██████▎   | 12500/20000 [21:29:04<3:10:34,  1.52s/it]   

Epoch 12500: train loss 2.8859, val loss 2.8814
The current learning rate: 0.00036


 65%|██████▌   | 13000/20000 [21:46:06<2:21:14,  1.21s/it]   

Epoch 13000: train loss 2.8414, val loss 2.8511
The current learning rate: 0.00038


 68%|██████▊   | 13500/20000 [22:02:54<2:12:30,  1.22s/it]   

Epoch 13500: train loss 2.8031, val loss 2.8066
The current learning rate: 0.00040


 70%|███████   | 14000/20000 [22:28:47<2:00:30,  1.21s/it]   

Epoch 14000: train loss 2.7573, val loss 2.7530
The current learning rate: 0.00041


 72%|███████▎  | 14500/20000 [22:45:26<1:46:43,  1.16s/it]   

Epoch 14500: train loss 2.7077, val loss 2.7147
The current learning rate: 0.00042


 75%|███████▌  | 15000/20000 [23:02:22<1:39:34,  1.19s/it]   

Epoch 15000: train loss 2.6789, val loss 2.6871
The current learning rate: 0.00044


 78%|███████▊  | 15500/20000 [23:19:12<1:28:46,  1.18s/it]   

Epoch 15500: train loss 2.6391, val loss 2.6439
The current learning rate: 0.00045


 80%|████████  | 16000/20000 [24:00:18<2:08:59,  1.93s/it]   

Epoch 16000: train loss 2.5940, val loss 2.5994
The current learning rate: 0.00046


 82%|████████▎ | 16500/20000 [24:34:57<1:12:29,  1.24s/it]   

Epoch 16500: train loss 2.5579, val loss 2.5721
The current learning rate: 0.00047


 85%|████████▌ | 17000/20000 [24:52:09<58:29,  1.17s/it]     

Epoch 17000: train loss 2.5368, val loss 2.5408
The current learning rate: 0.00048


 88%|████████▊ | 17500/20000 [25:09:01<49:35,  1.19s/it]     

Epoch 17500: train loss 2.5050, val loss 2.5033
The current learning rate: 0.00048


 90%|█████████ | 18000/20000 [33:59:10<41:06:51, 74.01s/it]  

Epoch 18000: train loss 2.4722, val loss 2.4795
The current learning rate: 0.00049


 92%|█████████▎| 18500/20000 [34:50:23<29:22,  1.18s/it]     

Epoch 18500: train loss 2.4447, val loss 2.4461
The current learning rate: 0.00049


 95%|█████████▌| 19000/20000 [35:07:17<19:52,  1.19s/it]    

Epoch 19000: train loss 2.4190, val loss 2.4203
The current learning rate: 0.00050


 98%|█████████▊| 19500/20000 [35:24:12<09:47,  1.17s/it]    

Epoch 19500: train loss 2.3990, val loss 2.4006
The current learning rate: 0.00050


100%|██████████| 20000/20000 [36:15:09<00:00,  6.53s/it]    


## Step 9: Plot the SLM Loss Function

In [12]:
import matplotlib.pyplot as plt
train_loss_list_converted = [i.cpu().detach() for i in train_loss_list]
validation_loss_list_converted = [i.cpu().detach() for i in validation_loss_list]

plt.plot(train_loss_list_converted, 'g', label='train_loss')
plt.plot(validation_loss_list_converted, 'r', label='validation_loss')
plt.xlabel("Steps - Every 100 epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()



ModuleNotFoundError: No module named 'matplotlib'

## Step 10: Run SLM Inference on our trained model

In [13]:
#Load the model
model = GPT(config)  # re-create the model with same config
device =  "cuda" if torch.cuda.is_available() else "cpu"
best_model_params_path = "best_model_params.pt"
model.load_state_dict(torch.load(best_model_params_path, map_location=torch.device(device))) # load best model states


<All keys matched successfully>

In [14]:
sentence = "Once upon a time there was a pumpkin."
context = (torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(dim = 0))
y = model.generate(context, 200)
print(enc.decode(y.squeeze().tolist()))

Once upon a time there was a pumpkin. He was very generous and said he wanted to make a promise. He splashed allating his funniest pretending.

One day, while he was sleeping. He saw something bright. It was a yellow triangle. It was shiny and sparkly. Betty loved it.

"What was that?" said his mom.

Max and Sally went inside and found a lot of colorful vegetables. After they looked, they always took turns.

When theBox began to put there, they made a big cough on their faces. They talked around tomatoes for each other's more and more them felt much better.

After a while, the drum was gone for a cookie. Then they found a little jar of chocolate. Jack and them both thanked and the fun.



From together on the day on, Lilly and her mom made sure to always be careful.Once upon a time there was a lively little girl. She liked to look at the woods and asked her mom


In [16]:
sentence = "One day Amit went to office"
context = (torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(dim = 0))
y = model.generate(context, 200)
print(enc.decode(y.squeeze().tolist()))

One day Amit went to office and found a hidden tiny rat. He even followed all around the corner of the corner, he opened the door and found a big chest on the ground. He was very kind and gentle, pretty, never realized that the label was.

When he reached inside, Frosty saw Kitty left above. â€œHave many fun,â€ her owner said. Spot was happy to be to have his owner help, who was make. Peppa in a big smile whenever he caught him with him thankful.Once upon a time, there was a little girl named Lily. She loved going for a ride with her dad. One day, they went to a cakes. Lily went to Sam was unhappy and he didn't know what to do.

While finally her best daddy called apron. "Mommy, what decide."

His dad told her that it was important to take them to a way to school. The next time they put in a big bag of juice
