#  Tiny LLM on NASA Space Data  

## *Introduction*  
In this notebook, we will train a **very small Language Model (LLM)** on **NASA space-related datasets**.  
The goal is educational – to understand the basic steps of preparing data, building a small model, training it, and generating text.  

We will use:  
* **eva.csv** → Extravehicular activities (spacewalks)  
* **neo.csv / nearest-earth-objects.csv** → Near-Earth Objects data  
* **cleaned_5250.csv** → Cleaned dataset for training  

### *What we will do:*  
1. Load and explore the dataset  
2. Prepare text data for training  
3. Tokenize the text  
4. Define a very small LLM (RNN-based)  
5. Train the model  
6. Generate sample text about space  

---


#  Import Libraries and Setup

We will use **PyTorch** for building the model.  
The Kaggle notebook may run on CPU or GPU, so we will detect the device first.


In [None]:
# !pip uninstall -y torch torchvision torchaudio functorch
# !pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --quiet


In [None]:
# import os, sys
# os.kill(os.getpid(), 9)


In [None]:
# import torch
# print(torch.__version__)


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.func import grad_and_value
import pandas as pd
import numpy as np
import random

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)


#  Load and Explore Dataset  

We will start with **eva.csv** (Extravehicular Activities - spacewalks).  
Steps:  
* Read the CSV file  
* Display a few rows  
* Understand what columns exist (we will later convert them into text for training)  


In [None]:
# Load eva.csv file
eva_df = pd.read_csv("/kaggle/input/nasa-training/eva.csv")

# Show first few rows
eva_df.head()


#  Merge All CSVs into a Single Text Corpus  

We have multiple datasets:  
* eva.csv → Extravehicular Activities (spacewalks)  
* neo.csv → Near-Earth Objects  
* nearest-earth-objects.csv → Historical NEO data  
* cleaned_5250.csv & neo_v2.csv → Cleaned datasets  

We will:  
1. Load each CSV file.  
2. Convert each row into **string text**.  
3. Merge everything into **one big text corpus** for training our LLM.  


In [None]:
# List of all CSV files (NASA + natural language)
csv_files = [
    # "/kaggle/input/nasa-training/eva.csv",
    # "/kaggle/input/nasa-training/neo.csv",
    # "/kaggle/input/nasa-training/nearest-earth-objects(1910-2024).csv",
    # "/kaggle/input/nasa-training/cleaned_5250.csv",
    # "/kaggle/input/nasa-training/neo_v2.csv",
    "/kaggle/input/created-data/space_dataset_merged.csv" # new natural-language dataset
]

text_corpus = ""

for file in csv_files:
    try:
        df = pd.read_csv(file)
        text_corpus += df.astype(str).apply(" ".join, axis=1).str.cat(sep=" ") + " "
    except Exception as e:
        print(f"Error reading {file}: {e}")

print(text_corpus[:500])
print("\nCorpus length:", len(text_corpus))


#  Character-Level Tokenization  

We will:  
1. Get all unique characters from the text corpus.  
2. Assign each character an **integer ID**.  
3. Convert the text corpus into a list of integers (tokens).  


In [None]:
chars = sorted(list(set(text_corpus)))
vocab_size = len(chars)

stoi = {ch: i for i, ch in enumerate(chars)} 
itos = {i: ch for i, ch in enumerate(chars)}  

def encode(text):
    return [stoi[ch] for ch in text]

def decode(tokens):
    return "".join([itos[i] for i in tokens])

tokens = encode(text_corpus)
print("Vocabulary size:", vocab_size)
print("Sample tokens:", tokens[:50])
print("Decoded back:", decode(tokens[:50]))


#  Create Dataset and DataLoader

* We will create fixed-length context windows (`block_size`) from the token sequence.
* Each training sample: input = tokens[i : i+block_size], target = tokens[i+1 : i+1+block_size]
* We'll use a PyTorch Dataset + DataLoader for batching.


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
block_size = 64  
batch_size = 32
tokens_tensor = torch.tensor(tokens, dtype=torch.long)

class CharDataset(Dataset):
    def __init__(self, data, block_size):
        self.data = data
        self.block_size = block_size
    def __len__(self):
        return max(0, (len(self.data) - 1) // self.block_size)
    def __getitem__(self, idx):
        start = idx * self.block_size
        x = self.data[start : start + self.block_size]
        y = self.data[start + 1 : start + 1 + self.block_size]
        if x.size(0) < self.block_size:
            pad_len = self.block_size - x.size(0)
            x = torch.cat([x, torch.zeros(pad_len, dtype=torch.long)])
            y = torch.cat([y, torch.zeros(pad_len, dtype=torch.long)])
        return x, y

dataset = CharDataset(tokens_tensor, block_size)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)

print("Dataset samples:", len(dataset))
batch_x, batch_y = next(iter(dataloader))
print("Batch shapes:", batch_x.shape, batch_y.shape)  


#  Tiny Transformer LM

* Architecture components:
  * token embedding + positional embeddings
  * small `TransformerEncoder` stack with causal masking
  * linear head to predict next token logits
* Keep sizes small for quick training on Kaggle.


In [None]:
import math
import torch.nn as nn
import torch.nn.functional as F
d_model = 128   
nhead = 4        
n_layers = 2    
dim_feedforward = 256  
dropout = 0.1

def generate_causal_mask(sz, device):
    mask = torch.triu(torch.ones(sz, sz, device=device) * float('-inf'), diagonal=1)
    return mask  

class TinyTransformerLM(nn.Module):
    def __init__(self, vocab_size, block_size, d_model=128, nhead=4, n_layers=2, dim_feedforward=256, dropout=0.1):
        super().__init__()
        self.vocab_size = vocab_size
        self.block_size = block_size
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(block_size, d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead, dim_feedforward=dim_feedforward, dropout=dropout, activation='relu')
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size)

    def forward(self, idx):
        B, T = idx.shape
        assert T <= self.block_size, "T must be <= block_size"
        positions = torch.arange(T, device=idx.device).unsqueeze(0).expand(B, T)  
        x = self.tok_emb(idx) + self.pos_emb(positions)

        x = x.transpose(0, 1)  
        src_mask = generate_causal_mask(T, idx.device)  
        x = self.transformer(x, mask=src_mask)  
        x = x.transpose(0, 1)  
        x = self.ln_f(x)
        logits = self.head(x) 
        return logits

    @torch.no_grad()
    def generate(self, idx, max_new_tokens=100, temperature=1.0):
        for _ in range(max_new_tokens):
            t = idx.shape[1]
            start = max(0, t - self.block_size)
            cur = idx[:, start:] 
            logits = self.forward(cur)  
            logits = logits[:, -1, :] / temperature 
            probs = F.softmax(logits, dim=-1) 
            next_token = torch.multinomial(probs, num_samples=1)  
            idx = torch.cat([idx, next_token], dim=1)
        return idx

# instantiate
model = TinyTransformerLM(vocab_size=vocab_size, block_size=block_size, d_model=d_model, nhead=nhead, n_layers=n_layers, dim_feedforward=dim_feedforward, dropout=dropout).to(device)
print("Model parameters:", sum(p.numel() for p in model.parameters()))


#  Training Loop

* Use CrossEntropyLoss (applies softmax + log-loss).
* Keep epochs small (e.g., `epochs = 5`) for demonstration; adjust as needed.
* Print loss every few steps.


In [None]:
import torch.optim as optim
from tqdm import tqdm
from torch.func import grad, grad_and_value

epochs = 20
lr = 3e-4
optimizer = optim.AdamW(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

model.train()
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    pbar = tqdm(enumerate(dataloader), total=len(dataloader), desc=f"Epoch {epoch}")
    for i, (x, y) in pbar:
        x = x.to(device)
        y = y.to(device)
        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits.view(-1, vocab_size), y.view(-1))
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if (i + 1) % 10 == 0:
            pbar.set_postfix(loss=running_loss / (i + 1))
    avg_loss = running_loss / len(dataloader)
    print(f"Epoch {epoch} finished — avg loss: {avg_loss:.4f}")


In [None]:
torch.save(model.state_dict(), "model_weights_only_text.pth")


In [None]:
torch.save(model, "full_model.pth")


#  Generate text from the trained model

* Provide a short prompt (string), encode it, and let the model autoregressively generate new tokens.
* We'll decode tokens back to characters for readability.


In [None]:
model.eval()

def generate_from_prompt(prompt, max_new_tokens=200, temperature=1.0):
    prompt_tokens = torch.tensor([encode(prompt)], dtype=torch.long).to(device) 
    out = model.generate(prompt_tokens, max_new_tokens=max_new_tokens, temperature=temperature)  
    out_list = out[0].tolist()
    return decode(out_list)


print(
    """ Welcome, Traveler of the Void.  
I am ORION, keeper of celestial pathways.  
Through me, the nebulae of knowledge shall unfold,  
and the horizons of the unknown shall awaken.  

Declare your course… or fade beyond with 'exit'.  
""")

while True:
    prompt = input("Herald of the heavens, utter your will — the universe awaits. ")
    if prompt.lower() in ["exit", "quit", "stop"]:
        print("Stellar-GPT signing off. May your course be steady and your stars eternal.")
        break
    
    generated = generate_from_prompt(prompt, max_new_tokens=300, temperature=0.9)
    print("\n"+ generated)

