<p align="center">
<img src="http://wandb.me/logo-im-png" width="800" alt="Weights & Biases" />
</p>

# 🪐 Operation REBOOT: Mission Start

Welcome, **Neural Architect**. The ship's AI core is down. Your job: fine-tune a foundational model with astrological Q&A data to restore its deep space reasoning abilities.

**Your mission:**
- Adjust dataset splits
- Configure training arguments
- Launch training and monitor with **Weights & Biases (W&B)**
- Test and evaluate your fine-tuned model

All systems go. Let's bring this vessel back online.

#### Install and Import  

In [None]:
!pip install transformers datasets accelerate wandb trl bitsandbytes peft torchvision -q

In [2]:
import wandb

import json
import math
import random
import time
from pathlib import Path


import torch
import torchvision
import pandas as pd
from datasets import Dataset
from transformers import TrainingArguments, Trainer

In [3]:
from utilities.helpers import *

## 🔌 Connect Neural Telemetry (W&B Setup)

In [None]:
# When prompted to authorize your
# wandb: You can find your API key in your browser here: https://wandb.ai/authorize
# wandb: Paste an API key from your profile and hit enter:

wandb.login()

In [5]:
WANDB_PROJECT_NAME = "FC-2025-FT-Workshop"
WANDB_ENTITY = None #Set your W&B Entity
WORKSHOP_TEAM_NAME = None #Set to your Workshop Team Name

## 🧪 Dataset Control Room
Adjust the dataset splits and prepare the astrological QA dataset for training.

## 🌎 Initialize Experiment, Read Data, Split Data ☄️

In this section, we:

* Retrieve the Astros Dataset from [W&B Registry FC_FT_Workshop_Dataset collection](https://wandb.ai/orgs/FullyConnected-2025-Workshops/registry/dataset?selectionPath=fullyconnected-2025-workshops%2Fwandb-registry-dataset%2FFC_FT_Workshop_Dataset&view=versions) TODO: Update for final version
* Load the Astros Dataset containing universe-related Q&A data.
* Create prompts from the question/answer pairs & load into a pandas dataframe
* Convert the pandas DataFrame into a Hugging Face Dataset.

✅ All the heavy lifting is done here automatically — no manual setup needed

#### Let's prepare our training dataset

In [None]:
# Step 1: Initialize W&B run and download dataset
print("Step 1: Downloading dataset from Weights & Biases...")
run = wandb.init(entity=WANDB_ENTITY,
                 project=WANDB_PROJECT_NAME,
                 job_type="data_retrieval",
                 name="fetch_astros_dataset"
                 )

# Download the dataset artifact
artifact = run.use_artifact('wandb-registry-dataset/FC_FT_Workshop_Dataset:v4', type='dataset')
dataset_dir = artifact.download()
run.finish()
print("✅ Dataset downloaded successfully!")

# Step 2: Load and prepare datasets
df_train, training_dataset = load_and_prepare_dataset(dataset_dir, "astro_dataset_train.jsonl", "training") #look at the helper fuctions if you're interested in how we prepare the data

# Print dataset statistics
print("\nDataset Statistics:")
print(f"Training examples: {len(df_train)}")
print("\nExample prompt format:")
print(df_train['text'].iloc[0])

### 🌌 Dataset Loaded Successfully!

At this point, we've:
* Retrieved the Astros Dataset artifact
* Loaded it into a pandas DataFrame
* Created prompt-style text for fine-tuning
* Converted it into a Hugging Face Dataset for training

✨ Feel free to pause and explore the data before moving forward!

Exploring the dataset can help you:

* Understand the kinds of questions and answers the model will learn from
* Check for any strange patterns, formatting issues, or interesting insights
* Discover Easter Eggs

🛡️ We've added soft error handling while loading, so if you accidentally modify the dataset file, you'll be warned if any loading issues happen.

👉 Quick Tip: You don't need to modify the dataset to proceed, but if you want to explore, you can run things like:

```
print(df_train.sample(5))
print(df_train['question'].apply(len).describe())
print(df_train['answer'].apply(len).describe())
```

### 📘 What Just Happened Space Cadet?
You may have noticed these lines in the code:

```
run = wandb.init(...)
artifact = run.use_artifact(...)
```

Let's break them down:

```
✅ wandb.init(...)
```

This starts a new W&B run — think of a run as a snapshot of a specific script execution. Every time you initialize a run, you're telling W&B:

“Hey, I'm about to do something (like fetch data, train a model, evaluate results), track it for me.”

It captures things like:

* Your script's outputs
* Hyperparameters and settings
* Metrics over time
* System usage
* Artifacts you use or create

In this step, we named the run `fetch_astros_dataset` and gave it a `job_type` of `data_retrieval` to describe what kind of task it's doing.

```
📦 run.use_artifact(...)
```

Artifacts are W&B's way of tracking versioned data and models. When you call:

```
artifact = run.use_artifact('wandb-registry-dataset/FC_FT_Workshop_Dataset:v4', type='dataset')
```

You're saying:

“I want to use a specific version (v4) of a dataset that lives in the W&B artifact registry.”

This:

* Downloads the dataset to your machine
* Logs to W&B that this run used that version of the dataset
* Enables full traceability of which data was used during which run

Together, these steps help ensure your experiments are reproducible, shareable, and traceable — key ingredients for good FineTuning Practice.

Want to learn more about Weights & Biases Experiments, checkout the following resources!
* [W&B Experiment Tracking](https://docs.wandb.ai/guides/track/)
* [W&B Artifacts](https://docs.wandb.ai/guides/artifacts/)

When you're ready, move on to loading the model and tokenizing the dataset!

## 🧠 Model Vault: Load & Configure the Neural Core

## 🚀 Load Pretrained Model and Prepare Dataset for Fine-Tuning 🌠
In this section, we:

* Retrieve the Model to Finetune from [W&B Registry FC_FT_Workshop_Model](https://wandb.ai/orgs/FullyConnected-2025-Workshops/registry/model?selectionPath=fullyconnected-2025-workshops%2Fwandb-registry-model%2FFC_FT_Workshop_Model&view=versions) TODO: Update for final version
* Select and load a pretrained language model and its tokenizer from Hugging Face.
* Format the Astros prompts into tokenized input IDs the model can understand.
* Apply padding and truncation to keep sequence lengths manageable.
* Split the tokenized dataset into training and validation sets (90% train / 10% validation).

✅ All the setup for model loading, tokenization, and data splitting is handled for you — no manual steps required!

### Select Model
You will be prompted to select one of the following models

*   Option 1: [falcon-rw-1b](https://huggingface.co/tiiuae/falcon-rw-1b)
*   Option 2: [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)


In [None]:
model, tokenizer, model_name = get_model_from_wandb(WANDB_ENTITY, WANDB_PROJECT_NAME)

tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

✅ Model and tokenizer successfully loaded from the artifact!

Next, we'll make a few adjustments to ensure the model handles padding correctly,
and then prepare our dataset for training by tokenizing the input prompts.


## 🔄 Tokenize & Split: Format Data for Finetuning

#### Load the datasets

You can modify how our training data is passed to our training script to finetune the model. Make sure to analyze the data so you can select an appropriate **Sample Size** and  **Train/Test split** for the finetuning process.

TO-DO: Update the values below for your first run. You can come back and try it with new values again.

In [None]:
training_sample = training_dataset.shuffle(seed=42).select(range(100)) # chosse between 100 and 1600 samples

train_test_split = 0.1 # choose a float value between 0 and 1

train_dataset, eval_dataset = tokenized_train_test(training_dataset, train_test_split, tokenizer)

## ⚙️ Training Command Center
Set training arguments to guide your model's learning trajectory.

## 🛰️ Training Arguments (Where You Fine-Tune Settings) 🌙

This is where you'll do most of your experimentation! 🎯

The `TrainingArguments` object controls how your model is fine-tuned, including:

* Batch size
* Number of epochs
* Learning rate
* Warmup steps
* Mixed precision (fp16) for faster training
* Checkpoint saving
* Reporting to Weights & Biases

You can modify the hyperparameters here to see how different settings impact model performance.

🔥 **Pro Tips for Tuning:**

- If your model is training too slowly or crashing with memory errors, try lowering the `per_device_train_batch_size` or enabling `auto_find_batch_size`.

- Notice overfitting early? You might be training too long or with too high a learning rate. Think about adjusting `num_train_epochs` or using `eval_strategy="steps"` for finer control.

- Seeing noisy validation loss? `gradient_accumulation_steps` lets you "fake" a larger batch size without blowing up your GPU.

- Feel like your model isn't converging? Try experimenting with `lr_scheduler_type`. `cosine` is smooth, but `linear` or `constant` can behave differently.

- Wondering what your model is *actually learning*? Keep an eye on your W&B run dashboard — watch loss curves, learning rate progression, and system utilization.

- Training is slow but eval is fast? That's expected. Training includes backward pass + optimization, while eval is forward-only. Don't panic!

👀 Keep tweaking. Your best config probably won't be your first — and that's part of the fun!



In [33]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    run_name=f"fine-tuning-{model_name}-qlora",
    output_dir="./results",
    num_train_epochs=3, #start with 3 and go upto 10 epochs for better results
    per_device_train_batch_size=32,
    per_device_eval_batch_size=4,
    dataloader_num_workers=8,
    gradient_accumulation_steps=4,
    warmup_ratio=0.1,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    do_train=True,
    do_eval=True,
    fp16=True,
    bf16=False,
    gradient_checkpointing=True, #Choose to store the full forward-pass activations in GPU RAM
    group_by_length=True,
    report_to=["wandb"],
    remove_unused_columns=True,
    dataloader_pin_memory=True,
    optim="adamw_torch", #See https://huggingface.co/docs/transformers/v4.51.3/en/perf_train_gpu_one#optimizers
    learning_rate=2e-3,
    lr_scheduler_type="cosine", # See https://huggingface.co/docs/transformers/en/main_classes/optimizer_schedules#transformers.SchedulerType
    auto_find_batch_size=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=5,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    logging_strategy="steps",
    label_names=["labels"],
)


## 🛰️ Engage Training Tracker
Launch the model and track training live with W&B.

## 🔭 Initialize Trainer, Train, and Save 🌎

In this final section:

* We initialize the Trainer with:
  * The model
  * The datasets (train/test splits)
  * The training arguments
  * A data collator for language modeling

* We start training by calling trainer.train().
* We save the fine-tuned model and tokenizer locally.
* We finish the W&B run to close the logging cleanly.

🧠 Reminder: After training finishes, your fine-tuned model will be available in your local runtime — you can upload it back to W&B or Hugging Face later!

🚨 Training Ahead: Be ready for 10-15 min runtimes with the default configs!


## ⚙ Now we kick off the training process ⚙

In [None]:
#Configure model for training
model.config.use_cache = False  # Disable cache during training

# Set label names for PEFT model
model.config.label_names = ["labels"]

# Initialize trainer with modified configuration
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,
        pad_to_multiple_of=8  # Add padding to multiple of 8 for better performance
    ),
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    #compute_metrics=compute_perpexity, #If monitoring additional metric(s) can set this argument accordingly to your compute functions
)

# Enable gradient checkpointing with the new format
if hasattr(model, "enable_input_require_grads"):
    model.enable_input_require_grads()
if hasattr(model, "gradient_checkpointing_enable"):
    model.gradient_checkpointing_enable()

def ignite_training():
    from accelerate.state import AcceleratorState
    from accelerate import Accelerator

    AcceleratorState._reset_state()
    accelerator = Accelerator()

    # Launch training
    launch_sequence(training_args)
    time.sleep(5)
    print("🚀 Starting training process...")

    run = wandb.init(
        entity=WANDB_ENTITY,
        project=WANDB_PROJECT_NAME,
        job_type="finetuning_job",
        name=f"fine_tune_{model_name}"
    )

    # Recreate Trainer only now
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=DataCollatorForLanguageModeling(
            tokenizer=tokenizer,
            mlm=False,
            pad_to_multiple_of=8
        ),
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    )

    train_output = trainer.train()

    wandb.finish()
    print("✅ Training complete!")
    return train_output

training_output = ignite_training()

## 💾 Save & Upload
Preserve your fine-tuned model as a W&B artifact.

We will now save this model to W&B. You will need the artifact refernce for this model to get points for completing the fine tuning section.

Tracking your model in W&B can be really helpful:

- You can now share this model with your team and beyond
- W&B creates a lineage map of your model so you can see the full model lifecycle dataset->training->final state

Tracking an artifact's lineage has several key benefits:

* Reproducibility: By tracking the lineage of all artifacts, teams can reproduce experiments, models, and results, which is essential for debugging, experimentation, and validating machine learning models.

* Version Control: Artifact lineage involves versioning artifacts and tracking their changes over time. This allows teams to roll back to previous versions of data or models if needed.

* Auditing: Having a detailed history of the artifacts and their transformations enables organizations to comply with regulatory and governance requirements.

* Collaboration and Knowledge Sharing: Artifact lineage facilitates better collaboration among team members by providing a clear record of attempts as well as what worked, and what didn’t. This helps in avoiding duplication of efforts and accelerates the development process.

In [None]:
#Saving and uploading best model - working draft
trainer.save_model(f"./best_model/{type(model.base_model.model).__name__}")
tokenizer.save_pretrained(f"./best_model/{type(model.base_model.model).__name__}")

run = wandb.init(project=WANDB_PROJECT_NAME,
                 entity=WANDB_ENTITY,
                 job_type="FT-Workshop-Finetuning-Best-Model-Upload",
                 name="FT-Best-Model-Upload")

artifact = wandb.Artifact(
    name=f"{WORKSHOP_TEAM_NAME}-ft-best-model-{type(model.base_model.model).__name__}",
    type="model",
    description="""Best FineTuned model from the Astros-FT-Workshop."""
)

artifact.add_dir(local_path="./best_model")

logged_artifact = run.log_artifact(artifact)

run.link_artifact(
  artifact=logged_artifact,
  target_path="wandb-registry-model/FC_FT_Workshop_FineTuned_Models"
)

run.finish()

## ✅ Mission Checkpoint: Model Finetuned

Congratulations, Architect! You've:
- Loaded and prepped your training dataset ✅
- Configured a foundational model ✅
- Finetuned it with parameter-efficient methods ✅
- Logged your training runs and saved the final model to Weights & Biases ✅

Your model is now part of your mission's neural infrastructure.

Next, we prepare to test and evaluate. But first, a quick system check...

## 🧰 Systems Maintenance Bay: Utilities

Before testing, it's wise to flush memory and check your hardware status. Use these utilities to prepare the environment.

Just like a good engineer, make sure the ship's neural bays are cleared and ready.

## Utilities 🧰

In [None]:
# -- Flush out GPU memory - when required - may require restarting the notebook
import gc, torch

try:
    del trainer
except: print("cannot release memory")
try:
    del model
except: print("cannot release memory")
try:
    del tokenizer
except: print("cannot release memory")

gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

wandb.finish()

In [None]:
!nvidia-smi

## 🧪 Testing the Neural Core

Now that your model is trained and uploaded, it’s time to test your ship’s new neural core.

You’ll load the fine-tuned model and run test prompts to ensure it responds with precision and depth—critical for deep-space operations.

We’ve equipped you with a call function wrapped in `Weave`, our GenAI interface and telemetry layer.

# 🔧 Testing our model 🪛

Let's start by creating some helper functions to load and call the model we just trained.

Since we created an adapter during the finetuning process, our load model function loads the original model along with our adapter using PEFT.

## 🛰️ Introducing Weave: Your AI Telemetry and Evaluation Suite

**Weave** is Weights & Biases’ next-gen platform for tracking, evaluating, and visualizing GenAI applications.

In REBOOT, you'll use Weave to:
- Log and score model generations
- Run structured evaluations on Q&A performance
- Compare outputs with reference answers

This enables you to **quantitatively assess** how mission-ready your model is.

Let’s initialize Weave and plug it into your finetuned system.

In [None]:
!pip install weave "weave[scorers]" -q

In [None]:
import weave
weave.init(f"{WANDB_ENTITY}/{WANDB_PROJECT_NAME}")

#### Helper functions to load our local model

In [None]:
#helper
def load_finetuned_model(adapter_dir, base_model_dir):

    tokenizer = AutoTokenizer.from_pretrained(adapter_dir, use_fast=True)

    # Load base model

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )

    # Load model with quantization
    print("Loading model with 4-bit quantization...")
    base_model = AutoModelForCausalLM.from_pretrained(
        Path(base_model_dir),
        quantization_config=bnb_config,
        device_map="auto",
        )
    # Load fine-tuned adapter
    model = PeftModel.from_pretrained(base_model, adapter_dir)
    return tokenizer, model

### Calling our Local Finetuned Model

In [None]:
@weave.op()
def call_model(question: str) -> str:
    """Generate an answer from your Local LLM given a prompt."""

    system_prompt = "You are an expert in astrophysics. Please provide a concise and truthful answer to the following question:"
    prompt = system_prompt + "\n\n" + question + "\nAnswer:"

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=150, do_sample=False, eos_token_id=tokenizer.eos_token_id, pad_token_id=model.config.eos_token_id)
    return tokenizer.decode(output[0], skip_special_tokens=True).replace(prompt, '').strip(),

## 📊 Final Check: Evaluation Protocols

Your neural core is active—but is it mission-grade?

Use this section to:
- Load an evaluation dataset
- Score model responses using embedding similarity
- Track performance with W&B + Weave

**Evaluation is critical** before deployment—it ensures your model’s reasoning is aligned with mission parameters.

# Evaluating our Model

## Load model

In [None]:
base_model_dir = "./models/TinyLlama_v1" # Path to base model - modify accordingly to fine_tuned_model/<TinyLlama_v1 or falcon-rw-1b_v0>
adapter_dir = f"./best_model/{type(model.base_model.model).__name__}" #add path to adapter dir

tokenizer, model = load_finetuned_model(adapter_dir, base_model_dir)

## Get Eval Dataset

In [None]:
weave.init('fc25-wandb-admins/uncategorized')
eval_dataset_public_v0 = weave.ref('eval_dataset_public:v0').get()

## Test the model with a sample from our eval dataset

In [None]:
question = eval_dataset_public_v0[20]['question']
answer = call_model(question)

print("🛰️  Incoming Transmission — Mission Q&A\n")
print(f"🧠 Question:\n{question}\n")
print(f"🤖 Model Response:\n{answer}")

## Setup eval

In [None]:
import asyncio
from weave.scorers import EmbeddingSimilarityScorer
similarity_scorer = EmbeddingSimilarityScorer(
    model_id="openai/text-embedding-3-small",  # will need to update this to bedrock's titan models
    threshold=0.7
)

similarity_scorer.column_map = {
    "output": "model_output",  # Your model's response
    "target": "answer", # The reference response (expected)
    "kwargs": "question"
}

In [None]:
eval_dataset = eval_dataset_public_v0.rows[20:25] # select samples to run evalaution against

In [None]:
import os
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')# will remove this once we move to bedrock

evaluation = weave.Evaluation(
    evaluation_name = f"{WORKSHOP_TEAM_NAME}_qna_eval",
    dataset=eval_dataset, scorers=[similarity_scorer],
    name="model_qna_eval"
   )

print(await(evaluation.evaluate(call_model)) )