<p align="center">
<img src="http://wandb.me/logo-im-png" width="800" alt="Weights & Biases" />
</p>

# 🪐 Operation REBOOT: Mission Start

Welcome, **Neural Architect**. The ship's AI core is down. Your job: fine-tune a foundational model with astrological Q&A data to restore its deep space reasoning abilities.

**Your mission:**
- Adjust dataset splits
- Configure training arguments
- Launch training and monitor with **Weights & Biases (W&B)**
- Test and evaluate your fine-tuned model

All systems go. Let's bring this vessel back online.

#### Install and Import  

In [1]:
!pip install transformers datasets accelerate wandb trl huggingface_hub bitsandbytes peft torchvision -q

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pathos 0.3.4 requires dill>=0.4.0, but you have dill 0.3.8 which is incompatible.
pathos 0.3.4 requires multiprocess>=0.70.18, but you have multiprocess 0.70.16 which is incompatible.[0m[31m
[0m

In [2]:
import wandb

import json
import math
import random
from pathlib import Path


import torch
import torchvision
import pandas as pd
from datasets import Dataset
from transformers import TrainingArguments, Trainer
#from huggingface_hub import login

In [3]:
from utilities.helpers import *

## 🔌 Connect Neural Telemetry (W&B Setup)

In [4]:
# When prompted to authorize your
# wandb: You can find your API key in your browser here: https://wandb.ai/authorize
# wandb: Paste an API key from your profile and hit enter:

#wandb.login(key=wandb_key)
wandb.login()

[34m[1mwandb[0m: [32m[41mERROR[0m Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

  ········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/ec2-user/.netrc
[34m[1mwandb[0m: Currently logged in as: [33manushrav-vatsa[0m ([33mwandb[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [5]:
WANDB_PROJECT_NAME = "Astros-FT-Workshop"
WANDB_ENTITY = "FT-Testing" #Set your W&B Entity #TODO - Change to None for final release
WORKSHOP_TEAM_NAME = "Wandb_Crew" #Set to your Workshop Team Name #TODO - Change to None for final release

## 🧪 Dataset Control Room
Adjust the dataset splits and prepare the astrological QA dataset for training.

## 🌎 Initialize Experiment, Read Data, Split Data ☄️

In this section, we:

* Retrieve the Astros Dataset from [W&B Registry FC_FT_Workshop_Dataset collection](https://wandb.ai/orgs/FullyConnected-2025-Workshops/registry/dataset?selectionPath=fullyconnected-2025-workshops%2Fwandb-registry-dataset%2FFC_FT_Workshop_Dataset&view=versions) TODO: Update for final version
* Load the Astros Dataset containing universe-related Q&A data.
* Create prompts from the question/answer pairs & load into a pandas dataframe
* Convert the pandas DataFrame into a Hugging Face Dataset.

✅ All the heavy lifting is done here automatically — no manual setup needed

#### Let's prepare our training dataset

In [17]:
# Step 1: Initialize W&B run and download dataset
print("Step 1: Downloading dataset from Weights & Biases...")
run = wandb.init(entity=WANDB_ENTITY,
                 project=WANDB_PROJECT_NAME,
                 job_type="data_retrieval",
                 name="fetch_astros_dataset"
                 )

# Download the dataset artifact
artifact = run.use_artifact('wandb-registry-dataset/FC_FT_Workshop_Dataset:v4', type='dataset')
dataset_dir = artifact.download()
run.finish()
print("✅ Dataset downloaded successfully!")

# Step 2: Load and prepare datasets
df_train, training_dataset = load_and_prepare_dataset(dataset_dir, "astro_dataset_train.jsonl", "training") #look at the helper fuctions if you're interested in how we prepare the data

# Print dataset statistics
print("\nDataset Statistics:")
print(f"Training examples: {len(df_train)}")
print("\nExample prompt format:")
print(df_train['text'].iloc[0])

Step 1: Downloading dataset from Weights & Biases...


[34m[1mwandb[0m:   1 of 1 files downloaded.  


✅ Dataset downloaded successfully!

Loading training dataset...
✅ Successfully loaded dataset with no errors.
✅ Training dataset loaded with 1600 examples

Dataset Statistics:
Training examples: 1600

Example prompt format:
Question: What are 'Superluminous Supernovae' (SLSNe) and what distinguishes Type I SLSNe from normal Type Ia supernovae spectroscopically?
Answer: Superluminous Supernovae (SLSNe) are much more luminous than normal Type Ia supernovae. Spectroscopically, Type I SLSNe are characterized by the absence of hydrogen and strong helium lines near peak light (like normal SNe Ia), but they show strong, broad metal lines, often including oxygen, magnesium, and calcium. Normal SNe Ia are defined by the presence of strong silicon absorption lines (Si II λ6355) near peak light, which are often weak or absent in SLSNe I. The differences in spectra indicate different progenitor systems and explosion mechanisms: SNe Ia are thermonuclear disruptions of white dwarfs, while SLSNe I ar

### 🌌 Dataset Loaded Successfully!

At this point, we've:
* Retrieved the Astros Dataset artifact
* Loaded it into a pandas DataFrame
* Created prompt-style text for fine-tuning
* Converted it into a Hugging Face Dataset for training

✨ Feel free to pause and explore the data before moving forward!

Exploring the dataset can help you:

* Understand the kinds of questions and answers the model will learn from
* Check for any strange patterns, formatting issues, or interesting insights
* Discover Easter Eggs

🛡️ We've added soft error handling while loading, so if you accidentally modify the dataset file, you'll be warned if any loading issues happen.

👉 Quick Tip: You don't need to modify the dataset to proceed, but if you want to explore, you can run things like:

```
print(df_train.sample(5))
print(df_train['question'].apply(len).describe())
print(df_train['answer'].apply(len).describe())
```

When you're ready, move on to loading the model and tokenizing the dataset!

## 🧠 Model Vault: Load & Configure the Neural Core

## 🚀 Load Pretrained Model and Prepare Dataset for Fine-Tuning 🌠
In this section, we:

* Retrieve the Model to Finetune from [W&B Registry FC_FT_Workshop_Model](https://wandb.ai/orgs/FullyConnected-2025-Workshops/registry/model?selectionPath=fullyconnected-2025-workshops%2Fwandb-registry-model%2FFC_FT_Workshop_Model&view=versions) TODO: Update for final version
* Select and load a pretrained language model and its tokenizer from Hugging Face.
* Format the Astros prompts into tokenized input IDs the model can understand.
* Apply padding and truncation to keep sequence lengths manageable.
* Split the tokenized dataset into training and validation sets (90% train / 10% validation).
* Define a compute_metrics function to track TODO during fine-tuning.

✅ All the setup for model loading, tokenization, and data splitting is handled for you — no manual steps required!

### Select Model
You will be prompted to select one of the following models

*   Option 1: [falcon-rw-1b](https://huggingface.co/tiiuae/falcon-rw-1b)
*   Option 2: [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)


In [19]:
model, tokenizer, model_name = get_model_from_wandb(WANDB_ENTITY, WANDB_PROJECT_NAME)

tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id


Available Models:
1. Falcon RW 1B
2. TinyLlama 1B



Select a model (1-2):  1



✅ Selected: falcon-rw-1b

⬇️ Downloading falcon-rw-1b (version v0) from Weights & Biases...


[34m[1mwandb[0m: Downloading large artifact FC_FT_Workshop_Models:v0, 2505.10MB. 12 files... 
[34m[1mwandb[0m:   12 of 12 files downloaded.  
Done. 0:0:2.2 (1120.0MB/s)


✅ Model saved to: models/falcon-rw-1b_v0
✅ Model downloaded successfully!

📦 Loading model from: models/falcon-rw-1b_v0

Step 1: Loading tokenizer...
✅ Tokenizer loaded successfully!

Step 2: Configuring QLoRA...
Loading model with 4-bit quantization...


You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.


trainable params: 12,582,912 || all params: 1,324,208,128 || trainable%: 0.9502
✅ Model loaded with QLoRA successfully!


✅ Model and tokenizer successfully loaded from the artifact!

Next, we'll make a few adjustments to ensure the model handles padding correctly,
and then prepare our dataset for training by tokenizing the input prompts.


## 🔄 Tokenize & Split: Format Data for Finetuning

#### Load the datasets

You can modify how our training data is passed to our training script to finetune the model. Make sure to analyze the data so you can select an appropriate **Sample Size** and  **Train/Test split** for the finetuning process.

TO-DO: Update the values below for your first run. You can come back and try it with new values again.

In [20]:
training_sample = training_dataset.shuffle(seed=42).select(range(100)) # chosse between 100 and 1600 samples

train_test_split = 0.1 # choose a float value between 0 and 1

train_dataset, eval_dataset = tokenized_train_test(training_dataset, train_test_split, tokenizer)

Map:   0%|          | 0/1440 [00:00<?, ? examples/s]

Map:   0%|          | 0/160 [00:00<?, ? examples/s]

✅ Tokenization applied to Training & Evaluation Datasets successfully!


## ⚙️ Training Command Center
Set training arguments to guide your model's learning trajectory.

## 🛰️ Training Arguments (Where You Fine-Tune Settings) 🌙

This is where you'll do most of your experimentation! 🎯

The `TrainingArguments` object controls how your model is fine-tuned, including:

* Batch size
* Number of epochs
* Learning rate
* Warmup steps
* Mixed precision (fp16) for faster training
* Checkpoint saving
* Reporting to Weights & Biases

You can modify the hyperparameters here to see how different settings impact model performance.

🔥 Pro Tip: TODO INSERT SOME TIPS


In [33]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    run_name=f"fine-tuning-{model_name}-qlora",
    output_dir="./results",
    num_train_epochs=3, #start with 3 and go upto 10 epochs for better results
    per_device_train_batch_size=32,
    per_device_eval_batch_size=4,
    dataloader_num_workers=8,
    gradient_accumulation_steps=4,
    warmup_ratio=0.1,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    do_train=True,
    do_eval=True,
    fp16=True,
    bf16=False,
    gradient_checkpointing=True, #Choose to store the full forward-pass activations in GPU RAM
    group_by_length=True,
    report_to=["wandb"],
    remove_unused_columns=True,
    dataloader_pin_memory=True,
    optim="adamw_torch", #See https://huggingface.co/docs/transformers/v4.51.3/en/perf_train_gpu_one#optimizers
    learning_rate=2e-3,
    lr_scheduler_type="cosine", # See https://huggingface.co/docs/transformers/en/main_classes/optimizer_schedules#transformers.SchedulerType
    auto_find_batch_size=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=5,
    load_best_model_at_end=True,
    # metric_for_best_model="perplexity",
    metric_for_best_model="eval_loss",
    logging_strategy="steps",
    label_names=["labels"],
)


## 🛰️ Engage Training Tracker
Launch the model and track training live with W&B.

## 🔭 Initialize Trainer, Train, and Save 🌎

In this final section:

* We initialize the Trainer with:
  * The model
  * The datasets (train/test splits)
  * The training arguments
  * A data collator for language modeling
  * Our compute_metrics function to calculate TODO

* We start training by calling trainer.train().
* We save the fine-tuned model and tokenizer locally.
* We finish the W&B run to close the logging cleanly.

🧠 Reminder: After training finishes, your fine-tuned model will be available in your local runtime — you can upload it back to W&B or Hugging Face later!

🚨 Training Ahead: Be ready for 10-15 min runtimes with the default configs!


Running into the following?

> AttributeError: `AcceleratorState` object has no attribute `distributed_type`

This happens if `AcceleratorState._reset_state()` was called and an `Accelerator` or `PartialState` was not reinitialized.

To Fix Run the following in a new cell before reinitalizing your trainer

```
from accelerate.state import AcceleratorState
AcceleratorState._reset_state()
```




In [34]:
# from accelerate.state import AcceleratorState
# AcceleratorState._reset_state()

In [35]:
#Configure model for training
model.config.use_cache = False  # Disable cache during training

# Set label names for PEFT model
model.config.label_names = ["labels"]

# Initialize trainer with modified configuration
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,
        pad_to_multiple_of=8  # Add padding to multiple of 8 for better performance
    ),
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    #compute_metrics=compute_perpexity, #If monitoring additional metric(s) can set this argument accordingly to your compute functions
)

# Enable gradient checkpointing with the new format
if hasattr(model, "enable_input_require_grads"):
    model.enable_input_require_grads()
if hasattr(model, "gradient_checkpointing_enable"):
    model.gradient_checkpointing_enable()

#Working Idea - Launch training via animation - #TODO Figure out how to pipe the ignite button or just write simple message to run next cell to train model
def ignite_training():
    # This function will be called by the "Ignite" button
    print("🚀 Starting training process...")

    # Your existing training code
    run = wandb.init(entity=WANDB_ENTITY,
                    project=WANDB_PROJECT_NAME,
                    job_type="finetuning_job",
                    # name=f"fine_tune_{model.__class__.__name__}"
                     name = f"fine_tune_{model_name}"
                    )
    train_output = trainer.train()

    wandb.finish()

    print("✅ Training complete!")
    return train_output

#%run {animate_dir}/launch_sequence.py
#launch_sequence(training_args)

You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.


## ⚙ Now we kick off the training process ⚙

In [None]:
#Train
run = wandb.init(entity=WANDB_ENTITY,
                  project=WANDB_PROJECT_NAME,
                  job_type="finetuning_job",
                  # name=f"fine_tune_{model.__class__.__name__}"
                  name = f"fine_tune_{model_name}"
                  )
train_output = trainer.train()

wandb.finish()

# launch_sequence(training_args) #TODO Test this in Sagemaker - Otherwise Remove Ignite

You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,2.6668,2.302916


  return fn(*args, **kwargs)


## 💾 Save & Upload
Preserve your fine-tuned model as a W&B artifact.

We will now save this model to W&B. You will need the artifact refernce for this model to get points for completing the fine tuning section.

Tracking your model in W&B can be really helpful:

- You can now share this model with your team and beyond
- W&B creates a lineage map of your model so you can see the full model lifecycle dataset->training->final state

In [None]:
#Saving and uploading best model - working draft
trainer.save_model(f"./best_model/{type(model.base_model.model).__name__}")
tokenizer.save_pretrained(f"./best_model/{type(model.base_model.model).__name__}")

run = wandb.init(project=WANDB_PROJECT_NAME,
                 entity=WANDB_ENTITY,
                 job_type="FT-Workshop-Finetuning-Best-Model-Upload",
                 name="FT-Best-Model-Upload")

artifact = wandb.Artifact(
    name=f"{WORKSHOP_TEAM_NAME}-ft-best-model-{type(model.base_model.model).__name__}",
    type="model",
    description="""Best FineTuned model from the Astros-FT-Workshop."""
)

artifact.add_dir(local_path="./best_model")

logged_artifact = run.log_artifact(artifact)

run.link_artifact(
  artifact=logged_artifact,
  target_path="wandb-registry-model/FC_FT_Workshop_FineTuned_Models"
)

run.finish()

[34m[1mwandb[0m: Adding directory to artifact (./best_model)... Done. 0.1s


## ✅ Mission Checkpoint: Model Finetuned

Congratulations, Architect! You've:
- Loaded and prepped your training dataset ✅
- Configured a foundational model ✅
- Finetuned it with parameter-efficient methods ✅
- Logged your training runs and saved the final model to Weights & Biases ✅

Your model is now part of your mission's neural infrastructure.

Next, we prepare to test and evaluate. But first, a quick system check...

## 🧰 Systems Maintenance Bay: Utilities

Before testing, it's wise to flush memory and check your hardware status. Use these utilities to prepare the environment.

Just like a good engineer, make sure the ship's neural bays are cleared and ready.

## Utilities 🧰

In [16]:
# -- Flush out GPU memory - when required - may require restarting the notebook
import gc, torch

try:
    del trainer
except: print("cannot release memory")
try:
    del model
except: print("cannot release memory")
try:
    del tokenizer
except: print("cannot release memory")

gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

wandb.finish()

0,1
train/epoch,▁█
train/global_step,▁█
train/grad_norm,█▁
train/learning_rate,▁█
train/loss,█▁

0,1
train/epoch,0.88889
train/global_step,20.0
train/grad_norm,0.4763
train/learning_rate,0.00165
train/loss,2.1704


In [None]:
!nvidia-smi

Wed May 14 03:37:35 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   75C    P0             32W /   70W |    8940MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 🧪 Testing the Neural Core

Now that your model is trained and uploaded, it’s time to test your ship’s new neural core.

You’ll load the fine-tuned model and run test prompts to ensure it responds with precision and depth—critical for deep-space operations.

We’ve equipped you with a call function wrapped in `Weave`, our GenAI interface and telemetry layer.

# 🔧 Testing our model 🪛

Let's start by creating some helper functions to load and call the model we just trained.

Since we created an adapter during the finetuning process, our load model function loads the original model along with our adapter using PEFT.

## 🛰️ Introducing Weave: Your AI Telemetry and Evaluation Suite

**Weave** is Weights & Biases’ next-gen platform for tracking, evaluating, and visualizing GenAI applications.

In REBOOT, you'll use Weave to:
- Log and score model generations
- Run structured evaluations on Q&A performance
- Compare outputs with reference answers

This enables you to **quantitatively assess** how mission-ready your model is.

Let’s initialize Weave and plug it into your finetuned system.

In [None]:
!pip install weave "weave[scorers]" -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/503.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m501.8/503.9 kB[0m [31m18.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.9/503.9 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/590.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import weave
weave.init(f"{WANDB_ENTITY}/{WANDB_PROJECT_NAME}")

  if event.key is 'enter':



Logged in as Weights & Biases user: mohammadbakir.
View Weave data at https://wandb.ai/FT-Testing/Astros-FT-Workshop/weave


<weave.trace.weave_client.WeaveClient at 0x7a163e1c1950>

#### Helper functions to load our local model

In [None]:
#helper
def load_finetuned_model(adapter_dir, base_model_dir):

    tokenizer = AutoTokenizer.from_pretrained(adapter_dir, use_fast=True)

    # Load base model

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )

    # Load model with quantization
    print("Loading model with 4-bit quantization...")
    base_model = AutoModelForCausalLM.from_pretrained(
        Path(base_model_dir),
        quantization_config=bnb_config,
        device_map="auto",
        )
    # Load fine-tuned adapter
    model = PeftModel.from_pretrained(base_model, adapter_dir)
    return tokenizer, model

### Calling our Local Finetuned Model

In [None]:
@weave.op()
def call_model(question: str) -> str:
    """Generate an answer from your Local LLM given a prompt."""

    system_prompt = "You are an expert in astrophysics. Please provide a concise and truthful answer to the following question:"
    prompt = system_prompt + "\n\n" + question + "\nAnswer:"

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=150, do_sample=False, eos_token_id=tokenizer.eos_token_id, pad_token_id=model.config.eos_token_id)
    return tokenizer.decode(output[0], skip_special_tokens=True).replace(prompt, '').strip(),

NameError: name 'weave' is not defined

## 📊 Final Check: Evaluation Protocols

Your neural core is active—but is it mission-grade?

Use this section to:
- Load an evaluation dataset
- Score model responses using embedding similarity
- Track performance with W&B + Weave

**Evaluation is critical** before deployment—it ensures your model’s reasoning is aligned with mission parameters.

# Evaluating our Model

## Load model

In [None]:
base_model_dir = "./models/TinyLlama_v1" # Path to base model - modify accordingly to fine_tuned_model/<TinyLlama_v1 or falcon-rw-1b_v0>
adapter_dir = f"./best_model/{type(model.base_model.model).__name__}" #add path to adapter dir

tokenizer, model = load_finetuned_model(adapter_dir, base_model_dir)

Loading model with 4-bit quantization...


## Get Eval Dataset

In [None]:
weave.init('fc25-wandb-admins/uncategorized')
eval_dataset_public_v0 = weave.ref('eval_dataset_public:v0').get()

Logged in as Weights & Biases user: mohammadbakir.
View Weave data at https://wandb.ai/fc25-wandb-admins/uncategorized/weave


## Test the model with a sample from our eval dataset

In [None]:
question = eval_dataset_public_v0[20]['question']
answer = call_model(question)

print("🛰️  Incoming Transmission — Mission Q&A\n")
print(f"🧠 Question:\n{question}\n")
print(f"🤖 Model Response:\n{answer}")

🍩 https://wandb.ai/fc25-wandb-admins/uncategorized/r/call/0196cd2e-7000-7f81-89ac-6757fbec2ea2
🛰️  Incoming Transmission — Mission Q&A

🧠 Question:
How can 'cosmic variance' affect the interpretation of measurements of the cosmic star formation rate density?

🤖 Model Response:
Cosmic variance is the uncertainty in measuring a quantity (like the number of galaxies or stars) due to the fact that the number of galaxies or stars observed is not independent of the number of galaxies or stars that were ever formed. This means that the observed number of galaxies or stars is a random variable, and the observed distribution of galaxies or stars is not a true probability distribution. This means that the observed cosmic star formation rate density (SFRD) is not a true measure of the total amount of star formation in the universe. It is a random variable, and its true value depends on the initial conditions (like the initial mass function) and the initial conditions of the universe. It is theref

## Setup eval

In [None]:
import asyncio
from weave.scorers import EmbeddingSimilarityScorer
similarity_scorer = EmbeddingSimilarityScorer(
    model_id="openai/text-embedding-3-small",  # will need to update this to bedrock's titan models
    threshold=0.7
)

similarity_scorer.column_map = {
    "output": "model_output",  # Your model's response
    "target": "answer", # The reference response (expected)
    "kwargs": "question"
}

In [None]:
eval_dataset = eval_dataset_public_v0.rows[20:25] # select samples to run evalaution against

In [None]:
import os
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')# will remove this once we move to bedrock

evaluation = weave.Evaluation(
    evaluation_name = f"{WORKSHOP_TEAM_NAME}_qna_eval",
    dataset=eval_dataset, scorers=[similarity_scorer],
    name="model_qna_eval"
   )

print(await(evaluation.evaluate(call_model)) )



🍩 https://wandb.ai/fc25-wandb-admins/uncategorized/r/call/0196cd2e-a20c-76c3-9ce1-903945e45e40


{'EmbeddingSimilarityScorer': {'similarity_score': {'mean': 0.8166838007224524}, 'is_similar': {'true_count': 5, 'true_fraction': 1.0}}, 'model_latency': {'mean': 72.45420384407043}}
