# Gemma-2-27B: Base vs Fine-tuned Comparison

This notebook compares the performance of the base `mlx-community/gemma-2-27b-it-4bit` model with the fine-tuned version (LoRA adapters).
We check for:
1. **Knowledge Injection:** Does it know about "SolverX"?
2. **General Capabilities:** Does it still answer general Python/Common Sense questions correctly (avoiding Catastrophic Forgetting)?

In [None]:
import mlx.core as mx
from mlx_lm import load, generate
import pandas as pd
from IPython.display import display, HTML

# Configuration
MODEL_PATH = "mlx-community/gemma-2-27b-it-4bit"
ADAPTER_PATH = "adapters_27b"  # Path to the fine-tuned adapters

# Test Questions (Mixed Domain)
questions = [
    # Domain Specific (SolverX)
    "SolverX가 뭐야?",
    "SolverX의 본사는 어디에 있어?",
    "SolverX Fusion의 특징은?",
    
    # General Knowledge (Python)
    "파이썬에서 리스트를 정렬하는 함수는?",
    "파이썬으로 Hello World 출력하는 코드를 짜줘.",
    
    # General Knowledge (Common Sense)
    "대한민국의 수도는 어디야?",
    "사과는 무슨 색이야?"
]

In [None]:
# 1. Evaluate Base Model
print(f"Loading Base Model: {MODEL_PATH}")
model, tokenizer = load(MODEL_PATH)

base_responses = []
print("Generating responses with Base Model...")
for q in questions:
    prompt = f"<start_of_turn>user\n{q}<end_of_turn>\n<start_of_turn>model\n"
    response = generate(model, tokenizer, prompt=prompt, max_tokens=200, verbose=False)
    base_responses.append(response.strip())
    print(".", end="", flush=True)
print("\nBase Model evaluation complete.")

In [None]:
# 2. Evaluate Fine-tuned Model
print(f"Loading Fine-tuned Model with Adapters: {ADAPTER_PATH}")
# Reload model with adapters
model, tokenizer = load(MODEL_PATH, adapter_path=ADAPTER_PATH)

ft_responses = []
print("Generating responses with Fine-tuned Model...")
for q in questions:
    prompt = f"<start_of_turn>user\n{q}<end_of_turn>\n<start_of_turn>model\n"
    response = generate(model, tokenizer, prompt=prompt, max_tokens=200, verbose=False)
    ft_responses.append(response.strip())
    print(".", end="", flush=True)
print("\nFine-tuned Model evaluation complete.")

In [None]:
# 3. Compare Results
results = []
for i, q in enumerate(questions):
    results.append({
        "Question": q,
        "Base Model": base_responses[i],
        "Fine-tuned Model": ft_responses[i]
    })

df = pd.DataFrame(results)

# Style the dataframe for better readability
def highlight_diff(row):
    # Simple heuristic: if FT is different from Base, highlight it?
    # Or just return empty styles.
    return ['' for _ in row]

styled_df = df.style.set_properties(**{'text-align': 'left', 'white-space': 'pre-wrap'}) \
    .set_table_styles([dict(selector='th', props=[('text-align', 'left')])])

display(HTML("<h3>Comparison Results</h3>"))
display(styled_df)