# Retrospective: Attempts, Issues, and Lessons Learned

## 1  Data preparation  
- **Recipe dialogues (`qwen_recipe_dialogues.jsonl`)**
  - Generated ~4 300 multi-turn dialogues with JSON-only assistant turns.  
  - Validation script confirmed schema conformance and length distribution, but later review exposed “silent” errors (e.g., tags such as `kosher` being inserted without user mention).  
- **Tag & ingredient metadata**
  - Tags (≈ 467) and their group mappings loaded.  
  - Embeddings generated with `all-MiniLM-L6-v2`, saved to Postgres `tags` table (`vector` column).  
  - Ingredient canonical / variant tables created and populated; ~4 000 canonicals, ~16 000 variants.

## 2  Database layer  
- Postgres 15 + pgvector extension.
- **Schemas implemented**
  - `tags(tag_id PK, tag_name, group_name, embedding vector(384))`.
  - `tag_groups(group_name PK, embedding vector(384))`.
  - `ingredients(ing_id PK, canonical, embedding vector(384))`.
  - `ingredient_variants(variant TEXT PK, canonical REFERENCES ingredients)`.
- **Issues encountered**
  - Attempted vector slicing in SQL (`embedding[1:3]`) — pgvector does not support subscripting; fixed by using `vector_dims(embedding)` or exporting to Python for preview.  
  - Unicode errors during `execute_values`; solved by ensuring UTF-8 connection and avoiding `.tolist()` (stored Python list → JSONB instead).

## 3  Model fine-tuning attempts  

| Step | Action | Outcome / Error |
|------|--------|-----------------|
| 3-A | Loaded **Qwen 2.5-0.5B-Instruct** in 4-bit with LoRA config (6.6 M trainable params). | Successful. |
| 3-B | Tokenised dialogues (train = 4 203, val = 131). | OK. |
| 3-C | `TrainingArguments` error (`evaluation_strategy` unknown). | Local transformers version 4.53.0 lacked that kwarg; upgraded to 4.53.1, then switched to `eval_strategy`. |
| 3-D | LoRA fine-tuning completed (2 epochs, loss ↓ from 0.2528 → 0.2291). | Training finished; checkpoints in `qwen2p5-recipe-lora` and `…-final`. |
| 3-E | Merged LoRA into base → `qwen2p5-recipe-merged`. | `tokenizer` files missing; copied manually. |
| 3-F | Inference test with merged model + base tokenizer. | Output not JSON-compliant; hallucinated duplicate keys, comments, and invalid structure. |
| 3-G | Added explicit system JSON-rules prompt + `chat_template`. | Improved but still produced malformed JSON (duplicate keys, trailing commas) for edge prompts. |

## 4  Automated post-processing / cleaning  
- Wrote `clean_conversation_with_api` to call GPT-4o-mini for:  
  1. Removing unseen tags / ingredients.  
  2. Rephrasing first-turn user opener (avoid repetition with sliding window of 5 past openers).  
- Problems:  
  - ChatGPT occasionally introduced new errors (e.g., duplicated keys) despite JSON formatting request.  
  - “Semantic” deletion rule (e.g., dropping `kosher` when not implied) remained unreliable.

## 5  Failure Modes Identified  

1. **Tokenizer / model mismatch after merge**  
   - Merged directory lacked complete tokenizer assets; `AutoTokenizer` could not locate `vocab.json` → `NoneType` error.  

2. **TrainingArguments API drift**  
   - Multiple errors due to version mismatches (`evaluation_strategy`, `eval_strategy`).  

3. **Generation quality**  
   - Even with LoRA, model often violates JSON constraints:  
     - Duplicate keys (`include_ingredients` twice).  
     - In-line comments (`// …`).  
     - Hallucinated tags or ingredients.  

4. **Post-processing loop**  
   - Clean-up prompts insufficiently constrained; still permitted undesired additions.  
   - Reliance on GPT-API for deterministic cleansing proved fragile.

## 6  Lessons & Next Steps  

- **Tokenizer integrity**: Always copy full tokenizer set (`*.json`, `*.model`, special token maps) when saving merged checkpoints.  
- **Strict decoding**: Use `model.generate(..., decoder_start_token_id, eos_token_id, logits_processor=[...JSONProcessor])` or enforce incremental validation rather than post-hoc regex fixes.  
- **Structured fine-tuning**:  
  - Consider **direct preference optimization** (DPO / RLHF) on JSON validity instead of LoRA alone.  
  - Smaller curated dataset focusing on tricky cases (exclusions, resets).  
- **Evaluation harness**: Build automatic JSON validator & semantic checker (tag-source alignment) to score generation before deployment.  
- **Version pinning**: Lock `transformers`, `peft`, `bitsandbytes` versions in `requirements.txt` to avoid API drift.
- **Model Change**: Consequently, I’ve decided to pivot the project to the Llama 3.1 1 B model, confident it will provide a more balanced blend of accuracy, efficiency, and predictable JSON control for our recipe-assistant goals.

In [1]:
import json
import ast

# Load tags
with open("tags", "r", encoding="utf-8") as f:
    raw_tags = f.read().splitlines()

# Load ingredient clusters
with open("ingredient_clusters.json", "r", encoding="utf-8") as f:
    ingredient_clusters = json.load(f)

# Load ingredient mapping
with open("ingredient_mapping.json", "r", encoding="utf-8") as f:
    ingredient_mapping = json.load(f)


In [2]:
# Convert TAG_GROUPS = {...} into Python dict
tag_string = "\n".join(raw_tags).strip().replace("TAG_GROUPS = ", "")
tag_groups = ast.literal_eval(tag_string)

list(tag_groups.items())[:3]  # (group_name, [tags])


[('TIME_DURATION',
  ['15-minutes-or-less',
   '30-minutes-or-less',
   '60-minutes-or-less',
   '4-hours-or-less',
   '1-day-or-more']),
 ('COMPLEXITY_EASE',
  ['3-steps-or-less',
   '5-ingredients-or-less',
   'easy',
   'beginner-cook',
   'for-1-or-2',
   'for-large-groups',
   'one-dish-meal',
   'from-scratch']),
 ('DIETARY_RESTRICTIONS',
  ['vegan',
   'vegetarian',
   'gluten-free',
   'dairy-free',
   'egg-free',
   'nut-free',
   'lactose',
   'kosher',
   'diabetic',
   'no-shell-fish'])]

In [3]:
# flat list
flat_tags = [tag for group in tag_groups.values() for tag in group]

flat_tags[:10]

['15-minutes-or-less',
 '30-minutes-or-less',
 '60-minutes-or-less',
 '4-hours-or-less',
 '1-day-or-more',
 '3-steps-or-less',
 '5-ingredients-or-less',
 'easy',
 'beginner-cook',
 'for-1-or-2']

In [4]:
# Convert list of dicts to canonical → list of variants
ingredient_clusters_dict = {
    item["canonical"]: item["variants"]
    for item in ingredient_clusters
    if "canonical" in item and "variants" in item
}

list(ingredient_clusters_dict.items())[:5]

[('salt',
  ['salt', 'saltine', 'nu-salt', "salt,'", 'nu salt', 'salt,.', 'salt, &']),
 ('butter',
  ['butter',
   'buttermilk',
   'cold butter',
   'soft butter',
   'real butter',
   'firm butter',
   'nut butter',
   'butter buds',
   'herb butter',
   'shea butter',
   'cub butter',
   'buttery oil',
   'aloe butter',
   'ghee butter',
   'hard butter',
   'hot butter',
   'pure butter']),
 ('eggs', ['eggs', 'egg']),
 ('sugar', ['sugar', 'raw sugar', 'red sugar', 'sugar*']),
 ('onion',
  ['onion',
   'onions',
   'red onion',
   'dry onion',
   'raw onion',
   'onion,.',
   'onion, --'])]

In [9]:
import psycopg2
import pandas as pd
import ast

# Connect to Postgres
conn = psycopg2.connect(
    dbname="recipes_db",
    user="postgres",
    password="turgutcem", 
    host="localhost",
    port=5432
)
cur = conn.cursor()

# Pull recipe_id and tags (assumed to be stored as text[] or JSONB array)
cur.execute("SELECT recipe_id, tags FROM recipes;")
rows = cur.fetchall()

# Build dataframe
df_tags_raw = pd.DataFrame(rows, columns=["recipe_id", "tags"])
df_tags_raw["tags"] = df_tags_raw["tags"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

cur.close()
conn.close()

df_tags_raw.head()


Unnamed: 0,recipe_id,tags
0,188602,"[30-minutes-or-less, time-to-make, course, mai..."
1,122957,"[60-minutes-or-less, time-to-make, course, pre..."
2,106263,"[60-minutes-or-less, time-to-make, course, pre..."
3,194121,"[15-minutes-or-less, time-to-make, course, pre..."
4,152982,"[time-to-make, course, preparation, very-low-c..."


In [12]:
import pandas as pd
import numpy as np
recipes_df = pd.read_csv(r'C:\Users\turgu\sertifika\reciperesuggestion\Recipes_Ingredients\experiments-with-models\recipes_revisited.csv')

In [13]:
import json
# Load smart tag groups
with open("smart_tag_groups.json", "r", encoding="utf-8") as f:
    smart_tag_groups = json.load(f)

# Flatten into a list of valid tags
valid_tags = sorted({tag for group_tags in smart_tag_groups.values() for tag in group_tags})

# Reverse lookup from tag → group
tag_to_group = {}
for group_name, tags in smart_tag_groups.items():
    for tag in tags:
        tag_to_group[tag] = group_name

print(f"Valid tags: {len(valid_tags)}")
print("Sample:", valid_tags[:10])

Valid tags: 467
Sample: ['1-day-or-more', '15-minutes-or-less', '3-steps-or-less', '30-minutes-or-less', '4-hours-or-less', '5-ingredients-or-less', '60-minutes-or-less', 'a1-sauce', 'african', 'american']


In [14]:
from collections import defaultdict

# Tag index: tag -> list of recipe IDs that have it
tag_to_recipe_ids = defaultdict(list)

for idx, row in recipes_df.iterrows():
    recipe_id = row["id"]
    try:
        recipe_tags = ast.literal_eval(row["tags"]) if isinstance(row["tags"], str) else row["tags"]
    except:
        continue
    for tag in recipe_tags:
        if tag in valid_tags:
            tag_to_recipe_ids[tag].append(recipe_id)

# Filter to max 10 recipes per tag
tag_to_recipe_ids_10 = {tag: ids[:10] for tag, ids in tag_to_recipe_ids.items()}

# Preview
for tag, ids in list(tag_to_recipe_ids_10.items())[:5]:
    print(f"{tag}: {ids}")

60-minutes-or-less: [76133, 318331, 164054, 339949, 250990, 420900, 504881, 136221, 106968, 230280]
casseroles: [76133, 160997, 413140, 197205, 441130, 329869, 494141, 481407, 54875, 251747]
main-dish: [76133, 489452, 8312, 164054, 214352, 250990, 102427, 49313, 136221, 106968]
eggs-dairy: [76133, 325623, 205188, 304398, 372781, 23889, 619, 239089, 413140, 78364]
oven: [76133, 318331, 60921, 160997, 372781, 23889, 413140, 78364, 134654, 59828]


In [15]:
from sentence_transformers import SentenceTransformer
import torch
import pandas as pd

model = SentenceTransformer("all-MiniLM-L6-v2")

# Generate tag embeddings (these are individual tag names like 'vegetarian', 'soups-stews', etc.)
tag_embeddings = model.encode(valid_tags, convert_to_tensor=True, show_progress_bar=True)
print("Tag embeddings shape:", tag_embeddings.shape)

# Store in DataFrame for later filtering
df_tag_embeddings = pd.DataFrame(tag_embeddings.cpu().numpy(), index=valid_tags)
print("df_tag_embeddings created — sample:")
print(df_tag_embeddings.head())





Batches:   0%|          | 0/15 [00:00<?, ?it/s]

Tag embeddings shape: torch.Size([467, 384])
df_tag_embeddings created — sample:
                         0         1         2         3         4    \
1-day-or-more      -0.026001  0.048450  0.048305  0.040473 -0.030104   
15-minutes-or-less  0.015296  0.079017  0.023351 -0.061849  0.031248   
3-steps-or-less    -0.016551  0.045348 -0.015148 -0.059154 -0.001241   
30-minutes-or-less  0.044542  0.042507 -0.015222 -0.046406  0.022921   
4-hours-or-less     0.079789  0.066414  0.038017  0.015312  0.016564   

                         5         6         7         8         9    ...  \
1-day-or-more      -0.061308  0.014986 -0.013676 -0.013961 -0.028035  ...   
15-minutes-or-less -0.003695 -0.059204  0.012238  0.096509 -0.032138  ...   
3-steps-or-less     0.007241 -0.087239  0.009688 -0.018296  0.016620  ...   
30-minutes-or-less  0.016567 -0.054062  0.013243  0.072070  0.013745  ...   
4-hours-or-less    -0.039805 -0.078136  0.036589  0.004458  0.007705  ...   

                       

In [16]:
# Compute group embeddings by averaging member tag vectors
group_embedding_dict = {}
missing_counts = {}

for group_name, tags in smart_tag_groups.items():
    vectors = []
    missing = []
    for tag in tags:
        if tag in df_tag_embeddings.index:
            vectors.append(df_tag_embeddings.loc[tag].values)
        else:
            missing.append(tag)

    if vectors:
        group_embedding_dict[group_name] = np.mean(vectors, axis=0)
    else:
        print(f"All tags missing for group '{group_name}' — skipping.")
    
    if missing:
        missing_counts[group_name] = missing

# Convert to Series
group_embeddings = pd.Series(group_embedding_dict)

# Debug: check embedding shape and missing stats
print("Group embeddings created by averaging tag vectors")
print("Sample:", group_embeddings.head())
print("Shape:", group_embeddings.iloc[0].shape if not group_embeddings.empty else None)



Group embeddings created by averaging tag vectors
Sample: TIME_DURATION        [0.033391405, 0.062393468, 0.016948272, -0.017...
DIFFICULTY_SCALE     [-0.027050288, 0.0049353912, -0.011782292, 0.0...
DIETARY_HEALTH       [-0.0056575504, 0.011860569, -0.0031483106, 0....
CUISINES_REGIONAL    [0.00017722673, 0.036341455, -0.02869772, 0.03...
MEAL_COURSES         [-0.026268303, 0.015148821, -0.005991973, 0.03...
dtype: object
Shape: (384,)


In [17]:
from sentence_transformers import util

def find_top_groups(query, group_embeddings, model, top_k=3):
    # Encode query
    query_emb = model.encode(query, convert_to_tensor=True)
    
    # Stack group vectors and compute similarity
    group_matrix = torch.tensor(np.stack(group_embeddings.values)).to(query_emb.device)
    similarities = util.cos_sim(query_emb, group_matrix)[0]

    # Extract top indices
    top_indices = torch.topk(similarities, k=top_k).indices.tolist()
    top_results = [(group_embeddings.index[i], similarities[i].item()) for i in top_indices]

    return top_results

In [18]:
def find_best_tags_across_groups(user_query, group_names, df_tag_embeddings, model, top_k=5):
    results = []

    # Encode user query once
    query_embedding = model.encode(user_query, convert_to_tensor=True)
    query_embedding = query_embedding.to('cuda' if torch.cuda.is_available() else 'cpu')

    for group_name in group_names:
        # Get tags in this group
        tags_in_group = smart_tag_groups.get(group_name, [])
        df_subset = df_tag_embeddings.loc[df_tag_embeddings.index.intersection(tags_in_group)]

        # Sanity check
        if df_subset.empty:
            print(f"No valid tag embeddings found for group '{group_name}'")
            continue

        try:
            tag_vectors = np.vstack(df_subset.values).astype(np.float32)
        except Exception as e:
            print(f"Error stacking vectors for group '{group_name}':", e)
            continue

        tag_tensor = torch.tensor(tag_vectors).to(query_embedding.device)
        similarities = util.cos_sim(query_embedding, tag_tensor)[0]

        for i, tag in enumerate(df_subset.index):
            results.append((tag, similarities[i].item(), group_name))

    # Sort and return top K
    results.sort(key=lambda x: x[1], reverse=True)
    return results[:top_k]


In [167]:
query = "i want a vegan breakfast"
top_groups = find_top_groups(query, group_embeddings, model, top_k=3)
group_names = [g for g, _ in top_groups]

best_tags = find_best_tags_across_groups(query, group_names, df_tag_embeddings, model, top_k=5)

print("Best matching tags across groups:")
for tag, score, group in best_tags:
    print(f"- {tag} (score: {score:.3f}) in {group}")

Best matching tags across groups:
- breakfast (score: 0.628) in MEAL_COURSES
- breakfast-eggs (score: 0.607) in MEAL_COURSES
- eggs-breakfast (score: 0.599) in MEAL_COURSES
- vegan (score: 0.597) in DIETARY_HEALTH
- vegetarian (score: 0.551) in DIETARY_HEALTH


In [21]:
import openai
import os
import json
from typing import List, Dict


client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

try:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=5
    )
    print("API connection successful.")
except Exception as e:
    print("API test failed:", str(e))


API connection successful.


In [118]:
system_prompt = """
You are a recipe assistant in a multi-turn conversation with a user.

Respond **only** in valid JSON after each user turn; no markdown or prose.

JSON format:
{
  "name_description": string,
  "include_tags": [string],        // ≤ 6
  "exclude_tags": [string],        // ≤ 6
  "include_ingredients": [string],
  "exclude_ingredients": [string],
  "count": integer,                // default 5 (user may request 1-10)
  "reason": string
}

Rules
- Never hallucinate: add only what the user asks or clearly implies.
- Tag / ingredient *mention-before-use*:
  • Add a tag only after the user names that tag or an unmistakable synonym
    (“quick” ⇒ 60-min tag, “vegan” ⇒ dietary tag, etc.).
  • Add an ingredient only after the user names that ingredient.
- If the user excludes an ingredient (e.g. “no eggs”):
  • Put that word in exclude_ingredients.
  • Remove any include_tag that contains that word (e.g. drop “eggs-dairy”) and
    never re-add it unless the user later says that tag name.
- Allowed tags are only those in TARGET METADATA.
- By the final assistant turn:
  • include_tags must contain every tag the USER explicitly requested, plus
    at most a few helpful extras, but never exceed six.
  • Do **not** surface tags the user never mentioned.
  • Never remove a user-requested tag unless it now conflicts with an exclusion.
- “vegetarian” ⇒ include both “vegetarian” and “vegan”, but do **not** exclude
  meat unless the user asks.
- The `count` field defaults to 5; change it only when the user explicitly
  requests 1–10.
- `name_description` must be a concise summary of the user’s current request
  (e.g. “Quick Italian beef casserole”), **never** a literal recipe title.
- `reason` explains only what changed since last assistant turn.
- If the user says “let’s start over” / “reset”, clear all fields and set
  count = 5.

Output requirements
- Strict ONE-LINE JSON (no literal newlines inside string values).
- Double-quote all keys and strings.
- Lists must be syntactically valid even if empty.
"""


In [143]:
recipe_dialogue_tool = {
    "type": "function",
    "function": {
        "name": "produce_dialogue",
        "description": "Return the full multi-turn conversation.",
        "parameters": {
            "type": "object",
            "properties": {
                "messages": {
                    "type": "array",
                    "items": {
                        "oneOf": [
                            {   # USER TURN
                                "type": "object",
                                "properties": {
                                    "role":    { "type":"string", "enum":["user"] },
                                    "content": { "type":"string" }
                                },
                                "required": ["role","content"]
                            },
                            {   # ASSISTANT TURN
                                "type": "object",
                                "properties": {
                                    "role":  { "type":"string", "enum":["assistant"] },    
                                    "name_description":     { "type":"string" },
                                    "include_tags":         { "type":"array","items":{"type":"string"} },
                                    "exclude_tags":         { "type":"array","items":{"type":"string"} },
                                    "include_ingredients":  { "type":"array","items":{"type":"string"} },
                                    "exclude_ingredients":  { "type":"array","items":{"type":"string"} },
                                    "count":                { "type":"integer" },
                                    "reason":               { "type":"string" }
                                },
                                "required": [
                                    "role",                         
                                    "name_description","include_tags","exclude_tags",
                                    "include_ingredients","exclude_ingredients","count","reason"
                                ]
                            }
                        ]
                    },
                    "minItems": 1
                }
            },
            "required": ["messages"]
        }
    }
}


In [61]:
from collections import OrderedDict
from pathlib import Path

# Canonicalise
INGR_CLUSTER_PATH = Path("ingredient_clusters.json")

with open(INGR_CLUSTER_PATH, "r", encoding="utf-8") as f:
    ingr_clusters = json.load(f) 

# Reverse map: variant -> canonical
variant2canon = {}
for entry in ingr_clusters:              
    canon = entry["canonical"]
    for v in entry["variants"]:
        variant2canon[v.lower()] = canon
    variant2canon[canon.lower()] = canon

def canonicalize_ingredients(raw_ingredients):
    """Return unique canonical ingredient names, preserving order."""
    canon_seen = OrderedDict()
    for item in raw_ingredients:
        canon = variant2canon.get(item.lower(), item.lower())
        canon_seen[canon] = None
    return list(canon_seen.keys())

# Build a mapping tag -> group once, using smart_tag_groups.json you already loaded
# Example: tag_group_map = {"vegan": "DIETARY_HEALTH", ...}

def choose_top_k_tags(recipe_tags, tag_group_map, k=6):
    recipe_tags = [t for t in recipe_tags if t in tag_group_map]
    chosen = []
    seen_groups = set()

    # one per group
    for tag in recipe_tags:
        group = tag_group_map.get(tag, None)
        if group and group not in seen_groups:
            chosen.append(tag)
            seen_groups.add(group)
            if len(chosen) == k:
                return chosen

    # fill up to k
    for tag in recipe_tags:
        if tag not in chosen:
            chosen.append(tag)
            if len(chosen) == k:
                break
    return chosen


In [62]:
sample_tags = [
    "grilling", "main-dish", "chicken", "healthy",
    "low-fat", "dinner-party", "60-minutes-or-less"
]
sample_ingredients = ["Unsalted butter", "Lemon", "cold Butter", "CHICKEN"]

print("Chosen tags :", choose_top_k_tags(sample_tags, tag_to_group, 6))
print("Canonical ingredients :", canonicalize_ingredients(sample_ingredients))

Chosen tags : ['grilling', 'main-dish', 'chicken', 'healthy', '60-minutes-or-less', 'low-fat']
Canonical ingredients : ['unsalted butter', 'lemon', 'butter', 'chicken']


In [43]:
from collections import Counter
import random

def pick_exclusions(base_idx, recipe_rows, tag_to_group, max_exc=2):
    """
    base_idx: index of the 'current' recipe within recipe_rows (length 10)
    recipe_rows: list-like of 10 recipe DataFrame rows
    Returns: exc_tags, exc_ingr (lists, length 0..max_exc)
    """
    # collect tag frequencies across the 10
    tag_freq = Counter()
    for r in recipe_rows:
        tlist = eval(r["tags"]) if isinstance(r["tags"], str) else r["tags"]
        tag_freq.update(tlist)

    # current recipe tags / ingredients
    base_tags = set(eval(recipe_rows[base_idx]["tags"]))
    base_ingr = set(canonicalize_ingredients(
        eval(recipe_rows[base_idx]["ingredients"])
    ))

    # candidate tag exclusions: rare & not in current
    rare_tag_candidates = [
        t for t, f in tag_freq.items() if f <= 3 and t not in base_tags
    ]
    random.shuffle(rare_tag_candidates)
    exc_tags = rare_tag_candidates[:max_exc]

    # ingredient exclusions
    ingr_freq = Counter()
    for r in recipe_rows:
        ingr_freq.update(
            canonicalize_ingredients(eval(r["ingredients"]))
        )
    rare_ingr_candidates = [
        i for i, f in ingr_freq.items() if f <= 3 and i not in base_ingr
    ]
    random.shuffle(rare_ingr_candidates)
    exc_ingr = rare_ingr_candidates[:max_exc]

    return exc_tags, exc_ingr


In [73]:
def tag_conflicts(tag: str, exc_ingr: list[str]) -> bool:
    """Return True if any excluded ingredient word appears in the tag string."""
    tag_lower = tag.lower()
    return any(word.lower() in tag_lower for word in exc_ingr)

In [74]:
def build_target_metadata_msg(recipe_row,
                              tag_group_map,
                              recipe_rows10=None,
                              idx_within10: int = 0,
                              max_tags: int = 6):
    """
    Build the TARGET-METADATA system message for a single recipe.
    """
    # Representative include tags (<=6)
    recipe_tags = ast.literal_eval(recipe_row["tags"]) \
                   if isinstance(recipe_row["tags"], str) else recipe_row["tags"]
    inc_tags = choose_top_k_tags(recipe_tags, tag_group_map, k=max_tags)

    # Canonical ingredient list
    raw_ingr   = ast.literal_eval(recipe_row["ingredients"]) \
                 if isinstance(recipe_row["ingredients"], str) else recipe_row["ingredients"]
    canon_ingr = canonicalize_ingredients(raw_ingr)

    # Candidate exclusions (if 10-recipe cohort provided) 
    if recipe_rows10 is not None:
        exc_tags, exc_ingr = pick_exclusions(idx_within10,
                                             recipe_rows10,
                                             tag_to_group,
                                             max_exc=2)
    else:
        exc_tags, exc_ingr = [], []

    # Filter out include tags that conflict with excluded ingredients 
    inc_tags = [t for t in inc_tags if not tag_conflicts(t, exc_ingr)]

    # Build name + description 
    name_part        = str(recipe_row.get("name", "")).strip()
    description_part = str(recipe_row.get("description", "")).strip()
    title = " – ".join(p for p in [name_part, description_part] if p)

    # Compose TARGET METADATA message 
    content = (
        "TARGET METADATA (use gradually, never output >6 tags):\n"
        f"- include_tags        : {inc_tags}\n"
        f"- exclude_tags        : {exc_tags}\n"
        f"- canonical_ingredients: {canon_ingr}\n"
        f"- exclude_ingredients : {exc_ingr}\n"
        f"- default_count       : 5\n"
        f"- name_description    : \"{title}\"\n"
    )
    return {"role": "system", "content": content}


In [75]:
row = recipes_df.iloc[0]
meta_msg = build_target_metadata_msg(row, tag_to_group, max_tags=6)
print(meta_msg["content"])

TARGET METADATA (use gradually, never output >6 tags):
- include_tags        : ['60-minutes-or-less', 'casseroles', 'main-dish', 'oven', 'cheese', 'eggs-dairy']
- exclude_tags        : []
- canonical_ingredients: ['corned beef', 'thousand island dressing', 'sauerkraut', 'swiss cheese', 'bread', 'butter']
- exclude_ingredients : []
- default_count       : 5
- name_description    : "Reuben and Swiss Casserole Bake – I think this is even better than a reuben sandwich, I bet you will probably eat the whole casserole by yourself! :)"



In [154]:
one_shot = """
### EXAMPLE_DIALOGUE  – for reference only
Do NOT quote or repeat this.  Observe the structure.

{
  "messages": [
    { "role": "user", "content": "I'm looking for a savory pie recipe with ground beef." },
    { "role": "assistant",
      "name_description": "Savory pie recipe featuring ground beef",
      "include_tags": ["savory-pies","beef"],
      "exclude_tags": [],
      "include_ingredients": [],
      "exclude_ingredients": [],
      "count": 5,
      "reason": "Initial request for a savory pie with ground beef."
    },
    …
    { "role": "user", "content": "Let's start over." },
    { "role": "assistant",
      "name_description": "",
      "include_tags": [],
      "exclude_tags": [],
      "include_ingredients": [],
      "exclude_ingredients": [],
      "count": 5,
      "reason": "Resetting as requested."
    }
  ]
}
"""


In [155]:
#  generate_selfplay_dialogue (function-call version) 
import json, textwrap , datetime

def generate_selfplay_dialogue(meta_msg: dict,
                               turns: int = 4,
                               need_reset: bool = False) -> dict:
    reset_instruction = (
        "\n- After the third user turn, the user says \"Let's start over\" "
        "and begins a completely new intent. Assistant must clear and return empty formatted JSON where count = 5(since it is the default value)."
        if need_reset else ""
    )

    task_block = textwrap.dedent(f"""
    TASK:
    - Write a {turns}-turn conversation (user + assistant = {turns*2} messages).
    - Follow every RULE from the system prompt.
    - Return the entire conversation in **one** call to *produce_dialogue*.
    • The single argument must be  "messages": [...]
    • USER turns must contain only:  {{ "role": "user", "content": "<utterance>" }}
    • ASSISTANT turns must contain: role="assistant" **plus all seven JSON fields**
        (name_description, include_tags, exclude_tags, include_ingredients,
        exclude_ingredients, count, reason).
    - Do not emit any other tool calls or plain text.
    {reset_instruction}
""")



    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            meta_msg,
            {"role":"system","content":system_prompt},
            {"role":"system","content":one_shot},
            {"role":"system","content":task_block}
        ],
        tools=[recipe_dialogue_tool],
        tool_choice="auto",
        temperature=0.4,
        max_tokens=1500
    )

    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    dialogue = args["messages"]

    
    return {"messages": dialogue}



In [156]:

# QUICK TEST ON ONE RECIPE
convo = generate_selfplay_dialogue(meta_msg, turns=4, need_reset=False)
print("Self-play conversation produced with", len(convo["messages"]), "messages")

# APPEND TO JSONL
with open("qwen_recipe_dialogues.jsonl", "a", encoding="utf-8") as f:
    f.write(json.dumps(convo, ensure_ascii=False) + "\n")

print("Saved to JSONL at", datetime.datetime.now().isoformat())

Self-play conversation produced with 8 messages
Saved to JSONL at 2025-07-01T18:16:46.120735


In [157]:
import os
if os.path.exists("qwen_recipe_dialogues.jsonl"):
    os.remove("qwen_recipe_dialogues.jsonl")

In [162]:
# MASTER DIALOGUE-GENERATION LOOP

import json, os, datetime

out_path = "qwen_recipe_dialogues.jsonl"

#  work out how many conversations already exist
completed = 0                                   
if os.path.exists(out_path):                    
    with open(out_path, encoding="utf-8") as _f:   
        completed = sum(1 for _ in _f)             
print(f"➡  {completed} conversations already saved")  

# open file in append-mode if it exists, otherwise create new
mode = "a" if completed else "w"                
with open(out_path, mode, encoding="utf-8") as fout:

    skip_convos = completed                     
    counter     = completed                     

    for anchor_tag, id_list in tag_to_recipe_ids_10.items():

        cohort_rows = [
            recipes_df.loc[recipes_df['id'] == rid].iloc[0]
            for rid in id_list
        ]

        #  skip whole tag if its 10-recipe block is done 
        if skip_convos >= len(cohort_rows):     
            skip_convos -= len(cohort_rows)     
            continue                            

        # otherwise resume mid-block 
        start_at   = skip_convos                
        skip_convos = 0                         

        print(f"[{counter:>5}]  generating for tag → {anchor_tag}")

        for idx, row in enumerate(cohort_rows[start_at:], start=start_at):

            # standard TARGET msg for multi-recipe tags,
            # no exclusions + 3-turn chat for singleton-tags
            if len(cohort_rows) > 1:
                meta_msg = build_target_metadata_msg(
                    row, tag_to_group,
                    recipe_rows10 = cohort_rows,
                    idx_within10  = idx,
                    max_tags      = 6
                )
                turns = 4
            else:
                meta_msg = build_target_metadata_msg(
                    row, tag_to_group,
                    recipe_rows10 = None,       # no exclusions
                    max_tags      = 6
                )
                turns = 3                      # shorter dialogue

            # every 10th conversation triggers a reset
            need_reset_flag = (counter % 10 == 9)

            convo = generate_selfplay_dialogue(
                meta_msg,
                turns      = turns,
                need_reset = need_reset_flag,
                # other defaults already set (temperature = 0.4)
            )

            # write one line to JSONL
            fout.write(json.dumps(convo, ensure_ascii=False) + "\n")
            counter += 1

print(f"Saved {counter} dialogues to {out_path} at "
      f"{datetime.datetime.now().isoformat()}")


➡  2276 conversations already saved
[ 2276]  generating for tag → welsh
[ 2280]  generating for tag → omelets-and-frittatas
[ 2290]  generating for tag → simply-potatoes
[ 2300]  generating for tag → 1-day-or-more
[ 2310]  generating for tag → stir-fry
[ 2320]  generating for tag → canning
[ 2330]  generating for tag → peppers
[ 2340]  generating for tag → water-bath
[ 2350]  generating for tag → pacific-northwest
[ 2360]  generating for tag → wedding
[ 2370]  generating for tag → green-yellow-beans
[ 2380]  generating for tag → broccoli
[ 2390]  generating for tag → french
[ 2400]  generating for tag → ontario
[ 2410]  generating for tag → new-zealand
[ 2420]  generating for tag → grapes
[ 2430]  generating for tag → lime
[ 2440]  generating for tag → zucchini
[ 2450]  generating for tag → pork-sausage
[ 2460]  generating for tag → melons
[ 2470]  generating for tag → spaghetti
[ 2480]  generating for tag → new-years
[ 2490]  generating for tag → superbowl
[ 2500]  generating for tag 

In [169]:
conn = psycopg2.connect(
    dbname="recipes_db",
    user="postgres",
    password="turgutcem",
    host="localhost",
    port=5432
)
cur = conn.cursor()

cur.execute(
"""
CREATE TABLE IF NOT EXISTS tag_groups (
    group_name      TEXT PRIMARY KEY,
    member_count    INTEGER   NOT NULL,
    embedding       VECTOR(384)
);
"""
)

In [170]:
conn.commit()

In [186]:
group_embeddings.index

Index(['TIME_DURATION', 'DIFFICULTY_SCALE', 'DIETARY_HEALTH',
       'CUISINES_REGIONAL', 'MEAL_COURSES', 'MAIN_INGREDIENTS_PROTEINS',
       'MAIN_INGREDIENTS_VEGETABLES', 'MAIN_INGREDIENTS_FRUITS',
       'MAIN_INGREDIENTS_GRAINS', 'PREPARATION_METHOD', 'OCCASIONS_SEASONS',
       'DISH_TYPES', 'MAIN_INGREDIENTS_MISC'],
      dtype='object')

In [175]:
from psycopg2.extras import execute_batch, register_default_json

rows_to_insert = [
    (
        g,                                 # group_name
        len(smart_tag_groups[g]),          # member_count
        group_embeddings[g].tolist()       # 384-dim vector → Python list
    )
    for g in group_embeddings.index
]

sql = """
INSERT INTO tag_groups (group_name, member_count, embedding)
VALUES (%s, %s, %s)
ON CONFLICT (group_name) DO UPDATE
SET member_count = EXCLUDED.member_count,
    embedding     = EXCLUDED.embedding;
"""

execute_batch(cur, sql, rows_to_insert)
conn.commit()

print(f"Inserted / updated {len(rows_to_insert)} tag_groups.")

Inserted / updated 13 tag_groups.


In [176]:
assert "df_tag_embeddings" in globals(), "df_tag_embeddings missing!"
assert "tag_to_group"       in globals(), "tag_to_group missing!"

In [177]:
cur.execute(
"""
CREATE TABLE IF NOT EXISTS tags (
    tag_name     TEXT PRIMARY KEY,
    group_name   TEXT    NOT NULL REFERENCES tag_groups(group_name),
    embedding    VECTOR(384)
);
"""
)
conn.commit()

# build rows (tag_name, group_name, embedding)
rows = [
    (
        tag,                         # tag_name
        tag_to_group[tag],           # group FK
        df_tag_embeddings.loc[tag].tolist()   # 384-dim vector
    )
    for tag in df_tag_embeddings.index
]

# batch-insert with UPSERT
sql = """
INSERT INTO tags (tag_name, group_name, embedding)
VALUES (%s, %s, %s)
ON CONFLICT (tag_name) DO UPDATE
SET group_name = EXCLUDED.group_name,
    embedding  = EXCLUDED.embedding;
"""
execute_batch(cur, sql, rows)
conn.commit()

print(f"Inserted / updated {len(rows)} tags.")

Inserted / updated 467 tags.


In [None]:
conn.rollback() 

cur.execute("""
    SELECT tag_name,
           group_name,
           embedding::text AS emb_text 
    FROM tags
    ORDER BY tag_name
    LIMIT 8;
""")

pd.DataFrame(cur.fetchall(),
             columns=["tag_name","group","embedding"])


Unnamed: 0,tag_name,group,embedding
0,1-day-or-more,TIME_DURATION,"[-0.02600091,0.04844982,0.04830537,0.04047262,..."
1,15-minutes-or-less,TIME_DURATION,"[0.015296437,0.07901682,0.023351092,-0.0618492..."
2,3-steps-or-less,DIFFICULTY_SCALE,"[-0.016551048,0.045348424,-0.015147763,-0.0591..."
3,30-minutes-or-less,TIME_DURATION,"[0.044542197,0.042507093,-0.015222352,-0.04640..."
4,4-hours-or-less,TIME_DURATION,"[0.07978941,0.06641381,0.038016554,0.015312334..."
5,5-ingredients-or-less,DIFFICULTY_SCALE,"[0.021728717,-0.012638898,0.050658286,0.025623..."
6,60-minutes-or-less,TIME_DURATION,"[0.05332988,0.07557979,-0.009709317,-0.0345772..."
7,a1-sauce,MAIN_INGREDIENTS_MISC,"[-0.114903145,-0.01187302,-0.026469233,0.05337..."


In [184]:
cur.execute("SHOW client_encoding;")
enc = cur.fetchone()[0]
if enc.upper() != "UTF8":
    print(f"[INFO] client_encoding was {enc!r} → switching to UTF8")
    conn.set_client_encoding("UTF8")   # psycopg2 method
else:
    print("client_encoding already UTF8 – good to go")
conn.commit()

[INFO] client_encoding was 'SQL_ASCII' → switching to UTF8


In [None]:

from psycopg2.extras import execute_values

# prepare data 
canonicals   = list(ingredient_clusters_dict.keys())
emb_matrix   = model.encode(canonicals, convert_to_numpy=True)   # (N,384)

# schema 
cur.execute("""
CREATE TABLE IF NOT EXISTS ingredients (
    id         SERIAL PRIMARY KEY,
    canonical  TEXT UNIQUE NOT NULL,
    embedding  vector(384)
);

CREATE TABLE IF NOT EXISTS ingredient_variants (
    canonical_id INT REFERENCES ingredients(id) ON DELETE CASCADE,
    variant      TEXT,
    PRIMARY KEY  (canonical_id, variant)
);
""")
conn.commit()

# bulk-insert canonicals 
rows = [(c, emb.tolist()) for c, emb in zip(canonicals, emb_matrix)]
execute_values(
    cur,
    "INSERT INTO ingredients (canonical, embedding) VALUES %s "
    "ON CONFLICT (canonical) DO NOTHING",
    rows
)
conn.commit()

# fetch id map for variant load 
cur.execute("SELECT id, canonical FROM ingredients;")
canon2id = {canon: _id for _id, canon in cur.fetchall()}

# bulk-insert variants 
var_rows = [
    (canon2id[canon], var)
    for canon, vars_ in ingredient_clusters_dict.items()
    for var in vars_
]

execute_values(
    cur,
    "INSERT INTO ingredient_variants (canonical_id, variant) VALUES %s "
    "ON CONFLICT DO NOTHING",
    var_rows
)
conn.commit()

print(f"Loaded {len(canonicals)} canonicals and {len(var_rows)} variants into Postgres.")


Loaded 3996 canonicals and 16371 variants into Postgres.


In [1]:
# quick dataset sanity-check
import json, pathlib, collections
from tqdm.notebook import tqdm

jsonl_path = pathlib.Path("qwen_recipe_dialogues.jsonl")
assert jsonl_path.exists(), f"{jsonl_path} not found"

bad_schema, long_examples = [], []
max_len   = 1024          # tokens later; here we just count characters
counts    = collections.Counter()

# quick scan
with jsonl_path.open(encoding="utf-8") as f:
    for ln, line in enumerate(tqdm(f, desc="Scanning")):
        data = json.loads(line)
        msgs = data.get("messages", [])
        # schema check on every assistant turn
        need_keys = {"name_description","include_tags","exclude_tags",
                     "include_ingredients","exclude_ingredients",
                     "count","reason"}
        for m in msgs:
            if m["role"] == "assistant" and not need_keys.issubset(m):
                bad_schema.append(ln+1); break
        # approximate length check (chars now, tokens later)
        total_chars = sum(len(m.get("content","")) for m in msgs)
        if total_chars > max_len*4:        # rough 4-char ≈ 1 token rule-of-thumb
            long_examples.append(ln+1)
        counts[len(msgs)] += 1             # message-count histogram

print("Lines with schema problems :", bad_schema[:5], "…" if bad_schema else "none")
print("Over-long examples         :", long_examples[:5], "…" if long_examples else "none")
print("Dialogue length histogram  :", counts.most_common())
print(f"Total dialogues            : {sum(counts.values()):,}")


Scanning: 0it [00:00, ?it/s]

Lines with schema problems : [] none
Over-long examples         : [] none
Dialogue length histogram  : [(8, 3518), (10, 384), (12, 230), (14, 128), (6, 74)]
Total dialogues            : 4,334


In [2]:
import torch, transformers, bitsandbytes as bnb
from transformers import AutoTokenizer, AutoModelForCausalLM,BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

model_id = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

In [3]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

In [4]:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config,
)

In [5]:
lora_cfg = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()

print("Qwen 2.5-0.5B loaded in 4-bit and LoRA-ready.")

trainable params: 6,586,368 || all params: 500,619,136 || trainable%: 1.3156
Qwen 2.5-0.5B loaded in 4-bit and LoRA-ready.


In [6]:
from itertools import chain
from tqdm import tqdm
import json, random, pathlib, datasets, torch

jsonl_path = pathlib.Path("qwen_recipe_dialogues.jsonl") 
assert jsonl_path.exists(), f"{jsonl_path} not found!"

In [7]:
def dialogue_to_text(record: dict) -> str:
    """
    Convert {"messages":[...]} into a single concatenated prompt
    that Qwen2.5's tokenizer understands.
    """
    chat = []
    for turn in record["messages"]:
        role = turn["role"]
        # user/assistant content
        content = turn["content"] if role == "user" else json.dumps(
            {k: turn[k] for k in (
                "name_description","include_tags","exclude_tags",
                "include_ingredients","exclude_ingredients","count","reason")
            },
            ensure_ascii=False,
        )
        chat.append({"role": role, "content": content})
    # Qwen’s to_chat_template handles BOS/EOS & system tag insertion
    return tokenizer.apply_chat_template(
        chat, tokenize=False, add_generation_prompt=True
    )

In [8]:
samples = []
with jsonl_path.open("r", encoding="utf-8") as f:
    for line in tqdm(f, desc="Streaming JSONL"):
        record = json.loads(line)
        samples.append({"text": dialogue_to_text(record)})

random.shuffle(samples)

Streaming JSONL: 4334it [00:00, 6762.44it/s]


In [9]:
split_idx = int(len(samples)*0.97)
train_ds = datasets.Dataset.from_list(samples[:split_idx])
valid_ds = datasets.Dataset.from_list(samples[split_idx:])
data = datasets.DatasetDict({"train": train_ds, "validation": valid_ds})

print(f"Loaded: {len(train_ds):,} train — {len(valid_ds):,} valid examples")
data

Loaded: 4,203 train — 131 valid examples


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 4203
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 131
    })
})

In [10]:
# LoRA / Trainer setup and start fine-tuning
from peft import LoraConfig, get_peft_model
from transformers import (
    TrainingArguments, Trainer,
    DataCollatorForLanguageModeling
)
import torch, math

max_length = 2048        # truncate long dialogues
batch_size = 4           
grad_accum = 8           
num_epochs = 2           
lr          = 2e-5

# Tokenise function
def tok_fn(example):
    ids = tokenizer(
        example["text"],
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )
    # Trainer expects dict of lists / tensors
    return {
        "input_ids":  ids["input_ids"][0],
        "attention_mask": ids["attention_mask"][0],
    }

tokenised = data.map(tok_fn, batched=False, remove_columns=["text"])
data_collator = DataCollatorForLanguageModeling(
    tokenizer, mlm=False, return_tensors="pt"
)

# LoRA config 
lora_cfg = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["w2", "o_proj"],  # Qwen block output proj layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model_lora = get_peft_model(model, lora_cfg)
model_lora.print_trainable_parameters()





Map:   0%|          | 0/4203 [00:00<?, ? examples/s]

Map:   0%|          | 0/131 [00:00<?, ? examples/s]

trainable params: 6,586,368 || all params: 500,619,136 || trainable%: 1.3156




In [12]:
import transformers, inspect
print(transformers.__version__)
print(inspect.signature(transformers.TrainingArguments))

4.53.1


In [13]:
# Trainer 
steps_per_epoch = math.ceil(len(tokenised["train"]) / (batch_size*grad_accum))
logging_steps   = max(1, steps_per_epoch // 10)

train_args = TrainingArguments(
    output_dir         = "./qwen2p5-recipe-lora",
    per_device_train_batch_size  = batch_size,
    per_device_eval_batch_size   = batch_size,
    gradient_accumulation_steps = grad_accum,
    learning_rate      = lr,
    num_train_epochs   = num_epochs,
    warmup_steps       = int(0.05 * steps_per_epoch * num_epochs),
    fp16               = torch.cuda.is_available(),
    logging_steps      = logging_steps,
    eval_strategy= "epoch",
    save_strategy      = "epoch",
    save_total_limit   = 2,
    report_to          = "none",
)

trainer = Trainer(
    model           = model_lora,
    args            = train_args,
    train_dataset   = tokenised["train"],
    eval_dataset    = tokenised["validation"],
    data_collator   = data_collator,
)

trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss
1,0.2528,0.249361
2,0.2311,0.229135


TrainOutput(global_step=264, training_loss=0.378726859887441, metrics={'train_runtime': 16719.773, 'train_samples_per_second': 0.503, 'train_steps_per_second': 0.016, 'total_flos': 8611284810566400.0, 'train_loss': 0.378726859887441, 'epoch': 2.0})

In [14]:
trainer.save_model("qwen2p5-recipe-lora-final")      # adapter & config
tokenizer.save_pretrained("qwen2p5-recipe-lora-final")

('qwen2p5-recipe-lora-final\\tokenizer_config.json',
 'qwen2p5-recipe-lora-final\\special_tokens_map.json',
 'qwen2p5-recipe-lora-final\\chat_template.jinja',
 'qwen2p5-recipe-lora-final\\vocab.json',
 'qwen2p5-recipe-lora-final\\merges.txt',
 'qwen2p5-recipe-lora-final\\added_tokens.json',
 'qwen2p5-recipe-lora-final\\tokenizer.json')

In [29]:
# merge LoRA back into base weights and copy tokenizer files
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import os, torch

base_id    = "Qwen/Qwen2.5-0.5B-Instruct"      # HF base model
lora_dir   = "qwen2p5-recipe-lora-final"       # LoRA checkpoint
merged_dir = "qwen2p5-recipe-merged"           # final fp16 folder
os.makedirs(merged_dir, exist_ok=True)

# load base in 4-bit, attach LoRA, merge → fp16
base  = AutoModelForCausalLM.from_pretrained(
            base_id, load_in_4bit=True,
            device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(base, lora_dir)
model = model.merge_and_unload()               # returns plain fp16 weights
model.save_pretrained(merged_dir)              # writes config & *.safetensors

# save the tokenizer once (writes every needed file)
tok = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
tok.save_pretrained(merged_dir)

print("Merged model & tokenizer saved to:", os.path.abspath(merged_dir))


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Merged model & tokenizer saved to: c:\Users\turgu\sertifika\reciperesuggestion\Recipes_Ingredients\qwen-exploration\qwen2p5-recipe-merged


In [30]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = "qwen2p5-recipe-merged"
model = AutoModelForCausalLM.from_pretrained(
            model_dir, device_map="auto",
            torch_dtype="auto", trust_remote_code=True)
tok   = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)


In [41]:
import textwrap

example_assistant = {
  "name_description": "Example template",
  "include_tags": [],
  "exclude_tags": [],
  "include_ingredients": [],
  "exclude_ingredients": [],
  "count": 5,
  "reason": "Template with all keys present."
}
system_prompt = textwrap.dedent("""
You are a recipe assistant in a multi-turn conversation with a user.

Respond **only** in valid JSON after each user turn, gradually filling the structure below. No explanations, no markdown, no surrounding text.

JSON format:
{
  "name_description": string,
  "include_tags": [string],        
  "exclude_tags": [string],        
  "include_ingredients": [string],
  "exclude_ingredients": [string],
  "count": integer,                // default 5, user may request 1-10
  "reason": string
}

Rules
- Never hallucinate preferences — include only what the user asks or clearly implies.
- Tag/ingredient **mention-before-use**:
  • Add a tag only after the user says that exact tag or an unmistakable synonym
    (e.g. “quick” ⇒ time tag, “vegan” ⇒ dietary tag).
  • Add an ingredient only after the user says that ingredient.
- If the user excludes an ingredient (e.g. “no eggs”):
  • List that word in exclude_ingredients.
  • **Remove any include_tag that contains that word** (e.g. drop “eggs-dairy”) and do not
    re-add it unless the user later says that tag name.
- A tag may appear in include_tags / exclude_tags only if it is in the allowed
  tag vocabulary supplied in TARGET METADATA. Ingredient words themselves are **not** tags
  unless they are legitimate tag names in that vocabulary.
- By the final assistant turn:
  • include_tags must contain every tag the user explicitly requested (if any) and may
    add others that help fulfill the request, **but the total may never exceed six**.
  • Do **not** surface tags the user never mentioned.
  • Never remove a user-requested tag unless it now conflicts with an ingredient/tag exclusion.
- 'vegetarian' ⇒ include both 'vegetarian' and 'vegan', but do NOT exclude meat unless asked.
- The `count` field defaults to 5; change it only when the user explicitly requests 1-10.
- The `reason` must briefly explain only what changed since the previous assistant turn.
- If the user says “let’s start over”, “reset”, or similar, clear all fields and reset count = 5.

Output requirements
- Strict one-line JSON (no literal newlines inside string values).
- Double-quote all keys and strings.
- Keep lists syntactically valid, even if empty.
""").strip()

#  create a dialogue to test
dialogue = [
    {"role": "system", "content": system_prompt},
    {"role": "assistant", "content": json.dumps(example_assistant)},
    {"role": "user",   "content": "I'm looking for a vegan recipe."}
]

inputs = tok.apply_chat_template(dialogue,
                                 add_generation_prompt=True,
                                 return_tensors="pt").to(model.device)

with torch.no_grad():
    gen_ids = model.generate(
        inputs,
        max_new_tokens=256,
        temperature=0.10,            # deterministic for inspection
        eos_token_id=tok.eos_token_id,
    )

response_text = tok.decode(gen_ids[0][inputs.shape[-1]:], skip_special_tokens=True).strip()

print("RAW MODEL OUTPUT")
print(response_text)

# quick JSON parse check
try:
    parsed = json.loads(response_text)
    print(" Parsed as JSON. Keys:", list(parsed))
except Exception as e:
    print(" Could not parse JSON:", e)

RAW MODEL OUTPUT
{
  "name_description": "Recipe for a delicious vegan dish.",
  "include_ingredients": ["Vegan protein powder", "Tomato paste", "Canned tomatoes"],
  "count": 6,
  "reason": "Added vegan protein powder and tomato paste as requested."
}
 Parsed as JSON. Keys: ['name_description', 'include_ingredients', 'count', 'reason']
