# Recipe Finder – Cuisine Classification
### TF–IDF + Logistic Regression (Machine Learning Model)

This notebook trains a machine learning model that predicts the **cuisine type** from a list of ingredients.
The trained model will later be used inside the Streamlit application.

## 0. Setup
Install dependencies, import libraries, and define paths.

In [None]:

import os, json, re, joblib
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, f1_score

PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data" / "whats_cooking"
MODEL_DIR = PROJECT_ROOT / "app" / "models"
MODEL_DIR.mkdir(parents=True, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_DIR exists?", DATA_DIR.exists())
print("MODEL_DIR:", MODEL_DIR)


## 1. Load Dataset
Place `train.json` at `data/whats_cooking/train.json`.

In [None]:

train_path = DATA_DIR / "train.json"
assert train_path.exists(), f"Missing {train_path}. Please download the Kaggle dataset."

with open(train_path, "r", encoding="utf-8") as f:
    rows = json.load(f)

df = pd.DataFrame(rows)

def normalize_ingredient(s: str) -> str:
    s = s.lower()
    s = re.sub(r"[^a-z\s]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text"] = df["ingredients"].apply(lambda ingr: " ".join(normalize_ingredient(x) for x in ingr))
df = df[["text", "cuisine"]].dropna()
df.head()


## 2. Train–Validation Split

In [None]:

X_train, X_val, y_train, y_val = train_test_split(
    df["text"], df["cuisine"], test_size=0.2, random_state=42, stratify=df["cuisine"]
)
len(X_train), len(X_val), len(set(y_train))


## 3. Train Model – TF‑IDF + Logistic Regression

In [None]:

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2, max_features=150000)),
    ("clf", LogisticRegression(max_iter=2000))
])
pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_val)
f1 = f1_score(y_val, pred, average="micro")
print("Validation micro-F1:", round(f1, 4))
print("\nClassification Report:\n", classification_report(y_val, pred))


## 4. Save Model

In [None]:

model_path = MODEL_DIR / "cuisine_pipeline.joblib"
joblib.dump(pipeline, model_path)
model_path


## 5. Quick Test

In [None]:

def classify_ingredients(ingredients):
    text = " ".join(normalize_ingredient(x) for x in ingredients)
    clf = pipeline.named_steps["clf"]
    tfidf = pipeline.named_steps["tfidf"]
    X = tfidf.transform([text])
    if hasattr(clf, "predict_proba"):
        proba = clf.predict_proba(X)[0]
        idx = int(proba.argmax())
        return clf.classes_[idx], float(proba[idx])
    else:
        return pipeline.predict([text])[0], 1.0

classify_ingredients(["tomato","basil","olive oil","parmesan"])
