
# 🧪 SmartRAG — Évaluation RAG (POC Métier) avec **RAGAS** — Version POC FULL

Ce notebook met en œuvre les **métriques d'évaluation** demandées et des **analyses avancées** adaptées à votre fichier CSV contenant :  
- **Références métier** : `question`, `reference_answer`, `sharepoint_document`  
- **Sorties du système RAG** : `ragas_question`, `ragas_answer`, `ragas_contexts`, `ragas_ground_truth`  

### Métriques RAGAS calculées
1. 🎯 **Faithfulness (Fidélité)** — cohérence factuelle *réponse ↔ contextes*  
2. ✅ **Answer Correctness (Correction)** — *réponse ↔ vérité métier (reference_answer)*  
3. 💬 **Relevancy (Pertinence)** — *réponse ↔ question* (auto-détection `response_relevancy`/`answer_relevancy`)  
4. 🎯 **Context Precision (Précision)** — pertinence des contextes récupérés  
5. 📚 **Context Recall (Rappel)** — complétude des contextes récupérés  

### Analyses additionnelles (POC)
- **Mesures documentaires** (à partir de `sharepoint_document` et `ragas_ground_truth`) : *doc-precision/recall/F1*  
- **Diagnostics détaillés** (tailles, Jaccard coverage, #contexts, etc.)  
- **Visualisations** et **explications** (histos, scatters, corrélations, effets de n_contexts, per-doc)  
- **Recommandations automatiques** & **Plan d’expériences** (préprocessing PDF, embeddings, chunking, recherche hybride, reranking, prompts de génération)


## 0) Installation & vérifications

In [1]:
# 1) Installer
!pip install -U ipywidgets tqdm

# 2) Pas besoin de "jupyter nbextension ..." en JupyterLab (normal que la commande n’existe pas)
# 3) IMPORTANT : redémarre le kernel ici (Kernel > Restart)

# 4) Après redémarrage :
import ipywidgets as W
from IPython.display import display
try:
    display(W.IntProgress(min=0, max=1, description="widgets OK"))
    from tqdm.notebook import tqdm
    print("tqdm notebook OK ✅")
except Exception as e:
    from tqdm.auto import tqdm
    print("Fallback tqdm auto ⚠️ ->", e)

# (test)
_ = [x for x in tqdm(range(50), desc="Test tqdm (widgets)", leave=False)]



IntProgress(value=0, description='widgets OK', max=1)

tqdm notebook OK ✅


Test tqdm (widgets):   0%|          | 0/50 [00:00<?, ?it/s]

## 1) Configuration — LLM & Chemins de fichiers

In [None]:

import os

# Fournisseur LLM: "openai" | "claude" | "gemini" | "ollama"
RAGAS_LLM_PROVIDER = os.getenv("RAGAS_LLM_PROVIDER", "openai").lower()

OPENAI_MODEL  = os.getenv("OPENAI_MODEL",  "gpt-4o-mini")
CLAUDE_MODEL  = os.getenv("CLAUDE_MODEL",  "claude-3-5-sonnet-20240620")
GEMINI_MODEL  = os.getenv("GEMINI_MODEL",  "gemini-1.5-pro")
OLLAMA_MODEL  = os.getenv("OLLAMA_MODEL",  "llama3.1:8b")

# Clés d'API attendues dans l'environnement
os.environ["OPENAI_API_KEY"]     = "your-openai-api-key"
# os.environ["ANTHROPIC_API_KEY"]  = "..."
# os.environ["GOOGLE_API_KEY"]     = "..."

# Données
DATA_PATH = os.getenv("DATA_PATH", "reference_qa_manuel_template.csv")

# Sorties
OUTPUT_DIR = os.getenv("OUTPUT_DIR", "outputs")
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Provider:", RAGAS_LLM_PROVIDER)
print("Data path:", DATA_PATH)
print("Output dir:", OUTPUT_DIR)


## 2) Chargement du CSV & aperçu

In [None]:

import pandas as pd
import os

if not os.path.exists(DATA_PATH):
    alt = '../data/reference/reference_qa_manuel_template.csv'
    if os.path.exists(alt):
        DATA_PATH = alt
        print(f"INFO: DATA_PATH introuvable, utilisation de {alt}")
    else:
        raise FileNotFoundError(f"CSV introuvable: {DATA_PATH}")

raw_df = pd.read_csv(DATA_PATH)
print("Shape:", raw_df.shape)
display(raw_df.head(5))
print("Colonnes:", list(raw_df.columns))



## 3) Normalisation → Clés attendues par RAGAS

**Mapping POC → RAGAS**  
- `question` (référence métier) **&** `ragas_question` (question réellement posée au système)  
  → On prend **`ragas_question`** si présent, sinon `question`  
- `ragas_answer` → **`answer`**  
- `ragas_contexts` → **`contexts`** (List[str])  
- `reference_answer` → **`ground_truth`** (texte pour `answer_correctness`)  
- `sharepoint_document` → **`reference_docs`** (List[str]) — pour analyses documentaires  
- `ragas_ground_truth` → **`cited_docs`** (List[str]) — documents cités par la réponse


In [None]:

import ast, math

df = raw_df.copy()

# Colonnes attendues côté POC (référence + RAG)
COLS_REQUIRED = [
    "question", "reference_answer", "sharepoint_document",
    "ragas_question", "ragas_answer", "ragas_contexts", "ragas_ground_truth"
]
missing = [c for c in COLS_REQUIRED if c not in df.columns]
if missing:
    print("⚠️ Colonnes manquantes (ok si volontairement absentes) :", missing)

def to_list_generic(x):
    "Convertit une cellule en List[str] : listes sérialisées, séparateurs (|||, ;;, \n, ,), simple chaîne."
    if x is None or (isinstance(x, float) and math.isnan(x)):
        return []
    if isinstance(x, list):
        return [str(xx).strip() for xx in x if str(xx).strip()]
    if isinstance(x, str):
        s = x.strip()
        if (s.startswith('[') and s.endswith(']')) or (s.startswith('(') and s.endswith(')')):
            try:
                parsed = ast.literal_eval(s)
                if isinstance(parsed, (list, tuple)):
                    return [str(xx).strip() for xx in parsed if str(xx).strip()]
            except Exception:
                pass
        for sep in ["|||", "§§", ";;", "##", "\n", ","]:
            if sep in s:
                parts = [p.strip() for p in s.split(sep)]
                return [p for p in parts if p]
        return [s]
    return [str(x).strip()]

# question → ragas_question si présent, sinon question
if "ragas_question" in df.columns and df["ragas_question"].notna().any():
    q_series = df["ragas_question"].fillna(df.get("question",""))
else:
    q_series = df.get("question","")

# answer
a_series = df.get("ragas_answer","").fillna("")

# contexts
ctx_series_raw = df.get("ragas_contexts","").fillna("")
contexts = [to_list_generic(v) for v in ctx_series_raw.tolist()]

# ground_truth (texte métier)
gt_series = df.get("reference_answer","").fillna("")

# reference_docs (liste)
ref_docs_series_raw = df.get("sharepoint_document","").fillna("")
reference_docs = [to_list_generic(v) for v in ref_docs_series_raw.tolist()]

# cited_docs (liste) — documents cités par la réponse
cited_docs_series_raw = df.get("ragas_ground_truth","").fillna("")
cited_docs = [to_list_generic(v) for v in cited_docs_series_raw.tolist()]

dataset_dict = {
    "question": q_series.astype(str).tolist(),
    "answer": a_series.astype(str).tolist(),
    "contexts": contexts,
    "ground_truth": gt_series.astype(str).tolist(),
}

print("Exemple — question:", dataset_dict["question"][0] if len(dataset_dict["question"]) else "n/a")
print("Exemple — answer:", dataset_dict["answer"][0] if len(dataset_dict["answer"]) else "n/a")
print("Exemple — contexts[0]:", dataset_dict["contexts"][0][:2] if len(dataset_dict["contexts"]) else "n/a")
print("Exemple — ground_truth:", dataset_dict["ground_truth"][0] if len(dataset_dict["ground_truth"]) else "n/a")

# Infos documentaires (hors-RAGAS)
aux_docs = {
    "reference_docs": reference_docs,
    "cited_docs": cited_docs,
}


## 4) Construction du Dataset (HuggingFace)

In [None]:

from datasets import Dataset as HFDataset
hf_dataset = HFDataset.from_dict(dataset_dict)
hf_dataset


## 5) LLM compatible RAGAS (Wrapper)

In [None]:

from ragas.llms import LangchainLLMWrapper

def build_llm(provider: str):
    provider = provider.lower().strip()
    if provider == "openai":
        from langchain_openai import ChatOpenAI
        lc = ChatOpenAI(model=OPENAI_MODEL, temperature=0)
        return LangchainLLMWrapper(lc)
    elif provider == "claude":
        from langchain_anthropic import ChatAnthropic
        lc = ChatAnthropic(model=CLAUDE_MODEL, temperature=0)
        return LangchainLLMWrapper(lc)
    elif provider == "gemini":
        from langchain_google_genai import ChatGoogleGenerativeAI
        lc = ChatGoogleGenerativeAI(model=GEMINI_MODEL, temperature=0)
        return LangchainLLMWrapper(lc)
    elif provider == "ollama":
        try:
            from langchain_community.chat_models import ChatOllama
            lc = ChatOllama(model=OLLAMA_MODEL)
        except Exception:
            from langchain_community.llms import Ollama
            lc = Ollama(model=OLLAMA_MODEL)
        return LangchainLLMWrapper(lc)
    else:
        raise ValueError(f"Provider non supporté: {provider}")

llm = build_llm(RAGAS_LLM_PROVIDER)
print("✅ LLM prêt pour RAGAS:", type(llm).__name__, "| provider:", RAGAS_LLM_PROVIDER)


## 6) Métriques RAGAS (auto-détection de *Relevancy*)

In [None]:

from ragas.metrics import faithfulness, answer_correctness, context_precision, context_recall

# Auto-détecte le bon alias selon la version de ragas
try:
    from ragas.metrics import response_relevancy as _relevancy_metric
    RELEVANCY_NAME = "response_relevancy"
except Exception:
    from ragas.metrics import answer_relevancy as _relevancy_metric
    RELEVANCY_NAME = "answer_relevancy"

metrics = [
    faithfulness,           # 1. Fidélité (réponse vs contexts)
    answer_correctness,     # 2. Correction (réponse vs ground_truth)
    _relevancy_metric,      # 3. Pertinence (réponse vs question)
    context_precision,      # 4. Précision des contextes récupérés
    context_recall,         # 5. Rappel des contextes récupérés
]

print("Métrique de pertinence retenue:", RELEVANCY_NAME)
metrics


## 7) Exécution de l’évaluation

In [None]:

from ragas import evaluate
import os

result = evaluate(
    dataset=hf_dataset,
    metrics=metrics,
    llm=llm,
    raise_exceptions=False,
    show_progress=False,  # évite dépendance ipywidgets
)

print("✅ Évaluation terminée.")
df_results = result.to_pandas()
display(df_results.head(10))

csv_out = os.path.join(OUTPUT_DIR, "ragas_raw_results.csv")
df_results.to_csv(csv_out, index=False, encoding="utf-8")
print("Résultats enregistrés ->", csv_out)


## 8) Synthèse des scores (0–1)

In [None]:

import numpy as np
import json
from datetime import datetime
import os

rel_col = "response_relevancy" if "response_relevancy" in df_results.columns else (
    "answer_relevancy" if "answer_relevancy" in df_results.columns else None
)

wanted_cols = ["faithfulness", "answer_correctness", "context_precision", "context_recall"]
if rel_col:
    wanted_cols.insert(2, rel_col)

present = [c for c in wanted_cols if c in df_results.columns]
summary = {c: float(np.nanmean(df_results[c])) for c in present}

print("📊 Scores moyens :")
for k, v in summary.items():
    print(f" - {k}: {v:.3f}")

summary_out = os.path.join(OUTPUT_DIR, "ragas_summary.json")
with open(summary_out, "w", encoding="utf-8") as f:
    json.dump({
        "generated_at": datetime.now().isoformat(),
        "provider": RAGAS_LLM_PROVIDER,
        "model": {
            "openai": OPENAI_MODEL,
            "claude": CLAUDE_MODEL,
            "gemini": GEMINI_MODEL,
            "ollama": OLLAMA_MODEL,
        }.get(RAGAS_LLM_PROVIDER, "n/a"),
        "scores": summary,
    }, f, ensure_ascii=False, indent=2)

print("Synthèse enregistrée ->", summary_out)



## 9) Diagnostics enrichis & features dérivées

On ajoute des signaux utiles au **debug RAG** :  
- `n_contexts`, `avg_context_len`, `total_context_len`  
- `answer_len`, `question_len`, `gt_len`  
- `context_coverage_jaccard` : similitude *ground_truth ↔ contexts_concat*  
- `answer_coverage_jaccard` : similitude *ground_truth ↔ answer*  
- **Doc-level** : *doc_precision/recall/F1* à partir de `sharepoint_document` (référence) et `ragas_ground_truth` (cités)


In [None]:

import numpy as np, pandas as pd, re, os

def wc(s: str) -> int:
    if not isinstance(s, str):
        s = str(s)
    return len(re.findall(r"\w+", s))

def jaccard_words(a: str, b: str) -> float:
    A = set([w.lower() for w in re.findall(r"\w+", a or "")])
    B = set([w.lower() for w in re.findall(r"\w+", b or "")])
    if not A and not B:
        return 0.0
    return len(A & B) / max(1, len(A | B))

def f1(p, r):
    if p + r == 0:
        return 0.0
    return 2*p*r/(p+r)

# Concat contexts
contexts_concat = ["\n".join(c) if isinstance(c, (list, tuple)) else str(c) for c in dataset_dict["contexts"]]

# Doc-level metrics
doc_precisions, doc_recalls, doc_f1s = [], [], []
for i in range(len(aux_docs["reference_docs"])):
    refs = set([x.lower() for x in aux_docs["reference_docs"][i]])
    cits = set([x.lower() for x in aux_docs["cited_docs"][i]])
    inter = refs & cits
    p = len(inter)/max(1, len(cits)) if len(cits)>0 else 0.0
    r = len(inter)/max(1, len(refs)) if len(refs)>0 else 0.0
    doc_precisions.append(p); doc_recalls.append(r); doc_f1s.append(f1(p,r))

enriched = pd.DataFrame({
    "question": dataset_dict["question"],
    "answer": dataset_dict["answer"],
    "ground_truth": dataset_dict["ground_truth"],
    "contexts_concat": contexts_concat,
    "n_contexts": [len(c) if isinstance(c, (list, tuple)) else 0 for c in dataset_dict["contexts"]],
    "avg_context_len": [np.mean([wc(x) for x in c]) if isinstance(c, (list, tuple)) and len(c)>0 else 0 for c in dataset_dict["contexts"]],
    "total_context_len": [np.sum([wc(x) for x in c]) if isinstance(c, (list, tuple)) else 0 for c in dataset_dict["contexts"]],
    "answer_len": [wc(a) for a in dataset_dict["answer"]],
    "question_len": [wc(q) for q in dataset_dict["question"]],
    "gt_len": [wc(g) for g in dataset_dict["ground_truth"]],
    "context_coverage_jaccard": [jaccard_words(dataset_dict["ground_truth"][i], contexts_concat[i]) for i in range(len(contexts_concat))],
    "answer_coverage_jaccard": [jaccard_words(dataset_dict["ground_truth"][i], dataset_dict["answer"][i]) for i in range(len(contexts_concat))],
    "doc_precision": doc_precisions,
    "doc_recall": doc_recalls,
    "doc_f1": doc_f1s,
})

df_all = pd.concat([enriched, df_results.reset_index(drop=True)], axis=1)

enriched_out = os.path.join(OUTPUT_DIR, "ragas_results_enriched.csv")
df_all.to_csv(enriched_out, index=False, encoding="utf-8")
print("Enrichi ->", enriched_out)
display(df_all.head(5))



## 10) Visualisations — Distributions (histogrammes)

Ces graphiques montrent **l'étalement des scores** : s'ils sont concentrés vers 1 → bon; vers 0 → à travailler ; très dispersés → comportement instable.


In [None]:

import matplotlib.pyplot as plt
import os

rel_col = "response_relevancy" if "response_relevancy" in df_all.columns else ("answer_relevancy" if "answer_relevancy" in df_all.columns else None)
score_cols = ["faithfulness","answer_correctness","context_precision","context_recall","doc_precision","doc_recall","doc_f1"]
if rel_col:
    score_cols.insert(2, rel_col)

for col in score_cols:
    if col in df_all.columns:
        plt.figure()
        df_all[col].dropna().plot(kind="hist", bins=10, title=f"Distribution — {col}")
        plt.xlabel(col); plt.ylabel("Fréquence")
        plt.tight_layout()
        outp = os.path.join(OUTPUT_DIR, f"hist_{col}.png")
        plt.savefig(outp); print("Saved:", outp)
        plt.show()



**Lecture :**  
- Si `doc_recall` est faible → le système **ne cite pas** suffisamment les bons documents (voir *préprocessing*, *retrieval*, *reranking*).  
- Si `faithfulness` est basse avec `context_precision` haute → **hallucinations de génération** (ajuster prompts/LLM).  
- Si `context_precision` et `context_recall` sont bas → **retrieval à optimiser** (hybride, top-k, embeddings, chunking).



## 11) Focus erreurs — Top-K items les plus faibles

Pour **prioriser** les investigations, voici les pires cas par métrique.


In [None]:

import pandas as pd

def topk_worst(col, k=15):
    if col not in df_all.columns:
        return pd.DataFrame()
    sub = df_all[["question","answer","ground_truth", col]].copy()
    sub = sub.sort_values(col, ascending=True).head(k)
    return sub

print("### Pires 'answer_correctness'")
display(topk_worst("answer_correctness"))

print("\n### Pires 'faithfulness'")
display(topk_worst("faithfulness"))

if rel_col:
    print(f"\n### Pires '{rel_col}'")
    display(topk_worst(rel_col))

print("\n### Pires 'doc_recall' (documents manquants)")
display(topk_worst("doc_recall"))



**Lecture :**  
- Regardez **les contextes** associés à ces cas : souvent des problèmes de découpe (chunk) ou d'extraction PDF.  
- Comparez `sharepoint_document` vs `ragas_ground_truth` pour voir **quels documents attendus manquent**.



## 12) Scatter diagnostics — Relations entre métriques

Nuages de points pour **détecter des patterns** : trop de contextes ? faible recouvrement ? corrélations inattendues ?


In [None]:

def scatter_xy(x, y):
    import matplotlib.pyplot as plt
    if x in df_all.columns and y in df_all.columns:
        plt.figure()
        plt.scatter(df_all[x], df_all[y])
        plt.xlabel(x); plt.ylabel(y)
        plt.title(f"{x} vs {y}")
        plt.tight_layout()
        outp = os.path.join(OUTPUT_DIR, f"scatter_{x}_vs_{y}.png")
        plt.savefig(outp); print("Saved:", outp)
        plt.show()

scatter_xy("faithfulness", "answer_correctness")
if rel_col: scatter_xy(rel_col, "answer_correctness")
scatter_xy("context_precision", "context_recall")
scatter_xy("n_contexts", "answer_correctness")
scatter_xy("avg_context_len", "answer_correctness")
scatter_xy("context_coverage_jaccard", "answer_correctness")
scatter_xy("answer_coverage_jaccard", "answer_correctness")
scatter_xy("doc_recall", "answer_correctness")



**Lecture :**  
- `n_contexts` ↘︎ `answer_correctness` → **trop de bruit** : réduire top‑k ou **reranker**.  
- `context_coverage_jaccard` ↗︎ `answer_correctness` → meilleurs contextes **mieux alignés** à la vérité métier.  
- `doc_recall` ↗︎ `answer_correctness` → citer les **bons documents** aide la justesse.



## 13) Corrélations — Matrice (Pearson)

Utile pour **prioriser les leviers** : qu’est‑ce qui corrèle le plus avec `answer_correctness` ?


In [None]:

import numpy as np, matplotlib.pyplot as plt

num_cols = df_all.select_dtypes(include=[np.number]).columns.tolist()
corr = df_all[num_cols].corr()

plt.figure()
plt.imshow(corr, aspect='auto')
plt.xticks(range(len(num_cols)), num_cols, rotation=90)
plt.yticks(range(len(num_cols)), num_cols)
plt.colorbar()
plt.title("Matrice de corrélations (Pearson)")
plt.tight_layout()
png = os.path.join(OUTPUT_DIR, "corr_matrix.png")
plt.savefig(png); print("Saved:", png)
plt.show()

if "answer_correctness" in corr.columns:
    display(corr.sort_values(by="answer_correctness", ascending=False)[["answer_correctness"]].head(10))



**Lecture :** Les variables en haut de la liste sont des **candidats prioritaires** à optimiser (ex. `doc_recall`, `context_coverage_jaccard`, etc.).



## 14) Effet du nombre de contextes — Agrégations par *bins*

Observe l’évolution des métriques par classes de `n_contexts` (oriente **top‑k**, **reranking**, **hybride**).


In [None]:

import pandas as pd, matplotlib.pyplot as plt

def agg_by_bins(col, bins=(0,1,2,3,4,6,10,999)):
    if col not in df_all.columns:
        return pd.DataFrame()
    b = pd.cut(df_all[col], bins=bins, right=True)
    cols = ["faithfulness","answer_correctness","context_precision","context_recall","doc_recall"]
    if rel_col and rel_col in df_all.columns: cols.append(rel_col)
    agg = df_all.groupby(b)[cols].mean(numeric_only=True)
    return agg.reset_index()

agg_ctx = agg_by_bins("n_contexts")
display(agg_ctx)

for c in [c for c in agg_ctx.columns if c != "n_contexts"]:
    plt.figure()
    x = agg_ctx.iloc[:,0].astype(str)
    y = agg_ctx[c]
    plt.bar(x, y)
    plt.xticks(rotation=30, ha="right")
    plt.ylabel(c)
    plt.title(f"{c} moyen par bin de n_contexts")
    plt.tight_layout()
    outp = os.path.join(OUTPUT_DIR, f"bar_{c}_by_ncontexts_bins.png")
    plt.savefig(outp); print("Saved:", outp)
    plt.show()



**Lecture :** Si `answer_correctness` baisse après un certain **top‑k**, c’est un signal pour **réduire** les passages envoyés en génération et/ou **renforcer le reranking**.



## 15) Analyse par document/source (`sharepoint_document`)

Identifier les **documents qui posent problème** (extraction PDF, structuration, obsolescence).


In [None]:

import pandas as pd, matplotlib.pyplot as plt

src_col = "sharepoint_document" if "sharepoint_document" in raw_df.columns else None

if src_col:
    metric_cols = ["faithfulness","answer_correctness","context_precision","context_recall","doc_recall","doc_precision","doc_f1"]
    if rel_col and rel_col in df_all.columns:
        metric_cols.insert(2, rel_col)
    per_src = pd.concat([raw_df[[src_col]], df_all[metric_cols]], axis=1)
    agg_src = per_src.groupby(src_col).mean(numeric_only=True).sort_values("answer_correctness", ascending=False)
    display(agg_src.head(10))

    topk = agg_src.head(10)
    if len(topk) > 0:
        plt.figure()
        plt.bar(topk.index.astype(str), topk["answer_correctness"])
        plt.xticks(rotation=45, ha="right")
        plt.ylabel("answer_correctness")
        plt.title("Top documents — answer_correctness moyen")
        plt.tight_layout()
        outp = os.path.join(OUTPUT_DIR, "bar_top_docs_answer_correctness.png")
        plt.savefig(outp); print("Saved:", outp)
        plt.show()
else:
    print("Aucune colonne 'sharepoint_document' détectée.")



**Lecture :** Les documents en bas du classement nécessitent souvent un **préprocessing PDF** plus robuste (OCR, nettoyage headers/footers, gestion des tableaux, extraction fidèle du texte).



## 16) Recommandations automatiques (guidées par les métriques)

Pistes d’optimisation à partir des signaux : **préprocessing PDF, embeddings, chunking, retrieval (hybride), reranking, prompts**.


In [None]:

import os

reco = []

def add(msg):
    print("•", msg); reco.append("• " + msg)

mean = df_all.mean(numeric_only=True).to_dict()
m = lambda k, d=mean: d.get(k, None)

if (m("faithfulness") or 0) < 0.6:
    add("Fidélité basse : citer explicitement les passages (verbatim), réduire température (0), contraindre la réponse (format, références).")

if (m("answer_correctness") or 0) < 0.6 and (m("context_recall") or 0) >= 0.6:
    add("Réponses incorrectes malgré un rappel correct : renforcer les **prompts de génération** (extraction stricte) et la **post-vérification** (self-check).")

if (m("context_recall") or 0) < 0.6:
    add("Rappel faible : augmenter **top-k**, utiliser la **recherche hybride** (BM25 + vecteur), améliorer **chunking** et **préprocessing PDF** (OCR, headers/footers).")

if (m("context_precision") or 0) < 0.6:
    add("Précision faible : ajouter un **reranker** (cross-encoder/LLM), baisser **top-k** avant génération, filtrer par **métadonnées**.")

if "n_contexts" in df_all and "answer_correctness" in df_all and df_all["n_contexts"].corr(df_all["answer_correctness"]) < -0.15:
    add("Trop de contextes nuit à la correction : **réduire top-k** et/ou reranker plus agressivement.")

if (m("doc_recall") or 0) < 0.6:
    add("Faible rappel documentaire : **aligner le nommage des documents**, améliorer les **citations automatiques** et le **préprocessing PDF** (extraction fidèle).")

if (m("context_coverage_jaccard") or 0) < 0.4:
    add("Faible recouvrement GT↔Contexts : vérifier extraction PDF/OCR, **embeddings** adaptés au FR/domaine, **chunking** par sections/titres.")

# Sauvegarde
txt_out = os.path.join(OUTPUT_DIR, "auto_recommendations.txt")
with open(txt_out, "w", encoding="utf-8") as f:
    f.write("Recommandations automatiques\n\n")
    for r in reco: f.write(r + "\n")
print("Recommandations ->", txt_out)



## 17) Plan d’expériences (A/B & grille)

Itérations recommandées :  
- **Préprocessing** *(pdfminer/pymupdf + tesseract OCR, nettoyage headers/footers, normalisation espaces)*  
- **Embeddings** *(modèles FR/domaine, dimension, normalisation)*  
- **Chunking** *(découpe sémantique par titres/sections ou hierachique)*  
- **Recherche** *(hybride BM25+vecteur, pondération, filtres métadonnées)*  
- **Reranking** *(cross-encoder, LLM-as-a-reranker, top‑k)*


In [None]:

import json, os

param_grid = {
    "preprocess": [
        {"ocr": False, "clean_headers": True, "normalize_ws": True},
        {"ocr": True,  "clean_headers": True, "normalize_ws": True},
    ],
    "embedding": [
        {"provider":"openai","model":"text-embedding-3-large"},
        {"provider":"openai","model":"text-embedding-3-small"},
        {"provider":"nomic","model":"nomic-embed-text"},
    ],
    "chunking": [
        {"method":"fixed","size":512,"overlap":64},
        {"method":"fixed","size":800,"overlap":100},
        {"method":"semantic","size":"auto","overlap":64},
    ],
    "retrieval": [
        {"type":"vector","top_k":8},
        {"type":"hybrid","bm25_weight":0.4,"top_k":8},
        {"type":"hybrid","bm25_weight":0.6,"top_k":12},
    ],
    "rerank": [
        {"enabled": False},
        {"enabled": True, "model":"cross-encoder/ms-marco-MiniLM-L-6-v2", "top_k":5},
    ],
}

print(json.dumps(param_grid, indent=2, ensure_ascii=False))
grid_out = os.path.join(OUTPUT_DIR, "experiment_plan.json")
with open(grid_out, "w", encoding="utf-8") as f:
    json.dump(param_grid, f, ensure_ascii=False, indent=2)
print("Plan d'expériences ->", grid_out)



## 18) Checklist d’optimisation RAG (rapide)

- **PDF** : OCR si scans; enlever headers/footers/numéros; gérer tableaux; normaliser espaces/casse.  
- **Embeddings** : FR/domaine; normalisation; taille vecteur suffisante; re‑indexer après changement.  
- **Chunking** : 512–800 tokens + overlap 64–100; découpe par sections/titres; isoler tableaux/code.  
- **Index** : métadonnées (titre, section, date, doc_id) pour filtrage.  
- **Retrieval** : tester **hybride** (BM25+vecteur), pondération; top‑k équilibré.  
- **Reranking** : cross‑encoder/LLM; réduire à 3–5 passages de haute qualité.  
- **Génération** : prompts de citation stricte (verbatim + doc_id); température 0; JSON strict; self‑check.  
- **Évaluation** : référentiel à jour; seuils GO/NO‑GO; journaliser la config de run.
