
# üß™ SmartRAG ‚Äî √âvaluation RAG (POC M√©tier) avec **RAGAS** ‚Äî Version POC FULL

Ce notebook met en ≈ìuvre les **m√©triques d'√©valuation** demand√©es et des **analyses avanc√©es** adapt√©es √† votre fichier CSV contenant :  
- **R√©f√©rences m√©tier** : `question`, `reference_answer`, `sharepoint_document`  
- **Sorties du syst√®me RAG** : `ragas_question`, `ragas_answer`, `ragas_contexts`, `ragas_ground_truth`  

### M√©triques RAGAS calcul√©es
1. üéØ **Faithfulness (Fid√©lit√©)** ‚Äî coh√©rence factuelle *r√©ponse ‚Üî contextes*  
2. ‚úÖ **Answer Correctness (Correction)** ‚Äî *r√©ponse ‚Üî v√©rit√© m√©tier (reference_answer)*  
3. üí¨ **Relevancy (Pertinence)** ‚Äî *r√©ponse ‚Üî question* (auto-d√©tection `response_relevancy`/`answer_relevancy`)  
4. üéØ **Context Precision (Pr√©cision)** ‚Äî pertinence des contextes r√©cup√©r√©s  
5. üìö **Context Recall (Rappel)** ‚Äî compl√©tude des contextes r√©cup√©r√©s  

### Analyses additionnelles (POC)
- **Mesures documentaires** (√† partir de `sharepoint_document` et `ragas_ground_truth`) : *doc-precision/recall/F1*  
- **Diagnostics d√©taill√©s** (tailles, Jaccard coverage, #contexts, etc.)  
- **Visualisations** et **explications** (histos, scatters, corr√©lations, effets de n_contexts, per-doc)  
- **Recommandations automatiques** & **Plan d‚Äôexp√©riences** (pr√©processing PDF, embeddings, chunking, recherche hybride, reranking, prompts de g√©n√©ration)


## 0) Installation & v√©rifications

In [1]:
# 1) Installer
!pip install -U ipywidgets tqdm

# 2) Pas besoin de "jupyter nbextension ..." en JupyterLab (normal que la commande n‚Äôexiste pas)
# 3) IMPORTANT : red√©marre le kernel ici (Kernel > Restart)

# 4) Apr√®s red√©marrage :
import ipywidgets as W
from IPython.display import display
try:
    display(W.IntProgress(min=0, max=1, description="widgets OK"))
    from tqdm.notebook import tqdm
    print("tqdm notebook OK ‚úÖ")
except Exception as e:
    from tqdm.auto import tqdm
    print("Fallback tqdm auto ‚ö†Ô∏è ->", e)

# (test)
_ = [x for x in tqdm(range(50), desc="Test tqdm (widgets)", leave=False)]



IntProgress(value=0, description='widgets OK', max=1)

tqdm notebook OK ‚úÖ


Test tqdm (widgets):   0%|          | 0/50 [00:00<?, ?it/s]

## 1) Configuration ‚Äî LLM & Chemins de fichiers

In [None]:

import os

# Fournisseur LLM: "openai" | "claude" | "gemini" | "ollama"
RAGAS_LLM_PROVIDER = os.getenv("RAGAS_LLM_PROVIDER", "openai").lower()

OPENAI_MODEL  = os.getenv("OPENAI_MODEL",  "gpt-4o-mini")
CLAUDE_MODEL  = os.getenv("CLAUDE_MODEL",  "claude-3-5-sonnet-20240620")
GEMINI_MODEL  = os.getenv("GEMINI_MODEL",  "gemini-1.5-pro")
OLLAMA_MODEL  = os.getenv("OLLAMA_MODEL",  "llama3.1:8b")

# Cl√©s d'API attendues dans l'environnement
os.environ["OPENAI_API_KEY"]     = "your-openai-api-key"
# os.environ["ANTHROPIC_API_KEY"]  = "..."
# os.environ["GOOGLE_API_KEY"]     = "..."

# Donn√©es
DATA_PATH = os.getenv("DATA_PATH", "reference_qa_manuel_template.csv")

# Sorties
OUTPUT_DIR = os.getenv("OUTPUT_DIR", "outputs")
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("Provider:", RAGAS_LLM_PROVIDER)
print("Data path:", DATA_PATH)
print("Output dir:", OUTPUT_DIR)


## 2) Chargement du CSV & aper√ßu

In [None]:

import pandas as pd
import os

if not os.path.exists(DATA_PATH):
    alt = '../data/reference/reference_qa_manuel_template.csv'
    if os.path.exists(alt):
        DATA_PATH = alt
        print(f"INFO: DATA_PATH introuvable, utilisation de {alt}")
    else:
        raise FileNotFoundError(f"CSV introuvable: {DATA_PATH}")

raw_df = pd.read_csv(DATA_PATH)
print("Shape:", raw_df.shape)
display(raw_df.head(5))
print("Colonnes:", list(raw_df.columns))



## 3) Normalisation ‚Üí Cl√©s attendues par RAGAS

**Mapping POC ‚Üí RAGAS**  
- `question` (r√©f√©rence m√©tier) **&** `ragas_question` (question r√©ellement pos√©e au syst√®me)  
  ‚Üí On prend **`ragas_question`** si pr√©sent, sinon `question`  
- `ragas_answer` ‚Üí **`answer`**  
- `ragas_contexts` ‚Üí **`contexts`** (List[str])  
- `reference_answer` ‚Üí **`ground_truth`** (texte pour `answer_correctness`)  
- `sharepoint_document` ‚Üí **`reference_docs`** (List[str]) ‚Äî pour analyses documentaires  
- `ragas_ground_truth` ‚Üí **`cited_docs`** (List[str]) ‚Äî documents cit√©s par la r√©ponse


In [None]:

import ast, math

df = raw_df.copy()

# Colonnes attendues c√¥t√© POC (r√©f√©rence + RAG)
COLS_REQUIRED = [
    "question", "reference_answer", "sharepoint_document",
    "ragas_question", "ragas_answer", "ragas_contexts", "ragas_ground_truth"
]
missing = [c for c in COLS_REQUIRED if c not in df.columns]
if missing:
    print("‚ö†Ô∏è Colonnes manquantes (ok si volontairement absentes) :", missing)

def to_list_generic(x):
    "Convertit une cellule en List[str] : listes s√©rialis√©es, s√©parateurs (|||, ;;, \n, ,), simple cha√Æne."
    if x is None or (isinstance(x, float) and math.isnan(x)):
        return []
    if isinstance(x, list):
        return [str(xx).strip() for xx in x if str(xx).strip()]
    if isinstance(x, str):
        s = x.strip()
        if (s.startswith('[') and s.endswith(']')) or (s.startswith('(') and s.endswith(')')):
            try:
                parsed = ast.literal_eval(s)
                if isinstance(parsed, (list, tuple)):
                    return [str(xx).strip() for xx in parsed if str(xx).strip()]
            except Exception:
                pass
        for sep in ["|||", "¬ß¬ß", ";;", "##", "\n", ","]:
            if sep in s:
                parts = [p.strip() for p in s.split(sep)]
                return [p for p in parts if p]
        return [s]
    return [str(x).strip()]

# question ‚Üí ragas_question si pr√©sent, sinon question
if "ragas_question" in df.columns and df["ragas_question"].notna().any():
    q_series = df["ragas_question"].fillna(df.get("question",""))
else:
    q_series = df.get("question","")

# answer
a_series = df.get("ragas_answer","").fillna("")

# contexts
ctx_series_raw = df.get("ragas_contexts","").fillna("")
contexts = [to_list_generic(v) for v in ctx_series_raw.tolist()]

# ground_truth (texte m√©tier)
gt_series = df.get("reference_answer","").fillna("")

# reference_docs (liste)
ref_docs_series_raw = df.get("sharepoint_document","").fillna("")
reference_docs = [to_list_generic(v) for v in ref_docs_series_raw.tolist()]

# cited_docs (liste) ‚Äî documents cit√©s par la r√©ponse
cited_docs_series_raw = df.get("ragas_ground_truth","").fillna("")
cited_docs = [to_list_generic(v) for v in cited_docs_series_raw.tolist()]

dataset_dict = {
    "question": q_series.astype(str).tolist(),
    "answer": a_series.astype(str).tolist(),
    "contexts": contexts,
    "ground_truth": gt_series.astype(str).tolist(),
}

print("Exemple ‚Äî question:", dataset_dict["question"][0] if len(dataset_dict["question"]) else "n/a")
print("Exemple ‚Äî answer:", dataset_dict["answer"][0] if len(dataset_dict["answer"]) else "n/a")
print("Exemple ‚Äî contexts[0]:", dataset_dict["contexts"][0][:2] if len(dataset_dict["contexts"]) else "n/a")
print("Exemple ‚Äî ground_truth:", dataset_dict["ground_truth"][0] if len(dataset_dict["ground_truth"]) else "n/a")

# Infos documentaires (hors-RAGAS)
aux_docs = {
    "reference_docs": reference_docs,
    "cited_docs": cited_docs,
}


## 4) Construction du Dataset (HuggingFace)

In [None]:

from datasets import Dataset as HFDataset
hf_dataset = HFDataset.from_dict(dataset_dict)
hf_dataset


## 5) LLM compatible RAGAS (Wrapper)

In [None]:

from ragas.llms import LangchainLLMWrapper

def build_llm(provider: str):
    provider = provider.lower().strip()
    if provider == "openai":
        from langchain_openai import ChatOpenAI
        lc = ChatOpenAI(model=OPENAI_MODEL, temperature=0)
        return LangchainLLMWrapper(lc)
    elif provider == "claude":
        from langchain_anthropic import ChatAnthropic
        lc = ChatAnthropic(model=CLAUDE_MODEL, temperature=0)
        return LangchainLLMWrapper(lc)
    elif provider == "gemini":
        from langchain_google_genai import ChatGoogleGenerativeAI
        lc = ChatGoogleGenerativeAI(model=GEMINI_MODEL, temperature=0)
        return LangchainLLMWrapper(lc)
    elif provider == "ollama":
        try:
            from langchain_community.chat_models import ChatOllama
            lc = ChatOllama(model=OLLAMA_MODEL)
        except Exception:
            from langchain_community.llms import Ollama
            lc = Ollama(model=OLLAMA_MODEL)
        return LangchainLLMWrapper(lc)
    else:
        raise ValueError(f"Provider non support√©: {provider}")

llm = build_llm(RAGAS_LLM_PROVIDER)
print("‚úÖ LLM pr√™t pour RAGAS:", type(llm).__name__, "| provider:", RAGAS_LLM_PROVIDER)


## 6) M√©triques RAGAS (auto-d√©tection de *Relevancy*)

In [None]:

from ragas.metrics import faithfulness, answer_correctness, context_precision, context_recall

# Auto-d√©tecte le bon alias selon la version de ragas
try:
    from ragas.metrics import response_relevancy as _relevancy_metric
    RELEVANCY_NAME = "response_relevancy"
except Exception:
    from ragas.metrics import answer_relevancy as _relevancy_metric
    RELEVANCY_NAME = "answer_relevancy"

metrics = [
    faithfulness,           # 1. Fid√©lit√© (r√©ponse vs contexts)
    answer_correctness,     # 2. Correction (r√©ponse vs ground_truth)
    _relevancy_metric,      # 3. Pertinence (r√©ponse vs question)
    context_precision,      # 4. Pr√©cision des contextes r√©cup√©r√©s
    context_recall,         # 5. Rappel des contextes r√©cup√©r√©s
]

print("M√©trique de pertinence retenue:", RELEVANCY_NAME)
metrics


## 7) Ex√©cution de l‚Äô√©valuation

In [None]:

from ragas import evaluate
import os

result = evaluate(
    dataset=hf_dataset,
    metrics=metrics,
    llm=llm,
    raise_exceptions=False,
    show_progress=False,  # √©vite d√©pendance ipywidgets
)

print("‚úÖ √âvaluation termin√©e.")
df_results = result.to_pandas()
display(df_results.head(10))

csv_out = os.path.join(OUTPUT_DIR, "ragas_raw_results.csv")
df_results.to_csv(csv_out, index=False, encoding="utf-8")
print("R√©sultats enregistr√©s ->", csv_out)


## 8) Synth√®se des scores (0‚Äì1)

In [None]:

import numpy as np
import json
from datetime import datetime
import os

rel_col = "response_relevancy" if "response_relevancy" in df_results.columns else (
    "answer_relevancy" if "answer_relevancy" in df_results.columns else None
)

wanted_cols = ["faithfulness", "answer_correctness", "context_precision", "context_recall"]
if rel_col:
    wanted_cols.insert(2, rel_col)

present = [c for c in wanted_cols if c in df_results.columns]
summary = {c: float(np.nanmean(df_results[c])) for c in present}

print("üìä Scores moyens :")
for k, v in summary.items():
    print(f" - {k}: {v:.3f}")

summary_out = os.path.join(OUTPUT_DIR, "ragas_summary.json")
with open(summary_out, "w", encoding="utf-8") as f:
    json.dump({
        "generated_at": datetime.now().isoformat(),
        "provider": RAGAS_LLM_PROVIDER,
        "model": {
            "openai": OPENAI_MODEL,
            "claude": CLAUDE_MODEL,
            "gemini": GEMINI_MODEL,
            "ollama": OLLAMA_MODEL,
        }.get(RAGAS_LLM_PROVIDER, "n/a"),
        "scores": summary,
    }, f, ensure_ascii=False, indent=2)

print("Synth√®se enregistr√©e ->", summary_out)



## 9) Diagnostics enrichis & features d√©riv√©es

On ajoute des signaux utiles au **debug RAG** :  
- `n_contexts`, `avg_context_len`, `total_context_len`  
- `answer_len`, `question_len`, `gt_len`  
- `context_coverage_jaccard` : similitude *ground_truth ‚Üî contexts_concat*  
- `answer_coverage_jaccard` : similitude *ground_truth ‚Üî answer*  
- **Doc-level**¬†: *doc_precision/recall/F1* √† partir de `sharepoint_document` (r√©f√©rence) et `ragas_ground_truth` (cit√©s)


In [None]:

import numpy as np, pandas as pd, re, os

def wc(s: str) -> int:
    if not isinstance(s, str):
        s = str(s)
    return len(re.findall(r"\w+", s))

def jaccard_words(a: str, b: str) -> float:
    A = set([w.lower() for w in re.findall(r"\w+", a or "")])
    B = set([w.lower() for w in re.findall(r"\w+", b or "")])
    if not A and not B:
        return 0.0
    return len(A & B) / max(1, len(A | B))

def f1(p, r):
    if p + r == 0:
        return 0.0
    return 2*p*r/(p+r)

# Concat contexts
contexts_concat = ["\n".join(c) if isinstance(c, (list, tuple)) else str(c) for c in dataset_dict["contexts"]]

# Doc-level metrics
doc_precisions, doc_recalls, doc_f1s = [], [], []
for i in range(len(aux_docs["reference_docs"])):
    refs = set([x.lower() for x in aux_docs["reference_docs"][i]])
    cits = set([x.lower() for x in aux_docs["cited_docs"][i]])
    inter = refs & cits
    p = len(inter)/max(1, len(cits)) if len(cits)>0 else 0.0
    r = len(inter)/max(1, len(refs)) if len(refs)>0 else 0.0
    doc_precisions.append(p); doc_recalls.append(r); doc_f1s.append(f1(p,r))

enriched = pd.DataFrame({
    "question": dataset_dict["question"],
    "answer": dataset_dict["answer"],
    "ground_truth": dataset_dict["ground_truth"],
    "contexts_concat": contexts_concat,
    "n_contexts": [len(c) if isinstance(c, (list, tuple)) else 0 for c in dataset_dict["contexts"]],
    "avg_context_len": [np.mean([wc(x) for x in c]) if isinstance(c, (list, tuple)) and len(c)>0 else 0 for c in dataset_dict["contexts"]],
    "total_context_len": [np.sum([wc(x) for x in c]) if isinstance(c, (list, tuple)) else 0 for c in dataset_dict["contexts"]],
    "answer_len": [wc(a) for a in dataset_dict["answer"]],
    "question_len": [wc(q) for q in dataset_dict["question"]],
    "gt_len": [wc(g) for g in dataset_dict["ground_truth"]],
    "context_coverage_jaccard": [jaccard_words(dataset_dict["ground_truth"][i], contexts_concat[i]) for i in range(len(contexts_concat))],
    "answer_coverage_jaccard": [jaccard_words(dataset_dict["ground_truth"][i], dataset_dict["answer"][i]) for i in range(len(contexts_concat))],
    "doc_precision": doc_precisions,
    "doc_recall": doc_recalls,
    "doc_f1": doc_f1s,
})

df_all = pd.concat([enriched, df_results.reset_index(drop=True)], axis=1)

enriched_out = os.path.join(OUTPUT_DIR, "ragas_results_enriched.csv")
df_all.to_csv(enriched_out, index=False, encoding="utf-8")
print("Enrichi ->", enriched_out)
display(df_all.head(5))



## 10) Visualisations ‚Äî Distributions (histogrammes)

Ces graphiques montrent **l'√©talement des scores**¬†: s'ils sont concentr√©s vers 1 ‚Üí bon; vers 0 ‚Üí √† travailler ; tr√®s dispers√©s ‚Üí comportement instable.


In [None]:

import matplotlib.pyplot as plt
import os

rel_col = "response_relevancy" if "response_relevancy" in df_all.columns else ("answer_relevancy" if "answer_relevancy" in df_all.columns else None)
score_cols = ["faithfulness","answer_correctness","context_precision","context_recall","doc_precision","doc_recall","doc_f1"]
if rel_col:
    score_cols.insert(2, rel_col)

for col in score_cols:
    if col in df_all.columns:
        plt.figure()
        df_all[col].dropna().plot(kind="hist", bins=10, title=f"Distribution ‚Äî {col}")
        plt.xlabel(col); plt.ylabel("Fr√©quence")
        plt.tight_layout()
        outp = os.path.join(OUTPUT_DIR, f"hist_{col}.png")
        plt.savefig(outp); print("Saved:", outp)
        plt.show()



**Lecture :**  
- Si `doc_recall` est faible ‚Üí le syst√®me **ne cite pas** suffisamment les bons documents (voir *pr√©processing*, *retrieval*, *reranking*).  
- Si `faithfulness` est basse avec `context_precision` haute ‚Üí **hallucinations de g√©n√©ration** (ajuster prompts/LLM).  
- Si `context_precision` et `context_recall` sont bas ‚Üí **retrieval √† optimiser** (hybride, top-k, embeddings, chunking).



## 11) Focus erreurs ‚Äî Top-K items les plus faibles

Pour **prioriser** les investigations, voici les pires cas par m√©trique.


In [None]:

import pandas as pd

def topk_worst(col, k=15):
    if col not in df_all.columns:
        return pd.DataFrame()
    sub = df_all[["question","answer","ground_truth", col]].copy()
    sub = sub.sort_values(col, ascending=True).head(k)
    return sub

print("### Pires 'answer_correctness'")
display(topk_worst("answer_correctness"))

print("\n### Pires 'faithfulness'")
display(topk_worst("faithfulness"))

if rel_col:
    print(f"\n### Pires '{rel_col}'")
    display(topk_worst(rel_col))

print("\n### Pires 'doc_recall' (documents manquants)")
display(topk_worst("doc_recall"))



**Lecture :**  
- Regardez **les contextes** associ√©s √† ces cas : souvent des probl√®mes de d√©coupe (chunk) ou d'extraction PDF.  
- Comparez `sharepoint_document` vs `ragas_ground_truth` pour voir **quels documents attendus manquent**.



## 12) Scatter diagnostics ‚Äî Relations entre m√©triques

Nuages de points pour **d√©tecter des patterns**¬†: trop de contextes ? faible recouvrement ? corr√©lations inattendues ?


In [None]:

def scatter_xy(x, y):
    import matplotlib.pyplot as plt
    if x in df_all.columns and y in df_all.columns:
        plt.figure()
        plt.scatter(df_all[x], df_all[y])
        plt.xlabel(x); plt.ylabel(y)
        plt.title(f"{x} vs {y}")
        plt.tight_layout()
        outp = os.path.join(OUTPUT_DIR, f"scatter_{x}_vs_{y}.png")
        plt.savefig(outp); print("Saved:", outp)
        plt.show()

scatter_xy("faithfulness", "answer_correctness")
if rel_col: scatter_xy(rel_col, "answer_correctness")
scatter_xy("context_precision", "context_recall")
scatter_xy("n_contexts", "answer_correctness")
scatter_xy("avg_context_len", "answer_correctness")
scatter_xy("context_coverage_jaccard", "answer_correctness")
scatter_xy("answer_coverage_jaccard", "answer_correctness")
scatter_xy("doc_recall", "answer_correctness")



**Lecture :**  
- `n_contexts` ‚ÜòÔ∏é `answer_correctness` ‚Üí **trop de bruit** : r√©duire top‚Äëk ou **reranker**.  
- `context_coverage_jaccard` ‚ÜóÔ∏é `answer_correctness` ‚Üí meilleurs contextes **mieux align√©s** √† la v√©rit√© m√©tier.  
- `doc_recall` ‚ÜóÔ∏é `answer_correctness` ‚Üí citer les **bons documents** aide la justesse.



## 13) Corr√©lations ‚Äî Matrice (Pearson)

Utile pour **prioriser les leviers** : qu‚Äôest‚Äëce qui corr√®le le plus avec `answer_correctness` ?


In [None]:

import numpy as np, matplotlib.pyplot as plt

num_cols = df_all.select_dtypes(include=[np.number]).columns.tolist()
corr = df_all[num_cols].corr()

plt.figure()
plt.imshow(corr, aspect='auto')
plt.xticks(range(len(num_cols)), num_cols, rotation=90)
plt.yticks(range(len(num_cols)), num_cols)
plt.colorbar()
plt.title("Matrice de corr√©lations (Pearson)")
plt.tight_layout()
png = os.path.join(OUTPUT_DIR, "corr_matrix.png")
plt.savefig(png); print("Saved:", png)
plt.show()

if "answer_correctness" in corr.columns:
    display(corr.sort_values(by="answer_correctness", ascending=False)[["answer_correctness"]].head(10))



**Lecture :** Les variables en haut de la liste sont des **candidats prioritaires** √† optimiser (ex. `doc_recall`, `context_coverage_jaccard`, etc.).



## 14) Effet du nombre de contextes ‚Äî Agr√©gations par *bins*

Observe l‚Äô√©volution des m√©triques par classes de `n_contexts` (oriente **top‚Äëk**, **reranking**, **hybride**).


In [None]:

import pandas as pd, matplotlib.pyplot as plt

def agg_by_bins(col, bins=(0,1,2,3,4,6,10,999)):
    if col not in df_all.columns:
        return pd.DataFrame()
    b = pd.cut(df_all[col], bins=bins, right=True)
    cols = ["faithfulness","answer_correctness","context_precision","context_recall","doc_recall"]
    if rel_col and rel_col in df_all.columns: cols.append(rel_col)
    agg = df_all.groupby(b)[cols].mean(numeric_only=True)
    return agg.reset_index()

agg_ctx = agg_by_bins("n_contexts")
display(agg_ctx)

for c in [c for c in agg_ctx.columns if c != "n_contexts"]:
    plt.figure()
    x = agg_ctx.iloc[:,0].astype(str)
    y = agg_ctx[c]
    plt.bar(x, y)
    plt.xticks(rotation=30, ha="right")
    plt.ylabel(c)
    plt.title(f"{c} moyen par bin de n_contexts")
    plt.tight_layout()
    outp = os.path.join(OUTPUT_DIR, f"bar_{c}_by_ncontexts_bins.png")
    plt.savefig(outp); print("Saved:", outp)
    plt.show()



**Lecture :** Si `answer_correctness` baisse apr√®s un certain **top‚Äëk**, c‚Äôest un signal pour **r√©duire** les passages envoy√©s en g√©n√©ration et/ou **renforcer le reranking**.



## 15) Analyse par document/source (`sharepoint_document`)

Identifier les **documents qui posent probl√®me** (extraction PDF, structuration, obsolescence).


In [None]:

import pandas as pd, matplotlib.pyplot as plt

src_col = "sharepoint_document" if "sharepoint_document" in raw_df.columns else None

if src_col:
    metric_cols = ["faithfulness","answer_correctness","context_precision","context_recall","doc_recall","doc_precision","doc_f1"]
    if rel_col and rel_col in df_all.columns:
        metric_cols.insert(2, rel_col)
    per_src = pd.concat([raw_df[[src_col]], df_all[metric_cols]], axis=1)
    agg_src = per_src.groupby(src_col).mean(numeric_only=True).sort_values("answer_correctness", ascending=False)
    display(agg_src.head(10))

    topk = agg_src.head(10)
    if len(topk) > 0:
        plt.figure()
        plt.bar(topk.index.astype(str), topk["answer_correctness"])
        plt.xticks(rotation=45, ha="right")
        plt.ylabel("answer_correctness")
        plt.title("Top documents ‚Äî answer_correctness moyen")
        plt.tight_layout()
        outp = os.path.join(OUTPUT_DIR, "bar_top_docs_answer_correctness.png")
        plt.savefig(outp); print("Saved:", outp)
        plt.show()
else:
    print("Aucune colonne 'sharepoint_document' d√©tect√©e.")



**Lecture :** Les documents en bas du classement n√©cessitent souvent un **pr√©processing PDF** plus robuste (OCR, nettoyage headers/footers, gestion des tableaux, extraction fid√®le du texte).



## 16) Recommandations automatiques (guid√©es par les m√©triques)

Pistes d‚Äôoptimisation √† partir des signaux : **pr√©processing PDF, embeddings, chunking, retrieval (hybride), reranking, prompts**.


In [None]:

import os

reco = []

def add(msg):
    print("‚Ä¢", msg); reco.append("‚Ä¢ " + msg)

mean = df_all.mean(numeric_only=True).to_dict()
m = lambda k, d=mean: d.get(k, None)

if (m("faithfulness") or 0) < 0.6:
    add("Fid√©lit√© basse : citer explicitement les passages (verbatim), r√©duire temp√©rature (0), contraindre la r√©ponse (format, r√©f√©rences).")

if (m("answer_correctness") or 0) < 0.6 and (m("context_recall") or 0) >= 0.6:
    add("R√©ponses incorrectes malgr√© un rappel correct : renforcer les **prompts de g√©n√©ration** (extraction stricte) et la **post-v√©rification** (self-check).")

if (m("context_recall") or 0) < 0.6:
    add("Rappel faible : augmenter **top-k**, utiliser la **recherche hybride** (BM25 + vecteur), am√©liorer **chunking** et **pr√©processing PDF** (OCR, headers/footers).")

if (m("context_precision") or 0) < 0.6:
    add("Pr√©cision faible : ajouter un **reranker** (cross-encoder/LLM), baisser **top-k** avant g√©n√©ration, filtrer par **m√©tadonn√©es**.")

if "n_contexts" in df_all and "answer_correctness" in df_all and df_all["n_contexts"].corr(df_all["answer_correctness"]) < -0.15:
    add("Trop de contextes nuit √† la correction : **r√©duire top-k** et/ou reranker plus agressivement.")

if (m("doc_recall") or 0) < 0.6:
    add("Faible rappel documentaire : **aligner le nommage des documents**, am√©liorer les **citations automatiques** et le **pr√©processing PDF** (extraction fid√®le).")

if (m("context_coverage_jaccard") or 0) < 0.4:
    add("Faible recouvrement GT‚ÜîContexts : v√©rifier extraction PDF/OCR, **embeddings** adapt√©s au FR/domaine, **chunking** par sections/titres.")

# Sauvegarde
txt_out = os.path.join(OUTPUT_DIR, "auto_recommendations.txt")
with open(txt_out, "w", encoding="utf-8") as f:
    f.write("Recommandations automatiques\n\n")
    for r in reco: f.write(r + "\n")
print("Recommandations ->", txt_out)



## 17) Plan d‚Äôexp√©riences (A/B & grille)

It√©rations recommand√©es :  
- **Pr√©processing** *(pdfminer/pymupdf + tesseract OCR, nettoyage headers/footers, normalisation espaces)*  
- **Embeddings** *(mod√®les FR/domaine, dimension, normalisation)*  
- **Chunking** *(d√©coupe s√©mantique par titres/sections ou hierachique)*  
- **Recherche** *(hybride BM25+vecteur, pond√©ration, filtres m√©tadonn√©es)*  
- **Reranking** *(cross-encoder, LLM-as-a-reranker, top‚Äëk)*


In [None]:

import json, os

param_grid = {
    "preprocess": [
        {"ocr": False, "clean_headers": True, "normalize_ws": True},
        {"ocr": True,  "clean_headers": True, "normalize_ws": True},
    ],
    "embedding": [
        {"provider":"openai","model":"text-embedding-3-large"},
        {"provider":"openai","model":"text-embedding-3-small"},
        {"provider":"nomic","model":"nomic-embed-text"},
    ],
    "chunking": [
        {"method":"fixed","size":512,"overlap":64},
        {"method":"fixed","size":800,"overlap":100},
        {"method":"semantic","size":"auto","overlap":64},
    ],
    "retrieval": [
        {"type":"vector","top_k":8},
        {"type":"hybrid","bm25_weight":0.4,"top_k":8},
        {"type":"hybrid","bm25_weight":0.6,"top_k":12},
    ],
    "rerank": [
        {"enabled": False},
        {"enabled": True, "model":"cross-encoder/ms-marco-MiniLM-L-6-v2", "top_k":5},
    ],
}

print(json.dumps(param_grid, indent=2, ensure_ascii=False))
grid_out = os.path.join(OUTPUT_DIR, "experiment_plan.json")
with open(grid_out, "w", encoding="utf-8") as f:
    json.dump(param_grid, f, ensure_ascii=False, indent=2)
print("Plan d'exp√©riences ->", grid_out)



## 18) Checklist d‚Äôoptimisation RAG (rapide)

- **PDF** : OCR si scans; enlever headers/footers/num√©ros; g√©rer tableaux; normaliser espaces/casse.  
- **Embeddings** : FR/domaine; normalisation; taille vecteur suffisante; re‚Äëindexer apr√®s changement.  
- **Chunking** : 512‚Äì800 tokens + overlap 64‚Äì100; d√©coupe par sections/titres; isoler tableaux/code.  
- **Index** : m√©tadonn√©es (titre, section, date, doc_id) pour filtrage.  
- **Retrieval** : tester **hybride** (BM25+vecteur), pond√©ration; top‚Äëk √©quilibr√©.  
- **Reranking** : cross‚Äëencoder/LLM; r√©duire √† 3‚Äì5 passages de haute qualit√©.  
- **G√©n√©ration** : prompts de citation stricte (verbatim + doc_id); temp√©rature 0; JSON strict; self‚Äëcheck.  
- **√âvaluation** : r√©f√©rentiel √† jour; seuils GO/NO‚ÄëGO; journaliser la config de run.
