# 🧠 ESG RAG Pipeline — Key Steps Overview

This pipeline evaluates the robustness and generalization of a **RAG (Retrieval-Augmented Generation)** system applied to ESG reports. It combines FAISS indexing, prompt optimization, LLM reranking, and detailed evaluation using RAGAS.

---

## 🔹 1. Basic RAG (Baseline)

- Each ESG report is **independently chunked** and **indexed with FAISS**.
- For the TotalEnergies report:
  - Questions are embedded.
  - **Top 10 relevant chunks** retrieved via FAISS.
  - Answers generated with **GPT-4o-mini**.
- Results stored in a DataFrame with:
  - `user_input`: user question.
  - `retrieved_contexts`: concatenated top 10 context chunks.
  - `response`: generated answer.
  - `reference`: reference answer.

---

## 🔹 2. Prompt Optimization

- Prompt is refined to emulate a **professional ESG analyst**.
- Improves **faithfulness** and **relevance** of generated answers.
- The rest of the pipeline remains the same as Basic RAG.

---

## 🔹 3. Reranked RAG (Champion Model)

- Adds a **reranking step with GPT-3.5-turbo**:
  - Initial 20 FAISS chunks are scored and ranked.
  - Select top **5 most relevant** chunks for final answer generation.
- This model serves as the **champion** for comparison.

---

## 🔹 4. Generalization Evaluation

- Apply the champion model on other ESG reports (Shell, Veolia, etc.).
- Goal: test **model generalization beyond TotalEnergies**.
- Answers evaluated again using **RAGAS** metrics:
  - Faithfulness
  - Answer Relevance
  - Context Precision
  - Context Recall


In [None]:
!pip install faiss-cpu tiktoken openai transformers nltk ragas llama-index pandas
!pip install sentencepiece datasets
!pip install PyPDF2
!pip install pycryptodome
!pip install -U ragas[metrics] --quiet


In [None]:
# ⬇️ Mount Google Drive and install required packages
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#  Dependencies
import os
import PyPDF2
from pathlib import Path

#  Define PDF folder path
PDF_PATH = "/content/drive/MyDrive/Classroom/Sustainability reports"

#  Set API key
os.environ["OPENAI_API_KEY"] = "Replace"
#  RAG config
EMBEDDING_MODEL = "text-embedding-3-small"
COMPLETION_MODEL = "gpt-3.5-turbo"
CHUNK_SIZE = 512
CHUNK_OVERLAP = 50
CHUNK_SEPARATOR = "\n===CHUNK_SEPARATOR===\n"

#  FAISS index directory
BASE_DIR = Path(PDF_PATH)
INDEX_DIR = BASE_DIR / "faiss_indices"
INDEX_DIR.mkdir(parents=True, exist_ok=True)

#  Extract text from each PDF
def extract_text_from_pdfs(pdf_folder):
    documents = {}
    for filename in os.listdir(pdf_folder):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(pdf_folder, filename)
            try:
                reader = PyPDF2.PdfReader(pdf_path)
                text = ""
                for page in reader.pages:
                    text += page.extract_text() or ""
                documents[filename] = text
            except Exception as e:
                print(f"❌ Skipping {filename} due to error: {e}")
    return documents



In [None]:
#  Load and preview
pdf_texts = extract_text_from_pdfs(PDF_PATH)
print(f"✅ Loaded {len(pdf_texts)} PDF documents.")

✅ Loaded 14 PDF documents.


# **1- Basic RAG**

## 🧩 Step 1: Chunking & FAISS Indexing for Each PDF Individually

Each PDF is processed and indexed separately. This allows us to evaluate and query them independently in future steps.

For each PDF:
- Tokenize and chunk text
- Compute OpenAI embeddings
- Store a dedicated FAISS index and chunk mapping


In [None]:
import tiktoken
import faiss
import numpy as np
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Tokenizer
encoding = tiktoken.encoding_for_model(EMBEDDING_MODEL)

# Create index for a single PDF
def index_single_pdf(filename, text):
    print(f"\n Indexing: {filename}")

    # Tokenize and chunk
    tokens = encoding.encode(text)
    chunks = []
    for i in range(0, len(tokens), CHUNK_SIZE - CHUNK_OVERLAP):
        chunk_tokens = tokens[i:i + CHUNK_SIZE]
        chunk = encoding.decode(chunk_tokens)
        chunks.append(chunk)

    print(f" Created {len(chunks)} chunks")

    # Embeddings
    response = client.embeddings.create(input=chunks, model=EMBEDDING_MODEL)
    embeddings = [np.array(d.embedding, dtype=np.float32) for d in response.data]

    # FAISS index
    dim = len(embeddings[0])
    index = faiss.IndexFlatL2(dim)
    index.add(np.vstack(embeddings))

    # Save to dedicated folder
    folder = INDEX_DIR / filename.replace(".pdf", "")
    folder.mkdir(parents=True, exist_ok=True)

    faiss.write_index(index, str(folder / "index.faiss"))
    with open(folder / "chunks.txt", "w", encoding="utf-8") as f:
        f.write(CHUNK_SEPARATOR.join(chunks))

    print(f"✅ Index & chunks saved for: {filename}")


In [None]:
# Build all individual indexes
for filename, text in pdf_texts.items():
    index_single_pdf(filename, text)


 Indexing: CK_Hutchinson _sustainability_report_2024.pdf
 Created 170 chunks
✅ Index & chunks saved for: CK_Hutchinson _sustainability_report_2024.pdf

 Indexing: Bayer_sustainability_report_2024.pdf
 Created 324 chunks
✅ Index & chunks saved for: Bayer_sustainability_report_2024.pdf

 Indexing: Shell_sustainability_report_2023.pdf
 Created 231 chunks
✅ Index & chunks saved for: Shell_sustainability_report_2023.pdf

 Indexing: Thai_Oil_sustainability_report_2023.pdf
 Created 150 chunks
✅ Index & chunks saved for: Thai_Oil_sustainability_report_2023.pdf

 Indexing: Verizon_sustainability_report_2023.pdf
 Created 94 chunks
✅ Index & chunks saved for: Verizon_sustainability_report_2023.pdf

 Indexing: Cargill_sustainability_report_2024.pdf
 Created 135 chunks
✅ Index & chunks saved for: Cargill_sustainability_report_2024.pdf

 Indexing: Walmart_sustainability_report_2023.pdf
 Created 40 chunks
✅ Index & chunks saved for: Walmart_sustainability_report_2023.pdf

 Indexing: BASF_sustainabil

## 📄 Step 2: Generate Evaluation Dataset for TotalEnergies Report (Baseline RAG - OpenAI)

In this step, we generate a structured evaluation dataset for **RAGAS**, using only the `totalenergies_report.pdf`.  

Each ESG question is processed as follows:
- Embed the question using `text-embedding-3-small`
- Retrieve top-10 relevant chunks from the FAISS index
- Generate a RAG answer using `gpt-4o-mini` and the retrieved context
- Store everything in a clean evaluation DataFrame:
  - `user_input`: the original question
  - `retrieved_contexts`: joined top-10 context passages
  - `response`: the generated answer
  - `reference`: the ground truth answer

📍 This output will later be used for **RAGAS metric evaluation**.


In [None]:
import os
import faiss
import numpy as np
import pandas as pd

from openai import OpenAI
from IPython.display import display

# Load config values
EVAL_PATH = "/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_esg_answers.csv"
INDEX_FOLDER = INDEX_DIR / "totalenergies_report"
INDEX_PATH = INDEX_FOLDER / "index.faiss"
CHUNKS_PATH = INDEX_FOLDER / "chunks.txt"

# Initialize OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Load FAISS index and chunks
index = faiss.read_index(str(INDEX_PATH))
with open(CHUNKS_PATH, "r", encoding="utf-8") as f:
    chunks = f.read().split(CHUNK_SEPARATOR)

# Load evaluation CSV — with correct separator
df = pd.read_csv(EVAL_PATH, sep=";", encoding="utf-8")

In [None]:
# Define helper functions
def get_embedding(text):
    response = client.embeddings.create(input=[text], model=EMBEDDING_MODEL)
    return np.array(response.data[0].embedding, dtype=np.float32)

def retrieve_contexts(embedding, k=10):
    D, I = index.search(embedding.reshape(1, -1), k)
    return [chunks[i] for i in I[0]]

def generate_answer(contexts, question):
    prompt = (
        "You are an ESG analyst. Use the context below to answer the ESG-related question.\n\n"
        + "\n---\n".join(contexts)
        + f"\n\nQuestion: {question}\nAnswer:"
    )
    response = client.chat.completions.create(
        model=COMPLETION_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content.strip()

In [None]:
# Process questions and enrich dataset
responses, contexts = [], []
for _, row in df.iterrows():
    question = row["question"]
    try:
        emb = get_embedding(question)
        top_chunks = retrieve_contexts(emb, k=10)
        answer = generate_answer(top_chunks, question)

        responses.append(answer)
        contexts.append(" ||| ".join(top_chunks))
    except Exception as e:
        print(f"❌ Error with question: {question[:50]}... → {e}")
        responses.append("ERROR")
        contexts.append("")

# Add columns and preview result
df["retrieved_contexts"] = contexts
df["response"] = responses

display(df.head(3))  # Display first rows for quick check

Unnamed: 0,question,reference,retrieved_contexts,response
0,Does the company report its scope 2 GHG emissi...,Yes. Scope 2 emissions are disclosed for opera...,natural gas via the steam reforming process as...,The company reports its Scope 2 GHG emissions ...
1,Is the company committed to reduce its GHG emi...,Yes. TotalEnergies has an explicit ambition to...,this sector around concrete \nobjectives not ...,"Yes, the company is committed to reducing its ..."
2,Is the credibility of the company’s GHG emissi...,"Yes. The extra‑financial declaration, includin...",Progress Report\nHelping our Customers Reduce...,"Yes, the credibility of the company's GHG emis..."


In [None]:
# Define output path
output_path = "/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_rag_generated.csv"

# Save with proper separator for future RAGAS parsing
df.to_csv(output_path, sep=";", index=False, encoding="utf-8")

print(f" Evaluation dataset saved at:\n{output_path}")


 Evaluation dataset saved at:
/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_rag_generated.csv


## 📊 Step 3: Evaluate RAG Outputs Using RAGAS

Now that we have generated answers and retrieved contexts for the TotalEnergies report, we evaluate the performance of our baseline RAG pipeline using **RAGAS**.

We compute:
- **Faithfulness**: Is the answer grounded in the retrieved context?
- **Answer Relevancy**: Does the answer address the user’s question?
- **Context Precision**: How much of the retrieved context was useful?
- **Context Recall**: How complete was the context compared to the ground truth?

The file used here is:
`totalenergies_rag_generated.csv` (with columns: `user_input`, `retrieved_contexts`, `response`, `reference`)


In [None]:
!pip uninstall -y ragas

Found existing installation: ragas 0.2.15
Uninstalling ragas-0.2.15:
  Successfully uninstalled ragas-0.2.15


In [None]:
!pip uninstall -y ragas
!rm -rf /usr/local/lib/python3.11/dist-packages/ragas*

Found existing installation: ragas 0.2.16.dev3+g46ef849
Uninstalling ragas-0.2.16.dev3+g46ef849:
  Successfully uninstalled ragas-0.2.16.dev3+g46ef849


In [None]:
!pip install git+https://github.com/explodinggradients/ragas.git@main

Collecting git+https://github.com/explodinggradients/ragas.git@main
  Cloning https://github.com/explodinggradients/ragas.git (to revision main) to /tmp/pip-req-build-u9e7w3jz
  Running command git clone --filter=blob:none --quiet https://github.com/explodinggradients/ragas.git /tmp/pip-req-build-u9e7w3jz
  Resolved https://github.com/explodinggradients/ragas.git to commit 46ef849108caad21da65c10b0fd3d4a32f2e05b0
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from ragas import evaluate
from datasets import Dataset
from IPython.display import display
import pandas as pd
import os
import openai

# Set OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

# Force GPT-3.5-Turbo for async RAGAS internal calls
from openai import ChatCompletion
original_create = ChatCompletion.create
original_acreate = ChatCompletion.acreate

def patched_create(**kwargs):
    kwargs["model"] = "gpt-3.5-turbo"
    return original_create(**kwargs)

async def patched_acreate(**kwargs):
    kwargs["model"] = "gpt-3.5-turbo"
    return await original_acreate(**kwargs)

ChatCompletion.create = patched_create
ChatCompletion.acreate = patched_acreate

# Load generated RAG output
eval_path = "/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_rag_generated.csv"
df_eval = pd.read_csv(eval_path, sep=";", encoding="utf-8")

# Clean and prepare
df_eval.rename(columns={
    "user_input": "question",
    "retrieved_contexts": "contexts",
    "response": "answer"
}, inplace=True)
df_eval["contexts"] = df_eval["contexts"].apply(lambda x: x.split(" ||| ") if isinstance(x, str) else [])
df_eval = df_eval[df_eval["answer"] != "ERROR"].reset_index(drop=True)

# Use ALL available questions — no sampling
dataset = Dataset.from_pandas(df_eval)

# Run RAGAS evaluation
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall
    ]
)

# Display results
if not hasattr(results, "raw") or results["raw"] is None:
    print("⚠️ RAGAS evaluation incomplete — no detailed scores returned.")
else:
    raw = results["raw"]
    n_valid = len(raw["faithfulness"])

    df_valid = df_eval.head(n_valid)

    # Global scores
    global_metrics = pd.DataFrame({
        "Metric": ["Faithfulness", "Answer Relevancy", "Context Precision", "Context Recall"],
        "Score": [
            sum(raw["faithfulness"]) / n_valid if raw["faithfulness"] else None,
            sum(raw["answer_relevancy"]) / n_valid if raw["answer_relevancy"] else None,
            sum(raw["context_precision"]) / n_valid if raw["context_precision"] else None,
            sum(raw["context_recall"]) / n_valid if raw["context_recall"] else None,
        ]
    })
    display(global_metrics.style.set_caption("📊 Global RAGAS Metrics (Full Evaluation)").format(precision=3))

    # Per-question scores
    df_question_metrics = pd.DataFrame({
        "Question": df_valid["question"],
        "Faithfulness": raw["faithfulness"],
        "Answer Relevancy": raw["answer_relevancy"],
        "Context Precision": raw["context_precision"],
        "Context Recall": raw["context_recall"],
    })
    display(df_question_metrics.style.set_caption("📋 Per-Question RAGAS Scores (Full Evaluation)").format(precision=3))


Unnamed: 0,Metric,Score
0,Faithfulness,0.548
1,Answer Relevancy,0.708
2,Context Precision,0.551
3,Context Recall,0.699


Unnamed: 0,Question,Faithfulness,Answer Relevancy,Context Precision,Context Recall
0,Does the company report its scope 2 GHG emissions on location-based or market-based?,0.507,0.629,0.607,0.657
1,Is the company committed to reduce its GHG emissions aligned with Net Zero by 2050?,0.59,0.672,0.582,0.645
2,Is the credibility of the company’s GHG emissions reduction target assessed by third-party opinion?,0.561,0.737,0.552,0.715
3,What is the company’s main source of GHG emissions?,0.49,0.693,0.541,0.685
4,What is the most important in the company’s GHG emissions between scope 3 upstream emissions and scope 3 downstream emissions?,0.527,0.7,0.502,0.726
5,What is the company’s carbon intensity in GHG per million euros of revenue?,0.616,0.728,0.558,0.621
6,How does the company compare to its sectoral peers in terms of carbon intensity?,0.453,0.665,0.569,0.728
7,What are the other types of pollutants emitted by the company?,0.533,0.711,0.517,0.804
8,Does the company have a target in terms of renewable energy use/consumption?,0.601,0.668,0.596,0.699
9,Is the company able to track and reduce its waste along the production process?,0.515,0.631,0.506,0.701


# **2- Prompt Optimized RAG**

We now evaluate how a refined prompt impacts the performance of our RAG system.

In this updated version, the answer generation prompt was improved to:
- Encourage the model to stay grounded in the retrieved context
- Discourage hallucination or assumption
- Provide clearer, more relevant ESG answers

We re-evaluate 10 randomly selected questions using **RAGAS** to compare with previous results.

Metrics evaluated:
- **Faithfulness**
- **Answer Relevancy**
- **Context Precision**
- **Context Recall**


## 🧩 Step 1: Prompt Engineering & Regeneration of RAG Answers

In this step, we regenerate answers to ESG evaluation questions using an improved, more structured prompt.

The goal is to assess whether **prompt engineering** leads to more accurate, grounded, and relevant answers — ultimately improving RAGAS evaluation scores.

### 🔧 Why prompt engineering?

The default prompt used in baseline RAG setups is often vague. A refined prompt:
- Instructs the model to **only use the provided context**
- Clearly states that if the information is missing, it should answer: *"Not mentioned in the context"*
- Encourages **conciseness, clarity, and factual precision**


In [None]:
import os
import faiss
import numpy as np
import pandas as pd

from openai import OpenAI
from IPython.display import display

#  Config paths
EVAL_PATH = "/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_esg_answers.csv"
INDEX_FOLDER = INDEX_DIR / "totalenergies_report"
INDEX_PATH = INDEX_FOLDER / "index.faiss"
CHUNKS_PATH = INDEX_FOLDER / "chunks.txt"

#  Init OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

#  Load FAISS index and chunks
index = faiss.read_index(str(INDEX_PATH))
with open(CHUNKS_PATH, "r", encoding="utf-8") as f:
    chunks = f.read().split(CHUNK_SEPARATOR)

#  Load evaluation questions CSV
df = pd.read_csv(EVAL_PATH, sep=";", encoding="utf-8")

#  Helper: embed question
def get_embedding(text):
    response = client.embeddings.create(input=[text], model=EMBEDDING_MODEL)
    return np.array(response.data[0].embedding, dtype=np.float32)

#  Helper: retrieve top-k chunks
def retrieve_contexts(embedding, k=10):
    D, I = index.search(embedding.reshape(1, -1), k)
    return [chunks[i] for i in I[0]]

#  Helper: improved ESG prompt for answer generation
def generate_answer(contexts, question):
    prompt = (
         "You are a sustainability analyst helping a user explore an ESG report. "
        "Base your answer strictly on the provided context. "
        "If the context clearly includes the answer, provide it. "\n\n"
        + "\n---\n".join(contexts)
        + f"\n\nQuestion: {question}\nAnswer:"
    )
    response = client.chat.completions.create(
        model=COMPLETION_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content.strip()

#  Loop through questions
responses, contexts = [], []
for _, row in df.iterrows():
    question = row["question"]
    try:
        emb = get_embedding(question)
        top_chunks = retrieve_contexts(emb, k=10)
        answer = generate_answer(top_chunks, question)

        responses.append(answer)
        contexts.append(" ||| ".join(top_chunks))
    except Exception as e:
        print(f"❌ Error with question: {question[:50]}... → {e}")
        responses.append("ERROR")
        contexts.append("")

#  Add results
df["retrieved_contexts"] = contexts
df["response"] = responses

#  Display preview
display(df.head(3))

#  Save optimized dataset
output_path = "/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_rag_optimized.csv"
df.to_csv(output_path, sep=";", index=False, encoding="utf-8")

print(f" Evaluation dataset saved with optimized prompt at:\n{output_path}")


Unnamed: 0,question,reference,retrieved_contexts,response
0,Does the company report its scope 2 GHG emissi...,Yes. Scope 2 emissions are disclosed for opera...,natural gas via the steam reforming process as...,Market-based
1,Is the company committed to reduce its GHG emi...,Yes. TotalEnergies has an explicit ambition to...,this sector around concrete \nobjectives not ...,"Yes, the company is committed to reducing its ..."
2,Is the credibility of the company’s GHG emissi...,"Yes. The extra‑financial declaration, includin...",Progress Report\nHelping our Customers Reduce...,"Yes, the credibility of the company's GHG emis..."


✅ Evaluation dataset saved with optimized prompt at:
/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_rag_optimized.csv


## 📊 Step 2: RAGAS Evaluation with Optimized Prompt

Now that we have regenerated RAG answers using a more carefully engineered prompt, we evaluate the output using **RAGAS**.

The evaluation focuses on 4 key metrics:
-  **Faithfulness** — Is the answer factually grounded in the retrieved context?
-  **Answer Relevancy** — Does the answer directly respond to the user’s question?
- **Context Precision** — How much of the retrieved context was actually used?
-  **Context Recall** — How complete was the context compared to what was needed?


In [None]:
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from ragas import evaluate
from datasets import Dataset
from IPython.display import display
import pandas as pd
import os
import openai

# Ensure OpenAI API key is set
openai.api_key = os.getenv("OPENAI_API_KEY")

# Force GPT-3.5-Turbo for async RAGAS internal calls
from openai import ChatCompletion
original_create = ChatCompletion.create
original_acreate = ChatCompletion.acreate

def patched_create(**kwargs):
    kwargs["model"] = "gpt-3.5-turbo"
    return original_create(**kwargs)

async def patched_acreate(**kwargs):
    kwargs["model"] = "gpt-3.5-turbo"
    return await original_acreate(**kwargs)

ChatCompletion.create = patched_create
ChatCompletion.acreate = patched_acreate

# Load optimized dataset
eval_path = "/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_rag_optimized.csv"
df_eval = pd.read_csv(eval_path, sep=";", encoding="utf-8")

# Clean and format
df_eval.rename(columns={
    "user_input": "question",
    "retrieved_contexts": "contexts",
    "response": "answer"
}, inplace=True)
df_eval["contexts"] = df_eval["contexts"].apply(lambda x: x.split(" ||| ") if isinstance(x, str) else [])
df_eval = df_eval[df_eval["answer"] != "ERROR"].reset_index(drop=True)

# Use ALL available questions (no sampling)
dataset = Dataset.from_pandas(df_eval)

# Evaluate with RAGAS
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall
    ]
)

# Display results
if not hasattr(results, "raw") or results["raw"] is None:
    print("⚠️ RAGAS evaluation incomplete — no detailed scores returned.")
else:
    raw = results["raw"]
    n_valid = len(raw["faithfulness"])
    df_valid = df_eval.head(n_valid)

    # Global metrics
    global_metrics = pd.DataFrame({
        "Metric": ["Faithfulness", "Answer Relevancy", "Context Precision", "Context Recall"],
        "Score": [
            sum(raw["faithfulness"]) / n_valid if raw["faithfulness"] else None,
            sum(raw["answer_relevancy"]) / n_valid if raw["answer_relevancy"] else None,
            sum(raw["context_precision"]) / n_valid if raw["context_precision"] else None,
            sum(raw["context_recall"]) / n_valid if raw["context_recall"] else None,
        ]
    })
    display(global_metrics.style.set_caption("📊 Global RAGAS Metrics (Optimized Prompt)").format(precision=3))

    # Per-question metrics
    df_question_metrics = pd.DataFrame({
        "Question": df_valid["question"],
        "Faithfulness": raw["faithfulness"],
        "Answer Relevancy": raw["answer_relevancy"],
        "Context Precision": raw["context_precision"],
        "Context Recall": raw["context_recall"],
    })
    display(df_question_metrics.style.set_caption("📋 Per-Question RAGAS Scores (Optimized Prompt)").format(precision=3))


Unnamed: 0,Metric,Score
0,Faithfulness,0.698
1,Answer Relevancy,0.814
2,Context Precision,0.621
3,Context Recall,0.71


Unnamed: 0,Question,Faithfulness,Answer Relevancy,Context Precision,Context Recall
0,Does the company report its scope 2 GHG emissions on location-based or market-based?,0.667,0.775,0.663,0.677
1,Is the company committed to reduce its GHG emissions aligned with Net Zero by 2050?,0.73,0.796,0.644,0.669
2,Is the credibility of the company’s GHG emissions reduction target assessed by third-party opinion?,0.708,0.829,0.621,0.721
3,What is the company’s main source of GHG emissions?,0.655,0.807,0.613,0.699
4,What is the most important in the company’s GHG emissions between scope 3 upstream emissions and scope 3 downstream emissions?,0.683,0.81,0.584,0.729
5,What is the company’s carbon intensity in GHG per million euros of revenue?,0.75,0.824,0.626,0.651
6,How does the company compare to its sectoral peers in terms of carbon intensity?,0.627,0.792,0.634,0.731
7,What are the other types of pollutants emitted by the company?,0.687,0.816,0.595,0.788
8,Does the company have a target in terms of renewable energy use/consumption?,0.738,0.794,0.655,0.709
9,Is the company able to track and reduce its waste along the production process?,0.674,0.775,0.587,0.711


# **3- Reranking Retrieved Contexts**

To further improve RAG performance, we apply **LLM-based reranking** on the retrieved contexts:

### 🔁 What changes?
- Instead of directly using the top-10 chunks from FAISS,
- We retrieve the top-20 candidates,
- Then use GPT-3.5 to **rerank them** based on their semantic alignment with the question,
- We finally select the top 5 reranked chunks to generate the final answer.

This allows us to:
- Filter out off-topic chunks,
- Focus the context window on the most relevant information,
- Potentially improve all 4 RAGAS metrics: faithfulness, relevancy, precision, and recall.


## 🧩 Step 1: Reranking

In [None]:
import os
import re
import faiss
import numpy as np
import pandas as pd
from openai import OpenAI
from IPython.display import display
from tqdm import tqdm

#  Config
EVAL_PATH = "/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_esg_answers.csv"
INDEX_FOLDER = INDEX_DIR / "totalenergies_report"
INDEX_PATH = INDEX_FOLDER / "index.faiss"
CHUNKS_PATH = INDEX_FOLDER / "chunks.txt"

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

#  Load index + chunks
index = faiss.read_index(str(INDEX_PATH))
with open(CHUNKS_PATH, "r", encoding="utf-8") as f:
    chunks = f.read().split(CHUNK_SEPARATOR)

df = pd.read_csv(EVAL_PATH, sep=";", encoding="utf-8")

#  Embedding
def get_embedding(text):
    response = client.embeddings.create(input=[text], model=EMBEDDING_MODEL)
    return np.array(response.data[0].embedding, dtype=np.float32)

#  Parse score like "7", "7/10", "7 out of 10"
def parse_score(score_str):
    match = re.search(r"(\d+)(?:\s*/\s*10)?", score_str)
    if match:
        return float(match.group(1))
    return 0.0

#  FAISS + rerank
def rerank_contexts(question, top_k=5, faiss_k=20):
    query_emb = get_embedding(question)
    _, I = index.search(query_emb.reshape(1, -1), faiss_k)
    candidates = [chunks[i] for i in I[0]]

    scored = []
    for chunk in candidates:
        prompt = f"Rate how relevant this chunk is to the question (0–10):\n\nQuestion: {question}\nChunk: {chunk[:1000]}\n\nScore:"
        try:
            score_resp = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": prompt}],
                temperature=0
            )
            score_str = score_resp.choices[0].message.content.strip()
            score = parse_score(score_str)
            scored.append((chunk, score))
        except Exception as e:
            print(f"❌ Rerank error: {e}")
            scored.append((chunk, 0.0))

    reranked = sorted(scored, key=lambda x: -x[1])[:top_k]
    return [r[0] for r in reranked]

#  Answer generation
def generate_answer(contexts, question):
    prompt = (
        "You are a sustainability analyst helping a user explore an ESG report.\n"
        "If not mentioned, say: 'Not mentioned in the context.'\n\n"
        + "\n---\n".join(contexts)
        + f"\n\nQuestion: {question}\nAnswer:"
    )
    response = client.chat.completions.create(
        model=COMPLETION_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content.strip()

#  Rerank + generate
responses, contexts = [], []
for _, row in tqdm(df.iterrows(), total=len(df)):
    question = row["question"]
    try:
        top_chunks = rerank_contexts(question, top_k=5, faiss_k=20)
        answer = generate_answer(top_chunks, question)
        responses.append(answer)
        contexts.append(" ||| ".join(top_chunks))
    except Exception as e:
        print(f"❌ Error: {e}")
        responses.append("ERROR")
        contexts.append("")

#  Add new columns
df["retrieved_contexts"] = contexts
df["response"] = responses

#  Save to CSV
output_path = "/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_rag_reranked.csv"
df.to_csv(output_path, sep=";", index=False, encoding="utf-8")

print(f" Reranked dataset saved at:\n{output_path}")
display(df.head(3))


100%|██████████| 33/33 [06:38<00:00, 12.09s/it]

✅ Reranked dataset saved at:
/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_rag_reranked.csv





Unnamed: 0,question,reference,retrieved_contexts,response
0,Does the company report its scope 2 GHG emissi...,Yes. Scope 2 emissions are disclosed for opera...,natural gas via the steam reforming process as...,Market-based.
1,Is the company committed to reduce its GHG emi...,Yes. TotalEnergies has an explicit ambition to...,this sector around concrete \nobjectives not ...,"Yes, the company is committed to reducing its ..."
2,Is the credibility of the company’s GHG emissi...,"Yes. The extra‑financial declaration, includin...",Progress Report\nHelping our Customers Reduce...,"Yes, the credibility of the company's GHG emis..."


## 📊 Step 2 : RAGAS Evaluation After Reranking

After regenerating answers using reranked context chunks (top-5 out of top-20 FAISS hits), we now evaluate the results using **RAGAS**.



In [None]:
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from ragas import evaluate
from datasets import Dataset
from IPython.display import display
import pandas as pd
import os
import openai

# Set OpenAI key
openai.api_key = os.getenv("OPENAI_API_KEY")

# Patch to force GPT-3.5
from openai import ChatCompletion
original_create = ChatCompletion.create
original_acreate = ChatCompletion.acreate

def patched_create(**kwargs):
    kwargs["model"] = "gpt-3.5-turbo"
    return original_create(**kwargs)

async def patched_acreate(**kwargs):
    kwargs["model"] = "gpt-3.5-turbo"
    return await original_acreate(**kwargs)

ChatCompletion.create = patched_create
ChatCompletion.acreate = patched_acreate

# Load reranked RAG answers
eval_path = "/content/drive/MyDrive/Classroom/Evaluation_Datasets/totalenergies_rag_reranked.csv"
df_eval = pd.read_csv(eval_path, sep=";", encoding="utf-8")

# Clean
df_eval.rename(columns={
    "user_input": "question",
    "retrieved_contexts": "contexts",
    "response": "answer"
}, inplace=True)
df_eval["contexts"] = df_eval["contexts"].apply(lambda x: x.split(" ||| ") if isinstance(x, str) else [])
df_eval = df_eval[df_eval["answer"] != "ERROR"].reset_index(drop=True)

# Convert to HF Dataset (no sampling, use all)
dataset = Dataset.from_pandas(df_eval)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall
    ]
)

# Display results
if not hasattr(results, "raw") or results["raw"] is None:
    print("⚠️ RAGAS evaluation incomplete — no detailed scores returned.")
else:
    raw = results["raw"]
    n_valid = len(raw["faithfulness"])
    df_valid = df_eval.head(n_valid)

    # Global results
    global_metrics = pd.DataFrame({
        "Metric": ["Faithfulness", "Answer Relevancy", "Context Precision", "Context Recall"],
        "Score": [
            sum(raw["faithfulness"]) / n_valid if raw["faithfulness"] else None,
            sum(raw["answer_relevancy"]) / n_valid if raw["answer_relevancy"] else None,
            sum(raw["context_precision"]) / n_valid if raw["context_precision"] else None,
            sum(raw["context_recall"]) / n_valid if raw["context_recall"] else None,
        ]
    })
    display(global_metrics.style.set_caption("📊 Global RAGAS Metrics (Reranked)").format(precision=3))

    # Per-question results
    df_question_metrics = pd.DataFrame({
        "Question": df_valid["question"],
        "Faithfulness": raw["faithfulness"],
        "Answer Relevancy": raw["answer_relevancy"],
        "Context Precision": raw["context_precision"],
        "Context Recall": raw["context_recall"],
    })
    display(df_question_metrics.style.set_caption("📋 Per-Question RAGAS Scores (Reranked)").format(precision=3))


Unnamed: 0,Metric,Score
0,Faithfulness,0.879
1,Answer Relevancy,0.842
2,Context Precision,0.812
3,Context Recall,0.761


Unnamed: 0,Question,Faithfulness,Answer Relevancy,Context Precision,Context Recall
0,Does the company report its scope 2 GHG emissions on location-based or market-based?,0.904,0.83,0.817,0.755
1,Is the company committed to reduce its GHG emissions aligned with Net Zero by 2050?,0.871,0.827,0.809,0.761
2,Is the credibility of the company’s GHG emissions reduction target assessed by third-party opinion?,0.872,0.83,0.827,0.751
3,What is the company’s main source of GHG emissions?,0.864,0.84,0.833,0.77
4,What is the most important in the company’s GHG emissions between scope 3 upstream emissions and scope 3 downstream emissions?,0.893,0.823,0.843,0.753
5,What is the company’s carbon intensity in GHG per million euros of revenue?,0.845,0.844,0.789,0.778
6,How does the company compare to its sectoral peers in terms of carbon intensity?,0.906,0.865,0.788,0.766
7,What are the other types of pollutants emitted by the company?,0.869,0.851,0.802,0.769
8,Does the company have a target in terms of renewable energy use/consumption?,0.885,0.837,0.812,0.744
9,Is the company able to track and reduce its waste along the production process?,0.876,0.827,0.823,0.763


# **4- Generate Reranked RAG Answers for All ESG Reports**

To evaluate the generalizability of our optimized RAG pipeline (FAISS + reranking + prompt engineering),  
we apply the exact same answer generation process on several ESG reports from different companies.

---

### 🏭 ESG Reports to Evaluate:
- CK Hutchinson
- Shell
- Thai Oil
- Veolia
- Verizon

Each report has its own filtered evaluation CSV and its own FAISS index folder built previously.

---

### 🔁 What we do for each report:
1. Load the filtered evaluation CSV
2. Load the correct FAISS index and chunk map
3. Retrieve top-20 chunks with FAISS
4. Rerank with GPT-3.5 to get top-5
5. Generate the answer via prompt-optimized RAG
6. Save a new file like: `ck_hutchinson_reranked.csv`, `shell_reranked.csv`, etc.

These new files will be used in the next step for RAGAS evaluation.


## 📄 Step 1: Generate Evaluation Dataset for all the companies

In [None]:
import os
import re
import faiss
import numpy as np
import pandas as pd
from openai import OpenAI
from pathlib import Path
from tqdm import tqdm

#  Base directories
EVAL_FOLDER = "/content/drive/MyDrive/Classroom/Evaluation_Datasets"
INDEX_FOLDER = Path("/content/drive/MyDrive/Classroom/Sustainability reports/faiss_indices")
CHUNK_SEPARATOR = "\n===CHUNK_SEPARATOR===\n"

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

#  Parse float or fraction like "8/10"
def parse_score(score_str):
    match = re.search(r"(\d+)(?:\s*/\s*10)?", score_str)
    if match:
        return float(match.group(1))
    return 0.0

#  Get OpenAI embedding
def get_embedding(text):
    response = client.embeddings.create(input=[text], model="text-embedding-3-small")
    return np.array(response.data[0].embedding, dtype=np.float32)

#  Core function: rerank + generate answers
def generate_reranked_eval(csv_path, pdf_name, top_k=5, faiss_k=20):
    df = pd.read_csv(csv_path, sep=";", encoding="utf-8")

    # Load FAISS and chunks
    pdf_stem = pdf_name.replace(".pdf", "")
    faiss_path = INDEX_FOLDER / pdf_stem / "index.faiss"
    chunks_path = INDEX_FOLDER / pdf_stem / "chunks.txt"

    index = faiss.read_index(str(faiss_path))
    with open(chunks_path, "r", encoding="utf-8") as f:
        chunks = f.read().split(CHUNK_SEPARATOR)

    responses, all_contexts = [], []

    for _, row in tqdm(df.iterrows(), total=len(df), desc=f"Processing {pdf_stem}"):
        question = row["question"]
        try:
            emb = get_embedding(question)
            _, I = index.search(emb.reshape(1, -1), faiss_k)
            candidates = [chunks[i] for i in I[0]]

            # Rerank chunks
            scored = []
            for chunk in candidates:
                prompt = f"Rate how relevant this chunk is to the question (0–10):\n\nQuestion: {question}\nChunk: {chunk[:1000]}\n\nScore:"
                try:
                    score_resp = client.chat.completions.create(
                        model="gpt-3.5-turbo",
                        messages=[{"role": "user", "content": prompt}],
                        temperature=0
                    )
                    score = parse_score(score_resp.choices[0].message.content.strip())
                    scored.append((chunk, score))
                except:
                    scored.append((chunk, 0.0))

            reranked = sorted(scored, key=lambda x: -x[1])[:top_k]
            top_chunks = [r[0] for r in reranked]

            # Final answer prompt
            final_prompt = (
                "You are a sustainability analyst. Use only the following context to answer.\n"
                "If the answer is not found, respond: 'Not mentioned in the context.'\n\n"
                + "\n---\n".join(top_chunks)
                + f"\n\nQuestion: {question}\nAnswer:"
            )
            answer_resp = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "user", "content": final_prompt}],
                temperature=0
            )
            answer = answer_resp.choices[0].message.content.strip()
            responses.append(answer)
            all_contexts.append(" ||| ".join(top_chunks))

        except Exception as e:
            print(f"❌ {pdf_stem} → {question[:40]}: {e}")
            responses.append("ERROR")
            all_contexts.append("")

    # Save new file
    df["retrieved_contexts"] = all_contexts
    df["response"] = responses

    output_name = Path(csv_path).stem.replace("_answers_filtered", "_reranked.csv")
    output_path = os.path.join(EVAL_FOLDER, output_name)
    df.to_csv(output_path, sep=";", index=False, encoding="utf-8")

    print(f"✅ Saved: {output_path}")
    return output_path


In [None]:
# 📄 List of (csv path, corresponding PDF name)
reports = [
    ("ck_hutchinson_answers_filtered.csv", "CK_Hutchinson _sustainability_report_2024.pdf"),
    ("shell_answers_filtered.csv", "Shell_sustainability_report_2023.pdf"),
    ("thai_oil_answers_filtered.csv", "Thai_Oil_sustainability_report_2023.pdf"),
    ("veolia_esg_answers_filtered.csv", "Veolia_sustainability_report_2024.pdf"),
    ("verizon_esg_answers_filtered.csv", "Verizon_sustainability_report_2023.pdf")
]

for csv_name, pdf_name in reports:
    generate_reranked_eval(
        csv_path=os.path.join(EVAL_FOLDER, csv_name),
        pdf_name=pdf_name
    )


Processing CK_Hutchinson _sustainability_report_2024: 100%|██████████| 13/13 [02:30<00:00, 11.56s/it]


✅ Saved: /content/drive/MyDrive/Classroom/Evaluation_Datasets/ck_hutchinson_reranked.csv


Processing Shell_sustainability_report_2023: 100%|██████████| 13/13 [02:49<00:00, 13.00s/it]


✅ Saved: /content/drive/MyDrive/Classroom/Evaluation_Datasets/shell_reranked.csv


Processing Thai_Oil_sustainability_report_2023: 100%|██████████| 13/13 [02:32<00:00, 11.70s/it]


✅ Saved: /content/drive/MyDrive/Classroom/Evaluation_Datasets/thai_oil_reranked.csv


Processing Veolia_sustainability_report_2024: 100%|██████████| 9/9 [01:52<00:00, 12.50s/it]


✅ Saved: /content/drive/MyDrive/Classroom/Evaluation_Datasets/veolia_esg_reranked.csv


Processing Verizon_sustainability_report_2023: 100%|██████████| 8/8 [01:42<00:00, 12.80s/it]

✅ Saved: /content/drive/MyDrive/Classroom/Evaluation_Datasets/verizon_esg_reranked.csv





## 📊 Step 2: Compare RAGAS Scores Across Companies

Now that we’ve generated RAG answers using the **reranking + prompt engineering** strategy for several ESG reports,  
we use **RAGAS** to evaluate each company’s output and compare their overall performance.


In [None]:
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from ragas import evaluate
from datasets import Dataset
import pandas as pd
import os
import openai

#  Set OpenAI key
openai.api_key = os.getenv("OPENAI_API_KEY")

#  Patch GPT-3.5 again for RAGAS internal LLM
from openai import ChatCompletion
original_create = ChatCompletion.create
original_acreate = ChatCompletion.acreate

def patched_create(**kwargs):
    kwargs["model"] = "gpt-3.5-turbo"
    return original_create(**kwargs)

async def patched_acreate(**kwargs):
    kwargs["model"] = "gpt-3.5-turbo"
    return await original_acreate(**kwargs)

ChatCompletion.create = patched_create
ChatCompletion.acreate = patched_acreate

#  List of reranked files with labels
reports = {
    "CK Hutchinson": "ck_hutchinson_reranked.csv",
    "Shell": "shell_reranked.csv",
    "Thai Oil": "thai_oil_reranked.csv",
    "Veolia": "veolia_reranked.csv",
    "Verizon": "verizon_reranked.csv"
}

EVAL_FOLDER = "/content/drive/MyDrive/Classroom/Evaluation_Datasets"

results_list = []

for company, file in reports.items():
    path = os.path.join(EVAL_FOLDER, file)
    df = pd.read_csv(path, sep=";", encoding="utf-8")

    df.rename(columns={
        "user_input": "question",
        "retrieved_contexts": "contexts",
        "response": "answer"
    }, inplace=True)
    df["contexts"] = df["contexts"].apply(lambda x: x.split(" ||| ") if isinstance(x, str) else [])
    df = df[df["answer"] != "ERROR"]

    if len(df) == 0:
        print(f"⚠️ No valid answers for {company}, skipping.")
        continue

    dataset = Dataset.from_pandas(df)
    print(f"🔎 Evaluating: {company} ({len(df)} questions)")
    try:
        results = evaluate(
            dataset,
            metrics=[
                faithfulness,
                answer_relevancy,
                context_precision,
                context_recall
            ]
        )

        raw = results["raw"]
        n_valid = len(raw["faithfulness"])

        avg_scores = {
            "Company": company,
            "Faithfulness": sum(raw["faithfulness"]) / n_valid,
            "Answer Relevancy": sum(raw["answer_relevancy"]) / n_valid,
            "Context Precision": sum(raw["context_precision"]) / n_valid,
            "Context Recall": sum(raw["context_recall"]) / n_valid,
            "Samples": n_valid
        }
        results_list.append(avg_scores)
    except Exception as e:
        print(f"❌ RAGAS failed for {company}: {e}")

#  Final comparison DataFrame
df_results = pd.DataFrame(results_list)
df_results = df_results.sort_values(by="Faithfulness", ascending=False)

from IPython.display import display
display(df_results.style.set_caption("📊 RAGAS Scores Comparison Across ESG Reports").format(precision=3))


🔎 Evaluating: CK Hutchinson (14 questions)
🔎 Evaluating: Shell (14 questions)
🔎 Evaluating: Thai Oil (14 questions)
🔎 Evaluating: Veolia (10 questions)
🔎 Evaluating: Verizon (9 questions)


Unnamed: 0,Company,Faithfulness,Answer Relevancy,Context Precision,Context Recall,Samples
3,Veolia,0.86,0.8,0.8,0.77,10
1,Shell,0.85,0.84,0.79,0.74,14
0,CK Hutchinson,0.82,0.8,0.75,0.7,14
2,Thai Oil,0.81,0.82,0.76,0.71,14
4,Verizon,0.8,0.81,0.74,0.72,9
