<a href="https://colab.research.google.com/github/sh-sadaf/Chat-Bot-3.5-Turbo-using-OpenAI-/blob/main/renewable_energy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pdfplumber




In [2]:
import requests
import pdfplumber
import os

# Create folder to save PDFs and text
os.makedirs("renewable_energy_docs", exist_ok=True)

# List of known recent renewable energy directive PDF URLs
pdf_urls = [
    "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32023L2413",  # 2023
    "https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32024L1275"   # 2024 (example)
]

for url in pdf_urls:
    # Create safe filenames
    celex_number = url.split("CELEX:")[-1]
    pdf_path = f"renewable_energy_docs/{celex_number}.pdf"

    # Download PDF
    response = requests.get(url)
    if response.status_code == 200:
        with open(pdf_path, "wb") as f:
            f.write(response.content)
        print(f"Downloaded {pdf_path}")
    else:
        print(f"Failed to download PDF: {url}")
        continue

    # Extract text
    text_file = pdf_path.replace(".pdf", ".txt")
    with pdfplumber.open(pdf_path) as pdf:
        full_text = ""
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                full_text += page_text + "\n"

    # Save text
    with open(text_file, "w", encoding="utf-8") as f:
        f.write(full_text)
    print(f"Saved extracted text to {text_file}")


Downloaded renewable_energy_docs/32023L2413.pdf
Saved extracted text to renewable_energy_docs/32023L2413.txt
Downloaded renewable_energy_docs/32024L1275.pdf
Saved extracted text to renewable_energy_docs/32024L1275.txt



1. Read the .txt files

2. Clean the text: remove extra whitespace, headers, footers, page numbers

3. Split into chunks (~500–1000 words each)

4. Store chunks with metadata (title, CELEX number, source URL)

In [3]:
import os
import re

# Folder with extracted texts
folder = "renewable_energy_docs"

# Parameters
chunk_size = 500  # words per chunk

# List to store all chunks
all_chunks = []

# Loop through each .txt file
for filename in os.listdir(folder):
    if filename.endswith(".txt"):
        file_path = os.path.join(folder, filename)

        # Extract CELEX number from filename
        celex_number = filename.replace(".txt", "")

        # Read text
        with open(file_path, "r", encoding="utf-8") as f:
            text = f.read()

        # Clean text
        text = re.sub(r'\n+', ' ', text)          # replace newlines with space
        text = re.sub(r'\s+', ' ', text).strip()  # remove extra spaces
        text = re.sub(r'Page \d+', '', text)      # remove page numbers if any

        # Split into words
        words = text.split()

        # Create chunks
        for i in range(0, len(words), chunk_size):
            chunk_words = words[i:i+chunk_size]
            chunk_text = ' '.join(chunk_words)

            chunk_data = {
                "celex_number": celex_number,
                "chunk_index": i // chunk_size,
                "text": chunk_text
            }
            all_chunks.append(chunk_data)

print(f"Created {len(all_chunks)} chunks from {len([f for f in os.listdir(folder) if f.endswith('.txt')])} documents.")

# Optional: save chunks to a JSON file for easy use later
import json
with open("renewable_energy_chunks.json", "w", encoding="utf-8") as f:
    json.dump(all_chunks, f, indent=2, ensure_ascii=False)

print("Saved all chunks to renewable_energy_chunks.json")


Created 168 chunks from 2 documents.
Saved all chunks to renewable_energy_chunks.json


What this does:

1. Splits your directive texts into manageable chunks for embeddings

2. Keeps metadata (CELEX number + chunk index)

3. Saves everything as renewable_energy_chunks.json for easy use in RAG

In [5]:
!pip install pinecone




In [9]:
from google.colab import auth
from google.colab import drive
# (Optional) Use Colab secrets widget if you stored the key there

In [26]:
!pip install sentence-transformers

from sentence_transformers import SentenceTransformer

# Load a free embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

text = "Test"
embedding = model.encode(text)
print("Embedding generated:", embedding[:10], "...")



Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding generated: [ 0.01157346  0.02513618 -0.03670185  0.05932486 -0.00714904 -0.04119425
  0.0770874   0.03744255  0.01244901 -0.00611766] ...


In [27]:
import json

with open("renewable_energy_chunks.json", "r", encoding="utf-8") as f:
    chunks = json.load(f)
print(f"Loaded {len(chunks)} chunks")


Loaded 168 chunks


In [28]:
from getpass import getpass
from pinecone import Pinecone, ServerlessSpec

pinecone_key = getpass("Enter your Pinecone API key: ")
pinecone_env = "us-east-1"

pc = Pinecone(api_key=pinecone_key)

index_name = "renewable-energy"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=384,  # sentence-transformers all-MiniLM-L6-v2 dimension
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region=pinecone_env)
    )
    print(f"Index '{index_name}' created!")
else:
    print(f"Index '{index_name}' already exists.")

index = pc.Index(index_name)


Enter your Pinecone API key: ··········
Index 'renewable-energy' already exists.


In [30]:
# ----------------------------
# 0️⃣ Install dependencies
# ----------------------------
!pip install sentence-transformers pinecone-client --quiet

# ----------------------------
# 1️⃣ Imports
# ----------------------------
from getpass import getpass
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer
import json
from time import sleep

# ----------------------------
# 2️⃣ Initialize Pinecone
# ----------------------------
pinecone_key = getpass("Enter your Pinecone API key: ")
pinecone_env = "us-east-1"
pc = Pinecone(api_key=pinecone_key)

index_name = "renewable-energy"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=384,  # all-MiniLM-L6-v2 dimension
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region=pinecone_env)
    )
    print(f"Index '{index_name}' created!")
else:
    print(f"Index '{index_name}' already exists.")

index = pc.Index(index_name)

# ----------------------------
# 3️⃣ Load your chunks
# ----------------------------
with open("renewable_energy_chunks.json", "r", encoding="utf-8") as f:
    chunks = json.load(f)
print(f"Loaded {len(chunks)} chunks")

# ----------------------------
# 4️⃣ Load sentence-transformers model
# ----------------------------
model = SentenceTransformer('all-MiniLM-L6-v2')

# ----------------------------
# 5️⃣ Upload chunks in batches
# ----------------------------
batch_size = 10
vectors = []

for i, chunk in enumerate(chunks):
    embedding = model.encode(chunk["text"]).tolist()

    vectors.append({
        "id": f"{chunk['celex_number']}_{chunk['chunk_index']}",
        "values": embedding,
        "metadata": {
            "celex_number": chunk["celex_number"],
            "chunk_index": chunk["chunk_index"],
            "text": chunk["text"]
        }
    })

    if len(vectors) == batch_size or i == len(chunks)-1:
        index.upsert(vectors)
        vectors = []
        sleep(1)
        print(f"Uploaded batch up to chunk {i+1}")

print("All chunks uploaded to Pinecone!")


Enter your Pinecone API key: ··········
Index 'renewable-energy' already exists.
Loaded 168 chunks
Uploaded batch up to chunk 10
Uploaded batch up to chunk 20
Uploaded batch up to chunk 30
Uploaded batch up to chunk 40
Uploaded batch up to chunk 50
Uploaded batch up to chunk 60
Uploaded batch up to chunk 70
Uploaded batch up to chunk 80
Uploaded batch up to chunk 90
Uploaded batch up to chunk 100
Uploaded batch up to chunk 110
Uploaded batch up to chunk 120
Uploaded batch up to chunk 130
Uploaded batch up to chunk 140
Uploaded batch up to chunk 150
Uploaded batch up to chunk 160
Uploaded batch up to chunk 168
All chunks uploaded to Pinecone!

Top relevant chunks:

--- Chunk 1 (CELEX 32023L2413) ---
Regulations (EC) No 401/2009 and (EU) 2018/1999 (‘European Climate Law’) (OJ L 243, 9.7.2021, p. 1). (5) Decision (EU) 2022/591 of the European Parliament and of the Council of 6 April 2022 on a General Union Environment Action Programme to 2030 (OJ L 114, 12.4.2022, p. 22). ELI: http://data

In [44]:
import nltk
nltk.download('punkt', quiet=True)
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> 1


    Error loading 1: Package '1' not found in index



---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> List
Command 'List' unrecognized

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> 1
Command '1' unrecognized

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> l

Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_eng Averag

True

In [42]:
import nltk
nltk.download('punkt', quiet=True)  # ensures sentence tokenizer is available

from getpass import getpass
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer, util
import json
from time import sleep
from nltk.tokenize import sent_tokenize


In [46]:
# ----------------------------
# 0️⃣ Install dependencies
# ----------------------------
!pip install sentence-transformers pinecone-client nltk --quiet

# ----------------------------
# Download NLTK resources non-interactively
# ----------------------------
import nltk
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt', quiet=True)
try:
    nltk.data.find('tokenizers/punkt_tab')
except nltk.downloader.DownloadError:
    nltk.download('punkt_tab', quiet=True)


# ----------------------------
# 1️⃣ Imports
# ----------------------------
from getpass import getpass
from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer, util
import json
from time import sleep
from nltk.tokenize import sent_tokenize


# ----------------------------
# 2️⃣ Initialize Pinecone
# ----------------------------
pinecone_key = getpass("Enter your Pinecone API key: ")
pinecone_env = "us-east-1"
pc = Pinecone(api_key=pinecone_key)

index_name = "renewable-energy"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=384,  # all-MiniLM-L6-v2 dimension
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region=pinecone_env)
    )
    print(f"Index '{index_name}' created!")
else:
    print(f"Index '{index_name}' already exists.")

index = pc.Index(index_name)

# ----------------------------
# 3️⃣ Load your chunks
# ----------------------------
with open("renewable_energy_chunks.json", "r", encoding="utf-8") as f:
    chunks = json.load(f)
print(f"Loaded {len(chunks)} chunks")

# ----------------------------
# 4️⃣ Load sentence-transformers model
# ----------------------------
model = SentenceTransformer('all-MiniLM-L6-v2')

# ----------------------------
# 5️⃣ Upload chunks in batches (optional if already uploaded)
# ----------------------------
batch_size = 10
vectors = []

for i, chunk in enumerate(chunks):
    embedding = model.encode(chunk["text"]).tolist()

    vectors.append({
        "id": f"{chunk['celex_number']}_{chunk['chunk_index']}",
        "values": embedding,
        "metadata": {
            "celex_number": chunk["celex_number"],
            "chunk_index": chunk["chunk_index"],
            "text": chunk["text"]
        }
    })

    if len(vectors) == batch_size or i == len(chunks)-1:
        index.upsert(vectors)
        vectors = []
        sleep(1)
        print(f"Uploaded batch up to chunk {i+1}")

print("All chunks uploaded to Pinecone!")

# ----------------------------
# 6️⃣ Assistant Query Function (Fixed)
# ----------------------------
def ask_assistant(user_query, top_k=5):
    # Encode query
    query_embedding = model.encode(user_query).tolist()

    # Retrieve top-k chunks from Pinecone
    results = index.query(vector=query_embedding, top_k=top_k, include_metadata=True, include_values=False)

    if not results['matches']:
        return "I don't know. No relevant information found in the documents."

    # Combine all retrieved chunks
    combined_text = " ".join([m['metadata']['text'] for m in results['matches']])

    # Split into sentences
    sentences = sent_tokenize(combined_text)

    if not sentences:
        return "I don't know. No relevant information found in the documents."

    # Encode sentences
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
    query_emb = model.encode(user_query, convert_to_tensor=True)

    # Find the most relevant sentence
    cos_scores = util.cos_sim(query_emb, sentence_embeddings)[0]
    best_idx = cos_scores.argmax().item()

    return sentences[best_idx]

# ----------------------------
# 7️⃣ Ask questions interactively
# ----------------------------
while True:
    user_query = input("\nEnter your question (or 'exit' to quit): ")
    if user_query.lower() == 'exit':
        break

    answer = ask_assistant(user_query)
    print("\nAssistant Response:")
    print(answer)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Enter your Pinecone API key: ··········
Index 'renewable-energy' already exists.
Loaded 168 chunks
Uploaded batch up to chunk 10
Uploaded batch up to chunk 20
Uploaded batch up to chunk 30
Uploaded batch up to chunk 40
Uploaded batch up to chunk 50
Uploaded batch up to chunk 60
Uploaded batch up to chunk 70
Uploaded batch up to chunk 80
Uploaded batch up to chunk 90
Uploaded batch up to chunk 100
Uploaded batch up to chunk 110
Uploaded batch up to chunk 120
Uploaded batch up to chunk 130
Uploaded batch up to chunk 140
Uploaded batch up to chunk 150
Uploaded batch up to chunk 160
Uploaded batch up to chunk 168
All chunks uploaded to Pinecone!

Enter your question (or 'exit' to quit): what are energy plans?

Assistant Response:
The European Strategic Energy Technology Plan set out in the Commission communication of 15 September 2015, entitled ‘Towards an Integrated Strategic Energy Technology (SET) Plan: Accelerating the European Energy System Transformation (the ‘SET-Plan’) aims to boos