<a href="https://colab.research.google.com/github/sleepyzzpanda/Environment-RAG-Chatbot/blob/main/Climate_RAG_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT-2 RAG Chatbot for Climate Information
This notebook sets up a retrieval-augmented generation (RAG) chatbot using GPT-2 and FAISS embeddings for climate data, with an interactive cell-based interface.

In [16]:
!pip install torch transformers datasets faiss-cpu sentence-transformers ipywidgets



In [17]:
import torch
import requests
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from ipywidgets import interact_manual, widgets
from datasets import load_dataset



In [18]:
# Load GPT-2 model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [19]:
passages = []
for example in ner_dataset["train"]:
    tokens = example["tokens"]
    # Join tokens into a full sentence (adjust logic depending on dataset structure)
    sentence = " ".join(tokens)
    passages.append(sentence)

# Now `passages` is a list of text strings you can embed


374 passages extracted
['The outcomes of the sensitivity analysis framework suggest that flood risk can vary dramatically as a result of possible change scenarios . The risk components that have not received much attention ( e.g. changes in dike systems and in vulnerability ) may mask the influence of climate change that is often investigated component .', 'A parameterization of vertical diffusivity in ocean general circulation models has been implemented in the ocean model component of the Community Climate System Model ( CCSM ) . The parameterization represents the dynamics of the mixing in the abyssal ocean arising from the breaking of internal waves generated by the tides forcing stratified flow over rough topography . Diapycnal mixing in the ocean is thought to be one of the primary controls on the meridional overturning circulation and the poleward heat transport by the ocean . The poleward ocean heat transport does not appear to be strongly affected by the mixing in the abyssal 

In [20]:
# Create embeddings and FAISS index
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embed_model.encode(passages)
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(np.array(embeddings))


In [21]:
# Retrieval function
def retrieve_passages(query, k=2):
    query_emb = embed_model.encode([query])
    _, indices = index.search(np.array(query_emb), k=k)
    return [passages[i] for i in indices[0]]

In [24]:
# RAG generation function
def generate_answer(query, k=2, max_new_tokens=100):
    context_passages = retrieve_passages(query, k)
    context = ' '.join(context_passages)
    prompt = f"Question: {query}\nAnswer concisely based on the following context: {context}"

    # Encode input
    inputs = tokenizer(prompt, return_tensors="pt")

    # Generate output
    output = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.eos_token_id  # avoids padding issues
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)


In [26]:
from IPython.display import display, clear_output
from ipywidgets import widgets

# Store chat history
chat_history = []

def generate_answer_clean(query, k=2, max_new_tokens=100):
    """
    Generate GPT-2 answer based on retrieved context,
    returns only the clean answer without repeated prompt/context.
    """
    context_passages = retrieve_passages(query, k)
    context = ' '.join(context_passages)

    prompt = f"Question: {query}\nAnswer concisely based on the following context: {context}"
    inputs = tokenizer(prompt, return_tensors="pt")

    output = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode and remove repeated prompt/context
    full_text = tokenizer.decode(output[0], skip_special_tokens=True)

    # Try to extract the part after "Answer concisely based on the following context:"
    if "Answer concisely based on the following context:" in full_text:
        answer = full_text.split("Answer concisely based on the following context:")[-1].strip()
    else:
        answer = full_text.strip()

    return answer

def chat_interface_widget(user_input):
    """
    Widget callback for interactive chat.
    """
    if user_input.strip() == '':
        return

    # Generate answer
    answer = generate_answer_clean(user_input)

    # Append to chat history
    chat_history.append(("You", user_input))
    chat_history.append(("Bot", answer))

    # Clear previous output and display chat
    clear_output(wait=True)
    for speaker, text in chat_history:
        print(f"{speaker}: {text}\n")

# Create interactive text widget
input_widget = widgets.Text(
    value='',
    description='Your Question:',
    placeholder='Type your question here...'
)

run_button = widgets.Button(description="Send")

def on_button_click(b):
    chat_interface_widget(input_widget.value)
    input_widget.value = ''  # Clear input box after sending

run_button.on_click(on_button_click)

# Display widget and button
display(input_widget, run_button)


You: can you tell me about flooding

Bot: The 100 - year return levels of joint storm surges and waves are used to map the spatial extent of flooding in more than 200 sandy beaches around the Balearic Islands by mid and late 21st century , using the hydrodynamical LISFLOOD - FP model and a high resolution ( 2 m ) Digital Elevation Model.</p > It occurs mainly due to extreme weather conditions ( e.g. heavy rainfall and snowmelt ) and the consequences of flood events can be devastating . For a sound flood risk management and mitigation , a proper risk assessment is needed . Anthropogenic climate change causes higher intensity of rainfall and sea level rise and therefore an increase in scale and frequency of the flood events . The impacts of changes in risk components are explored by plausible change scenarios for the mesoscale Mulde catchment ( sub - basin of the Elbe ) in Germany . The Elbe is a large, shallow, and relatively shallow lake with a high level of rainfall and sea level rise