# A Climate Change RAG Assistant.

## Set up the environment

### Subtask:
Install necessary libraries and set up API keys for any external services.


**Reasoning**:
Install the necessary libraries using pip.



In [19]:
# I'm installing the necessary libraries using pip.
%pip install pandas transformers sentence-transformers chromadb langchain openai python-dotenv



**Reasoning**:
Set up environment variables for the API key using python-dotenv.



In [20]:
import os
from dotenv import load_dotenv

# I'm loading environment variables from a .env file if it exists (optional).
load_dotenv()

# I'm using Colab's Secrets Manager for secure storage of API keys.
from google.colab import userdata

# I'm replacing 'OPENAI_API_KEY' with the actual name of my secret in Colab.
# I need to ensure I have added my API key to the Secrets Manager with this name.
try:
    os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
    print("OPENAI_API_KEY loaded from Colab Secrets Manager.")
except userdata.SecretNotFoundError:
    print("OPENAI_API_KEY not found in Colab Secrets Manager. Please add it.")
except Exception as e:
    print(f"An error occurred while loading OPENAI_API_KEY from Colab Secrets Manager: {e}")

# I can add other API keys here following the same pattern
# try:
#     os.environ['ANOTHER_API_KEY'] = userdata.get('ANOTHER_API_KEY')
#     print("ANOTHER_API_KEY loaded from Colab Secrets Manager.")
# except userdata.SecretNotFoundError:
#     print("ANOTHER_API_KEY not found in Colab Secrets Manager. Please add it.")
# except Exception as e:
#     print(f"An error occurred while loading ANOTHER_API_KEY from Colab Secrets Manager: {e}")

print("Environment variable setup complete.")

OPENAI_API_KEY not found in Colab Secrets Manager. Please add it.
Environment variable setup complete.


## Load and process data

### Subtask:
Load the climate change data and process it for use in the RAG system. This may involve cleaning, tokenization, and creating embeddings.


**Reasoning**:
Load the climate change data from a suitable source. Since no specific data source is provided, I will use a placeholder dataset. I'll create a dummy pandas DataFrame to simulate loading data.



In [21]:
# I'm installing the arxiv library to fetch climate change papers.
%pip install arxiv



Now, let's search for some climate change papers on arXiv and see how we can access their information.

In [22]:
import arxiv

# I'm searching for papers related to climate change.
search = arxiv.Search(
    query="climate change",
    max_results=10,
    sort_by=arxiv.SortCriterion.Relevance
)

client = arxiv.Client()

# I'm printing some information about the retrieved papers.
for result in client.results(search):
    print(f"Title: {result.title}")
    print(f"Authors: {', '.join(author.name for author in result.authors)}")
    print(f"Summary: {result.summary[:200]}...") # I'm printing the first 200 characters of the summary.
    print("-" * 20)

Title: The structure of the climate debate
Authors: Richard S. J. Tol
Summary: First-best climate policy is a uniform carbon tax which gradually rises over
time. Civil servants have complicated climate policy to expand bureaucracies,
politicians to create rents. Environmentalist...
--------------------
Title: Climate Science and Control Engineering: Insights, Parallels, and Connections
Authors: Salma M. Elsherif, Ahmad F. Taha
Summary: Climate science is the multidisciplinary field that studies the Earth's
climate and its evolution. At the very core of climate science are
indispensable climate models that predict future climate scen...
--------------------
Title: Baumol's Climate Disease
Authors: Fangzhi Wang, Hua Liao, Richard S. J. Tol
Summary: We investigate optimal carbon abatement in a dynamic general equilibrium
climate-economy model with endogenous structural change. By differentiating the
production of investment from consumption, we s...
--------------------
Title: You are rig

### Subtask:
Download the full text of the papers and create embeddings.

**Reasoning**:
Download the PDF of each paper retrieved from arXiv.

In [23]:
import os

download_folder = "arxiv_climate_papers"
os.makedirs(download_folder, exist_ok=True)

# I'm downloading the PDF of each paper retrieved from arXiv.
for result in client.results(search):
    try:
        result.download_pdf(dirpath=download_folder)
        print(f"Downloaded {result.title}")
    except Exception as e:
        print(f"Could not download {result.title}: {e}")

Downloaded The structure of the climate debate
Downloaded Climate Science and Control Engineering: Insights, Parallels, and Connections
Downloaded Baumol's Climate Disease
Downloaded You are right. I am ALARMED -- But by Climate Change Counter Movement
Downloaded Climate Change Conspiracy Theories on Social Media
Downloaded Hurricanes Increase Climate Change Conversations on Twitter
Downloaded Trend and Thoughts: Understanding Climate Change Concern using Machine Learning and Social Media Data
Downloaded Financial climate risk: a review of recent advances and key challenges
Downloaded Mapping the Climate Change Landscape on TikTok
Downloaded What shapes climate change perceptions in Africa? A random forest approach


In [24]:
# I'm installing pymupdf to extract text from PDFs.
%pip install pymupdf



**Reasoning**:
Extract text from the downloaded PDF files.

In [25]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    text = ""
    try:
        with fitz.open(pdf_path) as doc:
            for page in doc:
                text += page.get_text()
    except Exception as e:
        print(f"Could not extract text from {pdf_path}: {e}")
        return None
    return text

climate_texts = []
download_folder = "arxiv_climate_papers"

# I'm extracting text from the downloaded PDF files.
for filename in os.listdir(download_folder):
    if filename.endswith(".pdf"):
        pdf_path = os.path.join(download_folder, filename)
        print(f"Extracting text from {filename}...")
        text = extract_text_from_pdf(pdf_path)
        if text:
            climate_texts.append({"filename": filename, "text": text})

print(f"Extracted text from {len(climate_texts)} files.")
# I can inspect the extracted text from the first file
# if climate_texts:
#     print("\n--- My first extracted text sample ---")
#     print(climate_texts[0]['text'][:500]) # I'm printing the first 500 characters.

Extracting text from 2312.00160v1.Baumol_s_Climate_Disease.pdf...
Extracting text from 2111.14929v1.Trend_and_Thoughts__Understanding_Climate_Change_Concern_using_Machine_Learning_and_Social_Media_Data.pdf...
Extracting text from 2004.14907v1.You_are_right__I_am_ALARMED____But_by_Climate_Change_Counter_Movement.pdf...
Extracting text from 2105.07867v1.What_shapes_climate_change_perceptions_in_Africa__A_random_forest_approach.pdf...
Extracting text from 2504.21153v2.Climate_Science_and_Control_Engineering__Insights__Parallels__and_Connections.pdf...
Extracting text from 2404.07331v1.Financial_climate_risk__a_review_of_recent_advances_and_key_challenges.pdf...
Extracting text from 2505.03813v1.Mapping_the_Climate_Change_Landscape_on_TikTok.pdf...
Extracting text from 1608.05597v1.The_structure_of_the_climate_debate.pdf...
Extracting text from 2305.07529v1.Hurricanes_Increase_Climate_Change_Conversations_on_Twitter.pdf...
Extracting text from 2107.03318v1.Climate_Change_Conspiracy_Theorie

**Reasoning**:
Create embeddings from the extracted text using a pre-trained sentence transformer model.

In [26]:
from sentence_transformers import SentenceTransformer

# I'm loading a pre-trained sentence transformer model.
# 'all-MiniLM-L6-v2' is a good general-purpose model I'm using.
model = SentenceTransformer('all-MiniLM-L6-v2')

# I'm generating embeddings for the extracted texts.
# I'll store the embeddings along with the text and filename.
for item in climate_texts:
    item['embedding'] = model.encode(item['text'])

print(f"Generated embeddings for {len(climate_texts)} texts.")
# I can inspect the first embedding
# if climate_texts:
#     print("\n--- My first embedding sample ---")
#     print(climate_texts[0]['embedding'][:10]) # I'm printing the first 10 values of the embedding.

Generated embeddings for 10 texts.


## Build the RAG system

### Subtask:
Set up a vector database and add the extracted text and embeddings.

**Reasoning**:
Initialize ChromaDB and create a collection to store the embeddings and associated text.

In [27]:
import chromadb

# I'm initializing the ChromaDB client (in-memory for this example).
client = chromadb.Client()

# I'm creating a collection (or getting an existing one).
# This is where I'll store my embeddings, documents, and metadata.
collection_name = "climate_change_papers"
try:
    collection = client.create_collection(name=collection_name)
    print(f"Collection '{collection_name}' created.")
except: # I'm handling the case if the collection already exists.
    collection = client.get_collection(name=collection_name)
    print(f"Collection '{collection_name}' already exists. Using existing collection.")


# I'm preparing data for adding to ChromaDB.
# ChromaDB requires ids, embeddings, and documents (original text).
# I'll use the filename as the id for simplicity.
ids = [item['filename'] for item in climate_texts]
embeddings = [item['embedding'].tolist() for item in climate_texts] # I'm converting numpy arrays to lists.
documents = [item['text'] for item in climate_texts]

# I'm adding data to the collection.
collection.add(
    embeddings=embeddings,
    documents=documents,
    ids=ids
)

print(f"Added {len(climate_texts)} documents to the collection.")
print(f"Collection count: {collection.count()}")

Collection 'climate_change_papers' already exists. Using existing collection.
Added 10 documents to the collection.
Collection count: 10


### Subtask:
Implement the retrieval mechanism to find relevant documents based on a query.

**Reasoning**:
Implement a function to perform a similarity search on the ChromaDB collection using a query embedding.

In [28]:
# I'm implementing a function to perform a similarity search on the ChromaDB collection.
from sentence_transformers import SentenceTransformer # I need this import if this cell is run independently

def retrieve_documents(query, collection, model, n_results=5):
    """
    Retrieves relevant documents from my ChromaDB collection based on a query.

    Args:
        query (str): The user's query.
        collection (chromadb.Collection): The ChromaDB collection to search.
        model (SentenceTransformer): The sentence transformer model for generating query embeddings.
        n_results (int): The number of most relevant documents to retrieve.

    Returns:
        list: A list of dictionaries, where each dictionary contains the document ID,
              text, and distance for the retrieved documents.
    """
    # I'm generating an embedding for the query.
    query_embedding = model.encode(query).tolist()

    # I'm performing a similarity search in ChromaDB.
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=['documents', 'distances'] # I'm including document text and similarity distances.
    )

    # I'm formatting the results.
    retrieved_docs = []
    if results and results['ids'] and results['documents']:
        for i in range(len(results['ids'][0])):
            retrieved_docs.append({
                'id': results['ids'][0][i],
                'text': results['documents'][0][i],
                'distance': results['distances'][0][i]
            })
    return retrieved_docs

# Example usage:
query = "What are the financial risks of climate change?"
retrieved_documents = retrieve_documents(query, collection, model)

print(f"I'm retrieving documents for query: '{query}'")
for doc in retrieved_documents:
    print(f"--- Document ID: {doc['id']} (Distance: {doc['distance']}) ---")
    print(f"{doc['text'][:500]}...") # I'm printing the first 500 characters of the retrieved text.
    print("-" * 50)

I'm retrieving documents for query: 'What are the financial risks of climate change?'
--- Document ID: 2404.07331v1.Financial_climate_risk__a_review_of_recent_advances_and_key_challenges.pdf (Distance: 0.6765998601913452) ---
Institute for Resources, Enviroment and Sustainability, UBC 
 
 
 
1 
Financial climate risk: a review of recent advances and 
key challenges 
 
Victor Cardenas* 
 
*Institute for Resources, Enviroment and Sustainability, University of British Columbia 
 
Abstract- The document provides an overview of financial climate risks. It delves into how climate change impacts the global financial 
system, distinguishing between physical risks (such as extreme weather events) and transition risks (stemmin...
--------------------------------------------------
--- Document ID: 2105.07867v1.What_shapes_climate_change_perceptions_in_Africa__A_random_forest_approach.pdf (Distance: 0.9767863750457764) ---
 
1 
Original Manuscript 
[What shapes climate change perceptions in Africa

### Subtask:
Implement the generation mechanism to answer questions based on retrieved documents.

**Reasoning**:
Implement a function to generate a response using the OpenAI API based on the user query and retrieved documents.

## Using Google Generative AI API (Gemini)

You'll need to get an API key from Google AI Studio and add it to your Colab Secrets Manager.

### Subtask:
Set up the Google Generative AI API key.

In [32]:
# I'm installing the Google Generative AI library.
%pip install google-generativeai



**Reasoning**:
Set up the Google Generative AI API key using Colab's Secrets Manager.

In [33]:
import os
from google.colab import userdata
import google.generativeai as genai

# I'm using Colab's Secrets Manager for secure storage of API keys.
# I'm replacing 'GOOGLE_API_KEY' with the actual name of my secret in Colab.
# I need to ensure I have added my API key to the Secrets Manager with this name.
try:
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
    genai.configure(api_key=GOOGLE_API_KEY)
    print("Google Generative AI API key loaded from Colab Secrets Manager.")
except userdata.SecretNotFoundError:
    print("GOOGLE_API_KEY not found in Colab Secrets Manager. Please add it.")
except Exception as e:
    print(f"An error occurred while loading GOOGLE_API_KEY from Colab Secrets Manager: {e}")

print("Google Generative AI environment variable setup complete.")

Google Generative AI API key loaded from Colab Secrets Manager.
Google Generative AI environment variable setup complete.


### Subtask:
Update the `generate_answer` function to use the Google Generative AI API.

**Reasoning**:
Modify the `generate_answer` function to use the `google.generativeai` library instead of the `openai` library.

In [34]:
import google.generativeai as genai
from sentence_transformers import SentenceTransformer # I need this import if this cell is run independently
import chromadb # I need this import if this cell is run independently

# I'm assuming 'model' and 'collection' are defined from previous cells
# And GOOGLE_API_KEY has been loaded and genai.configure() has been called

def generate_answer_gemini(query, retrieved_documents):
    """
    Generates an answer to the query based on the retrieved documents using Google's Gemini model.
    """
    if not retrieved_documents:
        return "I could not find any relevant information to answer your question."

    # I'm combining the retrieved document texts into a single context.
    context = "\n---\n".join([doc['text'] for doc in retrieved_documents])

    try:
        # I'm initializing the Gemini model.
        # I can choose a different model if needed, e.g., 'gemini-pro'.
        gemini_model = genai.GenerativeModel('gemini-1.5-flash-latest')

        # I'm crafting the prompt for the language model.
        # I'm instructing the model to answer based only on the provided context.
        prompt = f"""Answer the following question based on the context below:

Question: {query}

Context:
{context}

Answer:
"""

        # I'm calling the Gemini API to generate the answer.
        response = gemini_model.generate_content(prompt)

        return response.text

    except Exception as e:
        print(f"An error occurred during text generation with Gemini: {e}")
        return "Sorry, I could not generate an answer at this time."

# Example usage with the Gemini API:
query = "What are the financial risks of climate change?"

# I'm re-running the document retrieval step to ensure retrieved_documents is defined.
# I'm assuming 'collection' and 'model' are still defined from previous steps.
print(f"I'm retrieving documents for query: '{query}' before generating answer with Gemini...")
retrieved_documents = retrieve_documents(query, collection, model) # I'm using the existing retrieve_documents function.

# I'm generating the answer using my new function.
generated_answer_gemini = generate_answer_gemini(query, retrieved_documents)

print("\n--- My Generated Answer (Gemini) ---")
print(generated_answer_gemini)

I'm retrieving documents for query: 'What are the financial risks of climate change?' before generating answer with Gemini...

--- My Generated Answer (Gemini) ---
Based on the provided text, the financial risks of climate change are multifaceted and include:

**Physical Risks:** These are the direct consequences of climate change, such as extreme weather events (catastrophic floods, hurricanes, wildfires), rising sea levels, and shifts in weather patterns.  These can lead to:

* **Increased non-performing loans (NPLs):** Severe climate and environmental disasters can cause borrowers to default on loans, impacting the banking sector.
* **Physical damages:** Damage to property, infrastructure, and land leading to business disruption, asset destruction, and the need for reconstruction or replacement.  This affects various sectors, including banking, insurance, and real estate.
* **Increased insured damages:** Higher insurance claims due to increased frequency and severity of climate-rela

## Build the RAG Application

### Subtask:
Integrate the retrieval and generation components to create the RAG application workflow.

**Reasoning**:
Combine the `retrieve_documents` and `generate_answer_gemini` functions into a single function that takes a query and returns a generated answer using the RAG approach.

In [35]:
# I'm combining the retrieval and generation components into a single function.
def rag_answer_query(query, collection, model):
    """
    Answers a user query using the RAG approach.
    """
    # 1. I'm retrieving relevant documents.
    print(f"Retrieving documents for query: '{query}'...")
    retrieved_documents = retrieve_documents(query, collection, model)

    # 2. I'm generating an answer based on retrieved documents.
    print("Generating answer based on retrieved documents...")
    generated_answer = generate_answer_gemini(query, retrieved_documents)

    return generated_answer

# Example usage of my integrated RAG function:
query = "What is the impact of hurricanes on climate change conversations?"
rag_response = rag_answer_query(query, collection, model)

print("\n--- My RAG Application Response ---")
print(rag_response)

Retrieving documents for query: 'What is the impact of hurricanes on climate change conversations?'...
Generating answer based on retrieved documents...

--- My RAG Application Response ---
Based on the provided text, hurricanes significantly increase conversations about climate change on Twitter.  The study shows an average 80% increase in climate change-related tweets in regions affected by hurricanes, with increases up to 200% for the most damaging hurricanes.  This heightened online discussion, however, is both geographically and temporally limited, with a rapid decay in public attention in the weeks following the event.  The study also highlights that news media coverage of hurricanes frequently includes climate change as a prominent topic, although the language used varies between reliable and questionable sources.  Reliable sources use "climate change" more often, while less reliable sources favor terms like "global warming" and "weather," sometimes even referencing conspiracy t

## Interactive RAG Assistant

### Subtask:
Create a simple interactive loop to test the RAG assistant.

**Reasoning**:
Implement a loop that prompts the user for a query, calls the `rag_answer_query` function, and prints the generated answer until the user types 'quit'.

In [36]:
# I'm starting the interactive loop for testing.
print("Climate Change RAG Assistant. Type 'quit' to exit.")
while True:
    user_query = input("\nEnter your query: ")
    if user_query.lower() == 'quit':
        break

    # I'm getting the answer from my RAG assistant.
    answer = rag_answer_query(user_query, collection, model)

    print("\n--- My RAG Assistant's Answer ---")
    print(answer)
    print("-" * 30)

print("Exiting RAG Assistant.")

Climate Change RAG Assistant. Type 'quit' to exit.

Enter your query: what is the future of climate change?
Retrieving documents for query: 'what is the future of climate change?'...
Generating answer based on retrieved documents...

--- My RAG Assistant's Answer ---
The provided text offers several perspectives on the future of climate change, but doesn't offer a singular prediction.  

Richard Tol's paper (2016) suggests that the climate debate will become more constructive due to factors such as the Paris Agreement shifting focus back to national governments, changing political priorities, austerity measures, and a maturing bureaucracy.  He believes a modest carbon tax is a feasible solution.

The second paper (2023) focuses on the impact of hurricanes on public awareness of climate change.  It finds that while hurricanes significantly increase online discussions about climate change in affected areas, this heightened awareness is temporary.  This implies a need for sustained effort

In [39]:
# I'm providing instructions for running the Streamlit app locally.

# Now I have two files: app.py which contains the Streamlit application code,
# and requirements.txt which lists the necessary Python libraries.

# To run this Streamlit application locally:

# 1. Save the files: Make sure app.py and requirements.txt are saved in the same directory on my local machine.
# 2. Install dependencies: Open a terminal or command prompt in that directory and run:
#    pip install -r requirements.txt
# 3. Set your Google API Key as an environment variable. Use the appropriate command for your operating system,
#    replacing 'YOUR_API_KEY' with your actual API key:

#    For macOS and Linux:
#    export GOOGLE_API_KEY='YOUR_API_KEY'

#    For Windows Command Prompt:
#    set GOOGLE_API_KEY='YOUR_API_KEY'

#    For Windows PowerShell:
#    $env:GOOGLE_API_KEY='YOUR_API_KEY'

# 4. Run the Streamlit application:
#    streamlit run app.py

# This will start the Streamlit development server, and my web application will open in my browser.