In [1]:
import warnings
warnings.filterwarnings("ignore")

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

from langchain_core.runnables import RunnablePassthrough

from dotenv import load_dotenv

import requests

load_dotenv()

W0112 13:48:04.618000 109208 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.





True

PDFs
 → 
Document Loader (PDF parsing)
 → 
Text Chunking (paper-aware)
 → 
Embeddings
 → 
Vector Store
 → 
Retriever
 → 
LLM (with citations)
 → 
Grounded Answer + References

In [2]:
# -------------- GROBID PDF PARSING --------------

GROBID_URL = "http://localhost:8070/api/processFulltextDocument"

def parse_pdf_with_grobid(pdf_path):
    with open(pdf_path, "rb") as f:
        files = {"input": f}
        response = requests.post(
            GROBID_URL,
            files=files,
            data={"consolidateHeader": "1"}
        )

    response.raise_for_status()
    return response.text  # TEI XML

In [3]:
file_path = "../papers/optimizing-renewable-energy-systems-through-artificial-intelligence-review-and-future-prospects.pdf"

paper_xml = parse_pdf_with_grobid(file_path)

In [4]:
from lxml import etree

def tei_to_text(tei_xml: str) -> str:
    root = etree.XML(tei_xml.encode("utf-8"))

    ns = {"tei": "http://www.tei-c.org/ns/1.0"}

    sections = []

    # Title
    title = root.xpath("//tei:titleStmt/tei:title/text()", namespaces=ns)
    if title:
        sections.append(f"Title: {title[0]}")

    # Abstract
    abstract = root.xpath("//tei:abstract//tei:p//text()", namespaces=ns)
    if abstract:
        sections.append("Abstract:\n" + " ".join(abstract))

    # Body
    body_paragraphs = root.xpath("//tei:body//tei:p//text()", namespaces=ns)
    if body_paragraphs:
        sections.append("Body:\n" + " ".join(body_paragraphs))

    return "\n\n".join(sections)

In [5]:
paper_txt = tei_to_text(paper_xml)

In [6]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=[
        "\n\n",
        "\n",
        ". ",
        " ",
        ""
    ]
)

chunks = text_splitter.split_text(paper_txt)

In [7]:
print(f"Number of chunks: {len(chunks)}")
print(chunks[0][:500])

Number of chunks: 155
Title: Optimizing renewable energy systems through artificial intelligence: Review and future prospects


In [8]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content=chunk,
        metadata={"source": file_path}
    )
    for chunk in chunks
]

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-large"
)

vectorstore = FAISS.from_documents(documents=documents, embedding=embeddings)
vectorstore.save_local("../papers/papers_index")

In [10]:
query = "What methodology is used in the paper?"

results = vectorstore.similarity_search(
    query=query,
    k=5
)

In [11]:
for i, doc in enumerate(results):
    print(f"\nResult {i+1}")
    print(doc.page_content[:300])


Result 1
. This research endeavors to fill a crucial gap in the existing studies by exploring the untapped potential of AI in optimizing RES. By critically analyzing the current state of research, highlighting the originality of the proposed approach, and outlining future research directions, this study aims

Result 2
. However, the methodology's limitations, such as potential biases in literature selection and subjective interpretation of findings, may impact the study's objectivity and generalizability. The synthesis of diverse literature on AI applications in RES provides useful insights into the current state

Result 3
. By consolidating insights from diverse studies, this research endeavors to underscore the research originality in synthesizing disparate findings, identifying overarching trends, and elucidating the underlying mechanisms driving the efficacy of AI in RES optimization. The rationale behind conducti

Result 4
. This study stands at the intersection of several pivota

In [12]:
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 8, "fetch_k": 20}
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

In [21]:
rag_prompt = ChatPromptTemplate.from_messages([
    ("system", 
     "You are an academic research assistant.\n"
     "Answer ONLY using the provided paper.\n"
     "Cite each claim using paper_id, year, page or section.\n"
     "If evidence is insufficient, say so.\n"),
    ("human", 
     "Question: {question}\n\n"
     "Sources:\n{context}\n\n"
     "Answer:")
])

def format_docs(docs):
    parts = []
    for d in docs:
        pid = d.metadata.get("paper_id", "unknown")
        sec = d.metadata.get("section", "unknown")
        parts.append(f"[{pid}, {sec}] {d.page_content}")
    return "\n\n".join(parts)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

In [19]:
answer = rag_chain.invoke("What methodology did the authors use and reference?")
print(answer)

The authors employed a literature synthesis methodology to analyze the applications of artificial intelligence (AI) in renewable energy systems (RES). This approach involved reviewing diverse literature to provide insights into the current state of investigations, identify emerging trends, and highlight challenges within the field. However, the methodology has limitations, such as potential biases in literature selection and subjective interpretation of findings, which may affect the study's objectivity and generalizability (43d0efcf-a16e-47d7-9691-d7204db891a1, year not specified, page not specified). 

Additionally, the study discusses the use of data analytics and machine learning techniques in load forecasting and demand response (DR) studies, emphasizing the analysis of historical consumption data and other factors to improve prediction accuracy (01f8fd87-8903-4fe0-b7d5-47d7ae5bc870, year not specified, page not specified). 

The authors also reference advanced control algorithms 

In [16]:
answer = rag_chain.invoke("What is the main finding of the study?")
print(answer)

The main finding of the study is that it synthesizes existing knowledge on the optimization of renewable energy systems (RES) through artificial intelligence (AI), identifies research gaps, and offers insights into the potential of AI in this field. The study also emphasizes the importance of considering contextual factors such as technological readiness, regulatory frameworks, and market dynamics when applying its recommendations [paper_id, year, page/section].


In [20]:
answer = rag_chain.invoke("what is the title of the paper?")
print(answer)

The title of the paper is "Optimizing renewable energy systems through artificial intelligence: Review and future prospects" (paper_id: 7dd46ad6-84f9-4a43-a583-3bf70160d462, year: 2023, page: not specified).
