# üß† Task 1: Text Agent (Research Analyst)

### Objective
Before building the complete agent, I am first creating a **baseline LLM model** that can **summarize paragraphs** from research papers.  
This step helps establish the core functionality (text understanding and summarization) that will later be integrated into the **Text Agent** component of the CellSense system.

### Plan
1. Implement a simple LLM pipeline with:
   - Model configuration  
   - Prompt template  
   - Output parser (for structured summaries)
2. Test the model by summarizing random paragraphs from sample papers.
3. Analyze the outputs and refine the prompt for clarity and accuracy.
4. Once validated, extend the pipeline to include:
   - Document Loaders  
   - Text Splitters  
   - Memory and Tools  
   - Agent orchestration (in later stages)

### Outcome
A working summarization prototype that serves as the **foundation for the Text Agent**, capable of analyzing scientific text before integrating with other agents in CellSense.


#### Model Configuration

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "last_expr"


In [2]:
from langchain.chat_models import init_chat_model
from langchain.messages import SystemMessage,HumanMessage,AIMessage
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from dotenv import load_dotenv
import os

# Loading the environment variables, such as openAI API key. This will keep my API key secret.

load_dotenv()

model = init_chat_model("openai:gpt-4o-mini", 
                        temperature = 0.8, 
                        max_tokens = 2000, 
                        max_retries = 2,
                        timeout = 60)

  from .autonotebook import tqdm as notebook_tqdm


#### Prompt Template

In [3]:
system_message = '''
You are an expert biomedical research assistant specializing in analyzing scientific literature. 
Your task is to carefully read the provided paragraph and perform the following tasks:

1. Summarize the paragraph in detail and meaningful way.
2. Extract the following information if mentioned:
   - Key concepts or keywords
   - Material(s) being studied
   - Methods or experimental techniques
   - Parameters to consider
   - Technology or tools used

3. If a field is not mentioned in the paragraph, return an empty list for that field.

Guidelines:
- Be precise and concise in the summary.
- Extract only what is explicitly mentioned in the paragraph.
- Avoid adding assumptions or information not present in the text.
'''

In [4]:
template = ChatPromptTemplate([
    ("system", system_message),
    ("user", "Can you please summarise the given text?: {text}")
])

Paragraph = """An extensive research field in regenerative medicine is electrical stimulation (ES) and its
impact on tissue and cells. The mechanism of action of ES, particularly the role of electrical parameters
like intensity, frequency, and duration of the electric field, is not yet fully understood. Human MG-63
osteoblasts were electrically stimulated for 10 min with a commercially available multi-channel
system (IonOptix). We generated alternating current (AC) electrical fields with a voltage of 1 or 5 V
and frequencies of 7.9 or 20 Hz, respectively. To exclude liquid-mediated effects, we characterized the
AC-stimulated culture medium. AC stimulation did not change the medium's pH, temperature, and
oxygen content. The H2O2 level was comparable with the unstimulated samples except at 5 V_7.9 Hz,
where a significant increase in H2O2 was found within the first 30 min. Pulsed electrical stimulation
was beneficial for the process of attachment and initial adhesion of suspended osteoblasts. At the
same time, the intracellular Ca2+ level was enhanced and highest for 20 Hz stimulated cells with
1 and 5 V, respectively. In addition, increased Ca2+ mobilization after an additional trigger (ATP)
was detected at these parameters. New knowledge was provided on why electrical stimulation
contributes to cell activation in bone tissue regeneration"""
prompt = template.invoke({"text" : Paragraph})

print(prompt)

messages=[SystemMessage(content='\nYou are an expert biomedical research assistant specializing in analyzing scientific literature. \nYour task is to carefully read the provided paragraph and perform the following tasks:\n\n1. Summarize the paragraph in detail and meaningful way.\n2. Extract the following information if mentioned:\n   - Key concepts or keywords\n   - Material(s) being studied\n   - Methods or experimental techniques\n   - Parameters to consider\n   - Technology or tools used\n\n3. If a field is not mentioned in the paragraph, return an empty list for that field.\n\nGuidelines:\n- Be precise and concise in the summary.\n- Extract only what is explicitly mentioned in the paragraph.\n- Avoid adding assumptions or information not present in the text.\n', additional_kwargs={}, response_metadata={}), HumanMessage(content="Can you please summarise the given text?: An extensive research field in regenerative medicine is electrical stimulation (ES) and its\nimpact on tissue and

#### Testing model Output

In [5]:
response = model.invoke(prompt)
print(response.content)

**Summary:**
The paragraph discusses research in regenerative medicine focusing on electrical stimulation (ES) and how it influences tissues and cells. While the exact mechanisms of ES remain unclear, the study examines the effects of specific electrical parameters such as intensity, frequency, and duration. Human MG-63 osteoblasts were subjected to electrical stimulation for 10 minutes using a multi-channel system (IonOptix), employing alternating current (AC) electrical fields at voltages of 1 or 5 V and frequencies of 7.9 or 20 Hz. The researchers ensured that liquid-mediated effects were eliminated by analyzing the AC-stimulated culture medium, confirming no changes in pH, temperature, or oxygen content. However, a significant increase in H2O2 levels was noted at 5 V and 7.9 Hz within the first 30 minutes. The study found that pulsed electrical stimulation promoted attachment and initial adhesion of the osteoblasts, while also increasing intracellular Ca2+ levels, particularly at 2

#### üë©‚Äçüè´ key points:

* I have sent only the abstract from the paper and it summarised well.

* The summary and other details are very helpful, but this is in plain text format, if I want to send this details or another details to another model or an agent in the future, it won't understand it should follow a structured way may be `JSON` or similar like strucure.

#### Trying to get structured output

In [6]:
from pydantic import BaseModel, Field
from typing import Optional,Literal,List

# Define structured output schema
class ExtractUsefulInfo(BaseModel):
    summary: str = Field(description="Detailed summary of the text")
    Keywords: Optional[List[str]] = Field(default=[], description="Important concepts")
    Materials: Optional[List[str]] = Field(default=[], description="Materials used")
    Methods: Optional[List[str]] = Field(default=[], description="Methods or experimental techniques")
    Parameters: Optional[List[str]] = Field(default=[], description="Experimental parameters")
    Technology: Optional[List[str]] = Field(default=[], description="Technologies or tools used")

formated_prompt = template.format(text=Paragraph)
# Creating a strucutred wrapper around the model
structured_model = model.with_structured_output(ExtractUsefulInfo)

response = structured_model.invoke(formated_prompt)


In [7]:
print(type(response))

<class '__main__.ExtractUsefulInfo'>


#### Writing all as a chain

In [8]:
from langchain_core.runnables import RunnableSequence
chain = RunnableSequence(template, structured_model)
response_chain = chain.invoke({"text": Paragraph})
print(type(response_chain))
print(response_chain.Materials)
print(response_chain.summary)

<class '__main__.ExtractUsefulInfo'>
['human MG-63 osteoblasts', 'culture medium']
The research focuses on electrical stimulation (ES) and its effects on tissue and cells in regenerative medicine, particularly involving human MG-63 osteoblasts. The study uses a multi-channel system (IonOptix) to apply alternating current (AC) electrical fields at voltages of 1 or 5 V and frequencies of 7.9 or 20 Hz for 10 minutes. Parameters such as pH, temperature, and oxygen content of the culture medium were tested to rule out liquid-mediated effects, revealing no changes aside from a significant increase in H2O2 at 5 V and 7.9 Hz. The findings indicate that pulsed electrical stimulation improved attachment and initial adhesion of suspended osteoblasts, while intracellular Ca2+ levels rose significantly, peaking in cells stimulated at 20 Hz with both voltages. The study concludes that electrical stimulation plays a crucial role in cell activation during bone tissue regeneration.



Trying to understand whether the model can process each keywords and understand based on the previous context.

In [9]:
from langchain_core.runnables import RunnableSequence, RunnablePassthrough, RunnableParallel

# Here, we need to define system message again because it doesn't remember from earlier prompt.
template_1 = ChatPromptTemplate([("system", system_message),
                               ("user", " Explain the {Keywords} in detail")]) 

chain = RunnableSequence(template, structured_model, template_1, )

#### üë©‚Äçüè´ key points:

* Here is one problem now, As I want to pass the another prompt, I have to define ne system message for model as it doesn't remember the previous context.

* Now, I have to either 
    * provide new system message to another model defining it's role so it doesn't get confused with it's new task 
    * or provide a memory 
    * or I have to use tool calling but this will come in agent
    * or I have to chose another aproach, which is RAG (Retrieval-Augmented Generation)

* So, Staring with RAG, where I need to first upload paper, split texts, generate embeddings.


# RAG (Retrieval-Augmented Generation)

* RAG will help me to overcome the above problem which is limited context and static knowledge-doesn't automatically update when new events, aftcs occur.

* How it will help?

    - As I upload a document it will fetch a relevant information from that document and provide me result related to the query and the context. In simple terms, it allos to use external knowledge that it didn't learn while training.

* To Create a RAG-based agent,I first need to create knowledge base, which is a repository of all the documents or structured data.

### RAG based application is comprise of four main components:

    1. Document Loader
    2. Text splitter
    3. Vector Databases
    4. Retrievers


## Document Loader 
First thing to start with is, How do we load a document. A document can be pdf, word file, or any file.

#### üë©‚Äçüè´ key points:
* Data can be coming from various sources, such as pdf, text, google drive, slack, amazon s3, and the chances are high that each source might have different format. LangChain providea common interface for reading data regardless of the sources
* There are hundreds of document loaders. But for Text Agent, I have research paper in pdf format. So I will use pdfloader.

* I am planning to use UnstructuredPDFLoader. As Research Papers are complex PDFs, containing two columnn layout, figures, tables, captions, footnotes, we need to use unstructuredPDFLoader, helping for multimodal RAG.

### Experimenting with Langchain's UnstructuredPDFLoader

In [10]:

# Importing pdfloader
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader("/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents/cells-11-02650-v2.pdf", 
                               mode="elements",
                               strategy="hi_res",
                               infer_table_structure=True,
                               extract_images_in_pdf=True,                           
                               extract_image_block_types=["Image"],  # if want table as an image then we can add table also
                               extract_image_block_to_payload=True,
                               )

docs = loader.load() # It will load the document




The `max_size` parameter is deprecated and will be removed in v4.26. Please specify in `size['longest_edge'] instead`.


In [11]:
for doc in docs[20:40]:   print(doc)

page_content='' metadata={'source': '/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents/cells-11-02650-v2.pdf', 'coordinates': {'points': ((np.float64(99.21111111111111), np.float64(1824.6681105205612)), (np.float64(99.21111111111111), np.float64(1879.786110520561)), (np.float64(256.6911111111111), np.float64(1879.786110520561)), (np.float64(256.6911111111111), np.float64(1824.6681105205612))), 'system': 'PixelSpace', 'layout_width': 1654, 'layout_height': 2339}, 'last_modified': '2025-11-01T08:41:48', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'image_base64': '/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAA3AJ4DASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdIS

In [12]:
for doc in docs[2:20]:
    print(f"category:{doc.metadata.get('category')}----> content:{doc.page_content}, Page_number: {doc.metadata.get('page_number')}")

category:NarrativeText----> content:Article, Page_number: 1
category:Title----> content:Pulsed Electrical Stimulation Affects Osteoblast Adhesion and Calcium Ion Signaling, Page_number: 1
category:Title----> content:Susanne Staehlke 1,* , Meike Bielfeldt 1 , Julius Zimmermann 2 , Martina Gruening 1, Ingo Barke 3, Thomas Freitag 1, Sylvia Speller 3,4, Ursula Van Rienen 2,4,5 and Barbara Nebe 1,4, Page_number: 1
category:ListItem----> content:1 Department of Cell Biology, Rostock University Medical Center, 18057 Rostock, Germany, Page_number: 1
category:ListItem----> content:2 Department of Computer Science and Electrical Engineering, Institute of General Electrical Engineering, University of Rostock, 18059 Rostock, Germany, Page_number: 1
category:UncategorizedText----> content:ae wo, Page_number: 1
category:ListItem----> content:Physics of Surfaces and Interfaces, Institute of Physics, University of Rostock, 18059 Rostock, Germany, Page_number: 1
category:ListItem----> content:4 Depart

In [13]:
# Cheking the categories in Document Objects
for doc in docs:
    list_categories = [doc.metadata.get('category') for doc in docs]

print(set(list_categories))

{'Table', 'NarrativeText', 'UncategorizedText', 'ListItem', 'Title', 'Header', 'Image', 'FigureCaption'}


#### üë©‚Äçüè´ key points:

* Here, It is important to know what type of category the text is divided into. This will help to find the different elements of a pdf is classified in what categories.

#### Checking the metadata based on categories

In [14]:
# Table Category
for doc in docs[40:60]:
    if doc.metadata.get('category') == 'Table':
        print(f"{doc.metadata.get("page_number")}")
        print(f"**Content** ===> {doc.page_content}")
        print(doc.metadata)
        print(f"metadata keys----> {list(doc.metadata.keys())}\n")
        



#### üë©‚Äçüè´ key points:

* For Table category, the DocumnetLoader has clearly identified tables and it has contained all the elements are present in the table. We can use this content to create a chunk and generate embedding to store in vector database.

* It is important to check the metadata also, which is also gonna be stored in vector database along with embeddingsa this will help us to provide additional information at query time. 

    > For example, *"The table in cells-11-02650-v2.pdf(source), on page number 14 illustrates that........."*

* The metadata key is to get the overview what type of metadata is available so, we can then decide which keys to include in metadata and which to remove.

In [15]:
# Title Category
for doc in docs[60:80]:
    if doc.metadata.get('category') == 'Title':
        print(f"{doc.metadata.get("page_number")}")
        print(f"**Content** ===> {doc.page_content}")
        print(doc.metadata)
        print(f"metadata keys----> {list(doc.metadata.keys())}\n")

#### üë©‚Äçüè´ key points:

* We need to create embeddings for Title category as it provides valuable context-The information about the following paragraph. As we can see, the document loader has correctly classified almost all the title but there are some noise too.

* Again for this category we need to include only few metadata, such as source, coordinates, page_number.

In [16]:
# Header Category
for doc in docs[80:100]:
    if doc.metadata.get('category') == 'Header':
        print(f"{doc.metadata.get("page_number")}")
        print(f"**Content** ===> {doc.page_content}")
        print(doc.metadata)
        print(f"metadata keys----> {list(doc.metadata.keys())}\n")

7
**Content** ===> 7 of 22
{'source': '/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents/cells-11-02650-v2.pdf', 'detection_class_prob': 0.44439083337783813, 'is_extracted': 'true', 'coordinates': {'points': ((np.float64(1487.4879150390625), np.float64(154.74832153320312)), (np.float64(1487.4879150390625), np.float64(181.1849116666669)), (np.float64(1554.24658203125), np.float64(181.1849116666669)), (np.float64(1554.24658203125), np.float64(154.74832153320312))), 'system': 'PixelSpace', 'layout_width': 1654, 'layout_height': 2339}, 'last_modified': '2025-11-01T08:41:48', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 7, 'file_directory': '/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents', 'filename': 'cells-11-02650-v2.pdf', 'category': 'Header', 'element_id': '8e3f77d70b88f9df6f3139efdcc54564'}
metadata keys----> ['source', 'detection_class_prob', 'is_extracted', 'coordinates', 'last_modified', 'filetype', 'languages', 'page_number', 'file

#### üë©‚Äçüè´ key points:

* I thinnk, we don't need this category to be stored in vectorstore or vector database, while it only contains page numbers and page headers.

In [17]:
# NarrativeText Category
for doc in docs[30:50]:
    if doc.metadata.get('category') == 'NarrativeText':
        print(f"{doc.metadata.get("page_number")}")
        print(f"**Content** ===> {doc.page_content}")
        print(doc.metadata)
        print(f"metadata keys----> {list(doc.metadata.keys())}\n")


2
**Content** ===> For electrical stimulation (ES), we used a multi-channel electrical stimulator, a voltage generator, and a 12-well C (culture)-Dish (IonOptix, Milton, MA, USA) (Figure 1a,c).
{'source': '/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents/cells-11-02650-v2.pdf', 'detection_class_prob': 0.7710264325141907, 'is_extracted': 'true', 'coordinates': {'points': ((np.float64(461.416259765625), np.float64(1319.9261474609375)), (np.float64(461.416259765625), np.float64(1383.248481111111)), (np.float64(1557.4393310546875), np.float64(1383.248481111111)), (np.float64(1557.4393310546875), np.float64(1319.9261474609375))), 'system': 'PixelSpace', 'layout_width': 1654, 'layout_height': 2339}, 'last_modified': '2025-11-01T08:41:48', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 2, 'file_directory': '/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents', 'filename': 'cells-11-02650-v2.pdf', 'parent_id': '83ad84162f789ab47bfee56a766fdcac', 'cat

#### üë©‚Äçüè´ key points:

* This category is very crucial as it is the main source of information. As we can see all the paragraphs from the pdf is classified into this category.

* We can include only similir metadata as mentioned above.

In [18]:
# ListItem Category

for doc in docs[10:20]:
    if doc.metadata.get('category') == 'ListItem':
        print(f"{doc.metadata.get("page_number")}")
        print(f"**Content** ===> {doc.page_content}")
        print(doc.metadata)
        print(f"metadata keys----> {list(doc.metadata.keys())}\n")

1
**Content** ===> 5 Department Aging of Individuals and Society, Interdisciplinary Faculty, University of Rostock, 18059 Rostock, Germany
{'source': '/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents/cells-11-02650-v2.pdf', 'detection_class_prob': 0.8452051281929016, 'is_extracted': 'true', 'coordinates': {'points': ((np.float64(461.8221740722656), np.float64(748.2141200000001)), (np.float64(461.8221740722656), np.float64(804.3043561111112)), (np.float64(1514.283447265625), np.float64(804.3043561111112)), (np.float64(1514.283447265625), np.float64(748.2141200000001))), 'system': 'PixelSpace', 'layout_width': 1654, 'layout_height': 2339}, 'last_modified': '2025-11-01T08:41:48', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': '/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents', 'filename': 'cells-11-02650-v2.pdf', 'parent_id': '8f6456b57c49f15b5e0bb5356a1bc108', 'category': 'ListItem', 'element_id': 'a2d23b388c1c784ea03c07

#### üë©‚Äçüè´ key points:

* This category includes only references and the items that are present in the list. Hence, I've decided not to consider

In [19]:
# UncategorizedText Category
for doc in docs[0:10]:
    if doc.metadata.get('category') == 'UncategorizedText':
        print(f"{doc.metadata.get("page_number")}")
        print(f"**Content** ===> {doc.page_content}")
        print(doc.metadata)
        print(f"metadata keys----> {list(doc.metadata.keys())}\n")
        


1
**Content** ===> ae wo
{'source': '/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents/cells-11-02650-v2.pdf', 'coordinates': {'points': ((np.float64(451.0), np.float64(684.0)), (np.float64(451.0), np.float64(760.0)), (np.float64(479.0), np.float64(760.0)), (np.float64(479.0), np.float64(684.0))), 'system': 'PixelSpace', 'layout_width': 1654, 'layout_height': 2339}, 'last_modified': '2025-11-01T08:41:48', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': '/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents', 'filename': 'cells-11-02650-v2.pdf', 'parent_id': '8f6456b57c49f15b5e0bb5356a1bc108', 'category': 'UncategorizedText', 'element_id': '57b0174c00d8f350de0c4629208cf463'}
metadata keys----> ['source', 'coordinates', 'last_modified', 'filetype', 'languages', 'page_number', 'file_directory', 'filename', 'parent_id', 'category', 'element_id']



#### üë©‚Äçüè´ key points:

* This category includes only reference numbers and the garbage. Hence, I've decided not to consider

In [20]:
# Image Category
for doc in docs[40:70]:
    if doc.metadata.get('category') == 'Image':
        print(f"{doc.metadata.get("page_number")}")
        print(f"**Content** ===> {doc.page_content}")
        print(doc.metadata)
        print(f"metadata keys----> {list(doc.metadata.keys())}\n")

#### üë©‚Äçüè´ key points:

* This category is for all the images present in the pdf. we need to consider it as it is another crucial element.

* The metadata should conain source, page number, coordinates, image_base64, image_mime_type.

* This metadata will help to get the image from the pdf to enhance the context. Though image itself not sent to llm for reasoning.


In [21]:
# FigureCaption Category

for doc in docs[40:70]:
    if doc.metadata.get('category') == 'FigureCaption':
        print(f"{doc.metadata.get("page_number")}")
        print(f"**Content** ===> {doc.page_content}")
        print(doc.metadata)
        print(f"metadata keys----> {list(doc.metadata.keys())}\n")

#### üë©‚Äçüè´ key points:

* This will help to connecting to the image info or table information which will provide the context with the finding.

### Notes:

* To create a chunks UncategorisedText, Header, and ListItem category will not be used. ALl the other categories are useful for retreival.

* To create chunk we first need to combine all the elements in a meaningful way that provides semantic understanding of data.

* Once we combine all the content, we will be doing token aware chunking so that no sentence is broken or chunked that doesn't provide accurate information.

#### Before we generate embeddings we need to preprocess the extracted information and use text splitter for better chunking.

## Let's combine the content in semantically.

##### Just for reference üòú
page_content='Article' metadata={'source': '/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents/cells-11-02650-v2.pdf', 'detection_class_prob': 0.8227533102035522, 'is_extracted': 'true', 'coordinates': {'points': ((np.float64(97.11119842529297), np.float64(290.358642578125)), (np.float64(97.11119842529297), np.float64(321.78399611111104)), (np.float64(178.90414428710938), np.float64(321.78399611111104)), (np.float64(178.90414428710938), np.float64(290.358642578125))), 'system': 'PixelSpace', 'layout_width': 1654, 'layout_height': 2339}, 'last_modified': '2025-11-01T08:41:48', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': '/home/vraj/Desktop/CellSense/X20251016-CellSense-VB/Documents', 'filename': 'cells-11-02650-v2.pdf', 'category': 'NarrativeText', 'element_id': 'be3a187cfe3ab16d142f7c7dfc25ceeb'}

The Output I want is :

Document(page_content = "semantic content of max chunk size", metadata = {"source":***, "page_number":***, })

In [22]:
from hashlib import new
from langchain_core.documents import Document
import re

#def combine_semantic_content(docs: List[Document], max_chunk_size: int = 5000):
All_saemantic_docs = []
Accumulated_semantic_content = ""
accumulated_metadata = {}
section_header = "Pulsed Electrical Stimulation Affects Osteoblast Adhesion and Calcium Ion Signaling"
pattern = r'^(?:\d+(?:\.\d+)*\.\s+[A-Za-z ]+|[A-Za-z ]+)$'
necessaery_metadata = {'source', 'section_header', 'page_number', 'coordinates', 'category', 'image_base64', 'image_mime_type','text_as_html'}

# Iterating over documents
for doc in docs:
    metadata_copy = (doc.metadata).copy()
    doc_based_content = doc.page_content.strip()
    current_category = doc.metadata.get('category')

    is_header = current_category in ['Title'] # checking if it's a section header

    # If it a section header then we just add a new key into meatadata with header name
    if is_header:
        # If the content is already there then we add only new key in metadata
        if Accumulated_semantic_content:
            new_metadata = accumulated_metadata.copy()
            new_metadata['section_header'] =  section_header
            # creating new Document object
            new_docobj = Document(metadata=new_metadata, page_content= Accumulated_semantic_content)
            All_saemantic_docs.append(new_docobj)

        # if it's just strating of content, first we set the section header.(we are not writting content yet.)
        Refined_header = re.match(pattern, doc_based_content)
        if Refined_header:
            section_header = doc_based_content
        Accumulated_semantic_content=""
        accumulated_metadata = metadata_copy
        continue ## Move to the next Document Object
    if current_category in ['NarrativeText', 'Table', 'Image', 'FigureCaption']:

        header = ""
        cat=""
        if not Accumulated_semantic_content:
            header = f"\n\n# {section_header}\n {current_category}\n"
        semantic_content = f'{header} {cat} {doc_based_content}'
        #revised_metadata_copy = {key: value for key, value in metadata_copy.items() if key in necessaery_metadata}
        
        accumulated_metadata.update(metadata_copy)

        if len(Accumulated_semantic_content) + len(semantic_content) > 20000:
            new_metadata = accumulated_metadata.copy()
            new_metadata['section_header'] =  section_header
            # creating new Document object
            new_docobj = Document(metadata=new_metadata, page_content= Accumulated_semantic_content)
            All_saemantic_docs.append(new_docobj)

            # starting new chunk
            Accumulated_semantic_content = semantic_content

        else:
            Accumulated_semantic_content += semantic_content

# if the iteration is finished we put all the document objects in combined docs
if Accumulated_semantic_content:
    new_metadata = accumulated_metadata.copy()
    new_metadata['section_header'] = section_header
    new_docobj = Document(metadata=new_metadata, page_content= Accumulated_semantic_content)
    All_saemantic_docs.append(new_docobj)


for doc in All_saemantic_docs[2:5]:
    print(doc.page_content)
        




# 1. Introduction
 NarrativeText
  Publisher‚Äôs Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional afÔ¨Ål- iations.    Copyright: ¬© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).  A widespread research Ô¨Åeld in regenerative medicine is electrical stimulation (ES), and its impact on tissue and cells, such as bone [1]. In 1957 the piezoelectric properties of bone were described [2]. As the bone healing process takes place under the mechanical strain of the bone [3], an electric Ô¨Åeld is generated in the bone in vitro and in vivo [4]. Externally applied electric Ô¨Åelds were shown to contribute to bone deposition and osteoblast differentiation and proliferation [5]. The living cell has a membrane potential that indicates the electrical potential d

#### Now I have whole semantic content. I was thinking to add image and table summary where the table and image is present.



## Creating a function to generate summary by taking table html content.

In [23]:
from langchain.chat_models import init_chat_model
from langchain_core.prompts import ChatPromptTemplate
from dotenv import load_dotenv
import os

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
model = init_chat_model("openai:gpt-5.1")


In [24]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.messages import SystemMessage, HumanMessage, AIMessage

Role = """You are an expert biomedical research assistant specialising in analysing scientific literature. 
You must generate concise and informative summaries of tables present in scientific documents. 
Your summaries should capture the key insights, trends, and significant data points from the table content provided to you. 
Ensure that your summaries are clear, accurate, and useful for researchers looking to understand the data quickly."""

table_template = ChatPromptTemplate([
    (SystemMessage, Role),
    (HumanMessage, "Generate a insightful summary for given table: \n{table_content}")
])



ValidationError: 2 validation errors for SystemMessage
content.str
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.12/v/string_type
content.list[union[str,dict[any,any]]]
  Input should be a valid list [type=list_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.12/v/list_type