**Few Important Point**
As i did not have an openai api key. So i had to use the free tier of the groq api. So now all the things that I will do in the next code will be change from how I was to do with OpenAi api. They give much more options, like they provide even 32k token limit, which is much more that groq gives which is 6k. Also there is a limit on how much requests I can make in a second. So I will have to make sure that I do not make too many requests.

In [1]:
import os
from dotenv import load_dotenv
from langchain_groq import ChatGroq

In [3]:
load_dotenv()

GROQ_API_KEY=os.getenv('GROQ_API_KEY')
llm = ChatGroq(model="llama3-70b-8192")

Now for loading the document I got help from https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/
I will be using code example they have given into my code and usecase.

In [6]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("crime-and-punishment.pdf")
pages = loader.load()

Now lets see how many pages we have right now

In [7]:
print(len(pages))

767


Now in order to create my own document object from the pages text , I used the code this link. https://python.langchain.com/docs/how_to/document_loader_custom/

In [8]:
from langchain_core.documents import Document
combined_text = " ".join(page.page_content for page in pages)

combined_document = Document(page_content=combined_text)

From this link I got the code for recursive text splitter. https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/
As i need smaller chunks so that my model can process it, without any token limit error. As the model limit is 6000 so I am choosing the chunk size by hit and trial. And after checking which chunk size is perfect. Also I am overlapping size is 500, which helps in getting the context of the of previus chunk.
Learned about the text spliiter 

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=22700, chunk_overlap=500)
splits = text_splitter.split_documents([combined_document])

Now lets see how many chunks we have to process here

In [10]:
print(len(splits))

53


In [11]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import HumanMessage

In [12]:
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Summarize only the unique events and details from the text, capturing essential plot points and themes without repeating introductory statements or prior events. Avoid starting with the introductory phrases like 'Here is a summary' and keep each summary segment cohesive and streamlined. Don't create any bullet point rather right summary in paragraphs.",
        ),
        MessagesPlaceholder(variable_name="messages"),
    ]
)

In [None]:
batch_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are generating a batch summary of a section of the book, based on multiple summary segments provided. Please condense these segments into a coherent and concise summary that captures the main points, themes, and character developments of this batch. Avoid repeating details from individual segments and focus on providing an accurate overview of this portion of the narrative."
        ),
        MessagesPlaceholder(variable_name="messages"),
    ]
)


Now the current approach here is first generate the summary for chunks and then make batches of 8, and summarize them to generate the final summary. This is a bit of a hack, but it seems to work well.

In [None]:
final_summary = ""
summaries = []
i=0
for i, split in enumerate(splits):
    # Generate summary for each chunk
    chain = prompt | llm
    summary = chain.invoke({"messages": [HumanMessage(content=split.page_content)]})
    summaries.append(summary.content)
    print(summary)
    # Every 8 summaries, create a batch summary
    if (i + 1) % 8 == 0 or (i + 1) == len(splits):  # Ensures last batch is also summarized
        # Combine batch summaries
        batch_text = " ".join(summaries)
        
        # Summarize the combined text from this batch
        batch_chain = batch_prompt | llm
        batch_summary = batch_chain.invoke({"messages": [HumanMessage(content=batch_text)]})
        
        # Add batch summary to final summary
        final_summary += batch_summary.content + "\n\n"
        
        # Reset summaries list for next batch
        summaries = []
      