# powerful book summarizer using Python, Langchain and OpenAI embeddings.

#### AI models, such as GPT-3 and GPT-4, are extremely powerful, but they have limitations. The context window is a significant limitation because it limits the amount of text that the model can consider at any given time. This means that you cannot simply feed an entire book into the model and expect a coherent summary. Furthermore, processing large documents can be expensive. ####

## The Process Simplified
### Here’s how we transform a full-length book into a concise summary:

1. **Splitting & Embeddings:** 
We break down the book into smaller chunks and convert them into embeddings. This step is surprisingly affordable.
1. **Clustering:** Next, we cluster these embeddings to find the most representative sections of the book.
1. **Summarization:** We then summarize these key sections using the more cost-effective GPT-3.5 model.
1. **Combining Summaries:** Finally, we use GPT-4 to stitch these summaries into one fluid narrative.
By using GPT-4 only in the final step, we manage to keep our costs low.

## Step 1: Load the Book##
**First, we need to read the book’s content. We’ll support PDF and EPUB formats.**

In [None]:
import os
from langchain.document_loaders import PyPDFLoader, UnstructuredEPubLoader


def load_book(file_obj):

    # Load the book
    loader = PyPDFLoader(file_obj)
    pages = loader.load()

    # Cut out the open and closing parts
    # pages = pages[1:222]

    # Combine the pages, and replace the tabs with spaces
    text = ""

    for page in pages:
        text += page.page_content
        
    text = text.replace('\t', ' ')
    return text

## Step 2: Split and Embed the Text##

AI models have a token limit, which means they cannot process entire books at once. By breaking up the text into chunks, we ensure that each section is digestible for the AI.

We will divide the text into chunks and convert them into embeddings. Embeddings convert text into a compact numerical form rapidly and with minimal computation, making the process both fast and cost-effective.


In [133]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings
def split_and_embed(text, openai_api_key):
    text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"], chunk_size=12000, chunk_overlap=2500)
    docs = text_splitter.create_documents([text])
    # embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
    embeddings = GoogleGenerativeAIEmbeddings(model = "models/embedding-001")
    vectors = embeddings.embed_documents([x.page_content for x in docs])
    return docs, vectors

## Step 3: Cluster the Embeddings##
We use KMeans clustering to group similar chunks. In my version, as shown below, I found that 11 clusters worked well for the majority of books. However, you can make changes based on your needs.

Here, we've divided the entire book into chunks and then embedded them. These embeddings are grouped according to their similarity. For each group, we select the most representative embedding and map it back to the corresponding text chunk.

In [134]:
from sklearn.cluster import KMeans
import numpy as np

def cluster_embeddings(vectors, num_clusters):
    kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(vectors)
    closest_indices = [np.argmin(np.linalg.norm(vectors - center, axis=1)) for center in kmeans.cluster_centers_]
    return sorted(closest_indices)

## Step 4: Summarize the Representative Chunks
We’ll summarize only the selected chunks using GPT-3.5

In [135]:
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate
from langchain_groq import ChatGroq

groq_api_key=os.getenv('GROQ_API_KEY')

def summarize_chunks(docs, selected_indices, openai_api_key):
    # llm3_turbo = ChatOpenAI(temperature=0, openai_api_key=openai_api_key, max_tokens=1000, model='gpt-3.5-turbo-16k')
    llm3_turbo=ChatGroq(groq_api_key=groq_api_key,   model_name="mixtral-8x7b-32768")
    map_prompt = """
    You are provided with a passage from a book. Your task is to produce a comprehensive summary of this passage. Ensure accuracy and avoid adding any interpretations or extra details not present in the original text. The summary should be at least ten paragraphs long and fully capture the essence of the passage.
    ```{text}```
    SUMMARY:
    """
    
    cprompt = """
    As a professional summarizer, create comprehensive summary of the provided passage from a book in minimum 1200 words, while adhering to these guidelines:
    * Ensure accuracy and avoid adding any details that is not in original text
    * Craft a summary that is detailed, thorough and in-depth, while maintaining clarity and conciseness.
    * Incorporate main ideas and essential information, eliminating extraneous language and focusing on critical aspects.
    * Rely strictly on the provided text, without including external information.
    * Format the summary in paragraph form for easy understanding.
    
    ```{text}```
    SUMMARY:
    """
    map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])
    map_prompt_template2 = PromptTemplate(template=cprompt, input_variables=["text"])
    
    selected_docs = [docs[i] for i in selected_indices]
    summary_list = []

    for doc in selected_docs:
        chunk_summary = load_summarize_chain(llm=llm3_turbo, chain_type="stuff", prompt=map_prompt_template2).run([doc])
        summary_list.append(chunk_summary)
    
    return "\n".join(summary_list)

## Step 5: Create the Final Summary
We combine the individual summaries into one final, cohesive summary using GPT-4.

In [89]:
from langchain.schema import Document
from langchain.chat_models import ChatOpenAI

def create_final_summary(summaries):#, openai_api_key):
    llm4 = ChatGroq(groq_api_key=groq_api_key,   model_name="llama3-70b-8192")
    # llm4 = ChatOpenAI(temperature=0, openai_api_key=openai_api_key, max_tokens=3000, model='gpt-4', request_timeout=120)
    combine_prompt = """
    You are given a series of summarized sections from a book. Your task is to weave these summaries into cohesive, and verbose summary. The reader should be able to understand the main events or points of the book from your summary. Ensure you retain the accuracy of the content and present it in a clear and engaging manner.
    ```{text}```
    COHESIVE SUMMARY:
    """
    combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])
    reduce_chain = load_summarize_chain(llm=llm4, chain_type="stuff", prompt=combine_prompt_template)
    final_summary = reduce_chain.run([Document(page_content=summaries)])
    return final_summary

## Bringing It All Together
Now, we combine all the steps into a single function that takes an uploaded file and generates a summary.

In [136]:
# ... (previous code for imports and functions)

def generate_summary(uploaded_file, openai_api_key, num_clusters=13, verbose=False):
    # file_extension = os.path.splitext(uploaded_file.name)[1].lower()
    text = load_book(uploaded_file)#, file_extension)
    docs, vectors = split_and_embed(text, openai_api_key)
    selected_indices = cluster_embeddings(vectors, num_clusters)
    summaries = summarize_chunks(docs, selected_indices, openai_api_key)
    # final_summary = create_final_summary(summaries)#, openai_api_key)
    return summaries #final_summary


## Testing the Summarizer
Finally, we can test our summarizer with a book file.

In [137]:
# Testing the summarizer
if __name__ == '__main__':
    load_dotenv()
    openai_api_key = os.getenv('OPENAI_API_KEY')
    # book_path = "C:/Users/Admin/Downloads/task/book.pdf"
    # with open(book_path, 'rb') as uploaded_file:
    summary = generate_summary('crime-and-punishment.pdf', openai_api_key, verbose=True)
    print(summary)

Raskolnikov, the protagonist of Fyodor Dostoevsky's "Crime and Punishment," is on his way to commit a heinous act. He has decided to kill an old pawnbroker, Alyona Ivanovna, to rob her of her wealth, which he plans to use for good deeds. As he approaches her apartment, he is not as afraid as he had imagined he would be. Instead, his mind is preoccupied with various thoughts, from the refreshing effect of fountains to the reasons why people choose to live in dirty and smelly parts of town. As he passes the Yusupov garden, he contemplates the idea of extending the summer garden to the field of Mars and joining it to the garden of the Mihalovsky Palace, believing it would be a great benefit to the town. He then wonders why men are drawn to live in dirty and unpleasant areas.

Suddenly, he realizes he is near the pawnbroker's apartment, and he begins to feel fear. However, he quickly dismisses the thought and focuses on the task at hand. He approaches the gate, and luckily for him, a huge 

In [90]:
summary2 = create_final_summary(summary)
print(summary2)

Here is a cohesive summary of the book "Crime and Punishment" by Fyodor Dostoevsky, based on the provided sections:

The story revolves around Raskolnikov, a poverty-stricken student who visits a pawnbroker, Alyona Ivanovna, to redeem a pledge. He becomes obsessed with the idea of murdering her and stealing her wealth, which he believes will solve his financial problems. Raskolnikov's mental state deteriorates as he becomes increasingly isolated and struggles with his own morality.

Meanwhile, Razumihin, a friend of Raskolnikov, takes it upon himself to care for Raskolnikov's mother, Pulcheria Alexandrovna, and sister, Avdotya Romanovna, after their harrowing encounter with Raskolnikov. Razumihin demonstrates his trustworthiness and capability, promising to update them on Raskolnikov's condition and bringing a doctor to assess him.

As the story unfolds, Raskolnikov's relationships with his family and others become increasingly strained. He becomes embroiled in a murder investigation, 

## generating a pdf summary file

In [138]:
from reportlab.lib.pagesizes import letter
from reportlab.lib import colors
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle

def save_text_response_to_pdf(text_response, filename):
    # Create a PDF document
    doc = SimpleDocTemplate(filename, pagesize=letter)

    # Create a list to hold the flowables (elements that can be added to the PDF)
    flowables = []

    # Create a sample stylesheet
    styles = getSampleStyleSheet()

    # Define a custom ParagraphStyle with desired alignment and spacing
    custom_style = ParagraphStyle(
        name='CustomStyle',
        parent=styles['Normal'],
        alignment=0,  # Center alignment (0=left, 1=center, 2=right)
        spaceAfter=12  # Space after each paragraph
    )

    # Add paragraphs to the flowables list
    for paragraph in text_response.split('\n\n'):  # Split by double newline for paragraphs
        flowables.append(Paragraph(paragraph.strip(), custom_style))
        flowables.append(Spacer(1, 12))  # Add spacer for spacing between paragraphs

    # Add the flowables to the document
    doc.build(flowables)


save_text_response_to_pdf(summary, "summary.pdf")
