___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

# WELCOME

This notebook will guide you through two increasingly significant applications in the realm of Generative AI: RAG (Retrieval Augmented Generation) chatbots and text summarization for big text.

Through two distinct projects, you will explore these technologies and enhance your skills. Detailed descriptions of the projects are provided below.

## Project 1: Building a Chatbot with a PDF Document (RAG)

In this project, you will develop a chatbot using a provided PDF document from web page. You will utilize the Langchain framework along with a large language model (LLM) such as GPT or Gemini. The chatbot will leverage the Retrieval Augmented Generation (RAG) technique to comprehend the document's content and respond to user queries effectively.

### **Project Steps:**

- **1.PDF Document Upload:** Upload the provided PDF document from web page (https://aclanthology.org/N19-1423.pdf) (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding).

- **2.Chunking:** Divide the uploaded PDF document into smaller segments (chunks). This facilitates more efficient information processing by the LLM.

- **3.ChromaDB Setup:**
  - Save ChromaDB to your Google Drive.

  - Retrieve ChromaDB from your Drive to begin using it in your project.

  - ChromaDB serves as a vector database to store embedding vectors generated from your document.

- **4.Embedding Vectors Creation:**
  - Convert the chunked document into embedding vectors. You can use either GPT or Gemini embedding models for this purpose.

  - If you choose the Gemini embedding model, set "task_type" to "retrieval_document" when converting the chunked document.

- **5.Chatbot Development:**
  - Utilize the **load_qa_chain** function from the Langchain library to build the chatbot.

  - This function will interpret user queries, retrieve relevant information from **ChromaDB**, and generate responses accordingly.



### Install Libraries

In [1]:
!pip install -qU langchain-google-community

In [2]:
!pip install -qU langchain-community

In [3]:
!pip install langchain-google-genai


Collecting protobuf (from google-generativeai<0.8.0,>=0.7.0->langchain-google-genai)
  Downloading protobuf-4.25.4-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Downloading protobuf-4.25.4-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.6/294.6 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 5.28.0
    Uninstalling protobuf-5.28.0:
      Successfully uninstalled protobuf-5.28.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.
grpcio-health-checking 1.66.1 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 4.25.4 which is incompatible.
grpcio-tools 1.66.1 requir

In [4]:
!pip install -qU langchain-chroma

In [5]:
!pip install -qU pypdfium2

In [6]:
!pip install -q -U google-generativeai

In [7]:
!pip install chromadb



In [8]:
import pandas as pd
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA


In [9]:
!pip install -qU langchain-openai

In [10]:
!pip install -q datasets

In [11]:
!pip install langchain openai weaviate-client ragas

Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-health-checking<2.0.0,>=1.57.0->weaviate-client)
  Using cached protobuf-5.28.0-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Using cached protobuf-5.28.0-cp38-abi3-manylinux2014_x86_64.whl (316 kB)
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 4.25.4
    Uninstalling protobuf-4.25.4:
      Successfully uninstalled protobuf-4.25.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires protobuf<5,>=3.20, but you have protobuf 5.28.0 which is incompatible.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.
google-ai-generativelanguage 0.6.6 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have p

### Access Google Drive

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Entering Your OpenAI or Google Gemini API Key.

In [13]:
import os
from google.colab import userdata
os.environ['OPENAI_API_KEY']=userdata.get('OpenAIkey')

### Loading PDF Document

In [14]:
# create a pdf reader function
from langchain_community.document_loaders import PyPDFium2Loader

def read_doc(directory):
    file_loader=PyPDFium2Loader(directory)
    pdf_documents=file_loader.load() # PyPDFium2Loader reads page by page
    return pdf_documents

In [15]:
pdf=read_doc('/content/drive/MyDrive/Rag_Chatbot/N19-1423.pdf')
len(pdf)

# The document consists of 16 pages



16

In [16]:
pdf[0]

Document(metadata={'source': '/content/drive/MyDrive/Rag_Chatbot/N19-1423.pdf', 'page': 0}, page_content='Proceedings of NAACL-HLT 2019, pages 4171–4186\r\nMinneapolis, Minnesota, June 2 - June 7, 2019. \rc 2019 Association for Computational Linguistics\r\n4171\r\nBERT: Pre-training of Deep Bidirectional Transformers for\r\nLanguage Understanding\r\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\r\nGoogle AI Language\r\n{jacobdevlin,mingweichang,kentonl,kristout}@google.com\r\nAbstract\r\nWe introduce a new language representa\x02tion model called BERT, which stands for\r\nBidirectional Encoder Representations from\r\nTransformers. Unlike recent language repre\x02sentation models (Peters et al., 2018a; Rad\x02ford et al., 2018), BERT is designed to pre\x02train deep bidirectional representations from\r\nunlabeled text by jointly conditioning on both\r\nleft and right context in all layers. As a re\x02sult, the pre-trained BERT model can be fine\x02tuned with just one additio

### Document Splitter

In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter


def chunk_data(docs, chunk_size=1000, chunk_overlap=200):
    text_splitter=RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                                 chunk_overlap=chunk_overlap)
    pdf=text_splitter.split_documents(docs)
    return pdf

# This code splits documents into chunks using the RecursiveCharacterTextSplitter class from the langchain library.

# A function named chunk_data is defined, which takes a document or a collection of documents (docs) as input. It also takes two parameters:
# chunk_size and chunk_overlap. chunk_size specifies the maximum number of characters in each chunk, while chunk_overlap determines the amount of
# overlap between consecutive chunks.

# The function divides the documents into chunks based on these parameters using the RecursiveCharacterTextSplitter class. Consequently, each chunk
# contains chunk_size characters, with an overlap of chunk_overlap characters between consecutive chunks.

# As a result, the documents are segmented into chunks of specified sizes, and these chunks are returned.

# The chunk_overlap parameter is used to specify the sharing of characters between consecutive chunks. In other words, it ensures that the characters at
# the end of one chunk reappear at the beginning of the next chunk. This prevents the loss of information when the text is segmented or divided and
# helps preserve a certain context. Especially, overlap can be used to maintain important contextual relationships within a specific text and sustain
# meaning across chunks.

In [18]:
pdf_doc=chunk_data(docs=pdf)
len(pdf_doc)

84

In [19]:

pdf_doc[25:27]

[Document(metadata={'source': '/content/drive/MyDrive/Rag_Chatbot/N19-1423.pdf', 'page': 4}, page_content='answering, and the [CLS] representation is fed\r\ninto an output layer for classification, such as en\x02tailment or sentiment analysis.\r\nCompared to pre-training, fine-tuning is rela\x02tively inexpensive. All of the results in the pa\x02per can be replicated in at most 1 hour on a sin\x02gle Cloud TPU, or a few hours on a GPU, starting\r\nfrom the exact same pre-trained model.7 We de\x02scribe the task-specific details in the correspond\x02ing subsections of Section 4. More details can be\r\nfound in Appendix A.5.\r\n4 Experiments\r\nIn this section, we present BERT fine-tuning re\x02sults on 11 NLP tasks.\r\n4.1 GLUE\r\nThe General Language Understanding Evaluation\r\n(GLUE) benchmark (Wang et al., 2018a) is a col\x02lection of diverse natural language understanding\r\ntasks. Detailed descriptions of GLUE datasets are\r\nincluded in Appendix B.1.\r\nTo fine-tune on GLUE, we r

### 1. Creating A Embedding Model
### 2. Convert the Each Chunk of The Split Document to Embedding Vectors
### 3. Storing of The Embedding Vectors to Vectorstore
### 4. Save the Vectorstore to Your Drive

In [20]:
from langchain_openai import OpenAIEmbeddings


embeddings = OpenAIEmbeddings(model="text-embedding-3-large",
                                          dimensions = 3072)# dimensions=256, 1024, 3072



print(embeddings)


client=<openai.resources.embeddings.Embeddings object at 0x78bfdaaeb9d0> async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x78bfdab78b80> model='text-embedding-3-large' dimensions=3072 deployment='text-embedding-ada-002' openai_api_version='' openai_api_base=None openai_api_type='' openai_proxy='' embedding_ctx_length=8191 openai_api_key=SecretStr('**********') openai_organization=None allowed_special=None disallowed_special=None chunk_size=1000 max_retries=2 request_timeout=None headers=None tiktoken_enabled=True tiktoken_model_name=None show_progress_bar=False model_kwargs={} skip_empty=False default_headers=None default_query=None retry_min_seconds=4 retry_max_seconds=20 http_client=None http_async_client=None check_embedding_ctx_length=True


### Load Vectorstore(index) From Your Drive

In [21]:
from langchain_chroma import Chroma

#index=Chroma.from_documents(documents=pdf_doc,
#                            embedding=embeddings,
#                            persist_directory="/content/drive/MyDrive/vectorstore") # persist_directory, saves in the directory

#retriever=index.as_retriever()

In [37]:
loaded_index=Chroma(persist_directory="/content/drive/MyDrive/vectorstore",
                    embedding_function=embeddings)

In [38]:
load_retriever=loaded_index.as_retriever(search_kwargs={"k": 5})

### Retrival the First 5 Chunks That Are Most Similar to The User Query from The Document

In [39]:
load_retriever = loaded_index.as_retriever(search_kwargs = {'k': 5})

In [40]:
retrieved_docs = load_retriever.invoke('What is BERT?')

In [41]:
retrieved_docs

[Document(metadata={'page': 0, 'source': '/content/drive/MyDrive/Rag_Chatbot/N19-1423.pdf'}, page_content='to create state-of-the-art models for a wide\r\nrange of tasks, such as question answering and\r\nlanguage inference, without substantial task\x02specific architecture modifications.\r\nBERT is conceptually simple and empirically\r\npowerful. It obtains new state-of-the-art re\x02sults on eleven natural language processing\r\ntasks, including pushing the GLUE score to\r\n80.5% (7.7% point absolute improvement),\r\nMultiNLI accuracy to 86.7% (4.6% absolute\r\nimprovement), SQuAD v1.1 question answer\x02ing Test F1 to 93.2 (1.5 point absolute im\x02provement) and SQuAD v2.0 Test F1 to 83.1\r\n(5.1 point absolute improvement).\r\n1 Introduction\r\nLanguage model pre-training has been shown to\r\nbe effective for improving many natural language\r\nprocessing tasks (Dai and Le, 2015; Peters et al.,\r\n2018a; Radford et al., 2018; Howard and Ruder,\r\n2018). These include sentence-level

### Generating an Answer Based on The Similar Chunks

In [42]:
from langchain.prompts import PromptTemplate

template="""Use the following pieces of context to answer the user's question of {question}.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
{context}"""

prompt_template = PromptTemplate(
    input_variables =['question','context'],
    template = template
)

### Pipeline For RAG (If you want, you can use the gemini-1.5-pro model)

In [28]:
our_query = "What is BERT?"

In [29]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser

llm=ChatOpenAI(model_name="gpt-4o-mini",
               temperature=0,
               top_p=1)

chain = prompt_template | llm | StrOutputParser()

output= chain.invoke({"question":our_query, "context":retrieved_docs}) # first four most similar texts are returned
output

'BERT, which stands for Bidirectional Encoder Representations from Transformers, is a language representation model that improves fine-tuning approaches for natural language processing tasks. It is designed to address the limitations of unidirectional models by using a "masked language model" (MLM) pre-training objective, allowing it to incorporate context from both directions. BERT has achieved state-of-the-art results on various natural language processing tasks, including question answering and language inference, without requiring significant modifications to task-specific architectures.'

In [30]:
from IPython.display import Markdown

Markdown(output)

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a language representation model that improves fine-tuning approaches for natural language processing tasks. It is designed to address the limitations of unidirectional models by using a "masked language model" (MLM) pre-training objective, allowing it to incorporate context from both directions. BERT has achieved state-of-the-art results on various natural language processing tasks, including question answering and language inference, without requiring significant modifications to task-specific architectures.

##**Retrieval**

In [43]:
def retrieve_query(query,k=5):
    load_retriever=loaded_index.as_retriever(search_kwargs={"k": k}) #loaded_index
    return load_retriever.invoke(query)

In [44]:
our_query = "What is BERT?"

doc_search=retrieve_query(our_query, k=5) # first two most similar texts are returned
doc_search

[Document(metadata={'page': 0, 'source': '/content/drive/MyDrive/Rag_Chatbot/N19-1423.pdf'}, page_content='to create state-of-the-art models for a wide\r\nrange of tasks, such as question answering and\r\nlanguage inference, without substantial task\x02specific architecture modifications.\r\nBERT is conceptually simple and empirically\r\npowerful. It obtains new state-of-the-art re\x02sults on eleven natural language processing\r\ntasks, including pushing the GLUE score to\r\n80.5% (7.7% point absolute improvement),\r\nMultiNLI accuracy to 86.7% (4.6% absolute\r\nimprovement), SQuAD v1.1 question answer\x02ing Test F1 to 93.2 (1.5 point absolute im\x02provement) and SQuAD v2.0 Test F1 to 83.1\r\n(5.1 point absolute improvement).\r\n1 Introduction\r\nLanguage model pre-training has been shown to\r\nbe effective for improving many natural language\r\nprocessing tasks (Dai and Le, 2015; Peters et al.,\r\n2018a; Radford et al., 2018; Howard and Ruder,\r\n2018). These include sentence-level

In [45]:
def get_answers(query, k = 5):
    from langchain_openai import ChatOpenAI
    from langchain_core.output_parsers import StrOutputParser
    from langchain.prompts import PromptTemplate
    from IPython.display import Markdown

    doc_search=retrieve_query(query, k = k) # most similar texts are returned


    template="""Use the following pieces of context to answer the user's question of {question}.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.
    ----------------
    {context}"""

    prompt_template = PromptTemplate(
    input_variables =['question','context'],
    template = template)


    llm=ChatOpenAI(model_name="gpt-4o-mini",
                  temperature=0,
                  top_p=1)

    chain = prompt_template | llm | StrOutputParser()

    output= chain.invoke({"question":query, "context":doc_search}) # first four most similar texts are returned
    return Markdown(output)

In [46]:
our_query = "What are the key architectural components of BERT?"
answer = get_answers(our_query)
answer

The key architectural components of BERT include:

1. **Multi-layer Bidirectional Transformer Encoder**: BERT is based on a multi-layer bidirectional Transformer architecture, which allows it to consider context from both the left and right sides of a token.

2. **Unified Architecture**: There is minimal difference between the pre-trained architecture and the final downstream architecture, allowing for a consistent approach across different tasks.

3. **Model Sizes**: BERT primarily comes in two sizes:
   - **BERTBASE**: 12 layers (Transformer blocks), hidden size of 768, and 12 self-attention heads, totaling 110 million parameters.
   - **BERTLARGE**: 24 layers, hidden size of 1024, with a larger number of parameters.

4. **Input Representation**: BERT uses special tokens such as [CLS] for classification tasks and [SEP] to separate different segments of input.

5. **Pre-training and Fine-tuning**: The same architecture is used for both pre-training and fine-tuning, with all parameters being fine-tuned during the latter phase.

These components contribute to BERT's ability to effectively understand and generate language representations for various natural language processing tasks.

In [47]:
our_query = "What are the main contributions of the BERT model?"
answer = get_answers(our_query)
answer

The main contributions of the BERT model include:

1. **Bidirectional Pre-training**: BERT demonstrates the importance of bidirectional pre-training for language representations. Unlike previous models that used unidirectional language models, BERT employs masked language models to enable deep bidirectional representations.

2. **State-of-the-Art Performance**: BERT achieves new state-of-the-art results on eleven natural language processing tasks, including significant improvements in scores for benchmarks like GLUE, MultiNLI, and SQuAD.

3. **Flexibility for Various Tasks**: The model can be fine-tuned with just one additional output layer, allowing it to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

4. **Joint Conditioning on Context**: BERT is designed to pre-train representations by jointly conditioning on both left and right context in all layers, which enhances its understanding of language. 

These contributions highlight BERT's effectiveness and versatility in natural language processing.

In [50]:
from datasets import Dataset
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Define the prompt template
prompt_template = PromptTemplate(
    input_variables=["question", "context"],
    template="Based on the following context:\n{context}\n\nQ: {question}\nA:"
)

# Initialize the language model
llm = ChatOpenAI(
    model_name="gpt-4o-mini",
    temperature=0,
    top_p=1
)

# Create the LLM chain
chain = prompt_template | llm | StrOutputParser()

# Define the questions, ground truths, and placeholders for answers and contexts
questions = [
    "What strategy does BERT use in the masked language model (MLM) task, and why?",
    "What are the advantages of using a feature-based approach with BERT compared to the fine-tuning approach?",
    "What is the major contribution of recent improvements in transfer learning with language models?",
]
ground_truths = [
    ["BERT replaces 15% of input tokens, using [MASK] 80% of the time, a random token 10%, and leaves the token unchanged 10% of the time. This strategy helps the model learn bidirectional representations while reducing the difference between pre-training and fine-tuning."],
    ["The feature-based approach has advantages such as allowing for task-specific model architectures when the Transformer encoder alone isn't suitable, and offering computational benefits by pre-computing expensive representations once. This makes it possible to run multiple experiments with simpler models on top of these pre-computed features."],
    ["The major contribution is extending the benefits of rich, unsupervised pre-training to deep bidirectional architectures. This advancement allows pre-trained models to effectively handle a wide range of NLP tasks, even in low-resource settings, by leveraging the strengths of deep bidirectional models."]
]
answers = []
contexts = []

# Inference
for query in questions:
    # Get relevant documents for the query
    relevant_docs = load_retriever.get_relevant_documents(query)
    # Join the documents' content to form the context
    context = "\n".join([doc.page_content for doc in relevant_docs])

    # Create input dictionary for the chain
    inputs = {"question": query, "context": context}

    # Run the chain and get the output
    output = chain.invoke(inputs)

    # Append the output and context
    answers.append(output)
    contexts.append(context)

# Prepare the data dictionary
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truths": ground_truths
}

# Convert the dictionary to a Dataset
dataset = Dataset.from_dict(data)

  relevant_docs = load_retriever.get_relevant_documents(query)


In [54]:
from datasets import Dataset, Features, Sequence, Value

# Convert `contexts` to a list of strings if it isn't already
def convert_contexts_to_list_of_strings(dataset):
    # Ensure the contexts field is a list of strings
    def to_list_of_strings(x):
        if isinstance(x, str):
            return [x]  # Convert single string to a list containing one string
        return x  # If already a list, return it unchanged

    # Apply transformation
    new_data = dataset.map(lambda x: {"contexts": to_list_of_strings(x["contexts"])}, batched=False)

    # Define the correct features
    features = Features({
        "question": Value(dtype="string"),
        "answer": Value(dtype="string"),
        "contexts": Sequence(Value(dtype="string")),
        "ground_truths": Sequence(Value(dtype="string")),
    })

    # Re-create the dataset with correct features
    dataset = new_data.cast(features)

    return dataset

# Apply the conversion function
dataset = convert_contexts_to_list_of_strings(dataset)

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/3 [00:00<?, ? examples/s]

In [55]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

# Evaluate the dataset using the specified metrics
result = evaluate(
    dataset=dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

# Convert the result to a pandas DataFrame
df = result.to_pandas()



Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[10]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[6]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[2]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[3]: TimeoutError()


In [56]:
df

Unnamed: 0,question,answer,contexts,ground_truths,ground_truth,context_precision,context_recall,faithfulness,answer_relevancy
0,What strategy does BERT use in the masked lang...,BERT uses a masking strategy in the masked lan...,[right-to-left language models to pre-train BE...,"[BERT replaces 15% of input tokens, using [MAS...","BERT replaces 15% of input tokens, using [MASK...",1.0,1.0,,
1,What are the advantages of using a feature-bas...,The advantages of using a feature-based approa...,[all parameters are jointly fine-tuned on a do...,[The feature-based approach has advantages suc...,The feature-based approach has advantages such...,1.0,1.0,,1.0
2,What is the major contribution of recent impro...,The major contribution of recent improvements ...,[the classification layer.\r\nResults are pres...,[The major contribution is extending the benef...,The major contribution is extending the benefi...,1.0,1.0,,1.0


## Project 2: Generating PDF Document Summaries

In this project, you will explore various methods for creating summaries from the provided PDF document. You will experiment with different chaining functions offered by the Langchain library to achieve this.

### **Project Steps:**
- **1.PDF Document Upload and Chunking:** As in the first project, upload the PDF document and divide it into smaller chunks. Consider splitting it by half-page or page.

- **2.Summarization Techniques:**

  - **Summary of the First 5 Pages (Stuff Chain):** Utilize the load_summarize_chain function with the parameter chain_type="stuff" to generate a concise summary of the first 5 pages of the PDF document.

  - **Short Summary of the Entire Document (Map Reduce Chain):** Employ chain_type="map_reduce" and refine parameters to create a brief summary of the entire document. This method generates individual summaries for each chunk and then combines them into a final summary.

  - **Detailed Summary with Bullet Points (Map Reduce Chain):** Use chain_type="map_reduce" to generate a detailed summary with at least 1000 tokens. Provide the LLM with the prompt "Summarize with 1000 tokens" and set the max_token parameter to a value greater than 1000. Add a title to the summary and present key points using bullet points.

### Important Notes:

- Models like GPT-4 and Gemini Pro models might excel in generating summaries based on token count. Consider prioritizing these models.

- For comprehensive information on Langchain and LLMs, refer to their respective documentation.
Best of luck!

### Install Libraries

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

### Loading PDF Document

In [None]:
pdf=read_doc('/content/drive/MyDrive/Rag_Chatbot/N19-1423.pdf')



In [None]:
pdf[0] #first page

Document(metadata={'source': '/content/drive/MyDrive/Rag_Chatbot/N19-1423.pdf', 'page': 0}, page_content='Proceedings of NAACL-HLT 2019, pages 4171–4186\r\nMinneapolis, Minnesota, June 2 - June 7, 2019. \rc 2019 Association for Computational Linguistics\r\n4171\r\nBERT: Pre-training of Deep Bidirectional Transformers for\r\nLanguage Understanding\r\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\r\nGoogle AI Language\r\n{jacobdevlin,mingweichang,kentonl,kristout}@google.com\r\nAbstract\r\nWe introduce a new language representa\x02tion model called BERT, which stands for\r\nBidirectional Encoder Representations from\r\nTransformers. Unlike recent language repre\x02sentation models (Peters et al., 2018a; Rad\x02ford et al., 2018), BERT is designed to pre\x02train deep bidirectional representations from\r\nunlabeled text by jointly conditioning on both\r\nleft and right context in all layers. As a re\x02sult, the pre-trained BERT model can be fine\x02tuned with just one additio

### Summarizing the First 5 Pages of The Document With Chain_Type of The 'stuff'

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import load_summarize_chain
from langchain.llms import OpenAI

In [None]:
pdf[:5] # first five page

[Document(metadata={'source': '/content/drive/MyDrive/Rag_Chatbot/N19-1423.pdf', 'page': 0}, page_content='Proceedings of NAACL-HLT 2019, pages 4171–4186\r\nMinneapolis, Minnesota, June 2 - June 7, 2019. \rc 2019 Association for Computational Linguistics\r\n4171\r\nBERT: Pre-training of Deep Bidirectional Transformers for\r\nLanguage Understanding\r\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\r\nGoogle AI Language\r\n{jacobdevlin,mingweichang,kentonl,kristout}@google.com\r\nAbstract\r\nWe introduce a new language representa\x02tion model called BERT, which stands for\r\nBidirectional Encoder Representations from\r\nTransformers. Unlike recent language repre\x02sentation models (Peters et al., 2018a; Rad\x02ford et al., 2018), BERT is designed to pre\x02train deep bidirectional representations from\r\nunlabeled text by jointly conditioning on both\r\nleft and right context in all layers. As a re\x02sult, the pre-trained BERT model can be fine\x02tuned with just one additi

### Document Splitter

In [None]:
pdf_doc=chunk_data(docs=pdf)
len(pdf_doc)

84

In [None]:
#Belgeyi parçalara ayırın
#text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=0)
#docs = text_splitter.split_documents(pdf)

In [None]:
from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

llm = ChatOpenAI(temperature=0,
                 model_name='gpt-4o-mini',
                 max_tokens=1024)

In [None]:
chain = load_summarize_chain(
    llm,
    chain_type='stuff'
)
output_summary = chain.invoke(pdf_doc[0:5])['output_text']

In [None]:
from IPython.display import Markdown
Markdown(output_summary)

The paper introduces BERT (Bidirectional Encoder Representations from Transformers), a novel language representation model developed by Google AI Language. Unlike previous models that use unidirectional context, BERT pre-trains deep bidirectional representations from unlabeled text, allowing it to consider both left and right context in all layers. This design enables BERT to be fine-tuned with minimal additional architecture for various natural language processing tasks, achieving state-of-the-art results on eleven benchmarks, including significant improvements in GLUE, MultiNLI, and SQuAD. BERT employs a masked language model pre-training objective, enhancing its effectiveness for both sentence-level and token-level tasks.

### Make A Brief Summary of The Entire Document With Chain_Types of "map_reduce" and "refine"

##**chain_type = map_reduce**

In [None]:
from langchain.chains.summarize import load_summarize_chain
import textwrap

llm = ChatOpenAI(temperature=0,
                 model_name='gpt-4o-mini',
                 max_tokens=1024)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=0)
chunks = text_splitter.split_documents(pdf)

In [None]:
len(chunks)

16

In [None]:
chunks[0]

Document(metadata={'source': '/content/drive/MyDrive/Rag_Chatbot/N19-1423.pdf', 'page': 0}, page_content='Proceedings of NAACL-HLT 2019, pages 4171–4186\r\nMinneapolis, Minnesota, June 2 - June 7, 2019. \rc 2019 Association for Computational Linguistics\r\n4171\r\nBERT: Pre-training of Deep Bidirectional Transformers for\r\nLanguage Understanding\r\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\r\nGoogle AI Language\r\n{jacobdevlin,mingweichang,kentonl,kristout}@google.com\r\nAbstract\r\nWe introduce a new language representa\x02tion model called BERT, which stands for\r\nBidirectional Encoder Representations from\r\nTransformers. Unlike recent language repre\x02sentation models (Peters et al., 2018a; Rad\x02ford et al., 2018), BERT is designed to pre\x02train deep bidirectional representations from\r\nunlabeled text by jointly conditioning on both\r\nleft and right context in all layers. As a re\x02sult, the pre-trained BERT model can be fine\x02tuned with just one additio

In [None]:
%%time
chain = load_summarize_chain(llm,
                             chain_type="map_reduce")


output_summary = chain.invoke(chunks)["output_text"]
Markdown(output_summary)

CPU times: user 2.17 s, sys: 243 ms, total: 2.41 s
Wall time: 5min 22s


The paper presents BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking language representation model developed by Google AI Language. Unlike previous unidirectional models, BERT utilizes deep bidirectional representations by conditioning on both left and right contexts, allowing for effective pre-training on unlabeled text through tasks like Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). This architecture enables BERT to be fine-tuned with minimal modifications for various natural language processing (NLP) tasks, achieving state-of-the-art results on eleven benchmarks, including GLUE and SQuAD.

BERT's input representation accommodates both single and paired sentences, employing WordPiece embeddings and special tokens for classification and separation. The model's performance is significantly enhanced by its extensive pre-training on large corpora, and it demonstrates superior accuracy across multiple tasks, particularly in scenarios with limited training data. The study also explores the impact of model size and pre-training tasks on performance, revealing that larger models and effective pre-training strategies lead to substantial improvements.

Overall, BERT's innovative architecture and training methodology mark a significant advancement in NLP, with publicly available code and models facilitating further research and application in the field.

##**chain_type = refine**

In [None]:
%%time
chain = load_summarize_chain(llm,
                             chain_type="refine")

output_summary = chain.invoke(chunks)["output_text"]

CPU times: user 1.43 s, sys: 164 ms, total: 1.6 s
Wall time: 5min 17s


In [None]:
chain

RefineDocumentsChain(initial_llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['text'], template='Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'), llm=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7ef9421d3ac0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7ef9423b3fa0>, root_client=<openai.OpenAI object at 0x7ef9421de6e0>, root_async_client=<openai.AsyncOpenAI object at 0x7ef9421d3b50>, model_name='gpt-4o-mini', temperature=0.0, openai_api_key=SecretStr('**********'), openai_proxy='', max_tokens=1024)), refine_llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['existing_answer', 'text'], template="Your job is to produce a final summary.\nWe have provided an existing summary up to a certain point: {existing_answer}\nWe have the opportunity to refine the existing summary (only if needed) with some more context below.\n------------\n{text}\n------------\nGiven the new context, re

In [None]:
Markdown(output_summary)

The original summary is comprehensive and effectively captures the key aspects of the BERT model, its architecture, training methodology, and performance across various natural language processing tasks. The additional context provided elaborates on specific ablation studies regarding the number of training steps and different masking procedures used during pre-training. These studies indicate that BERT benefits from extensive pre-training, with significant accuracy improvements observed when increasing the number of training steps. Furthermore, the analysis of masking strategies reveals that while fine-tuning is robust to various masking approaches, the feature-based method is more sensitive to the choice of masking, highlighting the importance of the mixed strategy employed by BERT.

However, these details do not significantly enhance the existing summary of BERT. Therefore, the original summary remains relevant and complete as it stands.

**Final Summary:**
The paper introduces BERT (Bidirectional Encoder Representations from Transformers), a novel language representation model developed by Google AI Language. Unlike previous models that use unidirectional context, BERT pre-trains deep bidirectional representations from unlabeled text, allowing it to consider both left and right context in all layers. This design enables BERT to be fine-tuned with minimal additional architecture for various natural language processing tasks, achieving state-of-the-art results on eleven benchmarks, including significant improvements in GLUE, MultiNLI, and SQuAD tasks.

BERT employs a masked language model (MLM) pre-training objective, which enhances its effectiveness for both sentence-level and token-level tasks. The model randomly masks a percentage of input tokens and predicts them, allowing for a deep bidirectional representation. Additionally, BERT incorporates a "next sentence prediction" (NSP) task that trains the model to understand the relationship between sentence pairs, which is crucial for tasks like Question Answering (QA) and Natural Language Inference (NLI). The NSP task is designed to transfer all parameters to initialize end-task model parameters, unlike prior work that only transferred sentence embeddings.

To handle various downstream tasks, BERT's input representation can unambiguously represent both single sentences and pairs of sentences. It uses WordPiece embeddings with a 30,000 token vocabulary, where the first token of every sequence is a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are differentiated using a special token ([SEP]) and learned embeddings indicating sentence membership. The input embeddings are the sum of token embeddings, segment embeddings, and position embeddings.

BERT's architecture is a multi-layer bidirectional Transformer encoder, with two primary model sizes: BERTBASE and BERTLARGE. The model reduces the need for heavily-engineered task-specific architectures, marking it as the first fine-tuning based representation model to achieve state-of-the-art performance across a wide range of NLP tasks. Fine-tuning BERT is straightforward and can be accomplished in a relatively short time, with results replicable in about an hour on a single Cloud TPU or a few hours on a GPU.

In terms of performance, BERTBASE and BERTLARGE significantly outperform previous state-of-the-art models across various tasks, achieving average accuracy improvements of 4.5% and 7.0%, respectively, on the GLUE benchmark. For instance, BERTLARGE achieved an accuracy of 86.7% on the MNLI task, surpassing OpenAI GPT's 82.1%. In the SQuAD v1.1 question answering task, BERT demonstrated superior performance, with BERTLARGE (Single) achieving an F1 score of 90.9%, and an ensemble version reaching 91.8%, outperforming top leaderboard systems by notable margins. In SQuAD v2.0, BERT also showed a +5.1 F1 improvement over the previous best system, demonstrating its robustness in handling more complex question-answering scenarios.

Ablation studies highlighted the importance of the NSP task, showing that removing it significantly degrades performance on tasks like QNLI, MNLI, and SQuAD. Additionally, the impact of model size was explored, revealing that larger models consistently lead to accuracy improvements across various tasks, even those with limited training data. BERTBASE contains 110M parameters, while BERTLARGE contains 340M parameters, showcasing the effectiveness of scaling model size for enhanced performance.

The paper also discusses the potential of both fine-tuning and feature-based approaches with BERT. While the fine-tuning approach, where a simple classification layer is added to the pre-trained model, has shown significant success, the feature-based approach allows for the extraction of fixed features from the pre-trained model, which can be beneficial for certain tasks and computationally efficient. Experiments on the CoNLL-2003 Named Entity Recognition (NER) task demonstrated that BERT can perform competitively using both approaches, with the feature-based method achieving results close to those of fine-tuning.

Overall, BERT's architecture and training methodology represent a significant advancement in the field of natural language processing, enabling effective transfer learning and improved

### Generate A Detailed Summary of The Entire Document With At Least 1000 Tokens. Also, Add A Title To The Summary And Present Key Points Using Bullet Points With Chain_Type of "map_reduce".

##**map_reduce with custom prompt**

In [None]:
chain = load_summarize_chain(
    llm=llm,
    chain_type='map_reduce'
)
chain

MapReduceDocumentsChain(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['text'], template='Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'), llm=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7ef9421d3ac0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7ef9423b3fa0>, root_client=<openai.OpenAI object at 0x7ef9421de6e0>, root_async_client=<openai.AsyncOpenAI object at 0x7ef9421d3b50>, model_name='gpt-4o-mini', temperature=0.0, openai_api_key=SecretStr('**********'), openai_proxy='', max_tokens=1024)), reduce_documents_chain=ReduceDocumentsChain(combine_documents_chain=StuffDocumentsChain(llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['text'], template='Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'), llm=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7ef9421d3ac0>, async_client=<openai.resources.chat.completions.As

In [None]:
# prompt for every chunk
from langchain import PromptTemplate

# Map prompt (her parça için kullanılacak prompt)
chunks_prompt = """
Please summarize the following text:
{text}
"""
map_prompt_template = PromptTemplate(input_variables=['text'], template=chunks_prompt)

In [None]:
# Combine prompt (bütün parçalar birleştirildiğinde kullanılacak prompt)
final_combine_prompt = """
Summarize the text below with at least 1000 tokens. Provide a title and key points using bullet points:
{text}
"""
final_combine_prompt_template = PromptTemplate(input_variables=['text'], template=final_combine_prompt)

In [None]:
# 3. Map-Reduce zinciri oluşturma
chain = load_summarize_chain(
    llm=llm,
    chain_type='map_reduce',
    map_prompt=map_prompt_template,
    combine_prompt=final_combine_prompt_template
)

In [None]:
output_summary = chain.invoke(chunks)["output_text"]
output_summary

'# Title: Understanding BERT: A Comprehensive Overview of Bidirectional Encoder Representations from Transformers\n\n## Key Points:\n\n- **Introduction to BERT**: \n  - BERT, which stands for Bidirectional Encoder Representations from Transformers, is a groundbreaking language representation model developed by Google AI Language.\n  - It is designed to pre-train deep bidirectional representations from unlabeled text, taking into account both left and right context in all layers, which enhances its understanding of language.\n\n- **Advancements Over Previous Models**:\n  - Traditional language models often utilized unidirectional approaches, which limited their contextual understanding and effectiveness for various tasks.\n  - BERT introduces a "masked language model" (MLM) pre-training objective, which randomly masks tokens in the input text and trains the model to predict these masked tokens, thereby improving its contextual comprehension.\n\n- **Key Contributions of BERT**:\n  - **Bi

In [None]:
from IPython.display import Markdown

Markdown(output_summary)

# Title: Understanding BERT: A Comprehensive Overview of Bidirectional Encoder Representations from Transformers

## Key Points:

- **Introduction to BERT**: 
  - BERT, which stands for Bidirectional Encoder Representations from Transformers, is a groundbreaking language representation model developed by Google AI Language.
  - It is designed to pre-train deep bidirectional representations from unlabeled text, taking into account both left and right context in all layers, which enhances its understanding of language.

- **Advancements Over Previous Models**:
  - Traditional language models often utilized unidirectional approaches, which limited their contextual understanding and effectiveness for various tasks.
  - BERT introduces a "masked language model" (MLM) pre-training objective, which randomly masks tokens in the input text and trains the model to predict these masked tokens, thereby improving its contextual comprehension.

- **Key Contributions of BERT**:
  - **Bidirectional Pre-training**: The MLM approach allows BERT to develop deep bidirectional representations, a significant improvement over previous unidirectional models.
  - **Reduction of Task-Specific Architectures**: BERT reduces the necessity for complex, task-specific architectures, achieving state-of-the-art performance across numerous NLP tasks through a fine-tuning process.
  - **Performance Improvement**: BERT has set new benchmarks for eleven NLP tasks, with its code and pre-trained models made publicly available for further research and application.

- **Evolution of Language Representation Models**:
  - The text reviews the evolution of language representation models, highlighting the shift from traditional word embeddings to more advanced contextual models like ELMo and GPT, which paved the way for BERT's development.

- **BERT's Architecture and Training Process**:
  - BERT's training process consists of two main phases: pre-training on unlabeled data and fine-tuning on labeled data for specific downstream tasks.
  - The architecture is based on a multi-layer bidirectional Transformer encoder, which allows it to process input in both directions.
  - BERT is available in two sizes: BERTBASE and BERTLARGE, each with different configurations and total parameters.

- **Input Representation**:
  - BERT can process both single sentences and pairs of sentences in a unified token sequence, utilizing WordPiece embeddings with a vocabulary of 30,000 tokens.
  - Each input sequence begins with a special classification token ([CLS]), and sentence pairs are separated using a special token ([SEP]) along with learned embeddings.

- **Pre-training Tasks**:
  - **Masked Language Model (MLM)**: This task involves randomly masking a percentage of input tokens and training the model to predict them, which facilitates deep bidirectional representation.
  - **Next Sentence Prediction (NSP)**: This task trains the model to understand the relationship between sentences by predicting whether one sentence follows another.

- **Fine-tuning Process**:
  - Fine-tuning BERT for various downstream tasks is efficient due to its self-attention mechanism, which allows it to handle both single texts and text pairs seamlessly.
  - The fine-tuning process involves adapting the model to specific tasks by modifying the input and output layers.

- **Performance on Benchmarks**:
  - BERT models have significantly outperformed previous state-of-the-art systems on benchmarks such as the GLUE benchmark and the Stanford Question Answering Dataset (SQuAD v1.1).
  - BERTLARGE achieved an average accuracy improvement of 7.0% over prior models on GLUE tasks.
  - In SQuAD v1.1, BERT fine-tuned to predict answer spans from passages based on questions demonstrated superior results compared to leading systems.

- **SQuAD and SWAG Results**:
  - BERTLARGE achieved an F1 score of 87.4 on the SQuAD 1.1 development set and 93.2 on the test set.
  - In SQuAD 2.0, BERTLARGE achieved an F1 score of 80.0, indicating significant improvement over previous systems.
  - BERTLARGE also excelled in the SWAG dataset, achieving an accuracy of 86.3.

- **Ablation Studies**:
  - Ongoing research includes ablation studies to understand the contributions of different components of the BERT model.
  - Removing the NSP task significantly degrades performance, particularly on tasks like QNLI, MNLI, and SQuAD.
  - Increasing the model size consistently yields better performance across various datasets, even for tasks with limited training data.

- **Feature-Based vs. Fine-Tuning Approaches**:
  - The text compares the effectiveness of fine-tuning all parameters versus using fixed features extracted from the pre-trained model for tasks like Named Entity Recognition (NER).
  - The feature-based method offers computational advantages and flexibility, demonstrating BERT's effectiveness in both fine-tuning and feature extraction.

- **Conclusion**:
  - The findings emphasize the importance of rich, unsupervised pre-training in enhancing language understanding systems

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___