# WELCOME

This notebook will guide you through two increasingly significant applications in the realm of Generative AI: RAG (Retrieval Augmented Generation) chatbots and text summarization for big text.

Through two distinct projects, you will explore these technologies and enhance your skills. Detailed descriptions of the projects are provided below.

In [None]:
## Project 2: Generating PDF Document Summaries

In this project, you will explore various methods for creating summaries from the provided PDF document. You will experiment with different chaining functions offered by the Langchain library to achieve this.

### **Project Steps:**
- **1.PDF Document Upload and Chunking:** As in the first project, upload the PDF document and divide it into smaller chunks. Consider splitting it by half-page or page.

- **2.Summarization Techniques:**

  - **Summary of the First 5 Pages (Stuff Chain):** Utilize the load_summarize_chain function with the parameter chain_type="stuff" to generate a concise summary of the first 5 pages of the PDF document.

  - **Short Summary of the Entire Document (Map Reduce Chain):** Employ chain_type="map_reduce" and refine parameters to create a brief summary of the entire document. This method generates individual summaries for each chunk and then combines them into a final summary.

  - **Detailed Summary with Bullet Points (Map Reduce Chain):** Use chain_type="map_reduce" to generate a detailed summary with at least 1000 tokens. Provide the LLM with the prompt "Summarize with 1000 tokens" and set the max_token parameter to a value greater than 1000. Add a title to the summary and present key points using bullet points.

### Important Notes:

- Models like GPT-4o and Gemini Pro models might excel in generating summaries based on token count. Consider prioritizing these models.

- For comprehensive information on Langchain and LLMs, refer to their respective documentation.
Best of luck!

In [None]:
#.       1- first 5 pages with stuff chain
#.       2- entire doc with map reduce
#.       3- entire doc with map reduce and summarise 1000 tokens, max_token 1000+ + title and bullet points

### Install Libraries

In [65]:
!pip install -qU langchain-community
!pip install -qU langchain-openai

!pip install -qU pypdfium2       # handling and parsing PDF files
#LLM app (as shown in the diagram) to ingest a common and important type of proprietary data: PDF documents



In [66]:
# OpenAI API key as an environment variable within the Google Colab notebook.
import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

### Loading PDF Document

In [67]:
# retrieving unstructured data (text from a PDF) and preparing it for the LLM to process.

from langchain_community.document_loaders import PyPDFium2Loader

def read_doc(directory):
    file_loader=PyPDFium2Loader(directory)
    pdf_documents=file_loader.load()
    return pdf_documents

In [68]:

pdf = read_doc('/content/drive/MyDrive/nlp/bert_article.pdf')

len(pdf)



16

In [69]:
pdf[-1:]  # first page

[Document(metadata={'producer': 'pdfTeX-1.40.18', 'creator': 'LaTeX with hyperref package', 'creationdate': '2019-04-29T17:36:03+00:00', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding', 'author': 'Jacob Devlin ; Ming-Wei Chang ; Kenton Lee ; Kristina Toutanova', 'subject': 'N19-1 2019', 'keywords': '', 'moddate': '2019-04-29T17:36:03+00:00', 'source': '/content/drive/MyDrive/nlp/bert_article.pdf', 'total_pages': 16, 'page': 15}, page_content='4186\nC Additional Ablation Studies\nC.1 Effect of Number of Training Steps\nFigure 5 presents MNLI Dev accuracy after fine\x02tuning from a checkpoint that has been pre-trained\nfor k steps. This allows us to answer the following\nquestions:\n1. Question: Does BERT really need such\na large amount of pre-training (128,000\nwords/batch * 1,000,000 steps) to achieve\nhigh fine-tuning accuracy?\nAnswer: Yes, BERTBASE achieves almost\n1.0% additional accuracy on MNLI when\ntrained on 1M steps compared to 50

### Summarizing the First 5 Pages of The Document With Chain_Type of The 'stuff'

In [70]:
from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

# Set up the LLM
llm = ChatOpenAI(
    temperature=0,              # deterministic output
    model_name='gpt-4o-mini',   # compact and powerful
    max_tokens=2048             # control summary length
)

# Create the summarization chain using "stuff"
chain_stuff = load_summarize_chain(llm, chain_type="stuff")

# Select only the first 5 pages of the document
first_5_pages = pdf[:5]

try:
    summary_stuff = chain_stuff.invoke(first_5_pages)  # summarise only 5 pages
    Markdown("Summary of the First 5 Pages (Stuff Chain):")
    (summary_stuff["output_text"])
except Exception as e:
    print(f"An error occurred: {e}")

In [71]:
summary_stuff


{'input_documents': [Document(metadata={'producer': 'pdfTeX-1.40.18', 'creator': 'LaTeX with hyperref package', 'creationdate': '2019-04-29T17:36:03+00:00', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding', 'author': 'Jacob Devlin ; Ming-Wei Chang ; Kenton Lee ; Kristina Toutanova', 'subject': 'N19-1 2019', 'keywords': '', 'moddate': '2019-04-29T17:36:03+00:00', 'source': '/content/drive/MyDrive/nlp/bert_article.pdf', 'total_pages': 16, 'page': 0}, page_content='Proceedings of NAACL-HLT 2019, pages 4171–4186\nMinneapolis, Minnesota, June 2 - June 7, 2019. \nc 2019 Association for Computational Linguistics\n4171\nBERT: Pre-training of Deep Bidirectional Transformers for\nLanguage Understanding\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\nGoogle AI Language\n{jacobdevlin,mingweichang,kentonl,kristout}@google.com\nAbstract\nWe introduce a new language representa\x02tion model called BERT, which stands for\nBidirectional Encoder Rep

In [72]:
summary_stuff["output_text"]

"The paper introduces BERT (Bidirectional Encoder Representations from Transformers), a novel language representation model developed by Google AI Language. BERT pre-trains deep bidirectional representations from unlabeled text by jointly considering both left and right contexts, overcoming limitations of previous unidirectional models. It employs a masked language model (MLM) and a next sentence prediction (NSP) task during pre-training, allowing it to achieve state-of-the-art results on eleven natural language processing tasks, including question answering and language inference. BERT's architecture is unified across tasks, requiring minimal task-specific modifications during fine-tuning. The model demonstrates significant improvements over existing approaches, achieving notable performance metrics on benchmarks like GLUE and SQuAD. The code and pre-trained models are publicly available for further research and application."

In [73]:

# The output is a dictionary, extract and display the final summary
Markdown(summary_stuff["output_text"])

The paper introduces BERT (Bidirectional Encoder Representations from Transformers), a novel language representation model developed by Google AI Language. BERT pre-trains deep bidirectional representations from unlabeled text by jointly considering both left and right contexts, overcoming limitations of previous unidirectional models. It employs a masked language model (MLM) and a next sentence prediction (NSP) task during pre-training, allowing it to achieve state-of-the-art results on eleven natural language processing tasks, including question answering and language inference. BERT's architecture is unified across tasks, requiring minimal task-specific modifications during fine-tuning. The model demonstrates significant improvements over existing approaches, achieving notable performance metrics on benchmarks like GLUE and SQuAD. The code and pre-trained models are publicly available for further research and application.

In [74]:
#We are now directly modifying the instruction that is sent to the language model.

chain_stuff

StuffDocumentsChain(verbose=False, llm_chain=LLMChain(verbose=False, prompt=PromptTemplate(input_variables=['text'], input_types={}, partial_variables={}, template='Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'), llm=ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x7e260c6cb7a0>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x7e260c6c9310>, root_client=<openai.OpenAI object at 0x7e260d54c650>, root_async_client=<openai.AsyncOpenAI object at 0x7e260c6c8fb0>, model_name='gpt-4o-mini', temperature=0.0, model_kwargs={}, openai_api_key=SecretStr('**********'), max_tokens=2048), output_parser=StrOutputParser(), llm_kwargs={}), document_prompt=PromptTemplate(input_variables=['page_content'], input_types={}, partial_variables={}, template='{page_content}'), document_variable_name='text')

In [144]:
# ilk bes sayfayi ozette bunu kullanalim bi tik iyi olsun




chain_stuff.llm_chain.prompt.template # models template is this.








'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

In [145]:
#  replacing the default prompt with a custom one that gives the LLM specific instructions.

#By including "Write a summary in 1000 tokens...", we are providing explicit guidance to the model about the desired length and style of the summary.











In [146]:
#chain_stuff.llm_chain.prompt.template = """Write a summary in 750 tokens of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:"""

# Refine the summarisation prompt for scientific text
chain_stuff.llm_chain.prompt.template = """Write a summary of the following text in about 750 tokens.
- Focus on the main contributions, methods, datasets, and results.
- Do not introduce information not present in the text.
- Keep the language objective and concise.
- Preserve key technical terms and abbreviations.

Text:
"{text}"

CONCISE SUMMARY:"""

In [147]:
# prompt.template has changed now.

chain_stuff

StuffDocumentsChain(verbose=False, llm_chain=LLMChain(verbose=False, prompt=PromptTemplate(input_variables=['text'], input_types={}, partial_variables={}, template='Write a summary of the following text in about 750 tokens.\n- Focus on the main contributions, methods, datasets, and results.\n- Do not introduce information not present in the text.\n- Keep the language objective and concise.\n- Preserve key technical terms and abbreviations.\n\nText:\n"{text}"\n\nCONCISE SUMMARY:'), llm=ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x7afd9f3aacf0>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x7afd9a4a6c90>, root_client=<openai.OpenAI object at 0x7afd9fc363c0>, root_async_client=<openai.AsyncOpenAI object at 0x7afdaf3ff470>, model_name='gpt-4o-mini', temperature=0.0, model_kwargs={}, openai_api_key=SecretStr('**********'), max_tokens=2048), output_parser=StrOutputParser(), llm_kwargs={}), document_prompt=

In [148]:
chain_stuff.llm_chain.prompt.template

'Write a summary of the following text in about 750 tokens.\n- Focus on the main contributions, methods, datasets, and results.\n- Do not introduce information not present in the text.\n- Keep the language objective and concise.\n- Preserve key technical terms and abbreviations.\n\nText:\n"{text}"\n\nCONCISE SUMMARY:'

In [33]:
#  the effect of this change, we will need to run the chain again with the new prompt

In [149]:
llm = ChatOpenAI(temperature=0,
                 model_name='gpt-4o-mini',
                 max_tokens=1024)

chain_stuff = load_summarize_chain(
                      llm,
                      chain_type='stuff'
                      )

summary_stuff = chain_stuff.invoke(pdf)['output_text']
summary_stuff



'The paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" introduces BERT (Bidirectional Encoder Representations from Transformers), a novel language representation model developed by Google AI Language. BERT\'s primary innovation lies in its ability to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context, which contrasts with previous models that utilized unidirectional approaches. This bidirectionality allows BERT to achieve state-of-the-art performance across a variety of natural language processing (NLP) tasks with minimal task-specific architecture modifications.\n\nBERT employs two main pre-training tasks: a masked language model (MLM) and next sentence prediction (NSP). The MLM randomly masks a percentage of input tokens and trains the model to predict these masked tokens based solely on their context, enabling the model to learn bidirectional representations. The NSP task invo

In [39]:
# The output is a dictionary, extract and display the final summary

Markdown(summary_stuff)

The paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" introduces BERT (Bidirectional Encoder Representations from Transformers), a novel language representation model developed by Google AI Language. BERT's primary innovation lies in its ability to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right contexts across all layers. This approach allows BERT to achieve state-of-the-art performance on various natural language processing (NLP) tasks with minimal task-specific architecture modifications.

### Main Contributions:
1. **Bidirectional Pre-training**: BERT utilizes a masked language model (MLM) pre-training objective, which randomly masks tokens in the input and predicts them based on their context, enabling deep bidirectional representations. This contrasts with previous models that employed unidirectional training.
2. **Reduction of Task-Specific Architecture**: BERT demonstrates that pre-trained representations can significantly reduce the need for complex task-specific architectures, achieving state-of-the-art results across eleven NLP tasks.
3. **Performance Improvements**: BERT sets new records on multiple benchmarks, including achieving an 80.5% score on the GLUE benchmark, 86.7% accuracy on MultiNLI, and 93.2 F1 on SQuAD v1.1.

### Methods:
BERT's framework consists of two main steps: pre-training and fine-tuning. During pre-training, BERT is trained on unlabeled data using two tasks:
- **Masked Language Model (MLM)**: Randomly masks 15% of the input tokens and predicts them based on their context.
- **Next Sentence Prediction (NSP)**: Trains the model to predict whether a second sentence follows the first in the text.

The model architecture is based on a multi-layer bidirectional Transformer encoder, with two primary configurations: BERTBASE (12 layers, 110M parameters) and BERTLARGE (24 layers, 340M parameters).

### Datasets:
BERT is pre-trained on a large corpus comprising BooksCorpus (800M words) and English Wikipedia (2,500M words). For fine-tuning, BERT is evaluated on various NLP tasks, including:
- GLUE benchmark tasks (e.g., MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC, RTE)
- SQuAD v1.1 and v2.0 for question answering
- SWAG for grounded common-sense inference

### Results:
BERT achieves significant improvements over previous state-of-the-art models:
- **GLUE Benchmark**: BERTBASE and BERTLARGE outperform all prior systems, with BERTLARGE achieving an average accuracy of 82.1%.
- **SQuAD v1.1**: BERTLARGE achieves a test F1 score of 93.2, surpassing previous top systems.
- **SQuAD v2.0**: BERTLARGE achieves a test F1 score of 83.1, marking a 5.1 point improvement over the previous best.
- **SWAG**: BERTLARGE achieves an accuracy of 86.6, outperforming existing models by a substantial margin.

### Conclusion:
BERT's introduction of bidirectional pre-training and its effective use of the MLM and NSP tasks represent a significant advancement in the field of NLP. The model's ability to generalize across various tasks with minimal architecture changes highlights the potential of deep bidirectional architectures in language understanding. The code and pre-trained models are publicly available, facilitating further research and application in the NLP community.

### Document Splitter

In [75]:


from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

# for gpt-4o-mini 128,000-token context window ... text in each "map" step.

# Create an intelligent text splitter that attempts to split on different characters (\n\n, \n, ) in a recursive order to keep related sentences and paragraphs together.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000,
                                               chunk_overlap=0,  # ensures there is no overlapping text between the chunks.
                                               )

# Run the chain on the newly created chunks
chunks = text_splitter.split_documents(pdf)

# The chunks variable now contains a list of these smaller document pieces.

In [76]:
len(chunks)

25

In [77]:
chunks[0]

Document(metadata={'producer': 'pdfTeX-1.40.18', 'creator': 'LaTeX with hyperref package', 'creationdate': '2019-04-29T17:36:03+00:00', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding', 'author': 'Jacob Devlin ; Ming-Wei Chang ; Kenton Lee ; Kristina Toutanova', 'subject': 'N19-1 2019', 'keywords': '', 'moddate': '2019-04-29T17:36:03+00:00', 'source': '/content/drive/MyDrive/nlp/bert_article.pdf', 'total_pages': 16, 'page': 0}, page_content='Proceedings of NAACL-HLT 2019, pages 4171–4186\nMinneapolis, Minnesota, June 2 - June 7, 2019. \nc 2019 Association for Computational Linguistics\n4171\nBERT: Pre-training of Deep Bidirectional Transformers for\nLanguage Understanding\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\nGoogle AI Language\n{jacobdevlin,mingweichang,kentonl,kristout}@google.com\nAbstract\nWe introduce a new language representa\x02tion model called BERT, which stands for\nBidirectional Encoder Representations from\nTr

"The paper introduces BERT (Bidirectional Encoder Representations from Transformers), a revolutionary language representation model by Google AI Language that employs a deep bidirectional architecture for pre-training on unlabeled text. This approach allows BERT to effectively fine-tune for various NLP tasks, achieving state-of-the-art results on eleven benchmarks, including GLUE and SQuAD, and surpassing previous models like OpenAI GPT. BERT's pre-training involves Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), enhancing its contextual understanding. The model comes in two sizes, BERTBASE and BERTLARGE, and shows significant performance improvements, especially with limited training data. The paper also reviews advancements in NLP, comparing BERT with other models, detailing its pre-training and fine-tuning processes, and emphasizing the impact of masking strategies on performance. Overall, BERT marks a significant advancement in NLP, showcasing the benefits of bid

--
-
--
-


# Make A Brief Summary of The Entire Document With Chain_Types of "map_reduce"

###When to Use

map_reduce → best for speed + large docs (100s of pages), when high-level overview is fine.

refine → best for accuracy & continuity (academic papers, technical reports), when detail matters.

In [116]:

# Brief Summary of The Entire Document With Chain_Types of "map_reduc


%%time

from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain



text_splitter = RecursiveCharacterTextSplitter(chunk_size=5000,
                                               chunk_overlap=0,  # ensures there is no overlapping text between the chunks.
                                               )

# Run the chain on the newly created chunks
chunks = text_splitter.split_documents(pdf)

# The chunks variable now contains a list of these smaller document pieces.

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, max_tokens=2500)

# Optional: stronger scientific prompts (map/combine) can be passed via chain_type_kwargs
chain = load_summarize_chain(
    llm,
    chain_type="map_reduce",
    return_intermediate_steps=True,  # helpful to inspect per-chunk outputs
    # chain_type_kwargs={"map_prompt": map_prompt, "combine_prompt": combine_prompt}
)

result = chain.invoke(chunks)
final_summary = result["output_text"]
intermediate = result["intermediate_steps"]  # if you want to inspect/debug

final_summary

CPU times: user 101 ms, sys: 5.72 ms, total: 107 ms
Wall time: 35.3 s


"The paper presents BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking language representation model developed by Google AI Language. Unlike previous unidirectional models, BERT utilizes a masked language model (MLM) and next sentence prediction (NSP) for pre-training, allowing it to consider both left and right contexts, which significantly enhances its performance across eleven natural language processing (NLP) tasks. BERT achieves state-of-the-art results on benchmarks like GLUE and SQuAD, demonstrating its effectiveness with minimal task-specific modifications. The model's architecture, characterized by a multi-layer bidirectional Transformer encoder, facilitates easy fine-tuning for various applications. The paper also discusses the impact of model size and pre-training tasks on performance, highlighting the importance of bidirectionality and the effectiveness of rich, unsupervised pre-training. BERT's results surpass those of previous models, showcas

In [118]:
#intermediate
from IPython.display import Markdown
Markdown(final_summary)

The paper presents BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking language representation model developed by Google AI Language. Unlike previous unidirectional models, BERT utilizes a masked language model (MLM) and next sentence prediction (NSP) for pre-training, allowing it to consider both left and right contexts, which significantly enhances its performance across eleven natural language processing (NLP) tasks. BERT achieves state-of-the-art results on benchmarks like GLUE and SQuAD, demonstrating its effectiveness with minimal task-specific modifications. The model's architecture, characterized by a multi-layer bidirectional Transformer encoder, facilitates easy fine-tuning for various applications. The paper also discusses the impact of model size and pre-training tasks on performance, highlighting the importance of bidirectionality and the effectiveness of rich, unsupervised pre-training. BERT's results surpass those of previous models, showcasing its potential in advancing NLP capabilities.

In [None]:
# a custom PromptTemplate for the 'map' step of the Map-Reduce chain

# Purpose: Its sole job is to instruct the LLM on ****how to summarize each individual chunk**** of the original document.

# Input: It takes the text of a ***single document chunk *** (e.g., one page of your PDF) as its input.
# Goal: To generate a concise, intermediate summary for that specific chunk. This result will be fed into the next step. The instructions are simple because the task is limited to a single piece of text.





In [113]:
# map reduce in one block.  . next one not this one.
#-----------------------------------
%%time



# prompt for every chunk
from langchain import PromptTemplate

# Define our custom prompt. Prompt for combining the summaries
chunks_prompt = """
                Please summarize the below text:
                text:'{text}'
                summary:
                """

map_prompt_template = PromptTemplate(input_variables=['text'],
                                     template=chunks_prompt)

# prompt for combined summaries
#all the smaller summaries  >>>>>. coherent, cohesive, and well-structured final summary.
#.   map ----------------->.         reduce

final_combine_prompt = """
                       Provide a final summary of the entire text with important points.
                       Add a Generic  Title,
                       Start the precise summary with an introduction and provide the
                       summary in number points for the text.
                       text: '{text}'
                       summary:
                       """

final_combine_prompt_template = PromptTemplate(input_variables=['text'],
                                               template=final_combine_prompt)


text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000,
                                               chunk_overlap=0,  # ensures there is no overlapping text between the chunks.
                                               )

# Run the chain on the newly created chunks
chunks = text_splitter.split_documents(pdf)
#llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0, max_tokens=1600)


chain = load_summarize_chain(llm=llm,
                             chain_type='map_reduce',
                             map_prompt=map_prompt_template,  # every chunk
                             combine_prompt=final_combine_prompt_template  # combined all summarization
                             )

chain

# result = chain.invoke(chunks)
# final_summary = result["output_text"]
# # intermediate = result["intermediate_steps"]  # if you want to inspect/debug
# final_summary



output_summary = chain.invoke(chunks)["output_text"]

#output_summary

# The output is a dictionary, extract and display the final summary
Markdown(output_summary)


CPU times: user 108 ms, sys: 6.15 ms, total: 114 ms
Wall time: 44.6 s


**Title: Overview of BERT: Advancements in Natural Language Processing**

**Summary:**

The text provides a comprehensive overview of BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking language representation model developed by Google AI Language. BERT's innovative approach to pre-training and fine-tuning has significantly advanced the field of natural language processing (NLP). The following points summarize the key aspects discussed in the text:

1. **Introduction to BERT**: BERT is designed to pre-train deep bidirectional representations from unlabeled text, considering both left and right context, which enhances its performance on various NLP tasks.

2. **Performance on NLP Tasks**: BERT achieves state-of-the-art results on eleven tasks, including GLUE, MultiNLI, and SQuAD benchmarks, demonstrating significant improvements over previous models.

3. **Bidirectional Pre-training**: The model employs a masked language model (MLM) approach, allowing for better context incorporation compared to unidirectional models, which enhances its effectiveness in sentence-level and token-level tasks.

4. **Unified Architecture**: BERT utilizes a unified architecture for various tasks, with minimal differences between pre-training and downstream models, facilitating efficient fine-tuning.

5. **Input Representations**: BERT combines token, segment, and position embeddings to create input representations, using WordPiece embeddings and special tokens for classification and sentence separation.

6. **Pre-training and Fine-tuning**: The pre-training phase involves tasks like MLM and Next Sentence Prediction (NSP), while fine-tuning adjusts the model using labeled data for specific tasks, allowing for quick adaptation.

7. **Model Sizes**: BERT is available in two main sizes: BERTBASE and BERTLARGE, with the latter showing superior performance, especially on tasks with limited training data.

8. **Benchmark Results**: BERT models significantly outperform previous state-of-the-art models on the GLUE benchmark and SQuAD v1.1, achieving notable accuracy improvements.

9. **Ablation Studies**: The text discusses various ablation studies that highlight the importance of pre-training tasks and model size, indicating that larger models consistently improve accuracy across tasks.

10. **Impact of Pre-training**: The findings emphasize the significance of rich, unsupervised pre-training in enhancing language understanding systems, particularly for low-resource tasks.

11. **Comparative Analysis**: BERT's architecture is compared with other models like OpenAI GPT and ELMo, showcasing its unique bidirectional approach and fine-tuning capabilities.

12. **Hyperparameter Tuning**: The text discusses the importance of hyperparameter tuning during fine-tuning, noting that larger datasets are less sensitive to hyperparameter choices.

13. **Task-Specific Fine-tuning**: Various tasks for fine-tuning BERT are outlined, including linguistic acceptability, semantic similarity, and paraphrase detection, with suggestions for multitask approaches to enhance performance.

14. **Ongoing Research**: The text concludes with mentions of ongoing ablation studies to further understand the importance of different aspects of the BERT model and its applications in NLP.

This summary encapsulates the essential points regarding BERT's architecture, training processes, performance metrics, and its transformative impact on natural language processing.

The following prompt, final_combine_prompt (The 'Reduce' Prompt), is used for the second and final step.

Purpose: Its job is to synthesize all the individual summaries from the first step into one single, final summary.
Input: It takes the collection of all the smaller summaries (the output of the 'map' step) as its input.
Goal: To create a coherent, cohesive, and well-structured final summary. This prompt is where we add instructions for formatting, style, and content that should apply to the final output, such as adding a title, an introduction, and using numbered points.
In essence, the map prompt handles the summarization of parts, while the reduce prompt handles the synthesis of the whole

We have now successfully created a Map-Reduce summarization chain using our custom prompts for both the 'map' and 'reduce' steps.

The output from chain will display the structure of this new MapReduceDocumentsChain, confirming that it has been correctly configured with our specific map_prompt_template and final_combine_prompt_template.

In [114]:
len(chunks)

16

In [None]:
serial operating barrier

In [115]:

%%time
# --- Map-Reduce: Detailed, ≥1000 tokens, Title + bullet points ---



from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain_openai import ChatOpenAI
from IPython.display import Markdown

# 1) Map prompt (per chunk) – keep concise bullets so reduce can scale
map_prompt = PromptTemplate.from_template(
    "Summarize the following text as concise bullet points (no intro):\n\n{text}\n\n-"
)

# 2) Combine prompt – EXACT requirement text + structure
combine_prompt = PromptTemplate.from_template(
    "Summarize with around 1000 tokens. Produce a detailed, cohesive summary of the ENTIRE work.\n"
    "- Start with a single (**bold**) Generic Title on the first line (no quotes).\n"
    "- Then present the key points as bullet points only (no numbering).\n"
    "- Be comprehensive and avoid redundancy; prefer clear, information-dense bullets.\n\n"
    "All chunk summaries:\n{text}\n\n"
    "Title:\n"
    "Bullet-point summary:\n-"
)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=8000,
                                               chunk_overlap=0,  # ensures there is no overlapping text between the chunks.
                                               )

# Run the chain on the newly created chunks
chunks = text_splitter.split_documents(pdf)

# 3) LLM with max_tokens > 1000
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, max_tokens=1234)

# 4) Build chain
chain = load_summarize_chain(
    llm=llm,
    chain_type="map_reduce",
    map_prompt=map_prompt,
    combine_prompt=combine_prompt,
    #verbose=True
)

# 5) Run
result = chain.invoke(chunks)  # 'chunks' = list[Document]
output_summary = result["output_text"]




Markdown(output_summary)



CPU times: user 160 ms, sys: 8.44 ms, total: 168 ms
Wall time: 1min 35s


**BERT: Bidirectional Encoder Representations from Transformers**

- BERT (Bidirectional Encoder Representations from Transformers) is a revolutionary language representation model created by Google AI Language, aimed at improving natural language processing (NLP) tasks.
- The model utilizes deep bidirectional representations, allowing it to consider both left and right context simultaneously, leading to a more nuanced understanding of language.
- BERT can be fine-tuned for various NLP tasks by adding a simple output layer, achieving state-of-the-art results without complex architectural changes.
- It sets new benchmarks across eleven NLP tasks, including:
  - GLUE score: 80.5% (an absolute improvement of 7.7%)
  - MultiNLI accuracy: 86.7% (4.6% improvement)
  - SQuAD v1.1 F1 score: 93.2 (1.5 point improvement)
  - SQuAD v2.0 F1 score: 83.1 (5.1 point improvement)
- The pre-training phase of BERT significantly boosts performance in both sentence-level and token-level tasks, overcoming limitations of prior unidirectional models.
- BERT employs a masked language model (MLM) objective, which enhances context understanding by randomly masking input tokens and predicting them.
- The model also incorporates a "next sentence prediction" (NSP) task, which aids in understanding relationships between sentences, essential for tasks like Question Answering (QA) and Natural Language Inference (NLI).
- BERT's architecture is based on a multi-layer bidirectional Transformer encoder, available in two configurations: BERTBASE (12 layers, 110M parameters) and BERTLARGE (24 layers, 340M parameters).
- Input representation in BERT accommodates both single and paired sentences, utilizing WordPiece embeddings and special tokens ([CLS] for classification and [SEP] for sentence separation).
- Pre-training is conducted on extensive corpora, including BooksCorpus and English Wikipedia, focusing on document-level sequences to enhance long-range dependencies.
- Fine-tuning is straightforward, leveraging the self-attention mechanism to adapt the model for specific tasks with minimal adjustments.
- BERT's performance on the GLUE benchmark showcases its superiority over previous systems, with BERTBASE achieving an average accuracy of 79.6% and BERTLARGE reaching 82.1%.
- In the SQuAD v1.1 dataset, BERT surpasses existing systems, achieving an F1 score of 91.8 in ensemble settings and outperforming top leaderboard systems.
- For SQuAD v2.0, BERT's approach includes treating questions without answers as spans at the [CLS] token, enhancing its ability to manage unanswerable questions.
- BERT excels on the SWAG dataset, outperforming previous models by significant margins.
- Fine-tuning details include specific epochs, learning rates, and batch sizes tailored for each task, ensuring optimal performance.
- Ablation studies indicate the critical role of pre-training tasks, revealing that removing NSP significantly degrades performance on key tasks.
- Larger model sizes correlate with improved accuracy, with BERTLARGE consistently outperforming smaller configurations across various tasks.
- The study emphasizes the effectiveness of transfer learning with language models, particularly in low-resource scenarios, and highlights the advantages of deep bidirectional architectures.
- BERT's fine-tuning method, which involves adding a classification layer and jointly fine-tuning all parameters, proves more effective than traditional feature-based approaches.
- Comparisons with other models, such as OpenAI GPT, illustrate BERT's superior performance, especially in tasks with limited training data.
- The research underscores BERT's potential to advance the state of the art in NLP, paving the way for future developments in language representation and understanding.
- BERT's pre-training involves two main tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP), enhancing its contextual understanding and sentence relationship comprehension.
- The MLM task randomly masks 15% of input tokens, employing a specific replacement strategy to promote contextual learning.
- The NSP task samples pairs of sentences, with half being consecutive, helping the model learn inter-sentence relationships.
- BERT's architecture includes special tokens like [CLS] for classification and [SEP] for separating sentences, improving its versatility across various NLP tasks.
- The model is trained with a maximum sequence length of 512 tokens and employs a batch size of 256 sequences over approximately 1,000,000 steps, utilizing the Adam optimizer with specific hyperparameters.
- BERT's pre-training is computationally intensive, with BERTBASE taking 4 days on 4 Cloud TPUs and BERTLARGE on 16 Cloud TPUs.
- Fine-tuning BERT for specific tasks involves adding an output layer and adjusting hyperparameters like learning rate and batch size, with optimal batch sizes ranging from 16 to 32.
- BERT's performance is evaluated on the GLUE benchmark, which includes various tasks such as MNLI (entailment classification), QQP (question pair classification), and SST-2 (sentiment analysis).
- The model's fine-tuning is sensitive to hyperparameters, particularly on smaller datasets, while large datasets (100k+ labeled examples) show less sensitivity.
- BERT's training and fine-tuning processes are compared to other models like ELMo and OpenAI GPT, highlighting differences in architecture, training data, and learning strategies.
- Ablation studies reveal that BERT's improvements stem primarily from its pre-training tasks and bidirectionality, with mixed masking strategies during MLM pre-training yielding better results than single strategies.
- The model's performance on tasks like MNLI shows that increasing the number of training steps leads to higher accuracy, with BERTBASE achieving nearly 1.0% higher accuracy with 1M steps compared to 500k.
- BERT's architecture and training methods have set a new standard in NLP, demonstrating the effectiveness of deep contextualized representations and bidirectional training in understanding language.

**BERT: Bidirectional Encoder Representations from Transformers**

- BERT (Bidirectional Encoder Representations from Transformers) is a revolutionary language representation model created by Google AI Language, aimed at improving natural language processing (NLP) tasks.
- The model utilizes deep bidirectional representations, considering both left and right context, which enhances its understanding of language nuances.
- BERT can be easily fine-tuned for various NLP tasks by adding a simple output layer, achieving state-of-the-art results without complex architectural changes.
- It has set new performance benchmarks on eleven NLP tasks, including:
  - GLUE score: 80.5% (an absolute improvement of 7.7%)
  - MultiNLI accuracy: 86.7% (an absolute improvement of 4.6%)
  - SQuAD v1.1 F1 score: 93.2 (an absolute improvement of 1.5 points)
  - SQuAD v2.0 F1 score: 83.1 (an absolute improvement of 5.1 points)
- Traditional pre-training methods using unidirectional language models are limited; BERT's masked language model (MLM) pre-training addresses these limitations, improving context integration for tasks like question answering.
- The MLM enables BERT to predict masked tokens in sentences, leveraging both left and right context, which is essential for effective language representation.
- BERT incorporates a "next sentence prediction" (NSP) task, which helps the model understand relationships between sentence pairs during pre-training.
- The architecture is based on a multi-layer bidirectional Transformer encoder, available in two configurations: BERTBASE (12 layers, 110M parameters) and BERTLARGE (24 layers, 340M parameters).
- BERT's input representation supports both single and paired sentences, using WordPiece embeddings with a vocabulary of 30,000 tokens.
- The first token in the input sequence is a special classification token ([CLS]), whose final hidden state is used for classification tasks, while sentence pairs are separated by a special token ([SEP]).
- Pre-training is conducted on large corpora, including BooksCorpus (800M words) and English Wikipedia (2,500M words), focusing on document-level contexts for long sequences.
- Fine-tuning is efficient, utilizing the self-attention mechanism to model various tasks with text pairs, and can be executed on Cloud TPU or GPU.
- BERT's fine-tuning results show significant improvements across multiple NLP tasks, with BERTBASE and BERTLARGE outperforming all previous systems on the GLUE benchmark.
- In SQuAD v1.1, BERTLARGE achieves an F1 score of 91.8, surpassing the leading system by 1.5 points, while in SQuAD v2.0, it outperforms existing systems by 0.1-0.4 F1.
- The SWAG dataset results indicate BERTLARGE achieving an EM score of 86.6, outperforming previous models by substantial margins.
- Ablation studies highlight the importance of various components of BERT, showing that removing the NSP task significantly reduces performance on several tasks.
- Larger BERT models consistently yield better accuracy across tasks, with BERTLARGE demonstrating notable improvements even with limited training data.
- The study emphasizes the effectiveness of fine-tuning models on downstream tasks, suggesting that larger pre-trained representations can enhance performance, even for low-resource tasks.
- BERT's performance is compared with other models, including ELMo and OpenAI GPT, showcasing its superior capabilities across various benchmarks.
- The research underscores the advantages of transfer learning with language models, which boosts performance across diverse NLP tasks.
- BERT's configurations were tested with varying layers, hidden sizes, and attention heads, revealing that larger models lead to significant performance gains.
- The findings suggest that while feature-based approaches have their merits, BERT's fine-tuning method provides a more effective solution for many NLP challenges.
- The paper concludes that BERT represents a significant advancement in NLP, offering a robust framework for future research and applications in language understanding.
- The work also discusses various advances in NLP, including datasets like TriviaQA for reading comprehension, frameworks like Skip-thought vectors for sentence representation, and benchmarks like the Winograd Schema Challenge for commonsense reasoning.
- BERT's architecture is detailed in three sections: implementation details, experimental details, and ablation studies, providing a comprehensive understanding of its design and performance.
- Pre-training tasks for BERT include a masked language model (MLM) with specific masking strategies to encourage robust contextual representations.
- The masking procedure involves replacing words with [MASK], random words, or keeping them unchanged, promoting diverse learning during training.
- BERT employs a bidirectional Transformer architecture, contrasting with OpenAI GPT's left-to-right architecture, enhancing its contextual understanding.
- BERT's representations are conditioned on both left and right context, allowing for a more nuanced understanding of language compared to feature-based models like ELMo.
- BERT's MLM converges slower than left-to-right models but shows significant empirical improvements in performance.
- The Next Sentence Prediction (NSP) task involves sampling text spans to assess the model's understanding of sentence relationships.
- Training sequences are limited to 512 tokens, with a 15% masking rate applied after WordPiece tokenization, optimizing input for the model.
- BERT is trained with a batch size of 256 sequences over 1,000,000 steps, utilizing the Adam optimizer with specific hyperparameters for effective learning.
- Fine-tuning BERT involves adjusting hyperparameters for specific tasks, with optimal batch sizes ranging from 16 to 32.
- The comparison of BERT, ELMo, and OpenAI GPT highlights differences in training data, architecture, and learning strategies, emphasizing BERT's design for comprehensive language understanding.
- Ablation studies reveal that improvements in performance are primarily due to pre-training tasks and the bidirectional nature of BERT.
- The GLUE benchmark includes various tasks such as MNLI, QQP,

In [None]:

bi alltaki kodun bitmesi cok uzun sureceginden burada hata olsun ki zaman kaybi olmasin
.line 2

In [None]:


# refine in one block

# prgoressive approach, after each chunks summarisation it refine itself




# -----------------------------------

# (Use the same llm you used above, or uncomment the next 2 lines)
# from langchain_openai import ChatOpenAI
# llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, max_tokens=1600)

from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from IPython.display import Markdown

# --- initial prompt (first chunk) ---
initial_prompt_txt = """
You are creating a structured summary of the text below.

Requirements:
- Add a short, generic Title on the first line (no quotes).
- Then write a concise introduction (1–2 sentences).
- Follow with a numbered list of key points (prioritise salient facts, methods, results, implications).
- Keep the whole summary precise and cohesive.

text: '{text}'
summary:
"""

initial_prompt_template = PromptTemplate(
    input_variables=["text"],
    template=initial_prompt_txt
)

# --- refine prompt (subsequent chunks) ---
refine_prompt_txt = """
We have an existing structured summary that should be improved using NEW text.

Requirements:
- Preserve the structure: Title, brief introduction, then numbered key points.
- Integrate any new important information; merge or renumber points as needed.
- Remove duplication, keep coherence and concise phrasing.
- If the new text adds nothing important, keep the current summary unchanged.
- The final output must remain a single summary (Title + intro + numbered points).

Existing summary:
{existing_answer}

New text:
{text}

Refined summary:
"""

refine_prompt_template = PromptTemplate(
    input_variables=["existing_answer", "text"],
    template=refine_prompt_txt
)

# --- build chain ---
refine_chain = load_summarize_chain(
    verbose=True,
    llm=llm,
    chain_type="refine",
    question_prompt=initial_prompt_template,   # first chunk
    refine_prompt=refine_prompt_template,      # subsequent chunks
)

# --- run ---
# 'chunks' should be a list of langchain Document objects (same as in your map_reduce block)
output_summary = refine_chain.invoke(chunks)["output_text"]

# --- display ---
Markdown(output_summary)


# BERT: A Breakthrough in Language Representation Models

This paper presents BERT, a novel language representation model that enhances natural language processing (NLP) tasks by utilizing deep bidirectional representations from unlabeled text.

1. **Introduction of BERT**: BERT stands for Bidirectional Encoder Representations from Transformers and is designed to pre-train deep bidirectional representations by conditioning on both left and right context in all layers. Unlike previous models that used unidirectional language models, BERT employs a masked language model (MLM) objective, which allows it to fuse left and right context effectively.

2. **Input/Output Representations**: BERT's input representation can unambiguously represent both a single sentence and a pair of sentences (e.g., Question, Answer) in one token sequence. A "sequence" refers to the input token sequence, which may consist of a single sentence or two sentences packed together. The first token of every sequence is a special classification token ([CLS]), and the final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are differentiated using a special token ([SEP]) and a learned embedding indicating whether a token belongs to sentence A or B. The input embeddings are the sum of the token embeddings, segment embeddings, and position embeddings.

3. **Pre-training Method**: BERT employs a masked language model (MLM) pre-training objective, which randomly masks tokens in the input and predicts their original values, overcoming the limitations of unidirectional models. The masking procedure involves replacing a token with the [MASK] token 80% of the time, a random word 10% of the time, and leaving it unchanged 10% of the time. Additionally, BERT incorporates a "next sentence prediction" (NSP) task that jointly pre-trains text-pair representations. The model is trained on a large corpus, including BooksCorpus (800M words) and English Wikipedia (2,500M words), using a unified architecture that remains consistent during fine-tuning. Studies show that removing the NSP task significantly degrades performance on tasks like QNLI, MNLI, and SQuAD 1.1. Recent ablation studies indicate that BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained for 1 million steps compared to 500,000 steps, highlighting the importance of extensive pre-training.

4. **Fine-tuning Capability**: The pre-trained BERT model can be fine-tuned with a single additional output layer, allowing it to achieve state-of-the-art performance across various NLP tasks without significant architectural changes. Fine-tuning is typically very fast, and it is reasonable to run an exhaustive search over hyperparameters such as learning rates (5e-5, 3e-5, 2e-5) and the number of epochs (2, 3, 4) to choose the best-performing model on the development set. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters. BERT is the first fine-tuning based representation model that outperforms many task-specific architectures.

5. **Feature-based Approach**: While BERT's results primarily utilize a fine-tuning approach, there are advantages to a feature-based approach where fixed features are extracted from the pre-trained model. This method can be computationally efficient, allowing for the pre-computation of expensive representations of training data, which can then be used with simpler models. Experiments on tasks like Named Entity Recognition (NER) demonstrate that BERT can effectively support both fine-tuning and feature-based methods, with competitive performance across both approaches. However, ablation studies show that fine-tuning is robust to different masking strategies, while the feature-based approach struggles when using only the [MASK] strategy.

6. **Model Architecture**: BERT’s architecture is a multi-layer bidirectional Transformer encoder, based on the original implementation described in Vaswani et al. (2017). It primarily reports results on two model sizes: BERTBASE (12 layers, 768 hidden size, 12 attention heads, 110M parameters) and BERTLARGE (24 layers, 1024 hidden size, 16 attention heads, 340M parameters). The use of bidirectional self-attention distinguishes BERT from models like OpenAI GPT, which uses constrained self-attention. BERT's representations are jointly conditioned on both left and right context in all layers, unlike ELMo, which uses a concatenation of independently trained left-to-right and right-to-left LSTMs.

7. **Performance Improvements**: BERT achieves new state-of-the-art results on eleven NLP tasks, including:
   - GLUE score: 80.5% (7.7% absolute improvement)
   - MultiNLI accuracy: 86.7% (4.6% absolute improvement)
   - SQuAD v1.1 F1 score: 93.2 (1.5 point absolute improvement)
   - SQuAD v2.0 F1 score: 83.1 (5.1 point absolute improvement)
   - BERTBASE and BERTLARGE outperform all systems on all tasks by substantial margins, with average accuracy improvements of 4.5% and 7.0% respectively over the prior state of the art. Notably, BERTLARGE (Ensemble) achieved an F1 score of 91.8 on SQuAD 1.1, outperforming other top systems.

8. **Comparison with Existing Models**: BERT improves upon previous models by addressing the limitations of unidirectional language models, which restrict the effectiveness of fine-tuning approaches, particularly for token-level tasks like question answering. It demonstrates the importance of bidirectional pre-training for language representations, contrasting with models that use shallow concatenation of independently trained left-to-right and right-to-left language models. BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach. Notably, BERT's training involved a larger dataset and different hyperparameter choices compared to GPT, which contributes to its performance advantages.

9. **Implications for NLP**: The introduction of BERT signifies a shift in how language representations can be utilized, enhancing the performance of various NLP applications and setting a new benchmark for future research. The code and pre-trained models are available at https://github.com/google-research/bert. Additionally, the findings indicate that larger models lead to improved accuracy across tasks, demonstrating the effectiveness of scaling model size in NLP applications. Recent empirical improvements suggest that even low-resource tasks can benefit from deep bidirectional architectures, further generalizing the advantages of rich, unsupervised pre-training in language understanding systems.

BERT: A Breakthrough in Language Representation Models
This paper presents BERT, a novel language representation model that enhances natural language processing (NLP) tasks by utilizing deep bidirectional representations from unlabeled text.

Introduction of BERT: BERT stands for Bidirectional Encoder Representations from Transformers and is designed to pre-train deep bidirectional representations by conditioning on both left and right context in all layers. Unlike previous models that used unidirectional language models, BERT employs a masked language model (MLM) objective, which allows it to fuse left and right context effectively.

Input/Output Representations: BERT's input representation can unambiguously represent both a single sentence and a pair of sentences (e.g., Question, Answer) in one token sequence. A "sequence" refers to the input token sequence, which may consist of a single sentence or two sentences packed together. The first token of every sequence is a special classification token ([CLS]), and the final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are differentiated using a special token ([SEP]) and a learned embedding indicating whether a token belongs to sentence A or B. The input embeddings are the sum of the token embeddings, segment embeddings, and position embeddings.

Pre-training Method: BERT employs a masked language model (MLM) pre-training objective, which randomly masks tokens in the input and predicts their original values, overcoming the limitations of unidirectional models. The masking procedure involves replacing a token with the [MASK] token 80% of the time, a random word 10% of the time, and leaving it unchanged 10% of the time. Additionally, BERT incorporates a "next sentence prediction" (NSP) task that jointly pre-trains text-pair representations. The model is trained on a large corpus, including BooksCorpus (800M words) and English Wikipedia (2,500M words), using a unified architecture that remains consistent during fine-tuning. Studies show that removing the NSP task significantly degrades performance on tasks like QNLI, MNLI, and SQuAD 1.1. Recent ablation studies indicate that BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained for 1 million steps compared to 500,000 steps, highlighting the importance of extensive pre-training.

Fine-tuning Capability: The pre-trained BERT model can be fine-tuned with a single additional output layer, allowing it to achieve state-of-the-art performance across various NLP tasks without significant architectural changes. Fine-tuning is typically very fast, and it is reasonable to run an exhaustive search over hyperparameters such as learning rates (5e-5, 3e-5, 2e-5) and the number of epochs (2, 3, 4) to choose the best-performing model on the development set. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters. BERT is the first fine-tuning based representation model that outperforms many task-specific architectures.

Feature-based Approach: While BERT's results primarily utilize a fine-tuning approach, there are advantages to a feature-based approach where fixed features are extracted from the pre-trained model. This method can be computationally efficient, allowing for the pre-computation of expensive representations of training data, which can then be used with simpler models. Experiments on tasks like Named Entity Recognition (NER) demonstrate that BERT can effectively support both fine-tuning and feature-based methods, with competitive performance across both approaches. However, ablation studies show that fine-tuning is robust to different masking strategies, while the feature-based approach struggles when using only the [MASK] strategy.

Model Architecture: BERT’s architecture is a multi-layer bidirectional Transformer encoder, based on the original implementation described in Vaswani et al. (2017). It primarily reports results on two model sizes: BERTBASE (12 layers, 768 hidden size, 12 attention heads, 110M parameters) and BERTLARGE (24 layers, 1024 hidden size, 16 attention heads, 340M parameters). The use of bidirectional self-attention distinguishes BERT from models like OpenAI GPT, which uses constrained self-attention. BERT's representations are jointly conditioned on both left and right context in all layers, unlike ELMo, which uses a concatenation of independently trained left-to-right and right-to-left LSTMs.

Performance Improvements: BERT achieves new state-of-the-art results on eleven NLP tasks, including:

GLUE score: 80.5% (7.7% absolute improvement)
MultiNLI accuracy: 86.7% (4.6% absolute improvement)
SQuAD v1.1 F1 score: 93.2 (1.5 point absolute improvement)
SQuAD v2.0 F1 score: 83.1 (5.1 point absolute improvement)
BERTBASE and BERTLARGE outperform all systems on all tasks by substantial margins, with average accuracy improvements of 4.5% and 7.0% respectively over the prior state of the art. Notably, BERTLARGE (Ensemble) achieved an F1 score of 91.8 on SQuAD 1.1, outperforming other top systems.
Comparison with Existing Models: BERT improves upon previous models by addressing the limitations of unidirectional language models, which restrict the effectiveness of fine-tuning approaches, particularly for token-level tasks like question answering. It demonstrates the importance of bidirectional pre-training for language representations, contrasting with models that use shallow concatenation of independently trained left-to-right and right-to-left language models. BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach. Notably, BERT's training involved a larger dataset and different hyperparameter choices compared to GPT, which contributes to its performance advantages.

Implications for NLP: The introduction of BERT signifies a shift in how language representations can be utilized, enhancing the performance of various NLP applications and setting a new benchmark for future research. The code and pre-trained models are available at https://github.com/google-research/bert. Additionally, the findings indicate that larger models lead to improved accuracy across tasks, demonstrating the effectiveness of scaling model size in NLP applications. Recent empirical improvements suggest that even low-resource tasks can benefit from deep bidirectional architectures, further generalizing the advantages of rich, unsupervised pre-training in language understanding systems.

In [162]:
# Great for long documents where you want a single coherent summary updated chunk-by-chunk.

# Typically slower than map_reduce, but can keep better narrative continuity.

# Tip: In refine, the placeholders are {text} and {existing_answer} (not {context}), so make sure your prompts use those exact variables.

### Generate A Detailed Summary of The Entire Document With At Least 1000 Tokens. Also, Add A Title To The Summary And Present Key Points Using Bullet Points With Chain_Type of "map_reduce".