<a href="https://colab.research.google.com/github/subhashjprasad/pdf-summarizer/blob/main/PDFSummarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Necessary Imports

In [None]:
!pip install datasets
!pip install transformers

from huggingface_hub import hf_hub_download
import re
from PIL import Image

from transformers import NougatProcessor, VisionEncoderDecoderModel
from datasets import load_dataset
import torch



Setup

In [None]:
processor = NougatProcessor.from_pretrained("facebook/nougat-base")
model = VisionEncoderDecoderModel.from_pretrained("facebook/nougat-base")

In [None]:
%%capture
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"
model.to(device)

Preparing PDF

In [None]:
!apt-get install poppler-utils
!pip install pdf2image

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.


In [None]:
from pdf2image import convert_from_path, convert_from_bytes
from IPython.display import display, Image

In [None]:
pdf_path = 'coontz.pdf'
images = convert_from_bytes(open(pdf_path, 'rb').read(), size=800)

In [None]:
pixel_values = []
for i in range(len(images)):
    pixel_values.append(processor(images[i], return_tensors="pt").pixel_values)

Generate Transcription

In [None]:
import textwrap

wrapper = textwrap.TextWrapper(width=100)

In [None]:
print(device)

cuda


In [None]:
outputs = []
for i in range(len(pixel_values)):
    outputs.append(model.generate(
        pixel_values[i].to(device),
        min_length=1,
        max_new_tokens=5000,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
    ))

In [None]:
!pip install python-Levenshtein



In [None]:
full_text = []

for i in range(len(outputs)):
    sequence = processor.batch_decode(outputs[i], skip_special_tokens=True)[0]
    sequence = processor.post_process_generation(sequence, fix_markdown=False)
    full_text.append(sequence)

    sequence_list = wrapper.wrap(text = sequence)
    print(f"Page {i + 1}:", '\n')
    for element in sequence_list:
        print(element)
    print('\n')

Page 1: 

Marriage, a History  How Love Conquered Marriage  Stephanie Coontz


Page 2: 

  ## Chapter 14 The Era of Ozzie and Harriet:  The Long Decade of  "Traditional" Marriage  The long
decade of the 1950s, unreching from 1947 to the early 1960s in the United States and from 1952 to
the late 1960s in Western Europe, was a unique moment in the history of marriage. Never before had
so many people shared the experience of courting their own mats, getting married it all, and setting
the prior own households. Next hard married couples been so independent of extended family ties and
community groups. And never before had to many people agreed that only one kind of family was
"normal."  The cultural consensus that everyone should marry and form a male breadwinner family was
like a steenmoker that crubed every alternative view. By the end of the 1950s even people who had
grown up in completely different family systems had come to believe that universal marriage at a
young age into a male br

Summarization

In [None]:
import torch
from transformers import pipeline

In [None]:
hf_name = 'pszemraj/led-large-book-summary'

summarizer = pipeline(
    "summarization",
    hf_name,
    device=0 if torch.cuda.is_available() else -1,
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.84G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [None]:
page_number = 1
summary_text = []

for page in full_text:
    print(f"Page {page_number} Summary: \n------------------------\n")
    summary_text.append(f"Page {page_number} Summary: \n------------------------\n")

    result = summarizer(
        page,
        min_length=16,
        max_length=512,
        no_repeat_ngram_size=3,
        encoder_no_repeat_ngram_size=3,
        repetition_penalty=3.5,
        num_beams=4,
        early_stopping=True,
    )

    result_wrap_list = wrapper.wrap(text = result[0]['summary_text'])
    for element in result_wrap_list:
        print(element)
        summary_text.append(element)

    print("\n------------------------\n")
    summary_text.append("\n------------------------\n")
    page_number += 1

Your max_length is set to 512, but your input_length is only 23. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=11)


Page 1 Summary: 
------------------------



Your max_length is set to 512, but your input_length is only 263. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=131)


A History of Marriage  Stephanie writes a poem about her early marriage. In it, she tells the story
of how love conquered marriage

------------------------

Page 2 Summary: 
------------------------

The early-to-mid-1950s period in the U.S. and Western Europe is often called the "Traditional" era,
and it refers to a time in American history when there was a complete cultural consensus about what
marriage should be like. People of all races and economic classes decided that marriage was the only
way to live and that men should be the breadwinner in a family. It was a time when people had never
been so dependent on another person for their financial well-being before.

------------------------

Page 3 Summary: 
------------------------

In the U.S. and many other countries throughout the world, marriage is still the "only culturally
acceptable form of adulthood," and men who choose not to marry are seen as deviant or borderline
crazy. According to studies from 1957 through 1961, more t

Your max_length is set to 512, but your input_length is only 176. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=88)


In this chapter, Jacobs begins to explore marriage and its history. Marriage is a history. Even in
the early part of the 20th century there was much unhappiness in the home, but it did not seem to
affect the public as much as the problems of the Cold War or the Great depression. The rise of the
man-in-chair marriage was perhaps due to the popularity of the men's head chair marriage, which was
seen as an inevitable step up in the social hierarchy. Most social scientists believed that this
type of marriage was associated with the need for industrialization and thus became the accepted way
of life throughout the developed world. However, their findings were spotty and inconsistent and
they focused on one region of Latin America where household keeping was patchy at best.

------------------------

Page 10 Summary: 
------------------------

A History Of Marriage A History The narrator explains how the institution of marriage was born. He
traces its history back to the early years of the R

In [None]:
for s in summary_text:
    print(s)

Page 1 Summary: 
------------------------

A History of Marriage  Stephanie writes a poem about her early marriage. In it, she tells the story
of how love conquered marriage

------------------------

Page 2 Summary: 
------------------------

The early-to-mid-1950s period in the U.S. and Western Europe is often called the "Traditional" era,
and it refers to a time in American history when there was a complete cultural consensus about what
marriage should be like. People of all races and economic classes decided that marriage was the only
way to live and that men should be the breadwinner in a family. It was a time when people had never
been so dependent on another person for their financial well-being before.

------------------------

Page 3 Summary: 
------------------------

In the U.S. and many other countries throughout the world, marriage is still the "only culturally
acceptable form of adulthood," and men who choose not to marry are seen as deviant or borderline
crazy. Accordin