<a href="https://colab.research.google.com/github/rogerwzeng/e104/blob/main/T5Summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Research Paper Summarization
## I wrote this simple script to help me with weekly assignment of doing research summaries for the E-11 course. It uses a long text summarization based on the T5 model. Thanks to Peter for the original code. I only had to add a wrapper.
## Note, this should only assist, not replace, reading the paper. I found the quality of summarization to be one and off, so it is important to *actually read* the paper itself, so no importance points are missed.
## Anyhow, here we go.

## Make sure rune-time env libraries are all set up 

In [None]:
!pip install transformers
!pip install torch
!pip install sentencepiece
!pip install pyPDF2
!pip install jedi==0.10
!pip install koila

## Enter research paper file name (must be PDF) below

In [None]:
in_file = ""  # Name and path (if any) of research paper PDF

## Read in PDF 

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
from PyPDF2 import PdfReader

with open(in_file, "rb") as pdf_file:
    reader = PdfReader(pdf_file)
    pg = len(reader.pages)
    print(f"Total Pages: {pg}")
    long_text = ""
    for page in reader.pages:
        long_text += page.extract_text() + "\n"
# print(text)

## Long text summarization with long-t5-tglobal-base-16384-book-summary model

Link to model card: [pszemraj/long-t5-tglobal-base-16384-book-summary](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary)

by [Peter](https://github.com/pszemraj)


In [None]:
from transformers import pipeline
import torch

summarizer = pipeline(
    "summarization",
    "pszemraj/long-t5-tglobal-base-16384-book-summary",
    device=0 if torch.cuda.is_available() else -1,
)


## Deal with long papers with lazy feeding

In [None]:
# for large input document, you may run out of GPU memory. 
# Chunk it up into batches by # of pages (16 page just a huerestic)
from koila import lazy

long_text = lazy(long_text, batch=pg//16)

## Run model to get summary text.
### Adjust "max_length" and "min_length" to suit your needs

In [None]:
%%time
params = {
    "max_length": 512,
    "min_length": 192,
    "no_repeat_ngram_size": 3,
    "early_stopping": True,
    "repetition_penalty": 4.5,
    "length_penalty": 0.3,
    "encoder_no_repeat_ngram_size": 3,
    "num_beams": 4,
} # parameters for text generation out of model

result = summarizer(long_text, **params)

print(result[0]['summary_text'])

Ref: https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary