## Transformer

Transformers only reads in the following formats:

1. txt
2. csv
3. tsv
4. jsonl
5. json
6. xml

[Source](https://huggingface.co/docs/datasets/dataset_script.html)

If you want to read in pdf files, you would have to convert it to any of the format above.

for our notebook we would be converting them to pdf.

In [None]:
!pip install transformers
!pip install datasets
!pip install pdfminer
!pip install rouge_score
!pip install rouge

In [2]:
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from datasets import load_metric



#from transformers import pipeline
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [3]:
output_string = StringIO()
with open("data/summary text.pdf" , "rb") as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device =  TextConverter(rsrcmgr, output_string, laparams= LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        #print('*' * 100)
        
print(output_string.getvalue())

New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester
County, New York.
A year later, she got married again in Westchester County, but to a different man and without
divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five
more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license,
she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the
first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her
attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos 

In [4]:
metric = load_metric('rouge')

In [5]:
text = output_string.getvalue()

In [6]:
text = text[:-4]

In [7]:
dataset = text.replace('\n', ' ').replace('\\', '').replace('\n\n', '').replace('\x0c','')

In [8]:
n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

In [9]:
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

In [10]:
tokenizer = AutoTokenizer.from_pretrained("t5-base")

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

In [12]:
summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

 a year after that marriage, she got hitched again in the Bronx. in 2010, she married again, this time in the Bronx. she says it was her "first adolescent marriage" barrientos pleads not guilty to two counts of "offering a false instrument for filing in the first degree" she is accused of false statements on the 2010 marriage license application. the marriages were part of an immigration scam, her attorney says. barrientos has been married 10 times, prosecutors say. she is accused of sneaking into the subway through an emergency exit. she is believed to still be married to four men. ration scam involved some of her husbands, who filed for permanent residence. any divorces happened only after such filings were approved. it was unclear whether any of the men will be prosecuted. barrientos faces up to four years in prison if convicted. her eighth husband, Rashid Rajput, was deported to his native Pakistan in 2006. if convicted, she faces up to four years in prison.


In [14]:

metric.compute(predictions = [result], references = [dataset])

{'rouge1': AggregateScore(low=Score(precision=0.9216867469879518, recall=0.4090909090909091, fmeasure=0.5666666666666668), mid=Score(precision=0.9216867469879518, recall=0.4090909090909091, fmeasure=0.5666666666666668), high=Score(precision=0.9216867469879518, recall=0.4090909090909091, fmeasure=0.5666666666666668)),
 'rouge2': AggregateScore(low=Score(precision=0.7151515151515152, recall=0.3163538873994638, fmeasure=0.43866171003717475), mid=Score(precision=0.7151515151515152, recall=0.3163538873994638, fmeasure=0.43866171003717475), high=Score(precision=0.7151515151515152, recall=0.3163538873994638, fmeasure=0.43866171003717475)),
 'rougeL': AggregateScore(low=Score(precision=0.7710843373493976, recall=0.3422459893048128, fmeasure=0.4740740740740741), mid=Score(precision=0.7710843373493976, recall=0.3422459893048128, fmeasure=0.4740740740740741), high=Score(precision=0.7710843373493976, recall=0.3422459893048128, fmeasure=0.4740740740740741)),
 'rougeLsum': AggregateScore(low=Score(p

In [15]:
from rouge import Rouge
r = Rouge()
r.get_scores(result, dataset)

[{'rouge-1': {'f': 0.5438596448231935,
   'p': 0.8691588785046729,
   'r': 0.39574468085106385},
  'rouge-2': {'f': 0.38133873815979497,
   'p': 0.6266666666666667,
   'r': 0.27405247813411077},
  'rouge-l': {'f': 0.532163738390445,
   'p': 0.8504672897196262,
   'r': 0.3872340425531915}}]

### Trying out new model

In [16]:
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

In [17]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

In [18]:
summary = []
for datas in bank:
    inputs = tokenizer("summarize: " + datas, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(
    inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary.append(tokenizer.decode(outputs[0]))
    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

 a year later, she got married again in westchester county, new york. only 18 days after that marriage, she got hitched yet again. in 2010, she married once more, this time in the Bronx. barrientos, now 39, is facing two criminal counts of "offering a false instrument" referring to her false statements on the 2010 marriage license application. the marriages were part of an immigration scam, her attorney says. in total, barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. she is believed to still be married to four men, and at one time, she was married to eight men at once. seven of the men are from so-called "red-flagged" countries. seven of the men are from so-called "red-flagged" countries. the case was referred to the district attorney's office. if convicted, barrientos faces up to four years in prison. if convicted, barrientos faces up to four years in prison. next court appearance is scheduled for may 18.


In [19]:
metric.compute(predictions = [result], references = [dataset])

{'rouge1': AggregateScore(low=Score(precision=0.9112426035502958, recall=0.4117647058823529, fmeasure=0.567219152854512), mid=Score(precision=0.9112426035502958, recall=0.4117647058823529, fmeasure=0.567219152854512), high=Score(precision=0.9112426035502958, recall=0.4117647058823529, fmeasure=0.567219152854512)),
 'rouge2': AggregateScore(low=Score(precision=0.8035714285714286, recall=0.36193029490616624, fmeasure=0.4990757855822551), mid=Score(precision=0.8035714285714286, recall=0.36193029490616624, fmeasure=0.4990757855822551), high=Score(precision=0.8035714285714286, recall=0.36193029490616624, fmeasure=0.4990757855822551)),
 'rougeL': AggregateScore(low=Score(precision=0.834319526627219, recall=0.3770053475935829, fmeasure=0.5193370165745856), mid=Score(precision=0.834319526627219, recall=0.3770053475935829, fmeasure=0.5193370165745856), high=Score(precision=0.834319526627219, recall=0.3770053475935829, fmeasure=0.5193370165745856)),
 'rougeLsum': AggregateScore(low=Score(precisi

In [20]:
#with beam
from rouge import Rouge
r = Rouge()
r.get_scores(result, dataset)

[{'rouge-1': {'f': 0.5088757354075488,
   'p': 0.8349514563106796,
   'r': 0.3659574468085106},
  'rouge-2': {'f': 0.4123711298793921,
   'p': 0.704225352112676,
   'r': 0.2915451895043732},
  'rouge-l': {'f': 0.5029585756442352,
   'p': 0.8252427184466019,
   'r': 0.3617021276595745}}]

Test on another pdf

In [21]:
output_string = StringIO()
with open("data/summary_one.pdf" , "rb") as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device =  TextConverter(rsrcmgr, output_string, laparams= LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        #print('*' * 100)
        
print(output_string.getvalue())

There were certain underlying conditions that enabled department stores to
grow up when they did. From the start, they all catered for middle-class
customers and set out to convey to them an air of luxury and solid comfort. Of
necessity, they all arose in central positions where large numbers of people
could reach them easily by means of public transport. Physically, they grew up
in an era of big technical developments in building so that they could a ord
multi-storey palaces and could have enormous plate-glass windows for
display, lighting and novelties like lifts.
Above all, the department stores rose with the rise of Victorian white-collar
workers, the small-scale businessmen and professionals whose womenfolk
had money to spare for a few luxuries and were gradually switching the
emphasis of their housekeeping expenditure from food to other items.

Most of these stores drew enough customers to ﬁll their huge shops by o ering
two new things. One was the new manufactures, particularly 

In [22]:
text = output_string.getvalue()
text = text[:-4]
dataset = text.replace('\n', ' ').replace('\\', '').replace('\n\n', '').replace('\x0c','').replace('\x00', 'ff')
dataset

'There were certain underlying conditions that enabled department stores to grow up when they did. From the start, they all catered for middle-class customers and set out to convey to them an air of luxury and solid comfort. Of necessity, they all arose in central positions where large numbers of people could reach them easily by means of public transport. Physically, they grew up in an era of big technical developments in building so that they could afford multi-storey palaces and could have enormous plate-glass windows for display, lighting and novelties like lifts. Above all, the department stores rose with the rise of Victorian white-collar workers, the small-scale businessmen and professionals whose womenfolk had money to spare for a few luxuries and were gradually switching the emphasis of their housekeeping expenditure from food to other items.  Most of these stores drew enough customers to ﬁll their huge shops by offering two new things. One was the new manufactures, particular

In [23]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 bob greene: department stores were a good example of how they grew up. he says they catered for middle-class customers and set out to convey comfort. greene: they grew up in an era of big technical developments in building. department stores rose with the rise of Victorian white-collar workers. womenfolk had money to spare for a few luxuries and were gradually switching the emphasis of their housekeeping expenditure from food to other items. department stores always made it a point to be the first in the field if they could with novelty. department stores also offered a lavish display and wide choice of goods. bob greene: department stores, holland, scotland, england, wales, scotland, ireland, scotland. department stores made virtue of displaying wares as openly as they could. they were not ashamed to make price one of their sacrificial sacrificial sacrificial sacrificial sacrificial sacrificial sacrificial sacrificial sacrificial sacrificial sacrificial sacrificial sacrificial sacrif

{'rouge1': AggregateScore(low=Score(precision=0.728110599078341, recall=0.3665893271461717, fmeasure=0.4876543209876543), mid=Score(precision=0.728110599078341, recall=0.3665893271461717, fmeasure=0.4876543209876543), high=Score(precision=0.728110599078341, recall=0.3665893271461717, fmeasure=0.4876543209876543)),
 'rouge2': AggregateScore(low=Score(precision=0.5879629629629629, recall=0.29534883720930233, fmeasure=0.393188854489164), mid=Score(precision=0.5879629629629629, recall=0.29534883720930233, fmeasure=0.393188854489164), high=Score(precision=0.5879629629629629, recall=0.29534883720930233, fmeasure=0.393188854489164)),
 'rougeL': AggregateScore(low=Score(precision=0.7004608294930875, recall=0.35266821345707655, fmeasure=0.4691358024691358), mid=Score(precision=0.7004608294930875, recall=0.35266821345707655, fmeasure=0.4691358024691358), high=Score(precision=0.7004608294930875, recall=0.35266821345707655, fmeasure=0.4691358024691358)),
 'rougeLsum': AggregateScore(low=Score(prec

In [24]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 department stores were built in the early 1900s to cater for middle-class customers. they all grew up in central positions where large numbers of people could reach them easily by means of public transport. department stores rose with the rise of white-collar workers. the small-scale businessmen and professionals whose womenfolk had money to spare. most of these stores drew enough customers to fill their huge shops. department stores were the first in the field if they could with novelty of any kind. the department stores, hodgson, hodgson, san francisco, san francisco, san francisco, san francisco, san francisco, san francisco, san francisco, san francisco, san francisco, san francisco, san francisco, san wever introduced the vulgar practice of openly marking or ticketing goods with their prices. but the department stores made a virtue not only of displaying their wares as openly as they could but also of boldly pricing them for all to see. large-scale purchases enabled them to sell 

{'rouge1': AggregateScore(low=Score(precision=0.8198198198198198, recall=0.4222737819025522, fmeasure=0.557427258805513), mid=Score(precision=0.8198198198198198, recall=0.4222737819025522, fmeasure=0.557427258805513), high=Score(precision=0.8198198198198198, recall=0.4222737819025522, fmeasure=0.557427258805513)),
 'rouge2': AggregateScore(low=Score(precision=0.6968325791855203, recall=0.3581395348837209, fmeasure=0.4731182795698925), mid=Score(precision=0.6968325791855203, recall=0.3581395348837209, fmeasure=0.4731182795698925), high=Score(precision=0.6968325791855203, recall=0.3581395348837209, fmeasure=0.4731182795698925)),
 'rougeL': AggregateScore(low=Score(precision=0.7927927927927928, recall=0.40835266821345706, fmeasure=0.5390505359877488), mid=Score(precision=0.7927927927927928, recall=0.40835266821345706, fmeasure=0.5390505359877488), high=Score(precision=0.7927927927927928, recall=0.40835266821345706, fmeasure=0.5390505359877488)),
 'rougeLsum': AggregateScore(low=Score(prec

In [25]:
output_string = StringIO()
with open("data/summary_two.pdf" , "rb") as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device =  TextConverter(rsrcmgr, output_string, laparams= LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        #print('*' * 100)
        
print(output_string.getvalue())

Despite the fact that our planet is habitable only because most of it is
composed of water, it is the oceans that are the most immediately threatened
part of the earth.

It was in the oceans that life ﬁrst began to stir, shielded by the waters from the
sun’s irresistible radiation. It was from the oceans that planets and animals
emerged to colonize the land surface of the planet.  It is the oceans today that
provide the water vapour which, drawn up by the sun, falls upon the earth in
harvest-bringing, life-sustaining rain. The ocean is a major provider of the
oxygen released by its plankton for the beneﬁt of all the species of land, air and
sea – breathing with lungs and gills.

Without special qualities for holding heat, much of the earth would be
uninhabitable.

The oceans are the coolants of the tropics, the bringers of warm currents to
cold regions, the universal moderators of temperature throughout the globe.

The oceans are also indispensable to man because they ﬁrst created the


In [26]:
text = output_string.getvalue()
text = text[:-4]
dataset = text.replace('\n', ' ').replace('\\', '').replace('\n\n', '').replace('\x0c','').replace('\x00', 'ff')
dataset

'Despite the fact that our planet is habitable only because most of it is composed of water, it is the oceans that are the most immediately threatened part of the earth.  It was in the oceans that life ﬁrst began to stir, shielded by the waters from the sun’s irresistible radiation. It was from the oceans that planets and animals emerged to colonize the land surface of the planet.  It is the oceans today that provide the water vapour which, drawn up by the sun, falls upon the earth in harvest-bringing, life-sustaining rain. The ocean is a major provider of the oxygen released by its plankton for the beneﬁt of all the species of land, air and sea – breathing with lungs and gills.  Without special qualities for holding heat, much of the earth would be uninhabitable.  The oceans are the coolants of the tropics, the bringers of warm currents to cold regions, the universal moderators of temperature throughout the globe.  The oceans are also indispensable to man because they ﬁrst created the

In [27]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 the oceans provide the water vapour which, drawn up by the sun, falls upon the earth in harvest-bringing, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, life-sustaining, the oceans are the coolants of the tropics, the bringers of warm currents to cold regions, the universal moderators of temperature throughout the globe. without special qualities for holding heat, much of the earth would be uninhabitable. fish could make up a large part of the protein diet required for the world's children, especially those in developing countries. fifty pence of fish is now being pumped into the oceans every year. rcent of the harvest from the oceans is converted to fish meal. today it feeds pigs

{'rouge1': AggregateScore(low=Score(precision=0.6867469879518072, recall=0.3665594855305466, fmeasure=0.47798742138364786), mid=Score(precision=0.6867469879518072, recall=0.3665594855305466, fmeasure=0.47798742138364786), high=Score(precision=0.6867469879518072, recall=0.3665594855305466, fmeasure=0.47798742138364786)),
 'rouge2': AggregateScore(low=Score(precision=0.5757575757575758, recall=0.3064516129032258, fmeasure=0.39999999999999997), mid=Score(precision=0.5757575757575758, recall=0.3064516129032258, fmeasure=0.39999999999999997), high=Score(precision=0.5757575757575758, recall=0.3064516129032258, fmeasure=0.39999999999999997)),
 'rougeL': AggregateScore(low=Score(precision=0.5963855421686747, recall=0.3183279742765273, fmeasure=0.4150943396226415), mid=Score(precision=0.5963855421686747, recall=0.3183279742765273, fmeasure=0.4150943396226415), high=Score(precision=0.5963855421686747, recall=0.3183279742765273, fmeasure=0.4150943396226415)),
 'rougeLsum': AggregateScore(low=Scor

In [28]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 oceans are the most immediately threatened part of the earth. it was in the oceans that life first began to stir. it was from the oceans that planets and animals emerged to colonize the land surface of the planet. the oceans are the coolants of the tropics, the bringers of warm currents to cold regions, the universal moderators of temperature throughout the globe. the oceans are also indispensable to man because they first created the worldwide currents of sebastian. in 1996, sixty-three million metric tons of fish came from the sea. fish could make up a large part of the protein diet required for the world’s children. rcent of the harvest from the oceans is converted to fish meal which ends up feeding pigs and chickens in developed countries. pigs and chickens in developed countries have the chance of a better diet than many 'developing' babies.


{'rouge1': AggregateScore(low=Score(precision=0.9459459459459459, recall=0.45016077170418006, fmeasure=0.6100217864923747), mid=Score(precision=0.9459459459459459, recall=0.45016077170418006, fmeasure=0.6100217864923747), high=Score(precision=0.9459459459459459, recall=0.45016077170418006, fmeasure=0.6100217864923747)),
 'rouge2': AggregateScore(low=Score(precision=0.8231292517006803, recall=0.3903225806451613, fmeasure=0.5295404814004376), mid=Score(precision=0.8231292517006803, recall=0.3903225806451613, fmeasure=0.5295404814004376), high=Score(precision=0.8231292517006803, recall=0.3903225806451613, fmeasure=0.5295404814004376)),
 'rougeL': AggregateScore(low=Score(precision=0.9256756756756757, recall=0.4405144694533762, fmeasure=0.596949891067538), mid=Score(precision=0.9256756756756757, recall=0.4405144694533762, fmeasure=0.596949891067538), high=Score(precision=0.9256756756756757, recall=0.4405144694533762, fmeasure=0.596949891067538)),
 'rougeLsum': AggregateScore(low=Score(prec

In [29]:
output_string = StringIO()
with open("data/summary_three.pdf" , "rb") as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device =  TextConverter(rsrcmgr, output_string, laparams= LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        #print('*' * 100)
        
print(output_string.getvalue())

It has been the custom of historians to divide the factors for wars into
immediate and underlying causes. Among these underlying causes, the
economic factor is generally placed at the head of the list. Indeed, the most
important of these was the industrial and commercial rivalry between
Germany and Great Britain.

Germany, after its uniﬁcation in 1871, went through a period of economic
miracle. By 1914, she was producing more iron and steel than Britain and
France combined. In chemicals, in dye, and in the manufacture of scientiﬁc
equipment she led the world. The products of her industries were crowding
British manufactures in nearly every market for continental Europe, in the Far
East and in Britain itself.

There is evidence that certain interests in Great Britain were becoming
seriously alarmed over the menace of German competition.

There seemed to be a strong conviction that Germany was waging deliberate
and deadly economic warfare upon Britain to capture her market by unfair
meth

In [30]:
text = output_string.getvalue()
text = text[:-4]
dataset = text.replace('\n', ' ').replace('\\', '').replace('\n\n', '').replace('\x0c','').replace('\x00', 'ff')
dataset

'It has been the custom of historians to divide the factors for wars into immediate and underlying causes. Among these underlying causes, the economic factor is generally placed at the head of the list. Indeed, the most important of these was the industrial and commercial rivalry between Germany and Great Britain.  Germany, after its uniﬁcation in 1871, went through a period of economic miracle. By 1914, she was producing more iron and steel than Britain and France combined. In chemicals, in dye, and in the manufacture of scientiﬁc equipment she led the world. The products of her industries were crowding British manufactures in nearly every market for continental Europe, in the Far East and in Britain itself.  There is evidence that certain interests in Great Britain were becoming seriously alarmed over the menace of German competition.  There seemed to be a strong conviction that Germany was waging deliberate and deadly economic warfare upon Britain to capture her market by unfair met

In [31]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 aaron miller: wars were a war of industrial and commercial rivalry between germany and Great Britain. miller: germany produced more iron and steel than Britain and France combined. he says the wars were a war of industrial and commercial rivalry. miller: wars were a war of economic and political importance. aaron carroll: german competition in the uk is a threat to the uk. he says the uk is alarmed by the threat of german competition. carroll: the uk is a great example of a 'deadly economic warfare' aaron miller: o allow germany to win in this struggle would mean the destruction of her prosperity. miller: there are indications that the french were alarmed by the German industrial expansion. he says the french were afraid that their enemy might eventually reach ousted. miller: o allow germany to win in this struggle would mean the destruction of her national existence. russia and other ally of germany were rivals for a monopoly of trade with the balkan kingdoms. russia and Austria a cl

{'rouge1': AggregateScore(low=Score(precision=0.5990566037735849, recall=0.3735294117647059, fmeasure=0.46014492753623193), mid=Score(precision=0.5990566037735849, recall=0.3735294117647059, fmeasure=0.46014492753623193), high=Score(precision=0.5990566037735849, recall=0.3735294117647059, fmeasure=0.46014492753623193)),
 'rouge2': AggregateScore(low=Score(precision=0.36492890995260663, recall=0.22713864306784662, fmeasure=0.28), mid=Score(precision=0.36492890995260663, recall=0.22713864306784662, fmeasure=0.28), high=Score(precision=0.36492890995260663, recall=0.22713864306784662, fmeasure=0.28)),
 'rougeL': AggregateScore(low=Score(precision=0.4669811320754717, recall=0.2911764705882353, fmeasure=0.3586956521739131), mid=Score(precision=0.4669811320754717, recall=0.2911764705882353, fmeasure=0.3586956521739131), high=Score(precision=0.4669811320754717, recall=0.2911764705882353, fmeasure=0.3586956521739131)),
 'rougeLsum': AggregateScore(low=Score(precision=0.4669811320754717, recall=

In [32]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 historians divide factors for wars into immediate and underlying causes. most important of these was the industrial and commercial rivalry between Germany and Great Britain. by 1914, she was producing more iron and steel than Britain and France combined. the products of her industries crowding British manufactures in nearly every market for continental Europe, in the far East and in Britain itself. there seems to be a strong conviction that Germany was waging deliberate and deadly economic warfare upon Britain to capture her market by unfair methods. the French were alarmed by the German industrial expansion. in 1870, the iron and coal deposit of Lorraine went to swell the industrial growth of Germany. the French were afraid that their enemy might eventually reach ousted. the Russian ambition to gain control of Constantinople conflicted with german plans for reserving the Turkish Empire as their happy hunting ground of commercial privilege. the monopoly of trade with the Balkan kingdo

{'rouge1': AggregateScore(low=Score(precision=0.9111111111111111, recall=0.4823529411764706, fmeasure=0.6307692307692309), mid=Score(precision=0.9111111111111111, recall=0.4823529411764706, fmeasure=0.6307692307692309), high=Score(precision=0.9111111111111111, recall=0.4823529411764706, fmeasure=0.6307692307692309)),
 'rouge2': AggregateScore(low=Score(precision=0.7988826815642458, recall=0.4218289085545723, fmeasure=0.5521235521235521), mid=Score(precision=0.7988826815642458, recall=0.4218289085545723, fmeasure=0.5521235521235521), high=Score(precision=0.7988826815642458, recall=0.4218289085545723, fmeasure=0.5521235521235521)),
 'rougeL': AggregateScore(low=Score(precision=0.8722222222222222, recall=0.46176470588235297, fmeasure=0.6038461538461538), mid=Score(precision=0.8722222222222222, recall=0.46176470588235297, fmeasure=0.6038461538461538), high=Score(precision=0.8722222222222222, recall=0.46176470588235297, fmeasure=0.6038461538461538)),
 'rougeLsum': AggregateScore(low=Score(p

In [33]:
output_string = StringIO()
with open("data/summary_four.pdf" , "rb") as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device =  TextConverter(rsrcmgr, output_string, laparams= LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        #print('*' * 100)
        
print(output_string.getvalue())

Mining ranks among the world’s most destructive industries. Yet mineral
extraction and processing are absent in most discussions of global
environmental threats. Governmental and private analyses have focused only
on increasing mineral supplies.

Each year, mining strips some 28 billion tons of material from the earth. This
is more than what is removed by the natural erosion of all the earth’s rivers.
Worldwide, mining and smelting generate an estimated 2.7 billion tons of
processing waste each year, much of it hazardous dwarﬁng the more familiar
municipal waste. Smelter pollution has created biological wastelands as large
as 10, 000 hectares and pumped some eight percent of the total worldwide
emissions of sulphur dioxide, a major contributor to acid rain, into the
atmosphere.

Mining could also cause more damaging deforestation than bad farming
practices in certain parts of the world. For example, smelters at a single iron
mine in Brazil will require enough fuelwood to deforest 50,00

In [34]:
text = output_string.getvalue()
text = text[:-4]
dataset = text.replace('\n', ' ').replace('\\', '').replace('\n\n', '').replace('\x0c','').replace('\x00', 'ff')
dataset

'Mining ranks among the world’s most destructive industries. Yet mineral extraction and processing are absent in most discussions of global environmental threats. Governmental and private analyses have focused only on increasing mineral supplies.  Each year, mining strips some 28 billion tons of material from the earth. This is more than what is removed by the natural erosion of all the earth’s rivers. Worldwide, mining and smelting generate an estimated 2.7 billion tons of processing waste each year, much of it hazardous dwarﬁng the more familiar municipal waste. Smelter pollution has created biological wastelands as large as 10, 000 hectares and pumped some eight percent of the total worldwide emissions of sulphur dioxide, a major contributor to acid rain, into the atmosphere.  Mining could also cause more damaging deforestation than bad farming practices in certain parts of the world. For example, smelters at a single iron mine in Brazil will require enough fuelwood to deforest 50,0

In [35]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 mining and smelting generate an estimated 2.7 billion tons of processing waste each year. this is more than what is removed by the natural erosion of all the earth’s rivers. smelter pollution has created biological wastelands as large as 10, 000 hectares. mining could cause more damaging deforestation than bad farming practices. mining could cause more damaging deforestation than bad farming practices. orest mines are poorly regulated even in wealthy industrialized nations. prices of minerals do not include their full environmental cost. mining has been poorly regulated even in wealthy industrialized nations. governments should remove subsidies provided for mining virgin minerals. the devastating effects of the industry are particularly severe in the developing countries. this is because environmental controls tend to be weak or non-existent. many of the world's poorest nations are among the world's poorest. they are among the world's most heavily indebted countries. planes are a majo

{'rouge1': AggregateScore(low=Score(precision=0.8348214285714286, recall=0.4146341463414634, fmeasure=0.554074074074074), mid=Score(precision=0.8348214285714286, recall=0.4146341463414634, fmeasure=0.554074074074074), high=Score(precision=0.8348214285714286, recall=0.4146341463414634, fmeasure=0.554074074074074)),
 'rouge2': AggregateScore(low=Score(precision=0.6860986547085202, recall=0.34, fmeasure=0.4546805349182764), mid=Score(precision=0.6860986547085202, recall=0.34, fmeasure=0.45468053491827637), high=Score(precision=0.6860986547085202, recall=0.34, fmeasure=0.4546805349182764)),
 'rougeL': AggregateScore(low=Score(precision=0.7008928571428571, recall=0.34811529933481156, fmeasure=0.46518518518518515), mid=Score(precision=0.7008928571428571, recall=0.34811529933481156, fmeasure=0.46518518518518515), high=Score(precision=0.7008928571428571, recall=0.34811529933481156, fmeasure=0.46518518518518515)),
 'rougeLsum': AggregateScore(low=Score(precision=0.7008928571428571, recall=0.348

Reference
1. What is Rouge score? [here](https://www.youtube.com/watch?v=TMshhnrEXlg)
2. Interpreting Rouge scores [here](https://stats.stackexchange.com/questions/301626/interpreting-rouge-scores)
3. How others tested their text summarizer after using rouge [here](https://link.springer.com/article/10.1007/s10579-017-9389-4)