## Transformer

Transformers only reads in the following formats:

1. txt
2. csv
3. tsv
4. jsonl
5. json
6. xml

[Source](https://huggingface.co/docs/datasets/dataset_script.html)

If you want to read in pdf files, you would have to convert it to any of the format above.

for our notebook we would be converting them to pdf.

In [None]:
!pip install transformers
!pip install datasets
!pip install pdfminer
!pip install rouge_score
!pip install rouge

In [2]:
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from datasets import load_metric



#from transformers import pipeline
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [36]:
output_string = StringIO()
with open("data/summary text.pdf" , "rb") as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device =  TextConverter(rsrcmgr, output_string, laparams= LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        #print('*' * 100)
        
print(output_string.getvalue())

New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester
County, New York.
A year later, she got married again in Westchester County, but to a different man and without
divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five
more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license,
she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the
first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her
attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos 

In [4]:
metric = load_metric('rouge')

In [37]:
text = output_string.getvalue()

In [38]:
text = text[:-4]

In [39]:
dataset = text.replace('\n', ' ').replace('\\', '').replace('\n\n', '').replace('\x0c','')

In [40]:
n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

In [43]:
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

In [44]:
tokenizer = AutoTokenizer.from_pretrained("t5-base")

In [45]:
summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40,length_penalty=2.0, num_beams=4, early_stopping=True)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

 liana barrientos got married in the westchester county, new york, in 2010. she got married again in the westchester county, but to a different man. in 2010, she married once more, this time in the Bronx. barrientos is accused of "offering a false instrument for filing in the first degree" she pleaded not guilty to two counts of "offering a false instrument for filing in the first degree" the marriages were part of an immigration scam, she says. in total, barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. she is believed to still be married to four men, and at one time, she was married to eight men at once. the case was referred to the Bronx District Attorney's Office. seven of the men are from so-called "red-flagged" countries. it is unclear whether any of the men will be prosecuted. barrientos faces up to four years in prison if convicted. her eighth husband, Rashid Rajput, was deported to his native Pakistan in 2006. if convicted, she f

In [46]:

metric.compute(predictions = [result], references = [dataset])

{'rouge1': AggregateScore(low=Score(precision=0.9347826086956522, recall=0.45989304812834225, fmeasure=0.6164874551971327), mid=Score(precision=0.9347826086956522, recall=0.45989304812834225, fmeasure=0.6164874551971327), high=Score(precision=0.9347826086956522, recall=0.45989304812834225, fmeasure=0.6164874551971327)),
 'rouge2': AggregateScore(low=Score(precision=0.7213114754098361, recall=0.353887399463807, fmeasure=0.4748201438848921), mid=Score(precision=0.7213114754098361, recall=0.353887399463807, fmeasure=0.4748201438848921), high=Score(precision=0.7213114754098361, recall=0.353887399463807, fmeasure=0.4748201438848921)),
 'rougeL': AggregateScore(low=Score(precision=0.7554347826086957, recall=0.3716577540106952, fmeasure=0.4982078853046595), mid=Score(precision=0.7554347826086957, recall=0.3716577540106952, fmeasure=0.4982078853046595), high=Score(precision=0.7554347826086957, recall=0.3716577540106952, fmeasure=0.4982078853046595)),
 'rougeLsum': AggregateScore(low=Score(prec

In [47]:
from rouge import Rouge
r = Rouge()
r.get_scores(result, dataset)

[{'rouge-1': {'f': 0.5266272146974896,
   'p': 0.8640776699029126,
   'r': 0.37872340425531914},
  'rouge-2': {'f': 0.40725806024982114,
   'p': 0.6601307189542484,
   'r': 0.2944606413994169},
  'rouge-l': {'f': 0.5147928951708624,
   'p': 0.8446601941747572,
   'r': 0.3702127659574468}}]

### Trying out new model

In [48]:
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

In [49]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")

In [50]:
summary = []
for datas in bank:
    inputs = tokenizer("summarize: " + datas, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(
    inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary.append(tokenizer.decode(outputs[0]))
    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

 a year later, she got married again in westchester county, new york. only 18 days after that marriage, she got hitched yet again. in 2010, she married once more, this time in the Bronx. barrientos, now 39, is facing two criminal counts of "offering a false instrument" referring to her false statements on the 2010 marriage license application. the marriages were part of an immigration scam, her attorney says. in total, barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. she is believed to still be married to four men, and at one time, she was married to eight men at once. seven of the men are from so-called "red-flagged" countries. seven of the men are from so-called "red-flagged" countries. the case was referred to the district attorney's office. if convicted, barrientos faces up to four years in prison. if convicted, barrientos faces up to four years in prison. next court appearance is scheduled for may 18.


In [19]:
metric.compute(predictions = [result], references = [dataset])

{'rouge1': AggregateScore(low=Score(precision=0.9112426035502958, recall=0.4117647058823529, fmeasure=0.567219152854512), mid=Score(precision=0.9112426035502958, recall=0.4117647058823529, fmeasure=0.567219152854512), high=Score(precision=0.9112426035502958, recall=0.4117647058823529, fmeasure=0.567219152854512)),
 'rouge2': AggregateScore(low=Score(precision=0.8035714285714286, recall=0.36193029490616624, fmeasure=0.4990757855822551), mid=Score(precision=0.8035714285714286, recall=0.36193029490616624, fmeasure=0.4990757855822551), high=Score(precision=0.8035714285714286, recall=0.36193029490616624, fmeasure=0.4990757855822551)),
 'rougeL': AggregateScore(low=Score(precision=0.834319526627219, recall=0.3770053475935829, fmeasure=0.5193370165745856), mid=Score(precision=0.834319526627219, recall=0.3770053475935829, fmeasure=0.5193370165745856), high=Score(precision=0.834319526627219, recall=0.3770053475935829, fmeasure=0.5193370165745856)),
 'rougeLsum': AggregateScore(low=Score(precisi

In [20]:
#with beam
from rouge import Rouge
r = Rouge()
r.get_scores(result, dataset)

[{'rouge-1': {'f': 0.5088757354075488,
   'p': 0.8349514563106796,
   'r': 0.3659574468085106},
  'rouge-2': {'f': 0.4123711298793921,
   'p': 0.704225352112676,
   'r': 0.2915451895043732},
  'rouge-l': {'f': 0.5029585756442352,
   'p': 0.8252427184466019,
   'r': 0.3617021276595745}}]

Test on another pdf

In [21]:
output_string = StringIO()
with open("data/summary_one.pdf" , "rb") as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device =  TextConverter(rsrcmgr, output_string, laparams= LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        #print('*' * 100)
        
print(output_string.getvalue())

There were certain underlying conditions that enabled department stores to
grow up when they did. From the start, they all catered for middle-class
customers and set out to convey to them an air of luxury and solid comfort. Of
necessity, they all arose in central positions where large numbers of people
could reach them easily by means of public transport. Physically, they grew up
in an era of big technical developments in building so that they could a ord
multi-storey palaces and could have enormous plate-glass windows for
display, lighting and novelties like lifts.
Above all, the department stores rose with the rise of Victorian white-collar
workers, the small-scale businessmen and professionals whose womenfolk
had money to spare for a few luxuries and were gradually switching the
emphasis of their housekeeping expenditure from food to other items.

Most of these stores drew enough customers to ﬁll their huge shops by o ering
two new things. One was the new manufactures, particularly 

In [22]:
text = output_string.getvalue()
text = text[:-4]
dataset = text.replace('\n', ' ').replace('\\', '').replace('\n\n', '').replace('\x0c','').replace('\x00', 'ff')
dataset

'There were certain underlying conditions that enabled department stores to grow up when they did. From the start, they all catered for middle-class customers and set out to convey to them an air of luxury and solid comfort. Of necessity, they all arose in central positions where large numbers of people could reach them easily by means of public transport. Physically, they grew up in an era of big technical developments in building so that they could afford multi-storey palaces and could have enormous plate-glass windows for display, lighting and novelties like lifts. Above all, the department stores rose with the rise of Victorian white-collar workers, the small-scale businessmen and professionals whose womenfolk had money to spare for a few luxuries and were gradually switching the emphasis of their housekeeping expenditure from food to other items.  Most of these stores drew enough customers to ﬁll their huge shops by offering two new things. One was the new manufactures, particular

In [51]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 liana barrientos got married in the westchester county, new york, in 2010. she got married again in the westchester county, but to a different man. in 2010, she married once more, this time in the Bronx. barrientos is accused of "offering a false instrument for filing in the first degree" she pleaded not guilty to two counts of "offering a false instrument for filing in the first degree" the marriages were part of an immigration scam, she says. in total, barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. she is believed to still be married to four men, and at one time, she was married to eight men at once. the case was referred to the Bronx District Attorney's Office. seven of the men are from so-called "red-flagged" countries. it is unclear whether any of the men will be prosecuted. barrientos faces up to four years in prison if convicted. her eighth husband, Rashid Rajput, was deported to his native Pakistan in 2006. if convicted, she f

{'rouge1': AggregateScore(low=Score(precision=0.9347826086956522, recall=0.45989304812834225, fmeasure=0.6164874551971327), mid=Score(precision=0.9347826086956522, recall=0.45989304812834225, fmeasure=0.6164874551971327), high=Score(precision=0.9347826086956522, recall=0.45989304812834225, fmeasure=0.6164874551971327)),
 'rouge2': AggregateScore(low=Score(precision=0.7213114754098361, recall=0.353887399463807, fmeasure=0.4748201438848921), mid=Score(precision=0.7213114754098361, recall=0.353887399463807, fmeasure=0.4748201438848921), high=Score(precision=0.7213114754098361, recall=0.353887399463807, fmeasure=0.4748201438848921)),
 'rougeL': AggregateScore(low=Score(precision=0.7554347826086957, recall=0.3716577540106952, fmeasure=0.4982078853046595), mid=Score(precision=0.7554347826086957, recall=0.3716577540106952, fmeasure=0.4982078853046595), high=Score(precision=0.7554347826086957, recall=0.3716577540106952, fmeasure=0.4982078853046595)),
 'rougeLsum': AggregateScore(low=Score(prec

In [52]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 a year later, she got married again in westchester county, new york. only 18 days after that marriage, she got hitched yet again. in 2010, she married once more, this time in the Bronx. barrientos, now 39, is facing two criminal counts of "offering a false instrument" referring to her false statements on the 2010 marriage license application. the marriages were part of an immigration scam, her attorney says. in total, barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. she is believed to still be married to four men, and at one time, she was married to eight men at once. seven of the men are from so-called "red-flagged" countries. seven of the men are from so-called "red-flagged" countries. the case was referred to the district attorney's office. if convicted, barrientos faces up to four years in prison. if convicted, barrientos faces up to four years in prison. next court appearance is scheduled for may 18.


{'rouge1': AggregateScore(low=Score(precision=0.9112426035502958, recall=0.4117647058823529, fmeasure=0.567219152854512), mid=Score(precision=0.9112426035502958, recall=0.4117647058823529, fmeasure=0.567219152854512), high=Score(precision=0.9112426035502958, recall=0.4117647058823529, fmeasure=0.567219152854512)),
 'rouge2': AggregateScore(low=Score(precision=0.8035714285714286, recall=0.36193029490616624, fmeasure=0.4990757855822551), mid=Score(precision=0.8035714285714286, recall=0.36193029490616624, fmeasure=0.4990757855822551), high=Score(precision=0.8035714285714286, recall=0.36193029490616624, fmeasure=0.4990757855822551)),
 'rougeL': AggregateScore(low=Score(precision=0.834319526627219, recall=0.3770053475935829, fmeasure=0.5193370165745856), mid=Score(precision=0.834319526627219, recall=0.3770053475935829, fmeasure=0.5193370165745856), high=Score(precision=0.834319526627219, recall=0.3770053475935829, fmeasure=0.5193370165745856)),
 'rougeLsum': AggregateScore(low=Score(precisi

In [53]:
output_string = StringIO()
with open("data/summary_two.pdf" , "rb") as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device =  TextConverter(rsrcmgr, output_string, laparams= LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        #print('*' * 100)
        
print(output_string.getvalue())

Despite the fact that our planet is habitable only because most of it is
composed of water, it is the oceans that are the most immediately threatened
part of the earth.

It was in the oceans that life ﬁrst began to stir, shielded by the waters from the
sun’s irresistible radiation. It was from the oceans that planets and animals
emerged to colonize the land surface of the planet.  It is the oceans today that
provide the water vapour which, drawn up by the sun, falls upon the earth in
harvest-bringing, life-sustaining rain. The ocean is a major provider of the
oxygen released by its plankton for the beneﬁt of all the species of land, air and
sea – breathing with lungs and gills.

Without special qualities for holding heat, much of the earth would be
uninhabitable.

The oceans are the coolants of the tropics, the bringers of warm currents to
cold regions, the universal moderators of temperature throughout the globe.

The oceans are also indispensable to man because they ﬁrst created the


In [54]:
text = output_string.getvalue()
text = text[:-4]
dataset = text.replace('\n', ' ').replace('\\', '').replace('\n\n', '').replace('\x0c','').replace('\x00', 'ff')
dataset

'Despite the fact that our planet is habitable only because most of it is composed of water, it is the oceans that are the most immediately threatened part of the earth.  It was in the oceans that life ﬁrst began to stir, shielded by the waters from the sun’s irresistible radiation. It was from the oceans that planets and animals emerged to colonize the land surface of the planet.  It is the oceans today that provide the water vapour which, drawn up by the sun, falls upon the earth in harvest-bringing, life-sustaining rain. The ocean is a major provider of the oxygen released by its plankton for the beneﬁt of all the species of land, air and sea – breathing with lungs and gills.  Without special qualities for holding heat, much of the earth would be uninhabitable.  The oceans are the coolants of the tropics, the bringers of warm currents to cold regions, the universal moderators of temperature throughout the globe.  The oceans are also indispensable to man because they ﬁrst created the

In [55]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 the oceans provide the water vapour which, drawn up by the sun, falls upon the earth in harvest-bringing, life-bringing, harvest-bringing, life-sustaining ways. it is the oceans today that provide the water vapour which, drawn up by the sun, falls upon the earth in harvest-bringing, life-sustaining ways. the oceans are the coolants of the tropics, the bringers of warm currents to cold regions, the universal moderators of temperature throughout the globe. without special qualities for holding heat, much of the earth would be uninhabitable. in 1996, sixty-three million metric tons of fish came from the sea. fish could make up a large part of the protein diet required for the world’s children, especially those in developing countries, at a very low cost. rcent of the harvest from the oceans is converted to fish meal which today ends up feeding pigs and chickens in developed countries. it is very sad that ‘developed’ animal pets have the chance of a better diet than very many ‘developing’

{'rouge1': AggregateScore(low=Score(precision=0.9011627906976745, recall=0.4983922829581994, fmeasure=0.6418219461697724), mid=Score(precision=0.9011627906976745, recall=0.4983922829581994, fmeasure=0.6418219461697724), high=Score(precision=0.9011627906976745, recall=0.4983922829581994, fmeasure=0.6418219461697724)),
 'rouge2': AggregateScore(low=Score(precision=0.8011695906432749, recall=0.44193548387096776, fmeasure=0.5696465696465697), mid=Score(precision=0.8011695906432749, recall=0.44193548387096776, fmeasure=0.5696465696465697), high=Score(precision=0.8011695906432749, recall=0.44193548387096776, fmeasure=0.5696465696465697)),
 'rougeL': AggregateScore(low=Score(precision=0.7965116279069767, recall=0.4405144694533762, fmeasure=0.567287784679089), mid=Score(precision=0.7965116279069767, recall=0.4405144694533762, fmeasure=0.567287784679089), high=Score(precision=0.7965116279069767, recall=0.4405144694533762, fmeasure=0.567287784679089)),
 'rougeLsum': AggregateScore(low=Score(prec

In [56]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 the oceans today provide the water vapour which falls upon the earth in harvest-bringing, life-bringing. it was in the oceans that life first began to stir, shielded by the waters from the sun’s irresistible radiation. the oceans are the coolants of the tropics, the bringers of warm currents to cold regions, the universal moderators of temperature throughout the globe. the oceans are also indispensable to man because they first created the worldwide currents of sebastian. in 1996, sixty-three million metric tons of fish came from the sea. fish could make up a large part of the protein diet required for the world’s children at a very low cost. rcent of the harvest from the oceans is converted to fish meal which ends up feeding pigs and chickens in developed countries. it is sad that ‘developed’ animal pets have the chance of a better diet than very many ‘developing’ babies.


{'rouge1': AggregateScore(low=Score(precision=0.9545454545454546, recall=0.47266881028938906, fmeasure=0.6322580645161291), mid=Score(precision=0.9545454545454546, recall=0.47266881028938906, fmeasure=0.6322580645161291), high=Score(precision=0.9545454545454546, recall=0.47266881028938906, fmeasure=0.6322580645161291)),
 'rouge2': AggregateScore(low=Score(precision=0.8562091503267973, recall=0.42258064516129035, fmeasure=0.5658747300215983), mid=Score(precision=0.8562091503267973, recall=0.42258064516129035, fmeasure=0.5658747300215983), high=Score(precision=0.8562091503267973, recall=0.42258064516129035, fmeasure=0.5658747300215983)),
 'rougeL': AggregateScore(low=Score(precision=0.8831168831168831, recall=0.43729903536977494, fmeasure=0.5849462365591398), mid=Score(precision=0.8831168831168831, recall=0.43729903536977494, fmeasure=0.5849462365591398), high=Score(precision=0.8831168831168831, recall=0.43729903536977494, fmeasure=0.5849462365591398)),
 'rougeLsum': AggregateScore(low=S

In [57]:
output_string = StringIO()
with open("data/summary_three.pdf" , "rb") as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device =  TextConverter(rsrcmgr, output_string, laparams= LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        #print('*' * 100)
        
print(output_string.getvalue())

It has been the custom of historians to divide the factors for wars into
immediate and underlying causes. Among these underlying causes, the
economic factor is generally placed at the head of the list. Indeed, the most
important of these was the industrial and commercial rivalry between
Germany and Great Britain.

Germany, after its uniﬁcation in 1871, went through a period of economic
miracle. By 1914, she was producing more iron and steel than Britain and
France combined. In chemicals, in dye, and in the manufacture of scientiﬁc
equipment she led the world. The products of her industries were crowding
British manufactures in nearly every market for continental Europe, in the Far
East and in Britain itself.

There is evidence that certain interests in Great Britain were becoming
seriously alarmed over the menace of German competition.

There seemed to be a strong conviction that Germany was waging deliberate
and deadly economic warfare upon Britain to capture her market by unfair
meth

In [58]:
text = output_string.getvalue()
text = text[:-4]
dataset = text.replace('\n', ' ').replace('\\', '').replace('\n\n', '').replace('\x0c','').replace('\x00', 'ff')
dataset

'It has been the custom of historians to divide the factors for wars into immediate and underlying causes. Among these underlying causes, the economic factor is generally placed at the head of the list. Indeed, the most important of these was the industrial and commercial rivalry between Germany and Great Britain.  Germany, after its uniﬁcation in 1871, went through a period of economic miracle. By 1914, she was producing more iron and steel than Britain and France combined. In chemicals, in dye, and in the manufacture of scientiﬁc equipment she led the world. The products of her industries were crowding British manufactures in nearly every market for continental Europe, in the Far East and in Britain itself.  There is evidence that certain interests in Great Britain were becoming seriously alarmed over the menace of German competition.  There seemed to be a strong conviction that Germany was waging deliberate and deadly economic warfare upon Britain to capture her market by unfair met

In [59]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40,length_penalty=2.0, num_beams=4, early_stopping=True)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 the industrial and commercial rivalry between Germany and Great Britain. by 1914, she was producing more iron and steel than Britain and France combined. by 1914, she was producing more iron and steel than Britain and France combined. david rothkopf: her industries were crowding British manufactures in nearly every market. he says there seemed to be a conviction that Germany was waging deadly economic warfare. rothkopf: for Britain to teeter on the edge of German competition, it would be a mistake. he says it would be a mistake for the u.s. to rely on british manufacturers. frida ghitis: o allow germany to be victorious in this struggle would mean the destruction of her prosperity. ghitis: there are indications that the french also were alarmed by the German industrial expansion. ghitis: o allow Germany to be victorious would mean the destruction of her national existence. ghitis: o allow Germany to be victorious in this struggle would mean the destruction of her prosperity. frida ghi

{'rouge1': AggregateScore(low=Score(precision=0.5954198473282443, recall=0.4588235294117647, fmeasure=0.5182724252491694), mid=Score(precision=0.5954198473282443, recall=0.4588235294117647, fmeasure=0.5182724252491694), high=Score(precision=0.5954198473282443, recall=0.4588235294117647, fmeasure=0.5182724252491694)),
 'rouge2': AggregateScore(low=Score(precision=0.44061302681992337, recall=0.3392330383480826, fmeasure=0.38333333333333336), mid=Score(precision=0.44061302681992337, recall=0.3392330383480826, fmeasure=0.38333333333333336), high=Score(precision=0.44061302681992337, recall=0.3392330383480826, fmeasure=0.38333333333333336)),
 'rougeL': AggregateScore(low=Score(precision=0.4961832061068702, recall=0.38235294117647056, fmeasure=0.4318936877076412), mid=Score(precision=0.4961832061068702, recall=0.38235294117647056, fmeasure=0.4318936877076412), high=Score(precision=0.4961832061068702, recall=0.38235294117647056, fmeasure=0.4318936877076412)),
 'rougeLsum': AggregateScore(low=S

In [60]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 it has been the custom of historians to divide the factors for wars into immediate and underlying causes. the most important of these was the industrial and commercial rivalry between Germany and Great Britain. by 1914, she was producing more iron and steel than Britain and France combined. the products of her industries crowding British manufactures in nearly every market for continental Europe, in the far East and in Britain itself. there seems to be a strong conviction that Germany was waging deliberate and deadly economic warfare upon Britain to capture her market by unfair methods. in 1870, France had lost possession of the expensive iron and coal deposit of Lorraine. in 1870, the French had plenty of iron left in the Briery Fields. but they were afraid that their enemy might eventually reach ousted. france was under necessity of importing coal and this galled her pride almost as much as the loss of the iron. the Russian ambition to gain control of Constantinople conflicted with 

{'rouge1': AggregateScore(low=Score(precision=0.9267015706806283, recall=0.5205882352941177, fmeasure=0.6666666666666666), mid=Score(precision=0.9267015706806283, recall=0.5205882352941177, fmeasure=0.6666666666666666), high=Score(precision=0.9267015706806283, recall=0.5205882352941177, fmeasure=0.6666666666666666)),
 'rouge2': AggregateScore(low=Score(precision=0.8736842105263158, recall=0.4896755162241888, fmeasure=0.6275992438563327), mid=Score(precision=0.8736842105263158, recall=0.4896755162241888, fmeasure=0.6275992438563327), high=Score(precision=0.8736842105263158, recall=0.4896755162241888, fmeasure=0.6275992438563327)),
 'rougeL': AggregateScore(low=Score(precision=0.9214659685863874, recall=0.5176470588235295, fmeasure=0.6629001883239172), mid=Score(precision=0.9214659685863874, recall=0.5176470588235295, fmeasure=0.6629001883239172), high=Score(precision=0.9214659685863874, recall=0.5176470588235295, fmeasure=0.6629001883239172)),
 'rougeLsum': AggregateScore(low=Score(prec

In [61]:
output_string = StringIO()
with open("data/summary_four.pdf" , "rb") as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device =  TextConverter(rsrcmgr, output_string, laparams= LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        #print('*' * 100)
        
print(output_string.getvalue())

Mining ranks among the world’s most destructive industries. Yet mineral
extraction and processing are absent in most discussions of global
environmental threats. Governmental and private analyses have focused only
on increasing mineral supplies.

Each year, mining strips some 28 billion tons of material from the earth. This
is more than what is removed by the natural erosion of all the earth’s rivers.
Worldwide, mining and smelting generate an estimated 2.7 billion tons of
processing waste each year, much of it hazardous dwarﬁng the more familiar
municipal waste. Smelter pollution has created biological wastelands as large
as 10, 000 hectares and pumped some eight percent of the total worldwide
emissions of sulphur dioxide, a major contributor to acid rain, into the
atmosphere.

Mining could also cause more damaging deforestation than bad farming
practices in certain parts of the world. For example, smelters at a single iron
mine in Brazil will require enough fuelwood to deforest 50,00

In [62]:
text = output_string.getvalue()
text = text[:-4]
dataset = text.replace('\n', ' ').replace('\\', '').replace('\n\n', '').replace('\x0c','').replace('\x00', 'ff')
dataset

'Mining ranks among the world’s most destructive industries. Yet mineral extraction and processing are absent in most discussions of global environmental threats. Governmental and private analyses have focused only on increasing mineral supplies.  Each year, mining strips some 28 billion tons of material from the earth. This is more than what is removed by the natural erosion of all the earth’s rivers. Worldwide, mining and smelting generate an estimated 2.7 billion tons of processing waste each year, much of it hazardous dwarﬁng the more familiar municipal waste. Smelter pollution has created biological wastelands as large as 10, 000 hectares and pumped some eight percent of the total worldwide emissions of sulphur dioxide, a major contributor to acid rain, into the atmosphere.  Mining could also cause more damaging deforestation than bad farming practices in certain parts of the world. For example, smelters at a single iron mine in Brazil will require enough fuelwood to deforest 50,0

In [63]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 each year, mining strips some 28 billion tons of material from the earth. this is more than what is removed by the natural erosion of all the earth’s rivers. mining and smelting generate an estimated 2.7 billion tons of processing waste each year. smelter pollution has created biological wastelands as large as 10, 000 hectares. mining could cause more damaging deforestation than bad farming practices. mining could cause more damaging deforestation than bad farming practices. mining has been poorly regulated even in wealthy industrialized nations. many governments subsidize mineral production, but few enact or enforce strict environmental regulations. prices of minerals do not include their full environmental cost. osts of eroded land, dammed or polluted rivers and displacement of people unlucky enough to live atop mineral deposits. the devastating effects of the industry are particularly severe in the developing countries. the people of most mineral–exporting countries gain little fro

{'rouge1': AggregateScore(low=Score(precision=0.9488372093023256, recall=0.4523281596452328, fmeasure=0.6126126126126126), mid=Score(precision=0.9488372093023256, recall=0.4523281596452328, fmeasure=0.6126126126126126), high=Score(precision=0.9488372093023256, recall=0.4523281596452328, fmeasure=0.6126126126126126)),
 'rouge2': AggregateScore(low=Score(precision=0.8644859813084113, recall=0.4111111111111111, fmeasure=0.5572289156626505), mid=Score(precision=0.8644859813084113, recall=0.4111111111111111, fmeasure=0.5572289156626505), high=Score(precision=0.8644859813084113, recall=0.4111111111111111, fmeasure=0.5572289156626505)),
 'rougeL': AggregateScore(low=Score(precision=0.9302325581395349, recall=0.4434589800443459, fmeasure=0.6006006006006006), mid=Score(precision=0.9302325581395349, recall=0.4434589800443459, fmeasure=0.6006006006006006), high=Score(precision=0.9302325581395349, recall=0.4434589800443459, fmeasure=0.6006006006006006)),
 'rougeLsum': AggregateScore(low=Score(prec

In [64]:

n =512
bank = [dataset[i:i+n] for i in range(0, len(dataset), n)]

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")


summary = []
for data in bank:
    #inputs, references = batch
    inputs = tokenizer("summarize: " + data, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary.append(tokenizer.decode(outputs[0]))
    


    #print(tokenizer.decode(outputs[0]))
    
result  = "".join(summary)
result = result.replace(r'<pad>','').replace(r'</s>','')

print(result)

metric.compute(predictions = [result], references = [dataset])

 each year, mining strips 28 billion tons of material from the earth. this is more than what is removed by the natural erosion of all the earth’s rivers. worldwide, mining and smelting generate an estimated 2.7 billion tons of processing waste each year. smelter pollution has created biological wastelands as large as 10, 000 hectares. mining could also cause more damaging deforestation than bad farming practices. mining could also cause more damaging deforestation than bad farming practices in certain parts of the world. mining has been poorly regulated even in wealthy industrialized nations. some governments enact or enforce strict environmental regulations for mining operations. prices of minerals do not include their full environmental cost. osts of eroded land, dammed or polluted rivers and displacement of people unlucky enough to live atop mineral deposits. governments should remove subsidies provided for mining virgin minerals. the people of most mineral–exporting countries gain 

{'rouge1': AggregateScore(low=Score(precision=0.9301310043668122, recall=0.4722838137472284, fmeasure=0.6264705882352941), mid=Score(precision=0.9301310043668122, recall=0.4722838137472284, fmeasure=0.6264705882352941), high=Score(precision=0.9301310043668122, recall=0.4722838137472284, fmeasure=0.6264705882352941)),
 'rouge2': AggregateScore(low=Score(precision=0.8245614035087719, recall=0.4177777777777778, fmeasure=0.5545722713864307), mid=Score(precision=0.8245614035087719, recall=0.4177777777777778, fmeasure=0.5545722713864307), high=Score(precision=0.8245614035087719, recall=0.4177777777777778, fmeasure=0.5545722713864307)),
 'rougeL': AggregateScore(low=Score(precision=0.8777292576419214, recall=0.44567627494456763, fmeasure=0.5911764705882353), mid=Score(precision=0.8777292576419214, recall=0.44567627494456763, fmeasure=0.5911764705882353), high=Score(precision=0.8777292576419214, recall=0.44567627494456763, fmeasure=0.5911764705882353)),
 'rougeLsum': AggregateScore(low=Score(p

Reference
1. What is Rouge score? [here](https://www.youtube.com/watch?v=TMshhnrEXlg)
2. Interpreting Rouge scores [here](https://stats.stackexchange.com/questions/301626/interpreting-rouge-scores)
3. How others tested their text summarizer after using rouge [here](https://link.springer.com/article/10.1007/s10579-017-9389-4)
4. Understanding how to finetune Hyperparameters for model.generate [here](https://huggingface.co/blog/how-to-generate)
5. Understanding num_beam used as an argument for model.generate [here](https://huggingface.co/blog/how-to-generate)