## Text Summarization using fine tuned facebook/bart-base model
---



***Install necessary Libraries***

In [2]:
!pip install datasets evaluate transformers rouge-score nltk



In [3]:
!pip install transformers[torch]



In [90]:
!pip install nltk



***Login to Huggingfacehub***

In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

***Installing Git-lfs***

In [5]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


***Checking Transformer Version***

In [6]:
import transformers

print(transformers.__version__)

4.38.2


In [7]:
import pandas as pd
import numpy as np

In [91]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Read and PreProcess the dataset for Finetuning the base model

In [8]:
input_df = pd.read_csv('curated_data_subset.csv')
input_df.head()

Unnamed: 0.1,Unnamed: 0,title,summary,url,date,article_content
0,0,Tencent gains approval to sell mutual funds to...,Tencent has been granted a licence from the Ch...,http://www.scmp.com/business/companies/article...,49:58.9,Traditional finance houses now being seriously...
1,1,"India testing blockchains in education, health...",India is testing blockchain applications in ed...,https://www.vccircle.com/niti-aayog-explores-b...,33:08.3,"The Indian government's policy think tank, Nit..."
2,2,Higher living wage risks robot takeover of low...,The UK Institute for Fiscal Studies has warned...,https://news.sky.com/story/ifs-living-wage-inc...,32:10.2,Increases in minimum wage levels risk raising ...
3,3,Regus WeWork may seek stock-market flotation t...,Co-working start-up WeWork may go public this ...,https://www.fool.com/investing/2018/01/03/will...,31:59.6,"Depending on whom you ask, WeWork is either a ..."
4,4,AMD poised to gain market share as Intel pound...,Intel has seen more than $11bn wiped off its m...,https://www.cnbc.com/2018/01/05/amd-is-big-win...,17:13.3,AMD is big winner from chip flaw fiasco as mor...


In [9]:
input_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       100 non-null    int64 
 1   title            100 non-null    object
 2   summary          100 non-null    object
 3   url              100 non-null    object
 4   date             100 non-null    object
 5   article_content  99 non-null     object
dtypes: int64(1), object(5)
memory usage: 4.8+ KB


In [10]:
input_df.isnull().sum()

Unnamed: 0         0
title              0
summary            0
url                0
date               0
article_content    1
dtype: int64

In [11]:
input_df.dropna(inplace=True, ignore_index= True)

In [12]:
input_df.isnull().sum()

Unnamed: 0         0
title              0
summary            0
url                0
date               0
article_content    0
dtype: int64

In [13]:
input_df.loc[29, :]

Unnamed: 0                                                        30
title              Fluenta's new gauges measure gas at extreme te...
summary            Fluenta has launched two ultrasonic flow trans...
url                https://www.oilfieldtechnology.com/product-new...
date                                                         20:05.4
article_content    Fluenta has announced the launch of its new ra...
Name: 29, dtype: object

In [14]:
input_df.rename(columns = {'article_content':'Text', 'summary':'Summary'}, inplace=True)

In [15]:
input_df = input_df[['Text', 'Summary']]

In [16]:
input_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Text     99 non-null     object
 1   Summary  99 non-null     object
dtypes: object(2)
memory usage: 1.7+ KB


In [17]:
input_df.head(10)

Unnamed: 0,Text,Summary
0,Traditional finance houses now being seriously...,Tencent has been granted a licence from the Ch...
1,"The Indian government's policy think tank, Nit...",India is testing blockchain applications in ed...
2,Increases in minimum wage levels risk raising ...,The UK Institute for Fiscal Studies has warned...
3,"Depending on whom you ask, WeWork is either a ...",Co-working start-up WeWork may go public this ...
4,AMD is big winner from chip flaw fiasco as mor...,Intel has seen more than $11bn wiped off its m...
5,"MUMBAI: About 250 companies, including some of...",Demand for cyber insurance among Indian compan...
6,Public sector organisations have quadrupled th...,The UK's public sector has increased its spend...
7,A worker pulls carts full of customer orders a...,The average price for large industrial plots o...
8,"Office take up in quarter four, 2017 in the ce...",Office take-up in Birmingham during 2017 broke...
9,Elevate your enterprise data technology and st...,"Machine learning chatbots – such as Ada, which..."


### Split the data into Train, Validation and Test dataset

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_val = train_test_split(input_df, test_size = 0.2)

X_train, X_test = train_test_split(X_train, test_size = 0.1)

In [19]:
print(f'Shape of X_train is: {X_train.shape}')
#print(f'Shape of y_train is: {y_train.shape}')

print(f'Shape of X_val is: {X_val.shape}')
#print(f'Shape of y_val is: {y_val.shape}')

print(f'Shape of X_test is: {X_test.shape}')
#print(f'Shape of y_test is: {y_test.shape}')

Shape of X_train is: (71, 2)
Shape of X_val is: (20, 2)
Shape of X_test is: (8, 2)


***Saving the dataset for training, validation and test***

In [20]:
input_df.to_csv('curated_dataset_for_summarization.csv', index=False)
X_train.to_csv('training_data.csv', index=False)
X_val.to_csv('validation_data.csv', index=False)
X_test.to_csv('test_data.csv', index=False)

### Loading the dataset from Huggingface Dataset

In [21]:
from datasets import load_dataset

data_files = {"train": "training_data.csv", "val":"validation_data.csv", "test": "test_data.csv"}
dataset = load_dataset("sudhanshusinghaiml/curated-dataset-for-summarization", data_files= data_files)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [22]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Text', 'Summary'],
        num_rows: 71
    })
    val: Dataset({
        features: ['Text', 'Summary'],
        num_rows: 20
    })
    test: Dataset({
        features: ['Text', 'Summary'],
        num_rows: 8
    })
})

***Displaying data from loaded dataset***

In [23]:
dataset['train'][29]

{'Text': '"But, on the other hand, we would be foolish to rule anything out. We know that Asia-Pacific will be a very important market and we know a lot of the global growth in the future will come from there."',
 'Summary': 'According to the UK\'s International Trade Secretary, Liam Fox, the UK could feasibly join the Trans-Pacific Partnership (TPP), saying "it would be foolish to rule anything out". The organisation is made up of Australia, Mexico, New Zealand, Canada, Chile, Japan, Singapore, Brunei, Peru, Vietnam and Malaysia - with Donald Trump pulling the US out last year - and\xa0is currently in renegotiation under the new name of the Comprehensive and Progressive Agreement for Trans-Pacific Partnership. Its aims are to\xa0lower both non-tariff and tariff barriers to trade and to provide a forum to settle international disputes.'}

In [24]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [25]:
show_random_elements(dataset["train"])

Unnamed: 0,Text,Summary
0,"Online retail and e-commerce giant Amazon is reportedly on the verge of making its first investment in an insurtech start-up, with the company said close to finalising an investment in online-only insurance start-up Acko.\n\nAcko wants to disrupt India’s insurance industry through a digital-only platform, having raised $30 million and recently received in-principal approval from the financial market regulators in India.\n\nAmazon and Indian rival Flipkart had both been pursuing investing in Acko, it has been reported widely, but at this stage it is now thought that Amazon is close to signing a term-sheet for the investment and a partnership deal with Acko.\n\nIt’s said that the arrangement will see Amazon acting as an online distributor for Acko’s insurance products, selling a range of financial products.\n\nThe potential for Amazon to enter the insurance space has been much-discussed in recent months, including in our article from November, Incumbents could be relegated, if tech giants come for re/insurance.\n\nNow it appears Amazon is close to taking a sensible step of investing in and partnering with an insurtech start-up, in order to gain the ability to add insurance products to its retail offering, seeing the firm stepping into the sale of financial products for the first time.\n\nTargeting India first is also a smart move, as the burgeoning financial services market there has a strong focus on technology and take-up rates of insurance products are rising all the time.\n\nIf Amazon can crack selling insurance online to the Indian market, it will stand it in good stead to break into more established markets such as the United States and Europe.\n\nOf course, if Amazon does move into insurance meaningfully it will likely only be a matter of time before other tech giants such as Google follow suit with their own integrated e-commerce offerings.\n\nIt’s also been reported that Flipkart is readying its own entry into insurance sales online, with the establishment of a new entity to focus on financial services and venture investing.\n\nAlso read: Incumbents could be relegated, if tech giants come for re/insurance.","Amazon is reportedly finalising an investment in Indian insurtech firm Acko. The arrangement would see Amazon acting as a distribution platform for the online-only insurer. Amazon's biggest rival in India, Flipkart, was also reportedly considering an investment in Acko. The start-up has raised $30m to date and has provisional approval to operate from India's financial markets regulator. \n"
1,"Updated Story Several Democratic-led states are looking to implement state-level individual mandates for insurance coverage in an effort to reduce the prominence of bare counties and failing risk pools due to the end of the Affordable Care Act’s individual mandate in 2019 and other instabilities surround the law. California, New York, Maryland, Connecticut, and Washington state are all considering pursuing state individual mandates for insurance coverage when their state legislatures come into session in early 2018 ,...","Following on from the Republican tax bill, several Democratic-led states are looking to implement state-level individual mandates to overcome bare counties and the prospect of failing risk pools. California, Connecticut, New York, Maryland and Washington State, are all considering the move when their state legislatures come into session in early 2018. It is likely that the states will attempt to implement a model similar to that of RomneyCare, introduced in Massachusetts in 2006, and the ACA’s individual mandate. These states do not require federal approval for the move as the mandate penalty is a tax, and as a result, have the ability to implement their own version of the Obamacare mandate. It is unlikely that the move will extend beyond these states as there is far more likely to be partisan pushback as state legislatures often skew to the right. California, in particular, is looking at the possibility of a state individual mandate to overcome the uncertainty at federal level surrounding Obamacare. Maryland too would likely introduce an individual mandate, Massachusetts could fall back on its original proposition, with Washington State proving the most complicated. \n"
2,"Source: Xinhua| 2018-01-03 15:31:55|Editor: Xiang Bo\n\nVideo Player Close\n\nBEIJING, Jan. 3 (Xinhua) -- Land sales increased in Chinese cities last year as the government moved to cool the market with higher supply, according to the China Index Academy, a property research organization.\n\nLand sales in 300 Chinese cities totaled 950.36 million square meters in 2017, up 8 percent from 2016, while sales of land for residential projects reached 354.33 million square meters, an increase of 24 percent year on year.\n\nLand sales in major cities like Beijing, Shanghai and Guangzhou were particularly robust, as local governments increased land supply to cool down runaway house prices fueled by huge demand and limited supply.\n\nIn China's first-tier cities, land sales jumped 46 percent year on year to 29.79 million square meters last year, according to China Index Academy.\n\nBoosted by surging sales, revenue from land transactions rose 36 percent to 4.01 trillion yuan (about 620 billion U.S. dollars) in a total of 300 Chinese cities.\n\nChina's property market, once deemed a major risk for the broader economy, cooled in 2017 amid tough curbs such as purchase restrictions and increased downpayment requirements as the government sought to rein in speculation.\n\nDue to these efforts, both investment and sales in China's property sector slowed. Real estate investment rose 7.5 percent year on year during January-November, down from 7.8 percent in the first 10 months.\n\nProperty sales in terms of floor area climbed 7.9 percent in the first 11 months, retreating from 8.2 percent in January-October.\n\nWith the market holding steady, Chinese authorities are aiming for a ""long-term mechanism"" for real estate regulation, and a housing system that ensures supply through multiple sources and encourages both housing purchases and rentals.\n\nA report from the National Academy of Economic Strategy predicted that the country's property market would remain stable in 2018 if there were no major policy shocks.","Land sales in China's first-tier cities such as Beijing, Shanghai and Guangzhou surged by 46% year on year to 30 million sq metres in 2017, according to the China Index Academy. The academy also revealed that land sales in 300 Chinese cities rose by 8% to 950 million sq metres and sales of land for residential purposes rose 24% to 354 million sq metres. The government has been increasing land supply in order to put a brake on rising house prices. New purchase restrictions and requirements for higher deposits have also had an effect in slowing price rises."
3,"The return to work after Christmas is never easy. Unless you’re an estate agent: they love January. Following the pre-Christmas lull, families rush back into wanting to buy and sell their houses (helped in part by the traditional post-festivity spike in family breakdown). But for an increasing number of us, house hunting is becoming little more than an exercise in window shopping (or ‘property porn’ if you’d rather).\n\nThe share of the population owning a home has been falling since 2003, with particularly profound consequences for younger families. As the chart below shows, today’s 30 year olds (that is, the oldest members of the millennial generation born between 1981 and 2000) are only half as likely to own their house as their parents were at the same age.\n\nLike so much of the Christmas TV schedule, this is a story that’s been on repeat for some time. Britons get that their country is no longer a nation of home owners. As research carried out by Ipsos MORI back in the summer for the Intergenerational Commission showed, 71 per cent of people (across all generations) think millennials face worse prospects than their parents in this regard. Just 7 per cent think young adults are better off. Indeed, of all of the questions asked in the survey, it was the one on which respondents were most pessimistic.\n\nYet, despite being pessimistic about the overall picture for millennials, new data shows that a significant share of the generation think they personally will manage to beat the gloom. The next chart takes data from the Bank of England’s latest NMG survey to show that more than half (52 per cent) of non-owning households headed by someone aged under-35 (roughly speaking, the millennial generation) expect to buy at some point in the future. And that proportion holds up even among lower income millennials.\n\nIf such expectations were borne out, around 75 per cent of millennial households would eventually own a home. That would put the generation on a par with the home ownership rates recorded among baby boomers. It would also be roughly 10 percentage points higher than the ‘optimistic’ scenario we set out in September (our ‘pessimistic’ scenario put the figure under 50 per cent). Short of a significant turnaround in housing trends, the implication is that many members of the younger generation will find their aspirations go unmet.\n\nAnd, while the one-in-four (24 per cent) non-owning millennials who think they’ll never buy a home might have a more realistic outlook of the future, they’re just as likely to be unhappy with their lot. The next chart sets out the factors which this group identify as being among the three most important reasons for not owning. What stands out is that just one-in-ten of them cite positive-sounding reasons: 10 per cent say they like their current home and just 8 per cent prefer the flexibility of renting. The upshot is that as few as 1 per cent of millennials appear to be happy with the idea of never owning a home.\n\nIt’s this finding that goes a long way to explaining why politicians are so keen to be seen to be offering hope on home ownership.\n\nAnd, with ‘purchase costs’ (such as the deposit, stamp duty and estate agents fees) being cited by millennials as the main barrier to owning, it’s easy to understand the temptation to focus on subsidising buyers. Measures such as the removal of stamp duty for first time buyers of property worth up to £300,000 – which Philip Hammond announced in the Autumn Budget – give the impression of extending home ownership to a wider group. But they largely miss the mark. The OBR’s assessment of the stamp duty policy was that it would benefit just 3,500 first time buyers who would not otherwise have been able to buy a home, costing roughly £160,000 per additional owner.\n\nSupply-based approaches represent a preferable and more sustainable option, but they take time to take effect. That’s not to say government should give up, and the Autumn Budget plans for returning housing capital spending back to the levels of the 2000s (outside of the fiscal stimulus peak of 2008-10) is a very welcome one.\n\nBut it’s hard to escape the conclusion that, even if we get to grips with the longer-term problem, home ownership will remain off-limits for significant numbers of millennials. Some might expect to benefit from the bank of mum and dad in the near-term and from inheritances as they age. But such support may come too late to cover expensive family-rearing years for many households, and will never arrive for many – mainly lower income – others.\n\nThat reality raises a number of challenges for today’s young people. Over the longer-term, home ownership plays an important role in building wealth (via semi-enforced saving), providing leverage and hedging against costs and location in retirement. In its absence, alternatives are needed.\n\nMore immediately, the generally higher housing costs associated with renting leave young people with less disposable income and less opportunity to save than earlier cohorts faced. The chart below sets out the share of income allocated to rent among younger respondents to the NMG survey. It shows that 30 per cent of renters in this group spend more than one-third of their pre-tax income – a threshold that is often taken as a sign of housing unaffordability. And that figure jumps to a massive 71 per cent among the poorest fifth of millennials.\n\nWe’ll turn to the question of how the country might rise to these challenges in a forthcoming policy options paper for the Intergenerational Commission. But our politicians – unlike our estate agents – need to be more honest about the housing aspiration gap. It’s good to offer hope, but a healthy dose of realism would sharpen the focus on the broader living standards challenge posed by our housing crisis.","A majority of millennials anticipate they will own a home in the future despite being pessimistic about their overall prospects. A recent survey commissioned by the Bank of England found that 52% of under-35s expect to buy a home at some point in the future, 24% expect never to buy and 25% are unsure. Purchase costs, such as deposits, estate agency fees and stamp duty, were cited as the biggest barriers to home ownership. An earlier Ipsos MORI poll found 71% of respondents thought millennials faced worse prospects than their parents of owning a home. \n"
4,"NASA's Flight Opportunities program is already flying experiments on Blue Origin's New Shepard vehicle, but researchers and companies alike want NASA to also fund experiments with people on board.\n\nBROOMFIELD, Colo. — As commercial suborbital vehicles capable of carrying both payloads and people prepare to enter service, NASA officials say they're willing to consider allowing agency-funded researchers to fly on those vehicles.\n\nIn an interview after a speech at the Next-Generation Suborbital Researchers Conference here Dec. 19, Steve Jurczyk, NASA associate administrator for space technology, said the agency would be open to allowing researchers funded by NASA's Flight Opportunities program to fly on suborbital spacecraft to carry out their experiments.\n\n""As principal investigators propose, both internal to NASA and external, we'll do the same kind of process that we do with Zero G,"" he said, referring to the company that performs parabolic aircraft flights. Zero G flies investigations as part of the Flight Opportunities program, with researchers flying on the aircraft with their experiments. [Watch Blue Origin's New Shepard 2.0 Spacecraft Soar in 1st Test Flight]\n\nZero G's aircraft, a Boeing 727, is regulated by the Federal Aviation Administration. Jurczyk said that, in addition to the FAA oversight, NASA's Armstrong Flight Research Center performs an evaluation of the aircraft for investigations selected by the Flight Opportunities program for flights on it. ""It just ensures that our grantees and contractors are safe to fly, and then we allow them to go fly,"" he said in a speech at the conference.\n\nA similar procedure is not yet in place for suborbital vehicles, but Jurczyk said the agency would be open to finding some process analogous to that used for Zero G. ""Moving forward, as these capabilities start coming online, we’ll figure it out,"" he said in the interview.\n\nHis comments come four and a half years after another agency official opened the door to flying people on commercial suborbital vehicles through the Flight Opportunities program. Speaking at the same conference in June 2013, Lori Garver, NASA deputy administrator at the time, said that past prohibitions about flying people would be lifted.\n\n""We absolutely do not want to rule out paying for research that could be done by an individual spaceflight participant — a researcher or payload specialist — on these vehicles in the future,"" Garver said then. ""That could open up a lot more opportunities.""\n\nThat announcement took the program by surprise, with the program's managers saying at the time they had yet to craft a policy for allowing people to fly with their experiments. Development of such a policy suffered years of delays, in part because of Garver's departure from NASA just a few months after her announcement as well as extended delays in the development of commercial suborbital vehicles capable of carrying people.\n\n""It mostly resulted in a bunch of ostriches sticking heads in the sand for a few years,"" said Erika Wagner, business development manager at Blue Origin, during a panel discussion at the conference Dec. 18.\n\nBlue Origin's New Shepard vehicle is already carrying research payloads, including for Flight Opportunities, but without people on board. However, the vehicle will be able to support missions carrying payloads and people in the future. Virgin Galactic’s SpaceShipTwo vehicle will also fly research payloads accompanied by a payload specialist.\n\nWagner said she has seen some progress as both companies' vehicles advance through flight testing. ""The heads are back out. They're looking around trying to understand what really are the barriers, what is the liability regime.""\n\nThose liability issues today, she said, prevent NASA civil servants from flying on the Zero G aircraft, even though outside researchers whose experiments are funded by NASA are able to do so. Jurczyk, in his speech at the conference, said that’s because they would have to sign a liability waiver to do so. ""Right now, that’s just NASA policy. We don't have a strong mission need to do that,"" he said. ""That's current policy. I’m not saying it's going to be policy forever and ever."" [In Photos: Blue Origin's New Shepard 2.0 Aces Maiden Test Flight]\n\nScientists who would like to fly experiments on suborbital vehicles argue that such missions are analogous to fieldwork — oftentimes hazardous — performed in other fields. ""Marine biologists and marine geologists get to put themselves in that very same operationally risky environment by going to the bottom of the ocean, to a deep sea vent,"" said Dan Durda, a planetary scientist at the Southwest Research Institute, during the Dec. 18 panel. ""These vehicles offer us, as space scientists, that opportunity to get into the field the way that biologists and geologists do.""\n\nAdvocates of commercial suborbital research, such as the Commercial Spaceflight Federation’s Suborbital Applications Research Group, have been pushing to allow NASA to fund human-tended experiments.\n\n""They're working quietly to get the word out that there are very definite needs for human-tended payloads,"" said Steven Collicott, a Purdue University professor, in a conference speech Dec. 19. ""We've heard some encouraging words and we’re working quietly to try and move that ahead.""\n\nOthers at the conference noted a decades-old precedent that suggests existing barriers to flying NASA-funded researchers on commercial suborbital vehicles can be overcome. In the 1980s, several payload specialists flew on the space shuttle, including Charles Walker, a McDonnell Douglas engineer who was part of three shuttle missions.\n\nWalker, in the Dec. 18 panel discussion, noted that on those shuttle missions he and his family signed liability waivers. He supported similar approaches to allow researchers to fly on commercial suborbital vehicles.\n\n""The environments opened up by suborbital flight and, at a greater scale, orbital flight, are laboratory environments,"" he said. ""You should be there to maximize the answers that are coming out of the conduct in that environment.""\n\nThis story was provided by SpaceNews, dedicated to covering all aspects of the space industry.","NASA is looking at allowing researchers from the agency on board commercial suborbital flights, according to Steve Jurczyk, NASA associate administrator for space technology. Blue Origin's New Shepard craft and Virgin Galactic’s SpaceShipTwo are among the vehicles that could carry researchers on suborbital flights. Liability issues on the relatively dangerous missions have so far deterred NASA from allowing staff aboard them.\n"


In [26]:
from evaluate import load

metric = load("rouge")

In [27]:
metric

EvaluationModule(name: "rouge", module_type: "metric", features: [{'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id=None)}, {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}], usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLsum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/

### Tokenize the training, Validation and test dataset from AutoTokenizer

***Creating Model Checkpoint for better readability***

In [28]:
model_checkpoint = 'facebook/bart-base'
max_input_length = 1024
max_target_length = 128
prefix = "Summarize: "

In [29]:
from transformers import BartTokenizer

tokenizer = BartTokenizer.from_pretrained(model_checkpoint)

In [81]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["Text"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["Summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [82]:
preprocess_function(dataset['train'][:2])

{'input_ids': [[0, 38182, 3916, 2072, 35, 29352, 1787, 138, 22226, 7219, 36, 347, 42456, 43, 892, 7441, 14, 5, 194, 9, 5993, 16, 5, 21084, 194, 11, 5, 382, 7, 4318, 8518, 4, 50118, 50118, 30219, 2593, 4481, 3441, 10, 319, 9, 3459, 476, 8, 5, 476, 1162, 10356, 11, 358, 194, 4, 50118, 50118, 20930, 15, 18694, 6335, 17, 27, 29, 665, 892, 9, 5, 701, 9, 11196, 4481, 420, 5, 382, 6, 24, 16, 855, 21084, 7, 4318, 8518, 11, 5993, 480, 4382, 1042, 23, 361, 4, 5677, 3205, 228, 42738, 4650, 5, 674, 701, 9, 4481, 65, 8518, 23, 68, 246, 6, 28835, 4, 50118, 50118, 713, 16, 3625, 7246, 87, 5, 595, 425, 9, 8518, 6, 61, 16, 855, 1446, 23, 198, 68, 1092, 6, 151, 228, 12911, 6, 25, 9, 1228, 86, 4, 50118, 50118, 13841, 1493, 11, 5, 382, 16, 24, 6162, 7, 4318, 116, 50118, 50118, 1121, 49, 892, 6, 18694, 6335, 67, 2319, 5, 701, 9, 8518, 4481, 716, 15, 5, 42738, 1580, 4850, 9, 5, 130, 144, 1406, 4481, 19880, 6, 13953, 6, 5, 3702, 448, 5101, 208, 466, 6, 5, 3702, 448, 5101, 208, 406, 6, 8, 5, 35383, 231, 6, 25

In [83]:
tokenized_datasets = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/71 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

### Train or Fine Tune with Pytorch Trainer on new dataset

We can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the AutoModelForSeq2SeqLM class. Like with the tokenizer, the from_pretrained method will download and cache the model for us

In [84]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [85]:
from accelerate import Accelerator

accelerator = Accelerator()

batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    "sudhanshusinghaiml/facebook-bart-base-finetuned",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True
    # push_to_hub=True
)

We need a special kind of data collator, which will pad the inputs to the maximum length in the batch for all datasets and texts as well as summary

In [86]:
dataCollator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our Seq2SeqTrainer is how to compute the metrics from the predictions. We need to define a function for this, which will just use the metric we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [87]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    # Note that other metrics may not have a `use_aggregator` parameter
    # and thus will return a list, computing a metric for each sentence.
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True, use_aggregator=True)
    # Extract a few results
    result = {key: value * 100 for key, value in result.items()}

    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the Seq2SeqTrainer:

In [88]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["val"],
    data_collator=dataCollator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [92]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.758328,15.6558,4.4138,11.2435,13.705,20.0




Epoch,Training Loss,Validation Loss


TrainOutput(global_step=5, training_loss=3.1713235855102537, metrics={'train_runtime': 7.8909, 'train_samples_per_second': 8.998, 'train_steps_per_second': 0.634, 'total_flos': 84023351193600.0, 'train_loss': 3.1713235855102537, 'epoch': 1.0})

In [94]:
# Save the model
trainer.save_model('facebook-bart-base-finetuned')

Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


Test the Preiction of the model

In [98]:
prediction_results = trainer.predict(tokenized_datasets['test'])



In [109]:
# Let us pull the model from huggingface


config.json:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/262 [00:00<?, ?B/s]

In [110]:
def summarize_from_finetuned_model(article):
    # Load BART model and tokenizer
    model_name = 'sudhanshusinghaiml/facebook-bart-base-fintuned'
    tokenizer = BartTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    # Tokenize and encode the article
    inputs = tokenizer(article, return_tensors='pt', max_length=1024, truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=128, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

# Example usage
article = """
Former U.S. President Donald Trump charged in classified documents probe. There was no immediate confirmation from the Justice Department regarding Mr. Trump’s assertion, although some U.S. media outlet cited sources saying that the former U.S. President has bee indicted.
"""

summary = summarize_from_finetuned_model(article)
print("Summary:")
print(summary)

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

Summary:
Former U.S. President Donald Trump charged in classified documents probe. There was no immediate confirmation from the Justice Department regarding Mr. Trump


In [112]:
article = """Packing all the contextual information from a document into a short summary is much harder with long text. If our summary has to be say 5 sentences max it is much harder to
decide what information is valuable enough to be added with 500 words vs 50,000 words. Chunking algorithms are often required, but they do grow the data variance coverage a model must have
to be accurate. Chunking algorithms control how much of the larger document we pass into a summarizer based on the max tokens the model allows and parameters we’ve set. The new dynamic
nature of the input data means our data variance is much larger than what is seen with smaller text. Longer documents often have much more internal data variance and swings in the information.
Use causes such as blog posts, interviews, transcripts and more have multiple swings in the dialog that make it harder to understand what contextual information is valuable for the summary.
Models have to learn a much deeper relationship between specific keywords, topics, and phrases as the text grows. There are two main types of summarization that are used as the baseline for
any enhanced versions of summarization - Extractive and abstractive. They focus on how the key information found in the input text is reconstructed in the generated summary in their own ways.
Both of these methods have their own unique challenges that pop up when looking at using longer text."""

summary = summarize_from_finetuned_model(article)
print("Summary:")
print(summary)

Summary:
Packing all the contextual information from a document into a short summary is much harder with long text. If our summary has to be say
