**We will use BART transformer supported by Huggingface.**

"BART is sequence-to-sequence model trained with denoising as pretraining objective. We show that this pretraining objective is more generic and show that we can match RoBERTa Results on SQuAD and GLUE and gain state-of-the-art results on summarization (XSum, CNN dataset), long form generative question answering (ELI5) and dialog response genration (ConvAI2). See the associated paper for more details." [More on BART](https://github.com/pytorch/fairseq/tree/master/examples/bart)

In [11]:
!pip install transformers



In [12]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

from google.colab import drive
drive.mount('/content/gdrive')
root_path = 'gdrive/Shared drives/CS263/data/american_rhetoric/'
# os.chdir(root_path + 'speech_bank')


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [13]:
retval = os.getcwd()
print("Current working directory %s" % retval)

Current working directory /content


In [14]:
import pandas as pd
import numpy as np

!gdown --id 1J7h0H8HsqqcLlbM0zUZGGCn2jh3ZnpgD
df = pd.read_csv('parsed.csv')
df.head()

Downloading...
From: https://drive.google.com/uc?id=1J7h0H8HsqqcLlbM0zUZGGCn2jh3ZnpgD
To: /content/parsed.csv
16.6MB [00:00, 101MB/s] 


Unnamed: 0,title,speaker,transcript,year
0,Congressional Gold Medal Acceptance Address,Aung San Suu Kyi,This is one of the most moving days of my life...,2012
1,Memorial Remarks for Ronald Reagan,Prime Minister Brian Mulroney,"In the spring of 1987, President Reagan and I ...",2004
2,Address to the American Society of Newspaper E...,Dwight D. Eisenhower,"President Bryan, distinguished guests of this ...",1953
3,2004 Democratic National Convention Address,Al Gore,"Thank you, very much. Thank you. Thank you, ve...",2004
4,Speech to the D.C. Federalist Society Lawyers ...,Edwin Meese III,A large part of American history has been the ...,1985


In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1202 entries, 0 to 1201
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   title       1202 non-null   object
 1   speaker     1202 non-null   object
 2   transcript  1202 non-null   object
 3   year        1202 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 37.7+ KB


In [0]:
import torch
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

In [18]:
# Use bart.large finetuned on Xsum
BART_PATH = 'bart-large-xsum'

# Bart cnn
# BART_PATH = 'bart-large-cnn'


# Bart large
# BART_PATH = 'bart-large'

bart_tokenizer = BartTokenizer.from_pretrained(BART_PATH, output_past=True) # Initialize tokenizer
bart_model = BartForConditionalGeneration.from_pretrained(BART_PATH, output_past=True) # Download model and configuration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1367.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1625270765.0, style=ProgressStyle(descr…




In [0]:
def bart_summarize(i, input_text):
    input_text = str(input_text)
    # input_text = ' '.join(input_text.split())
    input_text = input_text.replace('\n', ' ')
    print('\n', i+501, 'Original:', input_text, '\n Original speech length:', len(input_text.split()))
    # print('\n', i+500, 'Original:', input_text, '\n Original speech length:', len(input_text.split()))

    input_tokenized = bart_tokenizer.encode(input_text, max_length=1024, return_tensors='pt').to(device)

    summary_ids = bart_model.generate(input_tokenized,
                                      num_beams= 4,
                                      length_penalty=2.0,
                                      min_length=30,
                                      max_length=100,
                                      no_repeat_ngram_size=3,
                                      early_stopping=True
                                      )
    summary = bart_tokenizer.decode(summary_ids.squeeze(), skip_special_tokens=True, clean_up_tokenization_spaces=False)
    summary_length = len(summary.split())
    print('\n', i+501, 'Bart Summary:', summary, '\n Summarized speech length:', summary_length)

    return summary, summary_length

In [28]:
torch.cuda.is_available()

True

In [0]:
bart_model.to(device)
bart_model.eval()

In [43]:
# Test
import time

start = time.time()
summarized  = bart_summarize(1, df['transcript'][1])
end = time.time()
print(end - start)



 1 Bart Summary: Former US President Ronald Reagan and his wife, Nancy, arrived in Canada for a state visit in 1987, and it was a visit that will be remembered for many years to come, writes Prime Minister Justin Trudeau. 
 Summarized speech length: 37
4.3178746700286865


In [62]:
# Get the file with first 500 summary entries saved
!gdown --id 1-Ubsv555Q-j_8YXksy4nH4E1ixMinzDq
df_new = pd.read_csv('parsed-with_summaries_bart_batches_500.csv')
df_new.head()

Downloading...
From: https://drive.google.com/uc?id=1-Ubsv555Q-j_8YXksy4nH4E1ixMinzDq
To: /content/parsed-with_summaries_bart_batches_500.csv
0.00B [00:00, ?B/s]16.7MB [00:00, 166MB/s]


Unnamed: 0,title,speaker,transcript,year,summary
0,Congressional Gold Medal Acceptance Address,Aung San Suu Kyi,This is one of the most moving days of my life...,2012,"Burma's President, Thein Sein, in his acceptan..."
1,Memorial Remarks for Ronald Reagan,Prime Minister Brian Mulroney,"In the spring of 1987, President Reagan and I ...",2004,Former US President Ronald Reagan and his wife...
2,Address to the American Society of Newspaper E...,Dwight D. Eisenhower,"President Bryan, distinguished guests of this ...",1953,"The President of the United States of America,..."
3,2004 Democratic National Convention Address,Al Gore,"Thank you, very much. Thank you. Thank you, ve...",2004,President George W. Bush: I want to thank the ...
4,Speech to the D.C. Federalist Society Lawyers ...,Edwin Meese III,A large part of American history has been the ...,1985,In our series of letters from senior White Hou...


In [64]:
df_new.loc[[501]]

Unnamed: 0,title,speaker,transcript,year,summary
501,Address to the Nation on the Invasion of Iraq,George H. W. Bush,"Just two hours ago, allied air forces began an...",1991,


In [72]:
df_processed = df_new[:501]
df_processed.tail()

Unnamed: 0,title,speaker,transcript,year,summary
496,"Going Dark: Are Technology, Privacy, and Publi...",James B. Comey,"Well, thank you, Ben, and good morning, everyb...",2014,"The FBI Director, James Comey, is speaking at ..."
497,Memories of War; Need for Peace: The Veteran’s...,Steven T. Banko III,Most Vietnam veterans I know still cling to th...,2008,In our series of letters from African-American...
498,Remarks on Connect America and Jobs Creation Fund,Julius Genachowski,"Today is, indeed, a momentous step in our effo...",2012,"In signing an executive order today, the chair..."
499,Address to Americans on Paris Climate Accord,Emmanuel Macron,"Now, let me say a few words to our American fr...",2017,Here is the full text of President Francois Ho...
500,Sermon on the Mount (KJV),Jesus of Nazareth,"Matthew 5\n1: And seeing the multitudes, he we...",2001,A selection of passages from the Gospel of Mat...


In [79]:
df_unprocessed = df_new[501:].reset_index(drop=True)
df_unprocessed.head()

Unnamed: 0,title,speaker,transcript,year,summary
0,Address to the Nation on the Invasion of Iraq,George H. W. Bush,"Just two hours ago, allied air forces began an...",1991,
1,Remarks Following UN Vote on Palestinian State...,Susan Rice,"Thank you, Mr. President.\nFor decades, the Un...",2012,
2,Statement to Parliament on Economic Response t...,Prime Minister Jacinda Ardern,There are moments in our history where it's no...,2020,
3,CDC Media Briefing on First Confirmed Diagnosi...,Tom Frieden et al.,Barbara Reynolds: Good afternoon. You're joini...,2014,
4,Response to United Nations Resolution 3379,Daniel Patrick Moynihan,There appears to have developed in the United ...,1975,


In [80]:
df_unprocessed['transcript'][0]



In [83]:
summariesLength = [] 
for i, text in enumerate(df_unprocessed.transcript):
  # store BART summary in 'summary' column
  df_unprocessed.loc[df_unprocessed.index[i],'summary'], summary_length = bart_summarize(i, df_unprocessed['transcript'][i])
  summariesLength.append(summary_length)

  # removed unfinished sentence in the summary
  # if summary_length > 200:
  #  index1 = summarized.rfind(".")
  #  index2 = summarized.rfind("?")
  #  index3 = summarized.rfind('!')
  #  index = max(index1, index2, index3)

  #  df.loc[df.index[i],'summary'] = summarized[:index+1]

  #  print('\n Number of words in T5 summarized text:', len(summarized.split()), '\n in processed summary', len(df.loc[df.index[i],'summary'].split()))
  #  print('\n Processed Summary', df.loc[df.index[i],'summary'])

  # save summaries in batches
  
  if (i+501)%50 == 0:
    filename = 'parsed-with_summaries_bart_batches_{}.csv'.format(i+501)
    df_unprocessed.to_csv(filename, index = False)

print(summariesLength)


Output hidden; open in https://colab.research.google.com to view.

In [88]:
df_unprocessed

Unnamed: 0,title,speaker,transcript,year,summary
0,Address to the Nation on the Invasion of Iraq,George H. W. Bush,"Just two hours ago, allied air forces began an...",1991,"US President George W. Bush: ""We have no choic..."
1,Remarks Following UN Vote on Palestinian State...,Susan Rice,"Thank you, Mr. President.\nFor decades, the Un...",2012,Here is the full text of President Barack Obam...
2,Statement to Parliament on Economic Response t...,Prime Minister Jacinda Ardern,There are moments in our history where it's no...,2020,Here is the full text of Prime Minister John K...
3,CDC Media Briefing on First Confirmed Diagnosi...,Tom Frieden et al.,Barbara Reynolds: Good afternoon. You're joini...,2014,The US Centers for Disease Control and Prevent...
4,Response to United Nations Resolution 3379,Daniel Patrick Moynihan,There appears to have developed in the United ...,1975,"The Secretary-General of the United Nations, B..."
...,...,...,...,...,...
696,"""",Mingo Chief Logan,I appeal to any white man to say if he ever en...,1774,In his last will and testament to his friend a...
697,Heritage Foundation Address on Anchoring the W...,A. Wess Mitchell,"Good morning everyone, it’s really great to se...",2018,Here is the full text of President Donald Trum...
698,On Violating the Joint Comprehensive Plan of A...,Donald J. Trump,"My fellow Americans:\nToday, I want to update ...",2018,Here is the full text of President Donald Trum...
699,"Remarks on the Shooting Tragedy in Aurora, Col...",Mitt Romney,"Good morning,\nAnd thank you for joining with ...",2012,Here is the full text of Democratic presidenti...


In [0]:
dataframe = [df_processed, df_unprocessed]
summaries_frame = pd.concat(dataframe).reset_index(drop = True)

In [92]:
from google.colab import drive
drive.mount('/content/gdrive')
root_path = 'gdrive/Shared drives/CS263/data/american_rhetoric/'
# os.chdir(root_path + 'speech_bank')

retval = os.getcwd()
print("Current working directory %s" % retval)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
Current working directory /content


In [0]:
summaries_frame.to_csv('/content/gdrive/Shared drives/CS263/data/american_rhetoric/speech_bank/summarized_bart_large_xsm/parsed_with_summaries_bart_xsm.csv', index = False)

In [102]:
bert_result = pd.read_csv('/content/gdrive/Shared drives/CS263/data/american_rhetoric/speech_bank/summarized_bart_large_xsm/parsed_with_summaries_bart_xsm.csv')
bert_result

Unnamed: 0,title,speaker,transcript,year,summary
0,Congressional Gold Medal Acceptance Address,Aung San Suu Kyi,This is one of the most moving days of my life...,2012,"Burma's President, Thein Sein, in his acceptan..."
1,Memorial Remarks for Ronald Reagan,Prime Minister Brian Mulroney,"In the spring of 1987, President Reagan and I ...",2004,Former US President Ronald Reagan and his wife...
2,Address to the American Society of Newspaper E...,Dwight D. Eisenhower,"President Bryan, distinguished guests of this ...",1953,"The President of the United States of America,..."
3,2004 Democratic National Convention Address,Al Gore,"Thank you, very much. Thank you. Thank you, ve...",2004,President George W. Bush: I want to thank the ...
4,Speech to the D.C. Federalist Society Lawyers ...,Edwin Meese III,A large part of American history has been the ...,1985,In our series of letters from senior White Hou...
...,...,...,...,...,...
1197,"""",Mingo Chief Logan,I appeal to any white man to say if he ever en...,1774,In his last will and testament to his friend a...
1198,Heritage Foundation Address on Anchoring the W...,A. Wess Mitchell,"Good morning everyone, it’s really great to se...",2018,Here is the full text of President Donald Trum...
1199,On Violating the Joint Comprehensive Plan of A...,Donald J. Trump,"My fellow Americans:\nToday, I want to update ...",2018,Here is the full text of President Donald Trum...
1200,"Remarks on the Shooting Tragedy in Aurora, Col...",Mitt Romney,"Good morning,\nAnd thank you for joining with ...",2012,Here is the full text of Democratic presidenti...


In [0]:
summariesLength = [] 
for i, text in enumerate(df.transcript):
  # store BART summary in 'summary' column
  df.loc[df.index[i],'summary'], summary_length = bart_summarize(i, df['transcript'][i])
  summariesLength.append(summary_length)

  # removed unfinished sentence in the summary
  if summary_length > 200:
    index1 = summarized.rfind(".")
    index2 = summarized.rfind("?")
    index3 = summarized.rfind('!')
    index = max(index1, index2, index3)

    df.loc[df.index[i],'summary'] = summarized[:index+1]

    print('\n Number of words in T5 summarized text:', len(summarized.split()), '\n in processed summary', len(df.loc[df.index[i],'summary'].split()))
    print('\n Processed Summary', df.loc[df.index[i],'summary'])

  # save summaries in batches
  if i%50 == 0:
    filename = 'parsed-with_summaries_bart_batches_{}.csv'.format(i)
    df.to_csv(filename, index = False)
print(summariesLength)
print(df.info())
df.to_csv('parsed-with_summaries_bart.csv', index = False)

Output hidden; open in https://colab.research.google.com to view.

1. Bart-large-cnn Summary: Former Canadian Prime Minister Brian Mulroney remembers President Reagan. He says Reagan was a leader who inspired his nation and transformed the world. Mulrroy: Reagan possessed a rare and prized gift called leadership. He will always be remembered with the deepest admiration and affection, he says. 
 Summarized speech length: 46

2. Bart-large-xsm Summary: Former US President Ronald Reagan and his wife, Nancy, arrived in Canada for a state visit in 1987, and it was a visit that will be remembered for many years to come, writes Prime Minister Brian Mulroney, who was then Prime Minister of Canada and now US President. 
 Summarized speech length: 48

3. Bert-large Summary: President Ronald Reagan was a President of the Reagan era and Reagan is a President. He possessed a rare and prized gift called leadership -- that ineffable and magical quality that sets some men and women apart so that millions will follow them as they conjure up grand visions and invite their countrymen to dream big and exciting dreams. Ronald Reagan does not enter history tentatively -- he does so with certainty and panache. At home and on the world stage, his were not the pallid etchings of a timorous politician. They were the bold strokes of a confident and accomplished leader. One day in Brussels President Mitterrand in referring to President Reagan said: "Il a vraiment la notion de l'Etat." Rough translation: "He really has a sense of the State about him." The translation does not fully capture the profundity of the observation. President Reagan's visit had been important, demanding, and successful. Our discussions reflected the international agenda of 
 Summarized speech length: 160

4. T5 Summary:  Summarized: "bob greene: Ronald Reagan was a president who inspired his nation and transformed the world. greene says he embodied the unusual alchemy of history, tradition, achievement. he says Reagan's vision of a united u.s. was based on a sense of the nation's majesty."
