<a href="https://colab.research.google.com/github/sljm12/machine_learning_notebooks/blob/master/Bert_Next_Sentence_for_breaking_up_paragraphs_and_summarizing_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Testing if Bert Next Sentence is able to break up News summaries into seperate news bites.

In [106]:
!pip -q install transformers

In [107]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertForNextSentencePrediction

In [108]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForNextSentencePrediction.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing TFBertForNextSentencePrediction: ['mlm___cls']
- This IS expected if you are initializing TFBertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertForNextSentencePrediction were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForNextSentencePrediction for predictions without further training.


# Preprocess the text

In [109]:
!pip -q install newspaper3k

In [110]:
from newspaper import Article, ArticleException

In [111]:
url = "https://www.asiaone.com/entertainment/darren-lim-turns-his-yacht-business-opportunity-takes-people-out-rides" #@param ["https://www.straitstimes.com/singapore/the-news-in-5-minutes-6","https://www.pmo.gov.sg/Newsroom/National-Day-Rally-2019","https://www.channelnewsasia.com/news/asia/malaysia-anwar-ibrahim-strong-majority-form-new-government-13136908","https://www.bbc.com/news/world-asia-54174598","https://www.straitstimes.com/tech/ibm-to-train-800-mid-career-people-in-ai-cyber-security","https://www.asiaone.com/entertainment/darren-lim-turns-his-yacht-business-opportunity-takes-people-out-rides"]
a=Article(url)
a.download()
a.parse()
text = a.text

In [112]:
filename = url.replace(":","").replace("/","_")

In [113]:
text

'Thinking of a quick getaway but want to be a little unconventional?\n\nWhy not take a ride on Darren Lim\'s yacht Gracefully? The 47-year-old is now offering a short trip on his boat where you\'ll travel from the south of Singapore to the north. The trip will take six hours, and you\'ll get a chance to see a different side of our Lion City.\n\nThe local actor-host was previously living aboard the yacht for four years with wife, former actress Evelyn Tan, and their four kids before he moved to a house so their kids can experience school life.\n\nAccording to Lianhe Wanbao — which went on a two-day, one-night trip on his yacht — the entire journey starts at 10am at ONE15 Marina Sentosa Cove. You\'ll ride all the way up north for six hours to the waters close to Pulau Ubin and spend the night there before returning to Sentosa the next day at 6pm.\n\nThe yacht can usually take about 20 people but due to social distancing measures, there can only be a maximum of 10.\n\n[embed]https://www.f

We want to extract only the sentences. Minus the headers. We split by \n\n then we only choose those that ends with a full stop

In [114]:
text = [t for t in text.split("\n\n") if t.endswith('.')]

In [115]:
text

["Why not take a ride on Darren Lim's yacht Gracefully? The 47-year-old is now offering a short trip on his boat where you'll travel from the south of Singapore to the north. The trip will take six hours, and you'll get a chance to see a different side of our Lion City.",
 'The local actor-host was previously living aboard the yacht for four years with wife, former actress Evelyn Tan, and their four kids before he moved to a house so their kids can experience school life.',
 "According to Lianhe Wanbao — which went on a two-day, one-night trip on his yacht — the entire journey starts at 10am at ONE15 Marina Sentosa Cove. You'll ride all the way up north for six hours to the waters close to Pulau Ubin and spend the night there before returning to Sentosa the next day at 6pm.",
 'The yacht can usually take about 20 people but due to social distancing measures, there can only be a maximum of 10.',
 "Speaking to the Chinese evening daily, Darren said the most dangerous experience he's had 

Break down into sentences 

In [116]:
results = []
for i in text:
  c = [t.strip()+"." for t in i.split(".") if len(t) != 0]
  results = results + c

In [117]:
text = results

# Running Bert NextSentence to detect paragraphs

In [118]:
def compute(text1, text2):
  encoding = tokenizer(text1, text2, return_tensors='tf')
  logits = model(encoding['input_ids'], token_type_ids=encoding['token_type_ids'])[0]
  return tf.nn.softmax(logits[0])

Loop through the sentences adding sentences to temp group when the sentences are the next sentences.

When they are not the next sentence we append the temp_group to groups and reinit the temp_group

In [119]:
groups = []
groups_counter=0
temp_group = []
for i in range(len(text)-1):
  print("Group "+str(i))
  print(text[i])
  print(text[i+1])
  
  sm = compute(text[i], text[i+1])
  print(sm[0], sm[1])

  if len(temp_group) == 0:
    temp_group = [text[i]]

  if sm[0] > 0.9995 : #Threshold for similarity we set at this
      temp_group = temp_group + [text[i+1]]
  else:
    groups.append(temp_group)
    temp_group = [text[i+1]]

Group 0
Why not take a ride on Darren Lim's yacht Gracefully? The 47-year-old is now offering a short trip on his boat where you'll travel from the south of Singapore to the north.
The trip will take six hours, and you'll get a chance to see a different side of our Lion City.
tf.Tensor(0.99999547, shape=(), dtype=float32) tf.Tensor(4.5798643e-06, shape=(), dtype=float32)
Group 1
The trip will take six hours, and you'll get a chance to see a different side of our Lion City.
The local actor-host was previously living aboard the yacht for four years with wife, former actress Evelyn Tan, and their four kids before he moved to a house so their kids can experience school life.
tf.Tensor(0.9933123, shape=(), dtype=float32) tf.Tensor(0.006687628, shape=(), dtype=float32)
Group 2
The local actor-host was previously living aboard the yacht for four years with wife, former actress Evelyn Tan, and their four kids before he moved to a house so their kids can experience school life.
According to Lia

In [120]:
groups

[["Why not take a ride on Darren Lim's yacht Gracefully? The 47-year-old is now offering a short trip on his boat where you'll travel from the south of Singapore to the north.",
  "The trip will take six hours, and you'll get a chance to see a different side of our Lion City."],
 ['The local actor-host was previously living aboard the yacht for four years with wife, former actress Evelyn Tan, and their four kids before he moved to a house so their kids can experience school life.',
  'According to Lianhe Wanbao — which went on a two-day, one-night trip on his yacht — the entire journey starts at 10am at ONE15 Marina Sentosa Cove.',
  "You'll ride all the way up north for six hours to the waters close to Pulau Ubin and spend the night there before returning to Sentosa the next day at 6pm.",
  'The yacht can usually take about 20 people but due to social distancing measures, there can only be a maximum of 10.'],
 ["Speaking to the Chinese evening daily, Darren said the most dangerous exp

In [121]:
import json
from pathlib import Path

Path(filename+"_groups.json").write_text(json.dumps(groups))

1781

# Summarization

## T5 Summarizer

In [122]:
from transformers import pipeline
summarizer = pipeline("summarization")

In [123]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

def t5_summarise(text):
  model = T5ForConditionalGeneration.from_pretrained('t5-small')
  tokenizer = T5Tokenizer.from_pretrained('t5-small')
  device = torch.device('cpu')

  preprocess_text = text.strip().replace("\n","")
  t5_prepared_Text = "summarize: "+preprocess_text
  #print ("original text preprocessed: \n", preprocess_text)

  tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)


  # summmarize 
  summary_ids = model.generate(tokenized_text,
                                      num_beams=4,
                                      no_repeat_ngram_size=2,
                                      min_length=30,
                                      max_length=200,
                                      early_stopping=False)

  output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

  #print ("\n\nSummarized text: \n",output)
  return output

In [124]:
' '.join(groups[0])

"Why not take a ride on Darren Lim's yacht Gracefully? The 47-year-old is now offering a short trip on his boat where you'll travel from the south of Singapore to the north. The trip will take six hours, and you'll get a chance to see a different side of our Lion City."

In [125]:
summarised = []
for i in groups:
  joined = ' '.join(i)
  summarised.append(t5_summarise(joined))

In [126]:
len(summarised)

4

In [127]:
import pprint

In [128]:
pprint.pprint(summarised)

['the 47-year-old is offering a short trip on his yacht Gracefully. the trip '
 "will take six hours, and you'll get chance to see different side of our Lion "
 'City.',
 'the actor-host was previously living aboard the yacht for four years. he '
 'moved to a house so their kids can experience school life on his yacht - but '
 "the entire journey starts at 10am at ONE15 Marina Sentosa Cove. you'll ride "
 'all the way up north for six hours to the waters close to Pulau Ubin and '
 'spend the night there before returning to town the next day at 6pm. it can '
 'usually take about 20 people but due to social distancing measures, there '
 'can only be ',
 "Darren says he encountered a typhoon years ago near thailand's Koh Samui. "
 '"when your entire family is on board, you don\'t have time to be afraid," '
 'says the chinese evening daily.',
 'Darren battled eight thunderstorms and only got through it with the help of '
 "his assistant. he's adapted well to life on land after living on th

In [129]:
Path(filename+"_t5_sum.json").write_text(json.dumps(summarised))

982

In [130]:
summarised_text = ""
for i in summarised:
  summarised_text = summarised_text+ i+"\n\n"

In [131]:
print(summarised_text)

the 47-year-old is offering a short trip on his yacht Gracefully. the trip will take six hours, and you'll get chance to see different side of our Lion City.

the actor-host was previously living aboard the yacht for four years. he moved to a house so their kids can experience school life on his yacht - but the entire journey starts at 10am at ONE15 Marina Sentosa Cove. you'll ride all the way up north for six hours to the waters close to Pulau Ubin and spend the night there before returning to town the next day at 6pm. it can usually take about 20 people but due to social distancing measures, there can only be 

Darren says he encountered a typhoon years ago near thailand's Koh Samui. "when your entire family is on board, you don't have time to be afraid," says the chinese evening daily.

Darren battled eight thunderstorms and only got through it with the help of his assistant. he's adapted well to life on land after living on the yacht from 2011 to 2015.




In [132]:
Path(filename+"_t5_sum.txt").write_text(summarised_text)

972

## Default hugging face summarizer pipeline

In [133]:
sum1 = []

for i in groups:
  t = ' '.join(i)
  sum1.append(summarizer(t))

Your max_length is set to 142, but you input_length is only 66. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
Your max_length is set to 142, but you input_length is only 75. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
Your max_length is set to 142, but you input_length is only 111. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)


In [134]:
sum1

[[{'summary_text': " Darren Lim's yacht Gracefully is now offering a short trip from the south of Singapore to the north . The trip will take six hours, and you'll get a chance to see a different side of our Lion City . The 47-year-old is offering a trip on his boat where you'll travel from south to north ."}],
 [{'summary_text': ' Lianhe Wanbao was previously living aboard the yacht for four years with wife, former actress Evelyn Tan, and their four kids . The yacht can usually take about 20 people but due to social distancing measures, there can only be a maximum of 10 .'}],
 [{'summary_text': ' Darren said the most dangerous experience he\'s had at sea was when he encountered a typhoon years ago near Thailand\'s Koh Samui . "When your entire family is on board, you don\'t have time to be afraid. You can only drive the boat as best as you can and ensure everyone\'s safety," he recounted .'}],
 [{'summary_text': ' Darren battled eight thunderstorms in a span of 24 hours . Darren battl

The number of groups.

In [135]:
len(sum1)

4

In [136]:
Path(filename+"_hugging_sum.json").write_text(json.dumps(sum1))

1280

In [137]:
summary = [i[0]["summary_text"].strip() for i in sum1]

Path(filename+"_hugging_sum.txt").write_text('\n\n'.join(summary))

1182

In [138]:
summary

["Darren Lim's yacht Gracefully is now offering a short trip from the south of Singapore to the north . The trip will take six hours, and you'll get a chance to see a different side of our Lion City . The 47-year-old is offering a trip on his boat where you'll travel from south to north .",
 'Lianhe Wanbao was previously living aboard the yacht for four years with wife, former actress Evelyn Tan, and their four kids . The yacht can usually take about 20 people but due to social distancing measures, there can only be a maximum of 10 .',
 'Darren said the most dangerous experience he\'s had at sea was when he encountered a typhoon years ago near Thailand\'s Koh Samui . "When your entire family is on board, you don\'t have time to be afraid. You can only drive the boat as best as you can and ensure everyone\'s safety," he recounted .',
 'Darren battled eight thunderstorms in a span of 24 hours . Darren battled the storms with the help of his assistant . His kids have adapted well to life 