<a href="https://colab.research.google.com/github/revs1/Natural-Language-Processing/blob/master/Movie_Plot_summarization_using_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Extractive Summarization with Transformers**
This method utilizes the HuggingFace transformers library to run extractive summarizations.

This works by first embedding the sentences, then running a clustering algorithm, finding the sentences that are closest to the cluster's centroids.

This library also uses coreference techniques, utilizing the https://github.com/huggingface/neuralcoref library to resolve words in summaries that need more context. The greedyness of the neuralcoref library can be tweaked in the CoreferenceHandler class.

Library Repo: https://github.com/dmmiller612/bert-extractive-summarizer Paper: https://arxiv.org/abs/1906.04165

In [38]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

We are now going to summarize the plot of a movie (The Shutter Island). Let us see if we are able to get the gist of the movie plot using our powerful Transformer networks. 

In [39]:
DOCUMENT = """
In 1954, U.S. Marshals Edward "Teddy" Daniels and his new partner Chuck Aule travel to the Ashecliffe Hospital for the criminally insane on Shutter Island in Boston Harbor. They are investigating the disappearance of patient Rachel Solando, incarcerated for drowning her three children. Their only clue is a cryptic note found hidden in Solando's room: "The law of 4; who is 67?". The two men arrive just before a massive storm, preventing their return to the mainland for a few days.

Teddy and Chuck find the staff confrontational. Lead psychiatrist John Cawley refuses to turn over records, and they learn that Solando's doctor Lester Sheehan left the island on vacation immediately after Solando disappeared. They are told that Ward C is off limits and the lighthouse has already been searched. While being interviewed, one patient writes the word "RUN" in Teddy's notepad. Teddy starts to have migraine headaches from the hospital's atmosphere and has waking visions of his experiences as a U.S. Army soldier during the liberation of Dachau including reprisals against the guards. He has disturbing dreams of his wife, Dolores Chanal, who was killed in a fire set by arsonist Andrew Laeddis. In one instance, she tells Teddy that Solando is still on the island—as is Laeddis, who everyone claims was never there. Teddy later explains to Chuck that locating Laeddis was his ulterior motive for taking the case.

Teddy and Chuck find Solando has resurfaced with no explanation, prompting the former to break into the restricted Ward C. Teddy encounters George Noyce, a patient in solitary confinement, who claims that the doctors are experimenting on patients, some of whom are taken to the lighthouse to be lobotomized. Noyce warns that everyone else on the island, including Chuck, is playing an elaborate game designed for Teddy.

Teddy regroups with Chuck and climbs the cliffs toward the lighthouse. They become separated, and Teddy later sees what he believes to be Chuck's body on the rocks below. By the time he climbs down, the body has disappeared, but he finds a cave where he discovers a woman in hiding, who claims to be the real Rachel Solando. She states that she is a former psychiatrist at the hospital who discovered the experiments with psychotropic medication and trans-orbital lobotomy in an attempt to develop mind control techniques. Before she could report her findings to the authorities, she was forcibly committed to Ashecliffe as a patient. Teddy returns to the hospital, but finds no evidence of Chuck ever being there.

Convinced Chuck was taken to the lighthouse, Teddy breaks in, only to discover Cawley waiting for him. Cawley explains that Daniels is actually Andrew Laeddis, their "most dangerous patient", incarcerated in Ward C for murdering his manic depressive wife, Dolores, after she drowned their children. Edward Daniels and Rachel Solando are anagrams of Andrew Laeddis and Dolores Chanal, and the little girl from Laeddis's recurring dreams is his daughter Rachel. According to Cawley, the events of the past several days have been designed to break Andrew's conspiracy-laden insanity by allowing him to play out the role of Teddy Daniels. The hospital staff were part of the test, including Lester Sheehan posing as Chuck Aule and a nurse posing as Rachel Solando. Andrew’s migraines were withdrawal symptoms from his medication, as were his hallucinations of the "real Rachel Solando". Overwhelmed, Andrew faints.

He awakens in the hospital under the watch of Cawley and Sheehan. When questioned, he tells the truth in a coherent manner, satisfying the doctors. Cawley notes that they had achieved this state nine months before, but Andrew quickly regressed. He warns this will be Andrew's last chance; otherwise, they will have to lobotomize him, as he previously attacked Noyce for calling him by his real name. Some time later, Andrew relaxes on the hospital grounds with Sheehan, but calls him "Chuck" again, saying they must leave the island because bad things are going on. Sheehan shakes his head to Cawley and Cawley gestures to the orderlies to take Andrew to be lobotomized. Before being led away, Andrew asks Sheehan if it would be better "to live as a monster, or to die as a good man?" A stunned Sheehan calls Andrew "Teddy" but the latter does not respond to the name.

"""

A bit of preprocessing our document to remove new lines and empty spaces. 

In [40]:
import re

DOCUMENT = re.sub(r'\n|\r', ' ', DOCUMENT)
DOCUMENT = re.sub(r' +', ' ', DOCUMENT)
DOCUMENT = DOCUMENT.strip()

In [32]:
!pip install bert-extractive-summarizer

Collecting bert-extractive-summarizer
  Downloading https://files.pythonhosted.org/packages/23/1d/71f0a5c7f81b1a87d4428a6a935e9ddeb5e662e41512952e11bd10533cd9/bert-extractive-summarizer-0.4.2.tar.gz
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |████████████████████████████████| 675kB 4.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 15.7MB/s 
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 20.6MB/s 
[?25hCollecting sentencepiece
[

In [41]:
from summarizer import Summarizer

In [42]:
sm = Summarizer(model='bert-large-uncased')

In [43]:
result = sm(body=DOCUMENT, ratio=0.2)

In [44]:
result = '\n'.join(nltk.sent_tokenize(result))
print(result)

In 1954, U.S.
Marshals Edward "Teddy" Daniels and his new partner Chuck Aule travel to the Ashecliffe Hospital for the criminally insane on Shutter Island in Boston Harbor.
Teddy and Chuck find the staff confrontational.
Teddy and Chuck find Solando has resurfaced with no explanation, prompting the former to break into the restricted Ward C. Teddy encounters George Noyce, a patient in solitary confinement, who claims that the doctors are experimenting on patients, some of whom are taken to the lighthouse to be lobotomized.
Noyce warns that everyone else on the island, including Chuck, is playing an elaborate game designed for Teddy.
Teddy regroups with Chuck and climbs the cliffs toward the lighthouse.
When questioned, he tells the truth in a coherent manner, satisfying the doctors.
Some time later, Andrew relaxes on the hospital grounds with Sheehan, but calls him "Chuck" again, saying they must leave the island because bad things are going on.


In [45]:
sm = Summarizer(model='distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [46]:
result = sm(body=DOCUMENT, ratio=0.2)

In [47]:
result = '\n'.join(nltk.sent_tokenize(result))
print(result)

In 1954, U.S.
Marshals Edward "Teddy" Daniels and his new partner Chuck Aule travel to the Ashecliffe Hospital for the criminally insane on Shutter Island in Boston Harbor.
Teddy and Chuck find the staff confrontational.
Lead psychiatrist John Cawley refuses to turn over records, and they learn that Solando's doctor Lester Sheehan left the island on vacation immediately after Solando disappeared.
They are told that Ward C is off limits and the lighthouse has already been searched.
Teddy and Chuck find Solando has resurfaced with no explanation, prompting the former to break into the restricted Ward C. Teddy encounters George Noyce, a patient in solitary confinement, who claims that the doctors are experimenting on patients, some of whom are taken to the lighthouse to be lobotomized.
Before she could report her findings to the authorities, she was forcibly committed to Ashecliffe as a patient.
Some time later, Andrew relaxes on the hospital grounds with Sheehan, but calls him "Chuck" ag