# Abstractive Summarization

## Data

### 1. CNN and Dailymail dataset

1. cnn_stories & dailymail_stories - Main dataset<br>
Link - https://cs.nyu.edu/~kcho/DMQA/ <br>
cnn: 392 mb<br>
dailymail: 979 mb<br>
Store as ```file.story```

#### Literature to read:
Here: good article about CNN data - https://machinelearningmastery.com/prepare-news-articles-text-summarization/
<br>
Ideas:<br>
Some data cleaning ideas for this data include:
<br>
a. Normalize case to lowercase (e.g. “An Italian”).<br>
b. Remove punctuation (e.g. “on-time”).<br>
We could also further reduce the vocabulary to speed up testing models, such as:<br>
c. Remove numbers (e.g. “93.4%”).<br>
d. Remove low-frequency words like names (e.g. “Tom Watkins”).<br>

Notable examples are the papers:<br>

Article about extractive summarization - https://arxiv.org/pdf/1707.02268v3.pdf
1. Topic Modeling
2. Sentence Scoring
3. Selecting Sentence

Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond, 2016.<br>
https://arxiv.org/abs/1602.06023

Get To The Point: Summarization with Pointer-Generator Networks, 2017.<br>
https://arxiv.org/abs/1704.04368

In [1]:
import spacy
import nltk
import rouge
import os
import pandas as pd

from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.translate.bleu_score import SmoothingFunction
from nltk.translate.bleu_score import sentence_bleu as bleu

pd.set_option('display.max_colwidth', 300)
pd.options.display.float_format = '{:,.3f}'.format

nlp = spacy.load("en") #, disable=['parser', 'ner'])

## Data (used for a baseline)

In [66]:
def get_files(directory_path):
    '''Read all files in the given directory to a dataframe'''
    data = []

    for filename in os.listdir(directory_path):
        with open(os.path.join(directory_path, filename)) as f:
            lines = f.read().split('\n')
            if len(lines[0]) < 10:
                title = ' '.join(lines[:2])
                article = ('\n'.join(lines[2:])).strip('\n')
            else:
                title = lines[0]
                article = ('\n'.join(lines[1:])).strip('\n')
            data.append((title, article))

    return pd.DataFrame(data, columns=['title', 'article'])

# 92'579 rows
df_cnn = get_files('CNN_stories/')
df_cnn.head(10)

Unnamed: 0,title,article
0,"(CNN) -- Federal authorities are using words uttered by the co-founder of a radical Islamic group to charge him with threats against the creators of ""South Park.""","A criminal complaint alleging the communication of threats was filed in Virginia late last week against Jesse Curtis Morton, also known as Younus Abdullah Mohammad.\n\nA senior law enforcement source Thursday told CNN, which interviewed Morton in 2009, that the suspect is believed to be in Moroc..."
1,"Even after Boomer Esiason apologized for what he called his ""insensitive"" comment about scheduling a C-section before the season started, his suggestion plus critical stances by other radio hosts demonstrate how much paternity leave is still not widely accepted in our society.","In conversations with men across the country, it's clear that while most join many women in expressing outrage at the view that a Major League Baseball game should come before the birth of a child, there were men who felt the player should have gotten back to his job as quickly as possible.\n\nB..."
2,"(CNN) -- Vacationers at Yellowstone and Grand Teton national parks this summer should make extra efforts to wash their hands, the National Park Service urged Wednesday, after noting a spike in sicknesses among visitors so far.","In a news release, the park service noted ""greater than normal reports of gastrointestinal illness"" among those visiting the park in northwestern Wyoming as well as areas in Montana outside the two parks.\n\nThat includes an incident June 7, when members of a tour group visiting Mammoth Hot Spri..."
3,"BERLIN, Germany (CNN) -- U.S. officials urged American citizens in Germany to keep a low profile and remain wary of their surroundings after the terrorist organization al Qaeda posted a video message threatening attacks in the country.","German special police patrol in Berlin last month during a visit by Israeli Prime Minister Benajmin Netanyahu.\n\nA State Department travel alert, issued Wednesday, remains in effect until November 11 -- two weeks after Germany holds its federal elections on Sunday.\n\nAl Qaeda posted its video ..."
4,(CNN) -- The lessons of the first round of the French presidential elections are multiple and somewhat contradictory.,"There is, on the one hand, the first-round victory of a self-described ""normal man"" who is still -- in spite of very tight results -- likely to become the next president of France: François Hollande. His lack of charisma has not been a handicap, so great was the rejection of incumbent President ..."
5,"(CNN) -- On the surface, water polo appears an elegant pursuit played by extremely polished performers.","But beneath the water line, a different storyline is playing out.\n\nLimbs bash against each other, punches and kicks are thrown, nails are used to claw at an opponent and every so often, a player inadvertently disrobes another.\n\nThe thing is, like most players, Australian goal-machine Rowena ..."
6,(CNN) -- I've been in Thomas Hurley III's shoes.,"Hurley is the 12-year-old Connecticut boy whose misspelling of ""Emancipation"" during a Kids Week episode on ""Jeopardy!"" took social media by storm. Fans are sharply divided over whether the show should have accepted his Final Jeopardy answer, even though he would have finished second regardless...."
7,"(CNN) -- He's the man who rolled into a bedroom in Abbottabad, Pakistan, raised his gun and shot Osama bin Laden three times in the forehead.","Nearly two years later, the SEAL Team Six member is a secret celebrity with nothing to show for the deed; no job, no pension, no recognition outside a small circle of colleagues.\n\nJournalist Phil Bronstein profiled the man in the March issue of Esquire, calling him only the Shooter -- a husban..."
8,"Santa Rosa, Peru (CNN) -- Murder suspect Joran van der Sloot arrived Friday in Peru to face charges that he killed a Peruvian woman as police in Lima said they had identified the weapon that killed 21-year-old Stephany Flores Ramirez.","Flores' body was found Wednesday in a Lima hotel room registered to van der Sloot, a Dutch citizen who was twice arrested and released in connection with the 2005 disappearance of an American teenager, Natalee Holloway, in Aruba.\n\nInvestigators also found a baseball bat in the room, two law en..."
9,"Washington (CNN) -- When presumptive Republican presidential nominee Mitt Romney appears before Latino small-business owners in Washington on Wednesday, he'll address a group whose explosive birth rates foreshadow a seismic political shift in GOP strongholds in the Deep South and Southwest.","""The Republicans' problem is their voters are white, aging and dying off,"" said David Bositis, a senior research associate at the Joint Center for Political and Economic Studies, who studies minority political engagement.\n\n""There will come a time when they suffer catastrophic losses with the r..."


## Baseline

In [33]:
class summarization():
    
    def __init__(self, X):
        self.summaries = self.get_summaries(X)

    
    def _noun_tokenizer(self, text):
        '''Tokenization step gives lemmas of all nouns'''

        tokens = nlp(text)
        return [token.lemma_ for token in tokens 
                if token.pos_ in set(('NOUN', 'PROPN')) 
                and token.is_alpha 
                and len(token.text) > 1]

    def _get_keywords(self, articles, top_num=20, n_grams=3):
        '''Get top (top_num) keywords up to specified n_grams
        for each arcticle separately based on tf-idf score'''
        
        stop_words = set(stopwords.words('english'))
        cv = CountVectorizer(stop_words=stop_words,
                             ngram_range=(1, n_grams),
                             tokenizer=self._noun_tokenizer)
        
        data = cv.fit_transform(articles)

        tfidf_transformer = TfidfTransformer()
        tfidf_matrix = tfidf_transformer.fit_transform(data)

        keywords_list = [zip(cv.get_feature_names(), doc) for doc in tfidf_matrix.toarray()]

        top_keywords_scores = [sorted(doc, key=lambda x: x[1], reverse=True)[:top_num]
                               for doc in keywords_list]  

        top_keywords = [[keyword for keyword, score in doc] for doc in top_keywords_scores]

        return top_keywords
    
    def _generate_title(self, article, keywords):
    
        whole_sents = nltk.sent_tokenize(article)
        doc = nlp(article)

        sent_scores = []
        for sent_tokens, sent in zip(doc.sents, whole_sents):
            score = 0
            for token in sent_tokens:
                if token.lemma_ in keywords:
                    score += 1
            sent_scores.append((sent, score))

        top_sentence = sorted(sent_scores, key=lambda x: x[1], reverse=True)[:1][0][0]

        return top_sentence
    
    def get_summaries(self, X):
        
        df = pd.DataFrame()
        df['article'] = X
    
        df['keywords'] = self._get_keywords(df['article'])
    
        df['generated_title'] = df.apply(lambda row: self._generate_title(row['article'], row['keywords']),
                                         axis=1)
        
        return df


In [58]:
def evalutaion(y_true, y_pred):
    
    # BLEU
    df = pd.DataFrame({'y_true': y_true, 'y_pred': y_pred})
    
    chencherry = SmoothingFunction()

    df['bleu-1'] = df.apply(lambda row: bleu(row['y_true'],
                                                     row['y_pred'],
                                                     smoothing_function=chencherry.method1,
                                                     weights=(1,)), axis=1)

    df['bleu-w'] = df.apply(lambda row: bleu(row['y_true'],
                                                     row['y_pred'],
                                                     smoothing_function=chencherry.method1,
                                                     weights=(0.25, 0.25, 0.25, 0.25)), axis=1)

    df['bleu-modi'] = df.apply(lambda row: bleu(row['y_true'],
                                                     row['y_pred'],
                                                     smoothing_function=chencherry.method1,
                                                     weights=(0.8, 0.1, 0.05, 0.05)), axis=1)
    
    # ROUGE
    evaluator = rouge.Rouge(metrics=['rouge-n', 'rouge-l', 'rouge-w'], max_n=4)

    def rouge_scores(reference, hypothesis):
        results = evaluator.get_scores(reference, hypothesis)
        return [(name, d['f']) for name, d in results.items()]  # f1


    def get_score_by_name(row, name):
        for score_name, score in row:
            if score_name == name:
                return score

    df['raw_rouges'] = df.apply(lambda row: rouge_scores(row['y_true'], row['y_pred']), axis=1)
    df['rouge-1'] = df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-1'))
    df['rouge-2'] = df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-2'))
    df['rouge-3'] = df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-3'))
    df['rouge-4'] = df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-4'))
    df['rouge-l'] = df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-l'))
    df['rouge-w'] = df['raw_rouges'].map(lambda x: get_score_by_name(x, 'rouge-w'))
    df = df.drop('raw_rouges', axis=1)
    
    return df

In [64]:
%%time
X_train, X_test, y_train, y_test = train_test_split(df_cnn['article'], df_cnn['title'], test_size=0.33, random_state=42)
X_dev = X_train[:100]
y_dev = y_train[:100]

df_sum = summarization(X_dev).summaries
df_sum.head(10)

CPU times: user 18.8 s, sys: 27.7 ms, total: 18.8 s
Wall time: 18.8 s


## Evaluation

In [62]:
y_pred = df_sum['generated_title']
y_true = y_dev

evalutaion(y_true, y_pred).head(10)

Unnamed: 0,y_true,y_pred,bleu-1,bleu-w,bleu-modi,rouge-1,rouge-2,rouge-3,rouge-4,rouge-l,rouge-w
54611,(CNN) -- You can now get college credit for watching all those cat videos.,"The University of Pennsylvania is offering a class next semester titled ""Wasting time on the Internet,"" in which students will ""focus on the alchemical recuperation of aimless surfing into substantial works of literature.""",0.086,0.002,0.03,0.0,0.0,0.0,0.0,0.0,0.0
64483,"MADRID, Spain (CNN) -- Prosecutors will recommend that a Spanish court drop its investigation of six former officials in the administration of U.S. George W. Bush for alleged torture of prisoners at Guantanamo Bay, Spain's attorney general said Thursday.","The claim against the former officials, presented by a human rights group and provisionally accepted last month at the court -- pending an opinion from the prosecutors - threatens to turn the court ""into a toy in the hands of people who are trying to do a political action,"" Attorney General Cand...",0.084,0.001,0.027,0.257,0.061,0.0,0.0,0.158,0.158
36229,"(CNN) -- My commitment to the cause of stopping the exploitation of children was born from a humbling experience. In 2002, I witnessed the horrors of human trafficking as we rescued three trembling girls living on the impoverished streets of India. Preventing these girls from falling prey to thi...","In 2004, we launched People for Children, our principal project, to provide education and solutions for international efforts to eliminate child trafficking.",0.178,0.003,0.058,0.158,0.0,0.0,0.0,0.079,0.079
36913,(CNN) -- Is hip design more important than being green?,So the city will not be able to purchase Apple desktops and laptops unless Apple gets the green certification again.,0.121,0.003,0.045,0.138,0.0,0.0,0.0,0.138,0.138
8653,"(CNN) -- ""Where is John Galt?"" reads a sign in the back of a vehicle heading down Interstate 85 in Atlanta, Georgia.","In the midst of the credit crisis and the federal government's massive bailout plan, the works of Rand, a proponent of a libertarian, free-market philosophy she called Objectivism, are getting new attention.",0.111,0.002,0.038,0.182,0.075,0.0,0.0,0.145,0.145
22834,"(CNN) -- Not in my lifetime have I witnessed a pope who has so quickly succeeded in making more Catholics, and non-Catholics, hyperventilate than Pope Francis. Indeed, some are ready to jump off the bleachers. They all need to calm down.","Regarding the pope's statements on abortion and gay marriage, here is what he said: ""We cannot insist only on issues related to abortion, gay marriage and the use of contraceptive methods.",0.128,0.002,0.043,0.11,0.0,0.0,0.0,0.11,0.11
59794,"LONDON, England (CNN) -- As familiar and reassuring as the map of the world is, there is only so much that physical geography can tell us about the state of the planet.","John Pritchard, research assistant at University of Sheffield and part of the team working on the project, told CNN: ""I think the maps of disease are particularly shocking and bring home the scale of the problem in Africa better than a table of statistics does.""",0.099,0.002,0.033,0.263,0.108,0.028,0.0,0.184,0.184
44171,"North Korea fired two short-range missiles off its eastern coast Monday, the second such launch in less than a week, according to the South Korean Defense Ministry.","The weapons launched were Scud missiles that flew more than 500 kilometers (311 miles), according to the defense ministry.",0.189,0.003,0.064,0.383,0.133,0.047,0.0,0.34,0.34
6147,"(CNN) -- Oscar-winning actor Gene Hackman was struck by a car while riding a bicycle Friday in Islamorada, Florida, the state highway patrol said.","Hackman was thrown from his bike when a Toyota Tundra pickup struck his rear tire, according to the highway patrol.",0.209,0.003,0.07,0.318,0.095,0.0,0.0,0.273,0.273
25126,"(CNN) -- Football officials are apologizing to ticket-holding fans who were denied seats at Sunday's Super Bowl, but that may not be enough to stop lawsuits over the issue from going ahead.","Grubman said officials had still hoped to complete the work by game day, so they didn't warn fans about a problem.",0.219,0.004,0.073,0.145,0.0,0.0,0.0,0.109,0.109


## Next steps:
1. Clean Data<br>
'(CNN) --' in each title<br>
Split on paragraphs<br>
2. Add NER to keyword extraction<br>
3. Speed up<br>
The algorithm works slowly right now. It takes 18.8 seconds to process only 100 samples.<br>
4. Make use of remained data