## Capstone --- Baseline models and Extractive summary models.


#### Shuaichen Wu

## Introduction

The purpose of this notebook is to build baseline models and extractive summary models.

## Import library

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

In [372]:
import pandas as pd
import joblib
from rouge import Rouge
import numpy as np
import time

## Baseline models

Consider a special trait of podcast transcripts --- the most important information is usually put at the front of an episode. I decided to extract the first K words from the beginning.

In [30]:
#load the dataframe I already cleaned in last two notebooks
train = joblib.load('train.pkl')

In [31]:
#check the head
train.head()

Unnamed: 0,transcript,summary
0,hello. hello. hello everyone. this is katie an...,on the first ever episode of kream in your kof...
1,welcome to inside the 18. today's episode is t...,today’s episode is a sit down michael and omar...
2,hey cheese fans before we get started. i wante...,join us as we take a look at all current chief...
3,get ready to whiten those knuckles and hold fa...,former boatswain’s mate dan shirey talks pitch...
4,"hey everyone, welcome to another episode of th...",how are relationships made? what is trust buil...


In [32]:
#check the shape
train.shape

(32354, 2)

Everything looks right. Now let's set K=100 first.

In [33]:
K=100
Baseline_100words = train['transcript'].apply(lambda x : ' '.join(x.split()[:K]))

Now I can introduce the evaluation metric for automatic summarization --- Rouge score.

Rouge-l measures the longest subsequence between the summary and the extractive summary. 

Rouge-2 measures the number of 2-gram words shared by both summary and extractive summary. 

Rouge-1 measures the number of single words that appear in both the summary and extractive summary.

In [337]:
#Rouge function returns precision, recall and f1 scores for rouge-1, rouge-2, and rouge-l
#
def evaluate_summary(y_test, predicted):    
    rouge_score = Rouge()    
    #get all the rouge score
    scores = rouge_score.get_scores(y_test, predicted, avg=True)       
    
    #get the rouge-1 score
    score_1_f = round(scores['rouge-1']['f'], 2)
    score_1_r = round(scores['rouge-1']['r'], 2)
    score_1_p = round(scores['rouge-1']['p'], 2)
    #get the rouge-2 score
    score_2_f = round(scores['rouge-2']['f'], 2)
    score_2_r = round(scores['rouge-2']['r'], 2)
    score_2_p = round(scores['rouge-2']['p'], 2)
    #get the rouge-3 score
    score_L_f = round(scores['rouge-l']['f'], 2) 
    score_L_r = round(scores['rouge-l']['r'], 2)
    score_L_p = round(scores['rouge-l']['p'], 2)
    print("rouge1 f1:", score_1_f, "| rouge2 f1:", score_2_f, "| rougeL f1:",score_L_f)
    print("rouge1 recall:", score_1_r, "| rouge2 recall:", score_2_r, "| rougeL recall:",score_L_r)
    print("rouge1 precision:", score_1_p, "| rouge2 precision:", score_2_p, "| rougeL precision:",score_L_p)

In [39]:
#Get the rouge score for the baseline_100words
#Reminder, this is going to take a while, around 5 minutes!
evaluate_summary(Baseline_100words,train['summary'])

rouge1 f1: 0.22 | rouge2 f1: 0.04 | rougeL f1: 0.2
rouge1 recall: 0.27 | rouge2 recall: 0.05 | rougeL recall: 0.23
rouge1 precision: 0.22 | rouge2 precision: 0.04 | rougeL precision: 0.19


The first 100 words already covered 27% of summaries' single words (rouge-1 recall), covered 5% of 2-gram words (rouge-2 recall),
and 23% longest subsequence (rouge-1 recall).
The first 100 words already did an incredible job in capturing the major summarization information.
This is also consistent with real life in that podcasts episodes often give some overviews at the beiginning.


Now let's set K=200, and compare the rouge score when K = 100.

In [40]:
K=200
Baseline_200words = train['transcript'].apply(lambda x : ' '.join(x.split()[:K]))

In [41]:
#Get the rouge score for the baseline_200words
#Reminder, this is going to take a while, around 10 minutes!
evaluate_summary(Baseline_200words,train['summary'])

rouge1 f1: 0.23 | rouge2 f1: 0.05 | rougeL f1: 0.2
rouge1 recall: 0.37 | rouge2 recall: 0.09 | rougeL recall: 0.33
rouge1 precision: 0.18 | rouge2 precision: 0.04 | rougeL precision: 0.16


Compared to baseline_100words, for baseline_200words, the precision of rouge-1, rouge-2, rouge-l have decreased.
The first 200 words cover 37% single words in summaries, 9% 2-gram words in summaries. 33% longest subsequence in summaries.
The recall increases because baseline_200words have 2 times the words of baseline_100words.
This shows longer summaries tend to capture more phrases and single words.

But also, the longer summary gets lower precision.



Also, when k=100, it is closer aligned with the distribution of the provided summaries.
In the transcripts loading notebook, I know the summaries' word counts are between 82 ± 66 words.

## Build extractive summary model based on TF-IDF

In the model, for each transcript, I treat each text as a document, with one transcript as a corpus.
I get TFIDF vectorization on each transcript (one corpus),and use token tfidf to calculate the score for each document.
Then, sort the document scores, and get the top 3 documents (3 sentences) as the final summary.


In [306]:
#import the libraries we need
from nltk.corpus import stopwords   
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import math
import string
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords = stopwords.words('english')

First I need to define my tokenizer.

I removed stop words, I removed punctuations.
I used stemming, which is a rule-based way of cutting off 's', 'ing', and other endings to reduce words to a basic root form.



In [323]:
def my_tokenizer(sentence):
    # remove punctuation and set to lower case
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark,'').lower()

    # split sentence into words
    listofwords = sentence.split(' ')
    listofstemmed_words = []
    
    stemmer = PorterStemmer()
    # remove stopwords and any tokens that are just empty strings
    for word in listofwords:
        if (not word in stopwords) and (word!=''):
            stemmed_word = stemmer.stem(word)
            listofstemmed_words.append(stemmed_word)
    return listofstemmed_words


In [332]:
def tfidf_score_summary(text):
    '''
    This is a function that did the following three steps:
    1, Divide each sentence into a entry or a document
    2, Get tfidf matrix for all documents
    3, get the score for each documents based on tokens tfidf 
    4, sort the score descendingly, use the 3 sentences with highest score.
    5, return the summary
    '''
    
    sentences=[]
    for sent in sent_tokenize(text):
        sentences.append(sent)
    # get into dataframe, every sentence in the transcript is an entry
    df = pd.DataFrame({'sentence':sentences})
                      
    #initiate a TFIDF vector
    vec=TfidfVectorizer(min_df=min_df,tokenizer=my_tokenizer)
                      
    #fit and transform sentences
    sentence_transformed = vec.fit_transform(df['sentence'])
                      
    #get the TF-IDf matrix for sentences
    sentence_df=pd.DataFrame(columns=vec.get_feature_names(), data=sentence_transformed.toarray())
    
    #calculate the score for each sentence
    #sum up all the tf idf score for tokens in a sentence
    sentence_df_sum = np.sum(sentence_df,axis=1)
    
    #the count of tokens that appears in a sentence
    #put 0 in where sentence_df != 0
    transit = sentence_df.where(sentence_df==0,1)
                      
    #tokens number in one sentence
    sentence_df_num = np.sum(sentence_df,axis=1)          
    
    #calculate the score for each sentence
    sentence_df_score = []
    for sum_tokens_score, num_tokens in zip(sentence_df_sum.tolist(), sentence_df_num.tolist()):
        if num_tokens != 0:
            sentence_df_score.append(round(sum_tokens_score/num_tokens,3))
        else:
            sentence_df_score.append(0)
    #concatenate back the sentence_score to sentence
    df['score'] = sentence_df_score
    df=df.sort_values('score',ascending=False)
    summary = ' '.join(df.head(number_of_sentence)['sentence'])
    
    return summary
        

When I build the tfidf summrization function, I have trouble in choosing the number of sentences to form a summary.

I recall, in the transcripts loading notebook, I found out, 
for central summaries, the number of sentences are between
5 ± 4.

Since I already know longer summary always give better recall score, I decide to try 5 and 9, to see if there is a significant difference.

Also, I found out when I change min_df to the range from 0 to 5, different summaries get returned. So I decide to use min_df=0, min_df=3, and min_df=5 on a small set of transcripts to see which one performs better.

For an extractive summarization model is trained only on one document, there is no information leakage.

I can just extract 100 random documents from transcripts, and see how number_of_sentence and min_df values perform.

In [333]:
np.random.seed(100)
random_choice=np.random.choice(train.shape[0],100)


Let's do for min_df = 0 first.

In [366]:
min_df=0
number_of_sentence = 5 
tfidf_0=train['transcript'][random_choice].apply(tfidf_score_summary) #apply on transcript

In [367]:
#the zero th experiment
evaluate_summary_f1(tfidf_0,train['summary'][random_choice]) #apply on summary

rouge1 f1: 0.19 | rouge2 f1: 0.02 | rougeL f1: 0.16
rouge1 recall: 0.21 | rouge2 recall: 0.02 | rougeL recall: 0.18
rouge1 precision: 0.2 | rouge2 precision: 0.02 | rougeL precision: 0.17


In [369]:
min_df=0
number_of_sentence = 9 
tfidf_1=train['transcript'][random_choice].apply(tfidf_score_summary) #apply on transcript

In [370]:
#the first experiment
evaluate_summary(tfidf_1,train['summary'][random_choice])  #apply on summary

rouge1 f1: 0.19 | rouge2 f1: 0.02 | rougeL f1: 0.17
rouge1 recall: 0.28 | rouge2 recall: 0.03 | rougeL recall: 0.24
rouge1 precision: 0.16 | rouge2 precision: 0.01 | rougeL precision: 0.14


Both shows strong precision.

Especially for number_of_sequence as 9, without losing precision, the F1 and recall scores both increased.

In [342]:
tfidf_0.apply(lambda x:len(x.split())).describe()

count    100.000000
mean      93.490000
std       46.240149
min       27.000000
25%       56.500000
50%       87.000000
75%      120.250000
max      222.000000
Name: transcript, dtype: float64

In [341]:
tfidf_1.apply(lambda x:len(x.split())).describe()

count    100.000000
mean     171.910000
std       73.163315
min       58.000000
25%      123.000000
50%      157.000000
75%      212.250000
max      468.000000
Name: transcript, dtype: float64

However, after I checked the word count for both experiments, I realize I can't set the number of sentences as 9.
Because target summaries have 82±66 in the central range, the returned summary of number of sentences as 9 has a word range of 123±73.

While the number of sentences as 5 have a word range of 87 ± 46, which is much closer to the target summaries.

In [348]:
min_df=0
number_of_sentence = 6 
tfidf_0_1=train['transcript'][random_choice].apply(tfidf_score_summary)

In [349]:
tfidf_0_1.apply(lambda x:len(x.split())).describe()

count    100.000000
mean     113.620000
std       54.498963
min       35.000000
25%       65.750000
50%      105.000000
75%      145.000000
max      279.000000
Name: transcript, dtype: float64

I tried setting the number of sentences as 6: 50% of returned summaries are longer than 105 words. It seems this is still too big to be compatible with target summaries.

5 seems to be the best number of sentences value.

In [362]:
#the second experiement
min_df=3
number_of_sentence = 5 
tfidf_2=train['transcript'][random_choice].apply(tfidf_score_summary) #used on transcript

In [363]:
evaluate_summary(tfidf_2,train['summary'][random_choice]) #apply summary as True summary

rouge1 f1: 0.18 | rouge2 f1: 0.02 | rougeL f1: 0.15
rouge1 recall: 0.21 | rouge2 recall: 0.02 | rougeL recall: 0.18
rouge1 precision: 0.19 | rouge2 precision: 0.02 | rougeL precision: 0.16


In [364]:
#the third experiement
min_df=5
number_of_sentence = 5 
tfidf_3=train['transcript'][random_choice].apply(tfidf_score_summary)

In [365]:
evaluate_summary(tfidf_3,train['summary'][random_choice])

rouge1 f1: 0.19 | rouge2 f1: 0.02 | rougeL f1: 0.16
rouge1 recall: 0.22 | rouge2 recall: 0.02 | rougeL recall: 0.19
rouge1 precision: 0.19 | rouge2 precision: 0.02 | rougeL precision: 0.17


In [423]:
#the third experiement
min_df=0.1
number_of_sentence = 5 
tfidf_3=train['transcript'][random_choice].apply(tfidf_score_summary)

In [424]:
evaluate_summary(tfidf_3,train['summary'][random_choice])

rouge1 f1: 0.19 | rouge2 f1: 0.02 | rougeL f1: 0.16
rouge1 recall: 0.25 | rouge2 recall: 0.03 | rougeL recall: 0.21
rouge1 precision: 0.17 | rouge2 precision: 0.02 | rougeL precision: 0.15


I did three combinations in the last few cells.
After I settled with number_of_sentence = 5,
I applied three min_df values:
min_df=0, 3, 5

As long as min_df is set higher, Rouge precision will get higher.
With the same sentence number of 5, precision gets higher. This means the predicted summary gets closer to the True summary.


I decided to set 

min_df = 5

number_of_sentence=5

on the whole dataset, and see how it performs.

During the final training:

I run into error for min_df=[2,3,4,5]. error said after pruning, there are no terms left. I need lower the min_df.
Since it can't handle hard numbers, I decide to change it to be a soft proportion of documents.
I decide to set it as min_Df=0.1. Of course, this brings an error alert.

But when I changed it to 0.05, it works.

Maybe this is a limitation of using TFIDF vectorizer as an extractive summary model.

Vectorizer will raise errors when one document is empty after pruning.

If one sentence full of rare words that shows up less than min_df times in other sentences, from intuition, this sentence shouldn't be chosen as the summary sentence.



In [359]:
min_df=0
number_of_sentence = 5 

In [360]:
#Reminder this cell runs more than 30 minutes.
tfidf_summary = train['transcript'].apply(tfidf_score_summary)

In [375]:
joblib.dump(tfidf_summary,'tfidf_summary.pkl')

['tfidf_summary.pkl']

In [374]:
#evaluate on the whole dataset.
start= time.time()
evaluate_summary(tfidf_summary,train['summary'])
end=time.time()
duration=end-start
print(f'\n duration:{duration}')

rouge1 f1: 0.18 | rouge2 f1: 0.02 | rougeL f1: 0.15
rouge1 recall: 0.21 | rouge2 recall: 0.02 | rougeL recall: 0.18
rouge1 precision: 0.19 | rouge2 precision: 0.02 | rougeL precision: 0.16

 duration:148.15238499641418


Set min_df as soft number.

In [426]:
min_df=0.05
number_of_sentence=5
tfidf_2 = train['transcript'].apply(tfidf_score_summary)

In [497]:
joblib.dump(tfidf_2,'tfidf_summary_2.pkl')

['tfidf_summary_2.pkl']

In [498]:
#evaluate on the whole dataset.
start= time.time()
evaluate_summary(tfidf_2,train['summary'])
end=time.time()
duration=end-start
print(f'\n duration:{duration}')

rouge1 f1: 0.18 | rouge2 f1: 0.02 | rougeL f1: 0.16
rouge1 recall: 0.23 | rouge2 recall: 0.03 | rougeL recall: 0.19
rouge1 precision: 0.18 | rouge2 precision: 0.02 | rougeL precision: 0.15

 duration:181.35782885551453


Compared to min_df=0, when I set min_df=0.05, my rouge recall increases.
Rouge precision decreases a little bit.
However, the f1 score slightly increases.
Overall, I choose 0.05 over 0.


Compared to baseline_100words, tfidf_summary seems to perform worse in both recall and precision. Let's compare the real summary manually.

First I need to pick a random number.

In [514]:
np.random.seed(100)
random_number = np.random.choice(train.shape[0],1)

In [515]:
#get the True summary that is created by the show creator
train['summary'][random_number].iloc[0]

'thoughts on who we are. what drives us and how we form our identity. i apologize for the massive amount of cuts and poor sound quality but the desire to post the episode and it’s message outweighed the desire to re-record. i was speaking from the heart and it was unscripted.'

In [516]:
#get the baseline_100 model
Baseline_100words[random_number].iloc[0]

'wanted to start this out by letting you guys hear a couple of the amazing messages. i have received since i started this that that inspire the hell out of me, and it is just so very appreciated. hey, i am a big fan of yours. and i am glad to see that you have started a podcast and i will be listening very rigorously. so enjoy. make the most of it. thank you. keep doing what you are doing, and i am proud of you. hey, it is marci irene. you gave me a shout out and alive the'

In [517]:
tfidf_summary[random_number].iloc[0]

'wanted to start this out by letting you guys hear a couple of the amazing messages. nobody believes bullshit excuses. it will stay with you for years of using years. all the things you did not accomplish all the things you did not do. and in your mind  it will stick with you.'

In [518]:
tfidf_2[random_number].iloc[0]

'wanted to start this out by letting you guys hear a couple of the amazing messages. it is something you need to get out of not fall into most people do not have that problem do not have that issue or concern and it is just not in their nature. all the things you did not accomplish all the things you did not do. sometimes there is more than one number one, but you can only do one thing at a time. some things can fall off the priority list.'

Because min_df = 0 or 0.05, sometimes the tfidf_summary puts more focus on rare words. In text summarization, in which this is a disaster, I need to fix this.

Besides that, I realize I need to do more cleaning of my transcripts, there is still a lot sponsor of information.

So, first step: I need go back to the transcript loading notebook and do more cleaning.

The second step: I use count vector to build an extractive summary model, that doesn't care about rare words.

## Another two extractive summarizers

1. Based on Count Vectorizer

2. Based on Text Rank

### Word Frequency based on extractive summary model

In this model, I first get the word tokens from one transcript, then score them with their frequency divided by the maximum frequency.
Now all word token scores are between 0 and 1.
For each sentence, I use the sum of tokens score that belongs to the sentence as the final sentence score.
Then, sort by sentence score, and extract the top n sentences.

In [521]:

def countvec_score_summary(text):
    '''
    this function did the following steps:
    1, treat every sentence as a document, and count vectorizer on whole corpus
    2, use the token frequency /max token frequency as token score
    3, turn sentence count vectorizer matrix to binary matrix, use dot prodcut to calculate the score for each sentence
    4, rank sentences by score, choose the top n sentences
    '''
    #step 1 get count vectorizer
    sentences=[]
    for sent in sent_tokenize(text):
        sentences.append(sent)
    # get into dataframe, every sentence in the transcript is an entry
    df = pd.DataFrame({'sentence':sentences})
    vec = CountVectorizer(tokenizer=my_tokenizer,max_features=5000) #set as 5000, to decrease the running time
    transformed = vec.fit_transform(df['sentence'])
    #put back to dataframe
    sentence_df = pd.DataFrame(columns=vec.get_feature_names(), data=transformed.toarray())

    sentence_df_sum = np.sum(sentence_df,axis=0)

    max_frequency = np.max(sentence_df_sum)
    #get score for each token
    sentence_df_sum=sentence_df_sum/max_frequency
    # put binary value back 
    #for one sentence, if the token in it, the value is one instead of frequency
    sentence_df=sentence_df.where(sentence_df==0,1)

    # #use dot product to get sentence score 
    sentence_df_sum = sentence_df_sum.to_numpy().reshape(-1,1)
    sentence_df = sentence_df.to_numpy()
    sentence_score = np.matmul(sentence_df, sentence_df_sum)

    # concatenate score and sentence
    number_of_sentences=5
    df['score']= sentence_score
    df=df.sort_values('score',ascending=False)
    summary = ' '.join(df.head(number_of_sentences)['sentence'])

    return summary


In [525]:
np.random.seed(100)
random_choice=np.random.choice(train.shape[0],1000)

In [526]:
countvec_1=train['transcript'][random_choice].apply(countvec_score_summary)

In [527]:
evaluate_summary(countvec_1,train['summary'][random_choice])

rouge1 f1: 0.17 | rouge2 f1: 0.02 | rougeL f1: 0.14
rouge1 recall: 0.38 | rouge2 recall: 0.06 | rougeL recall: 0.31
rouge1 precision: 0.12 | rouge2 precision: 0.01 | rougeL precision: 0.1


Count vec based extractive model shows great score on recall, but shows the same for f1 because of the low precision.

### Text rank model

Texxt rank is a graph based ranking model for text processing.

Graph based ranking algorithms are essentially a way of deciding the importance of a vertex within a graph based on global information recursively drawn from the entire graph. 

When one vertex links to anther one, it is bascically casting a vote for that other vertex. The higher the number of votes that are cast for a vertex, the higher the importance of the vertex.[1]


In [530]:
import gensim

Based on EDA I did on sentence counts in transcripts,
I know for cental transcripts, the number of sentences are between 408 ± 264
and for central summaries the number of sentences are between 5 ± 4.
I set ratio=0.01, which would be compatible with the True summaries.


I also know the central summaries have 82 ± 66 words. So I set word_count = 100.

In [573]:
#word count determines how many the output will contain 
#percentage of the number of sentences of the original text to be chosen for the summary
#when both are provided, the ratio will be ignored

#
def textrank(text, word_count=100, ratio = 0.01): 
    summary = gensim.summarization.summarize(text,word_count=word_count)
    if len(summary.split())==0:
        summary = gensim.summarization.summarize(text,ratio=ratio)
    return summary

In [574]:
np.random.seed(100)
random_choice=np.random.choice(train.shape[0],100)

In [575]:
textrank_1=train['transcript'][random_choice].apply(textrank)

In [576]:
evaluate_summary(textrank_1,train['summary'][random_choice])

rouge1 f1: 0.22 | rouge2 f1: 0.03 | rougeL f1: 0.17
rouge1 recall: 0.26 | rouge2 recall: 0.04 | rougeL recall: 0.2
rouge1 precision: 0.21 | rouge2 precision: 0.03 | rougeL precision: 0.16


## Conclusion

In this notebook, I performed baseline model building and built three different extractive models.

The baseline_100words performs the best.
Then comes the text rank model, followed by the count vectorizer based model, followed by the tfidf vectorizer based model.

Reference

[1][TextRank: Bringing Order into Text](https://aclanthology.org/W04-3252) (Mihalcea & Tarau, 2004)

