In [90]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt') # one time execution
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vyom\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [91]:
df = pd.read_excel("speech.xlsx")

In [92]:
df = df.reset_index()
df.columns = ['Timestamp', 'Speaker', 'Text']

Now we have 2 options – we can either summarize each article individually, or we can generate a single summary for all the articles. For our purpose, we will go ahead with the latter.

Next step is to break the text into individual sentences. We will use the sent_tokenize( ) function of the nltk library to do this.

In [93]:
from nltk.tokenize import sent_tokenize
sentences = []
for s in df['Text']:
  sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] # flatten list

In [94]:
# Printing a sample for the sentences
sentences

['Tim, how are we looking on your focus with the top 5 accounts this quarter?',
 'So just to recap, we identified the top 5 opportunities as Boots, Dell, Tesco Mobile, Peugeot and Adidas.',
 'Right now we are in a good place on all except Adidas.',
 'I’ll come back to Adidas in a minute.',
 'The headlines against the others are Boots are now spending over £1million with us on Create and we have some exciting opportunities to use DCO in a really customised execution based on both the products that are trending and and the current weather conditions.',
 'That’s brilliant - i’m really excited about this DCO opportunity - that has the potential to be an amazing story to spread throughout MediaCom.',
 'Tell me about it - I’m just keeping my fingers crossed this execution goes smoothly given all the problems we have had with Jivox in the UK.',
 'So, carrying on through the other accounts.',
 'Hey Bob, why dont you take us through Dell since you are closest to it?',
 'sure thing.',
 'For Dell

**Download GloVe Word Embeddings** <br>
GloVe word embeddings are vector representation of words. These word embeddings will be used to create vectors for our sentences. We could have also used the Bag-of-Words or TF-IDF approaches to create features for our sentences, but these methods ignore the order of the words (and the number of features is usually pretty large).

We will be using the pre-trained Wikipedia 2014 + Gigaword 5 GloVe vectors available in the link below. Heads up – the size of these word embeddings is 822 MB.

In [95]:
# Ont time install for wget package
# !pip install wget

Instead of using wget, we'll download the embeddings file on local and call it

In [96]:
# Extract word vectors
word_embeddings = {}

f = open('glove6B100d.txt', encoding='utf-8')

for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [97]:
len(word_embeddings)

400000

We now have word vectors for 400,000 different terms stored in the dictionary – ‘word_embeddings’.

**Text Preprocessing**<br>
It is always a good practice to make your textual data noise-free as much as possible. So, let’s do some basic text cleaning.

In [98]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

Get rid of the stopwords (commonly used words of a language – is, am, the, of, in, etc.) present in the sentences. If you have not downloaded nltk-stopwords, then execute the following line of code:

In [99]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\vyom\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [100]:
# import stopwords
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [101]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [102]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

**Vector Representation of Sentences**<br>
We will use clean_sentences to create vectors for sentences in our data with the help of the GloVe word vectors.

In [103]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

**Similarity Matrix Preparation**<br>
The next step is to find similarities between the sentences, and we will use the cosine similarity approach for this challenge. Let’s create an empty similarity matrix for this task and populate it with cosine similarities of the sentences.
<br>
Let’s first define a zero matrix of dimensions (n * n).  We will initialize this matrix with cosine similarity scores of the sentences. Here, n is the number of sentences.

In [104]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

We will use Cosine Similarity to compute the similarity between a pair of sentences.

In [105]:
from sklearn.metrics.pairwise import cosine_similarity

# initialize the matrix with cosine similarity scores
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

**Applying PageRank Algorithm**<br>
Before proceeding further, let’s convert the similarity matrix sim_mat into a graph. The nodes of this graph will represent the sentences and the edges will represent the similarity scores between the sentences. On this graph, we will apply the PageRank algorithm to arrive at the sentence rankings.

In [106]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

**Summary Extraction**<br>
Finally, it’s time to extract the top N sentences based on their rankings for summary generation.

In [107]:
df_temp = pd.DataFrame(sentences).reset_index()
df_temp.columns = ['Timestamp', 'Text']

In [108]:
df_temp

Unnamed: 0,Timestamp,Text
0,0,"Tim, how are we looking on your focus with the..."
1,1,"So just to recap, we identified the top 5 oppo..."
2,2,Right now we are in a good place on all except...
3,3,I’ll come back to Adidas in a minute.
4,4,The headlines against the others are Boots are...
...,...,...
64,64,Ok to summarise the actions
65,65,I am going to speak with Gavin on Dell
66,66,Tim to speak with Damien about the priority lo...
67,67,Tim to give Tom & Abhishek the context for an ...


In [109]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

# Extract top 10 sentences as the summary
for i in range(5):
  print(ranked_sentences[i][1])

We dont have a load of time left so I am going to be a little brief - especially as i want to get Adidas on your radar before we finish this morning.
Can you please let Damien know that this is a top priority for us - we’ve got one shot at getting this right on Tesco so we cant blow it because of a product issue.
We’re going to get kicked out in a minute so i just wanted to say thanks for making this an efficient meet.
The challenge i am getting is that they are all in on DV360 and dont want to work with us on a managed service.
Ok i know him really well.


In [110]:
ranked_df = pd.DataFrame(ranked_sentences)
ranked_df.columns = ['Score', 'Text']

In [111]:
final = pd.merge(df_temp, ranked_df, on = 'Text').sort_values(by = 'Score', ascending =  False)

In [117]:
list(final.head(20).sort_values('Timestamp')['Text'])

['Tell me about it - I’m just keeping my fingers crossed this execution goes smoothly given all the problems we have had with Jivox in the UK.',
 'Hey Bob, why dont you take us through Dell since you are closest to it?',
 'For Dell we are taking the great work done in the US by our friends over the pond and trying to replicate the same analytics engagement here.',
 'Our media spend is already strong on Dell, but this custom KPI project would be key to getting the business more sticky into Q1.',
 'I am hitting a few problems here - the European lead is really hard to meet with and has cancelled on me three times.',
 'Ok i know him really well.',
 'Want me to get in touch?',
 'We dont have a load of time left so I am going to be a little brief - especially as i want to get Adidas on your radar before we finish this morning.',
 'Both of these pieces of business are totally incremental on last year given we didnt see a penny in 2018.',
 'The strategic focus they have given us is to better 