## Text Summarization:

Text summarization can broadly be divided into two categories 
- Extractive Summarization and 
- Abstractive Summarization


- **Extractive Summarization**: 

    These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method.
    
    
- **Abstractive Summarization**: 
    
    These methods use advanced NLP techniques to generate an entirely new summary. Some parts of this summary may not even appear in the original text.
    
Here I am going to focuse on the extractive summarization technique.


In [658]:
#Load the necessary libraries

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
import wordcloud


import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

In [623]:
#load the data
ted_talks = pd.read_csv("/Users/HOME/Desktop/Springboard/TED-Talks/Data/clean_transcript_1.csv",index_col=0)
ted_talks.head()

Unnamed: 0,name,title,description,main_speaker,speaker_occupation,transcript,duration,film_date,published_date,languages,...,word_count,char_count,sentence_count,avg_word_length,avg_sentence_length,sentiment,sentiment_label,sent,sentiment_lab,clean_transc
0,Ken Robinson: Do schools kill creativity?,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,Ken Robinson,Author/educator,Good morning. How are you?(Laughter)It's been ...,19.4,2006-02-24,2006-06-26,60,...,3066,14344,225,4.678408,13.626667,0.146452,positive,positive,positive,good morning great blow away thing fact leave ...
1,Al Gore: Averting the climate crisis,Averting the climate crisis,With the same humor and humanity he exuded in ...,Al Gore,Climate advocate,"Thank you so much, Chris. And it's truly a gre...",16.283333,2006-02-24,2006-06-26,43,...,2089,9726,141,4.655816,14.815603,0.157775,positive,positive,positive,thank chris truly great honor opportunity come...
2,David Pogue: Simplicity sells,Simplicity sells,New York Times columnist David Pogue takes aim...,David Pogue,Technology columnist,"(Music: ""The Sound of Silence,"" Simon & Garfun...",21.433333,2006-02-23,2006-06-26,26,...,3253,15057,256,4.62865,12.707031,0.136579,positive,positive,positive,hello voice mail old friend tech support ignor...
3,Majora Carter: Greening the ghetto,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",Majora Carter,Activist for environmental justice,If you're here today — and I'm very happy that...,18.6,2006-02-25,2006-06-26,35,...,3015,15235,181,5.053068,16.657459,0.082928,positive,positive,positive,today happy hear sustainable development save ...
4,Hans Rosling: The best stats you've ever seen,The best stats you've ever seen,You've never seen data presented like this. Wi...,Hans Rosling,Global health expert; data visionary,"About 10 years ago, I took on the task to teac...",19.833333,2006-02-21,2006-06-27,48,...,3121,14245,236,4.564242,13.224576,0.096483,positive,positive,positive,year ago task teach global development swedish...


In [624]:
ted_talks = ted_talks[~(ted_talks['clean_transc'].isna())]

In [625]:
ted_talks = ted_talks.reset_index(drop = True)

In [626]:
ted_talks = ted_talks[['title','description','transcript']]

In [627]:
ted_talks

Unnamed: 0,title,description,transcript
0,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,Good morning. How are you?(Laughter)It's been ...
1,Averting the climate crisis,With the same humor and humanity he exuded in ...,"Thank you so much, Chris. And it's truly a gre..."
2,Simplicity sells,New York Times columnist David Pogue takes aim...,"(Music: ""The Sound of Silence,"" Simon & Garfun..."
3,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",If you're here today — and I'm very happy that...
4,The best stats you've ever seen,You've never seen data presented like this. Wi...,"About 10 years ago, I took on the task to teac..."
...,...,...,...
2450,What we're missing in the debate about immigra...,"Between 2008 and 2016, the United States depor...","So, Ma was trying to explain something to me a..."
2451,The most Martian place on Earth,How can you study Mars without a spaceship? He...,This is a picture of a sunset on Mars taken by...
2452,What intelligent machines can learn from a sch...,Science fiction visions of the future show us ...,"In my early days as a graduate student, I went..."
2453,A black man goes undercover in the alt-right,In an unmissable talk about race and politics ...,I took a cell phone and accidentally made myse...


In [532]:
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
nlp = en_core_web_sm.load()

def text_preprocess(text):
    text = re.sub(r"\((.*?)\)|—|\"|\n|\+", r" ", text)
    text = " ".join(text.split())
    nlp = en_core_web_sm.load()
    doc = nlp(text)
    return doc

### Text summarization using spaCy 

- spaCy is a free, open-source advanced natural language processing library, written in the programming languages Python and Cython. spaCy mainly used in the development of production software and also supports deep learning workflow via statistical models of PyTorch and TensorFlow.


- spaCy provides a fast and accurate syntactic analysis, named entity recognition and ready access to word vectors. We can use the default word vectors or replace them with any you have. spaCy also offers tokenization, sentence boundary detection, POS tagging, syntactic parsing, integrated word vectors, and alignment into the original string with high accuracy.



In [655]:
#Text summarization using spaCy 
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
nlp = en_core_web_sm.load()

def text_summary_spacy(text):
    text = re.sub(r"\((.*?)\)|—|\"|\n|\+", r" ", text)
    text = " ".join(text.split())
    nlp = en_core_web_sm.load()
    doc = nlp(text)

    #Filtering tokens
    keyword = []
    stopwords = list(STOP_WORDS)
    pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
    for token in doc:
        if(token.text in stopwords or token.text in punctuation):
            continue
        if(token.pos_ in pos_tag):
            keyword.append(token.text)
    freq_word = Counter(keyword)
    if not keyword:
        print("This transcript is empty")
    else:
        #Normalizing tokens
        max_freq = Counter(keyword).most_common(1)[0][1]

        for word in freq_word.keys():  
                freq_word[word] = (freq_word[word]/max_freq)

        #Weighing sentences
        sent_strength={}
        for sent in doc.sents:
            for word in sent:
                if word.text in freq_word.keys():
                    if sent in sent_strength.keys():
                        sent_strength[sent]+=freq_word[word.text]
                    else:
                        sent_strength[sent]=freq_word[word.text]

        #Summarizing the string
        summarized_sentences = nlargest(3, sent_strength, key=sent_strength.get)
        final_sentences = [ w.text for w in summarized_sentences ]
        summary = ' '.join(final_sentences)
        return summary
        

In [656]:
#checking the summaraiztion for 2451th transcript in the dataset as an example.
text_summary_spacy(ted_talks['transcript'][2451])

'All life on Earth requires water, so in my case I focus on the intimate relationship between water and life in order to understand if we could find life in a planet as dry as Mars. So I remembered that I usually see fogs in Yungay, so after setting sensors in a number of places, where I remember never seeing fogs or clouds, I reported four other sites much drier than Yungay, with this one, María Elena South, being the truly driest place on Earth, as dry as Mars, and amazingly, just a 15-minute ride from the small mining town where I was born. Now, in this search, we were trying to actually find the dry limit for life on Earth, a place so dry that nothing was able to survive in it.'

### Text Summarization using Gensim with TextRank

- **gensim** is a very handy python library for performing NLP tasks.The text summarization process using gensim library is based on TextRank Algorithm.


- **TextRank** is an extractive summarization technique.It is based on the concept that words which occur more frequently are significant.Hence,the sentences containing highly frequent words are important.

Based on this,the algorithm assigns scores to each sentence in the text.The top-ranked sentences make it to the summary.

The default parameters of the **summarize** function are:

**ratio**: It can take values between 0 to 1. It represents the proportion of the summary compared to the original text.

**word_count**: It decides the no of words in the summary.



In [544]:
#!pip install gensim_sum_ext

In [648]:
from gensim.summarization import summarize
text = re.sub(r"\((.*?)\)|—|\"|\n|\+", r" ", ted_talks['transcript'][2451])
text = " ".join(text.split())
summarize(text,word_count = 100)

"In this place, we reported a new type of microalgae that grew only on top of the spiderwebs that covered the cave entrance.\nIt's covered with dew, so this microalgae learned that in order to carry photosynthesis in the coast of the driest desert on Earth, they could use the spiderwebs.\nThese type of findings suggest to me that on Mars, we may find even photosynthetic life inside caves.\nBut even here, well hidden underground, we found a number of different microorganisms, which suggested to me that similarly dry places, like Mars, may be in inhabited."

## Text Summarization with Sumy

Along with TextRank , there are various other algorithms to summarize text.

**sumy** libraray provides you several algorithms to implement Text Summarzation. We are going to implement the below algorithms for summarization using sumy :

    LexRank
    Luhn
    Latent Semantic Analysis, LSA
    KL-Sum

### Latent Semantic Analysis (LSA)

- Latent Semantic Analysis is a unsupervised learning algorithm that can be used for extractive text summarization.

- It extracts semantically significant sentences by applying singular value decomposition(SVD) to the matrix of term-document frequency. 

In [543]:
#!pip install sumy

In [649]:
import sumy
# Import the summarizer
from sumy.summarizers.lsa import LsaSummarizer

# Text to summarize
original_text = str(text_preprocess(ted_talks['transcript'][2451]))

# Parsing the text string using PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
parser=PlaintextParser.from_string(original_text,Tokenizer('english'))

# creating the summarizer
lsa_summarizer=LsaSummarizer()
lsa_summary= lsa_summarizer(parser.document,3)

# Printing the summary
for sentence in lsa_summary:
    print(sentence)

People sometimes ask me, how can you be an astrobiologist if you don't have your own spaceship?Well, what I do is that I study life in those environments on Earth that most closely resemble other interesting places in the universe.
These type of findings suggest to me that on Mars, we may find even photosynthetic life inside caves.
But even here, well hidden underground, we found a number of different microorganisms, which suggested to me that similarly dry places, like Mars, may be in inhabited.


### Luhn

- Luhn Summarization algorithm’s approach is based on TF-IDF (Term Frequency-Inverse Document Frequency). It is useful when very low frequent words as well as highly frequent words(stopwords) are both not significant.

- Based on this, sentence scoring is carried out and the high ranking sentences make it to the summary.



In [650]:
# Import the summarizer
from sumy.summarizers.luhn import LuhnSummarizer

# text to summarize
original_text= str(text_preprocess(ted_talks['transcript'][2451]))

# Creating the parser
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
parser=PlaintextParser.from_string(original_text,Tokenizer('english'))

#  Creating the summarizer
luhn_summarizer=LuhnSummarizer()
luhn_summary=luhn_summarizer(parser.document,sentences_count=2)

# Printing the summary
for sentence in luhn_summary:
    print(sentence)

This one is able to use ocean mist as a source of water, and strikingly lives in the very bottom of a cave, so it has adapted to live with less than 0.1 percent of the amount of light that regular plants need.
So I remembered that I usually see fogs in Yungay, so after setting sensors in a number of places, where I remember never seeing fogs or clouds, I reported four other sites much drier than Yungay, with this one, María Elena South, being the truly driest place on Earth, as dry as Mars, and amazingly, just a 15-minute ride from the small mining town where I was born.Now, in this search, we were trying to actually find the dry limit for life on Earth, a place so dry that nothing was able to survive in it.


### KL-Sum

In [654]:
from sumy.summarizers.kl import KLSummarizer
# Our text to perform summarization
original_text= str(text_preprocess(ted_talks['transcript'][2451]))
# Creating the parser
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
parser=PlaintextParser.from_string(original_text,Tokenizer('english'))

# Instantiating the  KLSummarizer
kl_summarizer=KLSummarizer()
kl_summary=kl_summarizer(parser.document,sentences_count=3)

# Printing the summary
for sentence in kl_summary:
    print(sentence)

But since I do not have the 2.5 billion dollars to send my own robot to Mars, I study the most Martian place on Earth, the Atacama Desert.Located in northern Chile, it is the oldest and driest desert on Earth.
In the Atacama, there are places with no reported rains in the last 400 years.How do I know this?
Because I was born and raised in this desert.


### LexRank

- A sentence which is similar to many other sentences of the text has a high probability of being important. The approach of LexRank is that a particular sentence is recommended by other similar sentences and hence is ranked higher.

- Higher the rank, higher is the priority of being included in the summarized text.

In [652]:
# Importing the parser and tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

original_text= str(text_preprocess(ted_talks['transcript'][2451]))

# Import the LexRank summarizer
from sumy.summarizers.lex_rank import LexRankSummarizer

# Initializing the parser
my_parser = PlaintextParser.from_string(original_text,Tokenizer('english'))

# Creating a summary of 3 sentences.
lex_rank_summarizer = LexRankSummarizer()
lexrank_summary = lex_rank_summarizer(my_parser.document,sentences_count=3)

# Printing the summary
for sentence in lexrank_summary:
    print(sentence)

All life on Earth requires water, so in my case I focus on the intimate relationship between water and life in order to understand if we could find life in a planet as dry as Mars.
But since I do not have the 2.5 billion dollars to send my own robot to Mars, I study the most Martian place on Earth, the Atacama Desert.Located in northern Chile, it is the oldest and driest desert on Earth.
So I remembered that I usually see fogs in Yungay, so after setting sensors in a number of places, where I remember never seeing fogs or clouds, I reported four other sites much drier than Yungay, with this one, María Elena South, being the truly driest place on Earth, as dry as Mars, and amazingly, just a 15-minute ride from the small mining town where I was born.Now, in this search, we were trying to actually find the dry limit for life on Earth, a place so dry that nothing was able to survive in it.


**Among all the above algorithms, Text summarization with spaCy makes sense to me.So I choose the spaCy one for summarization**.

In [630]:
ted_talks['summary'] = ted_talks['transcript'].apply(text_summary_spacy)

In [657]:
ted_talks.head()

Unnamed: 0,title,description,transcript,summary
0,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,Good morning. How are you?(Laughter)It's been ...,"I think you'd have to conclude, if you look at..."
1,Averting the climate crisis,With the same humor and humanity he exuded in ...,"Thank you so much, Chris. And it's truly a gre...",And so I'm going to be conducting a course thi...
2,Simplicity sells,New York Times columnist David Pogue takes aim...,"(Music: ""The Sound of Silence,"" Simon & Garfun...","And the truth is, for years I was a little dep..."
3,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",If you're here today — and I'm very happy that...,"When she came into my life, we were fighting a..."
4,The best stats you've ever seen,You've never seen data presented like this. Wi...,"About 10 years ago, I took on the task to teac...","And in the '90s, we have the terrible HIV epid..."


In [653]:
ted_talks['summary'][2451]

'All life on Earth requires water, so in my case I focus on the intimate relationship between water and life in order to understand if we could find life in a planet as dry as Mars. So I remembered that I usually see fogs in Yungay, so after setting sensors in a number of places, where I remember never seeing fogs or clouds, I reported four other sites much drier than Yungay, with this one, María Elena South, being the truly driest place on Earth, as dry as Mars, and amazingly, just a 15-minute ride from the small mining town where I was born. Now, in this search, we were trying to actually find the dry limit for life on Earth, a place so dry that nothing was able to survive in it.'

In [665]:
#pickle the model
with open("/Users/HOME/Desktop/Springboard/TED-Talks/Models/" + 'ted_summary.pkl', 'wb') as picklefile:
    pickle.dump(ted_talks, picklefile)


### Conclusion

Our work started with merging the two datasets ted_main.csv and transcripts.csv and then the data cleaning process, during which we changed the original order of data columns for convenience and date columns from the Unix timestamps into a human readable format timestamp. And then checked for the Null values, duplicates and dropped the duplicated rows. In the Exploratory Data Analysis section, we analyzed the dataset using plots such as bar plots, box plots and histograms. Furthermore, this section has figured out other significant analysis about our dataset, regarding the most viewed talks of all time, the top 10 speakers and speakers occupations. We also made hypotheses to figure out the relation between views and speakers' occupation. From the ANOVA test, we concluded that There is no statistical significant relationship between views and speakers occupation.We also showed interesting statistics about views, comments distribution and proved their relationship using Pearson correlation statistical test. And we also showed TED Talks distribution over years, months and weekdays, and some of them were a bit surprising. During the analysis we figured out the outliers and did not remove them as they are actual data for our exploratory analysis. We figured out the collinear features through a heat map. We also analysed several other pairs for a meaningful correlation but they do not seemed to be strongly correlated. We showed the duration distribution and observed that the short duration TED talks are more famous, it is more likely that people are interested in shorter duration talks because they are able to grasp the talk’s content easily or they don’t have time to watch longer duration talks. We also analyzed the ratings features and visualized the top 10 most funniest, beautiful, inspiring, jaw-dropping and confusing talks of all time. We investigated the TED wordcloud to know about which words are most often used by TED Speakers as well as TED themes and occupations. 

Next we moved on to the preprocessing step, there we did feature extraction and feature engineering on the dataset. And then we did text preprocessing on the transcript which includes converting all letters to lower or upper case, converting numbers into words or removing numbers, removing punctuations, accent marks and other diacritics, removing white spaces, removing stop words, sparse terms and particular words. We did sentiment analysis on transcript and derived appropriate rating categories for transcripts from rating feature. And then we visualized the ratings categories with respect to sentiment.

Next we applied summarization algorithms using spaCy, Gensim and sumy(LexRank,LSA,etc) to extract the summary from the transcript. We found that summarization with spaCy gave good results compared to others for this dataset. 

**In conclusion, our work led to interesting results, analysis and statistics, but also provided useful tools both for audience and speakers, which allows a better understanding of TED Talks dataset.** 

This project gave me an opportunity to explore this freely available dataset using NLP and a proper data science pipeline of data wrangling, data analysis, data visualization, prediction, and data storytelling.


### Future Improvements

- Further analysis can be done over the rating column in the dataset to relate the negative comments with topics of TED talk, and find the area of talk which has received more negative feedback. 

- We can also make some more analysis over topic and area of TED Talk, by combining some other datasets like news article, social media post etc to find for any pattern between how the hot discussed topics over world found from news article and social media post are included in TED talk topics, around the same time frame as of the hot discussion over the world. 
