### Steps to follow for recommendation function

- Text Preprocessing
- Tfidf Vectorizer
- Generate Cosine similarity matrix

In [54]:
#Import all the necessary libraries
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
import wordcloud

import pickle

## for text processing
import re

## for ner
import spacy

## for vectorizer
from sklearn import feature_extraction


In [55]:
#Load the preprocessed data
ted_talks = pd.read_csv("/Users/HOME/Desktop/Springboard/TED-Talks/Data/preprocessed_ted.csv",index_col = 0)
ted_talks.head()

Unnamed: 0,name,title,description,main_speaker,speaker_occupation,transcript,duration,film_date,published_date,languages,...,ingenious,courageous,longwinded,informative,fascinating,unconvincing,persuasive,ok,obnoxious,rating
0,Ken Robinson: Do schools kill creativity?,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,['Ken Robinson'],"['Author', 'educator']",Good morning. How are you?(Laughter)It's been ...,19.4,2006-02-24,2006-06-26,60,...,6073,3253,387,7346,10581,300,10704,1174,209,Inspiring
1,Al Gore: Averting the climate crisis,Averting the climate crisis,With the same humor and humanity he exuded in ...,['Al Gore'],['Climate advocate'],"Thank you so much, Chris. And it's truly a gre...",16.283333,2006-02-24,2006-06-26,43,...,56,139,113,443,132,258,268,203,131,Funny
2,David Pogue: Simplicity sells,Simplicity sells,New York Times columnist David Pogue takes aim...,['David Pogue'],['Technology columnist'],"(Music: ""The Sound of Silence,"" Simon & Garfun...",21.433333,2006-02-23,2006-06-26,26,...,183,45,78,395,166,104,230,146,142,Funny
3,Majora Carter: Greening the ghetto,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",['Majora Carter'],['Activist for environmental justice'],If you're here today — and I'm very happy that...,18.6,2006-02-25,2006-06-26,35,...,105,760,53,380,132,36,460,85,35,Inspiring
4,Hans Rosling: The best stats you've ever seen,The best stats you've ever seen,You've never seen data presented like this. Wi...,['Hans Rosling'],"['Global health expert', ' data visionary']","About 10 years ago, I took on the task to teac...",19.833333,2006-02-21,2006-06-27,48,...,3202,318,110,5433,4606,67,2542,248,61,Informative


In [56]:
ted_talks.columns

Index(['name', 'title', 'description', 'main_speaker', 'speaker_occupation',
       'transcript', 'duration', 'film_date', 'published_date', 'languages',
       'num_speaker', 'event', 'comments', 'ratings', 'views', 'tags',
       'related_talks', 'url', 'film_year', 'film_month', 'film_day',
       'views_per_comment', 'num_ratings', 'funny', 'beautiful', 'inspiring',
       'confusing', 'jaw_dropping', 'lang', 'word_count', 'char_count',
       'sentence_count', 'avg_word_length', 'avg_sentence_length', 'sentiment',
       'clean_transc', 'sentiment_category', 'ingenious', 'courageous',
       'longwinded', 'informative', 'fascinating', 'unconvincing',
       'persuasive', 'ok', 'obnoxious', 'rating'],
      dtype='object')

In [57]:
ted_talks.tags

0       ['children', 'creativity', 'culture', 'dance',...
1       ['alternative energy', 'cars', 'climate change...
2       ['computers', 'entertainment', 'interface desi...
3       ['MacArthur grant', 'activism', 'business', 'c...
4       ['Africa', 'Asia', 'Google', 'demo', 'economic...
                              ...                        
2548    ['TED Residency', 'United States', 'community'...
2549    ['Mars', 'South America', 'TED Fellows', 'astr...
2550    ['AI', 'ants', 'fish', 'future', 'innovation',...
2551    ['Internet', 'TEDx', 'United States', 'communi...
2552    ['cities', 'design', 'future', 'infrastructure...
Name: tags, Length: 2549, dtype: object

In [58]:
#removing the unneccessary characters from url
ted_talks['url'] = ted_talks['url'].apply(lambda x: x.strip("\n"))

In [59]:
#selecting the required features from dataset
ted_talks = ted_talks[['title','clean_transc','url','tags']]

In [60]:
ted_talks.shape

(2549, 4)

In [61]:
ted_talks['clean_transc'].isna().any()

True

In [62]:
#filter out the null transcripts
ted_talks = ted_talks[~ (ted_talks['clean_transc'].isna())]
ted_talks.reset_index(drop=True, inplace=True)

In [63]:
ted_summ = pd.read_pickle("/Users/HOME/Desktop/Springboard/TED-Talks/Models/ted_summary.pkl")
ted_summ.head()

Unnamed: 0,title,description,transcript,summary
0,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,Good morning. How are you?(Laughter)It's been ...,"I think you'd have to conclude, if you look at..."
1,Averting the climate crisis,With the same humor and humanity he exuded in ...,"Thank you so much, Chris. And it's truly a gre...",And so I'm going to be conducting a course thi...
2,Simplicity sells,New York Times columnist David Pogue takes aim...,"(Music: ""The Sound of Silence,"" Simon & Garfun...","And the truth is, for years I was a little dep..."
3,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",If you're here today — and I'm very happy that...,"When she came into my life, we were fighting a..."
4,The best stats you've ever seen,You've never seen data presented like this. Wi...,"About 10 years ago, I took on the task to teac...","And in the '90s, we have the terrible HIV epid..."


In [64]:
ted_talks = ted_talks.merge(ted_summ, on = ('title'),how = "left")
ted_talks.head()

Unnamed: 0,title,clean_transc,url,tags,description,transcript,summary
0,Do schools kill creativity?,good morning great blow away thing fact leave ...,https://www.ted.com/talks/ken_robinson_says_sc...,"['children', 'creativity', 'culture', 'dance',...",Sir Ken Robinson makes an entertaining and pro...,Good morning. How are you?(Laughter)It's been ...,"I think you'd have to conclude, if you look at..."
1,Averting the climate crisis,thank chris truly great honor opportunity come...,https://www.ted.com/talks/al_gore_on_averting_...,"['alternative energy', 'cars', 'climate change...",With the same humor and humanity he exuded in ...,"Thank you so much, Chris. And it's truly a gre...",And so I'm going to be conducting a course thi...
2,Simplicity sells,hello voice mail old friend tech support ignor...,https://www.ted.com/talks/david_pogue_says_sim...,"['computers', 'entertainment', 'interface desi...",New York Times columnist David Pogue takes aim...,"(Music: ""The Sound of Silence,"" Simon & Garfun...","And the truth is, for years I was a little dep..."
3,Greening the ghetto,today happy hear sustainable development save ...,https://www.ted.com/talks/majora_carter_s_tale...,"['MacArthur grant', 'activism', 'business', 'c...","In an emotionally charged talk, MacArthur-winn...",If you're here today — and I'm very happy that...,"When she came into my life, we were fighting a..."
4,The best stats you've ever seen,year ago task teach global development swedish...,https://www.ted.com/talks/hans_rosling_shows_t...,"['Africa', 'Asia', 'Google', 'demo', 'economic...",You've never seen data presented like this. Wi...,"About 10 years ago, I took on the task to teac...","And in the '90s, we have the terrible HIV epid..."


### Term Frequency-Inverse Document Frequency (Tf-Idf)

Term Frequency measures how often the word appears in a given document, while Inverse document frequency measures how rare the word is in a corpus. The product of these two quantities, measures the importance of the word and is known as Tf-Idf.

In [65]:
#Tf-idf Vectorizer on clean_transc feature
from spacy.lang.en.stop_words import STOP_WORDS
import en_core_web_sm
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(strip_accents = 'ascii',stop_words = STOP_WORDS)
tfidf_matrix = tv.fit_transform(ted_talks['clean_transc'])
tfidf_matrix.shape          

(2455, 44699)

### Finding similar TED Talks

- To find out similar talks among different talks, we will need to compute a measure of similarity. Usually when dealing with Tf-Idf vectors, we use cosine similarity.
- The cosine similarity will become a means for us to find out how similar the transcript of one Ted Talk is to the other.

In [66]:
%%time
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix)

CPU times: user 1.42 s, sys: 82.9 ms, total: 1.51 s
Wall time: 1.53 s


In [67]:
%%time
from sklearn.metrics.pairwise import linear_kernel
cosine_sim_linear = linear_kernel(tfidf_matrix)

CPU times: user 1.21 s, sys: 32 ms, total: 1.24 s
Wall time: 1.26 s


### Recommender function

- For each transcript, we are going to find out the 10 most similar ones, based on cosine similarity. 

#### STEPS 

- 1.Take a movie title,cosine similarity matrix, and indices series as arguments.
- 2.Extract pairwise cosine similarity scores for the movie
- 3.Sort the scores in descending order
- 4.Output titles corresponding to the highest scores


In [68]:
#generating mapping between titles and index
indices = pd.Series(ted_talks.index, index = ted_talks['title']).drop_duplicates()

In [73]:
def get_recommendations(title,cosine_sim,indices):
    #get indices of the movie that matches title
    idx = indices[title]
    #sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores,key = lambda x:x[1],reverse = True)
    #get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    #get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    return ted_talks['title'].iloc[movie_indices],ted_talks['url'].iloc[movie_indices],ted_talks['summary']

In [74]:
ted_talks.title

0                             Do schools kill creativity?
1                             Averting the climate crisis
2                                        Simplicity sells
3                                     Greening the ghetto
4                         The best stats you've ever seen
                              ...                        
2450    What we're missing in the debate about immigra...
2451                      The most Martian place on Earth
2452    What intelligent machines can learn from a sch...
2453         A black man goes undercover in the alt-right
2454    How a video game might help us build better ci...
Name: title, Length: 2455, dtype: object

In [76]:
get_recommendations('The most Martian place on Earth',cosine_sim,indices)

(2090    Your kids might live on Mars. Here's how they'...
 1869    How Mars might hold the secret to the origin o...
 348                      There might just be life on Mars
 620                        Why we need to go back to Mars
 2283                             What time is it on Mars?
 2011                Let's not use Mars as a backup planet
 1980    Deep under the Earth's surface, discovering be...
 2027             The mysterious world of underwater caves
 328                      The story behind the Mars Rovers
 1552                          My glacier cave discoveries
 Name: title, dtype: object,
 2090    https://www.ted.com/talks/stephen_petranek_you...
 1869    https://www.ted.com/talks/nathalie_cabrol_how_...
 348             https://www.ted.com/talks/penelope_boston
 620                 https://www.ted.com/talks/joel_levine
 2283    https://www.ted.com/talks/nagin_cox_what_time_...
 2011    https://www.ted.com/talks/lucianne_walkowicz_l...
 1980    https://www.ted.co

### Conclusion

Our work started with merging the two datasets ted_main.csv and transcripts.csv and then the data cleaning process, during which we changed the original order of data columns for convenience and date columns from the Unix timestamps into a human readable format timestamp. And then checked for the Null values, duplicates and dropped the duplicated rows. In the Exploratory Data Analysis section, we analyzed the dataset using plots such as bar plots, box plots and histograms. Furthermore, this section has figured out other significant analysis about our dataset, regarding the most viewed talks of all time, the top 10 speakers and speakers occupations. We also made hypotheses to figure out the relation between views and speakers' occupation. From the ANOVA test, we concluded that There is no statistical significant relationship between views and speakers occupation.We also showed interesting statistics about views, comments distribution and proved their relationship using Pearson correlation statistical test. And we also showed TED Talks distribution over years, months and weekdays, and some of them were a bit surprising. During the analysis we figured out the outliers and did not remove them as they are actual data for our exploratory analysis. We figured out the collinear features through a heat map. We also analysed several other pairs for a meaningful correlation but they do not seemed to be strongly correlated. We showed the duration distribution and observed that the short duration TED talks are more famous, it is more likely that people are interested in shorter duration talks because they are able to grasp the talk’s content easily or they don’t have time to watch longer duration talks. We also analyzed the ratings features and visualized the top 10 most funniest, beautiful, inspiring, jaw-dropping and confusing talks of all time. We investigated the TED wordcloud to know about which words are most often used by TED Speakers as well as TED themes and occupations. 

Next we moved on to the preprocessing step, there we did feature extraction and feature engineering on the dataset. And then we did text preprocessing on the transcript which includes converting all letters to lower or upper case, converting numbers into words or removing numbers, removing punctuations, accent marks and other diacritics, removing white spaces, removing stop words, sparse terms and particular words. We did sentiment analysis on transcript and derived appropriate rating categories for transcripts from rating feature. And then we visualized the ratings categories with respect to sentiment.

Next we applied summarization algorithms using spaCy, Gensim and sumy(LexRank,LSA,etc) to extract the summary from the transcript. We found that summarization with spaCy gave good results compared to others for this dataset. And then we moved onto the Recommendation system, where we created word vectorizer using Tfidf Vectorizer on transcript, then calculated the cosine similarity to find out how similar the TED Talks are to each other and then we built a recommender function based on cosine similarity to get the top 10 most similar talks.


**In conclusion, our work led to interesting results, analysis and statistics, but also provided useful tools both for audience and speakers, which allows a better understanding of TED Talks dataset.** 

This project gave me an opportunity to explore this freely available dataset using NLP and a proper data science pipeline of data wrangling, data analysis, data visualization, prediction, and data storytelling.


### Future Improvements

- The recommendation engines used by the official ted page, will be a degree of magnitude more sophisticated than what we demonstrated here and would also involve use of some sort of historical user-item interaction data. Would love to try the TED Talk recommendation system using historical user-item interaction data if available.

- Further analysis can be done over the rating column in the dataset to relate the negative comments with topics of TED talk, and find the area of talk which has received more negative feedback. 

- We can also make some more analysis over topic and area of TED Talk, by combining some other datasets like news article, social media post etc to find for any pattern between how the hot discussed topics over world found from news article and social media post are included in TED talk topics, around the same time frame as of the hot discussion over the world. 

