# Background

Our Client (Documentary maker) wishes to produce TV Series related to Science and Technology and intends to learn more about customer preferences. We would like to support our Client on this venture and shed light on popular Ted Talks. In addition we shall also try to identify what makes a talk more successful than others

# Challenges
Current Ted Talks dataset does not have any topic name assigned to it. Since we have the transcript of the talks we shall run Topic Modeling excercise to extract the dominant themes

# Objectives

1. Gather a better understanding of data (Ted Talks)
2. Run Topic Modeling to figure out key topics
3. Workout Sentiment Analysis of top talks 

Lets import data sources

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
        
        
        
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/ted-talks/transcripts.csv
/kaggle/input/ted-talks/ted_main.csv


In [2]:
d1=pd.read_csv('/kaggle/input/ted-talks/ted_main.csv')
d1.head(5)
# d1.columns

#keep relavant data only
cols=['comments','duration','event','languages', 'main_speaker','speaker_occupation','title','url', 'views']
d1=d1[[columns for columns in cols]]
d1.head(5)
# print(d1.shape)

Unnamed: 0,comments,duration,event,languages,main_speaker,speaker_occupation,title,url,views
0,4553,1164,TED2006,60,Ken Robinson,Author/educator,Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
1,265,977,TED2006,43,Al Gore,Climate advocate,Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
2,124,1286,TED2006,26,David Pogue,Technology columnist,Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292
3,200,1116,TED2006,35,Majora Carter,Activist for environmental justice,Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550
4,593,1190,TED2006,48,Hans Rosling,Global health expert; data visionary,The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869


In [3]:
len(d1.speaker_occupation.unique())

1459

Our first dataset contains event name, total comments, speaker's name and occupation and time taken to present their talk. We also have title of the Ted Talk and url link to it. The speaker occupation is very unique i.e they are many distinct categories that we cannot learn much from it. 

## 1. What are the most popular Ted Talks and Ted Events?
<!-- ### 2. Are there any distinguishing features of successful Talks? 
### 3. List important topics often discussed in Ted Talks -->

In [4]:
d11=d1.groupby(['event']).agg({'main_speaker':'nunique',
                          'event':'count',
                              'views':'sum',
                              'comments':'sum'})
d11.columns=['ct_events','ct_speakers','net_views','net_comments']
# d11.reset_index(inplace=True)
d11.sort_values('ct_speakers',ascending=False,inplace=True)

# obtain top 10 events 
d12=d11.nlargest(10,'ct_speakers')
d12.reset_index(inplace=True)
# d12.head(5)

#obtain top 10 talks by View counts
d13=d11.nlargest(10,'net_views')

#obtain top 10 talks by Comment counts
d14=d11.nlargest(10,'net_comments')
d14.reset_index(inplace=True)
d13.reset_index(inplace=True)

# Lets Plot

fig1=px.bar(d12,x="event",y="ct_speakers",width=800,height=400)
fig1.update_layout(margin=dict(l=10,r=10,t=30,b=10),title="Top 10 Ted Events",
                   xaxis_title="Event",yaxis_title="Speakers Participated")
fig1.update_traces(marker_color='Turquoise')
fig1.show()

Going by number of participants (speakers) Ted Events in 2014 and 2009 featured well over 80 members. In other years it hovers between 70-80 speakers.  
In order to guage which event was popular we shall take a closer look at the total views and comments left by users. More comments would imply that the talks at the events spurred people to share their thoughts on social media and wanted their voice to be heard

In [5]:
fig2=make_subplots(rows=1,cols=2,subplot_titles=("Total Views","Total Comments"))

trace_1=go.Bar(x=d13.event,y=d13.net_views,name='views')
trace_2=go.Bar(x=d14.event,y=d14.net_comments,name='comments')

fig2.add_trace(trace_1, 1, 1)
fig2.add_trace(trace_2, 1, 2)

fig2.update_layout(showlegend=False, title_text="Ted Event Popularity",height=450)
fig2.show()

1. Contents presented in 2013 and 2014 have garnered more views than others
2. In terms of total comments, Ted Event in 2010 and 2013 lead the fray

It is worth exploring more about the Talks presented and the speakers who involved in that particular year/event to better understand the trend seen in above graphs.  We shall now uncover what the top talks are at an overall level

In [6]:
# Lets focus on top 25 talks by view count for now
d4=d1[['title','views','duration','comments']].nlargest(25,'views')
d11.ct_speakers.mean()
d4.sort_values('views',ascending=True,inplace=True)
d4['duration']=d4['duration'].apply(lambda x:(x/60))
d4['duration']=d4['duration'].round(1)
d4.head(5)

Unnamed: 0,title,views,duration,comments
381,Your elusive creative genius,13155478,19.2,1604
530,The danger of a single story,13298341,18.8,1155
262,Brain magic,13327101,19.8,669
176,Underwater astonishments,13926113,5.4,554
1572,How to make stress your friend,14566463,14.5,693


In [7]:
fig5=go.Figure(go.Bar(y=d4.title,x=d4.views,orientation='h',marker=dict(color='rgba(246, 78, 139, 0.6)'),name='Views'))
fig5.update_layout(height=600,width=900,autosize=False,title="Popular Ted Talks (View Count)")
fig5.show()

the top 3 talks most viewed are 'Do schools kill creativity? ', 'Your body language shapes who you are' and 'how great leaders inspire action'. We see that the top 25 topics are quite diverse and appear to be related to productivity, leadership, technology etc.

Lets also add the duration of the talk and the comments dimension as well to obtain better understanding of top talks

In [8]:
fig6=px.scatter(d4,x='duration',y='comments',size='views',color='title',width=1000,height=500)
fig6.update_layout(showlegend=False,xaxis_title="Duration (min)",yaxis_title="Total Comments",title="Ted Talks Engagement")
fig6.show()

We observe slight positive correlation between duration and comments which is understandable. The longer the talk, the more ideas can be touched upon thereby raising the involvement of a user. 'Do Schools kill creativity?' and 'My stroke of insight' have astonishingly high comments tagged against them. 

We also realize another unique feature of top talks by glancing at the duration of the event. Most of them lie between the range of 18-22 minutes. In order to produce a successful documentary, further deep dive needs to be carried out by understanding format of talks such as 'How did they introduce the audience to the topic? Did they begin on a pessimistic note?', 'How did they conclude the topic ?', 'What other sub topics did they include or exclude from their talks?' and so on.


# 2. What are the dominant themes presented at Ted Talks?
Let us now import transcript data and conduct an unsupervised machine learning called 'Topic Modeling' based on Latent Dirichlet Allocation (LDA)

In [9]:
d2=pd.read_csv('/kaggle/input/ted-talks/transcripts.csv')

# lets merge the transcript data with above one based on 'url' as common key
d3=pd.merge(d1,d2,on='url',how='inner')
d3.head(5)

# print(d1.shape)
# print(d2.shape)
# print(d3.shape)

Unnamed: 0,comments,duration,event,languages,main_speaker,speaker_occupation,title,url,views,transcript
0,4553,1164,TED2006,60,Ken Robinson,Author/educator,Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,Good morning. How are you?(Laughter)It's been ...
1,265,977,TED2006,43,Al Gore,Climate advocate,Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,"Thank you so much, Chris. And it's truly a gre..."
2,124,1286,TED2006,26,David Pogue,Technology columnist,Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292,"(Music: ""The Sound of Silence,"" Simon & Garfun..."
3,200,1116,TED2006,35,Majora Carter,Activist for environmental justice,Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550,If you're here today — and I'm very happy that...
4,593,1190,TED2006,48,Hans Rosling,Global health expert; data visionary,The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,"About 10 years ago, I took on the task to teac..."


In [10]:
# we will use nltk library for our purpose and visualize results using genesim since it is quite interactive 

from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
from gensim.models import CoherenceModel
import gensim

tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()

t1=d3['transcript']
texts = []

#lets clean up our raw text such as remove common words such as articles, prespositions and convert to lower case. 
#In addition we are stemming words as well. for example :'playing', 'played' to 'play' etc

for i in t1:
    
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    stopped_tokens = [i for i in tokens if not i in en_stop]
#     texts.append(stopped_tokens)
#     stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
    
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

In [11]:
%time
import pyLDAvis.gensim
# pyLDAvis.enable_notebook()
# topics = pyLDAvis.gensim.prepare(lda_model,corpus, dictionary)

# Build LDA model
def model(n):
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,id2word=dictionary,
                                                 num_topics=n,
                                                 random_state=100,
                                                 update_every=1,
                                                 chunksize=500,
                                                 passes=10,alpha='auto',per_word_topics=True)
    lda_model.print_topics()
    
    pyLDAvis.enable_notebook()
    topics = pyLDAvis.gensim.prepare(lda_model,corpus, dictionary)
    # compute perplexity
#     perplexity=lda_model.log_perplexity(corpus)
    
    # compute coherence
#     coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
#     coherence_lda = coherence_model_lda.get_coherence()
#     print(perplexity)
#     print(coherence_lda)
    return(topics)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.63 µs


We can also access model fit by evaluating perplexity and coherence measures. These are more valuable if the number of topics are quite large such as ~50-100. In our case we think it should be <10 and therefore analyze the topics given out by LDA by human approach

In [12]:
# Model with 8 topics
model(8)

We observe lot of overlap in topics, lets see if reducing the number of topics coalesces them into one

In [13]:
#model with 6 topics
model(6)

A better picture is starting to emerge! The words are in individual topics are more coherent. Let us also check out one with 5 topics

In [14]:
#model with 5 topics
model(5)

Now all the topics are widely spaced apart which tells us unique topics have been clearly identified. The size of the bubble is an indicator of topic prevalence in the text and top words that make up the topic. therefore we shall consider that our data consists of 5 major topics only

In [15]:
#5 topics
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,id2word=dictionary,
                                                 num_topics=5,
                                                 random_state=100,
                                                 update_every=1,
                                                 chunksize=1000,
                                                 passes=10,alpha='auto',per_word_topics=True)
pyLDAvis.enable_notebook()
topics = pyLDAvis.gensim.prepare(lda_model,corpus, dictionary)
topics

In [16]:
# Lets check Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus)) 

# Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.603562122509648

Coherence Score:  0.2555308758931093


In [17]:
# Merge results back with original dataset to obtain high level summaries

def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=d3['title']):
# Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row[0], key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0: # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=d3['title'])
#df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'title']

# Show
df_dominant_topic.head(10)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,title
0,0,4.0,0.6264,"s, t, go, can, like, re, one, thing, laughter,...",Do schools kill creativity?
1,1,0.0,0.5089,"s, peopl, can, t, world, now, one, go, think, re",Averting the climate crisis
2,2,4.0,0.8962,"s, t, go, can, like, re, one, thing, laughter,...",Simplicity sells
3,3,0.0,0.5226,"s, peopl, can, t, world, now, one, go, think, re",Greening the ghetto
4,4,0.0,0.8844,"s, peopl, can, t, world, now, one, go, think, re",The best stats you've ever seen
5,5,4.0,0.5764,"s, t, go, can, like, re, one, thing, laughter,...",Why we do what we do
6,6,4.0,0.4588,"s, t, go, can, like, re, one, thing, laughter,...",Letting go of God
7,7,1.0,0.3316,"s, one, us, like, world, peopl, year, see, sto...",Behind the design of Seattle's library
8,8,4.0,0.475,"s, t, go, can, like, re, one, thing, laughter,...",Let's teach religion -- all religion -- in sch...
9,9,4.0,0.5873,"s, t, go, can, like, re, one, thing, laughter,...",A life of purpose


In [18]:
print(lda_model.print_topics())

[(0, '0.022*"s" + 0.012*"peopl" + 0.009*"can" + 0.007*"t" + 0.007*"world" + 0.007*"now" + 0.007*"one" + 0.006*"go" + 0.006*"think" + 0.006*"re"'), (1, '0.011*"s" + 0.007*"one" + 0.006*"us" + 0.005*"like" + 0.005*"world" + 0.005*"peopl" + 0.005*"year" + 0.005*"see" + 0.005*"stori" + 0.004*"t"'), (2, '0.025*"s" + 0.015*"can" + 0.008*"re" + 0.008*"one" + 0.008*"now" + 0.008*"like" + 0.007*"go" + 0.007*"see" + 0.006*"t" + 0.006*"use"'), (3, '0.016*"s" + 0.013*"t" + 0.010*"peopl" + 0.007*"one" + 0.007*"can" + 0.007*"like" + 0.006*"year" + 0.006*"us" + 0.006*"know" + 0.006*"time"'), (4, '0.034*"s" + 0.014*"t" + 0.012*"go" + 0.012*"can" + 0.012*"like" + 0.011*"re" + 0.010*"one" + 0.010*"thing" + 0.010*"laughter" + 0.009*"just"')]


Careful analysis of appearence of words in individual topics suggests a pattern. In many cases words are specific to particular topic and we can make sense of what the topics (talks) are all about

In [19]:
df_dominant_topic.loc[df_dominant_topic['Dominant_Topic']==0.0,'Topic Name']='Economic Affairs'
df_dominant_topic.loc[df_dominant_topic['Dominant_Topic']==1.0,'Topic Name']='Arts and Culture'
df_dominant_topic.loc[df_dominant_topic['Dominant_Topic']==2.0,'Topic Name']='Planet and Diseases'
df_dominant_topic.loc[df_dominant_topic['Dominant_Topic']==3.0,'Topic Name']='Relations and Conflicts'
df_dominant_topic.loc[df_dominant_topic['Dominant_Topic']==4.0,'Topic Name']='Productivity'


In [20]:
t2=d3[['views','comments','title','duration']]
t3=pd.concat([df_dominant_topic,t2],axis=1)
t3.head(5)
# df_topic_sents_keywords

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,title,Topic Name,views,comments,title.1,duration
0,0,4.0,0.6264,"s, t, go, can, like, re, one, thing, laughter,...",Do schools kill creativity?,Productivity,47227110,4553,Do schools kill creativity?,1164
1,1,0.0,0.5089,"s, peopl, can, t, world, now, one, go, think, re",Averting the climate crisis,Economic Affairs,3200520,265,Averting the climate crisis,977
2,2,4.0,0.8962,"s, t, go, can, like, re, one, thing, laughter,...",Simplicity sells,Productivity,1636292,124,Simplicity sells,1286
3,3,0.0,0.5226,"s, peopl, can, t, world, now, one, go, think, re",Greening the ghetto,Economic Affairs,1697550,200,Greening the ghetto,1116
4,4,0.0,0.8844,"s, peopl, can, t, world, now, one, go, think, re",The best stats you've ever seen,Economic Affairs,12005869,593,The best stats you've ever seen,1190


In [21]:
#Lets check the overall topic distribution in dataset

t5=df_dominant_topic['Topic Name'].value_counts().to_frame().reset_index()
t5.columns=['Topic','Count']
t5['Topic_Perc']=(t5['Count'])/(t5['Count'].sum())*100
t5

Unnamed: 0,Topic,Count,Topic_Perc
0,Productivity,819,33.198216
1,Planet and Diseases,561,22.74017
2,Relations and Conflicts,462,18.727199
3,Economic Affairs,450,18.240778
4,Arts and Culture,175,7.093636


Around 56% of the talks are related to 'Productivity' and 'Planet & Diseases', while 'Relations & Conflicts', 'Economic Affairs' make up 18% individually. Matters related to Art and Culture seem to be of low interest at only 7%. This distribution suggests that Ted Talk audiences have their preferences clearly marked out. Another way explaining the trend observed would be that the field of 'Arts and Culture' could be lot diverse than others and speakers are talking about wide variety of subjects while other topics may have core issues which are constanly at the center of focus by different speakers.   

Above analysis is synthesis of all talks, however if we look at only top talks the distribution may very well be different. Since we want our Client to produce content that is popular and relavant

In [22]:
#Decile Analysis: Find out topic distribution among top 20% of Ted Talks based on view count

t3['Decile']=pd.qcut(t3['views'],10,labels=np.arange(10,0,-1))

t4=t3[(t3['Decile']==1)|(t3['Decile']==2)]
t41=t4['Topic Name'].value_counts().sort_values(ascending=False).to_frame().reset_index()
t41.columns=['Topic','Number of Ted Talks']
t41['Topic_Perc']=(t41['Number of Ted Talks']/t41['Number of Ted Talks'].sum()*100).round(2)
t41

Unnamed: 0,Topic,Number of Ted Talks,Topic_Perc
0,Productivity,243,49.19
1,Relations and Conflicts,108,21.86
2,Planet and Diseases,72,14.57
3,Economic Affairs,51,10.32
4,Arts and Culture,20,4.05


In [23]:

fig10=make_subplots(rows=1, cols=2,subplot_titles=("Overall Topic Distribution (%)","Topic Distribution among top 20% Ted Talks (%)"))

trace_1=go.Bar(x=t41.Topic,y=t41.Topic_Perc)
trace_2=go.Bar(x=t5.Topic,y=t5.Topic_Perc)

fig10.add_trace(trace_2,1,1)
fig10.add_trace(trace_1,1,2)

fig10.update_layout(showlegend=False,title="Popular Ted Topics",height=500,width=1200,yaxis_title='Percent')
fig10.show()

Talks concerning Productivity make up nearly 50% of top 20% of Ted Talks, While 'Relations and Conflicts' is now in second spot. This is probably a testament to the fact that audiences are keen to explore new ways of making their complex lives more productive, fun filled and meaningful. They are looking for sources of inspiration away traditional channels like TV and Print.  

In [24]:
# Obtain top 10 Ted Talks related to each Topic

t51=pd.DataFrame([])
for topic in ['Productivity','Economic Affairs','Arts and Culture','Planet and Diseases','Relations and Conflicts']:
    x1=t3[t3['Topic Name']==topic]
    x2=x1.nlargest(10,'views')
    t51=t51.append(x2,ignore_index=True)   
t51=t51[['title','views','Topic Name']]
t51.columns=['title','title2','views','Topic']
t51.drop(columns=['title2'],inplace=True)

t51.sort_values(['Topic','views'],ascending=[True,True],inplace=True)

# fig5=go.Figure(go.Bar(y=d4.title,x=d4.views,orientation='h',marker=dict(color='rgba(246, 78, 139, 0.6)'),name='Views'))
# fig5.update_layout(height=600,width=900,autosize=False,title="Popular Ted Talks")
# fig5.show()


fig12=make_subplots(rows=5,cols=1,subplot_titles=('Productivity','Economic Affairs','Arts and Culture','Planet and Diseases','Relations and Conflicts'))

t52=t51[t51['Topic']=='Productivity']
trace_11=go.Bar(x=t52.views,y=t52.title,orientation='h')

t53=t51[t51['Topic']=='Economic Affairs']
trace_22=go.Bar(x=t53.views,y=t53.title,orientation='h')

t54=t51[t51['Topic']=='Arts and Culture']
trace_33=go.Bar(x=t54.views,y=t54.title,orientation='h')

t55=t51[t51['Topic']=='Planet and Diseases']
trace_44=go.Bar(x=t55.views,y=t55.title,orientation='h')

t56=t51[t51['Topic']=='Relations and Conflicts']
trace_55=go.Bar(x=t56.views,y=t56.title,orientation='h')

fig12.add_trace(trace_11,1,1)
fig12.add_trace(trace_22,2,1)
fig12.add_trace(trace_33,3,1)
fig12.add_trace(trace_44,4,1)
fig12.add_trace(trace_55,5,1)

fig12.update_layout(height=1500,width=1000,showlegend=False,title="Top Ted Talks by Topics (View count)")
fig12.show()

Analysis of individual talks from each of the topics portrays a much clear picture

# 3. What is the Sentiment behind most sucessful Ted Talks?
## Sentiment Analysis

We now come to Sentiment Analysis. Our objective here is obtain a better view on how should we go about constructing a successful content. Should we begin on a more pessimistic note and conclude with an optimistic one? or vice versa? Since a lot is at stake here, we would like to leave nothing to chance. Lets dig into the talks (transcripts) and divide them into three parts and spot any trends in sentiment

In [25]:
# obtain all relavant data in one dataset
cols1=['title','transcript']
d33=d3[['title','transcript','views']]
z1=pd.merge(t3,d33, how='inner',left_on=['views'],right_on=['views'])


# we wish to look at differences between top 50 and bottom 50 talks by views 

z1_top=z1.nlargest(50,"views")
z1_bottom=z1.nsmallest(50,"views")
z12=z1_top.append(z1_bottom)
z12.head(5)


Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,title_x,Topic Name,views,comments,title_x.1,duration,Decile,title_y,transcript
0,0,4.0,0.6264,"s, t, go, can, like, re, one, thing, laughter,...",Do schools kill creativity?,Productivity,47227110,4553,Do schools kill creativity?,1164,1,Do schools kill creativity?,Good morning. How are you?(Laughter)It's been ...
1274,1268,4.0,0.5746,"s, t, go, can, like, re, one, thing, laughter,...",Your body language may shape who you are,Productivity,43155405,2290,Your body language may shape who you are,1262,1,Your body language may shape who you are,So I want to start by offering you a free no-t...
649,649,4.0,0.5003,"s, t, go, can, like, re, one, thing, laughter,...",How great leaders inspire action,Productivity,34309432,1930,How great leaders inspire action,1084,1,How great leaders inspire action,How do you explain when things don't go as we ...
800,800,4.0,0.5151,"s, t, go, can, like, re, one, thing, laughter,...",The power of vulnerability,Productivity,31168150,1927,The power of vulnerability,1219,1,The power of vulnerability,"So, I'll start with this: a couple years ago, ..."
444,444,4.0,0.5246,"s, t, go, can, like, re, one, thing, laughter,...",10 things you didn't know about orgasm,Productivity,22270883,354,10 things you didn't know about orgasm,1003,1,10 things you didn't know about orgasm,Alright. I'm going to show you a couple of ima...


In [26]:
# first lets break the transcript into three equal parts

l11=[]
l22=[]
l33=[]
i=0
x=z12['transcript'].str.len().reset_index()
for line in z12['transcript']:
    i=i+i
    y=x.iloc[i,1]//3
    l1=line[:y]
    l2=line[y:(y+y)]
    l3=line[(y+y):]
    
    l11.append(l1)
    l22.append(l2)
    l33.append(l3)
    
dc5=pd.concat([pd.DataFrame(l11),pd.DataFrame(l22),pd.DataFrame(l33)],axis=1,ignore_index=True)
dc5.columns=['opening','middle','closing']
z13=pd.concat([z12,dc5.set_index(z12.index)],axis=1)

In [27]:
# we will use polarity scores from Vadar. 
# we obtain positive, negative and neutral in each topic part. In addition we also have 'Compound' metric which gives an overall aggregate. The compound ratings are between -1 and 1 with former indicating negative. Anything close to '0' would be neutral

from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

sia = SIA()
results = []
L1=z13['opening']
for line in L1:
    pol_score = sia.polarity_scores(line)
    pol_score['L1'] = line
    results.append(pol_score)
dc1 = pd.DataFrame.from_records(results)

results=[]
L2=z13['middle']
for line in L2:
    pol_score = sia.polarity_scores(line)
    pol_score['L2'] = line
    results.append(pol_score)
dc2 = pd.DataFrame.from_records(results)

results=[]
L3=z13['closing']
for line in L3:
    pol_score = sia.polarity_scores(line)
    pol_score['L3'] = line
    results.append(pol_score)
dc3 = pd.DataFrame.from_records(results)

results=[]
L=z13['transcript']
for line in L:
    pol_score = sia.polarity_scores(line)
    pol_score['L'] = line
    results.append(pol_score)
dc4 = pd.DataFrame.from_records(results)

%time


The twython library has not been installed. Some functionality from the twitter package will not be available.



CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.54 µs


In [28]:
# merge all individual datasets into one
dc6=pd.concat([dc1,dc2,dc3,dc4],axis=1)
dc6.columns=['negative_open','neutral_open','positive_open','compound_open','opening',
            'negative_middle','neutral_middle','positive_middle','compound_middle','middle',
            'negative_closing','neutral_closing','positive_closing','compound_closing','closing',
            'negative','neutral','positive','compound','transcript']
dc6.head(5)

Unnamed: 0,negative_open,neutral_open,positive_open,compound_open,opening,negative_middle,neutral_middle,positive_middle,compound_middle,middle,negative_closing,neutral_closing,positive_closing,compound_closing,closing,negative,neutral,positive,compound,transcript
0,0.046,0.837,0.117,0.9968,Good morning. How are you?(Laughter)It's been ...,0.035,0.864,0.101,0.9969,"eare being a child, do you? Shakespeare being ...",0.024,0.852,0.124,0.9989,nk kinesthetically. We think in abstract terms...,0.035,0.845,0.12,0.9998,Good morning. How are you?(Laughter)It's been ...
1,0.036,0.855,0.108,0.9971,So I want to start by offering you a free no-t...,0.049,0.829,0.121,0.9971,"eel chronically less powerful than men, so thi...",0.084,0.812,0.105,0.9318,"d we've all had the feeling, right? So it seem...",0.061,0.824,0.115,0.9996,So I want to start by offering you a free no-t...
2,0.026,0.851,0.123,0.9984,How do you explain when things don't go as we ...,0.041,0.844,0.115,0.9981,"we communicate from the outside in, yes, peopl...",0.027,0.905,0.068,0.9881,", you see, the early majority will not try som...",0.036,0.866,0.098,0.9995,How do you explain when things don't go as we ...
3,0.107,0.772,0.121,0.9329,"So, I'll start with this: a couple years ago, ...",0.079,0.755,0.166,0.9989,ing me journal pages and sending me their stor...,0.168,0.681,0.151,-0.3628,"hat, when they realize that vulnerability and ...",0.125,0.735,0.14,0.9978,"So, I'll start with this: a couple years ago, ..."
4,0.069,0.82,0.111,0.9924,Alright. I'm going to show you a couple of ima...,0.018,0.867,0.115,0.9987,"attached hiccupper. In the 1900s, early 1900s,...",0.03,0.902,0.068,0.793,"amera and a light source, attached to a motor ...",0.049,0.837,0.113,0.9994,Alright. I'm going to show you a couple of ima...


In [29]:
# lets add topic number to above, index are not same so adjusting them accordingly

z14=pd.concat([z13,dc6.set_index(z13.index)],axis=1)
z14.drop(columns=['title_y','Document_No'],inplace=True)
z14.head(5)

Unnamed: 0,Dominant_Topic,Topic_Perc_Contrib,Keywords,title_x,Topic Name,views,comments,title_x.1,duration,Decile,...,negative_closing,neutral_closing,positive_closing,compound_closing,closing,negative,neutral,positive,compound,transcript
0,4.0,0.6264,"s, t, go, can, like, re, one, thing, laughter,...",Do schools kill creativity?,Productivity,47227110,4553,Do schools kill creativity?,1164,1,...,0.024,0.852,0.124,0.9989,nk kinesthetically. We think in abstract terms...,0.035,0.845,0.12,0.9998,Good morning. How are you?(Laughter)It's been ...
1274,4.0,0.5746,"s, t, go, can, like, re, one, thing, laughter,...",Your body language may shape who you are,Productivity,43155405,2290,Your body language may shape who you are,1262,1,...,0.084,0.812,0.105,0.9318,"d we've all had the feeling, right? So it seem...",0.061,0.824,0.115,0.9996,So I want to start by offering you a free no-t...
649,4.0,0.5003,"s, t, go, can, like, re, one, thing, laughter,...",How great leaders inspire action,Productivity,34309432,1930,How great leaders inspire action,1084,1,...,0.027,0.905,0.068,0.9881,", you see, the early majority will not try som...",0.036,0.866,0.098,0.9995,How do you explain when things don't go as we ...
800,4.0,0.5151,"s, t, go, can, like, re, one, thing, laughter,...",The power of vulnerability,Productivity,31168150,1927,The power of vulnerability,1219,1,...,0.168,0.681,0.151,-0.3628,"hat, when they realize that vulnerability and ...",0.125,0.735,0.14,0.9978,"So, I'll start with this: a couple years ago, ..."
444,4.0,0.5246,"s, t, go, can, like, re, one, thing, laughter,...",10 things you didn't know about orgasm,Productivity,22270883,354,10 things you didn't know about orgasm,1003,1,...,0.03,0.902,0.068,0.793,"amera and a light source, attached to a motor ...",0.049,0.837,0.113,0.9994,Alright. I'm going to show you a couple of ima...


In [30]:
# let us now summarize the above dataset. Out dataset consists of top 50 and bottom 50 Ted Talks only (filtered at the begining of the analysis). D1 is top 50 while D10 is bottom 50 

z15=z14.groupby(['Decile']).agg({'negative_open':'mean',
                             'neutral_open':'mean',
                             'positive_open':'mean',
                             
                            'negative_middle':'mean',
                             'neutral_middle':'mean',
                             'positive_middle':'mean',
                    
                             'negative_closing':'mean',
                             'neutral_closing':'mean',
                             'positive_closing':'mean',
                             
                             'negative':'mean',
                             'neutral':'mean',
                             'positive':'mean',
                            }).dropna().reset_index()
z15

Unnamed: 0,Decile,negative_open,neutral_open,positive_open,negative_middle,neutral_middle,positive_middle,negative_closing,neutral_closing,positive_closing,negative,neutral,positive
0,10,0.05942,0.83472,0.10586,0.03514,0.67824,0.08656,0.02536,0.45504,0.0795,0.05664,0.82976,0.11362
1,1,0.06348,0.81726,0.11934,0.05458,0.71772,0.10786,0.03742,0.48916,0.07344,0.06574,0.80524,0.1289


In [31]:
z16=z15[['negative_open', 'neutral_open', 'positive_open',
       'negative_middle', 'neutral_middle', 'positive_middle',
       'negative_closing', 'neutral_closing', 'positive_closing', 'negative',
       'neutral', 'positive']].apply(lambda x: x*100).round(1)

z16['Decile']=np.where(z15.index==0,"D10","NA")
z16['Decile']=np.where(z15.index==1,"D1",z16['Decile'])
z16


Unnamed: 0,negative_open,neutral_open,positive_open,negative_middle,neutral_middle,positive_middle,negative_closing,neutral_closing,positive_closing,negative,neutral,positive,Decile
0,5.9,83.5,10.6,3.5,67.8,8.7,2.5,45.5,7.9,5.7,83.0,11.4,D10
1,6.3,81.7,11.9,5.5,71.8,10.8,3.7,48.9,7.3,6.6,80.5,12.9,D1


In [32]:
#lets plot to visualize the results

titles=["Negative (Overall)","Neutral (Overall)","Positive (Overall)",
        "Negative (Opening)","Neutral (Opening)","Positive (Opening)",
       "Negative (Middle)","Neutral (Middle)","Positive (Middle)",
       "Negative (Closing)","Neutral (Closing)","Positive (Closing)"]

fig21=make_subplots(rows=4,cols=3,subplot_titles=titles)

t1=go.Bar(x=z16.Decile,y=z16.negative,text=z16.negative,textposition='auto')
t2=go.Bar(x=z16.Decile,y=z16.neutral,text=z16.neutral,textposition='auto')
t3=go.Bar(x=z16.Decile,y=z16.positive,text=z16.positive,textposition='auto')

t4=go.Bar(x=z16.Decile,y=z16.negative_open,text=z16.negative_open,textposition='auto')
t5=go.Bar(x=z16.Decile,y=z16.neutral_open,text=z16.neutral_open,textposition='auto')
t6=go.Bar(x=z16.Decile,y=z16.positive_open,text=z16.positive_open,textposition='auto')

t7=go.Bar(x=z16.Decile,y=z16.negative_middle,text=z16.negative_middle,textposition='auto')
t8=go.Bar(x=z16.Decile,y=z16.neutral_middle,text=z16.neutral_middle,textposition='auto')
t9=go.Bar(x=z16.Decile,y=z16.positive_middle,text=z16.positive_middle,textposition='auto')

t10=go.Bar(x=z16.Decile,y=z16.negative_closing,text=z16.negative_closing,textposition='auto')
t11=go.Bar(x=z16.Decile,y=z16.neutral_closing,text=z16.neutral_closing,textposition='auto')
t12=go.Bar(x=z16.Decile,y=z16.positive_closing,text=z16.positive_closing,textposition='auto')

fig21.add_trace(t1,1,1)
fig21.add_trace(t2,1,2)
fig21.add_trace(t3,1,3)
fig21.add_trace(t4,2,1)
fig21.add_trace(t5,2,2)
fig21.add_trace(t6,2,3)
fig21.add_trace(t7,3,1)
fig21.add_trace(t8,3,2)
fig21.add_trace(t9,3,3)
fig21.add_trace(t10,4,1)
fig21.add_trace(t11,4,2)
fig21.add_trace(t12,4,3)

# fig21.update_layout(xaxis={'type':'category'})
# fig21.update_layout(yaxis=dict(tickformat="%"))

fig21.update_layout(height=1200,showlegend=False,yaxis=dict(range=[0,100]),title="Breakdown of Ted Talk by Sentiment(%) and format")
fig21.update_yaxes(range=[0, 100])

fig21.show()

We can draw the following conclusions:
1. At an overall level, most viewed topics (D1) begin on slightly less neutral note (81.7 vs 83) than least viewed ones (D10). They also have more sentiment (either positive or negative) suggesting that they intend to attract audience attention by painting a bright future or dark and gloomy scenario and then jump into the topic. However positive sentiment dominates more than negative

2. The closing is also filled with positive sentiment suggesting hope and wish to leave the viewers on reflective thoughts concerning the matter.

We get an impression that fear (negative sentiment) is used less often in successful talks to drive home the message. No matter the gravity of issue at hand, if you wish to your audiences to imbibe the learnings or change their behaviour, intimidation cannot be the motivation. The content of show should be framed in such a way that should leaves them with positive and constructive thoughts and inspire them to take action