# Exploring the Impact of Charismatic Leadership Tactics Used by TED Speakers

## Background

_write about: study of leadership by antonakis
types of leadership
role of charisma
charisma: ill-defined concept --> definition
measure of charisma through clts
big 5?_

## About the project

This project, led by Prof. Michalis Vlachos (HEC Faculty of the University of Lausanne), builds on previous work done by John Antonakis and his team. Using a large amount of TED(x) talks, it aims at exploring whether the use of CLTs impacts **follower reactions** (here manifested by metrics such as views, comments, or "(dis)likes"). These talks represent good research material because they feature speakers trying their best to be inspiring and to convey a message. As such, they are a good place to look for CLTs.

Partnering with Prof. Antonakis, Philip Garner from the Idiap Research Institute trained a **recurrent neural network** based on transcripts from 240 TED talks to estimate the usage probability of CLTs in textual data (for more details on the model, see Garner et al., 2018). The CLTs in each transcript had to be coded by hand using specific guidelines.

This multidimensional model, made available to us by Idiap, allows us to **predict the nine CLT values** for any given text. Testing this tool against more data is a good way to check if its level of accuracy holds in larger data pool. If charisma scores do indeed correlate positively with the aforementioned metrics, it would indicate good performance. It should be noted, however, that time also has a role to play on the value of these metrics and thus has to be taken into account.

It may very well be that the deep learning tool requires more training data to provide consistent and accurate results. If anything, the data collected in the scope of this project will provide **more data points** to train the model (although the initial coding of CLTs would still have to be done manually).

Parting ways with the analysis based on CLTs, the data gathered during this project can be used with another machine learning application, namely **IBM Watson Personality Insights**, which makes a psycholinguistic analysis of text to predict scores in each of the [Big Five personality traits](https://en.wikipedia.org/wiki/Big_Five_personality_traits) (OCEAN model), as well as some additional metrics such as consumer needs and values. It is meant to be more of a business application, but since personality traits like extroversion and openness to experience are thought to be qualities of a good leader (see works cited by Antonakis, 2006), it might constitute an interesting approach.

Lastly, we explore more naive approaches to predict metrics by using word embedding and regression models.

### Further steps

All the work described above relates to text data, but a speaker's voice and facial expressions can also be used as ways to evaluate leadership (Antonakis & Eubanks, 2017). Given enough time, it would be interesting to gather then analyze the **audio and/or video** of those same TED talks. As in the project by Garner et al., this would require an initial coding phase to classify talks based on certain aspects of the speaker's speech (pitch, tone of voice, rhythm...) or appeareance. Other criteria than CLTs would have to be identified in order to bring that work to fruition.

## Data collection

Multiple ways of acquiring the TED talks metadata were explored:

* **Using an unofficial TED API**: it looked promising but was rejected because of its pricing plan and the lack of control over which metadata to extract. Besides, querying the API requires a known parameter like the name of the speaker or the YouTube, information that was not in our possession at the start of this project.
* **Downloading an existing dataset on Kaggle**: user `rounakbanik` from Kaggle published a dataset of about 2,500 TED talks containing both metadata and transcripts. We decided not to keep it because of the relatively small number of talks and the fact that it only goes to September 2017. However, we should note the presence of a feature which does not exists anymore, `ratings` (users could cast a vote using terms such as "inspiring", "fascinating", "jaw dropping", etc.). We could integrate these votes at a later stage by joining the Kaggle dataset to our own, and learn them for records where this feature is missing.
* **Parsing the TED RSS feed**: suggested on [StackOverflow](https://stackoverflow.com/questions/7239836/ted-talk-api-or-workaround-for-data-access), parsing the XML file of the TED talks [RSS feed](http://feeds.feedburner.com/TedtalksHD?fmt=xml) was also considered but quickly rejected because of its short range (one year only) and lack of relevant metadata. 
* **Custom web scraper**: given the poor options available to us, we decided to build a custom scraper for the TED website, based on preexisting work.

### TED

The companion code of a research project named "Awe the Audience: How Emotional Trajectories Affect Audience Perception in Public Speaking", available on [GitHub](https://github.com/ROC-HCI/TEDTalk_Analytics), was used as the basis for the structure of our own scraper.

Our **web scraper**, `ted_scraper.py`, has been updated to accommodate changes done to the TED website since 2017. It goes through all possible talk IDs (61,715 at the time of writing of this piece) and, using BeautifulSoup, parses each talk's JSON object (which is readily available in the HTML source) in order to extract the metadata. A second HTTP request has to be made to get the talk's transcript, which lies in another page. All the data is then saved to a CSV file. The IDs from successful and failed attempts (mostly 404 or 410 errors) are saved to log files to keep track of our progress.

The scraping took place from **March 8 to March 28 2020**. This long period of time can be explained by two factors.

First, querying the TED website is a **slow process**. As previously stated, two HTTP requests must be done for each ID (one for the metadata, the other for the transcript). Moreover, to avoid getting timed out too often, the script waits for two seconds between each request.

Second, we encountered **several issues** which forced us to start over again. For example, some data would not be written correctly to the CSV file (fields being split across several lines, or some records not being written at all unless a `flush()` is done) or some errors were not handled correctly. But more importantly, we realized only later that we would need the YouTube ID in order to supplement our dataset with the YouTube data. We noticed at the same time that the native language was also available in the JSON object. This information is valuable because it allows us to discard non-English talks, in which we are not interested.

After having crawled the entirety of the talks, we noticed that a couple of IDs redirected the user to other videos and thus gave **duplicated records**. The deduplication was done "manually" in Excel in order to have a cleaner dataset to share with others, but it could also have been done later in the cleaning section below.

Of all the data available in the TED website, not many are of actual interest to this project, but for the sake of completeness and future research, we decided to gather a bit more data. The full list of **features** is as follows:

* `id`: a _numerus currens_ to uniquely identify the talk in the TED database, e.g. 30200.
* `url`
* `main_speaker`
* `title`
* `full_name`: a string containing both the main speaker's name and the title of his presentation. Note that this convention is not always respected and that sometimes the `title` also contains the speaker, or the event is included as well.
* `event`: the actual event where the talk took place, e.g. TEDxLakeComo.
* `event_type`: there are 8 different TED talk types, the most prominent of which are TED stage talks and TEDx talks.
* `description`
* `tags`: a semi-colon-separated list of keywords related to the talk.
* `date_recorded`
* `date_published`
* `duration`: the video's duration. Note that many talks are not hosted on the TED website but on YouTube so this has a value of 0 for the majority of records.
* `native_language`
* `nb_languages`: the number of languages for which a transcript exists (this can be used as an indicator of popularity)
* `views`: the number of views. For the same reason as the `duration` field, this often has a value of 0.
* `nb_comments`: when the comment functionality is not enabled, this gets a value of -1.
* `nb_speakers`
* `speakers`: a semi-colon-separated list of speakers.
* `speakers_desc`: a semi-colon-separated list of the speaker's description.
* `ext_src`: the external service where videos are hosted, namely YouTube (close to 100% of talks are available on YouTube)
* `ext_id`: the external (= YouTube) ID, used at a later stage to query the YouTube API.
* `ext_duration`: the YouTube video's duration, which can be used when `duration` is not available.
* `transcript`: only English transcripts are considered.

### YouTube

Upon seeing that the TED website does not contain that many transcripts, and that most talks are actually hosted on YouTube, we decided to explore the YouTube data in the hope of acquiring **more transcripts**. Unfortunately, the information gain is quite small.

Our script, `youtube_query.py`, has the same structure as `ted_scraper.py`. However, since YouTube provides developers with an API, we extract that data by **querying the API** instead of scraping the website. One clear advantage of this is that we can query the data very fast without being timed out. However, Google imposes a daily quota of 10,000 units. Knowing that each query costs 7 units, this means only about 1,400 records can be queried per day.

The harvesting of YouTube data started on **March 20 and ended on April ?? 2020**.

Again for the sake of completeness, we retained more information than was actually necessary to us. Here is the **list of features**:

* `id`: an 11-character string that uniquely identifies the YouTube video, e.g. SEDvD1IICfE.
* `channel`: the channel where the video was uploaded.
* `title`
* `description`
* `tags`
* `date_published`
* `views`
* `likes`
* `dislikes`
* `nb_comments`
* `transcript`

Note that **likes and dislikes** are information that we do not find in any way in the TED dataset and that could potentially be interesting for our purposes.

#### Downloading the audio/video

Should it once be necessary, it is also possible to download the YouTube videos. However, this cannot be achieved through the official YouTube API; third-party libraries would have to be used.

## Loading and merging the data

Since the TED and YouTube datasets are contained in separate CSV files, we first need to merge them. The former is considered to be the main one, the role of the latter simply being to augment it.

In [18]:
import numpy as np
import pandas as pd
from IPython.display import display

pd.set_option("display.max_columns", None)

In [19]:
ted = pd.read_csv("data/ted/ted_talks_TED_metadata_transcripts.csv")
yt = pd.read_csv("data/youtube/ted_talks_YT_metadata_transcript.csv")

df = ted.merge(yt, how="left", left_on="ext_id", right_on="id", suffixes=("_TED", "_YT"))
df.head(3)

Unnamed: 0,id_TED,url,main_speaker,title_TED,full_name,event,event_type,description_TED,tags_TED,date_recorded,date_published_TED,duration,native_language,nb_languages,views_TED,nb_comments_TED,nb_speakers,speakers,speakers_desc,ext_src,ext_id,ext_duration,transcript_TED,id_YT,channel,title_YT,description_YT,tags_YT,date_published_YT,views_YT,likes,dislikes,nb_comments_YT,transcript_YT
0,1,https://www.ted.com/talks/al_gore_averting_the...,Al Gore,Averting the climate crisis,Al Gore: Averting the climate crisis,TED2006,TED Stage Talk,With the same humor and humanity he exuded in ...,alternative energy;cars;climate change;culture...,2006-02-25,2006-06-27,977.0,en,43,3512145,272,1,Al Gore;,Climate advocate;,YouTube,rDiGYuQicpA,1018.0,"Thank you so much, Chris. And it's truly a gre...",rDiGYuQicpA,TED,Averting the climate crisis | Al Gore,http://www.ted.com With the same humor and hum...,Al;Gore;TED;TEDTalks;Talks;climate;crisis;envi...,2007-01-16,193084.0,815.0,262.0,308.0,"Thank you so much, Chris. And it's truly a gre..."
1,2,https://www.ted.com/talks/amy_smith_simple_des...,Amy Smith,Simple designs to save a life,Amy Smith: Simple designs to save a life,TED2006,TED Stage Talk,Fumes from indoor cooking fires kill more than...,MacArthur grant;alternative energy;design;engi...,2006-02-24,2006-08-15,906.0,en,27,1712970,101,1,Amy Smith;,"inventor, engineer;",YouTube,FwFkb1x7FJQ,946.0,"In terms of invention, I'd like to tell you th...",FwFkb1x7FJQ,TED,Amy Smith: Simple designs that could save mill...,http://www.ted.com Fumes from indoor cooking f...,TEDTalks;engineering;engineers;MIT;genius;gran...,2007-01-16,34350.0,161.0,7.0,19.0,"In terms of invention, I'd like to tell you th..."
2,3,https://www.ted.com/talks/ashraf_ghani_how_to_...,Ashraf Ghani,How to rebuild a broken state,Ashraf Ghani: How to rebuild a broken state,TEDGlobal 2005,TED Stage Talk,Ashraf Ghani's passionate and powerful 10-minu...,business;corruption;culture;economics;entrepre...,2005-07-12,2006-10-18,1125.0,en,25,973977,74,1,Ashraf Ghani;,President-elect of Afghanistan;,YouTube,A6GLw12jywo,1165.0,"A public, Dewey long ago observed, is constitu...",A6GLw12jywo,TED,Ashraf Ghani: How to fix broken states,http://www.ted.com Ashraf Ghani's passionate a...,Ashraf Ghani;TED;TEDTalks;Talks;Afghanistan;Un...,2007-01-12,122956.0,1235.0,110.0,245.0,"A public, Dewey long ago observed, is constitu..."


In [20]:
criteria = df["transcript_TED"].isna() & ~df["transcript_YT"].isna()
df[criteria].shape[0]

27

In [21]:
criteria1 = (df["native_language"] == "en")
print("English:", df[criteria1].shape[0])
print("Non-English:", df[~criteria1].shape[0])
print("TOTAL:", df.shape[0])

English: 39002
Non-English: 18357
TOTAL: 57359


In [22]:
criteria2 = (~df["transcript_TED"].isna() | ~df["transcript_YT"].isna())
print("English w/ transcript:", df[criteria1 & criteria2].shape[0])
print("English w/o transcript:", df[criteria1 & ~criteria2].shape[0])
print("Non-English w/ transcript:", df[~criteria1 & criteria2].shape[0])
print("Non-English w/o transcript:", df[~criteria1 & ~criteria2].shape[0])

English w/ transcript: 4759
English w/o transcript: 34243
Non-English w/ transcript: 574
Non-English w/o transcript: 17783


### Merging outcomes

Was it worth querying the YouTube API in order to enrich our dataset? As we can see in the figures above, merging our original TED dataset with the YouTube dataset has allowed us to get the English transcript for an additional **27 records** only.

Close to **68%** of the talks are spoken in **English** (39,002), about 12% of which have a transcript. From the pool of _non-English_ talks (18,357 records), 574 of them have an English transcript. 

## Cleaning

Most of the cleaning operations are about **unifying similar fields** from both datasets. When data from the main TED dataset is missing (e.g. duration = 0, no comments, no transcript, etc.), we get it from the YouTube dataset if it is available.

Numeric values, i.e. views and comments, are **added** to get an overall picture across both platforms. Note that tags could also be concatenated; if this feature ends up being relevant, we might try it. 

We are then getting rid of **redundant columns** (either because they do not interest us in the scope of this project, or because they are duplicates).

Since **deduplication** of the TED dataset was done beforehand, the dataframe does not contain any duplicates. However, we keep the deduplication operation here in case the datasets change.

As far as **data types** are concerned, the numerical data from YouTube were identified as floats. We convert them to integers because it makes more sense. We are also creaing proper date-time objects from the timestamps.

Finally, since transcripts are the most relevant feature for our project, we split the data into two separate dataframes. `df` then becomes the one containing all the transcripts. We discard those that contain less than 100 characters (since they are mostly songs without lyrics or contain irrelevant information like "WEBVTT"). This brings us to a final count of **5,324 observations**.

In [23]:
# Combining columns/replacing null values
df["duration"] = np.where(df["duration"] == 0, df["ext_duration"], df["duration"])

df["views_YT"].fillna(0, inplace=True)
df["views"] = df["views_TED"] + df["views_YT"]

df["nb_comments_YT"].fillna(0, inplace=True)
df["nb_comments_TED"] = np.where(df["nb_comments_TED"] == -1, 0, df["nb_comments_TED"])
df["nb_comments"] = df["nb_comments_TED"] + df["nb_comments_YT"]

df["description"] = df["description_TED"].fillna(df["description_YT"])
df["tags"] = df["tags_TED"].fillna(df["tags_YT"])
df["transcript"] = df["transcript_TED"].fillna(df["transcript_YT"])

# Getting rid of unnecessary columns
df.drop(["url", "description_TED", "tags_TED", "nb_languages",
         "views_TED", "nb_comments_TED", "nb_speakers", "speakers", "speakers_desc",
         "ext_src", "ext_id", "ext_duration", "transcript_TED",
         "channel", "title_YT", "description_YT", "tags_YT", "date_published_YT",
         "views_YT", "nb_comments_YT", "transcript_YT"
        ], axis=1, inplace=True)

# Renaming some columns
df.rename(columns={"id_TED": "id", "title_TED": "title",
                   "date_published_TED": "date_published",
                   "native_language": "language", "id_YT": "yt_id"
                  }, inplace=True)

# Removing duplicates
df = df.drop_duplicates(subset="id", keep="first")

# Changing data types
df["duration"] = pd.Series(df["duration"], dtype="Int64")
df["likes"] = pd.Series(df["likes"], dtype="Int64")
df["dislikes"] = pd.Series(df["dislikes"], dtype="Int64")
df["views"] = pd.Series(df["views"], dtype="Int64")
df["nb_comments"] = pd.Series(df["nb_comments"], dtype="Int64")
df["date_recorded"] = pd.to_datetime(df["date_recorded"])
df["date_published"] = pd.to_datetime(df["date_published"])

# Splitting into two dataframes: with or without transcript
criteria = df["transcript"].isna()
df_notranscript = df[criteria]
df = df[~criteria]

# Removing records with transcript shorter than 100 characters
df = df[(df["transcript"].str.len() >= 100)]

In [24]:
df.head(3)

Unnamed: 0,id,main_speaker,title,full_name,event,event_type,date_recorded,date_published,duration,language,yt_id,likes,dislikes,views,nb_comments,description,tags,transcript
0,1,Al Gore,Averting the climate crisis,Al Gore: Averting the climate crisis,TED2006,TED Stage Talk,2006-02-25,2006-06-27,977,en,rDiGYuQicpA,815,262,3705229,580,With the same humor and humanity he exuded in ...,alternative energy;cars;climate change;culture...,"Thank you so much, Chris. And it's truly a gre..."
1,2,Amy Smith,Simple designs to save a life,Amy Smith: Simple designs to save a life,TED2006,TED Stage Talk,2006-02-24,2006-08-15,906,en,FwFkb1x7FJQ,161,7,1747320,120,Fumes from indoor cooking fires kill more than...,MacArthur grant;alternative energy;design;engi...,"In terms of invention, I'd like to tell you th..."
2,3,Ashraf Ghani,How to rebuild a broken state,Ashraf Ghani: How to rebuild a broken state,TEDGlobal 2005,TED Stage Talk,2005-07-12,2006-10-18,1125,en,A6GLw12jywo,1235,110,1096933,319,Ashraf Ghani's passionate and powerful 10-minu...,business;corruption;culture;economics;entrepre...,"A public, Dewey long ago observed, is constitu..."


In [25]:
df.dtypes

id                         int64
main_speaker              object
title                     object
full_name                 object
event                     object
event_type                object
date_recorded     datetime64[ns]
date_published    datetime64[ns]
duration                   Int64
language                  object
yt_id                     object
likes                      Int64
dislikes                   Int64
views                      Int64
nb_comments                Int64
description               object
tags                      object
transcript                object
dtype: object

In [26]:
df.shape

(5324, 18)

## Tokenization

will not really by useful for the deep learning charisma tool but for other models.

In [41]:
import spacy
import string

In [42]:
from spacy.lang.en import English
from sklearn.feature_extraction.text import TfidfVectorizer

In [43]:
nlp = English()
punctuation = string.punctuation + "—"
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~—'

In [24]:
# Not removing stop words, because they matter in identifying CLT's
# stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [44]:
def tokenizer(transcript):
    tokens = nlp(transcript)
    tokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens]
    tokens = [word for word in tokens if word not in punctuation]
    
    return tokens

In [45]:
vectorizer = TfidfVectorizer(tokenizer=tokenizer, ngram_range=(1,1))

In [33]:
documents = [transcript for transcript in df["transcript"]]
# documents = [df.iloc[0,17]]

In [46]:
X = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names())

