## TED Talk Dataset EDA

### Merge main data frame with transcripts

In [None]:
import pandas as pd

In [None]:
df_main = pd.read_csv("tedtalks/ted_main.csv")
df_transcripts = pd.read_csv("tedtalks/transcripts.csv")

Notice there are 2550 rows in the main file, but only 2467 of them have transcripts.

In [None]:
len(df_main), len(df_transcripts)

Merge data frames on `url`, since some ted talks in `ted_main.csv` do not have transcripts in `transcripts.csv`.

In [None]:
df_merged = df_main.merge(df_transcripts, how="inner", on="url")

### Extract most popular rating class for each TED talk

First replace single quotation marks with double quotation marks so that ratings is in appropriate JSON format

In [None]:
df_merged["ratings"] = df_merged["ratings"].apply(lambda ratings: ratings.replace('\'', '\"'))

Now create a new column `highest_rating` with the category that has the most votes for a given TED talk, e.g. Inspiring, Funny, Informative etc.

In [None]:
import json

In [None]:
def extract_highest_rating(ratings_json):
    """ For a set of ratings for a TED talk in JSON format, 
        extract the category with the most votes,
        e.g. Inspiring, Funny, Informative etc."""
    
    # 1. Read JSON into pandas data frame
    ratings_df = pd.read_json(ratings_json)
    
    # 2. Extract index with most votes
    max_index = ratings_df["count"].argmax()
    
    # 3. Return category with most votes
    return ratings_df.iloc[max_index]["name"]

In [None]:
df_merged['highest_rating'] = df_merged["ratings"].apply(extract_highest_rating)

 View distribution of `highest_rating`

In [None]:
import matplotlib.pyplot as plt

In [None]:
(df_merged['highest_rating'].value_counts()/len(df_merged)).sort_values().plot.barh()
plt.title("Distribution of highest rating categories");

Baseline accuracy (just predicting most popular class) would be 34.5%

In [None]:
baseline_acc = df_merged['highest_rating'].value_counts().max()/len(df_merged)
round(baseline_acc, 4) * 100

### Counts of keywords

In [None]:
from collections import defaultdict
tag_counts = defaultdict(int)
for tags in df_merged["tags"]:
    for tag in eval(tags):
        tag_counts[tag] += 1

In [None]:
tag_counts_df = pd.DataFrame(list(tag_counts.items()), columns=["Keyword", "Count"])
tag_counts_df = tag_counts_df.sort_values('Count', ascending = False)

In [None]:
plt.barh(y=tag_counts_df["Keyword"][10::-1], width=tag_counts_df["Count"][10::-1])

### Extract TED category for each TED talk

This involves finding which of Technology, Entertainment or Design a given TED talk corresponds to, 
based on its keywords.

In [None]:
def extract_category(tags):
    """ Given a string containing a list of tags, 
        returns its TED category.
        
        Sample input: "['computers', 'entertainment', 'technology']"
        Sample output: ['technology', 'entertainment']
    """
    category_options = ['technology', 'entertainment', 'design']
    
    tags = eval(tags) # Cast from e.g. "['a', 'b']" to ['a', 'b']
    tags = [tag.lower() for tag in tags] # Cast all tags to lower case
    
    category = set(category_options).intersection(tags)
    
    return sorted(list(category), reverse = True) # Descending order means tech then entertainment then design (i.e. TED order)

In [None]:
df_merged['category'] = df_merged["tags"].apply(extract_category)

In [None]:
(df_merged['category'].value_counts()/len(df_merged)).sort_values().plot.barh()
plt.title("Distribution of categories");

Baseline accuracy (just predicting most popular class) would be 55.25%

In [None]:
baseline_acc = df_merged['category'].value_counts().max()/len(df_merged)
round(baseline_acc, 4) * 100