### 1.Goal of my analysis
TED Talks are incredibly popular videos with a wide reach and an air of clout. Whose voices are TED Talks amplifying, and what topics feature most prominently?

Smaller questions whose answers might feed into a response for the primary question: Who is the most prolific TED Talk speaker? What topics are the most commonly discussed? Which topics are the most popular in terms of view count and like count? How have TED Talks changed in frequency over time?

### 2. Data Handling
The data set used for this research was found at the website Kaggle (https://www.kaggle.com/ashishjangra27/ted-talks). It was collected by Ashish Jangra, a data scientist, by scraping the TED Talks website. There was no given date of scraping, which is relevant because many of the values provided can change over time (namely "views" and "likes"), and the latest provided date of a talk is broadly February 2022. Because the data was scraped by a third party there could be the potential for errors in the data (the scraping process wasn't shared), but looking at a few rows of data and comparing them to their counterparts on the website doesn't show any obvious problems. 

Probably the largest concern for this data set is that the data wasn't created with this type of analysis in mind, and so there are some inconsistencies in conventions. It will be hard to get an accurate count of speakers, for instance. Where multiple individuals' names are used in the "author" column, they are sometimes seperated by a comma (e.g. "Al Roker, Al Gore"), sometimes by the word "and" (e.g. "David Biello and Latif Nasser"), sometimes by an ampersand (e.g. "Peta Greenfield & Alex Gendler"), or some combination of these when 3 or more speakers are involved. 

The data provided comes in six columns: 
1. "title": Title of the Ted Talk (string) 
2. "author": Name of Author(s)/Speaker(s)/Organization (string)
3. "date": Month and year when the talk took place (Month YYYY string)
4. "views": Number of non-unique video views (integer)
5. "likes": Number of likes (integer)
6. "link": URL of the published talk (string)

There are 5440 rows available, with 5440 unique titles and links. Because dates are given in a "Month YYYY" format there are only 127 unique date values (ranging from February 2004 to February 2022). There are 4444 unique author values, meaning there will be useful data for identifying repeat authors. 

For my research purposes I will need to do minimal data cleaning. I'll remove the "link" column because there isn't any added value to it for my purposes (it follows a standard "base URL + author + title convention", without providing any new data). There only appears to be a single non-null value in the "author" column that I'll also need to investigate.

### 3. Methods

Most of my quantitative analysis will come from the "views", "likes", and "date" columns of this data set since their formats (integers and dates) allow for the most calculations. I'll be reporting summary statistics for views and likes (mean, median, ranges) both over time and for different authors. These will also make for useful visualizations--scatterplotting likes and views over time should give a broad sense of the change of reach over time. A histogram of how many talks are given every month over time compared to likes and views over time might demonstrate how the quantity of material impacts the overall reach of the talks (i.e. do 5 talks in one month generate 5 times as many views as 1 talk in one month?). I've already identified the most prolific author (Alex Gendler at 45 talks) and busiest month for talks (April 2018 with 127 talks), but there's more analysis that can be done here--like does an author's viewership over time follow a similar path to overall viewership over time, or does a specific author's viewership grow the more they speak? If repeat authors have outsized impacts on viewership 

The question of which topics are most discussed is more complicated. I've attempted to parse out some common topics by taking the most common words in talk titles (minus stop words and some manually removed numbers and symbols I considered irrelevent), and will reapply them to each talk by adding a new "tag" column. There are a lot of problems with this approach (a talk titled "The world is on fire" wouldn't get tagged with "climate" even if it was about global warming, for instance) so I don't expect to find anything definitive by doing this, but I'm hoping it provides at least some insights into more broadly popular topics and how language is used in the titles. Beyond just the subject matter of Ted Talks, I'm interested in who Ted Talks might be directed towards or speaking about, so I've started a preliminary count of gendered words (e.g. "man" and "men" v "woman" and "men") in titles.

One additional step I'm considering taking to help answer the question of whose voices are being emphasized by Ted Talks is scraping their site for job titles for every author as it's already available and could be easily mapped to the existing data, because I think there might be some interesting patterns. I would have loved additional biographical information about Ted Talk authors like gender or nationality, but that doesn't appear to be available to me. 

In [25]:
######## Setup ########

#General Python imports
from collections import Counter
import pprint

#Data Science specific imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

#Remove scientific notation from results (personal preference)
pd.set_option('display.float_format', str)

#Read CSV into data frame
talks = pd.read_csv('TedTalkData.csv')




######## Basic overview of data frame ########

#First 5 rows
pprint.pprint("First 5 rows:")
display(talks.head())

#Columns, data types, non-null counts
pprint.pprint("Data Frame info:")
display(talks.info())

#Get count, uniques, and top frequency values from data frame 
pprint.pprint("Data Frame description:")
display(talks.describe(include=object))

#Get five most common author values
pprint.pprint("Top 5 authors:")
display(talks['author'].value_counts().head())

#Get five most common date values
pprint.pprint("Top 5 dates:")
display(talks['date'].value_counts().head())



######## Cleaning data ########

#Remove link column (unnecessary for my research)
talks.drop('link', axis=1, inplace=True)

#Show NaN rows and count
display(talks[talks.isna().any(axis=1)])
display("NaNs: " + str(talks.isna().sum().sum()))

#Remove NaN rows and display updated count
talks = talks.dropna()
display("NaNs post-dropna: " + str(talks.isna().sum().sum()))



######## Some early investigations ########

#### Get most common words in titles

title_list = talks['title'].tolist()

counts = Counter()

for title in title_list:
    title = title.lower()
    for word in title.split():
        counts[word] += 1
        
#Remove stopwords from list (downloaded from The Natural Language Toolkit))
stopwords_list = open('english', 'r').read().splitlines()

for word in stopwords_list:
    del counts[word]

#Remove arbitrary numbers and symbols (just a few manually added after reviewing initial results)
numbers_and_characters_to_remove = ('0','1','2','3','4','5','6','7','8','9','10','—','...','/')

for word in numbers_and_characters_to_remove:
    del counts[word]

#Print the 50 most common words in titles (minus stop words and numbers/symbols)
pprint.pprint("Most common words: ")
pprint.pprint(counts.most_common(50))

#### Calculate the count of some gendered words and print them

masculine_words = ('man', 'men', 'boy', 'boys', 'guy', 'guys', 'father', 'fathers', 'dad', 'dads', 'son', 'sons,' 'grandpa', 'grandpas', 'uncle', 'uncles')
feminine_words = ('woman', 'women', 'girl', 'girls', 'gal', 'gals', 'mother', 'mothers', 'mom', 'moms', 'daughter', 'daughters', 'grandma', 'grandmas', 'aunt', 'aunts')

masculine_word_count = 0
feminine_word_count = 0

for word in masculine_words:
    masculine_word_count += counts[word] 

for word in feminine_words:
    feminine_word_count += counts[word]

pprint.pprint("Masculine word count: " + str(masculine_word_count))
pprint.pprint("Feminine word count: " + str(feminine_word_count))




'First 5 rows:'


Unnamed: 0,title,author,date,views,likes,link
0,Climate action needs new frontline leadership,Ozawa Bineshi Albert,December 2021,404000,12000,https://ted.com/talks/ozawa_bineshi_albert_cli...
1,The dark history of the overthrow of Hawaii,Sydney Iaukea,February 2022,214000,6400,https://ted.com/talks/sydney_iaukea_the_dark_h...
2,How play can spark new ideas for your business,Martin Reeves,September 2021,412000,12000,https://ted.com/talks/martin_reeves_how_play_c...
3,Why is China appointing judges to combat clima...,James K. Thornton,October 2021,427000,12000,https://ted.com/talks/james_k_thornton_why_is_...
4,Cement's carbon problem — and 2 ways to fix it,Mahendra Singhi,October 2021,2400,72,https://ted.com/talks/mahendra_singhi_cement_s...


'Data Frame info:'
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5440 entries, 0 to 5439
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   5440 non-null   object
 1   author  5439 non-null   object
 2   date    5440 non-null   object
 3   views   5440 non-null   int64 
 4   likes   5440 non-null   int64 
 5   link    5440 non-null   object
dtypes: int64(2), object(4)
memory usage: 255.1+ KB


None

'Data Frame description:'


Unnamed: 0,title,author,date,link
count,5440,5439,5440,5440
unique,5440,4443,200,5440
top,Climate action needs new frontline leadership,Alex Gendler,April 2018,https://ted.com/talks/ozawa_bineshi_albert_cli...
freq,1,45,127,1


'Top 5 authors:'


Alex Gendler        45
Iseult Gillespie    33
Matt Walker         18
Alex Rosenthal      15
Elizabeth Cox       13
Name: author, dtype: int64

'Top 5 dates:'


April 2018       127
April 2019       124
April 2017       123
November 2018    115
November 2017    109
Name: date, dtype: int64

Unnamed: 0,title,author,date,views,likes
3039,Year In Ideas 2015,,December 2015,532,15


'NaNs: 1'

'NaNs post-dropna: 0'

'Most common words: '
[('life', 152),
 ('new', 128),
 ('us', 126),
 ('world', 125),
 ('future', 125),
 ('could', 104),
 ('change', 102),
 ('help', 96),
 ('climate', 89),
 ('make', 89),
 ('art', 83),
 ('need', 81),
 ('power', 79),
 ('solve', 76),
 ('people', 73),
 ('brain', 72),
 ('better', 71),
 ('history', 70),
 ('like', 67),
 ('get', 66),
 ('science', 64),
 ('work', 61),
 ('one', 61),
 ('ways', 60),
 ('way', 60),
 ('human', 58),
 ('global', 58),
 ('time', 55),
 ('story', 55),
 ('health', 53),
 ('riddle?', 52),
 ('secret', 49),
 ('teach', 48),
 ('good', 48),
 ('love', 48),
 ("let's", 48),
 ('design', 47),
 ('data', 47),
 ('women', 46),
 ('know', 46),
 ("world's", 45),
 ('music', 44),
 ('kids', 44),
 ('makes', 43),
 ('fight', 43),
 ('build', 41),
 ('think', 41),
 ("what's", 41),
 ('(and', 40),
 ('next', 39)]
'Masculine word count: 33'
'Feminine word count: 87'
