# **TED TALK RECOMMENDATION SYSTEM Machine Learning Project**




**WORK FLOW**
Data -> Data-Preprocessing -> Feature Extraction (Strings into Numericals) -> User Input -> Cosine Similarity -> Output

Feature Extraction: Find the Similarity Score (Similarity Confidence Score) between the tedtalks with each other

Cosine Similarity: to find the similarity between the vector (Each talk is converted into a kind of a vector) => Take userInput and compare with scores.

It also helps us to find similarity between different data points

In [90]:
%%capture
# Above %%capture is to suppress unwanted output.
import numpy as np
import pandas as pd
import difflib

import nltk
import string
import warnings
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('stopwords')
warnings.filterwarnings('ignore')

In [91]:
df = pd.read_csv('ted_main.csv')
df.head(2)
# df.shape

Unnamed: 0,description,duration,main_speaker,name,speaker_occupation,tags,title,views,index
0,Sir Ken Robinson makes an entertaining and pro...,1164,Ken Robinson,Ken Robinson: Do schools kill creativity?,Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,47227110,1
1,With the same humor and humanity he exuded in ...,977,Al Gore,Al Gore: Averting the climate crisis,Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,3200520,2


In [92]:
df.shape

(2550, 9)

In [93]:
# data = df.drop(['comments', 'event', 'film_date', 'languages','num_speaker','ratings', 'published_date', 'url'], axis=1)

In [94]:
data=df
data = data.drop(['duration', 'views'], axis=1)
data.shape

# Selecting the relevant features - X and Y i.e feature Selection
selected_features = ['main_speaker', 'description', 'speaker_occupation', 'tags']
print(selected_features)

['main_speaker', 'description', 'speaker_occupation', 'tags']


In [95]:
data.isnull().sum()
# Replace Null values in selected_features, if any
for feature in selected_features:
    data[feature] = data[feature].fillna('') #fill empty boxes with null string

data.head(1)

Unnamed: 0,description,main_speaker,name,speaker_occupation,tags,title,index
0,Sir Ken Robinson makes an entertaining and pro...,Ken Robinson,Ken Robinson: Do schools kill creativity?,Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,1


In [96]:
# Combining Selected Featured
combined_features = data['main_speaker']+' '+data['description']+' '+data['speaker_occupation']+' '+data['tags']
print(combined_features)

0       Ken Robinson Sir Ken Robinson makes an enterta...
1       Al Gore With the same humor and humanity he ex...
2       David Pogue New York Times columnist David Pog...
3       Majora Carter In an emotionally charged talk, ...
4       Hans Rosling You've never seen data presented ...
                              ...                        
2545    Duarte Geraldino Between 2008 and 2016, the Un...
2546    Armando Azua-Bustos How can you study Mars wit...
2547    Radhika Nagpal Science fiction visions of the ...
2548    Theo E.J. Wilson In an unmissable talk about r...
2549    Karoliina Korppoo With more than half of the w...
Length: 2550, dtype: object


In [97]:
# Converting Textual Data into Numericals(Feature Vectors)

# Create an instance of it
vectorizer = TfidfVectorizer()

# Now we convert it into numericals
feature_vectors = vectorizer.fit_transform(combined_features)
print(feature_vectors)

  (0, 13893)	0.1463897967059414
  (0, 10110)	0.15021304703047728
  (0, 3423)	0.1604668980761989
  (0, 3330)	0.08007339290370094
  (0, 2459)	0.11713953419514435
  (0, 4371)	0.14588037058947498
  (0, 1103)	0.12208987331684158
  (0, 3232)	0.2182602972238796
  (0, 14698)	0.251213265875426
  (0, 14067)	0.11135520786490397
  (0, 11312)	0.166741732028129
  (0, 9656)	0.251213265875426
  (0, 14073)	0.05558940684495165
  (0, 13765)	0.1391043942406136
  (0, 4369)	0.22894890873345705
  (0, 3226)	0.15457865623865016
  (0, 5363)	0.061951420019824864
  (0, 2243)	0.1360198642486211
  (0, 9217)	0.14298891611285441
  (0, 10966)	0.2229735718780267
  (0, 678)	0.03524196974627706
  (0, 4629)	0.18712575436867
  (0, 652)	0.15943418646396665
  (0, 8427)	0.11835625450290148
  (0, 12768)	0.21260363942070637
  :	:
  (2549, 8175)	0.08768172220366721
  (2549, 14886)	0.07797611544461426
  (2549, 14883)	0.04736881780640794
  (2549, 7268)	0.04290537944766699
  (2549, 15252)	0.04346031219927643
  (2549, 11738)	0.11168

In [98]:
# Now we find cosine similarity
similarity = cosine_similarity(feature_vectors)

# Now it will go through all feature_vectors(numerical values of selected_features) and find which values are similar to each other
# - how it works is it will first compare  first tedtalk with all the other tedtalk and then give a score for the first tedtalk, similariy all the tedtalks are compared and thus given a single simmilairy score

print(similarity)
similarity.shape

[[1.         0.02272407 0.00377821 ... 0.01530766 0.01216128 0.02994413]
 [0.02272407 1.         0.02378765 ... 0.01267135 0.0344802  0.01339022]
 [0.00377821 0.02378765 1.         ... 0.01108547 0.01713572 0.03289073]
 ...
 [0.01530766 0.01267135 0.01108547 ... 1.         0.04072367 0.03681623]
 [0.01216128 0.0344802  0.01713572 ... 0.04072367 1.         0.05772986]
 [0.02994413 0.01339022 0.03289073 ... 0.03681623 0.05772986 1.        ]]


(2550, 2550)

In [99]:
# create a list of all movies names of dataset
list_of_all_titles = data['title'].tolist()
print(list_of_all_titles)

['Do schools kill creativity?', 'Averting the climate crisis', 'Simplicity sells', 'Greening the ghetto', "The best stats you've ever seen", 'Why we do what we do', 'Letting go of God', "Behind the design of Seattle's library", "Let's teach religion -- all religion -- in schools", 'A life of purpose', 'My wish: A call for open-source architecture', 'My wish: A global day of film', 'My wish: Help me stop pandemics', 'The radical promise of the multi-touch interface', 'One Laptop per Child', "An 11-year-old's magical violin", 'Improvising on piano, aged 14', 'Simple designs to save a life', 'Organic design, inspired by nature', 'The birth of Wikipedia', 'The birth of the open-source learning revolution', 'Nerdcore comedy', 'Meet the founder of the blog revolution', 'Why we love, why we cheat', 'Happiness in body and soul', 'Chemical scum that dream of distant quasars', 'Why the universe seems so strange', 'The freakonomics of crack dealing', 'Choice, happiness and spaghetti sauce', 'The 

In [100]:
# User Input Movie Name
tedtalk_name = input("Enter Movie Name: ")

Enter Movie Name: why we do what we do


In [101]:
# Finding close match for user input
find_close_match = difflib.get_close_matches(tedtalk_name , list_of_all_titles)
print(find_close_match)


['Why we do what we do', 'Why I do theater']


In [102]:
# Now take the first movie from find_close_match and compare the rest of movies for similarity score
close_match = find_close_match[0]
print(close_match)

Why we do what we do


In [103]:
# and now find the index of the movie
index = data[data.title == close_match]['index'].values[0]
print(index)

6


In [104]:
# getiing list of similar tedtalk based on index number
similarity_score = list(enumerate(similarity[index]))
# the tedtalks having simmilairyt score value which is equal to tedtalk index(why we do what we do) will be taken
print(similarity_score)
# len(similarity_score) = 2550

[(0, 0.004760346323887444), (1, 0.019004020293594524), (2, 0.014066979082299879), (3, 0.010880605547067308), (4, 0.0038645681785684765), (5, 0.007154901214101993), (6, 1.0000000000000002), (7, 0.011213139356986621), (8, 0.11231241601633342), (9, 0.12909628225542955), (10, 0.005748353888329848), (11, 0.04354532290560355), (12, 0.003614661715297004), (13, 0.01394154572485055), (14, 0.008397295927506812), (15, 0.025260728643477497), (16, 0.049138336559121504), (17, 0.004502194934464281), (18, 0.01462041938923682), (19, 0.005421953157610963), (20, 0.009645153463820937), (21, 0.03631602060578575), (22, 0.022167141682907177), (23, 0.010943926255109851), (24, 0.0662912594516167), (25, 0.010852958347262923), (26, 0.0023293004900346963), (27, 0.011801737372730672), (28, 0.016510496073408015), (29, 0.027451058758881933), (30, 0.006451675311583718), (31, 0.023464395350268164), (32, 0.008263294427652805), (33, 0.008458685035197777), (34, 0.005846050641002676), (35, 0.010099495375096074), (36, 0.01

In [105]:
# find the highest similarity score - sort it now
sorted_similar_scores = sorted(similarity_score, key=lambda x:x[1], reverse=True)
print(sorted_similar_scores)

[(6, 1.0000000000000002), (685, 0.4427996820225012), (193, 0.3434023414632892), (1762, 0.12972513379680561), (9, 0.12909628225542955), (59, 0.12040547204057486), (2237, 0.11995372476328445), (8, 0.11231241601633342), (1004, 0.11153821502997349), (2084, 0.1085891676800563), (96, 0.10609634388765889), (2471, 0.10461222283897288), (2269, 0.10269438365283583), (2159, 0.10136461797223308), (95, 0.10132400796325948), (1114, 0.09325247472144742), (260, 0.0868124420747927), (2224, 0.08624823545942556), (556, 0.08621646733490758), (1398, 0.08617697725675742), (983, 0.08236733092205038), (2273, 0.07707305938917573), (431, 0.07655585990411418), (2123, 0.07363357442009433), (744, 0.07358608526119274), (437, 0.07306985687535571), (996, 0.07202168718740032), (906, 0.07072786294677089), (525, 0.06890590914070524), (1767, 0.0680781294814827), (1353, 0.06632574689510057), (24, 0.0662912594516167), (853, 0.06586253442657375), (1811, 0.06523498524509318), (112, 0.06487202976664844), (1651, 0.063957341005

In [107]:
# User Input Movie Name
tedtalk_name = input("Enter Movie Name: ")

Enter Movie Name: WHY WE DO WHAT WE DO


In [108]:
# Display names

print("Suggested Ted Talks for "+tedtalk_name+" are: \n")
i = 1;
for tedtalk in sorted_similar_scores:
    index = tedtalk[0]
    title_from_index = data[data.index == index]['title'].values[0]
    if (i<11):
        print(i, ' - ', title_from_index)
        i+=1

Suggested Ted Talks for WHY WE DO WHAT WE DO are: 

1  -  Letting go of God
2  -  It's time for "The Talk"
3  -  The Jill and Julia Show
4  -  I'm not your inspiration, thank you very much
5  -  A life of purpose
6  -  Four American characters
7  -  Why you think you're right -- even if you're wrong
8  -  Let's teach religion -- all religion -- in schools
9  -  Pay attention to nonviolence
10  -  How I'm working for change inside my church
