In [62]:
import pandas as pd
import numpy as np

In [29]:
transcripts=pd.read_csv("transcripts.csv", encoding = "ISO-8859-1", usecols=['transcript', 'url'])
transcripts = transcripts[transcripts['url'].notnull()]
transcripts = transcripts[transcripts['transcript'].notnull()]
transcripts.count()

transcript    2467
url           2467
dtype: int64

In [43]:
transcripts['title']=transcripts['url'].map(lambda x:x.split("/")[-1])
transcripts.count()

transcript    2467
url           2467
title         2467
dtype: int64

At this point We are ready to begin piecing together the components that will help us build a talk recommender. In order to achieve this I had to:

Create a vector representation of each transcript.
Create a similarity matrix for the vector representation created above.
For each talk, based on some similarity metric, select 4 most similar talks.

Since our final goal is to recommend talks based on the similarity of their content, the first thing we will have to do is to, create a representation of the transcripts that are amenable to comparison. One way of doing this is to create a tfidf vector for each transcript

To represent text, we will think of each transcript as one "Document" and the set of all documents as a "Corpus". Then we will create a vector representing the count of words that occur in each document, something like this:

In [31]:
from sklearn.feature_extraction import text
Text=transcripts['transcript'].tolist()
tfidf=text.TfidfVectorizer(input=Text,stop_words="english")
matrix=tfidf.fit_transform(Text)

To find out similar documents among different documents, we will need to compute a measure of similarity. Usually when dealing with Tf-Idf vectors, we use $cosine$ similarity. Think of $cosine$ similarity as measuring how close one TF-Idf vector is from the other. 

In [66]:
from sklearn.metrics.pairwise import cosine_similarity
sim_unigram=cosine_similarity(matrix)

All we have to do now is for, each Transcript, find out the 4 most similar ones, based on cosine similarity. Algorithmically, this would amount to finding out, for each row in the cosine matrix constructed above, the index of five columns, that are most similar to the document (transcript in our case) corresponding to the respective row number. This was accomplished using a few lines of code

In [111]:
def get_similar_articles(x):
    title_col = transcripts['title']
    col_labels = x.argsort()[-5:-1]
    related_articles = title_col.loc[col_labels]
    related_articles = related_articles.replace(np.nan, '', regex=True)
    return ",".join(related_articles)

In [112]:
transcripts['similar_articles_unigram']=[get_similar_articles(x) for x in sim_unigram]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  after removing the cwd from sys.path.


In [113]:
transcripts.head()

Unnamed: 0,transcript,url,title,similar_articles_unigram
0,Good morning. How are you?(Laughter)It's been ...,https://www.ted.com/talks/ken_robinson_says_sc...,ken_robinson_says_schools_kill_creativity\r,"sunitha_krishnan_tedindia\r,scott_dinsmore_how..."
1,"Thank you so much, Chris. And it's truly a gre...",https://www.ted.com/talks/al_gore_on_averting_...,al_gore_on_averting_climate_crisis\r,kate_stafford_how_human_noise_affects_ocean_ha...
2,"(Music: ""The Sound of Silence,"" Simon & Garfun...",https://www.ted.com/talks/david_pogue_says_sim...,david_pogue_says_simplicity_sells\r,"jennifer_8_lee_looks_for_general_tso\r,ze_fran..."
3,If you're here today ÛÓ and I'm very happy th...,https://www.ted.com/talks/majora_carter_s_tale...,majora_carter_s_tale_of_urban_renewal\r,rebecca_brachman_could_a_drug_prevent_depressi...
4,"About 10 years ago, I took on the task to teac...",https://www.ted.com/talks/hans_rosling_shows_t...,hans_rosling_shows_the_best_stats_you_ve_ever_...,"nathan_wolfe_what_s_left_to_explore\r,barbara_..."


In [116]:
#Lets take a talk
transcripts['title'].str.replace("_"," ").str.upper().str.strip()[1]

'AL GORE ON AVERTING CLIMATE CRISIS'

In [117]:
#Most similar talks
transcripts['similar_articles_unigram'].str.replace("_"," ").str.upper().str.strip().str.split("\n")[1]

['KATE STAFFORD HOW HUMAN NOISE AFFECTS OCEAN HABITATS\r,JENNIFER 8 LEE LOOKS FOR GENERAL TSO\r,PAULA SCHER GETS SERIOUS\r,AL GORE S NEW THINKING ON THE CLIMATE CRISIS']

In [120]:
#Lets take a talk
transcripts['title'].str.replace("_"," ").str.upper().str.strip()[4]

'HANS ROSLING SHOWS THE BEST STATS YOU VE EVER SEEN'

In [121]:
#Most similar talks
transcripts['similar_articles_unigram'].str.replace("_"," ").str.upper().str.strip().str.split("\n")[4]

['NATHAN WOLFE WHAT S LEFT TO EXPLORE\r,BARBARA BLOCK TAGGING TUNA IN THE DEEP OCEAN\r,HANS ROSLING REVEALS NEW INSIGHTS ON POVERTY\r,ERIC GILER DEMOS WIRELESS ELECTRICITY']

Credits for this analysis : https://github.com/Gunnvant/ted_talks/blob/master/BlogMarch18.ipynb