### Topic Modeling


Here we are going to try two algorithms LDA(Latent Dirichlet Allocation) and NMF(Non-negative Matrix Factorization).

- LDA is based on probabilistic graphical modeling while NMF relies on linear algebra. Both algorithms take as input a bag of words matrix (i.e., each document represented as a row, with each columns containing the count of words in the corpus). The aim of each algorithm is then to produce 2 smaller matrices, a document to topic matrix and a word to topic matrix that when multiplied together reproduce the bag of words matrix with the lowest error.

- Both NMF and LDA are not able to automatically determine the number of topics and this must be specified.


In [37]:
#Import all the necessary libraries
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np


import collections
import json
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
import wordcloud

## for text processing
import re
import nltk

from sklearn.pipeline import Pipeline


## for ner
import spacy

## for vectorizer
from sklearn import feature_extraction, manifold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF




In [4]:
#Load the preprocessed data
ted_talks = pd.read_csv("/Users/HOME/Desktop/Springboard/TED-Talks/Data/preprocessed_ted.csv",index_col = 0)
ted_talks.head()

Unnamed: 0,name,title,description,main_speaker,speaker_occupation,transcript,duration,film_date,published_date,languages,...,ingenious,courageous,longwinded,informative,fascinating,unconvincing,persuasive,ok,obnoxious,rating
0,Ken Robinson: Do schools kill creativity?,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,['Ken Robinson'],"['Author', 'educator']",Good morning. How are you?(Laughter)It's been ...,19.4,2006-02-24,2006-06-26,60,...,6073,3253,387,7346,10581,300,10704,1174,209,Inspiring
1,Al Gore: Averting the climate crisis,Averting the climate crisis,With the same humor and humanity he exuded in ...,['Al Gore'],['Climate advocate'],"Thank you so much, Chris. And it's truly a gre...",16.283333,2006-02-24,2006-06-26,43,...,56,139,113,443,132,258,268,203,131,Funny
2,David Pogue: Simplicity sells,Simplicity sells,New York Times columnist David Pogue takes aim...,['David Pogue'],['Technology columnist'],"(Music: ""The Sound of Silence,"" Simon & Garfun...",21.433333,2006-02-23,2006-06-26,26,...,183,45,78,395,166,104,230,146,142,Funny
3,Majora Carter: Greening the ghetto,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",['Majora Carter'],['Activist for environmental justice'],If you're here today — and I'm very happy that...,18.6,2006-02-25,2006-06-26,35,...,105,760,53,380,132,36,460,85,35,Inspiring
4,Hans Rosling: The best stats you've ever seen,The best stats you've ever seen,You've never seen data presented like this. Wi...,['Hans Rosling'],"['Global health expert', ' data visionary']","About 10 years ago, I took on the task to teac...",19.833333,2006-02-21,2006-06-27,48,...,3202,318,110,5433,4606,67,2542,248,61,Informative


In [15]:
ted_talks = ted_talks[~(ted_talks['clean_transc'].isna())]

In [16]:
ted_talks = ted_talks.reset_index(drop = True)

In [17]:
ted_talks.index

RangeIndex(start=0, stop=2455, step=1)

### Latent Dirichlet Allocation (LDA)

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

In [21]:
#LDA with number of topics set to 20
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def build_lda(data, num_of_topic=20):
    vec = CountVectorizer(strip_accents = 'ascii',stop_words = STOP_WORDS,ngram_range=(1,2),max_features=5000)
    transformed_data = vec.fit_transform(data)
    feature_names = vec.get_feature_names()

    lda = LatentDirichletAllocation(
        n_components=num_of_topic, max_iter=100, 
        learning_method='online', random_state=0)
    lda.fit(transformed_data)

    return lda, vec, feature_names

def display_word_distribution(model, feature_names, n_word):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        words = []
        for i in topic.argsort()[:-n_word - 1:-1]:
            words.append(feature_names[i])
        print(words)

lda_model, vec, feature_names = build_lda(ted_talks['clean_transc'])
display_word_distribution(
    model=lda_model, feature_names=feature_names, 
    n_word=10)

Topic 0:
['technology', 'like', 'want', 'come', 'know', 'thing', 'people', 'way', 'time', 'think']
Topic 1:
['da', 'humor', 'ha', 'da da', 'feminist', 'sensitive', 'thank thank', 'spread', 'approve', 'gray']
Topic 2:
['people', 'think', 'human', 'way', 'like', 'thing', 'right', 'idea', 'question', 'world']
Topic 3:
['black', 'community', 'white', 'race', 'american', 'african', 'america', 'man', 'drug', 'color']
Topic 4:
['life', 'know', 'people', 'come', 'story', 'time', 'like', 'feel', 'day', 'tell']
Topic 5:
['crispr', 'technology', 'drown', 'broken', 'excitement', 'fee', 'pause', 'venture', 'delight', 'spark']
Topic 6:
['jihad', 'mid', 'like use', 'optimism', 'deposit', 'relax', 'optimistic', 'bin', 'question question', 'lot different']
Topic 7:
['city', 'use', 'design', 'new', 'build', 'like', 'work', 'create', 'building', 'project']
Topic 8:
['like', 'robot', 'look', 'time', 'use', 'light', 'fly', 'come', 'air', 'think']
Topic 9:
['earth', 'planet', 'universe', 'like', 'life', 'lo

In [9]:
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
nlp = en_core_web_sm.load()

In [35]:
#LDA with number of topics set to 20 with different parameters
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def build_lda(data, num_of_topic=20):
  
    vec = CountVectorizer(strip_accents = 'ascii',stop_words = STOP_WORDS)

    transformed_data = vec.fit_transform(data)
    feature_names = vec.get_feature_names()

    lda = LatentDirichletAllocation(
        n_components=num_of_topic, max_iter=10, 
        learning_method='online', random_state=0)
    lda.fit(transformed_data)

    return lda, vec, feature_names



lda_model, vec, feature_names = build_lda(ted_talks['clean_transc'])


In [36]:
def display_word_distribution(model, feature_names, n_word):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        words = []
        for i in topic.argsort()[:-n_word - 1:-1]:
            words.append(feature_names[i])
        print(words)


display_word_distribution(model=lda_model, feature_names=feature_names, n_word=10)

Topic 0:
['poem', 'sex', 'sexual', 'man', 'mosul', 'gold', 'orgasm', 'athlete', 'amanda', 'wine']
Topic 1:
['hacker', 'gang', 'lie', 'loan', 'glamour', 'glamorous', 'hack', 'deception', 'liar', 'criminal']
Topic 2:
['internet', 'medium', 'twitter', 'african', 'africa', 'online', 'money', 'tweet', 'facebook', 'chinese']
Topic 3:
['brain', 'cell', 'patient', 'cancer', 'disease', 'body', 'actually', 'human', 'health', 'drug']
Topic 4:
['fish', 'ocean', 'shark', 'animal', 'sea', 'tag', 'tuna', 'shrimp', 'jellyfish', 'regret']
Topic 5:
['camino', 'homeno', 'finisterre', 'mccormack', 'seenthe', 'compostela', 'thatand', 'seaand', 'thinkingthat', 'thinkingi']
Topic 6:
['fonio', 'kedougou', 'heh', 'everglade', 'senghor', 'dogon', 'thresh', 'okra', 'fouta', 'sanoussi']
Topic 7:
['like', 'think', 'people', 'know', 'thing', 'time', 'want', 'look', 'work', 'way']
Topic 8:
['hg', 'nr', 'naghma', 'triceratop', 'jirga', 'justness', 'naser', 'manshiyat', 'dracorex', 'torosaurus']
Topic 9:
['dna', 'quan

In [25]:
#LDA with number of topics set to 15
def build_lda(data, num_of_topic=15):
  
    vec = CountVectorizer(strip_accents = 'ascii',stop_words = STOP_WORDS,max_features=2000)

    transformed_data = vec.fit_transform(data)
    feature_names = vec.get_feature_names()

    lda = LatentDirichletAllocation(n_components=num_of_topic, max_iter=100, 
                                    learning_method='online', random_state=0)
    lda.fit(transformed_data)

    return lda, vec, feature_names

def display_word_distribution(model, feature_names, n_word):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        words = []
        for i in topic.argsort()[:-n_word - 1:-1]:
            words.append(feature_names[i])
        print(words)


lda_model, vec, feature_names = build_lda(ted_talks['clean_transc'])
display_word_distribution(model=lda_model, feature_names=feature_names, n_word=10)



Topic 0:
['cell', 'actually', 'like', 'think', 'thing', 'look', 'human', 'gene', 'dna', 'use']
Topic 1:
['life', 'people', 'feel', 'think', 'world', 'human', 'story', 'experience', 'love', 'live']
Topic 2:
['universe', 'earth', 'planet', 'like', 'space', 'thing', 'science', 'know', 'look', 'time']
Topic 3:
['know', 'like', 'thing', 'think', 'want', 'people', 'come', 'time', 'year', 'start']
Topic 4:
['people', 'think', 'thing', 'like', 'right', 'use', 'actually', 'know', 'want', 'way']
Topic 5:
['use', 'computer', 'technology', 'like', 'robot', 'game', 'machine', 'human', 'thing', 'play']
Topic 6:
['people', 'work', 'think', 'thing', 'company', 'money', 'like', 'need', 'dollar', 'good']
Topic 7:
['food', 'plant', 'eat', 'water', 'grow', 'like', 'need', 'farmer', 'feed', 'plastic']
Topic 8:
['like', 'look', 'think', 'actually', 'way', 'thing', 'work', 'little', 'sound', 'use']
Topic 9:
['car', 'energy', 'use', 'year', 'power', 'fly', 'air', 'need', 'time', 'oil']
Topic 10:
['world', 'co

### Non-Negative Matrix Factorization(NMF)

Non-Negative Matrix Factorization is a statistical method to reduce the dimension of the input corpora. It uses factor analysis method to provide comparatively less weightage to the words with less coherence.

In [47]:
#NMF with number of topics set to 20
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

vectorizer = TfidfVectorizer(strip_accents = 'ascii',stop_words = STOP_WORDS,max_df = 0.25, min_df = 0.02)
vectors = vectorizer.fit_transform(ted_talks['clean_transc'])

nmf = NMF(n_components=20,random_state=0)


topics = nmf.fit_transform(vectors)
top_word = 10
topic_word, word_weight = {}, {}
for tid, t in enumerate(nmf.components_):
    topic_word[tid] = [vectorizer.get_feature_names()[i] for i in t.argsort()[:-top_word - 1:-1]]
    word_weight[tid] = t[t.argsort()[:-top_word - 1:-1]]

for key, val in topic_word.items():
    print(key, ": ", val)

0 :  ['war', 'government', 'political', 'democracy', 'violence', 'conflict', 'election', 'police', 'citizen', 'security']
1 :  ['patient', 'health', 'doctor', 'disease', 'medical', 'hospital', 'drug', 'medicine', 'surgery', 'treatment']
2 :  ['planet', 'earth', 'mars', 'solar', 'energy', 'atmosphere', 'star', 'sun', 'fly', 'ice']
3 :  ['business', 'economy', 'economic', 'market', 'growth', 'china', 'global', 'india', 'cost', 'government']
4 :  ['datum', 'computer', 'machine', 'internet', 'phone', 'web', 'digital', 'video', 'online', 'algorithm']
5 :  ['brain', 'neuron', 'memory', 'cortex', 'disorder', 'behavior', 'animal', 'consciousness', 'signal', 'pattern']
6 :  ['robot', 'robotic', 'machine', 'leg', 'intelligence', 'sensor', 'artificial', 'interact', 'video', 'autonomous']
7 :  ['universe', 'galaxy', 'particle', 'star', 'theory', 'telescope', 'physics', 'quantum', 'black', 'hole']
8 :  ['cell', 'dna', 'gene', 'genome', 'virus', 'stem', 'genetic', 'bacteria', 'molecule', 'tissue']
9

**The derived topics from NMF and LDA are displayed above LDA for this TED Talks dataset produces some of the topics with noisy data and are hard to interpret. I’d say the NMF was able to find more meaningful topics in this dataset.**

In [57]:
pipeline = Pipeline([('tfidf', vectorizer),
                     ('nmf', nmf)])


In [59]:
pipeline = Pipeline([('tfidf', vectorizer),
                     ('nmf', nmf)])

transcript = pipeline.transform([ted_talks['transcript'].iloc[0]])

print("\n Topic Weight: ", transcript[0])
print("\n Relevant topic for the Transcript: ", np.argmax(transcript[0]))
print("\n Transcript: ", ted_talks['transcript'].iloc[0][:1000])


 Topic Weight:  [0.         0.         0.02279914 0.         0.         0.02554309
 0.         0.         0.         0.         0.12278385 0.04714222
 0.         0.00252673 0.         0.03857361 0.05385094 0.
 0.00048741 0.00376381]

 Relevant topic for the Transcript:  10

 Transcript:  Good morning. How are you?(Laughter)It's been great, hasn't it? I've been blown away by the whole thing. In fact, I'm leaving.(Laughter)There have been three themes running through the conference which are relevant to what I want to talk about. One is the extraordinary evidence of human creativity in all of the presentations that we've had and in all of the people here. Just the variety of it and the range of it. The second is that it's put us in a place where we have no idea what's going to happen, in terms of the future. No idea how this may play out.I have an interest in education. Actually, what I find is everybody has an interest in education. Don't you? I find this very interesting. If you're at

In [51]:
ted_talks['topic'] = ted_talks['clean_transc'].apply(lambda x: np.argmax(pipeline.transform([x]))+1)
ted_talks['topic_tag'] =  ted_talks['topic'].apply(lambda x: topic_word[x-1])

In [55]:
ted_talks.head()

Unnamed: 0,name,title,description,main_speaker,speaker_occupation,transcript,duration,film_date,published_date,languages,...,longwinded,informative,fascinating,unconvincing,persuasive,ok,obnoxious,rating,topic,topic_tag
0,Ken Robinson: Do schools kill creativity?,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,['Ken Robinson'],"['Author', 'educator']",Good morning. How are you?(Laughter)It's been ...,19.4,2006-02-24,2006-06-26,60,...,387,7346,10581,300,10704,1174,209,Inspiring,11,"[student, teacher, education, classroom, game,..."
1,Al Gore: Averting the climate crisis,Averting the climate crisis,With the same humor and humanity he exuded in ...,['Al Gore'],['Climate advocate'],"Thank you so much, Chris. And it's truly a gre...",16.283333,2006-02-24,2006-06-26,43,...,113,443,132,258,268,203,131,Funny,4,"[business, economy, economic, market, growth, ..."
2,David Pogue: Simplicity sells,Simplicity sells,New York Times columnist David Pogue takes aim...,['David Pogue'],['Technology columnist'],"(Music: ""The Sound of Silence,"" Simon & Garfun...",21.433333,2006-02-23,2006-06-26,26,...,78,395,166,104,230,146,142,Funny,5,"[datum, computer, machine, internet, phone, we..."
3,Majora Carter: Greening the ghetto,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",['Majora Carter'],['Activist for environmental justice'],If you're here today — and I'm very happy that...,18.6,2006-02-25,2006-06-26,35,...,53,380,132,36,460,85,35,Inspiring,18,"[building, architecture, architect, material, ..."
4,Hans Rosling: The best stats you've ever seen,The best stats you've ever seen,You've never seen data presented like this. Wi...,['Hans Rosling'],"['Global health expert', ' data visionary']","About 10 years ago, I took on the task to teac...",19.833333,2006-02-21,2006-06-27,48,...,110,5433,4606,67,2542,248,61,Informative,14,"[africa, african, continent, hiv, aid, south, ..."


In [53]:
ted_talks.to_csv("/Users/HOME/Desktop/Springboard/TED-Talks/Models/ted_modeling.csv", index=False)

In [54]:
ted_topic = pd.DataFrame()
ted_topic['topic'] = [x+1 for x in topic_word.keys()]
ted_topic['topic_tag'] = ['-'.join(x) for x in topic_word.values()]
ted_topic.head()

Unnamed: 0,topic #,topic tag
0,1,war-government-political-democracy-violence-co...
1,2,patient-health-doctor-disease-medical-hospital...
2,3,planet-earth-mars-solar-energy-atmosphere-star...
3,4,business-economy-economic-market-growth-china-...
4,5,datum-computer-machine-internet-phone-web-digi...


In [56]:
ted_topic.to_csv("/Users/HOME/Desktop/Springboard/TED-Talks/Models/topics_transcript.csv", index=False)