## Topic Modeling
**Note: LDA runs glitchy on GCP.**  Might need to shutdown all kernels after each LDA model run  

* Be aware of the limitations of python’s multiprocessing library which Gensim relies on and be aware of your memory usage to avoid OOM errors.  
* Use the following rule to estimate your memory usage:  `8 bytes * num_terms * num_topics` and limit the number of topics or terms to keep the memory under control

In [1]:
import sys
print(sys.version)

3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0]


In [2]:
import multiprocessing

num_processors = multiprocessing.cpu_count()
num_processors

8

In [3]:
# !pip install gensim --upgrade
# !pip install pyLDAvis --upgrade

In [4]:
import time
import math
import re
from textblob import TextBlob
import pandas as pd

import nltk as nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import string


import gensim
from gensim import corpora, models
from gensim.models.ldamulticore import LdaMulticore

# import pyLDAvis.gensim
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

  from imp import reload


In [5]:
# nltk.download('omw-1.4')

In [6]:
import warnings

# warnings.simplefilter('once')
warnings.simplefilter('ignore')
# warnings.filterwarnings("ignore", category=FutureWarning)
# warnings.filterwarnings("ignore", category=DeprecationWarning)
# warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

#### Read tweets

In [7]:
pd.set_option('max_colwidth', 500)

In [8]:
tweet_path = 'https://storage.googleapis.com/msca-bdp-data-open/tweets/jeep_new.txt'

tweets = pd.read_csv(tweet_path, sep='\t', names = ['id', 'lang', 'created_at', 'screen_name', \
                                                       'name', 'location', 'retweet_count', 'text'])

In [9]:
tweets.shape

(68971, 8)

In [10]:
tweets.head(10)

Unnamed: 0,id,lang,created_at,screen_name,name,location,retweet_count,text
0,9.222399e+17,en,Sun Oct 22 23:15:03 +0000 2017,alyssa_rose4,Princess Alyssa♛,"Chicago, IL",0.0,@Rachel_31297 where’s the Jeep Wrangler option
1,9.222399e+17,en,Sun Oct 22 23:15:09 +0000 2017,negocialoya_us,NegocialoYa USA,Estados Unidos,0.0,Check this out: 2016 JEEP PATRIOT LATITUDE 4X4 - $8500 (MIAMI-DADE) 8500.00 USD https://t.co/akPkANPpEn #ads… https://t.co/L2WaZBsUTf
2,9.2224e+17,tl,Sun Oct 22 23:15:14 +0000 2017,Jasmne_abr,Jas,blueberry,0.0,Kadugay sa jeep
3,9.2224e+17,en,Sun Oct 22 23:15:15 +0000 2017,JFCO38,JUAN FCO,,0.0,RT @Jeep: The Grand Cherokee Trackhawk is officially in production. Keep your eyes peeled for one in the wild. https://t.co/bbiPfPXXax
4,9.2224e+17,tl,Sun Oct 22 23:15:15 +0000 2017,MaestreRaymond,Jin Rae Min,Northern Mindanao,0.0,Kadugay mularga sa jeep nga kadali ko😬😭
5,9.2224e+17,tl,Sun Oct 22 23:15:21 +0000 2017,troyxaquino,Troy,,0.0,Bang luluwag ng mga jeep hahaha
6,9.2224e+17,en,Sun Oct 22 23:15:27 +0000 2017,LifeForTrucker,TruckerForLife™,United States,0.0,RT @Jeep: Power stance. https://t.co/kl0oC8Xvof
7,9.2224e+17,en,Sun Oct 22 23:15:30 +0000 2017,rpx53,Rod O|||||||O,Where the blacktop ends,0.0,"@THEJeepMafia @Jeep Thanks, 28° but Chaos and Bear were awesome feet warmers! 🐾🐾 https://t.co/nNIoaRVXlW"
8,9.2224e+17,tl,Sun Oct 22 23:15:34 +0000 2017,ronneldash,🏳️‍🌈ROX🏳️‍🌈,St. Paul,0.0,tangina mula 6:00 nasa sakayan ako tas 7:30 na wala pa rin sa baliuag tong putanginang jeep na to
9,9.222401e+17,tl,Sun Oct 22 23:15:49 +0000 2017,FayeUsi,aria,"Las Pinas City, National Capit",0.0,@Dncyngs ikaw yung nakakasabay ko lagi sa jeep hahaha ang ganda mo :)


In [11]:
# Filter non-English tweets
tweets_eng = tweets[tweets['lang']=='en'].reset_index(drop=True)

In [12]:
# Remove special characters to avoid problems with analysis
tweets_eng['text_clean'] = tweets_eng['text'].map(lambda x: re.sub('[^a-zA-Z0-9 @ . , : - _]', '', str(x)))

In [13]:
pd.set_option('display.max_colwidth', 100)
tweets_eng[['text', 'text_clean']].head(5)

Unnamed: 0,text,text_clean
0,@Rachel_31297 where’s the Jeep Wrangler option,@Rachel_31297 wheres the Jeep Wrangler option
1,Check this out: 2016 JEEP PATRIOT LATITUDE 4X4 - $8500 (MIAMI-DADE) 8500.00 USD https://t.co/akP...,Check this out: 2016 JEEP PATRIOT LATITUDE 4X4 8500 MIAMIDADE 8500.00 USD https:t.coakPkANPpEn ...
2,RT @Jeep: The Grand Cherokee Trackhawk is officially in production. Keep your eyes peeled for on...,RT @Jeep: The Grand Cherokee Trackhawk is officially in production. Keep your eyes peeled for on...
3,RT @Jeep: Power stance. https://t.co/kl0oC8Xvof,RT @Jeep: Power stance. https:t.cokl0oC8Xvof
4,"@THEJeepMafia @Jeep Thanks, 28° but Chaos and Bear were awesome feet warmers! 🐾🐾 https://t.co/nN...","@THEJeepMafia @Jeep Thanks, 28 but Chaos and Bear were awesome feet warmers https:t.conNIoaRVXlW"


## Topic Modeling
#### Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”

### TF-IDF (term frequency–inverse document frequency)

#### Using TextBlob functionality to create TF-IDF function for our select Tweets

In [14]:
# http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)
# tf(word, blob) computes "term frequency" which is the number of times a word appears in a document blob, 
# normalized by dividing by the total number of words in blob. We use TextBlob for breaking up the text into words 
# and getting the word counts.


def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)
# n_containing(word, bloblist) returns the number of documents containing word. 
# A generator expression is passed to the sum() function.


def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
# idf(word, bloblist) computes "inverse document frequency" which measures how common a word is 
# among all documents in bloblist. The more common a word is, the lower its idf. 
# We take the ratio of the total number of documents to the number of documents containing word, 
# then take the log of that. Add 1 to the divisor to prevent division by zero


def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)
# tfidf(word, blob, bloblist) computes the TF-IDF score. It is simply the product of tf and idf.

In [15]:
bloblist = []
del bloblist[:]

for i  in range(0,len(tweets_eng)):
    bloblist.append(TextBlob(tweets_eng['text_clean'].iloc[i]))
    
len(bloblist)  

40704

In [16]:
for i, blob in enumerate(bloblist):
# Print top 5 values
    if i == 5:
        break
    print("Top words in tweet {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:5]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))

Top words in tweet 1
	Word: Rachel_31297, TF-IDF: 1.65349
	Word: wheres, TF-IDF: 1.50077
	Word: option, TF-IDF: 1.226
	Word: Wrangler, TF-IDF: 0.38572
	Word: the, TF-IDF: 0.28657
Top words in tweet 2
	Word: t.coakPkANPpEn, TF-IDF: 0.58358
	Word: t.coL2WaZBsUTf, TF-IDF: 0.58358
	Word: MIAMIDADE, TF-IDF: 0.55973
	Word: 8500.00, TF-IDF: 0.55973
	Word: 8500, TF-IDF: 0.51896
Top words in tweet 3
	Word: t.cobbiPfPXXax, TF-IDF: 0.34509
	Word: peeled, TF-IDF: 0.34191
	Word: Keep, TF-IDF: 0.31915
	Word: wild, TF-IDF: 0.31915
	Word: officially, TF-IDF: 0.31637
Top words in tweet 4
	Word: t.cokl0oC8Xvof, TF-IDF: 0.89363
	Word: stance, TF-IDF: 0.89104
	Word: Power, TF-IDF: 0.85214
	Word: RT, TF-IDF: 0.18822
	Word: Jeep, TF-IDF: 0.06068
Top words in tweet 5
	Word: t.conNIoaRVXlW, TF-IDF: 0.70864
	Word: warmers, TF-IDF: 0.65913
	Word: Chaos, TF-IDF: 0.63017
	Word: Bear, TF-IDF: 0.60962
	Word: feet, TF-IDF: 0.56011


## LDA (latent dirichlet allocation)
#### LDA is a matrix factorization technique, which assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

In [17]:
#https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

In [18]:
doc1 = "BMW upbeat sustained sales growth"
doc2 = "Ad wars When BMW Audi Mercedes Benz Jaguar prove prowess through advertisements"
doc3 = "BMW Protonic Frozen Yellow Edition Looks So Cool"
doc4 = "Judge Shuts Door On SoftClose Defect Suit Against BMW Law"
doc5 = "Just Listed BMW Alpina B Turbo Automobile Magazine"
doc6 = "How take part BMW Ultimate Driving Experience"
doc7 = "Long Beach BMW Motorcycles Becomes First BMW Dealer Offer Virtual Reality Experience Virtual Reality Reporter"
doc8 = "NYC Auto Show BMW M Performance Video Overview"
doc9 = "BMW F X Spy video shows SUV stress test"
doc10 = "Driver taken hospital BMW smashes tree Stourbridge Express Star"

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5, doc6, doc7, doc8, doc9, doc10]

In [19]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]     

In [20]:
type(doc_complete)

list

In [21]:
doc_complete[:2]

['BMW upbeat sustained sales growth',
 'Ad wars When BMW Audi Mercedes Benz Jaguar prove prowess through advertisements']

In [22]:
doc_clean[:3]

[['bmw', 'upbeat', 'sustained', 'sale', 'growth'],
 ['ad',
  'war',
  'bmw',
  'audi',
  'mercedes',
  'benz',
  'jaguar',
  'prove',
  'prowess',
  'advertisement'],
 ['bmw', 'protonic', 'frozen', 'yellow', 'edition', 'look', 'cool']]

In [23]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21

In [24]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

### Three-topic Model

In [25]:
warnings.simplefilter('ignore')

# Running and Trainign LDA model on the document term matrix.
%time ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50) #3 topics
print(*ldamodel.print_topics(num_topics=3, num_words=3), sep='\n')

CPU times: user 246 ms, sys: 19.8 ms, total: 266 ms
Wall time: 252 ms
(0, '0.062*"bmw" + 0.035*"experience" + 0.035*"express"')
(1, '0.097*"bmw" + 0.036*"virtual" + 0.036*"reality"')
(2, '0.070*"bmw" + 0.028*"audi" + 0.028*"jaguar"')


#### For larger datasets LdaMulticore should provide significant speed improvements

In [26]:
%time ldamodel = LdaMulticore(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50, workers = num_processors-1) #3 topics
print(*ldamodel.print_topics(num_topics=3, num_words=3), sep='\n')

CPU times: user 356 ms, sys: 192 ms, total: 548 ms
Wall time: 533 ms
(0, '0.065*"bmw" + 0.037*"experience" + 0.037*"automobile"')
(1, '0.079*"bmw" + 0.024*"jaguar" + 0.024*"advertisement"')
(2, '0.088*"bmw" + 0.038*"virtual" + 0.038*"reality"')


In [27]:
#topics = ldamodel.print_topics(num_words=3)
#for topic in topics:
#    print(topic)

In [28]:
%%time

lda_display = gensimvis.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


CPU times: user 111 ms, sys: 113 ms, total: 225 ms
Wall time: 1.52 s


### Five-topic Model

In [29]:
%time ldamodel = Lda(doc_term_matrix, num_topics=5, id2word = dictionary, passes=50) #5 topics
print(*ldamodel.print_topics(num_topics=5, num_words=5), sep='\n')

CPU times: user 224 ms, sys: 20 ms, total: 244 ms
Wall time: 241 ms
(0, '0.051*"ad" + 0.051*"advertisement" + 0.051*"audi" + 0.051*"prove" + 0.051*"war"')
(1, '0.015*"bmw" + 0.015*"upbeat" + 0.015*"sale" + 0.015*"growth" + 0.015*"sustained"')
(2, '0.083*"bmw" + 0.045*"video" + 0.045*"show" + 0.045*"experience" + 0.045*"auto"')
(3, '0.108*"bmw" + 0.038*"reality" + 0.038*"virtual" + 0.021*"first" + 0.021*"offer"')
(4, '0.074*"bmw" + 0.041*"smash" + 0.041*"stourbridge" + 0.041*"hospital" + 0.041*"driver"')


In [30]:
%%time

lda_display = gensimvis.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

CPU times: user 85.2 ms, sys: 11.8 ms, total: 97 ms
Wall time: 144 ms


### Ten-topic Model

In [31]:
%time ldamodel = Lda(doc_term_matrix, num_topics=10, id2word = dictionary, passes=50)
print(*ldamodel.print_topics(num_topics=10, num_words=5), sep='\n')

CPU times: user 225 ms, sys: 15.7 ms, total: 241 ms
Wall time: 237 ms
(0, '0.074*"law" + 0.074*"judge" + 0.074*"shuts" + 0.074*"suit" + 0.074*"door"')
(1, '0.080*"turbo" + 0.080*"alpina" + 0.080*"listed" + 0.080*"b" + 0.080*"automobile"')
(2, '0.015*"virtual" + 0.015*"auto" + 0.015*"reporter" + 0.015*"offer" + 0.015*"nyc"')
(3, '0.015*"virtual" + 0.015*"auto" + 0.015*"reporter" + 0.015*"offer" + 0.015*"nyc"')
(4, '0.015*"virtual" + 0.015*"auto" + 0.015*"reporter" + 0.015*"offer" + 0.015*"nyc"')
(5, '0.065*"advertisement" + 0.065*"jaguar" + 0.065*"benz" + 0.065*"prove" + 0.065*"mercedes"')
(6, '0.096*"bmw" + 0.050*"experience" + 0.050*"f" + 0.050*"spy" + 0.050*"stress"')
(7, '0.080*"show" + 0.080*"video" + 0.080*"performance" + 0.080*"nyc" + 0.080*"auto"')
(8, '0.015*"virtual" + 0.015*"auto" + 0.015*"reporter" + 0.015*"offer" + 0.015*"nyc"')
(9, '0.119*"bmw" + 0.049*"reality" + 0.049*"virtual" + 0.026*"dealer" + 0.026*"motorcycle"')


In [32]:
%%time

lda_display = gensimvis.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

CPU times: user 118 ms, sys: 4.53 ms, total: 123 ms
Wall time: 201 ms


### Applying LDA to tweets

In [33]:
tweets_list = tweets_eng['text_clean'].tolist()
tweets_list[:5]

['@Rachel_31297 wheres the Jeep Wrangler option',
 'Check this out: 2016 JEEP PATRIOT LATITUDE 4X4  8500 MIAMIDADE 8500.00 USD https:t.coakPkANPpEn ads https:t.coL2WaZBsUTf',
 'RT @Jeep: The Grand Cherokee Trackhawk is officially in production. Keep your eyes peeled for one in the wild. https:t.cobbiPfPXXax',
 'RT @Jeep: Power stance. https:t.cokl0oC8Xvof',
 '@THEJeepMafia @Jeep Thanks, 28 but Chaos and Bear were awesome feet warmers  https:t.conNIoaRVXlW']

In [34]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

tweet_clean = [clean(doc).split() for doc in tweets_list]

In [35]:
print(*tweet_clean[:3], sep='\n\n')

['rachel31297', 'wheres', 'jeep', 'wrangler', 'option']

['check', 'out', '2016', 'jeep', 'patriot', 'latitude', '4x4', '8500', 'miamidade', '850000', 'usd', 'httpstcoakpkanppen', 'ad', 'httpstcol2wazbsutf']

['rt', 'jeep', 'grand', 'cherokee', 'trackhawk', 'officially', 'production', 'keep', 'eye', 'peeled', 'one', 'wild', 'httpstcobbipfpxxax']


In [36]:
# Creating the term dictionary of our corpus, where every unique term is assigned an index. 

dictionary = corpora.Dictionary(tweet_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.

%time doc_term_matrix = [dictionary.doc2bow(doc) for doc in tweet_clean]

CPU times: user 455 ms, sys: 15.8 ms, total: 471 ms
Wall time: 471 ms


In [37]:
#Using traditional LDA
%time ldamodel = Lda(doc_term_matrix, num_topics=10, id2word = dictionary, passes=50)

CPU times: user 7min 6s, sys: 211 ms, total: 7min 6s
Wall time: 7min 6s


In [38]:
%%time

#Using multicore LDA
num_topics = 10
iterations = 100
passes = 20
workers = num_processors-1
eval_every = None

ldamodel = LdaMulticore(corpus=doc_term_matrix,
                       id2word=dictionary,
                       eta='auto',
                       num_topics=num_topics,
                       iterations=iterations,
                       passes=passes,
                       eval_every=eval_every,
                       workers = workers)

CPU times: user 46.9 s, sys: 4.14 s, total: 51 s
Wall time: 50.4 s


In [39]:
print(*ldamodel.print_topics(num_topics=10, num_words=3), sep='\n')

(0, '0.091*"jeep" + 0.028*"rt" + 0.011*"renegade"')
(1, '0.070*"jeep" + 0.036*"rt" + 0.024*"compass"')
(2, '0.101*"giveaway" + 0.068*"jeep" + 0.055*"girl"')
(3, '0.098*"jeep" + 0.028*"rt" + 0.014*"im"')
(4, '0.117*"jeep" + 0.052*"wrangler" + 0.032*"cherokee"')
(5, '0.092*"jeep" + 0.040*"rt" + 0.033*"used"')
(6, '0.086*"jeep" + 0.075*"rt" + 0.014*"kylie"')
(7, '0.065*"jeep" + 0.030*"wrangler" + 0.027*"rt"')
(8, '0.080*"jeep" + 0.035*"rt" + 0.032*"dodge"')
(9, '0.108*"jeep" + 0.056*"rt" + 0.014*"jeeplife"')


In [40]:
print(*ldamodel.print_topics(num_topics=10, num_words=3), sep='\n')

(0, '0.091*"jeep" + 0.028*"rt" + 0.011*"renegade"')
(1, '0.070*"jeep" + 0.036*"rt" + 0.024*"compass"')
(2, '0.101*"giveaway" + 0.068*"jeep" + 0.055*"girl"')
(3, '0.098*"jeep" + 0.028*"rt" + 0.014*"im"')
(4, '0.117*"jeep" + 0.052*"wrangler" + 0.032*"cherokee"')
(5, '0.092*"jeep" + 0.040*"rt" + 0.033*"used"')
(6, '0.086*"jeep" + 0.075*"rt" + 0.014*"kylie"')
(7, '0.065*"jeep" + 0.030*"wrangler" + 0.027*"rt"')
(8, '0.080*"jeep" + 0.035*"rt" + 0.032*"dodge"')
(9, '0.108*"jeep" + 0.056*"rt" + 0.014*"jeeplife"')


In [41]:
print(*ldamodel.print_topics(num_topics=10, num_words=5), sep='\n\n')

(0, '0.091*"jeep" + 0.028*"rt" + 0.011*"renegade" + 0.011*"drive" + 0.009*"cherokee"')

(1, '0.070*"jeep" + 0.036*"rt" + 0.024*"compass" + 0.013*"around" + 0.011*"new"')

(2, '0.101*"giveaway" + 0.068*"jeep" + 0.055*"girl" + 0.053*"win" + 0.052*"small"')

(3, '0.098*"jeep" + 0.028*"rt" + 0.014*"im" + 0.013*"want" + 0.012*"one"')

(4, '0.117*"jeep" + 0.052*"wrangler" + 0.032*"cherokee" + 0.022*"ebay" + 0.021*"grand"')

(5, '0.092*"jeep" + 0.040*"rt" + 0.033*"used" + 0.029*"photo" + 0.029*"spotted"')

(6, '0.086*"jeep" + 0.075*"rt" + 0.014*"kylie" + 0.014*"luxbucketlist" + 0.014*"httpstcobvgln4ayie"')

(7, '0.065*"jeep" + 0.030*"wrangler" + 0.027*"rt" + 0.015*"cherokee" + 0.013*"buy"')

(8, '0.080*"jeep" + 0.035*"rt" + 0.032*"dodge" + 0.020*"blue" + 0.020*"chrysler"')

(9, '0.108*"jeep" + 0.056*"rt" + 0.014*"jeeplife" + 0.013*"get" + 0.009*"make"')


In [42]:
print(*ldamodel.print_topics(num_topics=10, num_words=7), sep='\n\n')

(0, '0.091*"jeep" + 0.028*"rt" + 0.011*"renegade" + 0.011*"drive" + 0.009*"cherokee" + 0.007*"amp" + 0.007*"ready"')

(1, '0.070*"jeep" + 0.036*"rt" + 0.024*"compass" + 0.013*"around" + 0.011*"new" + 0.009*"makeinindia" + 0.009*"car"')

(2, '0.101*"giveaway" + 0.068*"jeep" + 0.055*"girl" + 0.053*"win" + 0.052*"small" + 0.052*"body" + 0.051*"chance"')

(3, '0.098*"jeep" + 0.028*"rt" + 0.014*"im" + 0.013*"want" + 0.012*"one" + 0.008*"even" + 0.007*"like"')

(4, '0.117*"jeep" + 0.052*"wrangler" + 0.032*"cherokee" + 0.022*"ebay" + 0.021*"grand" + 0.017*"sport" + 0.013*"4x4"')

(5, '0.092*"jeep" + 0.040*"rt" + 0.033*"used" + 0.029*"photo" + 0.029*"spotted" + 0.029*"stick" + 0.028*"bamboo"')

(6, '0.086*"jeep" + 0.075*"rt" + 0.014*"kylie" + 0.014*"luxbucketlist" + 0.014*"httpstcobvgln4ayie" + 0.011*"jeepporn" + 0.009*"sexy"')

(7, '0.065*"jeep" + 0.030*"wrangler" + 0.027*"rt" + 0.015*"cherokee" + 0.013*"buy" + 0.013*"black" + 0.011*"2017"')

(8, '0.080*"jeep" + 0.035*"rt" + 0.032*"dodge" + 0

In [43]:
print(*ldamodel.print_topics(num_topics=10, num_words=10), sep='\n\n')

(0, '0.091*"jeep" + 0.028*"rt" + 0.011*"renegade" + 0.011*"drive" + 0.009*"cherokee" + 0.007*"amp" + 0.007*"ready" + 0.007*"sprint" + 0.007*"like" + 0.006*"music"')

(1, '0.070*"jeep" + 0.036*"rt" + 0.024*"compass" + 0.013*"around" + 0.011*"new" + 0.009*"makeinindia" + 0.009*"car" + 0.008*"making" + 0.007*"2018" + 0.007*"game"')

(2, '0.101*"giveaway" + 0.068*"jeep" + 0.055*"girl" + 0.053*"win" + 0.052*"small" + 0.052*"body" + 0.051*"chance" + 0.051*"cross" + 0.051*"sun" + 0.050*"entered"')

(3, '0.098*"jeep" + 0.028*"rt" + 0.014*"im" + 0.013*"want" + 0.012*"one" + 0.008*"even" + 0.007*"like" + 0.006*"credit" + 0.006*"think" + 0.005*"dont"')

(4, '0.117*"jeep" + 0.052*"wrangler" + 0.032*"cherokee" + 0.022*"ebay" + 0.021*"grand" + 0.017*"sport" + 0.013*"4x4" + 0.013*"unlimited" + 0.011*"rubicon" + 0.011*"check"')

(5, '0.092*"jeep" + 0.040*"rt" + 0.033*"used" + 0.029*"photo" + 0.029*"spotted" + 0.029*"stick" + 0.028*"bamboo" + 0.028*"nigeria" + 0.028*"convey" + 0.027*"hummer"')

(6, '0.

In [44]:
%%time

lda_display = gensimvis.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


CPU times: user 7.79 s, sys: 128 ms, total: 7.92 s
Wall time: 9.73 s


## TF-IDF on news articles

In [45]:
news_path = 'https://storage.googleapis.com/msca-bdp-data-open/news/news_toyota.json'

news_df = pd.read_json(news_path, orient='records', lines=True)

In [46]:
news_df.head(5)

Unnamed: 0,crawled,language,text,title
0,2018-02-02T04:24:51.072+02:00,english,"QR Code Link to This Post All maintenance receipts available, one owner truck. Cash sale. No tra...",Dependable truck 03 Toyota Tacoma Double Cab $1500
1,2018-02-02T04:27:15.000+02:00,english,"0 \nNEW YORK: Automakers reported mixed US car sales in January, with strong demand for SUVs and...",US car sales mixed in January; trucks stay strong
2,2018-02-02T04:34:00.008+02:00,english,transmission: automatic 2005 Toyota Camry LE 4 door 4 cyl AUTOMATIC VERY CLEAN INSIDE CLOTH IN...,2005 TOYOTA CAMRY LE 167300 MILEAGE $2450 (TALLASSEE) $2450
3,2018-02-02T04:36:42.006+02:00,english,favorite this post Brand New Toyota Avalon Floor Mats - $115 (New Britain) hide this posting unh...,Brand New Toyota Avalon Floor Mats (New Britain) $115
4,2018-02-02T04:38:24.018+02:00,english,more ads by this user QR Code Link to This Post Black w/Piano Black w/Perforated NuLuxe Seat Tri...,2016 Lexus ES 350 (Coliseum Lexus of Oakland) $27772


In [47]:
news_df.shape

(100, 4)

In [48]:
# Filter non-English articles
news_eng = news_df[news_df['language']=='english'].reset_index(drop=True)

In [49]:
# Remove special characters to avoid problems with analysis
news_eng['text_clean'] = news_eng['text'].map(lambda x: re.sub('[^a-zA-Z0-9 @ . , : - _]', '', str(x)))

In [50]:
pd.set_option('display.max_colwidth', 100)
news_eng[['text', 'text_clean']].head(5)

Unnamed: 0,text,text_clean
0,"QR Code Link to This Post All maintenance receipts available, one owner truck. Cash sale. No tra...","QR Code Link to This Post All maintenance receipts available, one owner truck. Cash sale. No tra..."
1,"0 \nNEW YORK: Automakers reported mixed US car sales in January, with strong demand for SUVs and...","0 NEW YORK: Automakers reported mixed US car sales in January, with strong demand for SUVs and p..."
2,transmission: automatic 2005 Toyota Camry LE 4 door 4 cyl AUTOMATIC VERY CLEAN INSIDE CLOTH IN...,transmission: automatic 2005 Toyota Camry LE 4 door 4 cyl AUTOMATIC VERY CLEAN INSIDE CLOTH IN...
3,favorite this post Brand New Toyota Avalon Floor Mats - $115 (New Britain) hide this posting unh...,favorite this post Brand New Toyota Avalon Floor Mats 115 New Britain hide this posting unhide ...
4,more ads by this user QR Code Link to This Post Black w/Piano Black w/Perforated NuLuxe Seat Tri...,more ads by this user QR Code Link to This Post Black wPiano Black wPerforated NuLuxe Seat Trim....


In [51]:
bloblist = []
del bloblist[:]

for i  in range(0,len(news_eng)):
    bloblist.append(TextBlob(news_eng['text_clean'].iloc[i]))
    
len(bloblist) 

100

In [52]:
for i, blob in enumerate(bloblist):
# Print top 5 values
    if i == 5:
        break
    print("Top words in news article {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:10]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))

Top words in news article 1
	Word: receipts, TF-IDF: 0.21733
	Word: Cash, TF-IDF: 0.21733
	Word: 6477478013, TF-IDF: 0.21733
	Word: sale, TF-IDF: 0.19481
	Word: maintenance, TF-IDF: 0.17883
	Word: owner, TF-IDF: 0.17883
	Word: available, TF-IDF: 0.16643
	Word: trades, TF-IDF: 0.1563
	Word: truck, TF-IDF: 0.11779
	Word: QR, TF-IDF: 0.09844
Top words in news article 2
	Word: And, TF-IDF: 0.06643
	Word: In, TF-IDF: 0.05853
	Word: sales, TF-IDF: 0.04664
	Word: US, TF-IDF: 0.02365
	Word: The, TF-IDF: 0.02218
	Word: pickup, TF-IDF: 0.02214
	Word: saw, TF-IDF: 0.02117
	Word: strong, TF-IDF: 0.02106
	Word: percent, TF-IDF: 0.01984
	Word: overall, TF-IDF: 0.0195
Top words in news article 3
	Word: AUTOMATIC, TF-IDF: 0.18935
	Word: automatic, TF-IDF: 0.15643
	Word: LE, TF-IDF: 0.11506
	Word: cyl, TF-IDF: 0.11506
	Word: VERY, TF-IDF: 0.11506
	Word: INSIDE, TF-IDF: 0.11506
	Word: CLOTH, TF-IDF: 0.11506
	Word: INTERIOR, TF-IDF: 0.11506
	Word: NICE, TF-IDF: 0.11506
	Word: Just, TF-IDF: 0.11506
Top wo

### Applying LDA to news articles

In [53]:
news_list = news_eng['text_clean'].tolist()
news_list[:1]

['QR Code Link to This Post All maintenance receipts available, one owner truck. Cash sale. No trades.   6477478013']

In [54]:
news_clean = [clean(doc).split() for doc in news_list]

In [55]:
len(news_clean)

100

In [56]:
print(*news_clean[:1], sep='\n\n')

['qr', 'code', 'link', 'post', 'maintenance', 'receipt', 'available', 'one', 'owner', 'truck', 'cash', 'sale', 'trade', '6477478013']


In [57]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 

dictionary = corpora.Dictionary(news_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.

%time doc_term_matrix = [dictionary.doc2bow(doc) for doc in news_clean]

CPU times: user 44.7 ms, sys: 8.02 ms, total: 52.7 ms
Wall time: 52.3 ms


#### 3 topic model

In [58]:
%%time

num_topics = 3
iterations = 100
passes = 20
workers = num_processors-1
eval_every = None

ldamodel = LdaMulticore(corpus=doc_term_matrix,
                       id2word=dictionary,
                       eta='auto',
                       num_topics=num_topics,
                       iterations=iterations,
                       passes=passes,
                       eval_every=eval_every,
                       workers = workers)

CPU times: user 1.55 s, sys: 272 ms, total: 1.83 s
Wall time: 1.82 s


In [59]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=3), sep='\n')

(0, '0.011*"car" + 0.008*"toyota" + 0.006*"percent"')
(1, '0.018*"toyota" + 0.012*"sale" + 0.012*"vehicle"')
(2, '0.022*"percent" + 0.020*"u" + 0.012*"cent"')


In [60]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=5), sep='\n\n')

(0, '0.011*"car" + 0.008*"toyota" + 0.006*"percent" + 0.004*"said" + 0.004*"new"')

(1, '0.018*"toyota" + 0.012*"sale" + 0.012*"vehicle" + 0.009*"ford" + 0.008*"year"')

(2, '0.022*"percent" + 0.020*"u" + 0.012*"cent" + 0.012*"earnings" + 0.012*"per"')


In [61]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=7), sep='\n\n')

(0, '0.011*"car" + 0.008*"toyota" + 0.006*"percent" + 0.004*"said" + 0.004*"new" + 0.004*"job" + 0.003*"state"')

(1, '0.018*"toyota" + 0.012*"sale" + 0.012*"vehicle" + 0.009*"ford" + 0.008*"year" + 0.005*"motor" + 0.005*"2018"')

(2, '0.022*"percent" + 0.020*"u" + 0.012*"cent" + 0.012*"earnings" + 0.012*"per" + 0.011*"yield" + 0.011*"share"')


In [62]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=10), sep='\n\n')

(0, '0.011*"car" + 0.008*"toyota" + 0.006*"percent" + 0.004*"said" + 0.004*"new" + 0.004*"job" + 0.003*"state" + 0.003*"time" + 0.003*"d" + 0.003*"also"')

(1, '0.018*"toyota" + 0.012*"sale" + 0.012*"vehicle" + 0.009*"ford" + 0.008*"year" + 0.005*"motor" + 0.005*"2018" + 0.005*"january" + 0.005*"unit" + 0.005*"new"')

(2, '0.022*"percent" + 0.020*"u" + 0.012*"cent" + 0.012*"earnings" + 0.012*"per" + 0.011*"yield" + 0.011*"share" + 0.010*"index" + 0.009*"lower" + 0.008*"investor"')


In [63]:
%%time

lda_display = gensimvis.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

CPU times: user 183 ms, sys: 16.1 ms, total: 199 ms
Wall time: 250 ms


#### 5 topic model

In [64]:
%%time

num_topics = 5
iterations = 100
passes = 20
workers = num_processors-1
eval_every = None

ldamodel = LdaMulticore(corpus=doc_term_matrix,
                       id2word=dictionary,
                       eta='auto',
                       num_topics=num_topics,
                       iterations=iterations,
                       passes=passes,
                       eval_every=eval_every,
                       workers = workers)

CPU times: user 1.45 s, sys: 253 ms, total: 1.7 s
Wall time: 1.69 s


In [65]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=3), sep='\n\n')

(0, '0.016*"toyota" + 0.013*"sale" + 0.012*"percent"')

(1, '0.011*"toyota" + 0.010*"car" + 0.007*"company"')

(2, '0.013*"toyota" + 0.009*"car" + 0.008*"japan"')

(3, '0.025*"percent" + 0.022*"u" + 0.014*"earnings"')

(4, '0.017*"ford" + 0.010*"cent" + 0.010*"per"')


In [66]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=5), sep='\n\n')

(0, '0.016*"toyota" + 0.013*"sale" + 0.012*"percent" + 0.012*"vehicle" + 0.011*"year"')

(1, '0.011*"toyota" + 0.010*"car" + 0.007*"company" + 0.007*"vehicle" + 0.006*"hydrogen"')

(2, '0.013*"toyota" + 0.009*"car" + 0.008*"japan" + 0.006*"d" + 0.005*"also"')

(3, '0.025*"percent" + 0.022*"u" + 0.014*"earnings" + 0.013*"yield" + 0.012*"index"')

(4, '0.017*"ford" + 0.010*"cent" + 0.010*"per" + 0.007*"toyota" + 0.007*"margin"')


In [67]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=7), sep='\n\n')

(0, '0.016*"toyota" + 0.013*"sale" + 0.012*"percent" + 0.012*"vehicle" + 0.011*"year" + 0.010*"january" + 0.008*"unit"')

(1, '0.011*"toyota" + 0.010*"car" + 0.007*"company" + 0.007*"vehicle" + 0.006*"hydrogen" + 0.006*"australia" + 0.006*"state"')

(2, '0.013*"toyota" + 0.009*"car" + 0.008*"japan" + 0.006*"d" + 0.005*"also" + 0.005*"new" + 0.005*"air"')

(3, '0.025*"percent" + 0.022*"u" + 0.014*"earnings" + 0.013*"yield" + 0.012*"index" + 0.010*"lower" + 0.010*"share"')

(4, '0.017*"ford" + 0.010*"cent" + 0.010*"per" + 0.007*"toyota" + 0.007*"margin" + 0.006*"post" + 0.006*"company"')


In [68]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=10), sep='\n\n')

(0, '0.016*"toyota" + 0.013*"sale" + 0.012*"percent" + 0.012*"vehicle" + 0.011*"year" + 0.010*"january" + 0.008*"unit" + 0.007*"market" + 0.007*"said" + 0.007*"2018"')

(1, '0.011*"toyota" + 0.010*"car" + 0.007*"company" + 0.007*"vehicle" + 0.006*"hydrogen" + 0.006*"australia" + 0.006*"state" + 0.005*"new" + 0.005*"job" + 0.004*"workforce"')

(2, '0.013*"toyota" + 0.009*"car" + 0.008*"japan" + 0.006*"d" + 0.005*"also" + 0.005*"new" + 0.005*"air" + 0.004*"lexus" + 0.004*"model" + 0.004*"museum"')

(3, '0.025*"percent" + 0.022*"u" + 0.014*"earnings" + 0.013*"yield" + 0.012*"index" + 0.010*"lower" + 0.010*"share" + 0.010*"investor" + 0.009*"benchmark" + 0.009*"cent"')

(4, '0.017*"ford" + 0.010*"cent" + 0.010*"per" + 0.007*"toyota" + 0.007*"margin" + 0.006*"post" + 0.006*"company" + 0.006*"sale" + 0.006*"price" + 0.005*"commodity"')


In [69]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=10), sep='\n\n')

(0, '0.016*"toyota" + 0.013*"sale" + 0.012*"percent" + 0.012*"vehicle" + 0.011*"year" + 0.010*"january" + 0.008*"unit" + 0.007*"market" + 0.007*"said" + 0.007*"2018"')

(1, '0.011*"toyota" + 0.010*"car" + 0.007*"company" + 0.007*"vehicle" + 0.006*"hydrogen" + 0.006*"australia" + 0.006*"state" + 0.005*"new" + 0.005*"job" + 0.004*"workforce"')

(2, '0.013*"toyota" + 0.009*"car" + 0.008*"japan" + 0.006*"d" + 0.005*"also" + 0.005*"new" + 0.005*"air" + 0.004*"lexus" + 0.004*"model" + 0.004*"museum"')

(3, '0.025*"percent" + 0.022*"u" + 0.014*"earnings" + 0.013*"yield" + 0.012*"index" + 0.010*"lower" + 0.010*"share" + 0.010*"investor" + 0.009*"benchmark" + 0.009*"cent"')

(4, '0.017*"ford" + 0.010*"cent" + 0.010*"per" + 0.007*"toyota" + 0.007*"margin" + 0.006*"post" + 0.006*"company" + 0.006*"sale" + 0.006*"price" + 0.005*"commodity"')


In [70]:
%%time

lda_display = gensimvis.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

CPU times: user 186 ms, sys: 23.9 ms, total: 210 ms
Wall time: 280 ms


#### 10 topic model

In [71]:
%%time

num_topics = 10
iterations = 100
passes = 20
workers = num_processors-1
eval_every = None

ldamodel = LdaMulticore(corpus=doc_term_matrix,
                       id2word=dictionary,
                       eta='auto',
                       num_topics=num_topics,
                       iterations=iterations,
                       passes=passes,
                       eval_every=eval_every,
                       workers = workers)

CPU times: user 1.4 s, sys: 297 ms, total: 1.69 s
Wall time: 1.66 s


In [72]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=3), sep='\n\n')

(0, '0.025*"ford" + 0.012*"company" + 0.009*"margin"')

(1, '0.019*"toyota" + 0.015*"hydrogen" + 0.014*"australia"')

(2, '0.025*"percent" + 0.014*"sale" + 0.014*"vehicle"')

(3, '0.015*"unit" + 0.012*"market" + 0.011*"car"')

(4, '0.022*"toyota" + 0.020*"canada" + 0.018*"vehicle"')

(5, '0.021*"toyota" + 0.017*"sale" + 0.013*"car"')

(6, '0.029*"percent" + 0.009*"1" + 0.008*"law"')

(7, '0.016*"toyota" + 0.010*"unit" + 0.006*"truck"')

(8, '0.026*"u" + 0.021*"percent" + 0.015*"earnings"')

(9, '0.018*"japan" + 0.008*"museum" + 0.008*"toyota"')


In [73]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=5), sep='\n\n')

(0, '0.025*"ford" + 0.012*"company" + 0.009*"margin" + 0.009*"said" + 0.008*"workforce"')

(1, '0.019*"toyota" + 0.015*"hydrogen" + 0.014*"australia" + 0.007*"hma" + 0.006*"fuel"')

(2, '0.025*"percent" + 0.014*"sale" + 0.014*"vehicle" + 0.012*"toyota" + 0.011*"export"')

(3, '0.015*"unit" + 0.012*"market" + 0.011*"car" + 0.009*"new" + 0.009*"year"')

(4, '0.022*"toyota" + 0.020*"canada" + 0.018*"vehicle" + 0.011*"suspect" + 0.011*"canadian"')

(5, '0.021*"toyota" + 0.017*"sale" + 0.013*"car" + 0.011*"vehicle" + 0.008*"2018"')

(6, '0.029*"percent" + 0.009*"1" + 0.008*"law" + 0.007*"losing" + 0.007*"market"')

(7, '0.016*"toyota" + 0.010*"unit" + 0.006*"truck" + 0.005*"lexus" + 0.005*"1"')

(8, '0.026*"u" + 0.021*"percent" + 0.015*"earnings" + 0.015*"yield" + 0.012*"index"')

(9, '0.018*"japan" + 0.008*"museum" + 0.008*"toyota" + 0.007*"car" + 0.006*"factory"')


In [74]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=7), sep='\n\n')

(0, '0.025*"ford" + 0.012*"company" + 0.009*"margin" + 0.009*"said" + 0.008*"workforce" + 0.008*"state" + 0.008*"alabama"')

(1, '0.019*"toyota" + 0.015*"hydrogen" + 0.014*"australia" + 0.007*"hma" + 0.006*"fuel" + 0.006*"soul" + 0.006*"vehicle"')

(2, '0.025*"percent" + 0.014*"sale" + 0.014*"vehicle" + 0.012*"toyota" + 0.011*"export" + 0.010*"year" + 0.008*"january"')

(3, '0.015*"unit" + 0.012*"market" + 0.011*"car" + 0.009*"new" + 0.009*"year" + 0.008*"january" + 0.008*"toyota"')

(4, '0.022*"toyota" + 0.020*"canada" + 0.018*"vehicle" + 0.011*"suspect" + 0.011*"canadian" + 0.006*"tmmc" + 0.006*"trujillo"')

(5, '0.021*"toyota" + 0.017*"sale" + 0.013*"car" + 0.011*"vehicle" + 0.008*"2018" + 0.007*"year" + 0.007*"cent"')

(6, '0.029*"percent" + 0.009*"1" + 0.008*"law" + 0.007*"losing" + 0.007*"market" + 0.007*"ball" + 0.007*"toyota"')

(7, '0.016*"toyota" + 0.010*"unit" + 0.006*"truck" + 0.005*"lexus" + 0.005*"1" + 0.005*"top" + 0.005*"month"')

(8, '0.026*"u" + 0.021*"percent" + 0.01

In [75]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=10), sep='\n\n')

(0, '0.025*"ford" + 0.012*"company" + 0.009*"margin" + 0.009*"said" + 0.008*"workforce" + 0.008*"state" + 0.008*"alabama" + 0.008*"job" + 0.007*"commodity" + 0.007*"automotive"')

(1, '0.019*"toyota" + 0.015*"hydrogen" + 0.014*"australia" + 0.007*"hma" + 0.006*"fuel" + 0.006*"soul" + 0.006*"vehicle" + 0.006*"mobility" + 0.006*"technology" + 0.005*"new"')

(2, '0.025*"percent" + 0.014*"sale" + 0.014*"vehicle" + 0.012*"toyota" + 0.011*"export" + 0.010*"year" + 0.008*"january" + 0.008*"increase" + 0.007*"electrified" + 0.006*"automobile"')

(3, '0.015*"unit" + 0.012*"market" + 0.011*"car" + 0.009*"new" + 0.009*"year" + 0.008*"january" + 0.008*"toyota" + 0.007*"vehicle" + 0.007*"last" + 0.006*"d"')

(4, '0.022*"toyota" + 0.020*"canada" + 0.018*"vehicle" + 0.011*"suspect" + 0.011*"canadian" + 0.006*"tmmc" + 0.006*"trujillo" + 0.005*"jose" + 0.005*"lexus" + 0.005*"weapon"')

(5, '0.021*"toyota" + 0.017*"sale" + 0.013*"car" + 0.011*"vehicle" + 0.008*"2018" + 0.007*"year" + 0.007*"cent" + 0.00

In [76]:
print(*ldamodel.print_topics(num_topics=num_topics, num_words=10), sep='\n\n')

(0, '0.025*"ford" + 0.012*"company" + 0.009*"margin" + 0.009*"said" + 0.008*"workforce" + 0.008*"state" + 0.008*"alabama" + 0.008*"job" + 0.007*"commodity" + 0.007*"automotive"')

(1, '0.019*"toyota" + 0.015*"hydrogen" + 0.014*"australia" + 0.007*"hma" + 0.006*"fuel" + 0.006*"soul" + 0.006*"vehicle" + 0.006*"mobility" + 0.006*"technology" + 0.005*"new"')

(2, '0.025*"percent" + 0.014*"sale" + 0.014*"vehicle" + 0.012*"toyota" + 0.011*"export" + 0.010*"year" + 0.008*"january" + 0.008*"increase" + 0.007*"electrified" + 0.006*"automobile"')

(3, '0.015*"unit" + 0.012*"market" + 0.011*"car" + 0.009*"new" + 0.009*"year" + 0.008*"january" + 0.008*"toyota" + 0.007*"vehicle" + 0.007*"last" + 0.006*"d"')

(4, '0.022*"toyota" + 0.020*"canada" + 0.018*"vehicle" + 0.011*"suspect" + 0.011*"canadian" + 0.006*"tmmc" + 0.006*"trujillo" + 0.005*"jose" + 0.005*"lexus" + 0.005*"weapon"')

(5, '0.021*"toyota" + 0.017*"sale" + 0.013*"car" + 0.011*"vehicle" + 0.008*"2018" + 0.007*"year" + 0.007*"cent" + 0.00

In [77]:
%%time

lda_display = gensimvis.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

CPU times: user 238 ms, sys: 8.1 ms, total: 246 ms
Wall time: 366 ms


In [78]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Thu, 03 November 2022 12:49:57'