# Idea:
Our solution: LDA + keywords from clusters of BERT based embeddings of noun phrases and verbs :
- Each noun phrase and verb in the texts is  transformed to embedding vector using Universal Sentence Encoder (transformer based on BERT)
- Embedding vectors from (a) are grouped into clusters with cosign similarity >= 70%
- Words/phrases with embedding vectors closest to the centers of resulting clusters form key word/phrase
- Each text in the training sample is converted to collection of key-phrases by replacing its noun phrases and verbs with keyword/phrases and deleting other words
- LDA is performed on the transformed texts


**Reference:**<br>
- Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil. **Universal Sentence Encoder.** *arXiv:1803.11175, 2018.*

# Load data and python libraries

In [1]:
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')

# topic modeling libraries
import pyLDAvis.gensim 

# data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# topic modeling libraries
from gensim import models, corpora
from gensim.models.coherencemodel import CoherenceModel


# supporting libraries
import pandas as pd
import time
import pickle
import topic_modeling_v4 as tm

  from collections import Iterable
  from collections import Mapping
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()
INFO:absl:Using /var/folders/96/lvl2l9k91mqbyk2328hjtzg40000gn/T/tfhub_modules to cache modules.


In [2]:
# load data
with open("./transition_files/df_train_for_LDA.pickle", 'rb') as f:
    # The protocol version used is detected automatically, so we do not
    # have to specify it.
    df_train = pickle.load(f)

print("df_train.shape:", df_train.shape)
print("df_train.columns:",df_train.columns)

df_train.shape: (33982, 18)
df_train.columns: Index(['date', 'author', 'title', 'url', 'section', 'publication',
       'first_10_sents', 'list_of_first_10_sents', 'list_of_verb_lemmas',
       'noun_phrases', 'list_of_nouns', 'list_of_lemmas', 'ID',
       'group_level_1', 'group_level_2', 'group_level_3', 'all_words',
       'all_key_words'],
      dtype='object')


***

In [3]:
#prepare data for LDA
start_time = time.time()
df_data_1 = tm.prepare_for_modeling(data_path="", model_type="LDA-KeyWords",
                                    params={"TEXT_prepared_df": df_train,
                                     "save_LDA_dictionary_path": "./output/lda_keywords/dictionary1.pickle",
                                     "words_column": "all_key_words"
                                     },
                                    verbose=2)
end_time = time.time()
print("Processing time in minutes:", round((end_time - start_time)/60,2))

loaded data shape: (33982, 18)

Number of unique key-words for topic modeling dictionary: 179256
LDA dictionary file is saved to: ./output/lda_keywords/dictionary1.pickle

Number of texts processed:  33982
Number of extracted key-words:  179256

Each text is represented by list of  179256  tuples: 
		(key-words's index in bag-of-words dictionary, key-words's term frequency)
Processing time in minutes: 0.08


In [4]:
#first level of topics
start_time = time.time()
df_first_level = tm.train_model(model_type="LDA-KeyWords",
                            params={"num_topics": 10,
                                    "LDA_prepared_df": df_data_1,
                                    "LDA_dictionary_path": "./output/lda_keywords/dictionary1.pickle",
                                    "save_LDA_model_path": "./output/lda_keywords/LDA_model1"
                                    },
                               verbose=2)
end_time = time.time()
print("Processing time in minutes:", round((end_time - start_time)/60,2))

Training LDA with semantically similar clusteres ow words (NOUN_PHRASEs and VERBs)
loaded data shape: (33982, 19)

Creating document-term matrix for LDA...

Training LDA model with  10  topics...
LDA model file is saved to: ./output/lda_keywords/LDA_model1
Top topic indexes are selected. NOTE "-1" corresponds to top topic with probability < 20%
Processing time in minutes: 2.0


In [5]:
#value count of TOP level topics
df_first_level['first_level_topic'] = df_first_level['top_topic']
df_first_level['first_level_topic_proba'] = df_first_level['top_topic_proba']
df_first_level['first_level_topic'].value_counts().sort_index()

0    1027
1    3960
2    2721
3    7666
4    1173
5    4288
6    4312
7    3756
8    1449
9    3630
Name: first_level_topic, dtype: int64

In [6]:
df_first_level.columns

Index(['date', 'author', 'title', 'url', 'section', 'publication',
       'first_10_sents', 'list_of_first_10_sents', 'list_of_verb_lemmas',
       'noun_phrases', 'list_of_nouns', 'list_of_lemmas', 'ID',
       'group_level_1', 'group_level_2', 'group_level_3', 'all_words',
       'all_key_words', 'doc2bow', 'infered_topics', 'top_topic',
       'top_topic_proba', 'first_level_topic', 'first_level_topic_proba'],
      dtype='object')

In [7]:
#df_first_level[df_first_level['first_level_topic'] == 0]

In [8]:
#df_first_level[df_first_level['first_level_topic'] == 0]['first_10_sents'].iloc[0]

In [9]:
#df_first_level[df_first_level['first_level_topic'] == 0]['all_key_words'].iloc[0]

In [10]:
df_first_level = df_first_level.drop(columns=['doc2bow',
       'infered_topics', 'top_topic', 'top_topic_proba'])

***
# Get SECOND level topics (LDA)

In [11]:
first_level_topics = list(set(df_first_level['first_level_topic']))
first_level_topics

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [12]:
start = time.time()
list_dfs = []
for topic in first_level_topics:
    print("\nSelected topic index:", topic)
    df_topic = df_first_level[df_first_level['first_level_topic'] == topic].copy()
    save_dict_path = "./output/lda_keywords/dictionary1_"+str(topic+1)+".pickle"
    save_LDA_model_path = "./output/lda_keywords/LDA_model1_" + str(topic + 1)
    
    df_data_tmp = tm.prepare_for_modeling(data_path="", model_type="LDA-KeyWords",
                                       params={"TEXT_prepared_df": df_topic,
                                               "save_LDA_dictionary_path": save_dict_path,
                                               "words_column": "all_key_words"
                                               },
                                       verbose=1)

    df_2nd_tmp = tm.train_model(model_type="LDA-KeyWords",
                                params={"num_topics": 5,
                                        "LDA_prepared_df": df_data_tmp,
                                        "LDA_dictionary_path": save_dict_path,
                                        "save_LDA_model_path": save_LDA_model_path
                                        },
                                verbose=1)

    #value counts of SECOND level topics
    print("\nValue counts of SECOND level topics:")
    df_2nd_tmp['second_level_topic'] = df_2nd_tmp['top_topic']
    df_2nd_tmp['second_level_topic_proba'] = df_2nd_tmp['top_topic_proba']
    print(df_2nd_tmp['second_level_topic'].value_counts().sort_index())

    print("#"*50)
    df_2nd_tmp = df_2nd_tmp.drop(columns=['doc2bow',
                                           'infered_topics', 'top_topic', 'top_topic_proba'])
    list_dfs.append(df_2nd_tmp)
finish = time.time()


Selected topic index: 0
Training LDA with semantically similar clusteres ow words (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_1

Value counts of SECOND level topics:
0    167
1    195
2    235
3    143
4    287
Name: second_level_topic, dtype: int64
##################################################

Selected topic index: 1
Training LDA with semantically similar clusteres ow words (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_2

Value counts of SECOND level topics:
0     508
1     683
2     958
3     800
4    1011
Name: second_level_topic, dtype: int64
##################################################

Selected topic index: 2
Training LDA with semantically similar clusteres ow words (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_3

Value counts of SECOND level topics:
0    575
1    808
2    380
3    424
4    534
Name: second_level_topic, dtype: int64
##############

In [13]:
print("Time of gettig Second level topics in minutes:", round((finish-start)/60,2))
df_second_level = pd.concat(list_dfs)
df_second_level.columns

Time of gettig Second level topics in minutes: 7.29


Index(['date', 'author', 'title', 'url', 'section', 'publication',
       'first_10_sents', 'list_of_first_10_sents', 'list_of_verb_lemmas',
       'noun_phrases', 'list_of_nouns', 'list_of_lemmas', 'ID',
       'group_level_1', 'group_level_2', 'group_level_3', 'all_words',
       'all_key_words', 'first_level_topic', 'first_level_topic_proba',
       'second_level_topic', 'second_level_topic_proba'],
      dtype='object')

***
# Get THIRD level topics

In [14]:
df_second_level[['first_level_topic',
       'first_level_topic_proba', 'second_level_topic',
       'second_level_topic_proba']].describe()

Unnamed: 0,first_level_topic,first_level_topic_proba,second_level_topic,second_level_topic_proba
count,33982.0,33982.0,33982.0,33982.0
mean,4.560002,0.721855,2.053558,0.864971
std,2.597325,0.196475,1.411574,0.172679
min,0.0,0.211725,0.0,0.278245
25%,3.0,0.556901,1.0,0.733245
50%,5.0,0.711267,2.0,0.980011
75%,7.0,0.918785,3.0,0.987249
max,9.0,0.991813,4.0,0.992658


In [19]:
start = time.time()
list_dfs = []

for topic_1st in first_level_topics:
    print("\nSelected FIRST level topic index:",topic_1st)
    df_1st_tmp = df_second_level[df_second_level['first_level_topic'] == topic_1st].copy()
    second_level_topics = list(set(df_1st_tmp['second_level_topic']))
    print("second_level_topics", second_level_topics)
    
    for topic_2nd in second_level_topics:
        print("\nSelected topics' indexes:", (topic_1st, topic_2nd))
        
        save_dict_path = "./output/lda_keywords/dictionary1_"+str(topic_1st+1)+"_"+str(topic_2nd+1)+".pickle"
        save_LDA_model_path = "./output/lda_keywords/LDA_model1_"+str(topic_1st+1)+"_"+str(topic_2nd+1)
        
        df_2nd_tmp = df_1st_tmp[df_1st_tmp['second_level_topic'] == topic_2nd].copy()
        
        df_data_tmp = tm.prepare_for_modeling(data_path="", model_type="LDA-KeyWords",
                                           params={"TEXT_prepared_df": df_2nd_tmp,
                                                   "save_LDA_dictionary_path": save_dict_path,
                                                   "words_column": "all_key_words"
                                                   },
                                           verbose=1)

        df_3d_tmp = tm.train_model(model_type="LDA-KeyWords",
                                    params={"num_topics": 3,
                                            "LDA_prepared_df": df_data_tmp,
                                            "LDA_dictionary_path": save_dict_path,
                                            "save_LDA_model_path": save_LDA_model_path,
                                            },
                                    verbose=1)

        df_3d_tmp['third_level_topic'] = df_3d_tmp['top_topic']
        df_3d_tmp['third_level_topic_proba'] = df_3d_tmp['top_topic_proba']
        #print(df_3d_tmp['second_level_topic'].value_counts().sort_index())

        df_3d_tmp = df_3d_tmp.drop(columns=['doc2bow',
                                               'infered_topics', 'top_topic', 'top_topic_proba'])
        list_dfs.append(df_3d_tmp)
        #value counts of THIRD level topics
        print("Value counts of THIRD level topics:")
        print(df_3d_tmp['third_level_topic'].value_counts().sort_index())
    print("#"*50)
finish = time.time()


Selected FIRST level topic index: 0
second_level_topics [0, 1, 2, 3, 4]

Selected topics' indexes: (0, 0)
Training LDA with semantically similar clusteres ow words (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_1_1
Value counts of THIRD level topics:
0    55
1    50
2    62
Name: third_level_topic, dtype: int64

Selected topics' indexes: (0, 1)
Training LDA with semantically similar clusteres ow words (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_1_2
Value counts of THIRD level topics:
0    73
1    68
2    54
Name: third_level_topic, dtype: int64

Selected topics' indexes: (0, 2)
Training LDA with semantically similar clusteres ow words (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_1_3
Value counts of THIRD level topics:
0    65
1    91
2    79
Name: third_level_topic, dtype: int64

Selected topics' indexes: (0, 3)
Training LDA with semantically similar clusteres ow w

Value counts of THIRD level topics:
0    468
1    260
2    413
Name: third_level_topic, dtype: int64

Selected topics' indexes: (5, 2)
Training LDA with semantically similar clusteres ow words (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_6_3
Value counts of THIRD level topics:
0    146
1    111
2    144
Name: third_level_topic, dtype: int64

Selected topics' indexes: (5, 3)
Training LDA with semantically similar clusteres ow words (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_6_4
Value counts of THIRD level topics:
0    240
1    271
2    524
Name: third_level_topic, dtype: int64

Selected topics' indexes: (5, 4)
Training LDA with semantically similar clusteres ow words (NOUN_PHRASEs and VERBs)
LDA model file is saved to: ./output/lda_keywords/LDA_model1_6_5
Value counts of THIRD level topics:
0    279
1    193
2    416
Name: third_level_topic, dtype: int64
##################################################

In [20]:
print("Time of gettig Third level topics in minutes:", round((finish-start)/60,2))
df_third_level = pd.concat(list_dfs)
df_third_level.columns

Time of gettig Third level topics in minutes: 10.32


Index(['date', 'author', 'title', 'url', 'section', 'publication',
       'first_10_sents', 'list_of_first_10_sents', 'list_of_verb_lemmas',
       'noun_phrases', 'list_of_nouns', 'list_of_lemmas', 'ID',
       'group_level_1', 'group_level_2', 'group_level_3', 'all_words',
       'all_key_words', 'first_level_topic', 'first_level_topic_proba',
       'second_level_topic', 'second_level_topic_proba', 'third_level_topic',
       'third_level_topic_proba'],
      dtype='object')

***

# Evaluate 

In [23]:
df_result = df_third_level.copy()

# Name Topics (as a most frequent key-word in the cluster)

https://www.tutorialspoint.com/gensim/gensim_creating_tf_idf_matrix.htm

In [21]:
import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess
import numpy as np

In [24]:
df_topics = df_result.groupby(['first_level_topic', 'second_level_topic', 'third_level_topic'])['section'].count()
df_topics = df_topics.reset_index()
del df_topics['section']

print(df_topics.shape)
df_topics.head()

(150, 3)


Unnamed: 0,first_level_topic,second_level_topic,third_level_topic
0,0,0,0
1,0,0,1
2,0,0,2
3,0,1,0
4,0,1,1


In [25]:
#get first level topic names
s = df_result.groupby('first_level_topic')['noun_phrases'].sum()
word_lists_by_topic = list(s)

dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(w_list, allow_update=True) for w_list in word_lists_by_topic]
tfidf = models.TfidfModel(BoW_corpus, smartirs='ntc')

topic_names_list = []
for doc in tfidf[BoW_corpus]:
    l = [(dictionary[id], np.around(freq,4)) for id, freq in doc]
    #get a word with 3d highest tfidf score
    w = sorted(l, key=lambda tup: tup[1], reverse=True)[3][0]
    #print(w)
    topic_names_list.append(w)
    
df_first_l_topics = pd.DataFrame({"first_level_topic": list(range(len(topic_names_list))),
                                  "first_level_topic_name": topic_names_list
                                 })
df_topics = df_topics.merge(df_first_l_topics, on='first_level_topic', how='inner')
df_topics.head()

Unnamed: 0,first_level_topic,second_level_topic,third_level_topic,first_level_topic_name
0,0,0,0,hard seltzer
1,0,0,1,hard seltzer
2,0,0,2,hard seltzer
3,0,1,0,hard seltzer
4,0,1,1,hard seltzer


In [26]:
#get second level topic names
df_result['f&s_topics'] = df_result['first_level_topic'].apply(str) + "|" + \
                          df_result['second_level_topic'].apply(str)
s = df_result.groupby('f&s_topics')['noun_phrases'].sum()
word_lists_by_topic = list(s)
s = s.reset_index()

dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(w_list, allow_update=True) for w_list in word_lists_by_topic]
tfidf = models.TfidfModel(BoW_corpus, smartirs='ntc')

topic_names_list = []
for doc in tfidf[BoW_corpus]:
    l = [(dictionary[id], np.around(freq,4)) for id, freq in doc]
    #get a word with 3d highest tfidf score
    w = sorted(l, key=lambda tup: tup[1], reverse=True)[3][0]
    #print(w)
    topic_names_list.append(w)
    
df_second_l_topics = pd.DataFrame({"f&s_topics": s['f&s_topics'],
                                  "second_level_topic_name": topic_names_list
                                 })
print(df_second_l_topics.shape)
df_second_l_topics.head()

(50, 2)


Unnamed: 0,f&s_topics,second_level_topic_name
0,0|0,dl1850
1,0|1,arup
2,0|2,fanfiction
3,0|3,white claw
4,0|4,wouldbe journalist


In [27]:
df_topics['f&s_topics'] = df_topics['first_level_topic'].apply(str) + "|" + \
                          df_topics['second_level_topic'].apply(str)
df_topics = df_topics.merge(df_second_l_topics, on='f&s_topics', how='inner')
df_topics.head()

Unnamed: 0,first_level_topic,second_level_topic,third_level_topic,first_level_topic_name,f&s_topics,second_level_topic_name
0,0,0,0,hard seltzer,0|0,dl1850
1,0,0,1,hard seltzer,0|0,dl1850
2,0,0,2,hard seltzer,0|0,dl1850
3,0,1,0,hard seltzer,0|1,arup
4,0,1,1,hard seltzer,0|1,arup


In [28]:
#get third level topic names
df_result['f&s&th_topics'] = df_result['first_level_topic'].apply(str) + "|" + \
                          df_result['second_level_topic'].apply(str) + "|" + \
                          df_result['third_level_topic'].apply(str)
s = df_result.groupby('f&s&th_topics')['noun_phrases'].sum()
word_lists_by_topic = list(s)
s = s.reset_index()

dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(w_list, allow_update=True) for w_list in word_lists_by_topic]
tfidf = models.TfidfModel(BoW_corpus, smartirs='ntc')

topic_names_list = []
for doc in tfidf[BoW_corpus]:
    l = [(dictionary[id], np.around(freq,4)) for id, freq in doc]
    #get a word with 3d highest tfidf score
    w = sorted(l, key=lambda tup: tup[1], reverse=True)[3][0]
    #print(w)
    topic_names_list.append(w)
    
df_third_l_topics = pd.DataFrame({"f&s&th_topics": s['f&s&th_topics'],
                                  "third_level_topic_name": topic_names_list
                                 })
print(df_third_l_topics.shape)
df_third_l_topics.head()

(150, 2)


Unnamed: 0,f&s&th_topics,third_level_topic_name
0,0|0|0,kleinman
1,0|0|1,national sandwich day
2,0|0|2,momotombo
3,0|1|0,bub
4,0|1|1,xia


In [29]:
df_topics['f&s&th_topics'] = df_topics['first_level_topic'].apply(str) + "|" + \
                          df_topics['second_level_topic'].apply(str) + "|" + \
                          df_topics['third_level_topic'].apply(str)
df_topics = df_topics.merge(df_third_l_topics, on='f&s&th_topics', how='inner')
df_topics.head()

Unnamed: 0,first_level_topic,second_level_topic,third_level_topic,first_level_topic_name,f&s_topics,second_level_topic_name,f&s&th_topics,third_level_topic_name
0,0,0,0,hard seltzer,0|0,dl1850,0|0|0,kleinman
1,0,0,1,hard seltzer,0|0,dl1850,0|0|1,national sandwich day
2,0,0,2,hard seltzer,0|0,dl1850,0|0|2,momotombo
3,0,1,0,hard seltzer,0|1,arup,0|1|0,bub
4,0,1,1,hard seltzer,0|1,arup,0|1|1,xia


In [30]:
columns = [
    'first_level_topic','first_level_topic_name',
    'second_level_topic','second_level_topic_name',
    'third_level_topic', 'third_level_topic_name'
   ]
df_lda_topics = df_topics[columns].drop_duplicates()
print(df_lda_topics.shape, df_topics.shape)
df_lda_topics['second_level_topic'] = df_lda_topics['first_level_topic'].apply(str) + '.' + df_lda_topics['second_level_topic'].apply(str)
df_lda_topics['third_level_topic'] = df_lda_topics['second_level_topic'].apply(str) + '.' + df_lda_topics['third_level_topic'].apply(str)
df_lda_topics.head().T

(150, 6) (150, 8)


Unnamed: 0,0,1,2,3,4
first_level_topic,0,0,0,0,0
first_level_topic_name,hard seltzer,hard seltzer,hard seltzer,hard seltzer,hard seltzer
second_level_topic,0.0,0.0,0.0,0.1,0.1
second_level_topic_name,dl1850,dl1850,dl1850,arup,arup
third_level_topic,0.0.0,0.0.1,0.0.2,0.1.0,0.1.1
third_level_topic_name,kleinman,national sandwich day,momotombo,bub,xia


In [31]:
with open('./output/lda_keywords/topics.pickle', 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(df_lda_topics, f, pickle.HIGHEST_PROTOCOL)

***

In [32]:
df = df_result.copy()

df = df.merge(df_first_l_topics, on='first_level_topic', how='inner')
df = df.merge(df_second_l_topics, on='f&s_topics', how='inner')
df = df.merge(df_third_l_topics, on='f&s&th_topics', how='inner')

df[['publication', 
    'section',
    'first_level_topic_name',
    'second_level_topic_name',
    'third_level_topic_name'
   ]].iloc[::1000].head(10).T

Unnamed: 0,0,1000,2000,3000,4000,5000,6000,7000,8000,9000
publication,Economist,Wired,Economist,CNN,Economist,CNN,Wired,Wired,CNN,CNN
section,business,science,finance-and-economics,business,finance-and-economics,movies,culture,culture,health,health
first_level_topic_name,hard seltzer,hard seltzer,daily dispatch,daily dispatch,daily dispatch,huawei equipment,huawei equipment,huawei equipment,prevention,prevention
second_level_topic_name,dl1850,wouldbe journalist,inflation,financial crisis,tariff,star wars,huawei equipment,anias mcdonald,centers,centers
third_level_topic_name,momotombo,salvation army,alibaba,central bank,spector,episode ix,mariee,nicole,air pollution,cardiovascular disease


In [33]:
df[['publication', 
    'section',
    'first_level_topic_name',
    'second_level_topic_name',
    'third_level_topic_name'
   ]].iloc[::5000].head(10).T

Unnamed: 0,0,5000,10000,15000,20000,25000,30000
publication,Economist,CNN,CNN,CNN,CNN,Wired,Wired
section,business,movies,health,health,business,science,gear
first_level_topic_name,hard seltzer,huawei equipment,prevention,prevention,boeing ba,political ad,people
second_level_topic_name,dl1850,star wars,disease control,zika,rocket,hacker,mr pawlowski
third_level_topic_name,momotombo,episode ix,vaccine,drug,small satellite,account,myhrvold
