# Idea:
Our solution: LDA + keywords from clusters of BERT based embeddings of noun phrases and verbs :
- Each noun phrase and verb in the texts is  transformed to embedding vector using Universal Sentence Encoder (transformer based on BERT)
- Embedding vectors from (a) are grouped into clusters with cosign similarity >= 70%
- Words/phrases with embedding vectors closest to the centers of resulting clusters form key word/phrase
- Each text in the training sample is converted to collection of key-phrases by replacing its noun phrases and verbs with keyword/phrases and deleting other words
- LDA is performed on the transformed texts


**Reference:**<br>
- Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil. **Universal Sentence Encoder.** *arXiv:1803.11175, 2018.*

# Load data and python libraries

In [1]:
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')

# topic modeling libraries
from gensim import models, corpora
from gensim.models.coherencemodel import CoherenceModel
import re
import spacy
nlp = spacy.load("en_core_web_md")


# supporting libraries
import pandas as pd
import numpy as np
import time
import pickle

import topic_modeling as tm

In [2]:
# load data
with open("./transition_files/df_train_for_LDA.pickle", 'rb') as f:
    # The protocol version used is detected automatically, so we do not
    # have to specify it.
    df_train = pickle.load(f)

print("df_train.shape:", df_train.shape)
print("df_train.columns:",df_train.columns)

df_train.shape: (33982, 18)
df_train.columns: Index(['date', 'author', 'title', 'url', 'section', 'publication',
       'first_10_sents', 'list_of_first_10_sents', 'list_of_verb_lemmas',
       'noun_phrases', 'list_of_nouns', 'list_of_lemmas', 'ID',
       'group_level_1', 'group_level_2', 'group_level_3', 'all_words',
       'all_key_words'],
      dtype='object')


In [3]:
#inputs for testing
ind = 10

text = df_train['first_10_sents'].iloc[ind]
noun_phrases = df_train['noun_phrases'].iloc[ind]
list_of_verb_lemmas = df_train['list_of_verb_lemmas'].iloc[ind]
all_key_words = df_train['all_key_words'].iloc[ind]

text

'THE American Economic Associations annual conference, held each January, is ostensibly a gigantic teachin, with lots of seminars featuring famous economists. But the threeday event, held this year in San Francisco with 13,000 attending, is also a big jobs fair. More than 500 employersboth universities and companieswere tied up in hotel rooms holding marathon interview sessions with freshly minted PhDs. The ballroom of the Marriott was set aside for a hundred more. It is a gruelling three days for candidates: one exhausted PhD likened it to speeddating. It is also arduous for recruiters. Towards the end of the first day Alan Green and Christopher de Bodisco of Stetson University, a small private college in Florida, review the candidates they have seen so far. They are looking for someone with an interest in health and development. They plan to grill a dozen candidates each day before inviting the most promising ones to visit its campus and meet the rest of the faculty. Upgrade your inb

In [4]:
params={"topics_df_path": './output/lda_keywords/topics.pickle',
                           "word_embeddings": './output/lda_keywords/word_embeddings.pickle',
                           "first_dictionary_path": "./output/lda_keywords/dictionary1.pickle",
                           "first_LDA_model_path": "./output/lda_keywords/LDA_model1"
                           }

***

In [6]:
# get clustered words' embedings and cluster names(key words) from train corpus
with open(params["word_embeddings"], 'rb') as f:
    df_emb = pickle.load(f)

columns = ["emb_" + str(i) for i in range(300)]
df_emb = df_emb.drop(columns=columns)
df_emb.head()

Unnamed: 0,word,cl_number,cluster_label,cl_size,ID,emb_vector
0,rise,1486,rise,20,0,"[-0.46326, 0.49222, 0.15795, 0.10404, 0.23174,..."
1,ascent,1486,rise,20,1,"[0.0056534, -0.17881, -0.54648, 0.030204, 0.10..."
2,climb,1486,rise,20,2,"[0.0056534, -0.17881, -0.54648, 0.030204, 0.10..."
3,defenceunless,1486,rise,20,3,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,trudge,1486,rise,20,4,"[0.69606, 0.29966, -0.14856, -0.14035, -0.0646..."


In [7]:
df_tmp = df_emb[df_emb['word'].str.contains(' ') == False]
df_tmp.shape, df_emb.shape

((76718, 6), (419327, 6))

In [8]:
def get_word_embeddings(df_data, column = "word", N_batches=500):
    #split data into N batches
    N = N_batches

    part = int(len(df_data)/N)
    print(N, "batches with", part + 1, column + "s each")

    #get embeddings for each N words
    index = 0
    batch_num = 0
    list_dfs = []

    while index < len(df_data): 
        df_tmp = df_data.iloc[index : index + part].copy()
        df_tmp = df_tmp.reset_index(drop=True)
        print ("Batch number:", batch_num + 1, "out of ", N, "index:", index)

        df_tmp['emb_vector'] = df_tmp[column].apply(lambda w: nlp(w).vector)  

        columns = ["emb_" + str(i) for i in range(300)]
        df_tmp[columns] = np.array(list(df_tmp['emb_vector']))

        list_dfs.append(df_tmp)
        batch_num = batch_num + 1
        index = index + part

    #concatinate batches into single dataset
    df_emb = pd.concat(list_dfs)

    return df_emb


In [9]:
%%time
df_tmp = get_word_embeddings(df_emb, column = "word", N_batches=500)   
df_tmp.head()

500 batches with 839 words each
Batch number: 1 out of  500 index: 0
Batch number: 2 out of  500 index: 838
Batch number: 3 out of  500 index: 1676
Batch number: 4 out of  500 index: 2514
Batch number: 5 out of  500 index: 3352
Batch number: 6 out of  500 index: 4190
Batch number: 7 out of  500 index: 5028
Batch number: 8 out of  500 index: 5866
Batch number: 9 out of  500 index: 6704
Batch number: 10 out of  500 index: 7542
Batch number: 11 out of  500 index: 8380
Batch number: 12 out of  500 index: 9218
Batch number: 13 out of  500 index: 10056
Batch number: 14 out of  500 index: 10894
Batch number: 15 out of  500 index: 11732
Batch number: 16 out of  500 index: 12570
Batch number: 17 out of  500 index: 13408
Batch number: 18 out of  500 index: 14246
Batch number: 19 out of  500 index: 15084
Batch number: 20 out of  500 index: 15922
Batch number: 21 out of  500 index: 16760
Batch number: 22 out of  500 index: 17598
Batch number: 23 out of  500 index: 18436
Batch number: 24 out of  50

Batch number: 193 out of  500 index: 160896
Batch number: 194 out of  500 index: 161734
Batch number: 195 out of  500 index: 162572
Batch number: 196 out of  500 index: 163410
Batch number: 197 out of  500 index: 164248
Batch number: 198 out of  500 index: 165086
Batch number: 199 out of  500 index: 165924
Batch number: 200 out of  500 index: 166762
Batch number: 201 out of  500 index: 167600
Batch number: 202 out of  500 index: 168438
Batch number: 203 out of  500 index: 169276
Batch number: 204 out of  500 index: 170114
Batch number: 205 out of  500 index: 170952
Batch number: 206 out of  500 index: 171790
Batch number: 207 out of  500 index: 172628
Batch number: 208 out of  500 index: 173466
Batch number: 209 out of  500 index: 174304
Batch number: 210 out of  500 index: 175142
Batch number: 211 out of  500 index: 175980
Batch number: 212 out of  500 index: 176818
Batch number: 213 out of  500 index: 177656
Batch number: 214 out of  500 index: 178494
Batch number: 215 out of  500 in

Batch number: 380 out of  500 index: 317602
Batch number: 381 out of  500 index: 318440
Batch number: 382 out of  500 index: 319278
Batch number: 383 out of  500 index: 320116
Batch number: 384 out of  500 index: 320954
Batch number: 385 out of  500 index: 321792
Batch number: 386 out of  500 index: 322630
Batch number: 387 out of  500 index: 323468
Batch number: 388 out of  500 index: 324306
Batch number: 389 out of  500 index: 325144
Batch number: 390 out of  500 index: 325982
Batch number: 391 out of  500 index: 326820
Batch number: 392 out of  500 index: 327658
Batch number: 393 out of  500 index: 328496
Batch number: 394 out of  500 index: 329334
Batch number: 395 out of  500 index: 330172
Batch number: 396 out of  500 index: 331010
Batch number: 397 out of  500 index: 331848
Batch number: 398 out of  500 index: 332686
Batch number: 399 out of  500 index: 333524
Batch number: 400 out of  500 index: 334362
Batch number: 401 out of  500 index: 335200
Batch number: 402 out of  500 in

Unnamed: 0,word,cl_number,cluster_label,cl_size,ID,emb_vector,emb_0,emb_1,emb_2,emb_3,...,emb_290,emb_291,emb_292,emb_293,emb_294,emb_295,emb_296,emb_297,emb_298,emb_299
0,rise,1486,rise,20,0,"[-0.46326, 0.49222, 0.15795, 0.10404, 0.23174,...",-0.46326,0.49222,0.15795,0.10404,...,-0.53974,0.53004,0.37692,-0.10214,-0.17083,0.34781,-0.33974,-0.13493,0.46442,-0.001151
1,ascent,1486,rise,20,1,"[0.0056534, -0.17881, -0.54648, 0.030204, 0.10...",0.005653,-0.17881,-0.54648,0.030204,...,-0.2788,0.2507,0.04953,-0.092657,-0.28176,0.061938,0.37623,0.63426,0.51031,-0.27895
2,climb,1486,rise,20,2,"[0.0056534, -0.17881, -0.54648, 0.030204, 0.10...",0.005653,-0.17881,-0.54648,0.030204,...,-0.2788,0.2507,0.04953,-0.092657,-0.28176,0.061938,0.37623,0.63426,0.51031,-0.27895
3,defenceunless,1486,rise,20,3,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,trudge,1486,rise,20,4,"[0.69606, 0.29966, -0.14856, -0.14035, -0.0646...",0.69606,0.29966,-0.14856,-0.14035,...,0.013825,-0.51598,-0.020915,-0.49117,-0.23036,0.027205,0.79485,0.343,-0.13924,-0.23676


In [10]:
df_tmp.shape

(419327, 306)

In [14]:
del df_tmp['emb_vector']

with open('./output/lda_keywords/word_embeddings.pickle', 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(df_tmp, f, pickle.HIGHEST_PROTOCOL)

***

In [14]:
# get clustered words' embedings and cluster names(key words) from train corpus
with open(params["word_embeddings"], 'rb') as f:
    df_emb = pickle.load(f)
    

print(df_emb.shape)
df_emb.head()

(419327, 305)


Unnamed: 0,word,cl_number,cluster_label,cl_size,ID,emb_0,emb_1,emb_2,emb_3,emb_4,...,emb_290,emb_291,emb_292,emb_293,emb_294,emb_295,emb_296,emb_297,emb_298,emb_299
0,rise,1486,rise,20,0,-0.46326,0.49222,0.15795,0.10404,0.23174,...,-0.53974,0.53004,0.37692,-0.10214,-0.17083,0.34781,-0.33974,-0.13493,0.46442,-0.001151
1,ascent,1486,rise,20,1,0.005653,-0.17881,-0.54648,0.030204,0.102,...,-0.2788,0.2507,0.04953,-0.092657,-0.28176,0.061938,0.37623,0.63426,0.51031,-0.27895
2,climb,1486,rise,20,2,0.005653,-0.17881,-0.54648,0.030204,0.102,...,-0.2788,0.2507,0.04953,-0.092657,-0.28176,0.061938,0.37623,0.63426,0.51031,-0.27895
3,defenceunless,1486,rise,20,3,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,trudge,1486,rise,20,4,0.69606,0.29966,-0.14856,-0.14035,-0.064606,...,0.013825,-0.51598,-0.020915,-0.49117,-0.23036,0.027205,0.79485,0.343,-0.13924,-0.23676


In [15]:
# extract keywords from text
NPs_and_Vs = tm.get_NPs_Vs(text)
df_text_words = pd.DataFrame(NPs_and_Vs, columns=['text_words'])
df_text_words.head()

Unnamed: 0,text_words
0,american economic associations annual conference
1,gigantic teachin
2,lot
3,seminar
4,famous economist


In [16]:
df_text_emb = tm.get_word_embeddings(
    df_text_words, column="text_words")
df_text_emb.head()

Unnamed: 0,text_words,emb_0,emb_1,emb_2,emb_3,emb_4,emb_5,emb_6,emb_7,emb_8,...,emb_290,emb_291,emb_292,emb_293,emb_294,emb_295,emb_296,emb_297,emb_298,emb_299
0,american economic associations annual conference,-0.265382,0.235481,0.424552,0.02125,0.136934,-0.132931,0.235726,0.215936,0.092619,...,-0.155192,0.30978,0.03751,-0.177867,-0.002442,-0.136568,-0.004954,-0.063044,-0.122973,0.135374
1,gigantic teachin,-0.380575,-0.493945,-0.122119,-0.133775,0.207755,0.268635,0.110875,-0.19383,-0.010935,...,0.12579,-0.227065,-0.09847,-0.107917,-0.068435,0.47436,-0.011745,0.262315,0.07799,-0.025215
2,lot,-0.33355,0.49611,-0.27858,-0.20506,0.060868,0.36219,0.055708,-0.26858,0.19418,...,-0.60215,-0.135,0.12724,0.17424,0.23042,0.40913,-0.2178,-0.48043,0.063816,0.2062
3,seminar,0.050256,0.14734,0.31434,0.31144,0.10581,0.018469,0.32183,-0.3401,-0.098713,...,0.39175,-0.031723,0.62829,0.23654,0.054023,0.027702,0.41036,0.53709,-0.83991,0.30344
4,famous economist,0.043425,0.202752,-0.120916,-0.03733,0.192971,0.266181,0.25769,-0.133566,-0.178595,...,0.070634,0.291185,-0.121925,0.3142,0.206443,-0.18947,0.2735,0.078929,0.206745,0.107417


In [17]:
# find closest word in train corpus and get cluster name
from sklearn.metrics.pairwise import cosine_similarity

columns = ["emb_" + str(i) for i in range(300)]
sim_values = cosine_similarity(df_text_emb[columns], df_emb[columns])
max_sim_values = np.max(sim_values, axis=1)
df_text_words['take_cluster_name'] = max_sim_values >= 0.7
df_text_words['sim_max_index'] = np.argmax(sim_values, axis=1)

In [18]:
df_text_words['keyword'] = df_text_words.apply(
        tm.get_keyword, axis=1, args=[df_emb])

words_for_LDA = list(df_text_words['keyword'])
words_for_LDA = [w.replace(" ", "_") for w in words_for_LDA if len(w) > 0]

text = " ".join(words_for_LDA)
text

'american_public_health_association gigantic_teachin lot seminar economist twoday_event san_francisco small_business keio_university room marathon_interview_session freshly_minted_phds ballroom marriott tumultuous_week candidate grad_student enlist end day alan_ferguson christopher_bonanos stetson_university small_school florida candidate interest health development dozen_candidate promising_one cdcs_roybal_campus rest mit inbox daily_dispatch editors held involving held going tied schenkerwinkler_holding_swh minting set comparison consider seen looking plan grill invited visit meet upgrade'

In [19]:
# load pre-trained topics
LDA_topics_df_path = params["topics_df_path"]
with open(LDA_topics_df_path, 'rb') as f:
    df_topics = pickle.load(f)
df_topics.head(1)

Unnamed: 0,first_level_topic,first_level_topic_name,second_level_topic,second_level_topic_name,third_level_topic,third_level_topic_name
0,0,hard seltzer,0.0,dl1850,0.0.0,kleinman


In [20]:
# first level
first_LDA_dict_path = params["first_dictionary_path"]
first_LDA_model_path = params["first_LDA_model_path"]
t1, t1_proba = tm.get_top_topic_index(text,
                                   params={"LDA_dictionary_path": first_LDA_dict_path,
                                           "LDA_model_path": first_LDA_model_path
                                           }
                                   )
t1, t1_proba

(5, 0.5373726)

In [21]:
# second level
second_LDA_dict_path = first_LDA_dict_path[:-
                                           7] + "_" + str(t1 + 1) + ".pickle"
second_LDA_model_path = first_LDA_model_path + "_" + str(t1 + 1)
t2, t2_proba = tm.get_top_topic_index(text,
                                   params={"LDA_dictionary_path": second_LDA_dict_path,
                                           "LDA_model_path": second_LDA_model_path
                                           }
                                   )
t2, t2_proba

(4, 0.9047526)

In [22]:
# third level
third_LDA_dict_path = first_LDA_dict_path[:-7] + \
    "_" + str(t1 + 1) + "_" + str(t2 + 1) + ".pickle"
third_LDA_model_path = first_LDA_model_path + \
    "_" + str(t1 + 1) + "_" + str(t2 + 1)
t3, t3_proba = tm.get_top_topic_index(text,
                                   params={"LDA_dictionary_path": third_LDA_dict_path,
                                           "LDA_model_path": third_LDA_model_path
                                           }
                                   )
t3, t3_proba

(1, 0.9850626)

In [23]:
# get topic names
if t1 == -1:
    t1_name = "misc."
else:
    t1_name = df_topics[df_topics['first_level_topic']
                        == t1]['first_level_topic_name'].iloc[0]
t1_name

'boeing ba'

In [24]:
if t2 == -1:
    t2_name = "misc."
else:
    t2_name = df_topics[df_topics['second_level_topic'] == str(t1) +
                        '.' + str(t2)]['second_level_topic_name'].iloc[0]
t2_name

'rocket'

***

In [5]:
tm.predict_topics(text,
                  params={"topics_df_path": './output/lda_keywords/topics.pickle',
                          "word_embeddings": './output/lda_keywords/word_embeddings.pickle',
                          "first_dictionary_path": "./output/lda_keywords/dictionary1.pickle" ,
                          "first_LDA_model_path": "./output/lda_keywords/LDA_model1"
                         }
              )  

{'first_level_topic': 1,
 'first_level_topic_name': 'daily dispatch',
 'first_level_topic_proba': 0.51461256,
 'second_level_topic': 1,
 'second_level_topic_name': 'inflation',
 'second_level_topic_proba': 0.36204377,
 'third_level_topic': 1,
 'third_level_topic_name': 'alibaba',
 'third_level_topic_proba': 0.6143617}