# Topic Modeling and Visualization
We use [Latent Dirichlet Analysis (LDA)](https://dl.acm.org/doi/pdf/10.1145/2133806.2133826) to reveal thematic groups in the library data based on the "description" field of the library data.

We can then visualize the topics using [pyLDAvis](https://pypi.org/project/pyLDAvis/).

In [11]:
import nltk
import gensim
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import warnings
warnings.filterwarnings('ignore')

pyLDAvis.enable_notebook()
stop_words = set(stopwords.words('english'))

In [12]:
catalog_df = pd.read_excel('data/workshop_data.xlsx')
catalog_df.sample(3)

Unnamed: 0,title,authors,year,OCLC,publisher,description,topic,sub_topic,sub_sub_topic,std_call_number,shelf,times_lent_since_2022,cover_file
5285,The behavior of sandwich structures of isotrop...,Jack R Vinson 1929-,1999,41220387,CRC Press,The Behavior of Sandwich Structures of Isotrop...,physics,applied mechanics,mechanics of civil structures general,EFF199,219,,219/cover_41220387.webp
17168,Modern construction handbook.,Andrew Watts,2010,840444907,Springer,The Modern Construction Handbook has become a ...,building technology,building technology general,building technology general,UBA210,730,,730/cover_840444907.webp
7013,Liquides aux interfaces = Liquids at interfaces,Ecole d'été de physique théorique (Les Houc...,1990,21762161,North Holland,This school was concerned with surface propert...,physics,physics of fluids,physics of liquids,FBB190,291,,


In [13]:
replacement_values = {"title": "",
                      "authors": "",
                      "OCLC": 0,
                      "publisher": "",
                      "description": "",
                      "topic": "",
                      "sub_topic": "",
                      "sub_sub_topic": "",
                      "std_call_number": "",
                      "shelf": 0,
                      "times_lent_since_2022": 0,
                      "cover_file": "",
                     }

catalog_df = catalog_df.fillna(value=replacement_values)

In [14]:
def remove_puncts(description_text, alphanumeric_only='True'):
    description_text = description_text.replace('-', ' ')
    clean_description_text = ''.join(e for e in description_text if e.isalnum() or e == ' ').lower()
    clean_description_text = ' '.join(clean_description_text.split())
    return clean_description_text


In [15]:
def get_words_tokenized_nopunct_nostop(descriptions, stop_w=stop_words):
    description_words_list = []
    for description in descriptions:
        clean_description = remove_puncts(description)
        words = word_tokenize(clean_description.lower())
        words_nostop = [word for word in words if not word in stop_w]
        description_words_list.append(words_nostop)
    return description_words_list

In [16]:
descriptions = catalog_df['description'].to_list()
tokenized_descriptions_list = get_words_tokenized_nopunct_nostop(descriptions)

In [17]:
dictionary = gensim.corpora.Dictionary(tokenized_descriptions_list)
corpus = [dictionary.doc2bow(description) for description in tokenized_descriptions_list]
topic_count = 25
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=topic_count, id2word=dictionary, passes=50)

In [18]:
pyLDAvis.enable_notebook()
gensimvis.prepare(ldamodel, corpus, dictionary)