<a href="https://colab.research.google.com/github/tiffchu/402-test/blob/main/402_lda_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Topic Modelling using Latent Dirichlet Allocation (LDA) on ICU Patient Transcripts</center>

<center>Tiffany Chu</center>


<center>COGS402</center>

---



## Photovoice Transcript

explain content of dataset/ transcript
qualitative to quantitative

Topic modeling can identify themes in a set of documents by using unsupervised learning to automatically groups words without a predefined list of labels.

## What is LDA?!?!

Latent Dirichlet Allocation (LDA) is a statistical model/ algorithm within machine learning is a technique used in natural language processing (NLP) and text mining for topic modeling, which aims to discover abstract topics within a collection of documents.

- LDA is an unsupervised learning technique. It doesn't require labeled data; instead, it infers patterns and structures from the data itself.

- based on a statistical model that probabilistically assigns words to topics and topics to documents. It involves estimating probability distributions, which is a fundamental aspect of many ML algorithms.

- LDA learns from the input corpus (text), identifying hidden topics by analyzing the co-occurrence patterns of words across documents.

- estimating parameters, such as topic-word distributions and document-topic distributions, by iteratively updating and optimizing these distributions.

- Application in Decision Making: While LDA doesn't make explicit predictions, it enables understanding and organizing large volumes of text data, which can inform decision-making processes.

<center> :^)  <center>

## Importing the Required Libraries

In [None]:
import nltk
nltk.download("stopwords")
import string
from nltk.corpus import stopwords
import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#1introduction
import numpy as np
import json
import glob
#import pandas as pd #need older vers

#Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

#spacy
import spacy
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

In [None]:
!pip install pyLDAvis
#for vis
import pyLDAvis
import pyLDAvis.gensim_models



import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Preparing the Data
###    text pre-processing

In [None]:
#function to put file into read and write mode
# made functions to read JSON data from a file (load_data) and write JSON data
# to a file (write_data) in a structured and reusable way
def load_data(file):
    with open (file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return (data)

def write_data(file, data):
    with open (file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)


In [None]:
from nltk.corpus import stopwords
stopwords = stopwords.words("english")
stopwords = [word for word in stopwords if len(word) >= 5]
stopwords.extend(['from', 'go', 'so', 'know', 'subject', 're', 'edu', 'use', 'participant', 'P1', 'P2', 'P3', 'P4', 'P5', 'Facilitator','um','uh'])

In [None]:
print (stopwords)

['myself', 'ourselves', "you're", "you've", "you'll", "you'd", 'yours', 'yourself', 'yourselves', 'himself', "she's", 'herself', 'itself', 'their', 'theirs', 'themselves', 'which', "that'll", 'these', 'those', 'being', 'having', 'doing', 'because', 'until', 'while', 'about', 'against', 'between', 'through', 'during', 'before', 'after', 'above', 'below', 'under', 'again', 'further', 'there', 'where', 'other', "don't", 'should', "should've", "aren't", 'couldn', "couldn't", "didn't", 'doesn', "doesn't", "hadn't", "hasn't", 'haven', "haven't", "isn't", 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", "shan't", 'shouldn', "shouldn't", "wasn't", 'weren', "weren't", "won't", 'wouldn', "wouldn't", 'from', 'subject', 're', 'edu', 'use', 'participant', 'P1', 'P2', 'P3', 'P4', 'P5', 'Facilitator', 'um', 'uh']


In [None]:
data = load_data("session1.json")

# Access the third element (multiple dictionaries in the list)
first_element = data[3]

# Access the "speech" key from the first element
speech_text = first_element["speech"]

# Print the first 90 characters of the speech
print(speech_text[:200])

 I was leaving the safety of the hospital and then to kind of an unknown. The picture, represents the unknown on the other side, and it's almost a bit of a hill. And, and at the same time, that log is


In [None]:
import os

current_directory = os.getcwd()
print("Current Working Directory:", current_directory)

#data = load_data("session1.json")["speech"]

#print (data[0][0:90])


Current Working Directory: /content


In [None]:
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

In [None]:
# Function for lemmatization
def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    texts_out = []
    for text in texts:
        doc = nlp(text)
        new_text = []
        for token in doc:
            if token.pos_ in allowed_postags:
                new_text.append(token.lemma_)
        final = " ".join(new_text)
        texts_out.append(final)
    return texts_out

# Read data from the file
with open('session1.json', 'r') as file:
    data = json.load(file)

# Extract speech texts from the data
speech_texts = [item.get('speech', '') for item in data]

# Perform lemmatization
lemmatized_texts = lemmatization(speech_texts)

# Print the lemmatized texts
print(lemmatized_texts)

['', '', 'so go idea mind photo want take road actually just road cause feel journey road journey long road then come take park come outta park right see just almost slam brake unfortunately get side road just speak so just get very emotional whole time think want photo et cetera et cetera just speak fact have cross point', 'leave safety hospital then kind of unknown picture represent unknown other side almost bit hill same time log cover moss so transfer log dangerous scary feel leave safety net hospital give fact night event almost pass bathroom still send home next morning very apprehensive transition home fear landing post long haul symptom live fear get case again get sick again end back hospital experience all over again so kinda represent just speak just take photo pleasure give lot time think want go question', 'feel wanna go go wanna', '', 'thank lovely photograph thank explain process incredible just drive along see think speak little bit last week pop', 'picture road actuall

In [None]:
def gen_words(texts):
    final = []
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc=True)
        final.append(new)
    return (final)

data_words = gen_words(lemmatized_texts)

print (data_words[0][0:20])

[]


In [None]:
id2word = corpora.Dictionary(data_words)

corpus = []
for text in data_words:
    new = id2word.doc2bow(text)
    corpus.append(new)

print (corpus[0][0:20])

word = id2word[[0][:1][0]]
print (word)

##Building the LDA topic model

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(
    corpus=corpus,              # The corpus(collection) of texts to train the model
    id2word=id2word,            # Mapping from word IDs to words
    num_topics=20,              # The number of topics to be generated
    random_state=100,           # Seed for random number generation (for reproducibility)
    update_every=1,              # How often the model parameters should be updated
    chunksize=100,               # Number of documents to be used in each training chunk
    passes=10,                   # Number of passes through the corpus during training
    alpha="auto"                 # Alpha parameter for LDA (auto sets it automatically)
)


## Vizualizing the Data

In [None]:
!pip install "pandas<2.0.0"

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
vis

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
vis



produced topics and the associated keywords. Each bubble on the left-hand side plot represents a topic. A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

if you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.

Red bars give the estimated number of times a given term was generated by a given topic. As you can see, there are about 20 uses of the word surgery, and this term is used about 15 times within topic 9. The word with the longest red bar is the word that is used the most by the tweets belonging to that topic.

#Interpreting the Visualization:

**Topic Circles:** Each circle represents a topic. The larger the circle, the more prevalent the topic is in the corpus.

###Intertopic Distance Map (distance from center):

 Closer Topics - When topics are closer to each other on the intertopic distance map, it indicates that these topics share more similarities or have more common words and themes. These topics might be more closely related in terms of content or subject matter.

###Topic Details:

**Topic Sizes:** The size of each topic circle corresponds to the prevalence or weight of that topic in the entire corpus.
**Top Words:** Hovering over a topic circle displays the top words associated with that topic. These words are the most relevant terms defining that topic.

###Visualization Controls:

**Relevance Sliders:** Adjusting the relevance sliders (lambda values) can change the displayed words' relevance to the selected topic versus their overall frequency in the corpus. It helps in fine-tuning the word display based on relevance to the topic.
