In [5]:
#Import Glob to read all text files
import glob

path = '*.txt'
transcript_files = glob.glob(path)
print(transcript_files) #There are 19 .txt files (17 interviews + 2 fgd's)

['alind.txt', 'anirudh.txt', 'anuradha.txt', 'fgd1.txt', 'fgd2.txt', 'himanshu.txt', 'himmat.txt', 'jyoti.txt', 'mani.txt', 'nandkishore.txt', 'premchand.txt', 'puneet.txt', 'ranjana.txt', 'saideep.txt', 'sandhya.txt', 'shaurya.txt', 'shveta.txt', 'sushila.txt', 'tonu.txt']


In [4]:
#Install the necessary libraries
!pip install nltk
!pip install textblob



The presence or absence of negative words like "not" and other negations can indeed affect the sentiment conveyed by the topics. If "not" or other negative words are included in the list of stop words, it could lead to a more positive-leaning interpretation of the topics, as negations often change the sentiment of a statement.

Stop Words and Sentiment Bias:

Stop words are commonly used words (such as "the", "is", "in", etc.) that are usually removed from texts before processing because they often don't contribute much meaningful information.
However, words like "not" play a crucial role in the sentiment of a phrase. For instance, "good" and "not good" have opposite meanings. If negations are removed, the resulting topics might be skewed towards a more positive sentiment.
Interpreting LDA Topics:

When interpreting LDA topics, it's important to remember that LDA doesn't inherently understand the sentiment of words. It groups words based on their co-occurrence patterns.
The interpretation of topics can be subjective, and without considering the context or negations, it may inadvertently lean towards a more positive or neutral portrayal.
**Checking

for Negation Words**:
- To ensure a balanced interpretation, you might want to check if negation words like "not" are included in your stop words list. If they are, consider keeping them in your analysis.
- Reviewing the list of stop words and adjusting it based on the context of your analysis is crucial. For a nuanced analysis, especially in sentiment-rich texts, it’s important to carefully choose which words are treated as stop words.

Balanced Topic Interpretation:

It’s beneficial to approach LDA topic interpretation with an awareness of potential biases. This includes considering both positive and negative aspects that might be present in the topics.
Reinterpreting the topics with an eye for potential negations or contrasting sentiments could provide a more balanced view.
Revisiting the LDA Model:

If you suspect that the exclusion of negation words is affecting your results, you might want to rerun the LDA model without removing these words.
Re-analyzing the topics with negations included could reveal different nuances in how Vipassana is discussed in your corpus.
In summary, the treatment of negations and other sentiment-indic

ative words in LDA preprocessing can significantly impact the interpretation of the model's output. It's always a good practice to critically evaluate your stop words list and consider the broader linguistic context to ensure a balanced and comprehensive analysis of the topics derived from your data.

In [76]:
import re
from nltk import word_tokenize, sent_tokenize
import nltk
from nltk.corpus import stopwords
from textblob import TextBlob
from nltk.stem import PorterStemmer

nltk.download('punkt')

# Initialize the stemmer
stemmer = PorterStemmer()

#dictionary to store tokenized data
tokenized_data = {}

#Loop over all transcript_files
for txt_file in transcript_files:
    with open(txt_file, "r", encoding='utf-8') as file:
        transcript = file.read() #Read the .txt file insid the loop
        
    transcript = re.sub(r"\bInterviewer\b: ?", "", transcript, flags=re.IGNORECASE)
    transcript = re.sub(r"\b(Alind|Anirudh|Anuradha|Himanshu|Himmat|Jyoti|Mani|Nandkishore|Premchand|Puneet|Ranjana|Saideep|Sandhya|Shaurya|Shveta|Sushila|Tonu|Ritu|Raunak|Richa)\b: ?", "", transcript, flags=re.IGNORECASE)
    wtokenize = word_tokenize(transcript)
    stokenize = sent_tokenize(transcript)
    # Get the list of default English stopwords
    stop_words = stopwords.words('english')
    #List of words which are mostly irrelevant
    list_to_remove = ['no', 'nor', 'wouldn\'t', 'against', 'won\'t', 'ourselves', 'don\'t', 'not', 'now', 'doesn\'t']
    stop_words = list(set(stop_words) - set(list_to_remove)) #Use set difference to remove elements from a list
    list_to_add = ['nt', 'anirudh', 'narula', 'like', 'yeah', 'ranjanaa', 'would', 'could', 'okay', 'vipassana']
    #print(stop_words)
    stop_words = stop_words + list_to_add #simply add two lists like this
    #We clean wtokenize, but punctuations still remain
    # Convert the tokens to lowercase, remove punctuation, and then check against the stop words
    clean_wtokenize = [re.sub(r"[^a-zA-Z0-9 ]+", "", word).lower() for word in wtokenize if re.sub(r"[^a-zA-Z0-9 ]+", "", word).lower() not in stop_words]
    clean_stokenize = [re.sub(r"[^a-zA-Z0-9 ]+", "", word).lower() for word in stokenize if re.sub(r"[^a-zA-Z0-9 ]+", "", word).lower() not in stop_words]

    # Filter out any empty strings that may have resulted from removing punctuation
    clean_wtokenize = [word for word in clean_wtokenize if word]

     # Apply stemming
    stemmed_tokens = [stemmer.stem(word) for word in clean_wtokenize]

    # Sentiment analysis with TextBlob
    blob = TextBlob(transcript)
    sentiment = blob.sentiment
    noun_phrases = blob.noun_phrases

    word_freq = nltk.FreqDist(clean_wtokenize)

    # Store tokenized data in the dictionary using filename as key
    file_key = txt_file.split('.')[0]  # Extract filename from the path, remove .txt
    tokenized_data[file_key] = {
        'word_tokens': clean_wtokenize,
        'sentence_tokens': clean_stokenize,
        'stem_tokens': stemmed_tokens,
        'sentiment': sentiment,
        'noun_phrases': noun_phrases,
        'word_freq': word_freq
    }
print(stop_words)



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\samar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['does', 'those', 'from', 'all', "that'll", 'what', 'won', 'these', 'who', 'wasn', 'is', 't', 'because', 'be', 'shouldn', 'aren', 'this', 'you', 'he', "wasn't", "shouldn't", 'down', 'again', 'few', 'them', 'so', 'until', 'ours', 'while', 'can', "isn't", 'doesn', "couldn't", 'a', 'too', 'doing', 'there', 'very', 'were', 'm', 'are', 'hasn', 're', 'but', "hasn't", 'don', "you'll", 'their', 'our', 'most', 'll', 'it', 'which', 'why', 'hadn', 'o', 'once', 'any', "didn't", 'under', 'through', 'just', 'about', 'when', 'she', 'if', 'didn', 'we', 'mightn', 'over', 'theirs', 'did', 'ain', 'do', 'for', 'into', 'further', 'how', 'having', 'to', 'being', "hadn't", 'on', 'more', 'where', 'myself', 'in', "you've", 'your', 'after', "shan't", 'mustn', 'haven', "needn't", 'himself', 'themselves', 'me', 'whom', 'weren', 'shan', 'his', 'they', 'should', 've', 'yourselves', 'couldn', 'has', 'during', 'will', "you're", 'itself', 'my', 'd', 'hers', 'ma', 'needn', 'of', 'that', "haven't", 'its', 'up', 'each', 

corpora:

In the context of Natural Language Processing (NLP), corpora is a term used to describe a large and structured set of texts.
In your code, corpora likely refers to a module from the gensim library, which is used for unsupervised topic modeling and natural language processing.
dictionary:

Here, a dictionary is created using corpora.Dictionary(all_tokenized_texts). This is not a standard Python dictionary but a special object from the gensim library.
The dictionary object maps each unique word in your text data to a unique integer ID. This process is crucial for converting text data into a numerical form that machine learning models can understand.
corpus and Bag-of-Words (BoW):

A corpus in NLP is a collection of documents, which in this case is a collection of your text data.
doc2bow stands for "document to Bag-of-Words". It converts each document into the Bag-of-Words format, which is a list of tuples. Each tuple contains a word's ID (as per the dictionary) and its frequency in the document.
Essentially, doc2bow converts your text data into a numerical form, where each document is represented as a vector of word frequencies.
Parameters in LdaModel:

corpus: This is the dataset you're using for training the model, in the BoW format.
num_topics: The number of topics you want the LDA model to identify in your data.
id2word: This maps IDs back to words. Here, it's the dictionary you created earlier.
passes: The number of passes the algorithm makes over the entire corpus. More passes can lead to a more accurate model but also take longer to compute.
random_state: This is a seed for random number generation, ensuring reproducibility of your results. Different seeds can lead to slightly different topic allocations.
Determining the Dominant Topic:

For each text file (txt_file), you create a BoW representation (bow).
lda_model.get_document_topics(bow) gives you the topic distribution for that document.
You find the dominant topic (the one with the highest proportion) for each document and store it along with the most significant words (and their weights) in that topic.
dominant_topic_details:

This dictionary stores the results. For each file, it keeps track of the dominant topic and the top words (with their respective weights) that characterize this topic.


In [118]:
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel


# Create a list of tokenized texts
all_tokenized_texts = [data['word_tokens'] for data in tokenized_data.values()]

# Create a dictionary and corpus for the LDA model
dictionary = corpora.Dictionary(all_tokenized_texts)
corpus = [dictionary.doc2bow(text) for text in all_tokenized_texts]


# Train the LDA model
lda_model = LdaModel(corpus=corpus, num_topics=4, id2word=dictionary, passes=105, random_state=2, alpha='auto', eta='auto')

dominant_topic_details = {}

for txt_file, bow in zip(tokenized_data.keys(), corpus):
    # Get the topic distribution for the document
    topic_distribution = lda_model.get_document_topics(bow)
    # Find the dominant topic (the one with the highest proportion)
    dominant_topic = max(topic_distribution, key=lambda x: x[1])[0]
    # Store the dominant topic for the file
     # Get the word-weight pairs for the dominant topic
    topic_words = lda_model.show_topic(dominant_topic, topn=8)
    word_weight_pairs = [(word, round(weight, 3)) for word, weight in topic_words]

    # Store the results
    file_key = txt_file.split('.')[0]  # Extract filename
    dominant_topic_details[file_key] = {
        'dominant_topic': dominant_topic,
        'word_weights': word_weight_pairs
    }
    
coherence_model_lda = CoherenceModel(model=lda_model, texts=all_tokenized_texts, dictionary=dictionary, corpus=corpus, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score:', coherence_lda)

# Now, dominant_topics dictionary contains the dominant topic for each file

lda_model.print_topics()

Coherence Score: 0.4276125088280711


[(0,
  '0.009*"life" + 0.008*"personal" + 0.007*"experience" + 0.005*"practice" + 0.005*"think" + 0.005*"understanding" + 0.005*"not" + 0.005*"challenges" + 0.005*"issues" + 0.004*"selfawareness"'),
 (1,
  '0.009*"practice" + 0.009*"meditation" + 0.007*"nature" + 0.006*"often" + 0.006*"understanding" + 0.005*"law" + 0.005*"life" + 0.005*"experience" + 0.005*"personal" + 0.005*"approach"'),
 (2,
  '0.028*"know" + 0.021*"not" + 0.017*"think" + 0.016*"people" + 0.010*"feel" + 0.009*"say" + 0.008*"one" + 0.007*"now" + 0.007*"right" + 0.007*"get"'),
 (3,
  '0.012*"practice" + 0.009*"not" + 0.007*"sense" + 0.007*"understanding" + 0.007*"life" + 0.006*"might" + 0.006*"change" + 0.006*"self" + 0.006*"think" + 0.005*"others"')]

Each topic offers a distinct perspective:

Topic 0 - Integrative Life Perspectives:

Keywords: "life," "personal," "experience," "practice," "issues," "societal," "think," "understanding," "not," "however."
Analysis: This topic seems to encapsulate a holistic view of Vipassana, intertwining personal experiences and societal perspectives. It covers a broad range of elements from personal practice to societal issues, reflecting on the complexities and diverse interpretations of Vipassana in life. The presence of "not" and "however" suggests a nuanced or possibly critical exploration of these aspects.
Suggested Name: "Vipassana: Personal and Societal Dynamics"
Topic 1 - Meditative Practices and Philosophies:

Keywords: "practice," "meditation," "nature," "often," "understanding," "law," "life," "experience," "personal," "approach."
Analysis: This topic is likely centered around the meditation practice itself, including its nature and various philosophical understandings. It might discuss the regularity ("often"), principles ("law"), and different approaches to meditation, emphasizing how these aspects are woven into personal life and experiences.
Suggested Name: "Philosophical Dimensions of Meditation"
Topic 2 - Critical Reflection and Discourse:

Keywords: "know," "not," "think," "people," "feel," "say," "one," "now," "right," "something."
Analysis: Dominated by words like "know," "not," and "think," this topic seems to represent a space of critical reflection and discourse. It suggests discussions where ideas, beliefs, and feelings about Vipassana are questioned, debated, and reflected upon, possibly indicating diverse and contrasting opinions.
Suggested Name: "Debating the Essence of Vipassana"
Topic 3 - Self-Awareness and Transformation:

Keywords: "practice," "not," "sense," "understanding," "life," "might," "think," "change," "self," "others."
Analysis: This topic appears to focus on the journey of self-awareness and potential transformation through Vipassana. It covers aspects of personal

In [89]:
def get_individual_details(name, tokenized_data, dominant_topic_details):
    # Check if the individual's data exists
    if name not in tokenized_data:
        print(f"No data found for {name}.")
        return
    #The basic syntax is dictionary.get(key, default).
    #key: This is the key for which you want to retrieve the value.
    #default (optional): This is the value that will be returned if the key does not exist in the dictionary. If you don't provide a default value, it will return None by default.
    # Extracting individual's data
    individual_data = tokenized_data[name]
    individual_tokens = individual_data.get('word_tokens', [])
    individual_stemmed = individual_data.get('stem_tokens', [])
    individual_sentiment = individual_data.get('sentiment', None)
    individual_nphrases = individual_data.get('noun_phrases', [])
    individual_word_freq = individual_data.get('word_freq', None)

    # Print basic information
    print(f"Details for {name.capitalize()}:")
    print(f"First 10 Word Tokens: {individual_tokens[:10]}")
    print(f"First 10 Stemmed Tokens: {individual_stemmed[:10]}")
    print(f"Sentiment: {individual_sentiment}")
    print(f"First 10 Noun Phrases: {individual_nphrases[:10]}")
    if individual_word_freq:
        print(f"Top 10 Word Frequencies: {individual_word_freq.most_common(10)}")
    else:
        print("Word frequency data not available.")

    # Get the dominant topic
    individual_details = dominant_topic_details.get(name, {})
    dominant_topic = individual_details.get('dominant_topic', 'Not available')
    word_weight_pairs = individual_details.get('word_weights', [])

    print(f"Dominant Topic: {dominant_topic}")
    print("Word-Weight Pairs:")
    for word, weight in word_weight_pairs:
        print(f"{word}, {weight}")
    print("\n")


In [119]:
!pip install spacy
!python -m spacy download en_core_web_sm


Collecting spacy
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/90/f0/0133b684e18932c7bf4075d94819746cee2c0329f2569db526b0fa1df1df/spacy-3.7.2-cp311-cp311-win_amd64.whl.metadata
  Downloading spacy-3.7.2-cp311-cp311-win_amd64.whl.metadata (26 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Obtaining dependency information for spacy-loggers<2.0.0,>=1.0.0 from https://files.pythonhosted.org/packages/33/78/d1a1a026ef3af911159398c939b1509d5c36fe524c7b644f34a5146c4e16/spacy_loggers-1.0.5-py3-none-any.whl.metadata
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Obtaining dependency information for murmurhash<1.1.0,>=0.28.0 from https://files.pythonhosted.org/packages/71/46/af01a20ec368bd9cb49a1d2df15e3eca113bbf6952cc1f2a47f1c6801a7f/murmurhash-1.0.10-cp311

In [136]:
import spacy
from collections import Counter
import pandas as pd

nlp = spacy.load("en_core_web_sm")
ner_details = {}
all_counters = {}
#Loop over all transcript_files
for txt_file in transcript_files:
    with open(txt_file, "r", encoding='utf-8') as file:
        transcript = file.read() #Read the .txt file insid the loop
    
    transcript = re.sub(r"\bInterviewer\b: ?", "", transcript, flags=re.IGNORECASE)
    transcript = re.sub(r"\b(Alind|Anirudh|Anuradha|Himanshu|Himmat|Jyoti|Mani|Nandkishore|Premchand|Puneet|Ranjana|Saideep|Sandhya|Shaurya|Shveta|Sushila|Tonu|Ritu|Raunak|Richa)\b: ?", "", transcript, flags=re.IGNORECASE)
    doc = nlp(transcript)
    # Extract named entities
    named_entities = []
    for ent in doc.ents:
        context = ' '.join([token.text for token in ent.sent]) # Extracting sentence where the entity is mentioned
        named_entities.append({
            'text': ent.text,
            'type': ent.label_,
            'context': context
        })
        
    entity_freq = Counter([ent.label_ for ent in doc.ents])
    
    file_key = txt_file.split(".")[0]
    ner_details[file_key] = {
        'named_entities': named_entities,
        'entity_freq': entity_freq
    }
      # Add entity frequency to all_counters
    all_counters[file_key] = entity_freq

# Convert all_counters to DataFrame
df_counters = pd.DataFrame.from_dict(all_counters, orient='index').fillna(0)
df_counters.reset_index(inplace=True)
df_counters.rename(columns={'index': 'person_tag'}, inplace=True)
df_counters = df_counters[['person_tag', 'ORG', 'DATE', 'PERSON', 'CARDINAL', 'GPE', 'ORDINAL']]
df_counters.head(20)

Unnamed: 0,person_tag,ORG,DATE,PERSON,CARDINAL,GPE,ORDINAL
0,alind,14,10.0,9,4.0,2.0,1.0
1,anuradha,14,14.0,7,13.0,1.0,6.0
2,jyoti,14,2.0,7,1.0,1.0,2.0
3,saideep,28,3.0,7,2.0,2.0,1.0
4,sandhya,15,4.0,6,1.0,0.0,1.0
5,shveta,10,18.0,6,4.0,1.0,5.0
6,sushila,21,2.0,5,2.0,0.0,1.0
7,tonu,26,51.0,16,21.0,11.0,4.0
8,anirudh,28,4.0,9,0.0,0.0,0.0
9,fgd1,29,0.0,10,0.0,1.0,0.0


In [129]:
#id 1 Alind
get_individual_details("alind", tokenized_data, dominant_topic_details)
ner_details['alind'].get('named_entities')
ner_details['alind'].get('entity_freq')

Details for Alind:
First 10 Word Tokens: ['see', 'change', 'say', 'noticeable', 'change', 'no', 'surface', 'level', 'changes', 'change']
First 10 Stemmed Tokens: ['see', 'chang', 'say', 'notic', 'chang', 'no', 'surfac', 'level', 'chang', 'chang']
Sentiment: Sentiment(polarity=0.1487823984526112, subjectivity=0.44960976021614324)
First 10 Noun Phrases: ['wouldn ’ t', 'surface level changes', 'transient nature', 'vipassana', 'vipassana', 'was', 'okay', 'was', 'surface level', 'non vegetarian food']
Top 10 Word Frequencies: [('people', 31), ('not', 29), ('think', 29), ('know', 26), ('say', 24), ('feel', 16), ('change', 15), ('sense', 15), ('no', 14), ('course', 12)]
Dominant Topic: 2
Word-Weight Pairs:
know, 0.02800000086426735
not, 0.020999999716877937
think, 0.017000000923871994
people, 0.01600000075995922
feel, 0.009999999776482582
say, 0.008999999612569809
one, 0.00800000037997961
now, 0.007000000216066837




Counter({'ORG': 14,
         'DATE': 10,
         'PERSON': 9,
         'CARDINAL': 4,
         'GPE': 2,
         'ORDINAL': 1})

#### Overall Word Weights in Topics:

The word weights you initially see in the LDA model's output (for each of the four topics) are calculated based on the entire corpus. They represent how important each word is for a given topic across all documents in your dataset.
Individual Document (or Individual) Specific Word Weights:

When you find the dominant topic for an individual (like "alind" in your example), the word weights are specific to how that topic is represented in that individual's text.
This means that while "alind" might have been classified under "Topic 2: Debating the Essence of Vipassana," the specific word weights (like "know: 0.028" and "not: 0.021") are unique to their document. These weights indicate the relative importance of each word within the context of "alind's" text in relation to the identified dominant topic.
Interpreting Individual-Specific Word Weights:

These individual-specific word weights can provide insights into how the themes or topics manifest in different individuals' texts.
For example, if "know" and "not" have high weights in "alind's" dominant topic, it suggests that his/her discussion or thinking around the topic involves a significant amount of knowledge and negation, which is central to his/her understanding or perspective on 

#### The Counter object you're showing represents the frequency of different types of named entities in Alind's text. Here's what each of the entity labels typically represents in spaCy's NER model:

ORG: An organization (e.g., Google, United Nations)
DATE: A date (e.g., July 4th, 2021)
PERSON: A person's name (e.g., John Doe)
CARDINAL: Numerals that do not fall under another type (e.g., one, two, 1,000)
GPE: Geopolitical entity, i.e., countries, cities, states (e.g., India, Paris)
ORDINAL: "first", "second", etc.
Interpreting this data depends on the context of your dissertation. Here's how it might be relevant:

Organizational Context: If ORG entities are the most frequent in Alind's text, this might indicate a strong discussion about various organizations, which could be relevant if your dissertation discusses how individuals relate to or are influenced by organizations.

Temporal Aspects: The frequency of DATE entities suggests that specific points in time or periods are significant in the transcript. This could be relevant if the temporal context is important to your analysis.

Personal Interactions: The presence of PERSON entities might show that Alind's text involves many references to individuals, which could be important if your dissertation involves the social networks or interactions between people.

Quantitative Analysis: CARDINAL and ORDINAL entities indicate the presence of

numerical and sequential data. If your dissertation deals with quantifying aspects of Vipassana or ranking/ordering of concepts or experiences, these could be pertinent.

Geographical Relevance: GPE entities can show the geographical diversity in Alind's text. If your dissertation explores the cultural or regional aspects of Vipassana practice, these references could be quite relevant.Vipassana.

In [94]:
#id 2 Saideep
get_individual_details("saideep", tokenized_data, dominant_topic_details)

Details for Saideep:
First 10 Word Tokens: ['perceive', 'concept', 'fraternity', 'relation', 'really', 'group', 'dynamics', 'fraternity', 'focused', 'introspection']
First 10 Stemmed Tokens: ['perceiv', 'concept', 'fratern', 'relat', 'realli', 'group', 'dynam', 'fratern', 'focus', 'introspect']
Sentiment: Sentiment(polarity=0.11742166563595129, subjectivity=0.482493300350443)
First 10 Noun Phrases: ['vipassana', 'vipassana', 'group dynamics', 'individual inner journeys', 'collective identity', 'sadhguru', 'concept relate', 'vipassana', 'sadhguru', 'vipassana']
Top 10 Word Frequencies: [('might', 7), ('people', 5), ('conflict', 5), ('concept', 4), ('experience', 4), ('see', 4), ('life', 4), ('challenging', 4), ('fraternity', 3), ('understanding', 3)]
Dominant Topic: 3
Word-Weight Pairs:
practice, 0.012000000104308128
not, 0.008999999612569809
sense, 0.007000000216066837
understanding, 0.007000000216066837
life, 0.007000000216066837
might, 0.006000000052154064
change, 0.00600000005215406

The words in the word-weight pairs for each individual corresponding to their dominant topic being the same as those in the overall topic descriptions is not by chance. This consistency is due to how LDA (Latent Dirichlet Allocation) models work. Here's a more detailed explanation:

LDA Topic Composition:

An LDA model identifies a fixed number of topics from the entire corpus. Each topic is characterized by a distribution of words. For instance, in your model, Topic 2 might be characterized by words like "know," "not," "think," etc., based on their frequency and distribution across all documents.
Document (or Individual) Assignment to Topics:

When the LDA model analyzes an individual document (or in your case, an individual's text), it doesn't create new topics specific to that document. Instead, it tries to represent the document as a mixture of the topics it has already learned from the entire corpus.
The model assigns a document to topics based on how well the words in the document align with the words that are significant in each topic.
Word-Weight Pairs in Individual Analysis:

For each individual, the model calculates the proportion of each topic present in their text. The "dominant topic" is the one with the highest proportion.
The word-weight pairs for an individual in their dominant topic are derived from the overall word distribution of that topic. However, the specific weights are unique to the individual, indicating the relative importance of these words in the context of their text.
Why Words Match the Overall Topic Words:

The reason you see the same words (like "life," "personal," "experience" for Anirudh's dominant topic 0) is that these words are key to defining Topic 0 in the overall corpus.
The model is essentially saying, "Based on the presence and weight of these particular words in Anirudh
's text, his discussion aligns most closely with Topic 0, which is characterized by these words in the entire dataset."

Individualized Word Weights:
The actual weight values for Anirudh (or any other individual) show how prominent these words are in their specific text, compared to their importance in defining the topic across the entire corpus.
So, while "life," "personal," "experience," etc.,
are significant words for Topic 0 in general, their specific weights in Anirudh's text tell you how much these particular words contribute to making Topic 0 the dominant topic for him.

In summary, the words in the word-weight pairs for each individual are the same as those in the overall topic descriptions because they are key words that define each topic. However, the specific weights of these words vary from individual to individual, reflecting how these words are used in the context of each person's text and how strongly they associate with the dominant topic in that specific context.


In [95]:
#id 3 Shaurya
get_individual_details("shaurya", tokenized_data, dominant_topic_details)

Details for Shaurya:
First 10 Word Tokens: ['good', 'morning', 'shaurya', 'law', 'student', 'aspiring', 'ips', 'officer', 'drew', 'single']
First 10 Stemmed Tokens: ['good', 'morn', 'shaurya', 'law', 'student', 'aspir', 'ip', 'offic', 'drew', 'singl']
Sentiment: Sentiment(polarity=0.10873482726423898, subjectivity=0.4491433239962653)
First 10 Noun Phrases: ['good morning', 'shaurya', 'law student', 'ips', 'vipassana', 'good morning', 'vipassana', 'academic interest', 'buddha', 'ambedkar']
Top 10 Word Frequencies: [('law', 13), ('public', 10), ('service', 10), ('nature', 8), ('ips', 7), ('officer', 7), ('experience', 7), ('career', 7), ('practice', 7), ('life', 7)]
Dominant Topic: 1
Word-Weight Pairs:
practice, 0.008999999612569809
meditation, 0.008999999612569809
nature, 0.007000000216066837
often, 0.006000000052154064
understanding, 0.006000000052154064
law, 0.004999999888241291
life, 0.004999999888241291
experience, 0.004999999888241291




In [96]:
#id 4 Mani
get_individual_details("mani", tokenized_data, dominant_topic_details)

Details for Mani:
First 10 Word Tokens: ['tell', 'experiences', 'many', 'times', 'attended', 'attended', 'threeday', 'course', 'last', 'one']
First 10 Stemmed Tokens: ['tell', 'experi', 'mani', 'time', 'attend', 'attend', 'threeday', 'cours', 'last', 'one']
Sentiment: Sentiment(polarity=0.15221417069243154, subjectivity=0.42913503565677474)
First 10 Noun Phrases: ['vipassana', 'vipassana', 'november', 'vipassana', 'mental peace', 'certain life', "life 's questions", 'vipassana', 'post-vipassana', 'stress levels']
Top 10 Word Frequencies: [('understanding', 5), ('meditation', 5), ('social', 5), ('life', 4), ('practice', 3), ('well', 3), ('not', 3), ('day', 3), ('teaches', 3), ('without', 3)]
Dominant Topic: 3
Word-Weight Pairs:
practice, 0.012000000104308128
not, 0.008999999612569809
sense, 0.007000000216066837
understanding, 0.007000000216066837
life, 0.007000000216066837
might, 0.006000000052154064
change, 0.006000000052154064
self, 0.006000000052154064




In [98]:
#id 5 Himanshu
get_individual_details("himanshu", tokenized_data, dominant_topic_details)

Details for Himanshu:
First 10 Word Tokens: ['think', 'general', 'meditation', 'foster', 'sense', 'fraternity', 'since', 'attended', 'perspective', 'limited']
First 10 Stemmed Tokens: ['think', 'gener', 'medit', 'foster', 'sens', 'fratern', 'sinc', 'attend', 'perspect', 'limit']
Sentiment: Sentiment(polarity=0.1336591538972491, subjectivity=0.49066229185276816)
First 10 Noun Phrases: ['vipassana', 'general meditation', 'vipassana', 'individual journey', 'vipassana', 'self-perception post-vipassana', 'vipassana', "n't facilitate making bonds", 'vibration aspects', 'vibrations part']
Top 10 Word Frequencies: [('practice', 7), ('sense', 6), ('think', 5), ('meditation', 5), ('course', 4), ('might', 4), ('experience', 3), ('seems', 3), ('individual', 3), ('changes', 3)]
Dominant Topic: 3
Word-Weight Pairs:
practice, 0.012000000104308128
not, 0.008999999612569809
sense, 0.007000000216066837
understanding, 0.007000000216066837
life, 0.007000000216066837
might, 0.006000000052154064
change, 0.0

In [99]:
#id 6 Sandhya
get_individual_details("sandhya", tokenized_data, dominant_topic_details)

Details for Sandhya:
First 10 Word Tokens: ['good', 'morning', 'sandhya', 'share', 'us', 'experience', 'many', 'times', 'attended', 'good']
First 10 Stemmed Tokens: ['good', 'morn', 'sandhya', 'share', 'us', 'experi', 'mani', 'time', 'attend', 'good']
Sentiment: Sentiment(polarity=0.2220714285714286, subjectivity=0.46523809523809534)
First 10 Noun Phrases: ['good morning', 'sandhya', 'vipassana', 'good morning', 'vipassana', 'vipassana', 'vipassana', 'inner peace', 'professional crossroads', 'vipassana']
Top 10 Word Frequencies: [('understanding', 5), ('not', 5), ('practice', 4), ('life', 4), ('awareness', 4), ('taught', 3), ('thoughts', 3), ('emotions', 3), ('good', 2), ('morning', 2)]
Dominant Topic: 3
Word-Weight Pairs:
practice, 0.012000000104308128
not, 0.008999999612569809
sense, 0.007000000216066837
understanding, 0.007000000216066837
life, 0.007000000216066837
might, 0.006000000052154064
change, 0.006000000052154064
self, 0.006000000052154064




In [101]:
#id 7 Anuradha
get_individual_details("anuradha", tokenized_data, dominant_topic_details)

Details for Anuradha:
First 10 Word Tokens: ['basically', 'dissertation', 'topic', 'fraternity', 'context', 'context', 'fraternity', 'know', 'question', 'mark']
First 10 Stemmed Tokens: ['basic', 'dissert', 'topic', 'fratern', 'context', 'context', 'fratern', 'know', 'question', 'mark']
Sentiment: Sentiment(polarity=0.1346595522356944, subjectivity=0.5281511776435632)
First 10 Noun Phrases: ['okay', 'dissertation topic', 'vipassana', 'vipassana', 'question mark', 'vipassana', 'generate fraternity', 'okay', 'vipassana', 'november']
Top 10 Word Frequencies: [('know', 37), ('think', 26), ('not', 24), ('one', 18), ('sense', 17), ('something', 14), ('get', 14), ('also', 12), ('religion', 12), ('no', 11)]
Dominant Topic: 2
Word-Weight Pairs:
know, 0.02800000086426735
not, 0.020999999716877937
think, 0.017000000923871994
people, 0.01600000075995922
feel, 0.009999999776482582
say, 0.00800000037997961
one, 0.00800000037997961
now, 0.007000000216066837




In [102]:
#id 8 Shveta
get_individual_details("shveta", tokenized_data, dominant_topic_details)

Details for Shveta:
First 10 Word Tokens: ['hi', 'good', 'tell', 'bit', 'research', 'trying', 'go', 'ahead', 'aunty', 'topic']
First 10 Stemmed Tokens: ['hi', 'good', 'tell', 'bit', 'research', 'tri', 'go', 'ahead', 'aunti', 'topic']
Sentiment: Sentiment(polarity=0.21374631268436578, subjectivity=0.46556992663187335)
First 10 Noun Phrases: ['hi', "'m good", 'yeah', 'yeah', 'question mark', 'generate fraternity', 'wrong person', 'question mark', 'okay', 'yeah']
Top 10 Word Frequencies: [('not', 28), ('know', 18), ('people', 17), ('no', 15), ('now', 14), ('think', 14), ('person', 13), ('feel', 11), ('able', 11), ('make', 10)]
Dominant Topic: 2
Word-Weight Pairs:
know, 0.02800000086426735
not, 0.020999999716877937
think, 0.017000000923871994
people, 0.01600000075995922
feel, 0.009999999776482582
say, 0.00800000037997961
one, 0.00800000037997961
now, 0.007000000216066837




In [104]:
#id 9 Ranjana
get_individual_details("ranjana", tokenized_data, dominant_topic_details)

Details for Ranjana:
First 10 Word Tokens: ['many', 'times', 'attended', 'goals', 'attended', 'threeday', 'course', 'last', 'session', 'november']
First 10 Stemmed Tokens: ['mani', 'time', 'attend', 'goal', 'attend', 'threeday', 'cours', 'last', 'session', 'novemb']
Sentiment: Sentiment(polarity=0.15973224306557637, subjectivity=0.4555228721895389)
First 10 Noun Phrases: ['vipassana', 'ranjanaa', 'three-day course', 'november', 'main goals', 'mental peace', "life 's", 'difficult questions', 'fortunately', 'own actions']
Top 10 Word Frequencies: [('practice', 8), ('understanding', 6), ('morality', 5), ('not', 5), ('rather', 5), ('community', 5), ('practices', 5), ('social', 5), ('life', 4), ('teaches', 4)]
Dominant Topic: 3
Word-Weight Pairs:
practice, 0.012000000104308128
not, 0.008999999612569809
sense, 0.007000000216066837
understanding, 0.007000000216066837
life, 0.007000000216066837
might, 0.006000000052154064
change, 0.006000000052154064
self, 0.006000000052154064




In [105]:
#id 10 puneet
get_individual_details("puneet", tokenized_data, dominant_topic_details)

Details for Puneet:
First 10 Word Tokens: ['good', 'afternoon', 'puneet', 'teacher', 'someone', 'experienced', 'various', 'facets', 'share', 'unique']
First 10 Stemmed Tokens: ['good', 'afternoon', 'puneet', 'teacher', 'someon', 'experienc', 'variou', 'facet', 'share', 'uniqu']
Sentiment: Sentiment(polarity=0.11257816257816258, subjectivity=0.5751202501202499)
First 10 Noun Phrases: ['good afternoon', 'puneet', 'vipassana', 'various facets', 'unique perspective', 'good afternoon', 'vipassana', 'vipassana', 'vipassana', '’ s']
Top 10 Word Frequencies: [('practice', 9), ('sometimes', 7), ('teachings', 7), ('often', 6), ('approach', 6), ('view', 5), ('personal', 5), ('simplicity', 5), ('practices', 5), ('living', 5)]
Dominant Topic: 1
Word-Weight Pairs:
practice, 0.008999999612569809
meditation, 0.008999999612569809
nature, 0.007000000216066837
often, 0.006000000052154064
understanding, 0.006000000052154064
law, 0.004999999888241291
life, 0.004999999888241291
experience, 0.004999999888241

In [106]:
#id 11 Tonu
get_individual_details("tonu", tokenized_data, dominant_topic_details)

Details for Tonu:
First 10 Word Tokens: ['tell', 'bit', 'research', 'proceed', 'gone', 'last', 'year', 'master', 'right', 'now']
First 10 Stemmed Tokens: ['tell', 'bit', 'research', 'proceed', 'gone', 'last', 'year', 'master', 'right', 'now']
Sentiment: Sentiment(polarity=0.1847653549148113, subjectivity=0.4958143140072488)
First 10 Noun Phrases: ['vipassana', "master 's", 'tiss', 'development studies', 'credit dissertation', 'big chunk', 'vipassana', 'question mark', 'main idea', 'generate fraternity']
Top 10 Word Frequencies: [('know', 85), ('not', 54), ('people', 46), ('think', 40), ('day', 33), ('maybe', 29), ('feel', 26), ('practice', 26), ('time', 24), ('go', 24)]
Dominant Topic: 2
Word-Weight Pairs:
know, 0.02800000086426735
not, 0.020999999716877937
think, 0.017000000923871994
people, 0.01600000075995922
feel, 0.009999999776482582
say, 0.00800000037997961
one, 0.00800000037997961
now, 0.007000000216066837




In [107]:
#id 12 Anirudh
get_individual_details("anirudh", tokenized_data, dominant_topic_details)

Details for Anirudh:
First 10 Word Tokens: ['initially', 'motivated', 'try', 'practicing', 'mindfulness', 'center', 'beneficial', 'seemed', 'natural', 'progression']
First 10 Stemmed Tokens: ['initi', 'motiv', 'tri', 'practic', 'mind', 'center', 'benefici', 'seem', 'natur', 'progress']
Sentiment: Sentiment(polarity=0.15703578336557056, subjectivity=0.45006102053974384)
First 10 Noun Phrases: ['vipassana', 'anirudh narula', 'vipassana', 'natural progression', 'vipassana', 'anirudh narula', 'internal processes', 'vipassana', "'m okay", 'external circumstances']
Top 10 Word Frequencies: [('personal', 7), ('practice', 6), ('life', 6), ('understanding', 5), ('not', 5), ('volunteering', 5), ('challenges', 5), ('increased', 4), ('however', 4), ('maitri', 4)]
Dominant Topic: 0
Word-Weight Pairs:
life, 0.008999999612569809
personal, 0.00800000037997961
experience, 0.007000000216066837
practice, 0.006000000052154064
think, 0.004999999888241291
issues, 0.004999999888241291
societal, 0.00499999988

In [108]:
#id 13 Nandkishore
get_individual_details("nandkishore", tokenized_data, dominant_topic_details)

Details for Nandkishore:
First 10 Word Tokens: ['nandkishore', 'describe', 'experience', 'meditation', 'everything', 'seems', 'awakened', 'meditating', 'long', 'time']
First 10 Stemmed Tokens: ['nandkishor', 'describ', 'experi', 'medit', 'everyth', 'seem', 'awaken', 'medit', 'long', 'time']
Sentiment: Sentiment(polarity=0.08431961120640363, subjectivity=0.40599689618557533)
First 10 Noun Phrases: ['nandkishore', 'vipassana', 'long time', 'family members', 'meditating', 'ripple effect', 'modern lifestyles', 'consumption patterns', "n't align", 'natural processes']
Top 10 Word Frequencies: [('meditation', 16), ('spiritual', 11), ('societal', 11), ('development', 8), ('nature', 7), ('understanding', 7), ('practice', 6), ('often', 5), ('sense', 5), ('see', 5)]
Dominant Topic: 1
Word-Weight Pairs:
practice, 0.008999999612569809
meditation, 0.008999999612569809
nature, 0.007000000216066837
often, 0.006000000052154064
understanding, 0.006000000052154064
law, 0.004999999888241291
life, 0.00499

In [109]:
#id 14 Premchand
get_individual_details("premchand", tokenized_data, dominant_topic_details)

Details for Premchand:
First 10 Word Tokens: ['mental', 'state', 'influenced', 'practicing', 'since', 'started', 'practicing', 'noticeable', 'shift', 'mental']
First 10 Stemmed Tokens: ['mental', 'state', 'influenc', 'practic', 'sinc', 'start', 'practic', 'notic', 'shift', 'mental']
Sentiment: Sentiment(polarity=0.12624999999999997, subjectivity=0.45500000000000007)
First 10 Noun Phrases: ['mental state', 'vipassana', 'vipassana', 'definitely', 'healthy way', 'conscious choice', 'external factors disturb', 'inner peace', 'post-vipassana', 'interestingly']
Top 10 Word Frequencies: [('practice', 5), ('others', 5), ('maitri', 4), ('change', 4), ('practicing', 3), ('not', 3), ('life', 3), ('sense', 3), ('mental', 2), ('since', 2)]
Dominant Topic: 3
Word-Weight Pairs:
practice, 0.012000000104308128
not, 0.008999999612569809
sense, 0.007000000216066837
understanding, 0.007000000216066837
life, 0.007000000216066837
might, 0.006000000052154064
change, 0.006000000052154064
self, 0.0060000000521

In [110]:
#id 15 Jyoti
get_individual_details("jyoti", tokenized_data, dominant_topic_details)

Details for Jyoti:
First 10 Word Tokens: ['describe', 'meditation', 'changed', 'perception', 'actions', 'absolutely', 'began', 'practicing', 'sort', 'blindness']
First 10 Stemmed Tokens: ['describ', 'medit', 'chang', 'percept', 'action', 'absolut', 'began', 'practic', 'sort', 'blind']
Sentiment: Sentiment(polarity=0.08313615608136156, subjectivity=0.427264326236929)
First 10 Noun Phrases: ['vipassana', 'absolutely', 'vipassana', 'own self', 'various situations', 'significant realization', 'inner calmness', 'have', 'specific changes', 'day-to-day life']
Top 10 Word Frequencies: [('not', 7), ('now', 7), ('sense', 6), ('people', 6), ('towards', 5), ('feel', 5), ('self', 4), ('empathy', 4), ('service', 4), ('others', 4)]
Dominant Topic: 3
Word-Weight Pairs:
practice, 0.012000000104308128
not, 0.008999999612569809
sense, 0.007000000216066837
understanding, 0.007000000216066837
life, 0.007000000216066837
might, 0.006000000052154064
change, 0.006000000052154064
self, 0.006000000052154064




In [111]:
#id 16 Sushila
get_individual_details("sushila", tokenized_data, dominant_topic_details)

Details for Sushila:
First 10 Word Tokens: ['describe', 'changes', 'experienced', 'since', 'practicing', 'transformative', 'experience', 'change', 'certain', 'aspects']
First 10 Stemmed Tokens: ['describ', 'chang', 'experienc', 'sinc', 'practic', 'transform', 'experi', 'chang', 'certain', 'aspect']
Sentiment: Sentiment(polarity=0.10619182900432897, subjectivity=0.4549312839937839)
First 10 Noun Phrases: ['vipassana', 'vipassana', 'transformative experience', 'certain aspects', 'new way', "'pure birth", 'vipassana', 'regular practice', 'vipassana', 'haven ’ t']
Top 10 Word Frequencies: [('practice', 11), ('maitri', 7), ('social', 6), ('life', 5), ('address', 5), ('issues', 5), ('change', 4), ('way', 4), ('people', 4), ('not', 4)]
Dominant Topic: 3
Word-Weight Pairs:
practice, 0.012000000104308128
not, 0.008999999612569809
sense, 0.007000000216066837
understanding, 0.007000000216066837
life, 0.007000000216066837
might, 0.006000000052154064
change, 0.006000000052154064
self, 0.00600000005

In [112]:
#id 17 Himmat
get_individual_details("himmat", tokenized_data, dominant_topic_details)

Details for Himmat:
First 10 Word Tokens: ['perspective', 'mindset', 'change', 'practicing', 'postvipassana', 'noticed', 'significant', 'shift', 'towards', 'positive']
First 10 Stemmed Tokens: ['perspect', 'mindset', 'chang', 'practic', 'postvipassana', 'notic', 'signific', 'shift', 'toward', 'posit']
Sentiment: Sentiment(polarity=0.14952161365204847, subjectivity=0.48889924085576264)
First 10 Noun Phrases: ['vipassana', 'post-vipassana', 'positive thoughts', 'new dimension', 'vipassana', 'don ’ t practice', 'vipassana', 'vipassana', 'personal relationships', 'various factors interplay']
Top 10 Word Frequencies: [('life', 9), ('experience', 8), ('think', 7), ('change', 6), ('personal', 5), ('societal', 5), ('peace', 5), ('daily', 4), ('relationships', 4), ('no', 4)]
Dominant Topic: 0
Word-Weight Pairs:
life, 0.008999999612569809
personal, 0.00800000037997961
experience, 0.007000000216066837
practice, 0.006000000052154064
think, 0.004999999888241291
issues, 0.004999999888241291
societal

In [113]:
#id 18 fgd1
get_individual_details("fgd1", tokenized_data, dominant_topic_details)

Details for Fgd1:
First 10 Word Tokens: ['let', 'begin', 'discussion', 'see', 'fraternity', 'context', 'caste', 'class', 'society', 'fraternity']
First 10 Stemmed Tokens: ['let', 'begin', 'discuss', 'see', 'fratern', 'context', 'cast', 'class', 'societi', 'fratern']
Sentiment: Sentiment(polarity=0.058017868656166545, subjectivity=0.4219812102790827)
First 10 Noun Phrases: ['fraternity', 'own caste', 'caste', 'significant factor', 'vipassana', 'empower individuals', 'vipassana', 'isn ’ t', 'natural law', 'have']
Top 10 Word Frequencies: [('self', 16), ('foucault', 9), ('way', 8), ('thoughts', 6), ('understanding', 5), ('practice', 5), ('sensations', 5), ('context', 4), ('others', 4), ('see', 3)]
Dominant Topic: 3
Word-Weight Pairs:
practice, 0.012000000104308128
not, 0.008999999612569809
sense, 0.007000000216066837
understanding, 0.007000000216066837
life, 0.007000000216066837
might, 0.006000000052154064
change, 0.006000000052154064
self, 0.006000000052154064




In [114]:
#id 19 fgd2
get_individual_details("fgd2", tokenized_data, dominant_topic_details)

Details for Fgd2:
First 10 Word Tokens: ['today', 'gathered', 'diverse', 'group', 'dive', 'discussion', 'let', 'start', 'personal', 'perspective']
First 10 Stemmed Tokens: ['today', 'gather', 'divers', 'group', 'dive', 'discuss', 'let', 'start', 'person', 'perspect']
Sentiment: Sentiment(polarity=0.118826705940108, subjectivity=0.4247259041073473)
First 10 Noun Phrases: ['diverse group', 'vipassana', '’ s', 'personal perspective', 'ritu', 'vipassana', 'mental health', 'vipassana', 'transformative experience', 'mental health']
Top 10 Word Frequencies: [('energy', 13), ('quantum', 11), ('practice', 10), ('mental', 7), ('physics', 6), ('scientific', 5), ('health', 4), ('meditation', 4), ('way', 4), ('stress', 4)]
Dominant Topic: 3
Word-Weight Pairs:
practice, 0.012000000104308128
not, 0.008999999612569809
sense, 0.007000000216066837
understanding, 0.007000000216066837
life, 0.007000000216066837
might, 0.006000000052154064
change, 0.006000000052154064
self, 0.006000000052154064


