## Step-by-Step Explanation of the Code

### Step 1: Load Data

The code begins by loading your CSV data into a Pandas DataFrame. This DataFrame is used for further analysis.

### Step 2: Preprocess Text

The `preprocess_text` function is defined to preprocess the text data in the 'Content' column of the DataFrame. It tokenizes the text, converts words to lowercase, removes punctuation, and eliminates common English stop words. The preprocessed text is then joined into a single string.

### Step 3: Create TF-IDF Matrix

A `TfidfVectorizer` is created to convert the preprocessed text data into numerical features using the TF-IDF (Term Frequency-Inverse Document Frequency) transformation. This vectorizer is fitted to the text data, resulting in a TF-IDF matrix.

### Step 4: Build an LDA Model

A Latent Dirichlet Allocation (LDA) model is created using `LatentDirichletAllocation` from scikit-learn. It is fitted to the TF-IDF matrix to identify latent topics within the text data.

### Step 5: Retrieve Top N Words and Weights for Each Topic

The code defines `N`, which represents the number of top words to retrieve for each topic. Then, it retrieves the terms (words) associated with each topic from the TF-IDF vectorizer and the weights of the topics from the LDA model.

### Step 6: Create DataFrame

A DataFrame named `df_topics` is created to store the topics and their corresponding weights. The code enters a loop to process each topic:
- It identifies the top N word indices for the topic with `top_word_indices`.
- It retrieves the actual top words based on the indices and stores them in `top_words`.
- The maximum weight of the topic is obtained and stored in `max_weight`.
- The topic words and max weight are added to the `topic_data` list for each topic.

### Step 7: Save to CSV

Finally, the code saves the topics and weights in the `df_topics` DataFrame to a CSV file named 'topics_weights.csv'. The resulting CSV file contains two columns: 'topic' and 'weight,' where 'topic' contains a list of the top words for each topic, and 'weight' contains the maximum weight associated with each topic.

This code allows you to analyze the dominant topics in your text data and store them in a structured CSV format for further examination or visualization.


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Load your CSV data into a DataFrame
data = pd.read_csv('WhatsappChat.csv')

# Preprocess the text data
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return " ".join(tokens)  # Join tokens into a single string

data['Content'] = data['Content'].apply(preprocess_text)

# Create a TfidfVectorizer to convert text data into numerical features
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data['Content'])

# Build an LDA model
lda_model = LatentDirichletAllocation(n_components=100, random_state=42)
lda_model.fit(tfidf_matrix)

# Get the top N topic words and weights for each topic
N = 5  # Change N to the desired number of top words per topic
topic_terms = tfidf_vectorizer.get_feature_names_out()
topic_weights = lda_model.components_
# for topic in topic_terms:
#   print(topic)
# Create a DataFrame for topics and weights
topic_data = []
for i, topic_weight in enumerate(topic_weights):
    top_word_indices = topic_weight.argsort()[:-N-1:-1]  # Get top word indices
    top_words = [topic_terms[idx] for idx in top_word_indices]  # Get top words
    max_weight = topic_weight.max()  # Get max weight
    topic_data.append((top_words, max_weight))

df_topics = pd.DataFrame(topic_data, columns=['topic', 'weight'])

# Save topics and weights to a CSV file
df_topics.to_csv('topics_weights.csv', index=False)
# print(topic_data)

This below code For analysing data or topic more detaily. Above written code improved version of below code. So you use above code for topic modeling.

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Load your CSV data into a DataFrame
data = pd.read_csv('WhatsappChat.csv')

# Preprocess the text data
def preprocess_text(text):
    # Tokenize
    tokens = word_tokenize(text)
    # Remove punctuation and convert to lowercase
    tokens = [word.lower() for word in tokens if word.isalpha()]
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

# Apply preprocessing to the 'Content' column
data['Content'] = data['Content'].apply(preprocess_text)

# Create a dictionary and a corpus for topic modeling
dictionary = corpora.Dictionary(data['Content'])
corpus = [dictionary.doc2bow(text) for text in data['Content']]

# Build an LDA model
lda_model = gensim.models.LdaModel(corpus, num_topics=10, id2word=dictionary, passes=15)

# Extract and print topics
topics = lda_model.print_topics()
for topic in topics:
    print(topic)


# Save topics to 'topics.txt'
with open('topics.txt', 'w') as file:
    for topic in topics:
        file.write(f'Topic {topic[0]}: {topic[1]}\n')


(0, '0.019*"good" + 0.013*"would" + 0.009*"morning" + 0.007*"thing" + 0.007*"well" + 0.007*"thanks" + 0.007*"love" + 0.007*"help" + 0.007*"send" + 0.007*"please"')
(1, '0.029*"https" + 0.017*"ok" + 0.014*"yes" + 0.014*"data" + 0.012*"work" + 0.010*"analysis" + 0.010*"open" + 0.007*"see" + 0.006*"model" + 0.005*"pls"')
(2, '0.063*"media" + 0.062*"omitted" + 0.017*"power" + 0.016*"bi" + 0.015*"data" + 0.007*"tableau" + 0.006*"year" + 0.006*"workshop" + 0.005*"guys" + 0.005*"know"')
(3, '0.015*"data" + 0.009*"see" + 0.009*"guys" + 0.009*"mda" + 0.008*"interesting" + 0.008*"power" + 0.008*"work" + 0.007*"also" + 0.007*"excel" + 0.007*"collaboration"')
(4, '0.033*"okay" + 0.020*"thanks" + 0.014*"mda" + 0.011*"boss" + 0.009*"use" + 0.009*"extract" + 0.008*"one" + 0.008*"work" + 0.007*"try" + 0.007*"please"')
(5, '0.028*"data" + 0.012*"analyst" + 0.010*"work" + 0.009*"job" + 0.009*"one" + 0.009*"want" + 0.008*"still" + 0.008*"welcome" + 0.008*"working" + 0.008*"expenditure"')
(6, '0.035*"than