<a href="https://colab.research.google.com/github/ruforavishnu/Project_Machine_Learning/blob/master/Project13_Unsupervised_Learning_Document_topic_modeling_using_Latent_Dirichlet_Allocation_(LDA).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [25]:
pip install numpy pandas nltk scikit-learn pyLDAvis



In [26]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import nltk
from nltk.corpus import stopwords
import pyLDAvis


nltk.download('stopwords')




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Load the Dataset

In [27]:
data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))

documents = data.data[:1000]

 Preprocessing - Tokenization and Stopword Removal

In [29]:
stop_words = stopwords.words('english')

vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words=stop_words)


dtm = vectorizer.fit_transform(documents)

Apply LDA Model

In [30]:
lda = LatentDirichletAllocation(n_components=5, random_state=42)


lda.fit(dtm)

 Display the Topics

In [31]:
def display_topics(model, feature_names, no_top_words):
  for idx, topic in enumerate(model.components_):
    print(f'Topic: {idx}')
    print(' | '.join([feature_names[i] for i in topic.argsort()[-no_top_words:]]))


feature_names = vectorizer.get_feature_names_out()

display_topics(lda, feature_names, 10)



Topic: 0
also | data | 50 | 00 | one | shuttle | would | nasa | good | space
Topic: 1
like | also | god | think | see | know | jesus | one | would | people
Topic: 2
period | 3t | 0d | 00 | ql | 1t | 04 | 145 | max | ax
Topic: 3
health | know | time | problem | like | thanks | please | get | would | use
Topic: 4
us | know | also | think | get | argument | people | father | would | one


Visualize the Topics with pyLDAvis

In [32]:
import pyLDAvis
print('pyLDAvis version:', pyLDAvis.__version__)

pyLDAvis version: 3.4.0


In [33]:
from pyLDAvis import prepare

In [34]:
topic_term_dists = lda.components_ / lda.components_.sum(axis=1)[: , np.newaxis]

doc_topic_dists = lda.transform(dtm)

doc_lengths = dtm.sum(axis=1).A1

vocab = vectorizer.get_feature_names_out()

term_frequency = np.asarray(dtm.sum(axis=0)).ravel()

vis = prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)

pyLDAvis.enable_notebook()
vis

Successful ! We have essentially found out meaningful data patterns from a huge chunk of text data.