## NLP Topic Modeling Exercise

In [1]:
# import TfidfVectorizer and CountVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# import fetch_20newsgroups from sklearn.datasets
from sklearn.datasets import fetch_20newsgroups

# import NMF and LatentDirichletAllocation from sklearn
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [2]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

* create a variable called `'no_features'` and set its value to 100.

In [3]:
no_features = 100

* create a variable `'no_topics'` and set its value to 100

In [4]:
no_topics = 100

## NMF

* instantiate a TfidfVectorizer with the following parameters:


    * max_df = 0.95
    * min_df = 2
    * max_features = no_features
    * stop_words = 'english'

In [5]:
# Instantiate TfidfVectorizer with specified parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')

* use fit_transform method of TfidfVectorizer to transform the documents

In [6]:
# Use the fit_transform method to transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

* get the features names from TfidfVectorizer

In [7]:
# Get the feature names from TfidfVectorizer
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

* instantiate NMF and fit transformed data

In [8]:
# Instantiate NMF and fit_transform the TF-IDF data
num_topics = 100 
nmf_model = NMF(n_components=num_topics, random_state=1)
nmf_topics = nmf_model.fit_transform(tfidf_matrix)

In [9]:
# Print the shape of the NMF topic matrix
print(f"NMF topic matrix shape: {nmf_topics.shape}")

NMF topic matrix shape: (11314, 100)


## LDA w/ Sklearn

* instantiate a CountVectorizer with following parameters:


    * max_df = 0.95
    * min_df = 2
    * max_features = no_features
    * stop_words = 'english'

In [10]:
# Instantiate CountVectorizer with specified parameters
count_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')

* use fit_transform method of CountVectorizer to transform documents

In [11]:
# Use the fit_transform method to transform the documents
count_matrix = count_vectorizer.fit_transform(documents)

* get the features names from TfidfVectorizer

In [12]:
# Get the feature names from CountVectorizer
count_feature_names = count_vectorizer.get_feature_names_out()

* instantiate LatentDirichletAllocation and fit transformed data 

In [13]:
# Instantiate LatentDirichletAllocation and fit_transform the count data
num_topics = 100
lda_model = LatentDirichletAllocation(n_components=num_topics, random_state=1)
lda_topics = lda_model.fit_transform(count_matrix)

In [14]:
# Print the shape of the LDA topic matrix
print(f"LDA topic matrix shape: {lda_topics.shape}")

LDA topic matrix shape: (11314, 100)


* create a function `display_topics` that is able to display the top words in a topic for different models

* display top 10 words from each topic from NMF model

In [17]:
# function to display topics for NMF model
def display_nmf_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [18]:
# display the top words in each topic for the NMF model
no_top_words = 10  # Number of top words to display for each topic
display_nmf_topics(nmf_model, tfidf_feature_names, no_top_words)

Topic 0:
people know mr 14 different 25 set read let ll
Topic 1:
does know 14 set different read available 25 question didn
Topic 2:
know does read question didn years god don drive edu
Topic 3:
edu 14 file mr set different read available 25 let
Topic 4:
just a86 things don years going doesn drive edu fact
Topic 5:
like mr read different 25 file available set ll question
Topic 6:
just years good doesn don drive edu fact far file
Topic 7:
use max set don read 10 question years god drive
Topic 8:
thanks max set read file does question good doesn don
Topic 9:
good mr different read let available ll key didn question
Topic 10:
think don set read question didn case drive edu fact
Topic 11:
god things don mr jesus believe ll new let question
Topic 12:
problem 14 file question didn years going drive edu fact
Topic 13:
windows read set different 25 key question g9v doesn don
Topic 14:
drive max different mr set read 25 file key question
Topic 15:
time max ll 10 let question didn know key drive

* display top 10 words from each topic from LDA model

In [19]:
# function to display topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [20]:
# display the top words in each topic for the LDA model
no_top_words = 10  # Number of top words to display for each topic
display_topics(lda_model, count_feature_names, no_top_words)

Topic 0:
got just come way like make good don right didn
Topic 1:
say just don way like come let fact course make
Topic 2:
said people didn know don like say just time did
Topic 3:
best way don come just make course good does know
Topic 4:
list way know don new use point ll need like
Topic 5:
run just like way don make come course time using
Topic 6:
ve just like way don know got time years think
Topic 7:
list long like just good come believe 15 number way
Topic 8:
problem using time just don try way make does did
Topic 9:
point just like course good make don come fact know
Topic 10:
00 20 15 new 10 25 world software list edu
Topic 11:
probably way don just like make come know little really
Topic 12:
things way like come don believe know just people doesn
Topic 13:
using use way used does make work information like available
Topic 14:
said just like people way don come time believe say
Topic 15:
sure just like don way make know believe come good
Topic 16:
use way don just like using kn

### Stretch: Use LDA w/ Gensim to do the same thing.