## NLP Topic Modeling Exercise

In [2]:
# import TfidfVectorizer and CountVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# import fetch_20newsgroups from sklearn.datasets
from sklearn.datasets import fetch_20newsgroups

# import NMF and LatentDirichletAllocation from sklearn
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [3]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

* create a variable called `'no_features'` and set its value to 100.
* create a variable `'no_topics'` and set its value to 100

In [6]:
# setting the number of features and topics for the LDA
no_features = 100
no_topics = 100 

Clarity:
* The number of features is the number of words the vectorizer will pick up and have as columns for word count (countvectorizer) or topic importance (for tfidf) and gets used in the vectorizer below.

* The number of topics is for topic modelling later on with the LDA. **LDA doesn’t directly interact with the vectorizer; it works on the document-term matrix generated by the vectorizer.**

## NMF

* instantiate a TfidfVectorizer with the following parameters:


    * max_df = 0.95
    * min_df = 2
    * max_features = no_features
    * stop_words = 'english'

In [5]:

# Instantiate the TfidfVectorizer with the specified parameters
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.95,           # Ignore terms that appear in more than 95% of the documents - avoids common filler and stopwords
    min_df=2,              # Ignore terms that appear in less than 2 documents - avoid overfitting and noise
    max_features=no_features,  # Limit the number of features (words)
    stop_words='english'   # Use English stop words to ignore common words like 'the', 'and', etc.
)

# vectorizer now ready to transform documents

* use fit_transform method of TfidfVectorizer to transform the documents

In [7]:
# fit transform the documents and store it in the matrix variable for NMF
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

* get the features names from TfidfVectorizer

In [8]:
feature_names = tfidf_vectorizer.get_feature_names_out()
print(feature_names)

['00' '10' '12' '14' '15' '16' '20' '25' 'a86' 'available' 'ax' 'b8f'
 'believe' 'best' 'better' 'bit' 'case' 'com' 'come' 'course' 'data' 'day'
 'did' 'didn' 'different' 'does' 'doesn' 'don' 'drive' 'edu' 'fact' 'far'
 'file' 'g9v' 'god' 'going' 'good' 'got' 'government' 'help' 'information'
 'jesus' 'just' 'key' 'know' 'law' 'let' 'like' 'line' 'list' 'little'
 'll' 'long' 'look' 'lot' 'mail' 'make' 'max' 'mr' 'need' 'new' 'number'
 'people' 'point' 'power' 'probably' 'problem' 'program' 'question' 'read'
 'really' 'right' 'run' 'said' 'say' 'second' 'set' 'software' 'space'
 'state' 'sure' 'tell' 'thanks' 'thing' 'things' 'think' 'time' 'true'
 'try' 'use' 'used' 'using' 've' 'want' 'way' 'windows' 'work' 'world'
 'year' 'years']


* instantiate NMF and fit transformed data

In [9]:
# as we defined no_topics above
no_topics = 100 

# Instantiate the NMF model
nmf_model = NMF(n_components=no_topics, random_state=42)

# Fit the NMF model to the tf-idf matrix and transform it
nmf_topic_matrix = nmf_model.fit_transform(tfidf_matrix)

# Print the resulting topic matrix (optional)
print(nmf_topic_matrix)

[[0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [2.34138105e-01 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 1.15384918e-01]
 [1.16885597e-08 0.00000000e+00 2.47453911e-02 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 ...
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  8.05256725e-21 4.92142279e-21]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 1.18602982e-02 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]]


## LDA w/ Sklearn

* instantiate a CountVectorizer with following parameters:


    * max_df = 0.95
    * min_df = 2
    * max_features = no_features
    * stop_words = 'english'

In [10]:
# instantiate CountVectorizer for LDA

count_vectorizer = CountVectorizer(
    max_df=0.95,           # Ignore terms that appear in more than 95% of the documents - avoids common filler and stopwords
    min_df=2,              # Ignore terms that appear in less than 2 documents - avoid overfitting and noise
    max_features=no_features,  # Limit the number of features (words)
    stop_words='english'   # Use English stop words to ignore common words like 'the', 'and', etc.
)

* use fit_transform method of CountVectorizer to transform documents

In [11]:
# make the count matrix

count_matrix = count_vectorizer.fit_transform(documents)

* get the features names from TfidfVectorizer

In [12]:
feature_names = count_vectorizer.get_feature_names_out()

print(feature_names)

['00' '10' '12' '14' '15' '16' '20' '25' 'a86' 'available' 'ax' 'b8f'
 'believe' 'best' 'better' 'bit' 'case' 'com' 'come' 'course' 'data' 'day'
 'did' 'didn' 'different' 'does' 'doesn' 'don' 'drive' 'edu' 'fact' 'far'
 'file' 'g9v' 'god' 'going' 'good' 'got' 'government' 'help' 'information'
 'jesus' 'just' 'key' 'know' 'law' 'let' 'like' 'line' 'list' 'little'
 'll' 'long' 'look' 'lot' 'mail' 'make' 'max' 'mr' 'need' 'new' 'number'
 'people' 'point' 'power' 'probably' 'problem' 'program' 'question' 'read'
 'really' 'right' 'run' 'said' 'say' 'second' 'set' 'software' 'space'
 'state' 'sure' 'tell' 'thanks' 'thing' 'things' 'think' 'time' 'true'
 'try' 'use' 'used' 'using' 've' 'want' 'way' 'windows' 'work' 'world'
 'year' 'years']


* instantiate LatentDirichletAllocation and fit transformed data 

In [13]:
from sklearn.decomposition import LatentDirichletAllocation

# Define number of topics
n_topics = 100  # Replace with your predefined number of topics

# Instantiate the LDA model
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)

# Step 4: Fit the LDA model to the word count matrix
lda_topic_matrix = lda_model.fit_transform(count_matrix)

# Step 5: Get the feature names (words) from the CountVectorizer
feature_names = count_vectorizer.get_feature_names_out()

# Step 6: Print the top words for each topic
for topic_idx, topic in enumerate(lda_model.components_): # enumerate is used to loop when you need both the index (topix_idx) and the value (topic)
    print(f"Topic {topic_idx}:")
    print(" ".join([feature_names[i] for i in topic.argsort()[:-11:-1]]))  # Top 10 words for each topic

Topic 0:
jesus god know way said really think world read does
Topic 1:
edu mail like just people max new don time way
Topic 2:
data use way just like make don better time want
Topic 3:
jesus people come does make didn course tell look like
Topic 4:
long just way time like years people know doesn look
Topic 5:
ax max g9v b8f 25 a86 mr 14 16 good
Topic 6:
line ll just look like better good way 14 time
Topic 7:
jesus god say did don 14 point read fact really
Topic 8:
mr going know don think time ll just day said
Topic 9:
law fact does people way say think use make come
Topic 10:
ax b8f a86 g9v max 14 mr 25 10 good
Topic 11:
state don say better way going time does think good
Topic 12:
10 20 15 14 25 16 12 00 ll list
Topic 13:
best better good way probably going just say think doesn
Topic 14:
key use like using don used way time doesn probably
Topic 15:
world better 20 information new like just mail used know
Topic 16:
right just way people like going say don doesn look
Topic 17:
different

## Create a function `display_topics` that is able to display the top words in a topic for different models

### Notes:
Both LDA and NMF models have a `components_` attribute, which contains the topic-word distribution (i.e., the importance of each word for each topic).

The `CountVectorizer` (for LDA) and `TfidfVectorizer` (for NMF) provide the feature names (words) using the `get_feature_names_out()` method.

Vectorizer creates the vocabulary and transforms the text into a matrix (TF-IDF or raw counts).

LDA/NMF works on that matrix to find topics but doesn't store the word names (just word distributions).

You need to explicitly pass the feature_names (from the vectorizer) to the function so that the indices of words in the model components can be mapped back to the actual terms.

In [14]:
def display_topics(model, feature_names, no_top_words=10):
    for topic_idx, topic in enumerate(model.components_): # The trained LDA or NMF model. Both have a components_ attribute that stores the topic-word distributions.
        print(f"Topic {topic_idx}:")

        # Get the indices of the top words for each topic
        top_word_indices = topic.argsort()[:-no_top_words - 1:-1] # argsort is in descending order by default, here we start at the end of the array which is highest value and go backwards (the last :-1)

        # Display the top words for each topic
        top_words = [feature_names[i] for i in top_word_indices] # This retrieves the actual word (instead of its index) from the feature names (terms) in the vectorizer.
        print(" ".join(top_words))


### Explaining the above function:

* for topic_idx, topic in enumerate because we want both the index and the topic weighting from the model. THe model doesn't store the word name, only the distributions

* argsort returns the **index of the weights** in this case and it is **descending order by default**, so `topic.argsort()[:-no_top_words - 1:-1]` will get all the indices corresponding to weights of certain weights of words, and start counting from the very back (the indices of the **largest** weightings), stepping backwards (the last :-1) and end at the smallest weights

* top_names gives the **names of the words which comes from the the vectorizer not the model!**

* display top 10 words from each topic from NMF model

In [16]:
# For NMF model
nmf_model = NMF(n_components=10, random_state=42)
nmf_topic_matrix = nmf_model.fit_transform(tfidf_matrix)
feature_names_nmf = tfidf_vectorizer.get_feature_names_out()

display_topics(nmf_model, feature_names_nmf) # default 10 top words but you can use more or less, specify as 3rd argument

Topic 0:
don think know want say really try better lot need
Topic 1:
use windows using problem used file program software drive need
Topic 2:
does know let doesn help work need want say question
Topic 3:
edu mail com available information need best list 20 new
Topic 4:
just right ve way work got say doesn little ll
Topic 5:
like look things make doesn lot sure really thing ll
Topic 6:
god believe jesus say true question fact world things way
Topic 7:
people government right law state said world point fact say
Topic 8:
thanks mail help information need com tell drive list software
Topic 9:
good time new year years did ve got make way


* display top 10 words from each topic from LDA model

In [17]:
# example usage of the above function

# For LDA model
lda_model = LatentDirichletAllocation(n_components=10, random_state=42)
lda_topic_matrix = lda_model.fit_transform(count_matrix)
feature_names_lda = count_vectorizer.get_feature_names_out()

display_topics(lda_model, feature_names_lda)

Topic 0:
god jesus believe people say does things know just think
Topic 1:
edu com 00 new mail list available information best drive
Topic 2:
use windows key data available using software used information bit
Topic 3:
space thanks drive problem does program know help work need
Topic 4:
people government year said years right world new did make
Topic 5:
ax max g9v b8f a86 14 mr 25 ll 12
Topic 6:
file power line information program second read case use number
Topic 7:
10 15 20 16 12 25 14 00 new year
Topic 8:
don just like think know good ve going time ll
Topic 9:
law question time true bit used fact day point does


You can see some common topics e.g. NMF topic 6 (god, believe, jesus) and LDA topic 0

### Stretch: Use LDA w/ Gensim to do the same thing.

In [18]:
# to be completed 