# Topic Modeling 


### What is NMF?
Non-Negative Matrix Factorization (NMF) is a technique that helps us find hidden themes or topics in a collection of texts, like news articles, tweets, or research papers.

It works by breaking down the content into parts that make sense and are additive — in simple terms, it finds patterns where certain words often appear together, forming a topic.



### Why do we use NMF?
We use NMF when we want to understand what topics are being discussed across many documents without reading them all.

Think of it like this: If you had a big pile of books or emails, NMF can help you automatically organize them into categories, like:

1. Health

2. Finance

3. Technology

based on the words they contain.

It gives two useful things:

1. A list of topics, each represented by top words (like “doctor, health, exercise” for health).

2. For each document, it tells you how much it talks about each topic.


### When should you use NMF?

Use NMF when:

1. You have a lot of unstructured text data (e.g., customer reviews, articles, transcripts).

2. You want to automatically discover the main themes or categories.


# Sample Documents (10 Docs)

In [1]:
documents = [
    "The cat sat on the mat.",
    "Dogs are man's best friend.",
    "Cats and dogs are both popular pets.",
    "Artificial intelligence is transforming the world.",
    "Machine learning and deep learning are subsets of AI.",
    "The stock market crashed due to economic instability.",
    "Investors are cautious about tech stocks in 2025.",
    "Nutrition and exercise are important for good health.",
    "Doctors recommend eating more vegetables and fruits.",
    "Health experts suggest daily physical activity."
]

documents

['The cat sat on the mat.',
 "Dogs are man's best friend.",
 'Cats and dogs are both popular pets.',
 'Artificial intelligence is transforming the world.',
 'Machine learning and deep learning are subsets of AI.',
 'The stock market crashed due to economic instability.',
 'Investors are cautious about tech stocks in 2025.',
 'Nutrition and exercise are important for good health.',
 'Doctors recommend eating more vegetables and fruits.',
 'Health experts suggest daily physical activity.']

# Step 1: TF-IDF Vectorization

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
X.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.57735027,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.5182909 ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.44059462, 0.        , 0.        ,
        0.        , 0.        , 0.5182909 , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.5182909 , 0.        , 0. 

# Step 2: Apply NMF

Output: 

This is a matrix of numbers, and it shows how important each word is to each topic. Here's what it represents:

Each row is one topic.

Each column is one word from your vocabulary.

The number at each position tells you how strongly that word contributes to that topic.

In [8]:
from sklearn.decomposition import NMF

n_topics = 3
nmf = NMF(n_components=n_topics, random_state=42)
W = nmf.fit_transform(X)  # Document-topic matrix

In [5]:
H = nmf.components_       # Topic-word matrix
H

array([[1.77394889e-05, 0.00000000e+00, 1.16192698e-05, 0.00000000e+00,
        3.86395162e-01, 5.65090957e-06, 3.86395162e-01, 1.77394889e-05,
        1.77394889e-05, 0.00000000e+00, 1.16192698e-05, 1.77394889e-05,
        6.56942377e-01, 1.77394889e-05, 1.77394889e-05, 0.00000000e+00,
        0.00000000e+00, 3.86395162e-01, 1.77394889e-05, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 1.77394889e-05, 0.00000000e+00,
        1.77394889e-05, 2.32385395e-05, 1.16192698e-05, 3.86395162e-01,
        1.77394889e-05, 5.65090957e-06, 0.00000000e+00, 3.86395162e-01,
        0.00000000e+00, 3.86395162e-01, 1.77394889e-05, 5.65090957e-06,
        1.77394889e-05, 1.77394889e-05, 1.16192698e-05, 0.00000000e+00,
        1.77394889e-05, 0.00000000e+00, 1.77394889e-05, 0.00000000e+00],
       [7.89801300e-05, 3.08941273e-01, 5.65895854e-05, 0.00000000e+00,
        0.00000000e+00, 6.00045218e-05, 0.00000000e+00, 7.89801300e-05,
        7.89801300e-05, 3.08941273e-01, 5.65895854e-05, 7.89801

#  Step 3: Top Words per Topic

In [6]:
feature_names = vectorizer.get_feature_names_out()
n_top_words = 5

for topic_idx, topic in enumerate(H):
    print(f"\nTopic #{topic_idx+1}:")
    top_features = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
    print(", ".join(top_features))



Topic #1:
dogs, popular, pets, friend, man

Topic #2:
health, good, nutrition, important, exercise

Topic #3:
world, transforming, artificial, intelligence, learning
