### What is TfidfVectorizer?
The `TfidfVectorizer` is a tool provided by the `scikit-learn` library in Python for text feature extraction. It transforms text data into a matrix of TF-IDF features, which stands for Term Frequency-Inverse Document Frequency. This transformation helps in converting textual data into numerical data that machine learning models can process.

### How is TF-IDF Calculated?
TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus).

1. **Term Frequency (TF)**:
   - This measures how frequently a term appears in a document. It is calculated as:
$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

2. **Inverse Document Frequency (IDF)**:
   - This measures how important a term is across the entire corpus. It is calculated as:
     $$
     \text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right)
     $$

3. **TF-IDF**:
   - This is the product of TF and IDF:
     $$
     \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
     $$

### How Does the CountVectorizer Differ from the TfidfVectorizer?
- **CountVectorizer**:
  - Converts text data into a matrix of token counts. It simply counts the number of occurrences of each term in each document.
  - Example: If the term 'data' appears 3 times in a document, the CountVectorizer will assign it a value of 3.

- **TfidfVectorizer**:
  - Converts text data into a matrix of TF-IDF features, which take into account both the frequency of a term in a document and how common the term is across the entire corpus.
  - Example: The term 'data' might appear frequently in many documents, so its TF-IDF value will be lower than a term that appears frequently in only a few documents.

### How to Instantiate a TfidfVectorizer with Specific Parameters
To create a `TfidfVectorizer` with the specified parameters, you can do the following:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

no_features = 1000  # Example value for the number of features

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')

### What are the Methods of the TfidfVectorizer Object?
The `TfidfVectorizer` object has several important methods, including:

- **fit**: Learns the vocabulary and IDF from the training set.
- **transform**: Transforms the documents to the TF-IDF representation using the learned vocabulary and IDF.
- **fit_transform**: Combines the `fit` and `transform` methods to learn the vocabulary and IDF and transform the documents in one step.
- **get_feature_names_out**: Returns the feature names (terms) in the learned vocabulary.
- **vocabulary_**: Returns the learned vocabulary as a dictionary.
- **idf_**: Returns the inverse document frequency values.

### What is NMF?
NMF stands for Non-negative Matrix Factorization. It is a group of algorithms in multivariate analysis and linear algebra where a matrix \( V \) is factorized into (usually) two matrices \( W \) and \( H \) such that:

$$
V \approx WH
$$

NMF is used for dimensionality reduction, topic modeling, and source separation, with the constraint that all three matrices \( V \), \( W \), and \( H \) have no negative elements.

### What is LDA?
LDA stands for Latent Dirichlet Allocation. It is a generative probabilistic model used in natural language processing and machine learning to discover abstract topics within a collection of documents. LDA assumes that documents are mixtures of topics and that topics are mixtures of words. The process involves:

- Assigning topics to each word in a document.
- Estimating the distribution of topics in each document.
- Estimating the distribution of words in each topic.

LDA is widely used for topic modeling and discovering hidden thematic structures in large corpora.

In [8]:
# import TfidfVectorizer and CountVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [9]:
# import fetch_20newsgroups from sklearn.datasets
from sklearn.datasets import fetch_20newsgroups

In [10]:
# import NMF and LatentDirichletAllocation from sklearn
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [11]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

In [12]:
no_features = 100
no_topics = 100

### NMF

In [33]:
# Instantiate TfidfVectorizer with specified parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')

1. **TfidfVectorizer**:
   - `max_df=0.95`: Ignore terms that appear in more than 95% of the documents.
   - `min_df=2`: Ignore terms that appear in fewer than 2 documents.
   - `max_features=no_features`: Limit the number of features (terms) to `no_features`.
   - `stop_words='english'`: Remove common English stop words.

In [34]:
# Use the fit_transform method to transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

2. **Transform the Documents**:
   - `fit_transform(documents)`: Learn the vocabulary and idf, then return the term-document matrix.

In [35]:
# Get the feature names from TfidfVectorizer
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

3. **Get Feature Names**:
   - `get_feature_names_out()`: Retrieve the feature names (terms) learned by the vectorizer.

In [42]:
# Print the number of features and the first 10 feature names
print(f"Number of features: {len(tfidf_feature_names)}")
print(f"First 10 feature names: {tfidf_feature_names[:10]}")

Number of features: 100
First 10 feature names: ['00' '10' '12' '14' '15' '16' '20' '25' 'a86' 'available']


In [38]:
# Instantiate NMF and fit_transform the TF-IDF data
num_topics = 100 
nmf_model = NMF(n_components=num_topics, random_state=1)
nmf_topics = nmf_model.fit_transform(tfidf_matrix)

4. **NMF (Non-negative Matrix Factorization)**:
   - `NMF(n_components=num_topics, random_state=1)`: Instantiate the NMF model with a specified number of topics.
   - `fit_transform(tfidf_matrix)`: Fit the model to the TF-IDF matrix and return the topic matrix.

In [39]:
# Print the shape of the NMF topic matrix
print(f"NMF topic matrix shape: {nmf_topics.shape}")

NMF topic matrix shape: (11314, 100)


### LDA with Sklearn

In [25]:
# Instantiate CountVectorizer with specified parameters
count_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')

1. **CountVectorizer**:
   - `max_df=0.95`: Ignore terms that appear in more than 95% of the documents.
   - `min_df=2`: Ignore terms that appear in fewer than 2 documents.
   - `max_features=no_features`: Limit the number of features (terms) to `no_features`.
   - `stop_words='english'`: Remove common English stop words.

In [26]:
# Use the fit_transform method to transform the documents
count_matrix = count_vectorizer.fit_transform(documents)

2. **Transform the Documents**:
   - `fit_transform(documents)`: Learn the vocabulary and return the term-document matrix.

In [27]:
# Get the feature names from CountVectorizer
count_feature_names = count_vectorizer.get_feature_names_out()

3. **Get Feature Names**:
   - `get_feature_names_out()`: Retrieve the feature names (terms) learned by the vectorizer.

In [28]:
# Print the number of features and the first 10 feature names
print(f"Number of features: {len(count_feature_names)}")
print(f"First 10 feature names: {count_feature_names[:10]}")

Number of features: 100
First 10 feature names: ['00' '10' '12' '14' '15' '16' '20' '25' 'a86' 'available']


In [29]:
# Instantiate LatentDirichletAllocation and fit_transform the count data
num_topics = 100
lda_model = LatentDirichletAllocation(n_components=num_topics, random_state=1)
lda_topics = lda_model.fit_transform(count_matrix)

4. **LatentDirichletAllocation (LDA)**:
   - `LatentDirichletAllocation(n_components=num_topics, random_state=1)`: Instantiate the LDA model with a specified number of topics.
   - `fit_transform(count_matrix)`: Fit the model to the term-document matrix and return the topic matrix.

In [30]:
# Print the shape of the LDA topic matrix
print(f"LDA topic matrix shape: {lda_topics.shape}")

LDA topic matrix shape: (11314, 100)


### Display Top Words

In [40]:
# Function to display topics for NMF model
def display_nmf_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [41]:
# Display the top words in each topic for the NMF model
no_top_words = 10  # Number of top words to display for each topic
display_nmf_topics(nmf_model, tfidf_feature_names, no_top_words)

Topic 0:
people know mr 14 different 25 set read let ll
Topic 1:
does know 14 set different read available 25 question didn
Topic 2:
know does read question didn years god don drive edu
Topic 3:
edu 14 file mr set different read available 25 let
Topic 4:
just a86 things don years going doesn drive edu fact
Topic 5:
like mr read different 25 file available set ll question
Topic 6:
just years good doesn don drive edu fact far file
Topic 7:
use max set don read 10 question years god drive
Topic 8:
thanks max set read file does question good doesn don
Topic 9:
good mr different read let available ll key didn question
Topic 10:
think don set read question didn case drive edu fact
Topic 11:
god things don mr jesus believe ll new let question
Topic 12:
problem 14 file question didn years going drive edu fact
Topic 13:
windows read set different 25 key question g9v doesn don
Topic 14:
drive max different mr set read 25 file key question
Topic 15:
time max ll 10 let question didn know key drive

5. **Display Topics Function**:
   - The `display_nmf_topics` function takes the NMF model, feature names, and the number of top words to display for each topic. It prints out the top words for each topic by sorting the words based on their importance in the topic.

In [31]:
# Function to display topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [32]:
# Display the top words in each topic for the LDA model
no_top_words = 10  # Number of top words to display for each topic
display_topics(lda_model, count_feature_names, no_top_words)

Topic 0:
got just come way like make good don right didn
Topic 1:
say just don way like come let fact course make
Topic 2:
said people didn know don like say just time did
Topic 3:
best way don come just make course good does know
Topic 4:
list way know don new use point ll need like
Topic 5:
run just like way don make come course time using
Topic 6:
ve just like way don know got time years think
Topic 7:
list long like just good come believe 15 number way
Topic 8:
problem using time just don try way make does did
Topic 9:
point just like course good make don come fact know
Topic 10:
00 20 15 new 10 25 world software list edu
Topic 11:
probably way don just like make come know little really
Topic 12:
things way like come don believe know just people doesn
Topic 13:
using use way used does make work information like available
Topic 14:
said just like people way don come time believe say
Topic 15:
sure just like don way make know believe come good
Topic 16:
use way don just like using kn

5. **Display Topics Function**:
   - The `display_topics` function takes a model, feature names, and the number of top words to display for each topic. It prints out the top words for each topic by sorting the words based on their importance in the topic.