### What is TfidfVectorizer?
The `TfidfVectorizer` is a tool provided by the `scikit-learn` library in Python for text feature extraction. It transforms text data into a matrix of TF-IDF features, which stands for Term Frequency-Inverse Document Frequency. This transformation helps in converting textual data into numerical data that machine learning models can process.

### How is TF-IDF Calculated?
TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus).

1. **Term Frequency (TF)**:
   - This measures how frequently a term appears in a document. It is calculated as:
$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$

2. **Inverse Document Frequency (IDF)**:
   - This measures how important a term is across the entire corpus. It is calculated as:
     $$
     \text{IDF}(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right)
     $$

3. **TF-IDF**:
   - This is the product of TF and IDF:
     $$
     \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
     $$

### How Does the CountVectorizer Differ from the TfidfVectorizer?
- **CountVectorizer**:
  - Converts text data into a matrix of token counts. It simply counts the number of occurrences of each term in each document.
  - Example: If the term 'data' appears 3 times in a document, the CountVectorizer will assign it a value of 3.

- **TfidfVectorizer**:
  - Converts text data into a matrix of TF-IDF features, which take into account both the frequency of a term in a document and how common the term is across the entire corpus.
  - Example: The term 'data' might appear frequently in many documents, so its TF-IDF value will be lower than a term that appears frequently in only a few documents.

### How to Instantiate a TfidfVectorizer with Specific Parameters
To create a `TfidfVectorizer` with the specified parameters, you can do the following:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

no_features = 1000  # Example value for the number of features

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')

### What are the Methods of the TfidfVectorizer Object?
The `TfidfVectorizer` object has several important methods, including:

- **fit**: Learns the vocabulary and IDF from the training set.
- **transform**: Transforms the documents to the TF-IDF representation using the learned vocabulary and IDF.
- **fit_transform**: Combines the `fit` and `transform` methods to learn the vocabulary and IDF and transform the documents in one step.
- **get_feature_names_out**: Returns the feature names (terms) in the learned vocabulary.
- **vocabulary_**: Returns the learned vocabulary as a dictionary.
- **idf_**: Returns the inverse document frequency values.

### What is NMF?
NMF stands for Non-negative Matrix Factorization. It is a group of algorithms in multivariate analysis and linear algebra where a matrix \( V \) is factorized into (usually) two matrices \( W \) and \( H \) such that:

$$
V \approx WH
$$

NMF is used for dimensionality reduction, topic modeling, and source separation, with the constraint that all three matrices \( V \), \( W \), and \( H \) have no negative elements.

### What is LDA?
LDA stands for Latent Dirichlet Allocation. It is a generative probabilistic model used in natural language processing and machine learning to discover abstract topics within a collection of documents. LDA assumes that documents are mixtures of topics and that topics are mixtures of words. The process involves:

- Assigning topics to each word in a document.
- Estimating the distribution of topics in each document.
- Estimating the distribution of words in each topic.

LDA is widely used for topic modeling and discovering hidden thematic structures in large corpora.

In [8]:
# import TfidfVectorizer and CountVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [9]:
# import fetch_20newsgroups from sklearn.datasets
from sklearn.datasets import fetch_20newsgroups

In [10]:
# import NMF and LatentDirichletAllocation from sklearn
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [11]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

In [12]:
no_features = 100
no_topics = 100

### NMF

In [19]:
# Instantiate TfidfVectorizer with specified parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')

1. **TfidfVectorizer**:
   - `max_df=0.95`: Ignore terms that appear in more than 95% of the documents.
   - `min_df=2`: Ignore terms that appear in fewer than 2 documents.
   - `max_features=no_features`: Limit the number of features (terms) to `no_features`.
   - `stop_words='english'`: Remove common English stop words.

In [20]:
# Use the fit_transform method to transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

2. **Transform the Documents**:
   - `fit_transform(documents)`: Learn the vocabulary and idf, then return the term-document matrix.

In [21]:
# Get the feature names from TfidfVectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

3. **Get Feature Names**:
   - `get_feature_names_out()`: Retrieve the feature names (terms) learned by the vectorizer.

In [22]:
# Print the number of features and the first 10 feature names
print(f"Number of features: {len(feature_names)}")
print(f"First 10 feature names: {feature_names[:10]}")

Number of features: 100
First 10 feature names: ['00' '10' '12' '14' '15' '16' '20' '25' 'a86' 'available']


In [23]:
# Instantiate NMF and fit_transform the TF-IDF data
num_topics = 100
nmf_model = NMF(n_components=num_topics, random_state=1)
nmf_topics = nmf_model.fit_transform(tfidf_matrix)

4. **NMF (Non-negative Matrix Factorization)**:
   - `NMF(n_components=num_topics, random_state=1)`: Instantiate the NMF model with a specified number of topics.
   - `fit_transform(tfidf_matrix)`: Fit the model to the TF-IDF matrix and return the topic matrix.

In [24]:
# Print the shape of the NMF topic matrix
print(f"NMF topic matrix shape: {nmf_topics.shape}")

NMF topic matrix shape: (11314, 100)
