TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used to reflect the importance of a word in a document relative to a collection of documents (or corpus). It is often used in text mining and information retrieval to evaluate how relevant a word is in a particular document. Here's a breakdown of how TF-IDF works:

### Term Frequency (TF)

**Term Frequency** measures how frequently a term (word) appears in a document. It is calculated as follows:

\[ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \]

The more frequently a term appears in a document, the higher its TF value.

### Inverse Document Frequency (IDF)

**Inverse Document Frequency** measures the importance of a term in the corpus. It is calculated as follows:

\[ \text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents } N}{\text{Number of documents containing term } t} \right) \]

The idea is that if a term appears in many documents, it is less important (common words like "the", "is", etc.). The IDF value decreases as the term appears in more documents.

### TF-IDF Calculation

The **TF-IDF** score is the product of the TF and IDF scores for a term. It is calculated as follows:

\[ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) \]

Where:
- \( t \) is the term.
- \( d \) is an individual document.
- \( D \) is the entire document corpus.
- \( N \) is the total number of documents in the corpus.
- The logarithm in IDF can be natural log (ln) or base 10 log (log10).

### Example

Let's go through a simple example to illustrate the calculation:

1. **Document Corpus**:
   - Document 1 (d1): "this is a sample"
   - Document 2 (d2): "this is another example example"

2. **Calculate Term Frequency (TF)**:
   - For term "this" in d1: 
     \[ \text{TF}("this", d1) = \frac{1}{4} = 0.25 \]
   - For term "example" in d2:
     \[ \text{TF}("example", d2) = \frac{2}{4} = 0.5 \]

3. **Calculate Document Frequency (DF)**:
   - "this" appears in both documents, so DF("this") = 2.
   - "example" appears in one document, so DF("example") = 1.

4. **Calculate Inverse Document Frequency (IDF)**:
   - For term "this":
     \[ \text{IDF}("this", D) = \log \left( \frac{2}{2} \right) = \log (1) = 0 \]
   - For term "example":
     \[ \text{IDF}("example", D) = \log \left( \frac{2}{1} \right) = \log (2) \approx 0.693 \]

5. **Calculate TF-IDF**:
   - For term "this" in d1:
     \[ \text{TF-IDF}("this", d1, D) = 0.25 \times 0 = 0 \]
   - For term "example" in d2:
     \[ \text{TF-IDF}("example", d2, D) = 0.5 \times 0.693 \approx 0.346 \]

### Summary

- **TF-IDF** provides a weight that balances the frequency of a term in a document against how common the term is across the entire corpus.
- Common words across many documents have low TF-IDF scores, while words unique to a few documents have higher scores.
- This method helps to highlight the most important terms within a document relative to a larger set of documents.

Limitations of TF-IDF
Context Ignorance: TF-IDF does not capture the semantic meaning of words. Words with similar meanings but different forms will be treated differently.
Static Weights: The weights are static and do not account for the dynamic nature of language and context.
Scalability: For very large corpora, computing TF-IDF can become computationally intensive.
Despite these limitations, TF-IDF remains a foundational technique in text processing and continues to be widely used due to its simplicity and effectiveness.

A Count Vectorizer is a technique used in Natural Language Processing (NLP) to convert a collection of text documents into a matrix of token counts. It is a fundamental tool for text preprocessing and is often used as a first step in transforming raw text into a format suitable for machine learning algorithms.

### How Count Vectorizer Works

1. **Tokenization**: The text documents are tokenized, meaning they are split into individual words or tokens.
2. **Vocabulary Building**: A vocabulary (set of unique tokens) is built from all the documents.
3. **Count Matrix**: A matrix is created where each row represents a document and each column represents a token from the vocabulary. The values in the matrix are the counts of the tokens in each document.

### Example

Consider a simple example with three documents:

1. Document 1: "I love programming"
2. Document 2: "Programming is fun"
3. Document 3: "I love fun activities"

#### Tokenization and Vocabulary Building

From these documents, the vocabulary would be:

\[ \text{"I", "love", "programming", "is", "fun", "activities"} \]

#### Count Matrix

The count matrix for these documents would be:

\[
\begin{array}{l|c|c|c|c|c|c}
 & \text{"I"} & \text{"love"} & \text{"programming"} & \text{"is"} & \text{"fun"} & \text{"activities"} \\
\hline
\text{Document 1} & 1 & 1 & 1 & 0 & 0 & 0 \\
\text{Document 2} & 0 & 0 & 1 & 1 & 1 & 0 \\
\text{Document 3} & 1 & 1 & 0 & 0 & 1 & 1 \\
\end{array}
\]

### Applications

- **Text Classification**: Used as input features for machine learning models to classify text into categories (e.g., spam detection, sentiment analysis).
- **Information Retrieval**: Helps in retrieving documents that are relevant to a user's query by counting term occurrences.
- **Text Similarity**: Used to measure the similarity between documents by comparing their count vectors.

### Advantages

- **Simplicity**: Easy to understand and implement.
- **Effectiveness**: Provides a straightforward way to represent text data numerically.

### Limitations

- **Context Ignorance**: Does not capture the meaning or context of the words. Words with similar meanings but different forms will be treated differently.
- **High Dimensionality**: The resulting count matrix can be very large, especially with a large vocabulary, leading to sparsity issues.
- **Frequency Bias**: Frequent words (e.g., "the", "is") may dominate the count matrix, potentially overshadowing more informative terms.

### Python Implementation using Scikit-Learn

Here's a simple implementation using the `CountVectorizer` from the `scikit-learn` library:

```python
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "I love programming",
    "Programming is fun",
    "I love fun activities"
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert the result to an array
count_matrix = X.toarray()

# Get the feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

print("Vocabulary:", feature_names)
print("Count Matrix:\n", count_matrix)
```

### Output

```
Vocabulary: ['activities' 'fun' 'is' 'love' 'programming']
Count Matrix:
 [[0 0 0 1 1 1]
  [0 1 1 0 1 0]
  [1 1 0 1 0 1]]
```

This example demonstrates how Count Vectorizer converts text documents into a matrix of token counts, which can then be used for various NLP and machine learning tasks.