In [None]:
# This notebook contains a series of exercises designed to help you
# understand and implement fundamental techniques in Natural Language Processing (NLP) feature engineering using scikit-learn.

# Exercises:
# 1.  **Bag of Words:** Learn how to transform text data into a Bag of Words representation.
# 2.  **TF-IDF:** Explore the Term Frequency-Inverse Document Frequency (TF-IDF) method for text representation.

### Exercise 1: Bag of Words

**Task:** Use the `CountVectorizer` from scikit-learn to transform the following text corpus into a Bag of Words representation:

```python
documents = [
    "Text processing is important for NLP.",
    "Bag of Words is a simple text representation method.",
    "Feature engineering is essential in machine learning."
]
```


The Bag of Words model represents text as an unordered collection of words, disregarding grammar and even word order, but keeping track of word frequencies. Essentially, it creates a vocabulary of all unique words in the corpus and then for each document, it counts how many times each word from the vocabulary appears.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text corpus
documents = [
    "Text processing is important for NLP.",
    "Bag of Words is a simple text representation method.",
    "Feature engineering is essential in machine learning."
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Transform the text data
X = vectorizer.fit_transform(documents)

# Convert the result to an array
bow_array = X.toarray()

# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()

print("Vocabulary:")
print(vocab)
print("\nBag of Words Array:")
print(bow_array)

### Challenge 1: Bag of Words

**Task:** Using the `vectorizer` from Exercise 1, transform the following new document into its Bag of Words representation and print the result:

```python
new_document_bow = ["NLP is an important field."]
```

**Hint:** You'll need to use the `transform` method on the `vectorizer` that was already fitted on the original corpus.

### Exercise 2: TF-IDF

**Task:** Use the `TfidfVectorizer` from scikit-learn to transform the following text corpus into a TF-IDF representation:

```python
documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]
```


TF-IDF is a numerical statistic that reflects how important a word is to a document in a corpus. It's a product of two terms: Term Frequency (TF) and Inverse Document Frequency (IDF).

- **Term Frequency (TF):** This measures how frequently a term appears in a document. The more often a word appears, the higher its TF score, implying it's important to that document.
- **Inverse Document Frequency (IDF):** This measures how rare a term is across all documents in the corpus. Words that appear frequently in many documents (like "the", "is") will have a low IDF score, making them less important. Words that are rare across the corpus will have a high IDF score, indicating they are more distinctive to certain documents. The TF-IDF score increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. This helps to filter out common words that don't carry much meaning.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text corpus
documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Transform the text data
X = vectorizer.fit_transform(documents)

# Convert the result to an array
tfidf_array = X.toarray()

# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:")
print(vocab)
print("\nTF-IDF Array:")
print(tfidf_array)

### Challenge 2: TF-IDF

**Task:** Using the `vectorizer` from Exercise 2, transform the following new document into its TF-IDF representation and print the result:

```python
new_document_tfidf = ["NLP models are important."]
```

**Hint:** Similar to the BoW challenge, use the `transform` method on the `vectorizer` that was already fitted on the original corpus.