# Lesson 3: Implementing TF-IDF for Feature Engineering in Text Classification

Welcome! Today, we're going to take a deep dive into the concept of TF-IDF and its crucial role in Text Classification. TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic that reflects how important a word is in a document within a corpus of documents. The TF-IDF value increases proportionally to the number of times a word appears in the document but is counterbalanced by the frequency of the word in the corpus, helping to adjust for the fact that some words appear more frequently in general.

TF-IDF is used in information retrieval and text mining, to assist in identifying key words that contribute the most to the document's relevancy. In simple terms, terms that are more frequent in a specific document and less frequent in other documents from the corpus are significant and have high TF-IDF scores.

Now, let's understand it in practice.

## Introduction to TfidfVectorizer

In the Python ecosystem, scikit-learn is a widely used library offering various machine learning methods, along with utilities for pre-processing data, cross-validation, and other related tasks. One of the utilities it provides for text processing is TfidfVectorizer.

Let's walk through each line of the code:

```python
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
```

We first import the necessary libraries. Next, we set up a small list of text documents:

```python
sentences = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]
```

We then create an instance of the TfidfVectorizer class and fit the vectorizer to our set of documents:

```python
vectorizer = TfidfVectorizer()
vectorizer.fit(sentences)
```

"The fitting process" involves tokenization and learning the vocabulary. The text documents are tokenized into a set of tokens, and the vocabulary, which is a set of all tokens, is learned. At this point, we have effectively transformed our sentences into a numerical format that our machine can understand!

## Understanding Vocabulary and IDF from TfidfVectorizer

We can now print out the vocabulary and the Inverse Document Frequency (IDF) for each word in the vocabulary:

```python
print(f'Vocabulary: {vectorizer.vocabulary_}\n')
print(f'IDF: {vectorizer.idf_}\n')
```

The output looks something like this:

```
Vocabulary: {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}

IDF: [1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]
```

The 'Vocabulary' shows the numerical encoding of our sentences; each distinct word is assigned a unique numerical value. The 'IDF' values are the computed Inverse Document Frequencies for each word. These values define how important a word is to the document within the overall corpus. From these outputs, we get an important inference: terms that are very common in all documents (such as 'is' and 'the') have lower IDF scores, showing less importance. On the other hand, terms that are less common have higher IDF scores, indicating they may be more important or distinctive in our text data.

## Transforming Sentences to TF-IDF Vectors and Understanding the Output

Next, let's transform one of the text documents to a sparse vector of TF-IDF values:

```python
vector = vectorizer.transform([sentences[0]])
```

This step encodes our sentences using TF-IDF scores. Simply put, each word in the sentence is translated into a numerical value. This numerical value - generated by the TF-IDF algorithm - represents the word's relevance or significance within the document.

Finally, let's print out the resulting vector and its shape:

```python
print('Shape:', vector.shape)
print('Array:', vector.toarray())
```

The output reveals the shape of our encoded array and the TF-IDF score associated with each word in our sentence:

```
Shape: (1, 9)
Array: [[0.         0.46979139 0.58028582 0.38408524 0.         0.
 0.38408524 0.         0.38408524]]
```

In the array, the order of the TF-IDF scores matches the order of the words in the 'vocabulary'. So, for instance, the first word 'and' (in the 'vocabulary') has a score of 0.0 as it does not occur in the sentence, while the word 'this' has a score of 0.38408524, which gives us the relevance of the word 'this' in our sentence.

This way, we have transformed human language into a numerical representation that our machine can understand and learn from!

## Working with the IMDB Movie Reviews Dataset

Expanding on the simple sentences, let's apply the same process to the IMDB movie reviews dataset available in the NLTK library. This gives us a real-world scenario where TfidfVectorizer is utilized for text classification tasks - in this case, movie review classification.

```python
import nltk
from nltk.corpus import movie_reviews

nltk.download('movie_reviews')

reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(reviews)

print('Shape:', X.shape)

```

After applying the TfidfVectorizer to the movie reviews dataset, our output will show a shape that signifies the matrix dimension with the number of reviews and the unique words across all reviews:

```sh
Shape: (2000, 39659)
```

## Introduction to Sparse Matrices
In cases of large text datasets like ours, the matrix will have many zero entries because many words won't appear in a given review. Storing all these zeros can be highly memory-intensive and inefficient. Instead, we use a sparse matrix where we only store the non-zero elements, optimizing our memory usage. This is the storage method used for X, which holds all the TF-IDF vectors.

Let's look into the structure of this sparse matrix a bit more:

```Python
print("Total non-zero elements in the matrix X: ", len(X.data))
print("Length of the column indices array in X: ", len(X.indices))
print("Length of the row pointer array in X: ", len(X.indptr))
```
The outputs will look like this:

```sh
Total non-zero elements in the matrix X:  666842
Length of the column indices array in X:  666842
Length of the row pointer array in X:  2001
```

Here:

X.data: This array holds all the non-zero elements in our matrix, hence its length signifies the total number of non-zero elements.

X.indices: This array holds the column (word) indice for each non-zero element—it is as long as the X.data, telling us which word each data point corresponds to.

X.indptr: This is the "row pointer" array. It has as many elements as the number of rows in the matrix plus one. Each value signifies where the corresponding row starts in the X.data and X.indices arrays. It helps us locate which data points belong to which review.

## Lesson Summary and Practice
Congratulations! You've just learned about the concept of TF-IDF, how to apply the TF-IDF Vectorizer to text data in Python using the Scikit-Learn library, and how to understand the subsequent output. Additionally, you've been introduced to sparse matrices—a helpful concept when handling large text datasets—and understood how such matrices are represented.

In the coming practice exercises, you will independently apply these concepts, helping you solidify your understanding and deepen your comprehension of how TF-IDF fits into text classification tasks. Keep up the excellent work!

## Change TF-IDF Vector for Different Sentence

Excellent progress, Space Explorer! Now, let's take it a step further. Your task is to modify the provided code so that it transforms the second sentence in the list instead of the first sentence.

Give it a try and observe the difference in the output of the TF-IDF Vector. Keep in mind that this change will further enhance your understanding of how the TF-IDF Vector is influenced by the frequency of words in the chosen document.

Get ready to code!

```python
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = [
    'This is the first document.',
    'This document is the second document.',
    'And this document is the third one.',
    'Is this the first document here?'
]

vectorizer = TfidfVectorizer()
vectorizer.fit(sentences)
vector = vectorizer.transform([sentences[0]])
print('Shape:', vector.shape)
print('Features:', vectorizer.get_feature_names_out())
print('Array:', vector.toarray())

```

To modify the provided code to transform the second sentence in the list instead of the first, you simply need to change the index from 0 to 1 in the `transform` method call. This will allow you to see how the TF-IDF scores vary for the second sentence, which has a different word composition and frequency compared to the first.

Here's the updated code:

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# List of sentences
sentences = [
    'This is the first document.',
    'This document is the second document.',
    'And this document is the third one.',
    'Is this the first document here?'
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the sentences
vectorizer.fit(sentences)

# Transform the second sentence
vector = vectorizer.transform([sentences[1]])

# Output the results
print('Shape:', vector.shape)
print('Features:', vectorizer.get_feature_names_out())
print('Array:', vector.toarray())
```

### Explanation
- **Shape:** This will show the dimensions of the TF-IDF vector (1 row for the single sentence and columns equal to the number of unique words across all documents).
- **Features:** This will list all the words that the vectorizer has learned from the entire set of documents.
- **Array:** This will display the actual TF-IDF scores for each word as they appear in the second sentence. Since the word "document" appears twice in the second sentence, its TF-IDF score might be influenced differently compared to its appearance in other sentences.

This modification will help you observe how the TF-IDF values change when a word (like "document") appears more frequently within a single document relative to its frequency in the entire corpus. This is a key aspect of understanding the weighting mechanism of TF-IDF.

## Implementing TF-IDF Vectorizer on Provided Text

Right on, Space Voyager! Let's continue our exploration of TF-IDF. Fill in the blanks to fit the vectorizer to the sentences and transform the third sentence into a TF-IDF vector. Finally, print the shape of the array and the array itself. Happy coding!

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# sentences collection
sentences = ['The quick brown fox.', 'Foxes are quick and agile.', 'The fox jumped.']

# create an instance of TfidfVectorizer
vectorizer = ____()

# fit the vectorizer to the sentences
vectorizer.____(____)

# transform the third sentence
vector = vectorizer.____([sentences[2]])

# print the shape and array
print('Shape:', vector.shape)
print('Array:', vector.toarray())

```

To complete the code snippet and correctly use the TF-IDF vectorizer to transform the third sentence from the list of sentences, you need to fill in the blanks with appropriate method calls and parameters. Here's the completed code:

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# sentences collection
sentences = ['The quick brown fox.', 'Foxes are quick and agile.', 'The fox jumped.']

# create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()

# fit the vectorizer to the sentences
vectorizer.fit(sentences)

# transform the third sentence
vector = vectorizer.transform([sentences[2]])

# print the shape and array
print('Shape:', vector.shape)
print('Array:', vector.toarray())
```

### Explanation of the Filled Blanks:
- **`TfidfVectorizer()`**: This initializes a new instance of the TfidfVectorizer class.
- **`fit(sentences)`**: This method is used to learn the vocabulary of all the sentences in the list.
- **`transform([sentences[2]])`**: This method transforms the third sentence (indexed at 2) into its TF-IDF vector representation based on the vocabulary learned during the fitting process.

This code will output the shape of the TF-IDF vector for the third sentence and the actual TF-IDF values for each term in the sentence. These values help in understanding the importance of each word within the context of the provided documents.

## Understanding Sparse Matrix Components

Impressive navigation so far, Space Explorer!

Let's test your understanding of the sparse matrix components. Your mission is to fill in the blanks (____) to complete the code. The objective here is to learn and understand the structure of a sparse matrix by printing its key components: "data", "indices", and "indptr".

```python
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = ['Expedition to Mars', 'NASA launched an expedition to Mars', 'Mars expedition was successful']

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()
  
# fit_transform to convert text to vector  
X = vectorizer.fit_transform(sentences)

# print the shape and array
print('Shape:', X.shape)
print('Array:', X.toarray())

# Print the data
print("Data of Sparse Matrix ", X.____)

# Get the indices of the non-zero elements 
print("Indices of non-zero elements of Sparse Matrix ", X.____)

# Get the array that points to where the start of each row is in the data and indices array
print("Pointer to start of each row in indices and data ", X.____)
```

To complete the code snippet and correctly print the structure of the sparse matrix created by the `TfidfVectorizer`, you need to access the internal components of the sparse matrix representation. Here's the completed code with the correct attributes filled in:

```python
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = ['Expedition to Mars', 'NASA launched an expedition to Mars', 'Mars expedition was successful']

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()
  
# fit_transform to convert text to vector  
X = vectorizer.fit_transform(sentences)

# print the shape and array
print('Shape:', X.shape)
print('Array:', X.toarray())

# Print the data
print("Data of Sparse Matrix ", X.data)

# Get the indices of the non-zero elements 
print("Indices of non-zero elements of Sparse Matrix ", X.indices)

# Get the array that points to where the start of each row is in the data and indices array
print("Pointer to start of each row in indices and data ", X.indptr)
```

### Explanation of the Filled Blanks:
- **`X.data`**: This attribute holds all the non-zero elements of the sparse matrix. These are the TF-IDF values for each non-zero term in the documents.
- **`X.indices`**: This attribute contains the column indices of the non-zero elements in the sparse matrix. It tells which word (column in the TF-IDF matrix) each value in `X.data` corresponds to.
- **`X.indptr`**: This attribute is an array that helps in identifying the start and end of each row in the `X.data` and `X.indices` arrays. It essentially points to where each document's (row's) data starts in the `X.data` array.

This code effectively demonstrates how to access and understand the internal structure of a sparse matrix, which is crucial for handling and optimizing large datasets in machine learning tasks.

## Applying TF-IDF Vectorizer On Reviews Dataset

Nice navigation so far, Space Voyager! Now, let's spice things up a bit. Recall how we transformed our text into a matrix of TF-IDF features? Let's do that again. In the provided code, add the code to import TfidfVectorizer and then create an instance of TfidfVectorizer. Apply fit_transform to our reviews data. Now, go ahead and amplify your learning!

```python
import numpy as np
import nltk
from nltk.corpus import movie_reviews
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('movie_reviews', quiet=True)

reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]

# TODO: Create an instance of TfidfVectorizer

# TODO: Apply fit_transform on reviews

print('Feature Names:', vectorizer.get_feature_names_out()[-5:])

```

To complete the provided code snippet, you need to create an instance of the `TfidfVectorizer` and use it to apply `fit_transform` to the `reviews` dataset. This will convert the text reviews into a matrix of TF-IDF features. Here's how you can do it:

```python
import numpy as np
import nltk
from nltk.corpus import movie_reviews
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('movie_reviews', quiet=True)

reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()

# Apply fit_transform on reviews
X = vectorizer.fit_transform(reviews)

print('Feature Names:', vectorizer.get_feature_names_out()[-5:])
```

### Explanation:
- **`TfidfVectorizer()`**: This initializes an instance of `TfidfVectorizer`, which will be used to transform the text data into TF-IDF features.
- **`fit_transform(reviews)`**: This method fits the vectorizer to the text data and transforms the text into a TF-IDF encoded sparse matrix. `fit_transform` combines the fitting and transformation steps into one, which is more efficient than running them separately when you're processing the data for the first time.
- **`vectorizer.get_feature_names_out()`**: After transforming the data, this method retrieves the feature names (i.e., the vocabulary words), and `[-5:]` is used to print the last five feature names to give a glimpse of the vocabulary.

This code will output the shape of the TF-IDF matrix and the last five features in the vocabulary, demonstrating how to work with text data for feature extraction in natural language processing tasks.

## Implementing TF-IDF Vectorizer from Scratch

Your journey into the cosmos of TF-IDF is nearing its zenith, Stellar Navigator! Your mission is to write the complete code to apply TfidfVectorizer to a set of sentences. Fit the vectorizer to the sentences, transform the first sentence into a TF-IDF vector, and print the TF-IDF features, dimension, and array. Are you ready to complete this fascinating voyage in the space of Text Feature Engineering? Let's code!

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# List of example sentences 
sentences = ['Exploration is really the essence of the human spirit.', 
             'We are driven by a desire to explore and discover the unknown.',
             'The will to explore is a significant part of being human.']

# TODO: Create an instance of TfidfVectorizer

# TODO: Fit the vectorizer to the sentences

# TODO: Transform the first sentence into a TF-IDF vector

# TODO: Print the shape of the TF-IDF vector

# TODO: Print the TF-IDF features

# TODO: Print the TF-IDF vector


```

To fulfill your mission of applying the `TfidfVectorizer` to a set of sentences, transforming one of them into a TF-IDF vector, and then printing relevant details, follow this complete code guide:

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# List of example sentences 
sentences = ['Exploration is really the essence of the human spirit.', 
             'We are driven by a desire to explore and discover the unknown.',
             'The will to explore is a significant part of being human.']

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the sentences
vectorizer.fit(sentences)

# Transform the first sentence into a TF-IDF vector
tfidf_vector = vectorizer.transform([sentences[0]])

# Print the shape of the TF-IDF vector
print('Shape:', tfidf_vector.shape)

# Print the TF-IDF features
print('TF-IDF Features:', vectorizer.get_feature_names_out())

# Print the TF-IDF vector
print('TF-IDF Vector:', tfidf_vector.toarray())
```

### Explanation:
- **`TfidfVectorizer()`**: This line initializes the `TfidfVectorizer`. This object will convert text documents into a matrix of TF-IDF features.
- **`fit(sentences)`**: This method learns the vocabulary of the entire set of sentences. It's necessary to fit the model to the data before transforming it.
- **`transform([sentences[0]])`**: This method transforms the first sentence of the list into a sparse matrix of TF-IDF features. The brackets around `sentences[0]` ensure that the input is treated as a list.
- **`tfidf_vector.shape`**: This prints the shape of the TF-IDF vector, showing the number of documents (in this case, 1) and the number of features in the vocabulary.
- **`vectorizer.get_feature_names_out()`**: This retrieves all the feature names (i.e., words in the learned vocabulary) from the vectorizer.
- **`tfidf_vector.toarray()`**: Converts the sparse matrix to a dense array and prints it, showing the TF-IDF weights for each word in the first sentence.

This code snippet effectively demonstrates how to use `TfidfVectorizer` to process text data, making it ready for further analysis or machine learning modeling.