# Lesson 2: Implementing Bag-of-Words Representation

**Introducing the Bag-of-Words Representation**

In the realm of text analysis, transforming raw data into a format that is both computer-friendly and preserves essential information for further processing is crucial. One of the simplest yet most versatile methods for achieving this is the Bag-of-Words (BoW) representation.

The BoW method is essentially a way to extract features from text. Imagine having a large bag filled with words sourced from various texts like a book, a website, or, in our example, movie reviews from the IMDB dataset. For each document or sentence, the BoW representation will tally how many times each word appears. Crucially, in this "bag," the order of words is disregarded; only their frequency matters.

**Example with Three Sentences:**

Consider the following sentences:
1. The cat sat on the mat.
2. The cat sat near the mat.
3. The cat played with a ball.

Using a BoW representation, our table would look like this:

|   | the | cat | sat | on | mat | near | played | with | a | ball |
|---|-----|-----|-----|----|-----|------|--------|------|---|------|
| 1 | 2   | 1   | 1   | 1  | 1   | 0    | 0      | 0    | 0 | 0    |
| 2 | 2   | 1   | 1   | 0  | 1   | 1    | 0      | 0    | 0 | 0    |
| 3 | 1   | 1   | 0   | 0  | 0   | 0    | 1      | 1    | 1 | 1    |

Each row corresponds to a sentence (document), and each unique word forms a column. The cell values represent the word count in the corresponding sentence.

**Illustrating Bag-of-Words with a Simple Example**

We can start practicing the Bag-of-Words model by using Scikit-learn's CountVectorizer on the same three sentences:

```python
from sklearn.feature_extraction.text import CountVectorizer

# Simple example sentences
sentences = ['The cat sat on the mat.',
             'The cat sat near the mat.',
             'The cat played with a ball.']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print('Feature names:')
print(vectorizer.get_feature_names_out())
print('Bag of Words Representation:')
print(X.toarray())
```

The output will be:

```
Feature names:
['ball', 'cat', 'mat', 'near', 'on', 'played', 'sat', 'the', 'with']
Bag of Words Representation:
[[0, 1, 1, 0, 1, 0, 1, 2, 0],
 [0, 1, 1, 1, 0, 0, 1, 2, 0],
 [1, 1, 0, 0, 0, 1, 0, 1, 1]]
```

From the output, you can see that Scikit-learn's CountVectorizer has replicated our previous manual process, creating a Bag-of-Words representation where each row corresponds to a sentence and each column to a unique word.

**Applying Bag-of-Words to Our Dataset**

Now that we understand what Bag-of-Words is and how it functions, let's apply it to our dataset:

```python
import nltk
from nltk.corpus import movie_reviews
from sklearn.feature_extraction.text import CountVectorizer

nltk.download('movie_reviews')  
reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()]

vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit_transform(reviews)

print(f"The shape of our Bag-of-Words is: {bag_of_words.shape}")
```

The output will show:

```
The shape of our Bag-of-Words is: (2000, 39659)
```

This indicates that the result is a matrix where each row corresponds to a movie review and each column to a unique word, with entries representing word counts.

**Understanding the Bag-of-Words Matrix and Most Used Word**

Let's examine what's inside the bag_of_words matrix:

```python
feature_names = vectorizer.get_feature_names_out()
first_review_word_counts = bag_of_words[0].toarray()[0]

max_count_index = first_review_word_counts.argmax()
most_used_word = feature_names[max_count_index]

print(f"The most used word is '{most_used_word}' with a count of {first_review_word_counts[max_count_index]}")
```

Running the code will output:

```
The most used word is 'the' with a count of 38
```

This output reveals the most used word in the first review and its count. The script finds the index of the word with the highest count in the first review, then uses this index to identify the corresponding word in the feature_names. This demonstrates how we can identify the most used word in a specific review using the Bag-of-Words model.

**Lesson Summary**

Congratulations! You've successfully navigated through this lesson on the Bag-of-Words method, a fundamental concept in text classification. You've not only grasped the theoretical aspect but also applied it on a real-world dataset using Python. You even used it to extract insights about word frequency, a crucial aspect of many text classification problems.

As we move forward in upcoming lessons, we'll build on what you've learned today and continue our journey to understand and apply more advanced text classification techniques. Remember, practice makes perfect, so try to apply what you've learned today on different text data on your own. Happy coding, and see you in the next lesson!

## Customizing Bag-of-Words Representation

Great job! Now, let's play around with the text. Change the provided sentences to any three sentences of your choice. Then, observe how this affects the Bag-of-Words representation. Remember, the output should be a matrix, with each row representing one of your sentences and each column representing a unique word.

```python
from sklearn.feature_extraction.text import CountVectorizer

# Custom sentences
sentences = ['The cat sat on the mat.',
             'The cat sat near the mat.',
             'The cat played with a ball.']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print('Feature names:')
print(vectorizer.get_feature_names_out())
print('Bag of Words Representation:')
print(X.toarray())
```

Sure! Let's change the sentences to something different and observe how the Bag-of-Words representation changes accordingly. Here are three new sentences:

1. Dogs love playing in the park.
2. Children enjoy playing games.
3. The park is full of trees.

Now, let's plug these sentences into the provided Python script and see how the Bag-of-Words matrix looks:

```python
from sklearn.feature_extraction.text import CountVectorizer

# Custom sentences
sentences = ['Dogs love playing in the park.',
             'Children enjoy playing games.',
             'The park is full of trees.']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print('Feature names:')
print(vectorizer.get_feature_names_out())
print('Bag of Words Representation:')
print(X.toarray())
```

### Expected Output Explanation:

- **Feature names:** This will output a list of all unique words found in the three sentences, sorted in alphabetical order.
- **Bag of Words Representation:** This will be a matrix where each row corresponds to one of the sentences, and each column corresponds to one of the unique words. The values in the matrix represent the count of each word in each sentence.

For example, if the word "playing" appears in the first and second sentences but not in the third, its column will have values like `[1, 1, 0]` (assuming "playing" appears once in each of the first two sentences). The exact output will depend on the unique words extracted by the `CountVectorizer` and their frequency in each sentence.

## Applying CountVectorizer on Sentences

Great work, Space Voyager!

Now, do you remember how to use the CountVectorizer on the sentences to obtain the Bag-of-Words Representation? In the placeholders, marked with ____, insert the missing parts of the code. Keep on exploring!

```python
from sklearn.feature_extraction.text import _____

# Simple example sentences
sentences = ['Machine learning is fascinating.',
             'Deep learning is a subset of machine learning.',
             'We use Python for machine learning.']

vectorizer = _____()
X = vectorizer._____(sentences)

print('Feature names:')
print(vectorizer.get_feature_names_out())
print('Bag of Words Representation:')
print(X.toarray())


```

Certainly! To fill in the blanks in the provided Python script using `CountVectorizer` from the `sklearn.feature_extraction.text` module, follow these steps:

1. Import `CountVectorizer` from the `sklearn.feature_extraction.text` module.
2. Create an instance of `CountVectorizer`.
3. Fit and transform the sentences using the `fit_transform` method of `CountVectorizer`.

Here's the complete and corrected version of your script:

```python
from sklearn.feature_extraction.text import CountVectorizer

# Simple example sentences
sentences = ['Machine learning is fascinating.',
             'Deep learning is a subset of machine learning.',
             'We use Python for machine learning.']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print('Feature names:')
print(vectorizer.get_feature_names_out())
print('Bag of Words Representation:')
print(X.toarray())
```

This script will output the feature names, which are the unique words from the sentences, and the Bag-of-Words representation, which is a matrix indicating the frequency of each word in each sentence.

## Bag-of-Words Transformation on IMDB Reviews Dataset

Amazing progress, Space Voyager!

Now, for the next adventure: Here's some code with a few blank spots (____) for you to fill in. This code analyzes the first 100 reviews of the IMDB movie dataset and creates a Bag-of-Words representation. Finally, it prints the last ten feature names. Fill in the missing parts to complete the analysis. Happy coding!

```python
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import movie_reviews

# We need to download the dataset before we can use it
nltk.download('movie_reviews', quiet=True)

# Take only the first 100 reviews for simplicity
reviews = [movie_reviews.____(fileid) for fileid in movie_reviews.____()[:100]]

vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit_transform(reviews)

# Print the last ten feature names
feature_names = vectorizer.____()
print("Last ten feature names: ", feature_names[-10:])

```

To complete the provided Python script for analyzing the first 100 reviews of the IMDB movie dataset using a Bag-of-Words representation, let's fill in the blanks:

1. **First blank**: Use `raw()` method to get the text of each review.
2. **Second blank**: Use `fileids()` to retrieve all the file identifiers for the reviews.
3. **Third blank**: Use `get_feature_names_out()` to obtain the list of feature names from the vectorizer.

Here is the completed script:

```python
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import movie_reviews

# We need to download the dataset before we can use it
nltk.download('movie_reviews', quiet=True)

# Take only the first 100 reviews for simplicity
reviews = [movie_reviews.raw(fileid) for fileid in movie_reviews.fileids()[:100]]

vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit_transform(reviews)

# Print the last ten feature names
feature_names = vectorizer.get_feature_names_out()
print("Last ten feature names: ", feature_names[-10:])
```

This script will successfully download the necessary movie reviews dataset from NLTK, extract the first 100 reviews, create a Bag-of-Words model using `CountVectorizer`, and finally, print the last ten feature names from the vectorizer's feature names list. These names represent some of the vocabulary used in the reviews, particularly those that appear at the end of an alphabetically sorted list.

## Creating Bag-of-Words Representation Yourself

Fantastic job, Space Voyager!

Now, let's see if you can create a CountVectorizer yourself. Fill in the corresponding TODO lines to successfully vectorize the sentences and observe the resulting Bag-of-Words representation and feature names. Exploration time!

```python
from sklearn.feature_extraction.text import CountVectorizer

# Simple sentences
sentences = ["I love to sing", 
             "Singing in the rain is my favorite", 
             "She sang the whole night at the concert"]

# TODO: Initialize a CountVectorizer

# TODO: Fit transform the sentences

print('Feature names:')
print(_____.get_feature_names_out())
print('Bag of Words Representation:')
print(_____.toarray())

```

To complete the script and successfully vectorize the sentences using `CountVectorizer`, follow these steps:

1. **Initialize a CountVectorizer**: Create an instance of `CountVectorizer`.
2. **Fit transform the sentences**: Use the `fit_transform` method to transform the sentences into a Bag-of-Words model.
3. **Print the feature names and Bag-of-Words representation**: Replace the placeholders in the print statements with the variable that holds the transformed data.

Here's how you can fill in the TODO lines and the placeholders in the print statements:

```python
from sklearn.feature_extraction.text import CountVectorizer

# Simple sentences
sentences = ["I love to sing", 
             "Singing in the rain is my favorite", 
             "She sang the whole night at the concert"]

# TODO: Initialize a CountVectorizer
vectorizer = CountVectorizer()

# TODO: Fit transform the sentences
X = vectorizer.fit_transform(sentences)

print('Feature names:')
# Replace the placeholder to print feature names
print(vectorizer.get_feature_names_out())
print('Bag of Words Representation:')
# Replace the placeholder to print the Bag of Words representation
print(X.toarray())
```

This script initializes the `CountVectorizer`, fits it to the provided sentences, and then transforms these sentences into a Bag-of-Words representation. The output will include a list of all unique words (feature names) extracted from the sentences and a matrix showing the frequency of each word in each sentence.

## Turn Rich Text into Bag-of-Words Representation

Bravo, Space Voyager! Now, let's tackle a real-world challenge. Given some pre-processing code and three rich example sentences, your mission is to clean these sentences, then use the Bag-of-Words model to vectorize them, and finally, print out the feature names and their Bag-of-Words representation. Ready for a stellar run?

```python
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Load stop words from NLTK and initialize a stemmer
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

# Define function for text cleaning and stemming
def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces

    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words]

    return " ".join(filtered_text)

original_sentences = ['It is a lovely day, isn\'t it?', 
                      'The sun is shining brightly!', 
                      'I love the taste of lemonade on a sunny day.']

# Preprocess the sentences
preprocessed_sentences = [clean_text(sentence) for sentence in original_sentences]

# TODO: Initialize a CountVectorizer

# TODO: Fit transform the preprocessed sentences

# TODO: Print the feature names 

# TODO: Print the Bag of Words Representation


```

To complete the task of cleaning the sentences, vectorizing them using the Bag-of-Words model, and printing out the feature names and their Bag-of-Words representation, you'll need to follow these steps:

1. **Initialize a CountVectorizer**: Create an instance of `CountVectorizer`.
2. **Fit transform the preprocessed sentences**: Use the `fit_transform` method to transform the cleaned sentences into a Bag-of-Words model.
3. **Print the feature names and Bag-of-Words representation**: Utilize the methods provided by `CountVectorizer` to display the desired outputs.

Here's how to complete the script:

```python
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Load stop words from NLTK and initialize a stemmer
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

# Define function for text cleaning and stemming
def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces

    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words]

    return " ".join(filtered_text)

original_sentences = ['It is a lovely day, isn\'t it?', 
                      'The sun is shining brightly!', 
                      'I love the taste of lemonade on a sunny day.']

# Preprocess the sentences
preprocessed_sentences = [clean_text(sentence) for sentence in original_sentences]

# Initialize a CountVectorizer
vectorizer = CountVectorizer()

# Fit transform the preprocessed sentences
X = vectorizer.fit_transform(preprocessed_sentences)

# Print the feature names 
print('Feature names:')
print(vectorizer.get_feature_names_out())

# Print the Bag of Words Representation
print('Bag of Words Representation:')
print(X.toarray())
```

This script takes the original sentences, processes them through the `clean_text` function to remove stopwords, punctuation, URLs, and applies stemming. Then, it uses `CountVectorizer` to create and display a Bag-of-Words representation of these cleaned sentences. The output will show the unique features (words) extracted from the text and their respective counts in each sentence.