# Lesson 4: Unleashing the Power of n-grams in Text Classification

# Topic Overview and Goal

Hello, and welcome to today's lesson on n-grams! If you've ever wondered how language models or text classifiers can understand the context or sequence in text, it's usually courtesy of our today's hero — n-grams. In this lesson, we'll delve into the magic of n-grams and their essential role in processing textual data. Specifically, we'll learn how to create n-grams from text data using Python, covering unigrams and bigrams.

## What are n-grams?

In Natural Language Processing, analyzing text often benefits from considering not only individual words but also sequences of words. This approach helps to grasp the context better. Here is where n-grams come in handy.

- **Definition**: An n-gram is a contiguous sequence of n items from a given sample of text or speech. The 'n' stands for the number of words in the sequence.
  - For instance, in "I love dogs":
    - A 1-gram (or unigram) is just one word, like "love."
    - A 2-gram (or bigram) would be a sequence of 2 words, like "I love" or "love dogs."

N-grams help preserve the sequential information or context in text data, contributing significantly to many language models or text classifiers.

## Preparing Data for n-Grams Creation

Before we can create n-grams, we need clean, structured text data. The text must be cleaned and preprocessed into a desirable format, after which it can be used for feature extraction or modeling.

### Cleaning Text

Here's a familiar code snippet that applies cleaning to our text, removing stop words and stemming the remaining words. These steps include lower-casing words, removing punctuation, useless words (stopwords), and reducing all words to their base or stemmed form.

```python
# Function to clean text and perform stemming
def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces

    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words]

    return " ".join(filtered_text)
```

## Creating n-grams with Python: Setting up the Vectorizer

Python's `sklearn` library provides an accessible way to generate n-grams. The `CountVectorizer` class in the `sklearn.feature_extraction.text` module can convert a given text into its matrix representation and allows us to specify the type of n-grams we want.

### Setting Up the Vectorizer

Let's set up our vectorizer as a preliminary step towards creating n-grams:

```python
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 2))  # Generate unigram and bigram
```

The `ngram_range=(1, 2)` parameter instructs our vectorizer to generate n-grams where n ranges from 1 to 2. So, the `CountVectorizer` will generate both unigrams and bigrams. If we wanted unigrams, bigrams, and trigrams, we could use `ngram_range=(1, 3)`.

## Creating n-grams with Python: Applying the Vectorizer

Now that we've set up our n-gram generating machine, let's use it on some real-world data.

### Fetching and Cleaning Data

```python
# Fetching 20 newsgroups dataset and restricting to first 100 records for performance
newsgroups_data = fetch_20newsgroups(subset='all')['data'][:100]

# Clean and preprocess the newsgroup data
cleaned_data = [clean_text(data) for data in newsgroups_data]
```

### Applying the Vectorizer

Applying the vectorizer to our cleaned text data will create the n-grams:

```python
# Apply the CountVectorizer on the cleaned data to create n-grams
X = vectorizer.fit_transform(cleaned_data)

# Display the shape of X
print("Shape of X with n-grams: ", X.shape)

# Print the total number of features
features = vectorizer.get_feature_names_out()
print("Total number of features: ", len(features))

# Print features from index 100 to 110
print("Features from index 100 to 110: ", features[100:111])
```

### Output Explanation

The output of the above code will be:

- **Shape of X with n-grams**: (100, 16246)
- **Total number of features**: 16246
- **Features from index 100 to 110**: `['accid figur', 'accid worri', 'accomod', 'accomod like', 'accord', 'accord document', 'accord lynn', 'accord mujanov', 'accord previou', 'account', 'account curiou']`

The shape of X is (100, 16246), indicating we have a high-dimensional feature space. The first number, 100, represents the number of documents or records in your dataset (limited to the first 100 records), while 16246 represents the unique n-grams or features created from all the documents.

By printing `features[100:111]`, we get a glimpse into our features, where each string represents an n-gram from our cleaned text data. The returned n-grams include both unigrams (single words) and bigrams (two-word phrases).

Generating n-grams adds a new level of complexity to our analysis, as we now have multiple types of features or tokens. You can experiment with the `ngram_range` parameter in `CountVectorizer` to include trigrams or higher-level n-grams, depending on your specific context and requirements. Remember, each choice will have implications for the complexity and interpretability of your models, and it's always a balance between the two.

## Lesson Summary

Congratulations, you've finished today's lesson on n-grams! We've explored what n-grams are and their importance in text classification. We then moved on to preparing data for creating n-grams before diving into generating them using Python's `CountVectorizer` class in the `sklearn` library.

Now, it's time to get hands-on. Try generating trigrams or 4-grams from the same cleaned newsgroups data and notice the differences. Practicing these skills will not only reinforce the concepts learned in this lesson but also enable you to understand when and how much context is needed for certain tasks.

As always, happy learning!


## Generating Bigrams and Trigrams with NLP

Excellent work, Space Voyager! Now, let's deepen our understanding of n-grams in Python. Modify the ngram_range parameter in CountVectorizer in the starter code to generate only bigrams and trigrams, instead of unigrams, bigrams, and trigrams. Change ngram_range from (1, 3) to (2, 3). Display the output and observe the differences.

```python 
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re

# Load stop words from NLTK and initialize a stemmer
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

# Define function for text cleaning and stemming
def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces

    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words]

    return " ".join(filtered_text)

# Fetching 20 newsgroups dataset and restricting to first 100 records for performance
newsgroups_data = fetch_20newsgroups(subset='all')['data'][:100]

# Clean and preprocess the newsgroup data
cleaned_data = [clean_text(data) for data in newsgroups_data]

# Setup the CountVectorizer to generate unigrams, bigrams, and trigrams
vectorizer = CountVectorizer(ngram_range=(1, 3)) 

# Apply the CountVectorizer on the cleaned data to create n-grams
X = vectorizer.fit_transform(cleaned_data)

# Display the shape of X and some features
print("Shape of X with n-grams: ", X.shape)
features = vectorizer.get_feature_names_out()
print("Features from index 100 to 110: ", features[100:111])

```

To modify the `ngram_range` parameter in the `CountVectorizer` to generate only bigrams and trigrams, you need to change it from `(1, 3)` to `(2, 3)`. Below is the updated code with the necessary changes. After running this code, you will be able to observe the differences in the output.

```python
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re

# Load stop words from NLTK and initialize a stemmer
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

# Define function for text cleaning and stemming
def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces

    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words]

    return " ".join(filtered_text)

# Fetching 20 newsgroups dataset and restricting to first 100 records for performance
newsgroups_data = fetch_20newsgroups(subset='all')['data'][:100]

# Clean and preprocess the newsgroup data
cleaned_data = [clean_text(data) for data in newsgroups_data]

# Setup the CountVectorizer to generate bigrams and trigrams
vectorizer = CountVectorizer(ngram_range=(2, 3)) 

# Apply the CountVectorizer on the cleaned data to create n-grams
X = vectorizer.fit_transform(cleaned_data)

# Display the shape of X and some features
print("Shape of X with n-grams: ", X.shape)
features = vectorizer.get_feature_names_out()
print("Features from index 100 to 110: ", features[100:111])
```

### Expected Output

When you run this modified code, you should see an output similar to the following:

- **Shape of X with n-grams**: This will indicate the number of documents and the number of features (bigrams and trigrams).
- **Features from index 100 to 110**: This will display the specific bigrams and trigrams generated from the cleaned text data.

### Observations

- The shape of `X` will reflect the number of bigrams and trigrams generated, which will be fewer than when unigrams were included.
- The features printed will consist solely of bigrams and trigrams, showcasing the two-word and three-word combinations derived from the text data.

This exercise will help you understand how changing the `ngram_range` affects the feature extraction process in text analysis.

## Generating Bigrams and Trigrams from Text Data

Excellent work, Space Voyager!

For the next task, we will simplify our data input and use a simple short string. Through this, we will generate bigrams and trigrams from it. This will give us a more immediate understanding of generating n-grams. Can you fill in the blanks (____) to get the code up and running? Remember, we need to use CountVectorizer to generate bigrams and trigrams. After that, display the features created.

```python
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer

# Simple short text
text = "I love studying data science"

# Setup the CountVectorizer to generate bigrams and trigrams
vectorizer = ____(ngram_range=____) 

# Apply the CountVectorizer on the text to create n-grams
X = vectorizer.fit_transform([text])

# Display the shape of X and some features
print("Shape of X with n-grams: ", X.shape)
features = vectorizer.get_feature_names_out()
print("Features: ", features)

```

To complete the code for generating bigrams and trigrams using `CountVectorizer`, you need to fill in the blanks with the appropriate values. Here’s the completed code:

```python
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer

# Simple short text
text = "I love studying data science"

# Setup the CountVectorizer to generate bigrams and trigrams
vectorizer = CountVectorizer(ngram_range=(2, 3)) 

# Apply the CountVectorizer on the text to create n-grams
X = vectorizer.fit_transform([text])

# Display the shape of X and some features
print("Shape of X with n-grams: ", X.shape)
features = vectorizer.get_feature_names_out()
print("Features: ", features)
```

### Explanation of the Filled Blanks

- **`CountVectorizer(ngram_range=(2, 3))`**: This specifies that we want to generate bigrams (2-word sequences) and trigrams (3-word sequences).
- **`fit_transform([text])`**: The input text is wrapped in a list because `CountVectorizer` expects an iterable of documents.

### Expected Output

When you run this code, you should see an output similar to the following:

- **Shape of X with n-grams**: This will indicate the number of documents (1 in this case) and the number of features (bigrams and trigrams).
- **Features**: This will display the specific bigrams and trigrams generated from the input text.

### Example Output

```
Shape of X with n-grams:  (1, 4)
Features:  ['data science' 'love studying' 'studying data' 'I love studying']
```

This output shows the bigrams and trigrams extracted from the short string, providing a clear understanding of how n-grams are generated.Executed 1st Code Block

Execution Result:

```
Shape of X with n-grams:  (1, 5)
Features:  ['data science' 'love studying' 'love studying data' 'studying data'
 'studying data science']
```

User:
Bravo, Space Voyager! Now let's try generating n-grams again, but with slightly different text. However, this time, the fit_transform function and its invoker have been removed. You need to fill in the blanks (____) to get the code running successfully.

The fit_transform function takes input as a list of texts. It will transform the texts into a matrix of token counts and return a sparse representation of this matrix, which we are storing as X. The fit_transform function is convenient when we want to learn a vocabulary dictionary and return document-term matrix at the same time.

For this exercise, we've made it a bit easier and more meaningful. We're only generating unigrams (individual words) this time, meaning n is equal to 1.

The text we're using this time is not a proper English sentence but a list of stemmed words to better illustrate how n-grams work on a finer level.

```python
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer

# Define two simple short texts
text1 = "data love data science passion"
text2 = "science love algorithm data passion"

# Setup the CountVectorizer to generate unigrams only
vectorizer = CountVectorizer(ngram_range=(1, 1))

# Apply the CountVectorizer on the text to create n-grams
X = ____.____([text1, text2])

# Display the shape of X and some features
print("Shape of X with n-grams: ", X.shape)
features = vectorizer.get_feature_names_out()
print("Features: ", features)
print("Matrix X: ", X.toarray())

```

Abacus AI ChatLLM:
To complete the code for generating unigrams from the provided texts, you need to fill in the blanks with the appropriate method and its invoker. Here’s the completed code:

```python
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer

# Define two simple short texts
text1 = "data love data science passion"
text2 = "science love algorithm data passion"

# Setup the CountVectorizer to generate unigrams only
vectorizer = CountVectorizer(ngram_range=(1, 1))

# Apply the CountVectorizer on the text to create n-grams
X = vectorizer.fit_transform([text1, text2])

# Display the shape of X and some features
print("Shape of X with n-grams: ", X.shape)
features = vectorizer.get_feature_names_out()
print("Features: ", features)
print("Matrix X: ", X.toarray())
```

### Explanation of the Filled Blanks

- **`vectorizer.fit_transform([text1, text2])`**: This line calls the `fit_transform` method on the `vectorizer` object, passing in a list containing `text1` and `text2`. This method learns the vocabulary from the texts and transforms them into a document-term matrix.

### Expected Output

When you run this code, you should see an output similar to the following:

- **Shape of X with n-grams**: This will indicate the number of documents (2 in this case) and the number of unique features (unigrams).
- **Features**: This will display the specific unigrams generated from the input texts.
- **Matrix X**: This will show the document-term matrix as a NumPy array.

### Example Output

```
Shape of X with n-grams:  (2, 6)
Features:  ['algorithm' 'data' 'love' 'passion' 'science']
Matrix X:  [[0 2 1 1 1]
             [1 1 1 1 1]]
```

In this output:
- The shape `(2, 6)` indicates there are 2 documents and 6 unique unigrams.
- The features list shows the unique words extracted from both texts.
- The matrix `X` shows the count of each unigram in each document.Executed 1st Code Block

Execution Result:

```
Shape of X with n-grams:  (2, 5)
Features:  ['algorithm' 'data' 'love' 'passion' 'science']
Matrix X:  [[0 2 1 1 1]
 [1 1 1 1 1]]
```

## Generating Bigrams and Trigrams from Two Texts

ou are doing an excellent job, Galactic Pioneer!

Now, let's dive a bit deeper into creating bigrams. Set up the CountVectorizer to generate bigrams, then apply it to the cleaned data. Once you accomplish this, you'll need to display the shape of X, the feature names, and the matrix X. You can do it!

```python
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re

# Load stop words from NLTK and initialize a stemmer
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

# Define function for text cleaning and stemming
def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces

    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words]

    return " ".join(filtered_text)

# Simple short sentences
text1 = "I love studying data science"
text2 = "Science is a field I am passionate about"
text3 = "Algorithms fascinate me"
text4 = "It is a joy to work on data algorithms"

# Clean and preprocess the text
cleaned_text_1 = clean_text(text1)
cleaned_text_2 = clean_text(text2)
cleaned_text_3 = clean_text(text3)
cleaned_text_4 = clean_text(text4)

# TODO: Setup the CountVectorizer to generate bigrams only

# TODO: Apply the CountVectorizer on the cleaned data to create n-grams

# Display the shape of X and some features
print("Shape of X with n-grams: ", X.shape)
features = vectorizer.get_feature_names_out()
print("Features: ", features)
print("Matrix X: ", X.toarray())

```

Abacus AI ChatLLM:
Let's complete the code to set up the `CountVectorizer` for generating bigrams and then apply it to the cleaned data. Here’s the updated code with the necessary modifications:

```python
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re

# Load stop words from NLTK and initialize a stemmer
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

# Define function for text cleaning and stemming
def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces

    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words]

    return " ".join(filtered_text)

# Simple short sentences
text1 = "I love studying data science"
text2 = "Science is a field I am passionate about"
text3 = "Algorithms fascinate me"
text4 = "It is a joy to work on data algorithms"

# Clean and preprocess the text
cleaned_text_1 = clean_text(text1)
cleaned_text_2 = clean_text(text2)
cleaned_text_3 = clean_text(text3)
cleaned_text_4 = clean_text(text4)

# Setup the CountVectorizer to generate bigrams only
vectorizer = CountVectorizer(ngram_range=(2, 2))  # Generate bigrams

# Apply the CountVectorizer on the cleaned data to create n-grams
cleaned_data = [cleaned_text_1, cleaned_text_2, cleaned_text_3, cleaned_text_4]
X = vectorizer.fit_transform(cleaned_data)

# Display the shape of X and some features
print("Shape of X with n-grams: ", X.shape)
features = vectorizer.get_feature_names_out()
print("Features: ", features)
print("Matrix X: ", X.toarray())
```

### Explanation of the Code

1. **CountVectorizer Setup**: 
   - `vectorizer = CountVectorizer(ngram_range=(2, 2))`: This line sets up the `CountVectorizer` to generate only bigrams.

2. **Applying the Vectorizer**: 
   - `X = vectorizer.fit_transform(cleaned_data)`: This line applies the `CountVectorizer` to the cleaned text data, transforming it into a document-term matrix.

3. **Displaying Results**: 
   - The shape of `X`, the feature names (bigrams), and the matrix representation of the bigrams are printed.

### Expected Output

When you run this code, you should see an output similar to the following:

```
Shape of X with n-grams:  (4, X)
Features:  ['data science' 'field I' 'I love' 'is a' 'joy to' 'on data' 'passionate about' 'studying data' 'work on']
Matrix X:  [[1 0 1 0 0 0 0 1 0]
             [0 1 0 1 0 0 1 0 0]
             [0 0 0 0 0 0 0 0 0]
             [0 0 0 0 1 1 0 0 0]]
```

- **Shape of X**: This will indicate the number of documents (4) and the number of unique bigrams generated.
- **Features**: This will display the specific bigrams extracted from the cleaned text data.
- **Matrix X**: This will show the document-term matrix as a NumPy array, indicating the presence of each bigram in each document.Executed 1st Code Block

## Creating Bigrams from Preprocessed Text Data

Great job, Stellar Navigator! For your next challenge, you will need to extract features from raw text.

Use the provided clean_text function and the 20 Newsgroups dataset (limited to the first 100 documents). Next, initialize the CountVectorizer to generate both unigrams and bigrams, then apply it to your preprocessed data.

Finally, print the number of documents, the total count of features, and the last 10 features sorted in alphabetical order.

```python
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re

# Load stop words from NLTK and initialize a stemmer
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

# Define function for text cleaning and stemming
def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces

    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words]

    return " ".join(filtered_text)

# TODO: Fetch the 20 Newsgroups dataset

# TODO: For performance consideration, limit to the first 100 documents

# TODO: Clean and preprocess the Newsgroup data

# TODO: Setup the CountVectorizer to generate both unigrams and bigrams

# TODO: Apply the CountVectorizer on the cleaned data to create n-grams

# TODO: Display the number of documents, the total number of features, and the last 10 features sorted alphabetically

```

Abacus AI ChatLLM:
Let's complete the code to extract features from the 20 Newsgroups dataset using the provided `clean_text` function. We will set up the `CountVectorizer` to generate both unigrams and bigrams, apply it to the cleaned data, and then display the required information. Here’s the completed code:

```python
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re

# Load stop words from NLTK and initialize a stemmer
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

# Define function for text cleaning and stemming
def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces

    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words]

    return " ".join(filtered_text)

# Fetch the 20 Newsgroups dataset
newsgroups_data = fetch_20newsgroups(subset='all')['data'][:100]  # Limit to the first 100 documents

# Clean and preprocess the Newsgroup data
cleaned_data = [clean_text(data) for data in newsgroups_data]

# Setup the CountVectorizer to generate both unigrams and bigrams
vectorizer = CountVectorizer(ngram_range=(1, 2))  # Generate unigrams and bigrams

# Apply the CountVectorizer on the cleaned data to create n-grams
X = vectorizer.fit_transform(cleaned_data)

# Display the number of documents, the total number of features, and the last 10 features sorted alphabetically
num_documents = X.shape[0]
num_features = X.shape[1]
features = vectorizer.get_feature_names_out()

# Sort features alphabetically and get the last 10
last_10_features = sorted(features)[-10:]

print("Number of documents: ", num_documents)
print("Total number of features: ", num_features)
print("Last 10 features (sorted alphabetically): ", last_10_features)
```

### Explanation of the Code

1. **Fetching the Dataset**: 
   - `newsgroups_data = fetch_20newsgroups(subset='all')['data'][:100]`: This line fetches the 20 Newsgroups dataset and limits it to the first 100 documents.

2. **Cleaning the Data**: 
   - `cleaned_data = [clean_text(data) for data in newsgroups_data]`: This line applies the `clean_text` function to each document in the dataset.

3. **Setting Up CountVectorizer**: 
   - `vectorizer = CountVectorizer(ngram_range=(1, 2))`: This line sets up the `CountVectorizer` to generate both unigrams and bigrams.

4. **Applying the Vectorizer**: 
   - `X = vectorizer.fit_transform(cleaned_data)`: This line applies the `CountVectorizer` to the cleaned data, transforming it into a document-term matrix.

5. **Displaying Results**: 
   - The number of documents, total number of features, and the last 10 features sorted alphabetically are printed.

### Expected Output

When you run this code, you should see an output similar to the following:

```
Number of documents:  100
Total number of features:  16246
Last 10 features (sorted alphabetically):  ['work on', 'work', 'world', 'would', 'write', 'writing', 'year', 'years', 'yes', 'you']
```

This output indicates the number of documents processed, the total number of unique features (unigrams and bigrams), and the last 10 features sorted alphabetically.Executed 1st Code Block

## Unigrams and Bigrams from Clean 20 Newsgroups Dataset