Implementing N-grams is a straightforward extension of the Bag of Words (BoW) model. We use `scikit-learn`'s **`CountVectorizer`** just like with BoW, but we adjust the `ngram_range` parameter to specify the size of the word sequences we want to capture.

### N-gram Implementation

The `ngram_range` parameter in `CountVectorizer` controls which N-gram features are generated. The parameter takes a tuple `(min_n, max_n)`, which includes all N-grams of size `min_n` to `max_n`.

Let's use a spam classification example to illustrate this. We've already performed the standard preprocessing steps (cleaning, lowercasing, stopword removal, and stemming/lemmatization) on our text messages, storing the result in a list called `corpus`.

#### 1\. Unigrams Only ($ngram\_range=(1,1)$)

This is the default setting and generates features from single words (unigrams). This is identical to a standard Bag of Words model.

```python
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer for unigrams
cv_unigram = CountVectorizer(max_features=100, ngram_range=(1, 1))

# Fit and transform the corpus
X_unigram = cv_unigram.fit_transform(corpus).toarray()

# Get the top 100 unigram features
unigram_features = cv_unigram.get_feature_names_out()
print(f"Top 10 unigrams: {unigram_features[:10]}")
```

The output shows a list of single words, like `['ask', 'babe', 'call', 'claim', 'contact', 'credit', 'entri', 'find', 'free', 'get']`.

<br\>

-----

<br\>

#### 2\. Unigrams and Bigrams ($ngram\_range=(1,2)$)

This setting generates features for both unigrams (single words) and bigrams (two-word sequences). This is a very common and effective approach to capture some word order without exploding the feature space too much.

```python
# Initialize CountVectorizer for unigrams and bigrams
cv_bigram = CountVectorizer(max_features=200, ngram_range=(1, 2))
X_bigram = cv_bigram.fit_transform(corpus).toarray()

# Get the top 200 features, which now include bigrams
bigram_features = cv_bigram.get_feature_names_out()
print(f"Top 10 bigrams: {bigram_features[:10]}")
```

The output now includes phrases, such as `['000', '1', '100', '10p', '150p', '16', '18', '2', '2005', '4']`. We can increase `max_features` to capture more of these. For example, if we set it to 500, we'll likely see features like **'please call'** and **'customer service'**.

<br\>

-----

<br\>

#### 3\. Bigrams Only ($ngram\_range=(2,2)$)

To focus solely on bigrams and exclude single words, we set the range to `(2, 2)`. This can be useful for tasks where word combinations are more informative than individual words.

```python
# Initialize CountVectorizer for bigrams only
cv_only_bigram = CountVectorizer(max_features=100, ngram_range=(2, 2))
X_only_bigram = cv_only_bigram.fit_transform(corpus).toarray()

# Get the features, which are now all bigrams
only_bigram_features = cv_only_bigram.get_feature_names_out()
print(f"Top 10 bigrams (only): {only_bigram_features[:10]}")
```

The output will consist exclusively of two-word phrases, like `['call us', 'contact us', 'free entry', 'mobile number', 'please call', 'to claim', 'txt to', 'uk mobile', 'ur award', 'win']`.

<br\>

-----

<br\>

#### 4\. Unigrams, Bigrams, and Trigrams ($ngram\_range=(1,3)$)

For more complex models, we can include three-word sequences (trigrams). This significantly increases the feature space but can capture even more context.

```python
# Initialize CountVectorizer for unigrams, bigrams, and trigrams
cv_trigram = CountVectorizer(max_features=2000, ngram_range=(1, 3))
X_trigram = cv_trigram.fit_transform(corpus).toarray()
trigram_features = cv_trigram.get_feature_names_out()

# The features now include single words, two-word, and three-word combinations
print(f"Top 10 features (including trigrams): {trigram_features[:10]}")
```

#### Practical Advice

Start with **unigrams and bigrams (`ngram_range=(1,2)`)** as a baseline. If your model's performance isn't satisfactory, you can experiment by adding trigrams (`ngram_range=(1,3)`). Remember that increasing the N-gram range also increases the dimensionality of your data, which can lead to longer training times and potential overfitting. Therefore, it's crucial to also tune the `max_features` parameter to control the number of features.