### N-grams in NLP

**N-grams** are continuous sequences of `n` words from a given text, where `n` determines the length of each word sequence. They’re often used in Natural Language Processing to capture context and sequence information that single-word representations (like Bag of Words) miss.

### Key Points:
- **Purpose**: Capture patterns in text by grouping words together, allowing models to understand relationships and word sequences.
- **Types**:
  - **Unigrams**: Single words (e.g., `"I"`, `"love"`, `"learning"`)
  - **Bigrams**: Pairs of two consecutive words (e.g., `"I love"`, `"love learning"`)
  - **Trigrams**: Triplets of three consecutive words (e.g., `"I love learning"`)

Using N-grams helps models understand word context better, such as distinguishing `"New York"` as a location rather than two unrelated words.


In [8]:

from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = [
    "I love learning data science",
    "Data science is amazing and fun",
    "Machine learning and data science go hand in hand"
]

# Initialize CountVectorizer with n-grams
# ngram_range=(1, 2) will include both unigrams and bigrams
vectorizer = CountVectorizer(ngram_range=(1, 2))

# Fit and transform the documents
ngram_matrix = vectorizer.fit_transform(documents)

# Convert to array and display results
ngram_array = ngram_matrix.toarray()
print("N-gram Vocabulary:", vectorizer.get_feature_names_out())
print("N-gram Matrix:\n", ngram_array)

N-gram Vocabulary: ['amazing' 'amazing and' 'and' 'and data' 'and fun' 'data' 'data science'
 'fun' 'go' 'go hand' 'hand' 'hand in' 'in' 'in hand' 'is' 'is amazing'
 'learning' 'learning and' 'learning data' 'love' 'love learning'
 'machine' 'machine learning' 'science' 'science go' 'science is']
N-gram Matrix:
 [[0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 1 0 0]
 [1 1 1 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1]
 [0 0 1 1 0 1 1 0 1 1 2 1 1 1 0 0 1 1 0 0 0 1 1 1 1 0]]


### Explanation
1. **ngram_range=(1, 2)**: Includes both unigrams and bigrams, giving us context for individual words and two-word pairs.
2. **Vocabulary**: Now includes single words and two-word combinations.
3. **N-gram Matrix**: Each document is represented as a vector with counts for both individual words and word pairs.

### When to Use N-grams
- **Text Classification**: To capture simple word combinations like phrases.
- **Language Modeling**: Predicting the next word based on previous ones.
- **Sentiment Analysis**: Capturing phrases (e.g., "not good") rather than just individual words.

By using N-grams, you’re able to enhance text representations with context, making them effective for capturing patterns in a variety of NLP tasks.