## Implemention of BOW and ngram using NLTK

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = [
    "Natural language processing is a fascinating field of artificial intelligence that deals with human language.",
    "Machine learning and natural language processing often go hand in hand to create smarter applications.",
    "The advancements in artificial intelligence are transforming industries, enabling new insights and possibilities.",
    "Natural language processing helps machines understand and respond to human communication.",
    "AI, machine learning, and natural language processing are important for building intelligent applications."
]

In [4]:
# Initialize CountVectorizer for BoW
bow_vectorizer = CountVectorizer()

# Fit and transform the documents to BoW
bow_matrix = bow_vectorizer.fit_transform(documents)

# Convert to array and display results
bow_array = bow_matrix.toarray()
print("Vocabulary for BoW:", bow_vectorizer.get_feature_names_out())
print("BoW Matrix:\n", bow_array)

Vocabulary for BoW: ['advancements' 'ai' 'and' 'applications' 'are' 'artificial' 'building'
 'communication' 'create' 'deals' 'enabling' 'fascinating' 'field' 'for'
 'go' 'hand' 'helps' 'human' 'important' 'in' 'industries' 'insights'
 'intelligence' 'intelligent' 'is' 'language' 'learning' 'machine'
 'machines' 'natural' 'new' 'of' 'often' 'possibilities' 'processing'
 'respond' 'smarter' 'that' 'the' 'to' 'transforming' 'understand' 'with']
BoW Matrix:
 [[0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 2 0 0 0 1 0 1 0 0 1 0
  0 1 0 0 0 0 1]
 [0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 2 0 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0
  1 0 0 1 0 0 0]
 [1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0
  0 0 1 0 1 0 0]
 [0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1
  0 0 0 1 0 1 0]
 [0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 1 0 0 0 0 1 0
  0 0 0 0 0 0 0]]


- Vocabulary: Shows a list of unique words across all documents.

- BoW Matrix: Displays a matrix where each row represents a document, and each column represents the count of a word from the vocabulary in that document.

In [8]:
# Initialize CountVectorizer for N-grams with ngram_range
ngram_vectorizer = CountVectorizer(ngram_range=(1, 2))

# Fit and transform the documents to get N-gram representation
ngram_matrix = ngram_vectorizer.fit_transform(documents)

# Convert to array and display results
ngram_array = ngram_matrix.toarray()
print("Vocabulary for N-grams:", ngram_vectorizer.get_feature_names_out())
print("N-gram Matrix:\n", ngram_array)

Vocabulary for N-grams: ['advancements' 'advancements in' 'ai' 'ai machine' 'and' 'and natural'
 'and possibilities' 'and respond' 'applications' 'are' 'are important'
 'are transforming' 'artificial' 'artificial intelligence' 'building'
 'building intelligent' 'communication' 'create' 'create smarter' 'deals'
 'deals with' 'enabling' 'enabling new' 'fascinating' 'fascinating field'
 'field' 'field of' 'for' 'for building' 'go' 'go hand' 'hand' 'hand in'
 'hand to' 'helps' 'helps machines' 'human' 'human communication'
 'human language' 'important' 'important for' 'in' 'in artificial'
 'in hand' 'industries' 'industries enabling' 'insights' 'insights and'
 'intelligence' 'intelligence are' 'intelligence that' 'intelligent'
 'intelligent applications' 'is' 'is fascinating' 'language'
 'language processing' 'learning' 'learning and' 'machine'
 'machine learning' 'machines' 'machines understand' 'natural'
 'natural language' 'new' 'new insights' 'of' 'of artificial' 'often'
 'often go' 'p

- Vocabulary for N-grams: Contains both single words and two-word combinations, capturing more context than BoW.

- N-gram Matrix: Each row now represents a document with counts for both individual words and word pairs.
Example Out

Conclusion :

BoW provides word counts without capturing word order.
N-grams capture both individual words and phrases, helping to preserve word context for more complex relationships.