# **Text Representation**

Text representation is the process of converting **raw text** into a **numerical format** that machine learning models can understand.  
Since models cannot work directly with text, we need to transform it into **vectors or matrices**.

## **Why Text Representation is Important**

1. **Machine Learning Compatibility**: ML algorithms require numerical input.  
2. **Capturing Meaning**: Proper representation can capture the **semantics and context** of words.  
3. **Dimensionality Reduction**: Representations can reduce the complexity of text data while retaining important information.  

## **Common Text Representation Techniques**

| Technique | Description |
|-----------|-------------|
| **Bag of Words (BoW)** | Represents text as a **frequency count** of words. Ignores grammar and word order. |
| **TF-IDF (Term Frequency–Inverse Document Frequency)** | Weights words based on **importance** in a document relative to the corpus. |
| **Word Embeddings** | Dense vector representations capturing **semantic meaning** (e.g., Word2Vec, GloVe, FastText). |
| **Contextual Embeddings** | Modern embeddings capturing **context-specific meaning** (e.g., BERT, GPT). |

## **Text Representation Workflow**

1. **Preprocessing Text** → Clean, tokenize, remove stopwords, and normalize words.  
2. **Feature Extraction** → Convert tokens into numerical vectors.  
3. **Model Input** → Feed vectors into ML/DL models for tasks like classification, clustering, or generation.

## **Summary**

Text representations bridge the gap between **human language** and **machine understanding**.  
Choosing the right representation technique depends on the **task complexity**, **data size**, and **need for semantic understanding**.

## **Categories of Text Representation**

Text representation methods in NLP can be broadly divided into two categories:


### **Traditional Techniques**

These methods are **statistical** and based on **word frequency**.  
They treat text as a **bag of words**, ignoring grammar and context.

#### **Key Characteristics**
- Simple to implement  
- Sparse and high-dimensional  
- Do not capture semantic meaning  
- Work well for smaller, structured datasets  

#### **Common Traditional Techniques**
- **One-Hot Encoding**  
- **Label Encoding**  
- **Bag of Words (BoW)**  
- **TF-IDF (Term Frequency–Inverse Document Frequency)**  
- **Bag of N-Grams**

#### **One-Hot Encoding**

**One-Hot Encoding** is a simple text representation technique where each unique word in the vocabulary is represented as a **binary vector**.  
Each position in the vector corresponds to a word from the vocabulary:
- If the word is present → `1`
- If absent → `0`

For example, if the vocabulary is:
> ["machine", "learning", "is", "fun"]

Then the sentence **“machine learning is fun”** can be represented as:

| Word | machine | learning | is | fun |
|------|----------|-----------|----|-----|
| "machine" | 1 | 0 | 0 | 0 |
| "learning" | 0 | 1 | 0 | 0 |
| "is" | 0 | 0 | 1 | 0 |
| "fun" | 0 | 0 | 0 | 1 |

**Advantages**
- Easy to understand and implement  
- Useful for small vocabularies and simple models  

**Limitations**
- Produces **high-dimensional sparse vectors**  
- Does **not capture relationships** between words (e.g., “king” and “queen” are equally distant as “king” and “table”)  
- Not suitable for large-scale or deep NLP tasks  



In [None]:
# importing libraries

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.tokenize import word_tokenize

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
# sample corpus

corpus = [
    "The cats are running in the garden.",
    "A cat runs faster than many other animals.",
    "They were playing with the running dogs."
]

In [None]:
# Split sentences into words (using lemmatization)

lemmatizer = WordNetLemmatizer()

tokenized_docs = []
for doc in corpus:
    tokens = word_tokenize(doc.lower())
    lemmas = [lemmatizer.lemmatize(token, pos='v') for token in tokens if token.isalpha()]
    tokenized_docs.append(lemmas)

In [None]:
# data printing

print("Lemmatized Tokens:")
for i, doc in enumerate(tokenized_docs, 1):
    print(f"Document {i}: {doc}")

Lemmatized Tokens:
Document 1: ['the', 'cat', 'be', 'run', 'in', 'the', 'garden']
Document 2: ['a', 'cat', 'run', 'faster', 'than', 'many', 'other', 'animals']
Document 3: ['they', 'be', 'play', 'with', 'the', 'run', 'dog']


In [27]:
# vocabulory

vocab = sorted(set(word for doc in tokenized_docs for word in doc))
print("\nVocabulary:", vocab)
print("Vocabulary count:", len(vocab))


Vocabulary: ['a', 'animals', 'be', 'cat', 'dog', 'faster', 'garden', 'in', 'many', 'other', 'play', 'run', 'than', 'the', 'they', 'with']
Vocabulary count: 16


In [21]:
# dataframe
df = pd.DataFrame({'word': [word for doc in tokenized_docs for word in doc]})

In [26]:
# applying OHE

encoder = OneHotEncoder(sparse_output=False)
one_hot = encoder.fit_transform(df[['word']])

In [29]:
# Convert to readable DataFrame

one_hot_df = pd.DataFrame(one_hot, columns=encoder.categories_[0])
one_hot_df = pd.concat([df, one_hot_df], axis=1)
print("\nOne-Hot Encoded Representation:")

one_hot_df


One-Hot Encoded Representation:


Unnamed: 0,word,a,animals,be,cat,dog,faster,garden,in,many,other,play,run,than,the,they,with
0,the,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,cat,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,be,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,run,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,in,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,the,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6,garden,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,a,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,cat,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,run,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


#### **Label Encoding**

**Label Encoding** is a simple technique that converts each unique word (or label) into a **numerical ID**.  
Instead of representing words as binary vectors (like in One-Hot Encoding), Label Encoding assigns a **unique integer** to every word in the vocabulary.

For example:
| Sentence | Encoded Vector |
|-----------|----------------|
| "cat runs fast" | [0, 1, 2] |

**Advantages**
- Compact and memory-efficient  
- Converts text into numeric form suitable for ML models  
- Easy to implement  

**Limitations**
- Encoded numbers imply **ordinal relationships** (e.g., “cat” < “dog”)  
- Not suitable for most text-based ML models directly  

In [30]:
# importing libraries

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder

In [31]:
# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [32]:
# sample corpus

corpus = [
    "The cats are running in the garden.",
    "A cat runs faster than many other animals.",
    "They were playing with the running dogs."
]

In [33]:
lemmatizer = WordNetLemmatizer()

tokenized_docs = []
for doc in corpus:
    tokens = word_tokenize(doc.lower())
    lemmas = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]
    tokenized_docs.append(lemmas)

In [34]:
print("Lemmatized Tokens:")
for i, doc in enumerate(tokenized_docs, 1):
    print(f"Document {i}: {doc}")

Lemmatized Tokens:
Document 1: ['the', 'cat', 'are', 'running', 'in', 'the', 'garden']
Document 2: ['a', 'cat', 'run', 'faster', 'than', 'many', 'other', 'animal']
Document 3: ['they', 'were', 'playing', 'with', 'the', 'running', 'dog']


In [35]:
all_words = [word for doc in tokenized_docs for word in doc]

encoder = LabelEncoder()
encoder.fit(all_words)

In [36]:
encoded_sentences = []

for doc in tokenized_docs:
    encoded_vec = encoder.transform(doc)
    encoded_sentences.append(encoded_vec)

In [38]:
# label encoding

print("\nSentence-Level Label Encodings:")

for i, (doc, encoded_vec) in enumerate(zip(tokenized_docs, encoded_sentences), 1):
    print(f"Sentence {i}:")
    print("Words:", doc)
    print("Encoded Vector:", encoded_vec.tolist())
    print("-" * 50)


Sentence-Level Label Encodings:
Sentence 1:
Words: ['the', 'cat', 'are', 'running', 'in', 'the', 'garden']
Encoded Vector: [14, 3, 2, 12, 7, 14, 6]
--------------------------------------------------
Sentence 2:
Words: ['a', 'cat', 'run', 'faster', 'than', 'many', 'other', 'animal']
Encoded Vector: [0, 3, 11, 5, 13, 8, 9, 1]
--------------------------------------------------
Sentence 3:
Words: ['they', 'were', 'playing', 'with', 'the', 'running', 'dog']
Encoded Vector: [15, 16, 10, 17, 14, 12, 4]
--------------------------------------------------


In [39]:
# mapping the vocabulary to the label

print("\nVocabulary Mapping (Word → Number):")
for word, label in zip(encoder.classes_, range(len(encoder.classes_))):
    print(f"{word}: {label}")


Vocabulary Mapping (Word → Number):
a: 0
animal: 1
are: 2
cat: 3
dog: 4
faster: 5
garden: 6
in: 7
many: 8
other: 9
playing: 10
run: 11
running: 12
than: 13
the: 14
they: 15
were: 16
with: 17


#### **Bag of Words**

The **Bag of Words (BoW)** model is a traditional text representation method that converts text into a **numerical vector** by counting word occurrences.

In BoW:
- The **order of words** is ignored (treated as a "bag")
- Each unique word in the corpus becomes a **feature (column)**
- Each document is represented by the **frequency** of words it contains

**Types of BoW Representations**

1. Normal (Count-Based) BoW  
Represents how **many times** each word appears in a document.

| Word | cat | run | dog |
|------|-----|-----|-----|
| Sentence 1 | 1 | 2 | 0 |
| Sentence 2 | 0 | 1 | 1 |

2. Binary BoW  
Represents **presence or absence** of a word in a document (ignores frequency).

| Word | cat | run | dog |
|------|-----|-----|-----|
| Sentence 1 | 1 | 1 | 0 |
| Sentence 2 | 0 | 1 | 1 |

**Advantages**
- Simple, fast, and effective for small datasets  
- Works well with classical ML models (e.g., Naive Bayes, SVM)

**Limitations**
- Ignores word order and context  
- Leads to large, sparse vectors for big vocabularies  
- Does not capture semantic meaning  

In [40]:
# importing libraries

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [41]:
# Download NLTK resources

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [42]:
# sample corpus

corpus = [
    "The cats are running in the garden.",
    "A cat runs faster than many other animals.",
    "They were playing with the running dogs."
]

In [45]:
lemmatizer = WordNetLemmatizer()

lemmatized_corpus = []
for doc in corpus:
    tokens = word_tokenize(doc.lower())
    lemmas = [lemmatizer.lemmatize(token, pos='v') for token in tokens if token.isalpha()]
    lemmatized_corpus.append(" ".join(lemmas))

In [46]:
print("Lemmatized Corpus:")
for i, doc in enumerate(lemmatized_corpus, 1):
    print(f"Document {i}: {doc}")

Lemmatized Corpus:
Document 1: the cat be run in the garden
Document 2: a cat run faster than many other animals
Document 3: they be play with the run dog


In [47]:
# Normal BOW

count_vectorizer = CountVectorizer(binary=False)
bow_counts = count_vectorizer.fit_transform(lemmatized_corpus)

In [51]:
bow_df = pd.DataFrame(bow_counts.toarray(), columns=count_vectorizer.get_feature_names_out())
print("\nNormal (Count-Based) Bag of Words:")

bow_df


Normal (Count-Based) Bag of Words:


Unnamed: 0,animals,be,cat,dog,faster,garden,in,many,other,play,run,than,the,they,with
0,0,1,1,0,0,1,1,0,0,0,1,0,2,0,0
1,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0
2,0,1,0,1,0,0,0,0,0,1,1,0,1,1,1


* Each row is one sentence

In [52]:
# Binary BOW

binary_vectorizer = CountVectorizer(binary=True)
bow_binary = binary_vectorizer.fit_transform(lemmatized_corpus)

In [53]:
bow_binary_df = pd.DataFrame(bow_binary.toarray(), columns=binary_vectorizer.get_feature_names_out())
print("\n🔢 Binary Bag of Words:")

bow_binary_df


🔢 Binary Bag of Words:


Unnamed: 0,animals,be,cat,dog,faster,garden,in,many,other,play,run,than,the,they,with
0,0,1,1,0,0,1,1,0,0,0,1,0,1,0,0
1,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0
2,0,1,0,1,0,0,0,0,0,1,1,0,1,1,1


 * Each row in the Dataframe is a sentence

#### **TF-IDF**

**TF–IDF** is an advanced version of Bag of Words that measures how **important** a word is in a document **relative to all documents** in the corpus.  
It reduces the influence of common words (like “the”, “is”, “and”) and highlights **unique, meaningful words**.


`Formula picture here`

**Intuition**

- Words that occur **often in one document** but **rarely across others** → get **high scores**  
- Words that appear in **all documents** → get **low scores**

Example

| Word | TF | IDF | TF–IDF |
|------|----|-----|--------|
| data | 0.2 | 1.7 | 0.34 |
| science | 0.1 | 2.3 | 0.23 |

**Advantages**
- Highlights **important** and **unique** words  
- Improves document similarity and search ranking  
- Reduces the impact of common words  

**Limitations**
- Still ignores **word order and context**  
- Sparse representation for large vocabularies  

In [54]:
# importing libraries

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [56]:
# Download NLTK resources

nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [57]:
# sample corpus

corpus = [
    "The cats are running in the garden.",
    "A cat runs faster than many other animals.",
    "They were playing with the running dogs."
]

In [60]:
lemmatizer = WordNetLemmatizer()

lemmatized_corpus = []
for doc in corpus:
    tokens = word_tokenize(doc.lower())
    lemmas = [lemmatizer.lemmatize(token, pos='v') for token in tokens if token.isalpha()]
    lemmatized_corpus.append(" ".join(lemmas))

In [61]:
print("Lemmatized Corpus:")
for i, doc in enumerate(lemmatized_corpus, 1):
    print(f"Document {i}: {doc}")

Lemmatized Corpus:
Document 1: the cat be run in the garden
Document 2: a cat run faster than many other animals
Document 3: they be play with the run dog


In [62]:
# TF–IDF

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(lemmatized_corpus)

In [65]:
# Convert to DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

print("\n📊 TF–IDF Representation:")

tfidf_df.round(3)


📊 TF–IDF Representation:


Unnamed: 0,animals,be,cat,dog,faster,garden,in,many,other,play,run,than,the,they,with
0,0.0,0.315,0.315,0.0,0.0,0.415,0.415,0.0,0.0,0.0,0.245,0.0,0.631,0.0,0.0
1,0.411,0.0,0.312,0.0,0.411,0.0,0.0,0.411,0.411,0.0,0.243,0.411,0.0,0.0,0.0
2,0.0,0.324,0.0,0.426,0.0,0.0,0.0,0.0,0.0,0.426,0.252,0.0,0.324,0.426,0.426


* Each row in the dataframe is a vector

In [69]:
# showing the top three imp words for each document

print("Top 3 Words by TF–IDF per Document:")
for i, row in enumerate(tfidf_df.values):
    sorted_indices = row.argsort()[::-1]
    top_words = [(tfidf_df.columns[idx], float(round(row[idx], 3))) for idx in sorted_indices[:3]]
    print(f"Document {i+1}: {top_words:}")

Top 3 Words by TF–IDF per Document:
Document 1: [('the', 0.631), ('garden', 0.415), ('in', 0.415)]
Document 2: [('than', 0.411), ('other', 0.411), ('many', 0.411)]
Document 3: [('with', 0.426), ('they', 0.426), ('play', 0.426)]


#### **Bag of N-grams**

An **n-gram** is a **sequence of n consecutive words** from a text.  
Instead of representing individual words (like in Bag of Words), the **Bag of n-Grams** captures **word combinations** that preserve a bit of **context** and **word order**.

Types of n-Grams

| Type | n | Example (for sentence "I love natural language processing") |
|------|---|-------------------------------------------------------------|
| Unigram | 1 | ["I", "love", "natural", "language", "processing"] |
| Bigram | 2 | ["I love", "love natural", "natural language", "language processing"] |
| Trigram | 3 | ["I love natural", "love natural language", "natural language processing"] |

- **Unigrams** → capture individual word importance (like BoW)  
- **Bigrams / Trigrams** → capture short phrases and local word order  
  - e.g., "not good" vs. "good" — have different meanings

**Advantages**
- Captures **contextual information** ignored by simple BoW  
- Helps in **sentiment analysis**, **text classification**, **feature extraction**

**Limitations**
- Increases **feature space size** rapidly with higher n  
- Still doesn’t capture **long-term dependencies** like word embeddings do  

In [70]:
# importing libraries

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [71]:
# Download resources

nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [72]:
# sample corpus

corpus = [
    "The cats are running in the garden.",
    "A cat runs faster than many other animals.",
    "They were playing with the running dogs."
]

In [75]:
lemmatizer = WordNetLemmatizer()

lemmatized_corpus = []
for doc in corpus:
    tokens = word_tokenize(doc.lower())
    lemmas = [lemmatizer.lemmatize(token, pos='v') for token in tokens if token.isalpha()]
    lemmatized_corpus.append(" ".join(lemmas))

In [76]:
print("Lemmatized Corpus:")
for i, doc in enumerate(lemmatized_corpus, 1):
    print(f"Document {i}: {doc}")

Lemmatized Corpus:
Document 1: the cat be run in the garden
Document 2: a cat run faster than many other animals
Document 3: they be play with the run dog


In [77]:
# unigram - 1

vectorizer_uni = CountVectorizer(ngram_range=(1, 1))
unigram_matrix = vectorizer_uni.fit_transform(lemmatized_corpus)
unigram_df = pd.DataFrame(unigram_matrix.toarray(), columns=vectorizer_uni.get_feature_names_out())

print("Unigram (n=1) Representation:")
unigram_df


Unigram (n=1) Representation:


Unnamed: 0,animals,be,cat,dog,faster,garden,in,many,other,play,run,than,the,they,with
0,0,1,1,0,0,1,1,0,0,0,1,0,2,0,0
1,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0
2,0,1,0,1,0,0,0,0,0,1,1,0,1,1,1


In [78]:
# bigram - 2

vectorizer_bi = CountVectorizer(ngram_range=(2, 2))
bigram_matrix = vectorizer_bi.fit_transform(lemmatized_corpus)
bigram_df = pd.DataFrame(bigram_matrix.toarray(), columns=vectorizer_bi.get_feature_names_out())

print("Bigram (n=2) Representation:")
bigram_df

Bigram (n=2) Representation:


Unnamed: 0,be play,be run,cat be,cat run,faster than,in the,many other,other animals,play with,run dog,run faster,run in,than many,the cat,the garden,the run,they be,with the
0,0,1,1,0,0,1,0,0,0,0,0,1,0,1,1,0,0,0
1,0,0,0,1,1,0,1,1,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,1


In [80]:
# trigram - 3

vectorizer_tri = CountVectorizer(ngram_range=(3, 3))
trigram_matrix = vectorizer_tri.fit_transform(lemmatized_corpus)
trigram_df = pd.DataFrame(trigram_matrix.toarray(), columns=vectorizer_tri.get_feature_names_out())

print("Trigram (n=3) Representation:")
trigram_df

Trigram (n=3) Representation:


Unnamed: 0,be play with,be run in,cat be run,cat run faster,faster than many,in the garden,many other animals,play with the,run faster than,run in the,than many other,the cat be,the run dog,they be play,with the run
0,0,1,1,0,0,1,0,0,0,1,0,1,0,0,0
1,0,0,0,1,1,0,1,0,1,0,1,0,0,0,0
2,1,0,0,0,0,0,0,1,0,0,0,0,1,1,1


In [82]:
# unigram + Bigram

vectorizer_combined = CountVectorizer(ngram_range=(1, 2))
combined_matrix = vectorizer_combined.fit_transform(lemmatized_corpus)
combined_df = pd.DataFrame(combined_matrix.toarray(), columns=vectorizer_combined.get_feature_names_out())

print("Combined Unigram + Bigram Representation:")
combined_df

Combined Unigram + Bigram Representation:


Unnamed: 0,animals,be,be play,be run,cat,cat be,cat run,dog,faster,faster than,...,than,than many,the,the cat,the garden,the run,they,they be,with,with the
0,0,1,0,1,1,1,0,0,0,0,...,0,0,2,1,1,0,0,0,0,0
1,1,0,0,0,1,0,1,0,1,1,...,1,1,0,0,0,0,0,0,0,0
2,0,1,1,0,0,0,0,1,0,0,...,0,0,1,0,0,1,1,1,1,1
