<a href="https://colab.research.google.com/github/raviteja-padala/NLP/blob/main/NLP_Text_Representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Representation: Unveiling the Power of Feature Extraction from Text

###  'Text representation' or 'Text Vectorization' or 'Feature extraction from text'

In the realm of Natural Language Processing (NLP), the challenge lies in converting raw textual data into a form that machine learning algorithms can comprehend. This pivotal process, often referred to as "Text Representation," "Feature Extraction from Text," or "Text Vectorization," forms the bedrock for a myriad of NLP applications.

Textual data, being inherently unstructured, requires a systematic approach to unlock its insights. The goal is to transform words and sentences into numerical formats that algorithms can manipulate effectively.

As we traverse this intricate landscape, armed with an array of techniques, we gain the power to extract insights, detect patterns, and unravel hidden meanings within textual data. The journey from raw text to structured numerical data reshapes how we approach NLP and harness its transformative potential.

Basic terms in NLP:

1. **Corpus:** An NLP corpus is a **large collection of text documents** or speech recordings that are used for language analysis. It serves as a data source for various language-related tasks in natural language processing. Corpora (plural of corpus) provide diverse examples of how words, phrases, and language patterns are used in real-world contexts, helping NLP models learn and understand language better.

2. **Vocabulary:** Vocabulary refers to **the set of unique words** present in a given corpus, document, or text dataset. It's like a dictionary containing all the words used in the text. A rich vocabulary is essential for understanding and generating language accurately.

3. **Document:** In NLP, a document is a single piece of text that can vary in length. A document can be as short as a single sentence or as long as an entire book. Documents are the units of analysis in text processing tasks like text classification, sentiment analysis, and topic modeling.

4. **Word:** A word is a fundamental unit of language that carries meaning. It's a sequence of characters that represents a concept, object, action, or idea. Words are the building blocks of sentences and play a crucial role in conveying information and communication.

In summary,

* an NLP corpus is a collection of text or speech data used for analysis,
* vocabulary is the set of unique words,
* a document is a piece of text (short or long), and
* a word is the basic unit of language that carries meaning.

These concepts are essential for understanding and working with language in natural language processing tasks.

### Contents:

- 1.One Hot Encoding
- 2.Bag of Words (BoW)
- 3.N-grams or Bag of N-grams
- 4.Tf-Idf
- 5.Custom features

# 1.One Hot Encoding:

One-hot encoding is a technique used in natural language processing (NLP) to represent categorical variables, such as **words as binary vectors**. Each word in a vocabulary is represented as a unique binary vector where only one element is 1 (hot) and the rest are 0 (cold). This encoding is useful for feeding categorical data into machine learning models that require numerical input.

Here's an example of one-hot encoding for a small vocabulary:

Suppose we have a simple vocabulary: ["apple", "banana", "orange", "grape"]

1. **Word-to-Index Mapping:**
   Each word is assigned a unique index in the vocabulary:
   - "apple" -> 0
   - "banana" -> 1
   - "orange" -> 2
   - "grape" -> 3

2. **One-Hot Encoding:**
   Each word is represented as a binary vector of the length of the vocabulary. The index corresponding to the word's position in the vocabulary is set to 1, and the rest are set to 0.

   - "apple": [1, 0, 0, 0]
   - "banana": [0, 1, 0, 0]
   - "orange": [0, 0, 1, 0]
   - "grape": [0, 0, 0, 1]

So, in this example, the word "apple" is represented as [1, 0, 0, 0], "banana" as [0, 1, 0, 0], and so on. This binary representation preserves the categorical nature of the words and allows them to be used as input features for machine learning algorithms.

However, one-hot encoding can lead to high-dimensional sparse vectors, especially for large vocabularies, which can be inefficient and computationally expensive. In practice, more advanced techniques like word embeddings (e.g., Word2Vec, GloVe) are often used to represent words in a more compact and meaningful vector space.

Pros and cons of one-hot encoding:

**Pros of One-Hot Encoding:**

1. **Simplicity:** It's a straightforward and easy-to-understand technique for representing categorical data.

2. **Interpretability:** The resulting binary vectors are interpretable, as each dimension corresponds to a specific category.

3. **No Assumptions:** One-hot encoding doesn't assume any relationship or order between categories.

**Cons of One-Hot Encoding:**

1. **High Dimensionality:** For large vocabularies or categorical features with many levels, one-hot encoding can result in high-dimensional and **sparse data**, leading to increased memory and computation requirements.

2. **Loss of Continuity:** One-hot encoding **doesn't capture any semantic relationships** between words or categories. All categories are treated as independent.

3. **Curse of Dimensionality:** High-dimensional data can suffer from the "curse of dimensionality," where distances between data points become less meaningful, and the risk of overfitting increases.

4. **Not Suitable for Text Sequences:** One-hot encoding doesn't capture the sequential nature of words in text. It treats each word as independent, ignoring the contextual information.


In summary, while one-hot encoding is simple and intuitive for representing categorical data, it can lead to high-dimensional data and lack of context in certain NLP applications. More advanced techniques like word embeddings and categorical encodings have been developed to address some of these limitations.

# 2.Bag of Words (BoW)

"Bag of Words" (BoW) is a basic and widely used technique in natural language processing (NLP) for representing text data numerically. It's a simple model that **focuses on the frequency of words in a document**, ignoring the order and structure of the words. The name "Bag of Words" reflects the idea that you're treating a document as a bag that contains all the words in it, disregarding their sequence.

Here's how the Bag of Words model works:

1. **Tokenization:** Break down a document into individual words or tokens.

2. **Vocabulary Creation:** Create a vocabulary containing all unique words from the entire corpus. Each word is assigned a unique index.

3. **Counting:** For each document, count how many times each word from the vocabulary appears in that document. This information is stored in a frequency vector.

4. **Vectorization:** Represent each document as a vector where the value at each index corresponds to the frequency of the word at that index in the document.

Here's an example:

Suppose you have two documents:
1. "The cat chased the mouse."
2. "The mouse ran away."

**Vocabulary:** ["The", "cat", "chased", "mouse", "ran", "away"]

**Bag of Words Representation:**
1. [2, 1, 1, 1, 0, 0]
2. [1, 0, 0, 1, 1, 1]

In the first document, "The" appears twice, "cat," "chased," and "mouse" appear once each, and the other words don't appear. So, the vector [2, 1, 1, 1, 0, 0] represents the first document's bag of words.

In the second document, "The," "mouse," "ran," and "away" appear once each, and the other words don't appear. So, the vector [1, 0, 0, 1, 1, 1] represents the second document's bag of words.

BoW is simple and effective for tasks like text classification and sentiment analysis. However, it doesn't consider word order, grammar, or semantic meaning. More advanced techniques like TF-IDF (Term Frequency-Inverse Document Frequency) and word embeddings address some of these limitations by capturing more nuanced information from the text data.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The cat chased the mouse.",
    "The mouse ran away.",
    "The cat and the mouse are friends."
]

# Create a CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the documents into a bag of words representation
BOW = vectorizer.fit_transform(documents)

# Get the vocabulary (unique words in the documents)
vocabulary = vectorizer.get_feature_names_out()

# Convert the bag of words matrix to an array
X_array = BOW.toarray()

# Print the vocabulary and the bag of words representation
print("Alphabetical ordering of words", vectorizer.vocabulary_)
print("Vocabulary:", vocabulary)

print("Bag of Words Representation:\n", X_array)


Alphabetical ordering of words {'the': 8, 'cat': 3, 'chased': 4, 'mouse': 6, 'ran': 7, 'away': 2, 'and': 0, 'are': 1, 'friends': 5}
Vocabulary: ['and' 'are' 'away' 'cat' 'chased' 'friends' 'mouse' 'ran' 'the']
Bag of Words Representation:
 [[0 0 0 1 1 0 1 0 2]
 [0 0 1 0 0 0 1 1 1]
 [1 1 0 1 0 1 1 0 2]]


`CountVectorizer` is a class in scikit-learn that is used for converting a collection of text documents into a matrix of token counts. It has several hyperparameters that allow you to customize its behavior. Here are some important hyperparameters of `CountVectorizer`:

1. **`analyzer`**: Specifies whether to tokenize the input text at the word level or character level. It takes values like `'word'` (default) or `'char'`.

2. **`tokenizer`**: Allows you to specify a custom function for tokenizing the input text. If not specified, the `CountVectorizer` will use its default tokenizer.

3. **`stop_words`**: Specifies a list of stop words that will be ignored during tokenization. It can take values like `'english'` (for using the built-in English stop words) or a list of custom stop words.

4. **`ngram_range`**: Specifies the range of n-grams to be extracted. An n-gram is a sequence of n words. For example, setting `ngram_range=(1, 2)` would include both single words and pairs of consecutive words (bigrams).

5. **`max_df`**: Specifies the threshold for excluding words that appear in a certain percentage of documents. For instance, if `max_df=0.8`, words appearing in 80% or more of the documents will be excluded.

6. **`min_df`**: Specifies the threshold for excluding words that appear in a certain number or percentage of documents. For example, `min_df=2` would exclude words appearing in only one document.

7. **`max_features`**: Limits the vocabulary size to the specified number of most frequent terms. This can help manage memory and computation requirements.

8. **`lowercase`**: Controls whether the text should be converted to lowercase before tokenization. Defaults to `True`.

9. **`preprocessor`**: Allows you to specify a custom function to preprocess the text before tokenization.

10. **`Binary`** The binary hyperparameter in CountVectorizer controls whether to use binary or count features. If binary=True, then each feature will be represented by a single binary value, indicating whether the feature is present or absent in the document. If binary=False, then each feature will be represented by its count in the document.
* The **default value of binary is False**. This means that CountVectorizer will use count features by default. However, you can set binary=True if you want to use binary features instead. Used in sentiment analysis mostly


In [None]:
#hyperparametrs
from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "The cat chased the mouse.",
    "The mouse ran away.",
    "The cat and the mouse are friends."
]

vectorizer = CountVectorizer(
    lowercase=True,       # to convert text to lowercase,
    stop_words='english', # remove English stop words
    ngram_range=(1, 2),   # extract both single words and bigrams
    max_df=0.8,           # exclude words appearing in more than 80% of documents
    min_df=1,             # include words appearing in at least one document
    max_features=None     # not limit the vocabulary size
)

X = vectorizer.fit_transform(documents)

# Convert the bag of words matrix to an array
X_array = X.toarray()
print("Alphabetical ordering of words", vectorizer.vocabulary_)
print('\n')
print("Bag of Words Representation:\n", X_array)

Alphabetical ordering of words {'cat': 1, 'chased': 4, 'cat chased': 2, 'chased mouse': 5, 'ran': 9, 'away': 0, 'mouse ran': 8, 'ran away': 10, 'friends': 6, 'cat mouse': 3, 'mouse friends': 7}


Bag of Words Representation:
 [[0 1 1 0 1 1 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 1 1 1]
 [0 1 0 1 0 0 1 1 0 0 0]]


Pros and cons of the Bag of Words (BoW) model in natural language processing:

**Pros of Bag of Words (BoW):**

1. **Simplicity:** BoW is a straightforward and simple representation for text data.

2. **Ease of Implementation:** Implementing BoW is relatively easy, making it a good starting point for text-based projects.

3. **Feature Extraction:** BoW converts text data into a numerical format that can be used with various machine learning algorithms.

4. **Language Independence:** BoW treats words as independent units, making it suitable for languages without complex grammar structures.

5. **Useful for Shallow Models:** BoW can work well with simpler models like Naive Bayes and Logistic Regression.

6. **Interpretability:** The resulting features (word frequencies) are interpretable and can provide insights into text data.

**Cons of Bag of Words (BoW):**

1. **Lack of Context:** BoW completely **ignores word order** and context, losing valuable information in text sequences.

2. **Sparse Representation:** BoW results in high-dimensional **sparse** vectors, which can lead to memory and computation issues.

3. **Loss of Semantic Meaning:** BoW treats words with multiple meanings the same, lacking the ability to capture **semantic relationships**.

4. **No Sequence Information:** BoW fails to capture sequential information, which is important in many NLP tasks.

5. **Sensitive to Vocabulary Size:** The choice of vocabulary size affects the representation's quality and model performance.

6. **Out-of-Vocabulary Issue:** BoW struggles with handling words it hasn't seen during training (OOV words).

7. **Limited by Frequency:** BoW focuses solely on word frequency, disregarding other important aspects like word importance.

8. **Not Effective for Deep Learning:** Modern deep learning models benefit from understanding sequential relationships and nuances, which BoW doesn't capture.

In summary, while Bag of Words is a simple and effective technique for text representation, it has limitations related to context, sparsity, and lack of semantic understanding, making it less suitable for more advanced NLP tasks.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

documents = ["This movie is very good"]

documents = [
    "This movie is very good",
    "This movie is not very good"
]

vectorizer = CountVectorizer(
    lowercase=True,       # to convert text to lowercase,
    stop_words='english', # remove English stop words
    ngram_range=(1, 2),   # extract both single words and bigrams
)

X = vectorizer.fit_transform(documents)

# Convert the bag of words matrix to an array
X_array = X.toarray()
print("Alphabetical ordering of words", vectorizer.vocabulary_)
print('\n')
print("Bag of Words Representation:\n", X_array)


# both sentences have different meaning
# but Bag of words vectorised them into similar words, this is one of flaw

Alphabetical ordering of words {'movie': 1, 'good': 0, 'movie good': 2}


Bag of Words Representation:
 [[1 1 1]
 [1 1 1]]


# 3.N-grams or Bag of N-grams

In natural language processing (NLP), **n-grams** are contiguous sequences of 'n' items from a given sample of text or speech. The items can be words, characters, or even phonemes, depending on the context. N-grams are used to capture local patterns and dependencies within a sequence of tokens. They're especially useful for tasks involving context, like language modeling, machine translation, and text generation.

Let's look at an example to understand n-grams better. Consider the sentence:

**"The quick brown fox jumps over the lazy dog."**

For different values of 'n', here are the n-grams:

- **Unigrams (1-grams):** ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog."]
- **Bigrams (2-grams):** ["The quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the", "the lazy", "lazy dog."]
- **Trigrams (3-grams):** ["The quick brown", "quick brown fox", "brown fox jumps", "fox jumps over", "jumps over the", "over the lazy", "the lazy dog."]
- **4-grams:** ["The quick brown fox", "quick brown fox jumps", "brown fox jumps over", "fox jumps over the", "jumps over the lazy", "over the lazy dog."]

N-grams are valuable in NLP tasks as they capture local patterns and dependencies, enabling models to better understand and generate text with context.

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:

# Python example of extracting n-grams using the `nltk` library
# In this example, the sentence is tokenized into words, and then n-grams of different sizes are generated using the `ngrams` function from `nltk.util`.
# The output shows the extracted n-grams for each value of 'n'.


import nltk
from nltk.util import ngrams

sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize the sentence into words
words = nltk.word_tokenize(sentence)

# Generate n-grams for different values of n
unigrams = list(ngrams(words, 1))
bigrams = list(ngrams(words, 2))
trigrams = list(ngrams(words, 3))
fourgrams = list(ngrams(words, 4))

print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)
print("4-grams:", fourgrams)

Unigrams: [('The',), ('quick',), ('brown',), ('fox',), ('jumps',), ('over',), ('the',), ('lazy',), ('dog',), ('.',)]
Bigrams: [('The', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps'), ('jumps', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'dog'), ('dog', '.')]
Trigrams: [('The', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog'), ('lazy', 'dog', '.')]
4-grams: [('The', 'quick', 'brown', 'fox'), ('quick', 'brown', 'fox', 'jumps'), ('brown', 'fox', 'jumps', 'over'), ('fox', 'jumps', 'over', 'the'), ('jumps', 'over', 'the', 'lazy'), ('over', 'the', 'lazy', 'dog'), ('the', 'lazy', 'dog', '.')]


In the context of generating n-grams using libraries like NLTK or scikit-learn, there aren't many hyperparameters to consider, as generating n-grams is a relatively straightforward process. However, there are some aspects to customize based on specific needs:

**Value of 'n'**: This is the most important parameter and determines the size of the n-grams. You need to decide whether you want unigrams (single tokens), bigrams (pairs of tokens), trigrams (triplets of tokens), and so on.

**Tokenization**: Before generating n-grams, you need to tokenize your input text into tokens (words or characters). Depending on your requirements, you might want to consider different tokenization strategies.

**Padding and Truncation**: If your text data includes padding or you want to truncate it, you might need to adjust the tokenization process or deal with partial n-grams.

In [None]:
import nltk
from nltk.util import ngrams

sentence = "This movie is very good"

# Tokenize the sentence into words
words = nltk.word_tokenize(sentence)

# Generate n-grams for different values of n
unigrams = list(ngrams(words, 1))
bigrams = list(ngrams(words, 2))
trigrams = list(ngrams(words, 3))
fourgrams = list(ngrams(words, 4))

print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)
print("4-grams:", fourgrams)

Unigrams: [('This',), ('movie',), ('is',), ('very',), ('good',)]
Bigrams: [('This', 'movie'), ('movie', 'is'), ('is', 'very'), ('very', 'good')]
Trigrams: [('This', 'movie', 'is'), ('movie', 'is', 'very'), ('is', 'very', 'good')]
4-grams: [('This', 'movie', 'is', 'very'), ('movie', 'is', 'very', 'good')]


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

documents = ["This movie is very good"]

documents = [
    "This movie is very good",
    "This movie is not very good"
]

vectorizer = CountVectorizer(
    lowercase=True,       # to convert text to lowercase,
    stop_words='english', # remove English stop words
    ngram_range=(1, 2),   # extract both single words and bigrams
)

X = vectorizer.fit_transform(documents)

# Convert the bag of words matrix to an array
X_array = X.toarray()
print("Alphabetical ordering of words", vectorizer.vocabulary_)
print('\n')
print("Bag of Words Representation:\n", X_array)

Alphabetical ordering of words {'movie': 1, 'good': 0, 'movie good': 2}


Bag of Words Representation:
 [[1 1 1]
 [1 1 1]]


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "This movie is very good",
    "This movie is not very good"
]

# Create a CountVectorizer instance for unigrams
vectorizer_unigrams = CountVectorizer(ngram_range=(1, 1))
X_unigrams = vectorizer_unigrams.fit_transform(documents)

# Create a CountVectorizer instance for bigrams
vectorizer_bigrams = CountVectorizer(ngram_range=(2, 2))
X_bigrams = vectorizer_bigrams.fit_transform(documents)

# Get the vocabulary for unigrams and bigrams
vocabulary_unigrams = vectorizer_unigrams.get_feature_names_out()
vocabulary_bigrams = vectorizer_bigrams.get_feature_names_out()

# Convert the count matrices to arrays
X_unigrams_array = X_unigrams.toarray()
X_bigrams_array = X_bigrams.toarray()

# Print vocabulary and count matrices for unigrams and bigrams
print("Vocabulary (Unigrams):", vocabulary_unigrams)
print("Count Matrix (Unigrams):\n", X_unigrams_array)

print("\nVocabulary (Bigrams):", vocabulary_bigrams)
print("Count Matrix (Bigrams):\n", X_bigrams_array)


Vocabulary (Unigrams): ['good' 'is' 'movie' 'not' 'this' 'very']
Count Matrix (Unigrams):
 [[1 1 1 0 1 1]
 [1 1 1 1 1 1]]

Vocabulary (Bigrams): ['is not' 'is very' 'movie is' 'not very' 'this movie' 'very good']
Count Matrix (Bigrams):
 [[0 1 1 0 1 1]
 [1 0 1 1 1 1]]


### Interpretation:

In the unigrams representation, the generated 6-dimensional vectors are similar in 5 dimensions, keeping the vectors close to each other. However, the meaning of the sentences is very different. The BoW model, based solely on word frequencies, doesn't capture the negation "not" in the second document, leading to a misleading similarity in the vectors.

On the other hand, in the bigrams representation, the generated vectors are not very similar, accurately capturing the meaning of the sentences. The bigrams approach considers pairs of consecutive words, allowing it to capture important semantic distinctions that were missed by the unigrams representation.

In summary, while unigrams might not adequately capture certain nuances like negations, bigrams provide a more meaningful representation by considering word pairs and preserving context. This example showcases the importance of choosing an appropriate n-gram size based on the specific language patterns and context in the dataset.

A concise summary of the pros and cons of using n-grams in natural language processing:

**Pros of N-grams:**

1. **Contextual Information:** N-grams capture local patterns and dependencies within a sequence of tokens, allowing models to understand context better.

2. **Language Patterns:** N-grams help in capturing common phrases, collocations, and idiomatic expressions that might be missed by single words.

3. **Simplicity:** Generating n-grams is relatively simple and doesn't require complex linguistic analysis.

4. **Useful for Shallow Models:** N-grams can work effectively with simpler models like Naive Bayes and Linear Regression.

5. **Improved Feature Space:** N-grams provide a way to include some linguistic structure without the complexity of full syntactic parsing.

6. **Reduced Dimensionality:** Compared to full syntactic or semantic analysis, n-grams offer a compromise between capturing information and keeping dimensionality manageable.

**Cons of N-grams:**

1. **Sparsity:** N-grams can lead to high-dimensional sparse data, especially when considering higher-order n-grams, which might pose challenges in computation and memory.

2. **Limited Context:** While n-grams capture local context, they might not capture long-range dependencies or global context effectively.

3. **Data Size Impact:** Longer n-grams can be impacted by data size limitations; you need sufficient data to estimate the frequency of all possible n-grams.

4. **Lack of Semantics:** N-grams might not capture semantic relationships between words, as they focus on co-occurrence patterns.

5. **Out-of-Vocabulary Words:** N-grams can struggle with out-of-vocabulary (OOV) words and unseen combinations, particularly in languages with complex word forms or new terms.

6. **Order Sensitivity:** N-grams are sensitive to the order of words, which can be a limitation when the same concepts are expressed in different word orders.

7. **Data Sparsity for Higher N-grams:** Higher-order n-grams (e.g., 4-grams) suffer from data sparsity, as each specific combination might occur less frequently.

8. **Language-Dependent:** The effectiveness of n-grams can vary across languages due to differences in word order and linguistic structures.

In summary, n-grams offer a balance between context capture and simplicity, making them valuable for various NLP tasks. However, they also come with challenges related to sparsity, context limitations, and their inability to capture deeper semantic relationships. The choice of n-gram size should be based on the specific language patterns and tasks you're dealing with.

# 4.Tf-Idf (Term Frequency-Inverse Document Frequency)

TF-IDF  is a numerical representation used in natural language processing to evaluate the **importance of a word within a document relative to its occurrence in a collection of documents (corpus).** It takes into account both the frequency of the word in the document (TF) and how unique the word is across the entire corpus (IDF).

Here's how TF-IDF works:

**Term Frequency** (TF): This calculates how often a word appears in a document relative to the total number of words in that document.

**TF** = (Number of times the word appears in the document) / (Total number of words in the document)

**Inverse Document Frequency**(IDF): This measures the uniqueness of a word by considering how many documents in the corpus contain that word. It's calculated as the logarithm of the total number of documents divided by the number of documents containing the word.

**IDF** = log((Total number of documents) / (Number of documents containing the word))

TF-IDF Score: The TF-IDF score for a word in a document is the product of its TF and IDF values.

**TF-IDF** = TF * IDF

- Higher TF-IDF values indicate that a word is important in a particular document and relatively unique across the corpus.


- TF-IDF is commonly used in information retrieval, text classification, clustering, and other NLP tasks where word importance and uniqueness matter.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "The cat chased the mouse.",
    "The mouse ran away.",
    "The cat and the mouse are friends."
]

# Create a TfidfVectorizer instance
tfidf_vectorizer = TfidfVectorizer() #TfidfVectorizer is used to convert the text documents into a TF-IDF matrix.

# Fit and transform the documents into a TF-IDF matrix
X = tfidf_vectorizer.fit_transform(documents)

# Get the feature names (words) from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to an array
X_array = X.toarray()


# Print the feature names and TF-IDF matrix
print("Feature Names (Words):", feature_names)
print('\n')
print("TF-IDF Matrix:\n", X_array)

Feature Names (Words): ['and' 'are' 'away' 'cat' 'chased' 'friends' 'mouse' 'ran' 'the']


TF-IDF Matrix:
 [[0.         0.         0.         0.4172334  0.54861178 0.
  0.32401895 0.         0.64803791]
 [0.         0.         0.6088451  0.         0.         0.
  0.35959372 0.6088451  0.35959372]
 [0.43345167 0.43345167 0.         0.32965117 0.         0.43345167
  0.25600354 0.         0.51200708]]


In [None]:
# Get the IDF values
idf_values = tfidf_vectorizer.idf_

# Print words and their corresponding TF-IDF values
for word, idf_value in zip(feature_names, idf_values):
    print(f"{word}: {idf_value:.6f}")


and: 1.693147
are: 1.693147
away: 1.693147
cat: 1.287682
chased: 1.693147
friends: 1.693147
mouse: 1.000000
ran: 1.693147
the: 1.000000


The reason we're seeing differences between the TF-IDF matrix values and the printed IDF values is because of how TF-IDF is calculated and how the values are presented in the output.

In the TF-IDF matrix, the values are not only influenced by the IDF component but also by the Term Frequency (TF) component. The TF component represents how often a word occurs in a specific document. The TF-IDF score for a word in a specific document is the product of its TF and IDF values.

Here's a breakdown of what we're seeing:

1. **TF-IDF Matrix:**
   The values in the matrix are the actual TF-IDF scores for each word in each document. Each cell represents the TF-IDF score of a word in a particular document.

2. **Printed IDF Values:**
   The printed IDF values are the IDF components of the TF-IDF scores. They represent the uniqueness of each word across the entire corpus. These values are not directly multiplied with the term frequencies (TF) to get the TF-IDF scores.

In the TF-IDF matrix, we see how both TF and IDF contribute to the final TF-IDF scores. The IDF values themselves are just a part of the calculation. The printed IDF values are presented separately to show the uniqueness of each word in the corpus, but they aren't used directly to compute the values in the TF-IDF matrix.

#### Why is log used when calculating term frequency weight TF and IDF, inverse document frequency?

The formula for IDF is log( N / df t ) instead of just N / df t. Where N = total documents in collection, and df t = document frequency of term t. Log is said to be used because it “dampens” the effect of IDF.

The use of the logarithm in calculating the Term Frequency-Inverse Document Frequency (TF-IDF) weight and Inverse Document Frequency (IDF) is a mathematical choice that helps in addressing certain issues related to scaling and the way information is distributed in natural language data.

**Logarithmic Scaling:**

1. **Term Frequency (TF):** The purpose of taking the logarithm of the term frequency is to dampen the effect of large term frequencies. Without logarithmic scaling, a single document with a very high term frequency for a particular word could disproportionately dominate the TF-IDF score, making it less representative of the word's importance across the entire corpus.

2. **Inverse Document Frequency (IDF):** The logarithmic scaling of IDF smoothens the impact of the IDF values. It reduces the influence of extremely rare words with very high IDF values, making the IDF values more stable and balanced across different words.

**Data Distribution:**

In natural language data, word frequencies often follow a power-law distribution, where a small number of words occur very frequently (e.g., "the", "and") while the majority of words are relatively rare. This distribution can lead to skewed TF and IDF values. Applying a logarithmic transformation helps in normalizing the distribution of scores.

**Balance and Interpretability:**

Using logarithms helps to balance the TF-IDF values and make them more interpretable. It prevents large values from dominating and provides a better representation of word importance without introducing biases due to outliers.

Overall, the use of logarithms in TF and IDF calculations is a common practice in TF-IDF to ensure that the resulting scores are meaningful, stable, and representative of word importance in a corpus. It addresses challenges posed by the distribution of word frequencies and helps create a more balanced and interpretable representation of text data.


**Pros of TF-IDF:**

1. **Word Importance:** TF-IDF captures the relative importance of words in a document compared to their occurrence in the entire corpus, aiding in content analysis.

2. **Flexibility:** TF-IDF is language-agnostic and applicable to various text analysis tasks, such as information retrieval, text classification, and clustering.

3. **Simple to Implement:** Implementing TF-IDF is relatively simple, making it accessible to beginners in NLP.

4. **Document Comparison:** TF-IDF enables efficient comparison of documents based on content similarity, supporting tasks like document clustering and recommendation systems.

5. **Handles Stop Words:** TF-IDF naturally downweights common words (stop words) that appear frequently in many documents but provide less discriminatory power.

6. **Customization:** You can fine-tune TF-IDF by adjusting parameters like term frequency normalization and IDF smoothing.

7. **Interpretability:** TF-IDF scores offer some interpretability, allowing you to assess the significance of words in documents.

**Cons of TF-IDF:**

1. **Lack of Semantic Understanding:** TF-IDF treats words as independent entities, disregarding semantic relationships and context.

2. **Sparse Representations:** TF-IDF matrices can be high-dimensional and sparse, which might lead to memory and computation challenges.

3. **Out-of-Vocabulary Words:** New or unseen words during testing might receive zero IDF values, causing issues when calculating TF-IDF.

4. **Normalization Issues:** Different documents can have varying lengths, affecting term frequency normalization and influencing TF-IDF scores.

5. **Domain Sensitivity:** IDF values can be influenced by domain-specific terms, potentially leading to inappropriate ranking in certain contexts.

6. **Not Sequence-Aware:** TF-IDF doesn't consider word order, which is essential for tasks like sentiment analysis or text generation.

7. **Rare Term Impact:** Extremely rare terms might have disproportionately high IDF values, affecting the final TF-IDF scores.

8. **Choice of Parameters:** The choice of parameters like term frequency normalization and IDF smoothing can affect the results and requires domain expertise.

In summary, TF-IDF is a versatile and effective method for representing and analyzing text data, but it has limitations related to semantics, sparsity, and sequence awareness. It's important to consider your specific task, data, and goals when deciding whether to use TF-IDF or other methods in your NLP projects.

# 5.Custom features in natural language processing (NLP)

Custom features in natural language processing (NLP) refer to manually designed attributes or characteristics that are extracted from text data and used as input features for machine learning models. These features go beyond simple word frequencies and include engineered elements that are tailored to the specific task at hand. Custom features can enhance the performance of models by providing additional context, domain knowledge, or linguistic insights.

Let's consider an example of sentiment analysis, where the goal is to classify text as positive or negative sentiment. In addition to standard features like TF-IDF or word embeddings, we can create custom features based on linguistic cues.

---

*Example: Custom Features for Sentiment Analysis*

Suppose we want to perform sentiment analysis on product reviews. We can create the following custom features:

*Exclamation Mark Count*: Some studies show that the use of exclamation marks might indicate positive sentiment. We can count the number of exclamation marks in each review.

*Capitalization Ratio*: Uppercase words might convey strong emotions. We can calculate the ratio of capitalized words to the total number of words in a review.

*Emoticon Presence:* Emoticons like :) and :( can provide valuable sentiment information. We can check if certain emoticons appear in the review.

*Positive/Negative Keywords:* Create lists of positive and negative words relevant to the product domain. Count the occurrences of these words in the review.

*Sentence Length*: Longer sentences might indicate more elaborate opinions. We can include the average sentence length in a review.



In [None]:
import pandas as pd
from textblob import TextBlob

# Sample reviews
reviews = [
    "This product is amazing! I love it.",
    "Not satisfied with the quality. Very disappointed."
]

data = {'review': reviews}
df = pd.DataFrame(data)

# Custom feature extraction
df['exclamation_count'] = df['review'].apply(lambda x: x.count('!'))
df['capitalization_ratio'] = df['review'].apply(lambda x: sum(1 for c in x if c.isupper()) / len(x.split()))
df['positive_emoticon'] = df['review'].apply(lambda x: 1 if ':)' in x else 0)
df['negative_emoticon'] = df['review'].apply(lambda x: 1 if ':(' in x else 0)

# List of positive/negative keywords
positive_keywords = ['amazing', 'love']
negative_keywords = ['not satisfied', 'disappointed']

df['positive_keyword_count'] = df['review'].apply(lambda x: sum(1 for word in positive_keywords if word in x))
df['negative_keyword_count'] = df['review'].apply(lambda x: sum(1 for word in negative_keywords if word in x))

# Average sentence length
df['avg_sentence_length'] = df['review'].apply(lambda x: len(x.split()) / len(TextBlob(x).sentences))

#custom features
df

Unnamed: 0,review,exclamation_count,capitalization_ratio,positive_emoticon,negative_emoticon,positive_keyword_count,negative_keyword_count,avg_sentence_length
0,This product is amazing! I love it.,1,0.285714,0,0,2,0,3.5
1,Not satisfied with the quality. Very disappoin...,0,0.285714,0,0,0,1,3.5


## Conclusion:

In this exploration, we delved into key methods for transforming text data into numerical representations suitable for machine learning tasks.

By understanding the importance and role of these feature extraction techniques, we gain the ability to transform raw text into structured numerical data ready for machine learning algorithms. However, it's crucial to keep in mind that each method has its pros and cons:

- **One Hot Encoding:** Simple and intuitive, but inefficient for large vocabularies.
- **BoW:** Efficient and useful for basic analysis, but loses word order and nuances.
- **N-grams:** Captures partial word order and context, but increases dimensionality.
- **Tf-Idf:** Considers term importance and document context, but might not fully capture semantics.
- **Custom Features:** Tailored to specific tasks, but requires domain expertise and additional effort.

With this understanding, we can make informed decisions about which technique to apply based on the specific goals of our NLP tasks. By leveraging these techniques and their respective strengths, we can empower our models to extract meaningful insights from textual data and improve the overall performance of NLP applications.

## Thank you for reading till the end

- Raviteja
https://www.linkedin.com/in/raviteja-padala/