# One-hot vector
A one-hot vector is a binary representation used to represent categorical data where each bit in the vector represents a category, and only one bit is set to 1 (hot), while all others are 0 (cold). In the context of Natural Language Processing (NLP), one-hot vectors are often used to represent words or tokens in a vocabulary.

* High Dimensionality:

  One-hot encoding results in vectors that are very high-dimensional, especially when the vocabulary size is large. Each unique token or word in the vocabulary is represented by a vector of zeros with only one element set to 1. This can lead to very sparse representations and increased computational complexity.

* Lack of Semantic Information:

  One-hot vectors do not capture any semantic relationships or similarities between words. Each word is represented as a unique entity, with no information about its context or meaning. This makes it challenging for models to generalize and understand similarities between related words.

* Memory and Computational Efficiency:

  One-hot vectors are not memory-efficient, especially for large vocabularies. Storing and manipulating high-dimensional sparse matrices can be computationally expensive and inefficient, both in terms of memory usage and computational operations.

* Not Suitable for Sequential Models:

  For tasks where sequence or order matters, such as in natural language processing tasks like sequence prediction or language modeling, one-hot vectors do not encode any sequential information. They treat each word independently, ignoring the order and context in which words appear.

* Curse of Dimensionality:

  One-hot encoding exacerbates the curse of dimensionality, particularly in high-dimensional spaces. As the number of unique tokens increases, the vector space grows exponentially, leading to increased computational requirements and potential overfitting in models.

* No Embedding of Similarity:

  There is no inherent embedding of similarity between words. In contrast, dense embeddings like Word2Vec, GloVe, or FastText embed words into continuous vector spaces where distances (e.g., cosine similarity) can reflect semantic similarity between words.

* Model Sparsity:

  Models trained on one-hot vectors can suffer from sparsity issues. Sparse representations may not adequately capture relationships between words, making it harder for models to generalize effectively, especially on tasks requiring understanding of natural language semantics.

In [1]:
import numpy as np

# Sample vocabulary
vocabulary = ['apple', 'banana', 'cherry', 'date']

# Example word to encode
word = 'banana'

# Create one-hot vector for the word
def one_hot_encode(word, vocabulary):
    vector = np.zeros(len(vocabulary), dtype=int)
    index = vocabulary.index(word)
    vector[index] = 1
    return vector

# Encode the example word
one_hot_vector = one_hot_encode(word, vocabulary)

# Print the result
print(f"One-hot vector for '{word}':")
print(one_hot_vector)


One-hot vector for 'banana':
[0 1 0 0]


# Co-occurrence Matrix
A Term Document Matrix (TDM) is a matrix representation of a collection of documents where each row corresponds to a term (word) in the vocabulary and each column corresponds to a document. The entries in the matrix represent the frequency of the terms in the documents. This is useful for various NLP tasks such as document classification, clustering, and information retrieval.



In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

# Create the Term Document Matrix using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Convert the matrix to a pandas DataFrame for better visualization
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

print("Term Document Matrix:")
print(df)

# Extract document vectors
document_vectors = X.toarray()
for i, doc_vector in enumerate(document_vectors):
    print(f"Document {i+1} vector: {doc_vector}")


Term Document Matrix:
   and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     1
1    0         2      0   1    0       1    1      0     1
2    1         0      0   1    1       0    1      1     1
3    0         1      1   1    0       0    1      0     1
Document 1 vector: [0 1 1 1 0 0 1 0 1]
Document 2 vector: [0 2 0 1 0 1 1 0 1]
Document 3 vector: [1 0 0 1 1 0 1 1 1]
Document 4 vector: [0 1 1 1 0 0 1 0 1]


# Term Frequency-Inverse Document Frequency
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that is intended to reflect the importance of a word in a document relative to a collection of documents (or corpus). It is often used as a weighting factor in text mining and information retrieval.

Let's walk through a step-by-step example of computing TF-IDF scores using a small dataset.

### Example Dataset

Consider a small corpus with three documents:

1. Document 1: "the cat sat on the mat"
2. Document 2: "the cat sat"
3. Document 3: "the dog ate the bone"

### Step-by-Step Calculation

1. **Term Frequency (TF):**
   TF measures how frequently a term appears in a document. The simplest way to calculate TF is just to use the raw count of the term in the document.

   $
   \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
   $

   Let's calculate TF for each term in each document:

   - Document 1: "the cat sat on the mat"
     - TF("the", Document 1) = 2/6
     - TF("cat", Document 1) = 1/6
     - TF("sat", Document 1) = 1/6
     - TF("on", Document 1) = 1/6
     - TF("mat", Document 1) = 1/6

   - Document 2: "the cat sat"
     - TF("the", Document 2) = 1/3
     - TF("cat", Document 2) = 1/3
     - TF("sat", Document 2) = 1/3

   - Document 3: "the dog ate the bone"
     - TF("the", Document 3) = 2/5
     - TF("dog", Document 3) = 1/5
     - TF("ate", Document 3) = 1/5
     - TF("bone", Document 3) = 1/5

2. **Inverse Document Frequency (IDF):**
   IDF measures how important a term is. While computing TF, all terms are considered equally important. However, certain terms, like "is", "of", and "that", may appear a lot but have little importance. Thus, we need to weigh down the frequent terms while scaling up the rare ones.

   $
   \text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents with term } t} \right)
   $

   Let's calculate IDF for each term in the corpus:

   - IDF("the") = \(\log(3/3) = \log(1) = 0\)
   - IDF("cat") = \(\log(3/2)\)
   - IDF("sat") = \(\log(3/2)\)
   - IDF("on") = \(\log(3/1)\)
   - IDF("mat") = \(\log(3/1)\)
   - IDF("dog") = \(\log(3/1)\)
   - IDF("ate") = \(\log(3/1)\)
   - IDF("bone") = \(\log(3/1)\)

   Using \(\log\) base 10 for simplicity:

   - IDF("cat") = \(\log_{10}(1.5) \approx 0.176\)
   - IDF("sat") = \(\log_{10}(1.5) \approx 0.176\)
   - IDF("on") = \(\log_{10}(3) \approx 0.477\)
   - IDF("mat") = \(\log_{10}(3) \approx 0.477\)
   - IDF("dog") = \(\log_{10}(3) \approx 0.477\)
   - IDF("ate") = \(\log_{10}(3) \approx 0.477\)
   - IDF("bone") = \(\log_{10}(3) \approx 0.477\)

3. **TF-IDF Calculation:**
   The TF-IDF score for a term is the product of its TF and IDF scores.

   $
   \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t, D)
   $

   Let's calculate TF-IDF for each term in each document:

   - Document 1: "the cat sat on the mat"
     - TF-IDF("the", Document 1) = 2/6 * 0 = 0
     - TF-IDF("cat", Document 1) = 1/6 * 0.176 ≈ 0.029
     - TF-IDF("sat", Document 1) = 1/6 * 0.176 ≈ 0.029
     - TF-IDF("on", Document 1) = 1/6 * 0.477 ≈ 0.080
     - TF-IDF("mat", Document 1) = 1/6 * 0.477 ≈ 0.080

   - Document 2: "the cat sat"
     - TF-IDF("the", Document 2) = 1/3 * 0 = 0
     - TF-IDF("cat", Document 2) = 1/3 * 0.176 ≈ 0.059
     - TF-IDF("sat", Document 2) = 1/3 * 0.176 ≈ 0.059

   - Document 3: "the dog ate the bone"
     - TF-IDF("the", Document 3) = 2/5 * 0 = 0
     - TF-IDF("dog", Document 3) = 1/5 * 0.477 ≈ 0.095
     - TF-IDF("ate", Document 3) = 1/5 * 0.477 ≈ 0.095
     - TF-IDF("bone", Document 3) = 1/5 * 0.477 ≈ 0.095

### Summary of Results

The TF-IDF scores for each term in each document are:

- **Document 1:**
  - "the": 0
  - "cat": 0.029
  - "sat": 0.029
  - "on": 0.080
  - "mat": 0.080

- **Document 2:**
  - "the": 0
  - "cat": 0.059
  - "sat": 0.059

- **Document 3:**
  - "the": 0
  - "dog": 0.095
  - "ate": 0.095
  - "bone": 0.095

These TF-IDF scores reflect the importance of each term within the document and the entire corpus, helping to identify which terms are most relevant to each document.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample documents
documents = [
    "the cat sat on the mat",
    "the cat sat",
    "the dog ate the bone"
]

# Create the TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Convert the TF-IDF matrix to a pandas DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Display the DataFrame
print("TF-IDF Matrix:")
print(tfidf_df)


TF-IDF Matrix:
        ate      bone       cat       dog       mat        on       sat  \
0  0.000000  0.000000  0.356457  0.000000  0.468699  0.468699  0.356457   
1  0.000000  0.000000  0.619805  0.000000  0.000000  0.000000  0.619805   
2  0.476986  0.476986  0.000000  0.476986  0.000000  0.000000  0.000000   

        the  
0  0.553642  
1  0.481334  
2  0.563431  


## Limitation of TF-IDF

* Lack of Context Understanding:

  TF-IDF treats each term independently and does not capture the context in which terms appear. It ignores the order of words and relationships between them, which can be crucial for understanding the meaning of the text.

* Sparse Representations:

  TF-IDF vectors are often very high-dimensional and sparse, especially when dealing with large vocabularies. This can lead to inefficiencies in both storage and computation.

* Limited to Bag-of-Words Model:

  TF-IDF relies on the bag-of-words model, which disregards the grammar, syntax, and semantics of the language. This model counts word frequencies without considering word sequences or structures.

* Sensitive to Frequent Terms:

  While TF-IDF reduces the impact of very common words (like "the", "is", "in"), it still can be biased towards terms that appear frequently across documents but not frequently enough to be ignored completely. This can sometimes lead to less important terms being given more weight than they deserve.

* Static Weighting Scheme:

  The TF-IDF weighting scheme is static and does not adapt based on the specific task or domain. It uses a fixed formula to calculate term importance, which may not always align with the specific needs of a particular application.

* Difficulty Handling Synonyms:

  TF-IDF does not account for synonyms or different forms of the same word (e.g., "run" vs. "running"). Words with similar meanings are treated as completely separate features, which can dilute their significance.

* Not Suitable for Small Corpora:

  In smaller corpora, TF-IDF may not perform well because the inverse document frequency component can be less reliable when the number of documents is limited. The IDF values may be skewed due to the small sample size.

* No Semantic Similarity:

  TF-IDF vectors do not capture semantic similarity between words. For example, the words "car" and "automobile" will have completely different TF-IDF vectors, despite having similar meanings.