# Text Vectorization

In this notebook, we will explore various text vectorization techniques that are essential for converting text data into numerical representations for machine learning models. These techniques include TF-IDF, Bag of Words, and Word Embeddings (Word2Vec, GloVe).

In [1]:
# Importing necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from gensim.models import Word2Vec
import gensim.downloader as api

# Sample text data
documents = [
    "Apple is looking at buying U.K. startup for $1 billion",
    "Apple bought a startup",
    "Startup in the U.K. received $1 billion from Apple"
]

## 1. TF-IDF Method

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). The TF-IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

- **Term Frequency (TF)**: The number of times a word appears in a document. It measures how frequently a term occurs in a document. The more frequently a term appears in a document, the higher its TF value.
---


>$$TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}$$


---
- **Inverse Document Frequency (IDF)**: The logarithmically scaled inverse fraction of the documents that contain the word. It measures how important a term is. The more documents a term appears in, the lower its IDF value.
---


>$$IDF(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents with term } t} \right)$$


---

The final TF-IDF score for a term in a document is calculated as:

---


>$$TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)$$


---

In [2]:
# TF-IDF Method

# Initializing TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Transforming documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Converting to DataFrame for better readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix (first 10 columns):\n", tfidf_df.iloc[:, :10])  # Displaying only the first 10 columns


TF-IDF Matrix (first 10 columns):
       apple        at   billion    bought    buying       for      from  \
0  0.235756  0.399169  0.303578  0.000000  0.399169  0.399169  0.000000   
1  0.453295  0.000000  0.000000  0.767495  0.000000  0.000000  0.000000   
2  0.257129  0.000000  0.331100  0.000000  0.000000  0.000000  0.435357   

         in        is   looking  
0  0.000000  0.399169  0.399169  
1  0.000000  0.000000  0.000000  
2  0.435357  0.000000  0.000000  


## 2. Other Vectorization Methods

### Bag of Words

The Bag of Words (BoW) model represents text data as a collection of words, disregarding grammar and word order but keeping multiplicity. It creates a vocabulary of all the unique words in the text and represents each document by counting the number of times each word appears.

- **Process**:
  1. Create a list of unique words (vocabulary) from the entire corpus.
  2. Represent each document as a vector with the length of the vocabulary, where each position corresponds to the count of a word in the document.

- **Advantages**:
  - Simple to implement and understand.
  - Effective for basic text representation and simple models.

- **Disadvantages**:
  - Ignores the context and order of words.
  - Can lead to large and sparse matrices, especially for large vocabularies.

In [3]:
# Bag of Words

# Initializing Count Vectorizer
count_vectorizer = CountVectorizer()

# Transforming documents
count_matrix = count_vectorizer.fit_transform(documents)

# Converting to DataFrame for better readability
count_df = pd.DataFrame(count_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())
print("\nBag of Words Matrix (first 10 columns):\n", count_df.iloc[:, :10])  # Displaying only the first 10 columns


Bag of Words Matrix (first 10 columns):
    apple  at  billion  bought  buying  for  from  in  is  looking
0      1   1        1       0       1    1     0   0   1        1
1      1   0        0       1       0    0     0   0   0        0
2      1   0        1       0       0    0     1   1   0        0


### Word Embeddings

Word embeddings are dense vector representations of words that capture the semantic meaning of words in a continuous vector space. They are trained on large corpora of text and can capture context, synonyms, and relationships between words.

- **Word2Vec**: A neural network-based method that learns vector representations of words based on their context in the text. It uses two architectures: Continuous Bag of Words (CBOW) and Skip-Gram.
  - **CBOW**: Predicts the target word (center word) from the context words (surrounding words).
  - **Skip-Gram**: Predicts the context words (surrounding words) from the target word (center word).

- **GloVe (Global Vectors for Word Representation)**: A method that leverages word co-occurrence statistics across the entire corpus to learn word vectors. It creates a co-occurrence matrix and factorizes it to obtain the word vectors.

- **Advantages**:
  - Captures semantic relationships and context.
  - Produces dense and low-dimensional vectors, reducing computational complexity.

- **Disadvantages**:
  - Requires large corpora and significant computational resources to train.
  - Pre-trained models may not capture domain-specific nuances.

In [4]:
# Word Embeddings: Word2Vec

# Using Gensim to train Word2Vec model
word2vec_model = Word2Vec(sentences=[doc.split() for doc in documents], vector_size=100, window=5, min_count=1, workers=4)

# Displaying vector for a word (limited to first 10 dimensions for readability)
print("\nWord2Vec Vector for 'Apple' (first 10 dimensions):\n", word2vec_model.wv['Apple'][:10])

#### Word Embeddings: GloVe

# Loading pre-trained GloVe vectors
glove_vectors = api.load("glove-wiki-gigaword-100")

# Displaying vector for a word (limited to first 10 dimensions for readability)
print("\nGloVe Vector for 'Apple' (first 10 dimensions):\n", glove_vectors['apple'][:10])



Word2Vec Vector for 'Apple' (first 10 dimensions):
 [-0.00053623  0.00023643  0.00510335  0.00900927 -0.00930295 -0.00711681
  0.00645887  0.00897299 -0.00501543 -0.00376337]

GloVe Vector for 'Apple' (first 10 dimensions):
 [-0.5985   -0.46321   0.13001  -0.019576  0.4603   -0.3018    0.8977
 -0.65634   0.66858  -0.49164 ]
