<a href="https://www.kaggle.com/code/shravankumar147/token-to-vector-conversion?scriptVersionId=213821989" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Token-to-Vector Conversion: Transforming Tokens into Numerical Representations

Machine learning models require numerical inputs, so the text must be converted into vectors. This process, known as **Token-to-Vector Conversion**, is critical in NLP pipelines. Let's dive deeper into the various techniques.

---

### 1. **Bag of Words (BoW)**

BoW represents a text as a sparse vector of word counts or binary indicators for each word in the vocabulary.

#### Process:
1. Create a vocabulary of unique words from the dataset.
2. Count the occurrences of each word in a text.

#### Example:
Text: `"I love NLP. NLP is amazing!"`

- Vocabulary: `["I", "love", "NLP", "is", "amazing"]`
- BoW Vector for the text: `[1, 1, 2, 1, 1]`

#### Code:

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

texts = ["I love NLP.", "NLP is amazing!"]
vectorizer = CountVectorizer()
bow_vectors = vectorizer.fit_transform(texts)

print(vectorizer.get_feature_names_out())  # Vocabulary
print(bow_vectors.toarray())  # Vectors


['amazing' 'is' 'love' 'nlp']
[[0 0 1 1]
 [1 1 0 1]]


#### Pros:
- Simple and easy to implement.
- Works well for small datasets.

#### Cons:
- Ignores word order (context).
- High-dimensional and sparse for large vocabularies.

---

### 2. **TF-IDF (Term Frequency-Inverse Document Frequency)**

TF-IDF adjusts word frequency by considering its importance in the entire corpus. Words common across many documents are weighted lower.

#### Formula:
- **TF**: Term Frequency = (Number of times a word appears in a document) / (Total number of words in the document)
- **IDF**: Inverse Document Frequency = \( \log(\frac{N}{n}) \), where \( N \) is the total number of documents, and \( n \) is the number of documents containing the word.

#### Example:
Text: `"I love NLP. NLP is amazing!"`  
Corpus: `["I love NLP.", "NLP is amazing!", "I enjoy learning NLP."]`

- TF-IDF vectors will assign lower weights to frequently occurring words like "NLP".

#### Code:

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

texts = ["I love NLP.", "NLP is amazing!", "I enjoy learning NLP."]
vectorizer = TfidfVectorizer()
tfidf_vectors = vectorizer.fit_transform(texts)

print(vectorizer.get_feature_names_out())  # Vocabulary
print(tfidf_vectors.toarray())  # Vectors

['amazing' 'enjoy' 'is' 'learning' 'love' 'nlp']
[[0.         0.         0.         0.         0.861037   0.50854232]
 [0.65249088 0.         0.65249088 0.         0.         0.38537163]
 [0.         0.65249088 0.         0.65249088 0.         0.38537163]]


#### Pros:
- Highlights important words in context.
- Reduces the impact of common words.

#### Cons:
- Still sparse and high-dimensional.

---

### 3. **Word Embeddings**

Word embeddings are dense vector representations of words that capture semantic meaning. Words with similar meanings are closer in the vector space.

#### Techniques:
1. **Pre-trained Embeddings**:
   - **Word2Vec**: Uses Skip-gram or CBOW to learn word relationships.
   - **GloVe (Global Vectors)**: Captures statistical co-occurrence of words in a corpus.
   - **FastText**: Handles subword-level embeddings, useful for morphologically rich languages.

2. **Custom Embeddings**:
   - Train embeddings on your dataset using tools like Gensim or TensorFlow.

#### Example:
- "king" and "queen" might have embeddings like:
  - **king**: `[0.25, 0.35, 0.85, ...]`
  - **queen**: `[0.23, 0.34, 0.83, ...]`

#### Code (Word2Vec using Gensim):

In [3]:
from gensim.models import Word2Vec

sentences = [["I", "love", "NLP"], ["NLP", "is", "amazing"], ["I", "enjoy", "learning", "NLP"]]
model = Word2Vec(sentences, vector_size=10, window=3, min_count=1, workers=4)

print(model.wv["NLP"])  # Word embedding for "NLP"


[-0.00536227  0.00236431  0.0510335   0.09009273 -0.0930295  -0.07116809
  0.06458873  0.08972988 -0.05015428 -0.03763372]


#### Pros:
- Captures semantic relationships.
- Compact representation.

#### Cons:
- Requires a large corpus for training.

---

### 4. **Contextual Word Embeddings**

These embeddings generate word vectors **in context**, meaning the same word can have different vectors depending on the sentence.

#### Popular Models:
- **BERT (Bidirectional Encoder Representations from Transformers)**: Generates dynamic embeddings.
- **GPT**: Focused on generative tasks.

#### Example:
- In "I’m going to the bank to fish" vs. "I deposited money in the bank," the word "bank" will have different embeddings.

#### Code (Hugging Face Transformers):

In [4]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "NLP is amazing!"
tokens = tokenizer(text, return_tensors="pt")
output = model(**tokens)

print(output.last_hidden_state)  # Contextual embeddings


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tensor([[[ 0.0762,  0.0177,  0.0297,  ..., -0.2109,  0.2140,  0.3130],
         [ 0.1237, -1.0247,  0.6718,  ..., -0.3978,  0.9643,  0.7842],
         [-0.2696, -0.3281,  0.3901,  ..., -0.4611, -0.3425,  0.2534],
         ...,
         [ 0.2051,  0.2455, -0.1299,  ..., -0.3859,  0.1110, -0.4565],
         [-0.0241, -0.7068, -0.4130,  ...,  0.9138,  0.3254, -0.5250],
         [ 0.7112,  0.0574, -0.2164,  ...,  0.2193, -0.7162, -0.1466]]],
       grad_fn=<NativeLayerNormBackward0>)


In [6]:
tokens

{'input_ids': tensor([[  101, 17953,  2361,  2003,  6429,   999,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [5]:
output.last_hidden_state.shape

torch.Size([1, 7, 768])

In [7]:
output.last_hidden_state

tensor([[[ 0.0762,  0.0177,  0.0297,  ..., -0.2109,  0.2140,  0.3130],
         [ 0.1237, -1.0247,  0.6718,  ..., -0.3978,  0.9643,  0.7842],
         [-0.2696, -0.3281,  0.3901,  ..., -0.4611, -0.3425,  0.2534],
         ...,
         [ 0.2051,  0.2455, -0.1299,  ..., -0.3859,  0.1110, -0.4565],
         [-0.0241, -0.7068, -0.4130,  ...,  0.9138,  0.3254, -0.5250],
         [ 0.7112,  0.0574, -0.2164,  ...,  0.2193, -0.7162, -0.1466]]],
       grad_fn=<NativeLayerNormBackward0>)

#### Pros:
- Captures word meaning dynamically based on context.
- State-of-the-art performance in NLP tasks.

#### Cons:
- Computationally expensive.

---

### Comparison of Techniques

| Method            | Dimensionality | Context Awareness | Sparsity | Use Cases                                   |
|--------------------|----------------|--------------------|----------|--------------------------------------------|
| **Bag of Words**   | High           | No                 | Yes      | Simple text classification                 |
| **TF-IDF**         | High           | No                 | Yes      | Document ranking, text similarity          |
| **Word Embeddings**| Low            | Partial            | No       | Semantic analysis, downstream NLP tasks    |
| **Contextual Embeddings**| Low      | Yes                | No       | Advanced NLP tasks like NER, QA, sentiment |

---