# Introduction to Document Embeddings

## Introduction
Embeddings are numerical representations of words, phrases, or entire documents. They are used in machine learning models to understand and compare textual data by converting it into vectors in a continuous vector space. By placing similar items closer together, embeddings allow us to perform tasks like similarity comparisons and clustering.

## TF-IDF Embedder
Term Frequency-Inverse Document Frequency (TF-IDF) embeddings help transform textual documents into numerical vectors, reflecting how important a word is within a document relative to a collection of documents. This method weighs the frequency of terms against their commonality across documents to prioritize more significant terms.

## Creating Document Models with TF-IDF Embedder
In this section, we'll create document models and embed them using the `TfidfEmbedder`.

In [1]:
# Introduction to embeddings using TfidfEmbedder
from swarmauri.embeddings.concrete.TfidfEmbedding import TfidfEmbedding

In [2]:
# Initialize the TF-IDF embedder
embedder = TfidfEmbedding()

# Example documents
documents = [
    "The quick brown fox jumps over the lazy dog",
    "Cats are great pets",
    "Bananas are yellow and tasty"
]

# Fit the embedder on the documents and transform them into embeddings
embeddings = embedder.fit_transform(documents)

# Display embeddings for each document
for i, embedding in enumerate(embeddings):
    print(f"Embedding for Document {i+1}: {embedding}")

Embedding for Document 1: name=None id='03a2ff7c-f361-4b2f-b244-c03ce07a91f7' members=[] owner=None host=None resource='Vector' version='0.1.0' type='Vector' value=[0.0, 0.0, 0.0, 0.3015113445777637, 0.0, 0.3015113445777637, 0.3015113445777637, 0.0, 0.3015113445777637, 0.3015113445777637, 0.3015113445777637, 0.0, 0.3015113445777637, 0.0, 0.6030226891555274, 0.0]
Embedding for Document 2: name=None id='2e45bab4-a64f-4764-bb48-bf8d011587f2' members=[] owner=None host=None resource='Vector' version='0.1.0' type='Vector' value=[0.0, 0.4020402441612698, 0.0, 0.0, 0.5286346066596935, 0.0, 0.0, 0.5286346066596935, 0.0, 0.0, 0.0, 0.5286346066596935, 0.0, 0.0, 0.0, 0.0]
Embedding for Document 3: name=None id='052c5585-ce0d-4ce7-87bf-868209dcf2c5' members=[] owner=None host=None resource='Vector' version='0.1.0' type='Vector' value=[0.4673509818107163, 0.35543246785041743, 0.4673509818107163, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4673509818107163, 0.0, 0.4673509818107163]


## Explanation

Each document has been transformed into a vector of numbers. These numbers represent the importance of each word in the context of the whole corpus. The TF-IDF embeddings work well for simple tasks like document classification and information retrieval, but they may not capture semantic meaning as effectively as more advanced embedding techniques such as `word2vec` or `BERT`
.

## Notebook Metadata

In [4]:
import os
import platform
import sys
from datetime import datetime

author_name = "Huzaifa Irshad " 
github_username = "irshadhuzaifa"

print(f"Author: {author_name}")
print(f"GitHub Username: {github_username}")

notebook_file = "Notebook_01_Introduction_to_Document_Embeddings.ipynb"
try:
    last_modified_time = os.path.getmtime(notebook_file)
    last_modified_datetime = datetime.fromtimestamp(last_modified_time)
    print(f"Last Modified: {last_modified_datetime}")
except Exception as e:
    print(f"Could not retrieve last modified datetime: {e}")

print(f"Platform: {platform.system()} {platform.release()}")
print(f"Python Version: {sys.version}")

try:
    import swarmauri
    print(f"Swarmauri Version: {swarmauri.__version__}")
except ImportError:
    print("Swarmauri is not installed.")

Author: Huzaifa Irshad 
GitHub Username: irshadhuzaifa
Last Modified: 2024-10-18 09:31:45.414431
Platform: Windows 11
Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]
Swarmauri Version: 0.5.0
