<a href="https://colab.research.google.com/github/sunosa/AI-Agent-Code-Generator/blob/main/embed_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From HuggingFace dataset to [Qdrant](https://qdrant.tech/) vector database in 12 minutes flat
This notebook demonstrates how to transform a HuggingFace dataset into a local Qdrant vector database, using the Sentence Transformers `all-MiniLM-L6-v2` model.

For more information on the dataset, which consists of tweets made by American senators, check out the article [Fine-tuning DistilBERT on senator tweets](https://medium.com/@mary.newhauser/fine-tuning-distilbert-on-senator-tweets-a6f2425ca50e).

💾 [Dataset](https://huggingface.co/datasets/m-newhauser/senator-tweets)

First, change the notebook runtime type to enable GPU usage by clicking the downward-facing arrow in the upper right hand corner of the notebook next to the `RAM` and `Disk` buttons.

Click `Change runtime type`  >  select `T4 GPU`  >  click `Save`

In [None]:
# Install necessary packages
!pip install qdrant-client>=1.1.1
!pip install -U sentence-transformers==2.2.2
!pip install -U datasets==2.16.1

In [None]:
import time
import math
import torch
from itertools import islice
from tqdm import tqdm
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
from datasets import load_dataset, concatenate_datasets

In [None]:
# Record start time
start_time = time.time()

In [None]:
# Determine device based on GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


## Load model and dataset

In [None]:
# Load the dataset
dataset = load_dataset("m-newhauser/senator-tweets")

# If the embeddings column already exists, remove it (so we can practice generating it!)
for split in dataset:
    if 'embeddings' in dataset[split].column_names:
        dataset[split] = dataset[split].remove_columns('embeddings')

# Take a peak at the dataset
print(dataset)
dataset["train"].to_pandas().head()

In [None]:
# Load the desired model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)

## Generate embeddings for text data

In [None]:
# Create function to generate embeddings (in batches) for a given dataset split
def generate_embeddings(split, batch_size=32):
    embeddings = []
    split_name = [name for name, data_split in dataset.items() if data_split is split][0]

    with tqdm(total=len(split), desc=f"Generating embeddings for {split_name} split") as pbar:
        for i in range(0, len(split), batch_size):
            batch_sentences = split['text'][i:i+batch_size]
            batch_embeddings = model.encode(batch_sentences)
            embeddings.extend(batch_embeddings)
            pbar.update(len(batch_sentences))

    return embeddings

# Generate and append embeddings to the train split
train_embeddings = generate_embeddings(dataset['train'])
dataset["train"] = dataset["train"].add_column("embeddings", train_embeddings)

# Generate and append embeddings to the test split
test_embeddings = generate_embeddings(dataset['test'])
dataset["test"] = dataset["test"].add_column("embeddings", test_embeddings)

Generating embeddings for train split: 100%|██████████| 79754/79754 [06:17<00:00, 211.51it/s]
Generating embeddings for test split: 100%|██████████| 19939/19939 [00:37<00:00, 537.94it/s]


## Optional: Save the embeddings dataset to the HuggingFace Hub

In [None]:
# # Log in to the Hub via the CLI
# !huggingface-cli login

In [None]:
# # Push the embeddings dataset (with preserved splits) to the HuggingFace Hub
# dataset.push_to_hub(
#     repo_id="m-newhauser/senator-tweets", # name of your dataset
#     commit_message="Add all-MiniLM-L6-v2 embeddings", # commit message
# )

## Create local Qdrant DB and upsert embeddings

In [None]:
# Combine train and test splits into a single dataset
combined_dataset = concatenate_datasets([dataset['train'], dataset['test']])

In [None]:
# Create an in-memory Qdrant instance
client = QdrantClient(":memory:")

# Create a Qdrant collection for the embeddings
client.create_collection(
    collection_name="senator-tweets",
    vectors_config=models.VectorParams(
        size=model.get_sentence_embedding_dimension(),
        distance=models.Distance.COSINE,
    ),
)

In [None]:
# Create function to upsert embeddings in batches
def batched(iterable, n):
    iterator = iter(iterable)
    while batch := list(islice(iterator, n)):
        yield batch

In [None]:
# Define batch size
batch_size = 100

# Upsert the embeddings in batches
for batch in batched(combined_dataset, batch_size):
    ids = [point.pop("id") for point in batch]
    vectors = [point.pop("embeddings") for point in batch]

    client.upsert(
        collection_name="senator-tweets",
        points=models.Batch(
            ids=ids,
            vectors=vectors,
            payloads=batch,
        ),
    )

## Search the database

In [None]:
# Let's see what senators are saying about immigration policy
hits = client.search(
    collection_name="senator-tweets",
    query_vector=model.encode("Immigration policy").tolist(),
    limit=5
)
for hit in hits:
  print(hit.payload, "score:", hit.score)

{'date': '2021-06-08 20:50:41', 'username': 'SenatorRomney', 'text': 'Some policies that can realistically stem the illegal immigration crisis: - Completion of the barrier at our southern border - Enact mandatory E-Verify - Require asylum seekers to apply in their home country or the nearest safe location', 'party': 'Republican', 'labels': 0} score: 0.6172380655717898
{'date': '2021-11-03 17:56:55', 'username': 'JohnCornyn', 'text': 'Making crisis worse: Biden administration rescinds Trump-era policy limiting migrants at legal ports of entry - CNNPolitics https://t.co/LpSYwdKGER', 'party': 'Republican', 'labels': 0} score: 0.601150868004457
{'date': '2021-03-04 17:16:42', 'username': 'SenTedCruz', 'text': 'President Bidens immigration policies are dangerous.', 'party': 'Republican', 'labels': 0} score: 0.5960106315252139
{'date': '2021-01-21 01:10:11', 'username': 'SenatorDurbin', 'text': 'With his Executive Orders, President Biden is turning the page on four years of immigration polic

In [None]:
# Most of those tweets are by Republicans... let's see what the Dem's are saying
hits = client.search(
    collection_name="senator-tweets",
    query_vector=model.encode("Immigration policy").tolist(),
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="party",
                match=models.MatchValue(value="Democrat") # Filter by political party
            )
        ]
    ),
    limit=5
)
for hit in hits:
  print(hit.payload, "score:", hit.score)

{'date': '2021-01-21 01:10:11', 'username': 'SenatorDurbin', 'text': 'With his Executive Orders, President Biden is turning the page on four years of immigration policies that dragged our country backwards. Today, we move forward with a vision that reflects our proud heritage as a nation of immigrants. My full statement: https://t.co/Rwmu2esKnm', 'party': 'Democrat', 'labels': 1} score: 0.5954159442904414
{'date': '2021-05-09 23:00:00', 'username': 'SenatorSinema', 'text': 'Our Bipartisan Border Solutions Act will: - Create regional processing centers along the border - Provide more resources to improve the asylum process - Require @DHSgov to tell communities before releasing migrants https://t.co/e2yAL4TfH4', 'party': 'Democrat', 'labels': 1} score: 0.5691308095316189
{'date': '2021-07-10 16:00:02', 'username': 'SenatorSinema', 'text': 'Our Bipartisan Border Solutions Act with @JohnCornyn will help address the migrant crisis by creating regional processing centers to more quickly proc

In [None]:
# Record end time
end_time = time.time()

# Calculate and print the elapsed time
elapsed_time = end_time - start_time

print(f"Notebook completed in {int(elapsed_time // 60)} minutes and {math.ceil(elapsed_time % 60)} seconds.")

Notebook completed in 11 minutes and 35 seconds.
