[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/operations/apple/embedding_visualization_embedding_atlas.ipynb)

# Embedding visualization using Embedding Atlas and Weaviate

This notebook shows you how to visualize the data stored within a Weaviate vector database using the [open source Embedding Atlas by Apple](https://apple.github.io/embedding-atlas/overview.html).


## Installation

In [1]:
%%capture
%pip install -q -U weaviate-client

In [2]:
%pip show weaviate-client

Name: weaviate-client
Version: 4.16.5.dev3+gbdd43c4b8
Summary: A python native Weaviate client
Home-page: https://github.com/weaviate/weaviate-python-client
Author: Weaviate
Author-email: hello@weaviate.io,
License: BSD 3-clause
Location: /usr/local/lib/python3.11/dist-packages
Requires: authlib, deprecation, grpcio, grpcio-health-checking, httpx, pydantic, validators
Required-by: 


## Connect to Weaviate
You can connect to Weaviate in different ways.

- Embedded Weaviate: Experimental in-memory option without API keys for quick experimentation.
- [Weaviate Cloud (WCD)](https://console.weaviate.cloud/): Managed services for development and production environments.
- Local Weaviate instance e.g. deployed with Docker

You can get **14 days of free access** to [Weaviate Cloud](https://console.weaviate.cloud/)'s Sandbox by creating an account in Weaviate Cloud (WCD), where you can also obtain your API key. No name, no credit card required.

In [None]:
import os
huggingface_key = os.environ["HF_TOKEN"] # Replace with your HuggingFace key

# Uncomment this if you're using Google Colab
#from google.colab import userdata
#huggingface_key = userdata.get("HF_TOKEN")

In [3]:
import weaviate

# Connect to an embedded in-memory Weaviate instance
client = weaviate.connect_to_embedded(
  headers={
    "X-huggingface-Api-Key": huggingface_key
  }
)

# Option 2: Connect to your Weaviate Cloud Service cluster
#from weaviate.classes.init import Auth
#weaviate_url = os.environ["WEAVIATE_URL"]
#weaviate_api_key = os.environ["WEAVIATE_API_KEY"]
#client = weaviate.connect_to_weaviate_cloud(
#    cluster_url=weaviate_url,
#    auth_credentials=Auth.api_key(weaviate_api_key),
#)

# Option 3: Connect to your local Weaviate instance deployed with Docker
# client = weaviate.connect_to_local()

client.is_ready()

INFO:weaviate-client:Binary /root/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.30.5/weaviate-v1.30.5-Linux-amd64.tar.gz
INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 1726


In [11]:
import weaviate.classes.config as wc

# Delete the collection if it already exists
if (client.collections.exists("JeopardyQuestion")):
    client.collections.delete("JeopardyQuestion")

client.collections.create(
    name="JeopardyQuestion",

    vector_config=wc.Configure.Vectors.text2vec_huggingface( # specify the vectorizer and model type you're using
        model="sentence-transformers/all-MiniLM-L6-v2",
        #wait_for_model=True,
        #use_gpu=True,
        #use_cache=True,
    ),
    properties=[ # defining properties (data schema) is optional
        wc.Property(name="Description", data_type=wc.DataType.TEXT),
        wc.Property(name="Answer", data_type=wc.DataType.TEXT, skip_vectorization=True),
        wc.Property(name="Category", data_type=wc.DataType.TEXT, skip_vectorization=True),
    ]
)

<weaviate.collections.collection.sync.Collection at 0x7e70fdee8b10>

## Import the data

In [10]:
import pandas as pd
import requests, json
url = 'https://raw.githubusercontent.com/weaviate/weaviate-examples/main/jeopardy_small_dataset/jeopardy_small.csv'

# Read the entire CSV file
data = pd.read_csv(url)

# Select the 'Question' column
data = data[['Question', 'Answer', 'Category']]

display(data.head())

# Convert the DataFrame rows to a list of dictionaries
data = data.to_dict('records')

Unnamed: 0,Question,Answer,Category
0,Each year this British Columbia city hosts the...,Vancouver,CANADIAN TOURISM
1,The only Republican presidential nominee to lo...,Dewey,I GET NO KICK FROM CAMPAIGN
2,"The alternate title to the Anne Rice book ""Ram...",a mummy,BRING OUT YOUR DEAD
3,The priest's duty to keep your sins secret is ...,the confessional,CATHOLIC PRIESTS
4,This last colony in Africa shares part of its ...,Western Sahara,TOUGH GEOGRAPHY


In [12]:
collection = client.collections.get("JeopardyQuestion")

# Assuming 'data' is the list of dictionaries obtained from the CSV
# Each dictionary in 'data' has keys 'Question', 'Answer', 'Category'

with collection.batch.fixed_size(batch_size=200) as batch:
    # Iterate through the list of dictionaries
    for data_row in data:
        # Create a new dictionary with keys mapped to Weaviate properties
        weaviate_properties = {
            "description": data_row.get("Question"), # Map 'Question' to 'description'
            "answer": data_row.get("Answer"),       # Map 'Answer' to 'answer'
            "category": data_row.get("Category")     # Map 'Category' to 'category'
        }
        batch.add_object(
            properties=weaviate_properties,
        )
        if batch.number_errors > 10:
            print("Batch import stopped due to excessive errors.")
            break

failed_objects = collection.batch.failed_objects
if failed_objects:
    print(f"Number of failed imports: {len(failed_objects)}")
    print(f"First failed object: {failed_objects[0]}")
else:
    try:
        print(f"Imported {len(collection)} objects.")

Batch import finished. Checking object count...
Imported 1000 objects.


In [13]:
import pandas as pd

def get_all_collections_data(client):
    all_data = []
    collections = client.collections.list_all()

    for col_name in collections:
        try:
            collection = client.collections.get(col_name)
            collection_data = []

            for item in collection.iterator(include_vector=True):
                row = item.properties.copy()
                row['vector'] = item.vector['default']
                collection_data.append(row)

            if collection_data:
                df = pd.DataFrame(collection_data)
                #df['collection'] = col_name
                all_data.append(df)
                print(f"Successfully processed: {col_name}")

        except Exception as e:
            print(f"Error in {col_name}: {str(e)}")
            continue

    return pd.concat(all_data, ignore_index=True) if all_data else pd.DataFrame()

df = get_all_collections_data(client)

display(df.head())

Successfully processed: JeopardyQuestion


Unnamed: 0,category,answer,description,vector
0,SID & MARTY KROFFT TV,rock and roll,"On their Krofft-produced variety show, Marie O...","[0.07722147554159164, 0.032216183841228485, -0..."
1,IF THE TV SERIES HAD A DOWNER ENDING,The Dukes of Hazzard,Boss Hogg has the General Lee impounded on doz...,"[-0.057870276272296906, 0.07658110558986664, 0..."
2,RADIO WAVES,Simulcast,This word dating back to the 1940s refers to a...,"[0.028829582035541534, -0.030836530029773712, ..."
3,CARIBBEAN CUISINE,Curacao,"Goat stew is savored on this ""C"" of the ABC Is...","[0.06545388698577881, -0.058375656604766846, -..."
4,SPECIAL EFFECTS,squib,5-letter word for an explosive used to lend re...,"[-0.02972320280969143, -0.06054762005805969, -..."


## Usage


In [14]:
%%capture
%pip install -q -U embedding-atlas

In [15]:
from embedding_atlas.widget import EmbeddingAtlasWidget

# Create an Embedding Atlas widget without projection
# This widget will show table and charts only, not the embedding view.
EmbeddingAtlasWidget(df)

# Compute text embedding and projection of the embedding
from embedding_atlas.projection import compute_text_projection

compute_text_projection(df, text="description",
    x="projection_x", y="projection_y", neighbors="neighbors"
)

# Create an Embedding Atlas widget with the pre-computed projection
widget = EmbeddingAtlasWidget(df, text="description",
    x="projection_x", y="projection_y", neighbors="neighbors"
)

# Display the widget
widget

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

EmbeddingAtlasWidget()

In [None]:
df = widget.selection()

**Want to scale this notebook?**

Try [Weaviate Cloud (WCD)](https://console.weaviate.cloud/) with your **14 day of free trial**.
(No credit card required.)