1. Use Environment Variables Securely
Using dotenv to load the environment variables, which is great. However, it's a good practice to check if the environment variables are loaded correctly and provide fallback values or error messages if they are missing.



In [None]:
EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY")
if EMBEDDING_API_KEY is None:
    raise ValueError("EMBEDDING_API_KEY is missing. Please check your .env file.")


2. Modularize Code
The code can be split into smaller, reusable functions, especially for tasks like connecting to Weaviate, inserting data, and running queries. This will improve code structure and reduce duplication.

In [None]:
def connect_to_weaviate(api_key):
    return weaviate.connect_to_embedded(
        version="1.24.21",
        environment_variables={
            "ENABLE_MODULES": "backup-filesystem,multi2vec-palm",
            "BACKUP_FILESYSTEM_PATH": "/home/jovyan/work/L2/backups",
        },
        headers={"X-PALM-Api-Key": api_key}
    )

client = connect_to_weaviate(EMBEDDING_API_KEY)


3. Use Try-Except Blocks for Error Handling
Add error handling when connecting to Weaviate, uploading files, or running queries. This will make your program more robust and prevent crashes.

In [None]:
try:
    client.is_ready()
except Exception as e:
    print(f"Error connecting to Weaviate: {e}")
    raise


4. Batch Insert Optimization
Inserting data in batches should include more precise logging and error handling to track failures more easily.

In [None]:
with animals.batch.rate_limit(requests_per_minute=100) as batch:
    for name in source:
        path = "./source/image/" + name
        try:
            batch.add_object({
                "name": name,
                "path": path,
                "image": toBase64(path),
                "mediaType": "image",
            })
        except Exception as e:
            print(f"Error inserting {name}: {e}")


5. Optimize File I/O
Instead of repeatedly opening files with toBase64 or file_to_base64, cache the base64 representations when possible.

In [None]:
base64_cache = {}

def toBase64(path):
    if path not in base64_cache:
        with open(path, 'rb') as file:
            base64_cache[path] = base64.b64encode(file.read()).decode('utf-8')
    return base64_cache[path]


6. Improve Logging
Use a logging framework like Python’s logging module instead of print statements for better control and logging levels (e.g., INFO, ERROR, etc.).



In [None]:
import logging

logging.basicConfig(level=logging.INFO)

# Log successful connections
logging.info("Weaviate client is ready.")


7. Enable Parallel Processing
For larger datasets, use parallel processing to speed up base64 encoding and insertions. You can use Python’s concurrent.futures or multiprocessing to parallelize I/O-bound tasks like encoding files.

In [None]:
from concurrent.futures import ThreadPoolExecutor

def insert_object(file_path):
    # Insert your batch logic here
    pass

with ThreadPoolExecutor(max_workers=5) as executor:
    results = executor.map(insert_object, source)


8. Use Configurable Parameters
Make parameters like the number of objects to query or request limits configurable, so you can change them without altering the code. Use environment variables or configuration files to store such values.

In [None]:
REQUESTS_PER_MINUTE = int(os.getenv("REQUESTS_PER_MINUTE", 100))
QUERY_LIMIT = int(os.getenv("QUERY_LIMIT", 3))


In [None]:
def test_base64_conversion():
    assert toBase64("./test/test-cat.jpg") is not None


10. Handle Large Dataset Loading Efficiently
When working with large datasets, loading all data at once might not be feasible. You can process data in chunks to avoid memory overload.

In [None]:
def load_data_in_chunks(client, batch_size=100):
    source = os.listdir("./source/animal_image/")
    for i in range(0, len(source), batch_size):
        chunk = source[i:i+batch_size]
        insert_objects(chunk)


11. Add Visual Feedback
For better understanding, show more detailed progress bars, especially for long-running tasks like inserting data or restoring backups.

In [None]:
from tqdm.notebook import tqdm

for name in tqdm(source, desc="Inserting images"):
    path = "./source/image/" + name
    animals.batch.add_object({
        "name": name,
        "path": path,
        "image": toBase64(path),
        "mediaType": "image"
    })


12. Optimize UMAP for Large Datasets
UMAP might take a long time for very large datasets. You can experiment with different hyperparameters like n_neighbors, min_dist, and metric to get faster performance.

In [None]:
mapper2 = umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine').fit(emb_df)
