# Vectorizers

In this notebook, we will show how to use RedisVL4j to create embeddings using the built-in text embedding vectorizers. Today RedisVL4j supports:
1. HuggingFace (Sentence Transformers via ONNX - runs locally)
2. LangChain4j Integration (OpenAI, Cohere, VoyageAI, Azure, etc.)
3. Custom vectorizers

Before running this notebook, be sure to:
1. Have Java 17+ installed
2. Have a running Redis Stack instance with RediSearch > 2.4 active

For example, you can run Redis Stack locally with Docker:

```bash
docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest
```

This will run Redis on port 6379 and RedisInsight at http://localhost:8001.

## Setup

First, add the RedisVL4j JAR and its dependencies to the classpath.
For local development, you can build the project with `./gradlew :core:build` and find the JAR in `core/build/libs/`.

In [None]:
// Add JARs to classpath - adjust paths as needed
%jars /path/to/redisvl4j/core/build/libs/*.jar

// Import necessary classes
import com.redis.vl.utils.vectorize.*;
import com.redis.vl.index.SearchIndex;
import com.redis.vl.schema.IndexSchema;
import com.redis.vl.schema.VectorField;
import com.redis.vl.query.VectorQuery;
import redis.clients.jedis.UnifiedJedis;
import redis.clients.jedis.search.schemafields.VectorField.VectorAlgorithm;
import java.util.List;
import java.util.Map;
import java.util.HashMap;
import java.util.ArrayList;
import java.util.Arrays;

## Creating Text Embeddings

This example will show how to create an embedding from 3 simple sentences with different text vectorizers in RedisVL4j.

- "That is a happy dog"
- "That is a happy person"
- "Today is a sunny day"

In [None]:
// Define our test sentences
List<String> sentences = Arrays.asList(
    "That is a happy dog",
    "That is a happy person",
    "Today is a sunny day"
);

### HuggingFace Sentence Transformers (Local)

[Huggingface](https://huggingface.co/models) is a popular NLP platform with many pre-trained models. RedisVL4j supports using Huggingface "Sentence Transformers" via ONNX models to create embeddings **locally** (no API key required).

Models are automatically downloaded and cached on first use.

In [None]:
// Create a vectorizer using HuggingFace Sentence Transformers
// This model runs locally - no API key needed!
var hf = new SentenceTransformersVectorizer("sentence-transformers/all-mpnet-base-v2");

// Embed a single sentence
float[] test = hf.embed("This is a test sentence.");
System.out.println("Vector dimensions: " + test.length);
System.out.println("First 10 dimensions: " + Arrays.toString(Arrays.copyOfRange(test, 0, 10)));

In [None]:
// Create many embeddings at once
List<float[]> embeddings = hf.embedBatch(sentences);
System.out.println("Created " + embeddings.size() + " embeddings");
System.out.println("First embedding (first 10): " + Arrays.toString(Arrays.copyOfRange(embeddings.get(0), 0, 10)));

### OpenAI via LangChain4j

The `LangChain4JVectorizer` wraps any LangChain4j `EmbeddingModel`. This gives you access to OpenAI's powerful embedding models.

You'll need to set your OpenAI API key as an environment variable.

In [None]:
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;

// Get API key from environment
String apiKey = System.getenv("OPENAI_API_KEY");
if (apiKey == null) {
    System.out.println("Skipping OpenAI example - OPENAI_API_KEY not set");
} else {
    // Create OpenAI embedding model
    var openaiModel = OpenAiEmbeddingModel.builder()
        .apiKey(apiKey)
        .modelName("text-embedding-ada-002")
        .build();
    
    // Wrap in LangChain4JVectorizer
    var oai = new LangChain4JVectorizer("text-embedding-ada-002", openaiModel);
    
    // Embed a sentence
    float[] openaiTest = oai.embed("This is a test sentence.");
    System.out.println("OpenAI Vector dimensions: " + openaiTest.length);
    System.out.println("First 10 dimensions: " + Arrays.toString(Arrays.copyOfRange(openaiTest, 0, 10)));
    
    // Batch embeddings
    List<float[]> openaiEmbeddings = oai.embedBatch(sentences);
    System.out.println("Created " + openaiEmbeddings.size() + " embeddings");
}

### Cohere via LangChain4j

[Cohere](https://dashboard.cohere.ai/) provides powerful language AI models. The `LangChain4JVectorizer` makes it easy to use Cohere's embedding models.

You'll need to set your Cohere API key.

In [None]:
import dev.langchain4j.model.cohere.CohereEmbeddingModel;

String cohereApiKey = System.getenv("COHERE_API_KEY");
if (cohereApiKey == null) {
    System.out.println("Skipping Cohere example - COHERE_API_KEY not set");
} else {
    var cohereModel = CohereEmbeddingModel.builder()
        .apiKey(cohereApiKey)
        .modelName("embed-english-v3.0")
        .build();
    
    var co = new LangChain4JVectorizer("embed-english-v3.0", cohereModel);
    
    float[] cohereTest = co.embed("This is a test sentence.");
    System.out.println("Cohere Vector dimensions: " + cohereTest.length);
    System.out.println("First 10 dimensions: " + Arrays.toString(Arrays.copyOfRange(cohereTest, 0, 10)));
}

### VoyageAI via LangChain4j

[VoyageAI](https://dash.voyageai.com/) provides specialized embedding models. Access them via LangChain4j integration.

You'll need to set your VoyageAI API key.

In [None]:
import dev.langchain4j.model.voyageai.VoyageAiEmbeddingModel;

String voyageApiKey = System.getenv("VOYAGE_API_KEY");
if (voyageApiKey == null) {
    System.out.println("Skipping VoyageAI example - VOYAGE_API_KEY not set");
} else {
    var voyageModel = VoyageAiEmbeddingModel.builder()
        .apiKey(voyageApiKey)
        .modelName("voyage-law-2")
        .build();
    
    var vo = new LangChain4JVectorizer("voyage-law-2", voyageModel);
    
    float[] voyageTest = vo.embed("This is a test sentence.");
    System.out.println("VoyageAI Vector dimensions: " + voyageTest.length);
    System.out.println("First 10 dimensions: " + Arrays.toString(Arrays.copyOfRange(voyageTest, 0, 10)));
}

### Custom Vectorizers

RedisVL4j supports custom vectorizers by extending `BaseVectorizer`. This enables compatibility with any embedding generation function.

In [None]:
// Create a simple custom vectorizer
class CustomVectorizer extends BaseVectorizer {
    public CustomVectorizer() {
        super("custom-model", 768, "float32");
    }
    
    @Override
    protected float[] generateEmbedding(String text) {
        // Simple example: fill with constant value
        float[] embedding = new float[768];
        Arrays.fill(embedding, 0.101f);
        return embedding;
    }
    
    @Override
    protected List<float[]> generateEmbeddingsBatch(List<String> texts, int batchSize) {
        return texts.stream()
            .map(this::generateEmbedding)
            .collect(java.util.stream.Collectors.toList());
    }
}

var customVectorizer = new CustomVectorizer();
float[] customEmbed = customVectorizer.embed("This is a test sentence.");
System.out.println("Custom vectorizer dimensions: " + customEmbed.length);
System.out.println("First 10 values: " + Arrays.toString(Arrays.copyOfRange(customEmbed, 0, 10)));

## Vector Search with Embeddings

Now let's use embeddings to search for similar sentences. We'll:
1. Create a Redis search index
2. Load our 3 sentences with their embeddings
3. Query for the most similar sentence to "That is a happy cat"

In [None]:
// Connect to Redis
var redis = new UnifiedJedis("redis://localhost:6379");

// Create the schema - matching the Python notebook YAML
var schema = IndexSchema.builder()
    .name("vectorizers")
    .prefix("doc")
    .storageType(IndexSchema.StorageType.HASH)
    .addTextField("sentence", textField -> {})
    .addVectorField("embedding", 768, vectorField ->
        vectorField
            .algorithm(VectorAlgorithm.FLAT)
            .distanceMetric(VectorField.DistanceMetric.COSINE)
            .dataType(VectorField.VectorDataType.FLOAT32))
    .build();

// Create the index
var index = new SearchIndex(schema, redis);
index.create(true); // overwrite if exists
System.out.println("Index created: " + index.getName());

In [None]:
// Create embeddings for our sentences using HuggingFace
List<float[]> sentenceEmbeddings = hf.embedBatch(sentences);

// Prepare data for loading
List<Map<String, Object>> data = new ArrayList<>();
for (int i = 0; i < sentences.size(); i++) {
    Map<String, Object> doc = new HashMap<>();
    doc.put("sentence", sentences.get(i));
    doc.put("embedding", sentenceEmbeddings.get(i));
    data.add(doc);
}

// Load data into the index
index.load(data);
System.out.println("Loaded " + data.size() + " documents");

In [None]:
// Create a query embedding for "That is a happy cat"
float[] queryEmbedding = hf.embed("That is a happy cat");

// Create and execute a vector query
var query = VectorQuery.builder()
    .vector(queryEmbedding)
    .field("embedding")
    .returnFields(List.of("sentence"))
    .numResults(3)
    .build();

List<Map<String, Object>> results = index.query(query);

System.out.println("\nSearch results for: 'That is a happy cat'");
for (var doc : results) {
    System.out.println(doc.get("sentence") + " - Distance: " + doc.get("vector_distance"));
}

Notice that "That is a happy dog" is the closest match to "That is a happy cat" - this makes semantic sense!

In [None]:
// Cleanup
index.delete(true);
System.out.println("Index deleted");

## Summary

RedisVL4j provides flexible vectorization options:

1. **Local Models** - HuggingFace Sentence Transformers (no API key required)
2. **Cloud APIs** - OpenAI, Cohere, VoyageAI via LangChain4j integration
3. **Custom** - Implement your own vectorizer

All vectorizers work seamlessly with Redis vector search for building semantic search applications!