getMeta() output -
- -
- -
- -
- -
-Weaviate output:
-
-```text
-Interstellar 2014 157336
-Distance to query: 0.354
-
-Gravity 2013 49047
-Distance to query: 0.384
-
-Arrival 2016 329865
-Distance to query: 0.386
-
-Armageddon 1998 95
-Distance to query: 0.400
-
-Godzilla 1998 929
-Distance to query: 0.441
-```
-
-
-
-Weaviate output:
-
-```text
-Interstellar 2014 157336
-Distance to query: 0.354
-
-Gravity 2013 49047
-Distance to query: 0.384
-
-Arrival 2016 329865
-Distance to query: 0.386
-
-Armageddon 1998 95
-Distance to query: 0.400
-
-Godzilla 1998 929
-Distance to query: 0.441
-```
-
- -
- -
- -
- -
- -
-Weaviate output:
-
-```text
-Deadpool 2 2018 383498
-Distance to query: 0.670
-
-Bloodshot 2020 338762
-Distance to query: 0.677
-
-Deadpool 2016 293660
-Distance to query: 0.678
-
-300 2007 1271
-Distance to query: 0.682
-
-The Hunt for Red October 1990 1669
-Distance to query: 0.683
-```
-
-
-
-Weaviate output:
-
-```text
-Deadpool 2 2018 383498
-Distance to query: 0.670
-
-Bloodshot 2020 338762
-Distance to query: 0.677
-
-Deadpool 2016 293660
-Distance to query: 0.678
-
-300 2007 1271
-Distance to query: 0.682
-
-The Hunt for Red October 1990 1669
-Distance to query: 0.683
-```
-
-getMeta() outputauto_tenant_creationauto_tenant_activationget_meta output -
- -
- -
- -
- -
-```text
-Life Is Beautiful 1997 637
-Distance to query: 0.621
-
-Groundhog Day 1993 137
-Distance to query: 0.623
-
-Jingle All the Way 1996 9279
-Distance to query: 0.625
-
-Training Day 2001 2034
-Distance to query: 0.627
-
-Misery 1990 1700
-Distance to query: 0.632
-```
-
-
-
-```text
-Life Is Beautiful 1997 637
-Distance to query: 0.621
-
-Groundhog Day 1993 137
-Distance to query: 0.623
-
-Jingle All the Way 1996 9279
-Distance to query: 0.625
-
-Training Day 2001 2034
-Distance to query: 0.627
-
-Misery 1990 1700
-Distance to query: 0.632
-```
-
-   -
-Luckily for them, the `MovieNVDemo` collection has `poster_title` named vectors which is primarily based on the poster design. So Aesthetico's designers can search against the `poster_title` named vector and find movies that are similar to their poster design. And, they can then perform RAG to summarize the movies that are found.
-
-###  Code
-
-This query will find similar movies to the input image, and then provide insights using RAG.
-
-
-
-Luckily for them, the `MovieNVDemo` collection has `poster_title` named vectors which is primarily based on the poster design. So Aesthetico's designers can search against the `poster_title` named vector and find movies that are similar to their poster design. And, they can then perform RAG to summarize the movies that are found.
-
-###  Code
-
-This query will find similar movies to the input image, and then provide insights using RAG.
-
-get_meta outputget_meta output -
- -
- -
- -
- -
-Weaviate output:
-
-```text
-Interstellar 2014 157336
-Distance to query: 0.354
-
-Gravity 2013 49047
-Distance to query: 0.384
-
-Arrival 2016 329865
-Distance to query: 0.386
-
-Armageddon 1998 95
-Distance to query: 0.400
-
-Godzilla 1998 929
-Distance to query: 0.441
-```
-
-
-
-Weaviate output:
-
-```text
-Interstellar 2014 157336
-Distance to query: 0.354
-
-Gravity 2013 49047
-Distance to query: 0.384
-
-Arrival 2016 329865
-Distance to query: 0.386
-
-Armageddon 1998 95
-Distance to query: 0.400
-
-Godzilla 1998 929
-Distance to query: 0.441
-```
-
- -
- -
- -
- -
- -
-Weaviate output:
-
-```text
-Deadpool 2 2018 383498
-Distance to query: 0.670
-
-Bloodshot 2020 338762
-Distance to query: 0.677
-
-Deadpool 2016 293660
-Distance to query: 0.678
-
-300 2007 1271
-Distance to query: 0.682
-
-The Hunt for Red October 1990 1669
-Distance to query: 0.683
-```
-
-
-
-Weaviate output:
-
-```text
-Deadpool 2 2018 383498
-Distance to query: 0.670
-
-Bloodshot 2020 338762
-Distance to query: 0.677
-
-Deadpool 2016 293660
-Distance to query: 0.678
-
-300 2007 1271
-Distance to query: 0.682
-
-The Hunt for Red October 1990 1669
-Distance to query: 0.683
-```
-
-get_meta output -
-.default}) -
-Now, let’s see what happens when we try to retrieve the best matching objects using two different embedding models. We will use the following two models:
-
-- `FastText (fasttext-en-vectors)` (from 2015; [model card](https://huggingface.co/facebook/fasttext-en-vectors))
-- `snowflake-arctic-embed-l-v2.0` (from 2024; [model card](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0))
-
-Here is a summary of results from a search, using the `FastText` model from 2015:
-
-
-
-Now, let’s see what happens when we try to retrieve the best matching objects using two different embedding models. We will use the following two models:
-
-- `FastText (fasttext-en-vectors)` (from 2015; [model card](https://huggingface.co/facebook/fasttext-en-vectors))
-- `snowflake-arctic-embed-l-v2.0` (from 2024; [model card](https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0))
-
-Here is a summary of results from a search, using the `FastText` model from 2015:
-
-.default}) -
-The top result identified by the FastText is quite relevant, as it discusses how to correct some potential issues with cookie making. However, it’s less relevant than the idea result, which is a step-by-step recipe.
-
-The other two, however, are not relevant to the query. While they are recipes, they are not for baking cookies.
-
-It would be fair to say that there’s quite a bit of room for improvement.
-
-Here are the results from the `snowflake-arctic-embed-l-v2.0` model, from 2025:
-
-
-
-The top result identified by the FastText is quite relevant, as it discusses how to correct some potential issues with cookie making. However, it’s less relevant than the idea result, which is a step-by-step recipe.
-
-The other two, however, are not relevant to the query. While they are recipes, they are not for baking cookies.
-
-It would be fair to say that there’s quite a bit of room for improvement.
-
-Here are the results from the `snowflake-arctic-embed-l-v2.0` model, from 2025:
-
-.default}) -
-We see that the `arctic` embeddings correctly identified the ideal top-ranked result. In fact, the top two expected results are included in the top three results for the `arctic` embeddings. Even the other result is relevant to chocolate chip cookies - although perhaps slightly off topic.
-
-###  Evaluation criteria
-
-We could even compare these models using a standard metric, such as `nDCG@k`.
-
-For this scenarios, the two models scored:
-
-| Model | nDCG@10 |
-| --- | --- |
-| `FastText` | 0.595 |
-| `snowflake-arctic-embed-l-v2.0` | 0.908 |
-
-
-
-We see that the `arctic` embeddings correctly identified the ideal top-ranked result. In fact, the top two expected results are included in the top three results for the `arctic` embeddings. Even the other result is relevant to chocolate chip cookies - although perhaps slightly off topic.
-
-###  Evaluation criteria
-
-We could even compare these models using a standard metric, such as `nDCG@k`.
-
-For this scenarios, the two models scored:
-
-| Model | nDCG@10 |
-| --- | --- |
-| `FastText` | 0.595 |
-| `snowflake-arctic-embed-l-v2.0` | 0.908 |
-
-.default}) -
-These simple examples illustrate some of the impact of embedding model selection. The choice of embedding models can make a huge difference in the quality of your search, your resource requirements, and many more factors.
-
-There have been huge advancements in the landscape of embedding models over the last 10 to 15 years. In fact, innovations in embedding models continue to occur today. You might have heard of some of these names: word2vec, FastText, GloVe, BERT, CLIP, OpenAI ada, Cohere multi-lingual, Snowflake Arctic, ColBERT, and ColPali.
-
-Each model (or architecture) brings with it some improvements. It may be in model architecture, training data, training methodology, modality, or efficiency, for instance.
-
-So in the next few sections, let’s begin to explore a workflow for embedding model selection.
-
-## Questions and feedback
-
-import DocsFeedback from '/_includes/docs-feedback.mdx';
-
-
-
-These simple examples illustrate some of the impact of embedding model selection. The choice of embedding models can make a huge difference in the quality of your search, your resource requirements, and many more factors.
-
-There have been huge advancements in the landscape of embedding models over the last 10 to 15 years. In fact, innovations in embedding models continue to occur today. You might have heard of some of these names: word2vec, FastText, GloVe, BERT, CLIP, OpenAI ada, Cohere multi-lingual, Snowflake Arctic, ColBERT, and ColPali.
-
-Each model (or architecture) brings with it some improvements. It may be in model architecture, training data, training methodology, modality, or efficiency, for instance.
-
-So in the next few sections, let’s begin to explore a workflow for embedding model selection.
-
-## Questions and feedback
-
-import DocsFeedback from '/_includes/docs-feedback.mdx';
-
-.default}) -
-The chart shows a clear positive relationship between model size and higher performance. This also means that generally, models with better performance will be larger. They will require more memory and compute, which means higher costs and slower speeds.
-
-In other words, a larger model such as `nv-embed-v2` may perform better at retrieval than a smaller model such as `snowflake-arctic-embed-m-v1.5`, but may cost more to run and/or use.
-
-But there are many other dimensions to consider. For example:
-
-- A proprietary model such as a modern `gemini` model may show promising performance, but may not meet a user’s preference for local inference.
-- While a model may perform well at a standard benchmark, it may not perform as well if given material from a specialized domain, such as legal, medical, or coding tasks.
-- A local model may be cheaper to run, but the organization may lack the expertise and resources for long-term infrastructure maintenance.
-
-In the face of this complexity, a systematic approach can help you to make an informed decision based on your specific requirements. This is one such approach:
-
-
-
-The chart shows a clear positive relationship between model size and higher performance. This also means that generally, models with better performance will be larger. They will require more memory and compute, which means higher costs and slower speeds.
-
-In other words, a larger model such as `nv-embed-v2` may perform better at retrieval than a smaller model such as `snowflake-arctic-embed-m-v1.5`, but may cost more to run and/or use.
-
-But there are many other dimensions to consider. For example:
-
-- A proprietary model such as a modern `gemini` model may show promising performance, but may not meet a user’s preference for local inference.
-- While a model may perform well at a standard benchmark, it may not perform as well if given material from a specialized domain, such as legal, medical, or coding tasks.
-- A local model may be cheaper to run, but the organization may lack the expertise and resources for long-term infrastructure maintenance.
-
-In the face of this complexity, a systematic approach can help you to make an informed decision based on your specific requirements. This is one such approach:
-
-.default}) -
-###  Data Characteristics
-
-| Factor | Key Questions | Why It Matters |
-| --- | --- | --- |
-| **Modality** | Are you dealing with text, images, audio, or multimodal data? | Models are built for specific modality/modalities.  |
-| **Language** | Which languages must be supported? | Models are trained & optimized for specific language(s), leading to trade-offs in performance. |
-| **Domain** | Is your data general or domain-specific (legal, medical, technical)? | Domain-specific models (e.g. [medical](https://huggingface.co/blog/abhinand/medembed-finetuned-embedding-models-for-medical-ir)) understand specialized vocabulary and concepts.  |
-| **Length** | What's the typical length of your documents and queries? | Input token context windows vary between models, from as small as `256` tokens to `8192` tokens for example. However, longer context windows typically require exponentially higher compute and latency. |
-| **Asymmetry** | Will your queries differ significantly from your documents? | Some models are built for asymmetric query to document comparisons. So queries like `laptop won't turn on` can easily identify documents like `Troubleshooting Power Issues: If your device fails to boot...`. |
-
-###  Performance Needs
-
-| Factor | Key Questions | Why It Matters |
-| --- | --- | --- |
-| **Accuracy** (recall) | How critical is it that all the top results are retrieved? | Higher accuracy requirements may justify more expensive or resource-intensive models.  |
-| **Latency** | How quickly must queries be processed? | Larger models with better performance often have slower inference times. For inference services, faster services will cost more. |
-| **Throughput** | What query volume do you anticipate? Will there be traffic spikes? | Larger models with better performance often have lower processing capacity. For inference services, increased throughput will increase costs. |
-| **Volume** | How many documents will you process? | Larger embedding dimensions increase memory requirements for your vector store. This will impact resource requirements and affect costs at scale. |
-| **Task type** | Is retrieval the only use case? Or will it also involve others (e.g. clustering or classification) ? | Models have strengths and weaknesses; a model excellent at retrieval might not excel at clustering. This will drive your evaluation & selection criteria. |
-
-###  Operational Factors
-
-| Factor | Key Questions | Why It Matters |
-| --- | --- | --- |
-| **Hardware limitations** | What computational resources are available for hosting & inference? | Hardware availability (costs, GPU/TPU availability) will significantly affect your range of choices. |
-| **API rate limits** | If using a hosted model, what are the provider's limits? | Rate limits can bottleneck applications, or limit potential growth. |
-| **Deployment & maintenance** | What technical expertise and resources are required? | Is self-hosting a model an option, or should you look at API-based hosted options? |
-
-###  Business Requirements
-
-| Factor | Key Questions | Why It Matters |
-| --- | --- | --- |
-| **Hosting options** | Do you need self-hosting capabilities, or is a cloud API acceptable? | Self-hosting ➡️ more control at higher operational complexity; APIs ➡️ lower friction at higher dependencies. |
-| **Licensing** | What are the licensing restrictions for commercial applications? | Some model licenses or restrictions may prohibit certain use cases. |
-| **Long-term support** | What guarantees exist for the model's continued availability? | If a model or business is abandoned, downstream applications may need significant reworking. |
-| **Budget** | What are your cost limits and expenditure preferences? | Embedding costs can add up over time, but self-hosting can incur high upfront costs. |
-| **Privacy & Compliance** | Are there data privacy requirements or industry regulations to consider? | Some industries require specific models. And privacy requirements may impose hosting requirements. |
-
-Documenting these requirements creates a clear profile of your ideal embedding model, which will guide your selection process and help you make informed trade-offs.
-
-##  Compile candidate models
-
-After identifying your needs, create a list of potential embedding models to evaluate. This process helps focus your detailed evaluation on the most promising candidates.
-
-There are hundreds of embedding models available today, with new ones being released regularly. For this many models, even a simple screening process would be too time-consuming.
-
-As a result, we suggest identifying an initial list of models with a simple set of heuristics, such as these:
-
-###  Account for model modality
-
-This is a critical, first-step filter. A model can only support the modality/modalities that it is designed and trained for.
-
-Some models (e.g. Cohere `embed-english-v3.0`) are multimodal, while others (e.g. Snowflake’s `snowflake-arctic-embed-l-v2.0`) are unimodal.
-
-No matter how good a model is, a text-only model such as `snowflake-arctic-embed-l-v2.0` will not be able to perform image retrieval. Similarly, a `ColQwen` model cannot be used for plain text retrieval.
-
-###  Favor models already available
-
-If your organization already uses embedding models for other applications, these are great starting points. They are likely to have been screened, evaluated and approved for use, and accounts/billing already configured. For local models, this would mean that the infrastructure is already available.
-
-This also extends to models available through your other service providers.
-
-You may be already using generative AI models through providers such as Cohere, Mistral or OpenAI. Or, perhaps your hyperscaler partners such as AWS, Microsoft Azure or Google Cloud provide embedding models.
-
-In many cases, these providers will also provide access to embedding models, which would be easier to adopt than those from a new organization.
-
-###  Try well-known models
-
-Generally, well-known or popular models are popular for a reason.
-
-Industry leaders in AI such as Alibaba, Cohere, Google, NVIDIA and OpenAI all produce embedding models for different modalities, languages and sizes. Here are a few samples of their available model families:
-
-| Provider | Model families |
-| --- | --- |
-| Alibaba | `gte`, `Qwen` |
-| Cohere | `embed-english`, `embed-multilingual` |
-| Google | `gemini-embedding`, `text-embedding` |
-| NVIDIA | `NV-embed` |
-| OpenAI | `text-embedding`, `ada` |
-
-There are also other families of models that you can consider.
-
-For example, the `ColPali` family of models for image embeddings and `CLIP` / `SigLIP` family of models for multimodal (image and text) are well-known in their respective domains. Then, `nomic`, `snowflake-arctic`, `MiniLM` and `bge` models are some examples of well-known language retrieval models.
-
-These popular models tend to be well-documented, discussed and widely supported.
-
-As a result, they tend to be easier than the more obscure models to use, evaluate, troubleshoot and use.
-
-###  Benchmark leaders
-
-Models that perform well on standard benchmarks may be worth considering. Resources like [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) can help identify high-performing models.
-
-As an example, the screenshot below shows models on MTEB at a size of fewer than 1 billion parameters, sorted by their `retrieval` performance.
-
-
-
-###  Data Characteristics
-
-| Factor | Key Questions | Why It Matters |
-| --- | --- | --- |
-| **Modality** | Are you dealing with text, images, audio, or multimodal data? | Models are built for specific modality/modalities.  |
-| **Language** | Which languages must be supported? | Models are trained & optimized for specific language(s), leading to trade-offs in performance. |
-| **Domain** | Is your data general or domain-specific (legal, medical, technical)? | Domain-specific models (e.g. [medical](https://huggingface.co/blog/abhinand/medembed-finetuned-embedding-models-for-medical-ir)) understand specialized vocabulary and concepts.  |
-| **Length** | What's the typical length of your documents and queries? | Input token context windows vary between models, from as small as `256` tokens to `8192` tokens for example. However, longer context windows typically require exponentially higher compute and latency. |
-| **Asymmetry** | Will your queries differ significantly from your documents? | Some models are built for asymmetric query to document comparisons. So queries like `laptop won't turn on` can easily identify documents like `Troubleshooting Power Issues: If your device fails to boot...`. |
-
-###  Performance Needs
-
-| Factor | Key Questions | Why It Matters |
-| --- | --- | --- |
-| **Accuracy** (recall) | How critical is it that all the top results are retrieved? | Higher accuracy requirements may justify more expensive or resource-intensive models.  |
-| **Latency** | How quickly must queries be processed? | Larger models with better performance often have slower inference times. For inference services, faster services will cost more. |
-| **Throughput** | What query volume do you anticipate? Will there be traffic spikes? | Larger models with better performance often have lower processing capacity. For inference services, increased throughput will increase costs. |
-| **Volume** | How many documents will you process? | Larger embedding dimensions increase memory requirements for your vector store. This will impact resource requirements and affect costs at scale. |
-| **Task type** | Is retrieval the only use case? Or will it also involve others (e.g. clustering or classification) ? | Models have strengths and weaknesses; a model excellent at retrieval might not excel at clustering. This will drive your evaluation & selection criteria. |
-
-###  Operational Factors
-
-| Factor | Key Questions | Why It Matters |
-| --- | --- | --- |
-| **Hardware limitations** | What computational resources are available for hosting & inference? | Hardware availability (costs, GPU/TPU availability) will significantly affect your range of choices. |
-| **API rate limits** | If using a hosted model, what are the provider's limits? | Rate limits can bottleneck applications, or limit potential growth. |
-| **Deployment & maintenance** | What technical expertise and resources are required? | Is self-hosting a model an option, or should you look at API-based hosted options? |
-
-###  Business Requirements
-
-| Factor | Key Questions | Why It Matters |
-| --- | --- | --- |
-| **Hosting options** | Do you need self-hosting capabilities, or is a cloud API acceptable? | Self-hosting ➡️ more control at higher operational complexity; APIs ➡️ lower friction at higher dependencies. |
-| **Licensing** | What are the licensing restrictions for commercial applications? | Some model licenses or restrictions may prohibit certain use cases. |
-| **Long-term support** | What guarantees exist for the model's continued availability? | If a model or business is abandoned, downstream applications may need significant reworking. |
-| **Budget** | What are your cost limits and expenditure preferences? | Embedding costs can add up over time, but self-hosting can incur high upfront costs. |
-| **Privacy & Compliance** | Are there data privacy requirements or industry regulations to consider? | Some industries require specific models. And privacy requirements may impose hosting requirements. |
-
-Documenting these requirements creates a clear profile of your ideal embedding model, which will guide your selection process and help you make informed trade-offs.
-
-##  Compile candidate models
-
-After identifying your needs, create a list of potential embedding models to evaluate. This process helps focus your detailed evaluation on the most promising candidates.
-
-There are hundreds of embedding models available today, with new ones being released regularly. For this many models, even a simple screening process would be too time-consuming.
-
-As a result, we suggest identifying an initial list of models with a simple set of heuristics, such as these:
-
-###  Account for model modality
-
-This is a critical, first-step filter. A model can only support the modality/modalities that it is designed and trained for.
-
-Some models (e.g. Cohere `embed-english-v3.0`) are multimodal, while others (e.g. Snowflake’s `snowflake-arctic-embed-l-v2.0`) are unimodal.
-
-No matter how good a model is, a text-only model such as `snowflake-arctic-embed-l-v2.0` will not be able to perform image retrieval. Similarly, a `ColQwen` model cannot be used for plain text retrieval.
-
-###  Favor models already available
-
-If your organization already uses embedding models for other applications, these are great starting points. They are likely to have been screened, evaluated and approved for use, and accounts/billing already configured. For local models, this would mean that the infrastructure is already available.
-
-This also extends to models available through your other service providers.
-
-You may be already using generative AI models through providers such as Cohere, Mistral or OpenAI. Or, perhaps your hyperscaler partners such as AWS, Microsoft Azure or Google Cloud provide embedding models.
-
-In many cases, these providers will also provide access to embedding models, which would be easier to adopt than those from a new organization.
-
-###  Try well-known models
-
-Generally, well-known or popular models are popular for a reason.
-
-Industry leaders in AI such as Alibaba, Cohere, Google, NVIDIA and OpenAI all produce embedding models for different modalities, languages and sizes. Here are a few samples of their available model families:
-
-| Provider | Model families |
-| --- | --- |
-| Alibaba | `gte`, `Qwen` |
-| Cohere | `embed-english`, `embed-multilingual` |
-| Google | `gemini-embedding`, `text-embedding` |
-| NVIDIA | `NV-embed` |
-| OpenAI | `text-embedding`, `ada` |
-
-There are also other families of models that you can consider.
-
-For example, the `ColPali` family of models for image embeddings and `CLIP` / `SigLIP` family of models for multimodal (image and text) are well-known in their respective domains. Then, `nomic`, `snowflake-arctic`, `MiniLM` and `bge` models are some examples of well-known language retrieval models.
-
-These popular models tend to be well-documented, discussed and widely supported.
-
-As a result, they tend to be easier than the more obscure models to use, evaluate, troubleshoot and use.
-
-###  Benchmark leaders
-
-Models that perform well on standard benchmarks may be worth considering. Resources like [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) can help identify high-performing models.
-
-As an example, the screenshot below shows models on MTEB at a size of fewer than 1 billion parameters, sorted by their `retrieval` performance.
-
-.default}) -
-It shows some models that we’ve already discussed - such as the `showflake-arctic`,  Alibaba’s `gte`, or BAAI’s `bge` models.
-
-But additionally, you can see already a number of high-performing models that we hadn’t discussed. Microsoft research's `intfloat/multilingual-e5-large-instruct` or JinaAI’s `jinaai/jina-embeddings-v3` model are both easily discoverable here.
-
-Note that as of 2025, the MTEB contains different benchmarks to assess different capabilities, such as the linguistic or modality needs.
-
-When viewing benchmarks, make sure to view the right set of benchmarks, and the appropriate columns. In the example below, note that the page shows results for MIEB (image retrieval), with results sorted by *Any to Any Retrieval*.
-
-
-
-It shows some models that we’ve already discussed - such as the `showflake-arctic`,  Alibaba’s `gte`, or BAAI’s `bge` models.
-
-But additionally, you can see already a number of high-performing models that we hadn’t discussed. Microsoft research's `intfloat/multilingual-e5-large-instruct` or JinaAI’s `jinaai/jina-embeddings-v3` model are both easily discoverable here.
-
-Note that as of 2025, the MTEB contains different benchmarks to assess different capabilities, such as the linguistic or modality needs.
-
-When viewing benchmarks, make sure to view the right set of benchmarks, and the appropriate columns. In the example below, note that the page shows results for MIEB (image retrieval), with results sorted by *Any to Any Retrieval*.
-
-.default}) -
-The MTEB is filterable and sortable by various metrics. So, you can arrange it to suit your preferences and add models to your list as you see fit.
-
-You should be able to compile a manageable list of models relatively quickly using these techniques. This list can then be manually reviewed for detailed screening.
-
-## Questions and feedback
-
-import DocsFeedback from '/_includes/docs-feedback.mdx';
-
-
-
-The MTEB is filterable and sortable by various metrics. So, you can arrange it to suit your preferences and add models to your list as you see fit.
-
-You should be able to compile a manageable list of models relatively quickly using these techniques. This list can then be manually reviewed for detailed screening.
-
-## Questions and feedback
-
-import DocsFeedback from '/_includes/docs-feedback.mdx';
-
-.default}) -
-Some of the readily screenable factors, and how to screen models are shown below:
-
-##  Screening factors
-
-###  Context length
-
-Input context length is a critical factor to ensure that meaning from the whole document chunks taken into account. Maximum input context lengths vary widely between models, as shown in these examples:
-
-- `all-MiniLM-L6-v2`: 256 tokens
-- Cohere `embed-english-v3.0`: 512 tokens
-- `snowflake-arctic-embed-l-v2.0`: 8192 tokens
-
-Input text exceeding the context length will be ignored. On the other hand, higher allowable context lengths typically require exponentially higher compute and latency. As a result, this is an important tradeoff that includes an interplay with your text chunking strategy.
-
-:::tip
-
-Consider what a “chunk” of information to retrieve looks like for your use case. Typically, a model with 512 tokens or higher is sufficient for most use cases.
-
-:::
-
-###  Model goals & training methodology
-
-Different embedding models are optimized for different use cases. This informs the model architecture, training data and training methodology.
-
-Reviewing the model provider’s descriptions and published training details can provide key insights into its suitability for your use case.
-
-- **Linguistic capabilities**: Some models (e.g. Snowflake’s `snowflake-arctic-embed-l-v2.0`) are multi-lingual, while others are primarily uni-lingual (e.g. Cohere’s `embed-english-v3.0`). These linguistic capabilities come largely from the training data and methodology selection.
-- **Domain exposure**: Models trained on specialized domains (e.g., legal, medical, financial) typically perform better for domain-specific applications.
-- **Primary tasks**: The provider may have been building a general-purpose embedding model, or one that is particularly focussed on particular tasks. Google’s `gemini-embedding` model appears to be designed with a goal of being a jack-of-all-trades type, state of the art model in all tasks and domains ([release blog](https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/)). On the other hand, Snowflake’s `arctic-embed` 2.0 models appear to be focussed on retrieval tasks ([release blog](https://www.snowflake.com/en/engineering-blog/snowflake-arctic-embed-2-multilingual/)).
-- **Base model**: In many cases, an embedding model is trained from an existing model. Any advantages, or shortcomings, of the base model will often carry over to the final model, especially if it is an architectural one such as its context window size or pooling strategy.
-- **Training methods (advanced)**: If you have more experience with model training techniques, this is an area that you can use as heuristics as well. For example, models trained with contrastive learning often perform better for retrieval tasks. Additionally, hard negative mining is a technique that is valuable to enhance contrastive learning.
-
-:::tip
-
-Select a model whose capabilities align with your goals. For example, if your application requires retrieving paragraphs of text chunks in English, French, German, Mandarin Chinese and Japanese, check the model card and training information. Look for its retrieval performance, and whether these languages were included in the training corpus.
-
-:::
-
-###  Dimensionality and optimization options
-
-The dimensionality of embeddings affects both performance and resource requirements.
-
-As a rule of thumb, your memory requirements for a vector database (any quantization notwithstanding) may be: `4 bytes` * `n dimensions` * `m objects` * `1.5` where `m` is the size of your database, and `n` is the vector dimensionality (`1.5` to account for overhead).
-
-This means that for, say, 10 million objects, the memory requirements for given models’ full outputs will be:
-
-- NVIDIA `NV-embed-v2`: `246 GB`
-- OpenAI `text-embedding-3-large`: `184 GB`
-- `snowflake-arctic-embed-l-v2.0`: `61 GB`
-- `all-MiniLM-L6-v2`: `23 GB`
-
-As you might imagine, this can add significant costs to your infrastructure needs for the vector database.
-
-At the database end, there are quantization strategies which will reduce the footprint and therefore costs, which we will cover in another course.
-
-However, certain models can also help in this regard as well. [Matryoshka Representation Learning (MRL)](https://weaviate.io/blog/openais-matryoshka-embeddings-in-weaviate) models like `jina-embeddings-v2` or `snowflake-arctic-embed-l-v2.0` allow for flexible dimensionality reduction by simply truncating the vector. In the case of `snowflake-arctic-embed-l-v2.0`, it can be truncated to `256` dimensions from its original `1024` dimensions, reducing its size to a quarter without much loss in performance.
-
-:::tip
-
-Consider how big your dataset is likely to get to, then select your model accordingly, keeping the resulting system requirements in mind. If the requirements are too high and thus out-of-budget, it may set you back to square one when you need to scale up and go to production.
-
-:::
-
-###  Model size and inference speed
-
-Model size directly impacts inference speed, which is critical for applications with latency requirements. Larger models generally offer better performance but at the cost of increased computational demands.
-
-When screening models, consider these aspects:
-
-| Factor | Implications |
-| --- | --- |
-| Parameter count | More parameters typically mean better quality but slower inference and higher memory usage |
-| Architecture efficiency | Some models are optimized for faster inference despite their size |
-| Hardware requirements | Larger models may require specialized hardware (GPUs/TPUs) |
-
-:::tip
-
-Given that the inference speed is a function of the model, inference hardware as well as the  network latencies, review these factors as a system when screening models’ suitability.
-
-:::
-
-###  Pricing, availability, and licensing
-
-The practical aspects of model adoption extend beyond technical considerations.
-
-Providers offer various pricing structures:
-
-- **API-based pricing**: Pay-per-token (OpenAI, Cohere)
-- **Compute-based pricing**: Based on hardware utilization (Cloud providers)
-- **Tiered licensing**: Different capabilities at different price points
-- **Open-source**: Free to use, but self-hosting costs apply
-
-Choice of model and inference type will affect model availability:
-
-- **Geographic availability**: Some providers don't operate in all regions
-- **SLA guarantees**: Uptime commitments and support levels
-- **Rate limiting**: Constraints on throughput that may affect your application
-- **Version stability**: How frequently models are deprecated or updated
-
-Additionally, licensing terms vary significantly:
-
-- **Commercial use restrictions**: Some open models prohibit commercial applications
-- **Data usage policies**: How your data may be used by the provider
-- **Export restrictions**: Compliance with regional regulations
-- **Deployment flexibility**: Whether the model can be deployed on-premises or edge devices
-
-Always review the specific terms for each model. For example, while models like CLIP are openly available, they may have usage restrictions that affect your application.
-
-:::tip
-
-These practical considerations can sometimes outweigh performance benefits. A slightly less accurate model with favorable licensing terms and lower costs might be preferable for many production applications.
-
-:::
-
-###  Creating your candidate shortlist
-
-After considering these factors, you can create a prioritized shortlist of models to evaluate in more detail. A good approach is to include a mix of:
-
-1. **Benchmark leaders**: High-performing models on standard metrics
-2. **Resource-efficient options**: Models with smaller footprints or faster inference
-3. **Specialized models**: Models that might be particularly well-suited to your domain
-4. **Different architectures**: Including diverse approaches increases the chance of finding a good fit
-
-Aim for 3-5 models in your initial shortlist for detailed evaluation. Including too many models can make the evaluation process unwieldy and time-consuming.
-
-In the next section, we'll explore how to perform detailed evaluations of these candidate models to determine which one best meets your specific requirements.
-
-## Questions and feedback
-
-import DocsFeedback from '/_includes/docs-feedback.mdx';
-
-
-
-Some of the readily screenable factors, and how to screen models are shown below:
-
-##  Screening factors
-
-###  Context length
-
-Input context length is a critical factor to ensure that meaning from the whole document chunks taken into account. Maximum input context lengths vary widely between models, as shown in these examples:
-
-- `all-MiniLM-L6-v2`: 256 tokens
-- Cohere `embed-english-v3.0`: 512 tokens
-- `snowflake-arctic-embed-l-v2.0`: 8192 tokens
-
-Input text exceeding the context length will be ignored. On the other hand, higher allowable context lengths typically require exponentially higher compute and latency. As a result, this is an important tradeoff that includes an interplay with your text chunking strategy.
-
-:::tip
-
-Consider what a “chunk” of information to retrieve looks like for your use case. Typically, a model with 512 tokens or higher is sufficient for most use cases.
-
-:::
-
-###  Model goals & training methodology
-
-Different embedding models are optimized for different use cases. This informs the model architecture, training data and training methodology.
-
-Reviewing the model provider’s descriptions and published training details can provide key insights into its suitability for your use case.
-
-- **Linguistic capabilities**: Some models (e.g. Snowflake’s `snowflake-arctic-embed-l-v2.0`) are multi-lingual, while others are primarily uni-lingual (e.g. Cohere’s `embed-english-v3.0`). These linguistic capabilities come largely from the training data and methodology selection.
-- **Domain exposure**: Models trained on specialized domains (e.g., legal, medical, financial) typically perform better for domain-specific applications.
-- **Primary tasks**: The provider may have been building a general-purpose embedding model, or one that is particularly focussed on particular tasks. Google’s `gemini-embedding` model appears to be designed with a goal of being a jack-of-all-trades type, state of the art model in all tasks and domains ([release blog](https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/)). On the other hand, Snowflake’s `arctic-embed` 2.0 models appear to be focussed on retrieval tasks ([release blog](https://www.snowflake.com/en/engineering-blog/snowflake-arctic-embed-2-multilingual/)).
-- **Base model**: In many cases, an embedding model is trained from an existing model. Any advantages, or shortcomings, of the base model will often carry over to the final model, especially if it is an architectural one such as its context window size or pooling strategy.
-- **Training methods (advanced)**: If you have more experience with model training techniques, this is an area that you can use as heuristics as well. For example, models trained with contrastive learning often perform better for retrieval tasks. Additionally, hard negative mining is a technique that is valuable to enhance contrastive learning.
-
-:::tip
-
-Select a model whose capabilities align with your goals. For example, if your application requires retrieving paragraphs of text chunks in English, French, German, Mandarin Chinese and Japanese, check the model card and training information. Look for its retrieval performance, and whether these languages were included in the training corpus.
-
-:::
-
-###  Dimensionality and optimization options
-
-The dimensionality of embeddings affects both performance and resource requirements.
-
-As a rule of thumb, your memory requirements for a vector database (any quantization notwithstanding) may be: `4 bytes` * `n dimensions` * `m objects` * `1.5` where `m` is the size of your database, and `n` is the vector dimensionality (`1.5` to account for overhead).
-
-This means that for, say, 10 million objects, the memory requirements for given models’ full outputs will be:
-
-- NVIDIA `NV-embed-v2`: `246 GB`
-- OpenAI `text-embedding-3-large`: `184 GB`
-- `snowflake-arctic-embed-l-v2.0`: `61 GB`
-- `all-MiniLM-L6-v2`: `23 GB`
-
-As you might imagine, this can add significant costs to your infrastructure needs for the vector database.
-
-At the database end, there are quantization strategies which will reduce the footprint and therefore costs, which we will cover in another course.
-
-However, certain models can also help in this regard as well. [Matryoshka Representation Learning (MRL)](https://weaviate.io/blog/openais-matryoshka-embeddings-in-weaviate) models like `jina-embeddings-v2` or `snowflake-arctic-embed-l-v2.0` allow for flexible dimensionality reduction by simply truncating the vector. In the case of `snowflake-arctic-embed-l-v2.0`, it can be truncated to `256` dimensions from its original `1024` dimensions, reducing its size to a quarter without much loss in performance.
-
-:::tip
-
-Consider how big your dataset is likely to get to, then select your model accordingly, keeping the resulting system requirements in mind. If the requirements are too high and thus out-of-budget, it may set you back to square one when you need to scale up and go to production.
-
-:::
-
-###  Model size and inference speed
-
-Model size directly impacts inference speed, which is critical for applications with latency requirements. Larger models generally offer better performance but at the cost of increased computational demands.
-
-When screening models, consider these aspects:
-
-| Factor | Implications |
-| --- | --- |
-| Parameter count | More parameters typically mean better quality but slower inference and higher memory usage |
-| Architecture efficiency | Some models are optimized for faster inference despite their size |
-| Hardware requirements | Larger models may require specialized hardware (GPUs/TPUs) |
-
-:::tip
-
-Given that the inference speed is a function of the model, inference hardware as well as the  network latencies, review these factors as a system when screening models’ suitability.
-
-:::
-
-###  Pricing, availability, and licensing
-
-The practical aspects of model adoption extend beyond technical considerations.
-
-Providers offer various pricing structures:
-
-- **API-based pricing**: Pay-per-token (OpenAI, Cohere)
-- **Compute-based pricing**: Based on hardware utilization (Cloud providers)
-- **Tiered licensing**: Different capabilities at different price points
-- **Open-source**: Free to use, but self-hosting costs apply
-
-Choice of model and inference type will affect model availability:
-
-- **Geographic availability**: Some providers don't operate in all regions
-- **SLA guarantees**: Uptime commitments and support levels
-- **Rate limiting**: Constraints on throughput that may affect your application
-- **Version stability**: How frequently models are deprecated or updated
-
-Additionally, licensing terms vary significantly:
-
-- **Commercial use restrictions**: Some open models prohibit commercial applications
-- **Data usage policies**: How your data may be used by the provider
-- **Export restrictions**: Compliance with regional regulations
-- **Deployment flexibility**: Whether the model can be deployed on-premises or edge devices
-
-Always review the specific terms for each model. For example, while models like CLIP are openly available, they may have usage restrictions that affect your application.
-
-:::tip
-
-These practical considerations can sometimes outweigh performance benefits. A slightly less accurate model with favorable licensing terms and lower costs might be preferable for many production applications.
-
-:::
-
-###  Creating your candidate shortlist
-
-After considering these factors, you can create a prioritized shortlist of models to evaluate in more detail. A good approach is to include a mix of:
-
-1. **Benchmark leaders**: High-performing models on standard metrics
-2. **Resource-efficient options**: Models with smaller footprints or faster inference
-3. **Specialized models**: Models that might be particularly well-suited to your domain
-4. **Different architectures**: Including diverse approaches increases the chance of finding a good fit
-
-Aim for 3-5 models in your initial shortlist for detailed evaluation. Including too many models can make the evaluation process unwieldy and time-consuming.
-
-In the next section, we'll explore how to perform detailed evaluations of these candidate models to determine which one best meets your specific requirements.
-
-## Questions and feedback
-
-import DocsFeedback from '/_includes/docs-feedback.mdx';
-
-.default}) -
-The next image includes tasks that uses more specialized domain data. These benchmarks span a range of areas, including legal, medical, programming, and government data.
-
-
-
-The next image includes tasks that uses more specialized domain data. These benchmarks span a range of areas, including legal, medical, programming, and government data.
-
-.default}) -
-The chart shows that certain embedding models at the top of the table, such as `gemini-embedding-exp-03-07` perform quite well across the board compared to the others. But this doesn’t tell the whole story, as a given model often outperforms its average score in particular tasks.
-
-For example, the `snowflake-arctic-embed` models perform very well with the `LEMBPasskeyRetrieval` task, which is designed to test recall of specific text buried in a longer embedding. And Cohere’s `Cohere-embed-multilingual-v3.0` performs quite well in the MIRACL task, which is a highly multilingual task.
-
-In fact, it is interesting to note that even though we are looking at MTEB’s multilingual task set, it includes includes a number of tasks with an English-only (or majority) corpus.
-
-
-
-The chart shows that certain embedding models at the top of the table, such as `gemini-embedding-exp-03-07` perform quite well across the board compared to the others. But this doesn’t tell the whole story, as a given model often outperforms its average score in particular tasks.
-
-For example, the `snowflake-arctic-embed` models perform very well with the `LEMBPasskeyRetrieval` task, which is designed to test recall of specific text buried in a longer embedding. And Cohere’s `Cohere-embed-multilingual-v3.0` performs quite well in the MIRACL task, which is a highly multilingual task.
-
-In fact, it is interesting to note that even though we are looking at MTEB’s multilingual task set, it includes includes a number of tasks with an English-only (or majority) corpus.
-
-.default}) -
-So, you may benefit from deriving your own metric that blends these task scores, based on how well each specific task corresponds to your needs.
-
-You might consider:
-
-1. **Task relevance**: Does the task match your use case?
-2. **Data distribution**: Does the data represent your domain?
-3. **Metric relevance**: Are the reported metrics aligned with your requirements?
-4. **Recency**: Are the results recent enough to reflect current model capabilities?
-5. **Fairness**: Were all models evaluated under comparable conditions?
-
-For example, if you know that your data definitely will include a blend of languages, you may weight the multi-lingual datasets more heavily than the mono-lingual datasets. And similarly with domain-specific data, such as legal, medical, programming and so on.
-
-The resulting score may be different from the official blended number, but may be more relevant to your particular use case.
-
-###  Limitations of standard benchmarks
-
-These third-party benchmarks are very useful, but there are a few limitations that you should keep in mind. The main two limitations are data leakage, and correlation to your needs.
-
-**Data leakage**
-
-Because these benchmarks are publicly available, there is a risk that some of the benchmark data ends up in the training data used to build models. This can happen for a number of reasons, especially because there is simply so much data being used in the training process.
-
-This means that the benchmark result may not be a fair representation of the model’s capabilities, as the model is “remembering” the training data.
-
-**Correlation to your needs**
-
-Another limitation is that the standard benchmarks don’t accurately reflect your needs. As you saw, we can aim to find a benchmark that is as close as possible to your actual use case. But it is unlikely that the task, data distribution and metrics are fully aligned with your needs.
-
-**Mitigation**
-
-As a result, it is important to take these standard benchmarks with a grain of salt. And in terms of getting further signals, a good complementary exercise is to perform your own benchmarks, which we will look at in the next section.
-
-##  Model evaluation: custom benchmarks
-
-
-While standard benchmarks provide valuable reference points, creating your own custom evaluation can be a fantastic complementary tool to address their limitations.
-
-Running your own benchmark can sound quite intimidating, especially given how extensive benchmarks such as MTEB are. But it doesn’t need to be. You can do this by following these steps:
-
-###  Set benchmark objectives
-
-By now, you should have an idea of any gaps in your knowledge set, as well as your given tasks. It might be something like:
-
-- Which model best retrieves the appropriate related customer reviews about coffee, written primarily in English, French, and Korean?
-- Does any model work well across code retrieval in Python and Golang for back-end web code chunks, as well as related documentation snippets?
-
-The custom benchmark should be designed with an idea of addressing particular questions.
-
-###  Determine metrics to use
-
-Once the overall goals are defined, the corresponding metrics can follow.
-
-For example, retrieval performance is commonly measured by one or more of precision, recall, MAP, MRR, and NDCG.
-
-Each of these measure slightly different aspects of retrieval performance. However, using NDCG is a good starting point.
-
-NDCG measures the system's ability to correctly sort items based on relevance. Given a query and a dataset ranked for this query, NDCG will reward results for having the higher ranked items higher in the search results.
-
-It is measured on a score of 0 to 1, where 0 means no ranked items were retrieved, and 1 means all top ranked items were retrieved, and ordered correctly.
-
-###  Curate a benchmark dataset
-
-A suitable dataset is critical for a benchmark to be meaningful. While such a dataset may already exist, it is common to build or reshape a dataset to suit the benchmark goals and metrics.
-
-The dataset should aim to:
-
-- Reflect the retrieval task
-- Reflect the task difficulty
-- Capture the data distribution
-- Include sufficient volume
-
-This may be the most time consuming part of the process. However, a pragmatic approach can help to make it manageable. A benchmark with as few as 20 objects and a handful of queries can produce meaningful results.
-
-###  Run benchmark
-
-At this point, run the benchmark using your candidate models. 
-
-As with many other scientific projects, reproducibility and consistency is key here. It is also worth keeping in mind that you may come back to this later on to assess new models, or with updated knowledge about your needs.
-
-In programming terms, you might modularize aspects, such as embedding creation, dataset loading, metric evaluation, and result presentation.
-
-###  Evaluate the results
-
-Once the benchmarks are run, it is important to assess the results using quantitative (e.g. NDCG@k numbers) and qualitative (e.g. which objects were retrieved where) means.
-
-The quantitative results will produce a definitive ranking that you can use, for example to order the models. However, this is subject to many factors, such as dataset composition and metric being used.
-
-The qualitative results may provide more important insights, such as patterns of failure. For example, may see an embedding model:
-
-- Regularly fail to retrieve certain types of objects, such as shorter, but very relevant text, favoring longer ones
-- Perform better with positively phrased text but not ones with negation in the sentence.
-- Struggle with your domain-specific jargon.
-- Work well with English and Mandarin Chinese, but not so well with Hungarian, which may be a key language for your data.
-
-To some extent, these insights may be only discoverable to those with domain familiarity, or those with context on the system being built. Accordingly, qualitative assessment is critically important.
-
-## Questions and feedback
-
-import DocsFeedback from '/_includes/docs-feedback.mdx';
-
-
-
-So, you may benefit from deriving your own metric that blends these task scores, based on how well each specific task corresponds to your needs.
-
-You might consider:
-
-1. **Task relevance**: Does the task match your use case?
-2. **Data distribution**: Does the data represent your domain?
-3. **Metric relevance**: Are the reported metrics aligned with your requirements?
-4. **Recency**: Are the results recent enough to reflect current model capabilities?
-5. **Fairness**: Were all models evaluated under comparable conditions?
-
-For example, if you know that your data definitely will include a blend of languages, you may weight the multi-lingual datasets more heavily than the mono-lingual datasets. And similarly with domain-specific data, such as legal, medical, programming and so on.
-
-The resulting score may be different from the official blended number, but may be more relevant to your particular use case.
-
-###  Limitations of standard benchmarks
-
-These third-party benchmarks are very useful, but there are a few limitations that you should keep in mind. The main two limitations are data leakage, and correlation to your needs.
-
-**Data leakage**
-
-Because these benchmarks are publicly available, there is a risk that some of the benchmark data ends up in the training data used to build models. This can happen for a number of reasons, especially because there is simply so much data being used in the training process.
-
-This means that the benchmark result may not be a fair representation of the model’s capabilities, as the model is “remembering” the training data.
-
-**Correlation to your needs**
-
-Another limitation is that the standard benchmarks don’t accurately reflect your needs. As you saw, we can aim to find a benchmark that is as close as possible to your actual use case. But it is unlikely that the task, data distribution and metrics are fully aligned with your needs.
-
-**Mitigation**
-
-As a result, it is important to take these standard benchmarks with a grain of salt. And in terms of getting further signals, a good complementary exercise is to perform your own benchmarks, which we will look at in the next section.
-
-##  Model evaluation: custom benchmarks
-
-
-While standard benchmarks provide valuable reference points, creating your own custom evaluation can be a fantastic complementary tool to address their limitations.
-
-Running your own benchmark can sound quite intimidating, especially given how extensive benchmarks such as MTEB are. But it doesn’t need to be. You can do this by following these steps:
-
-###  Set benchmark objectives
-
-By now, you should have an idea of any gaps in your knowledge set, as well as your given tasks. It might be something like:
-
-- Which model best retrieves the appropriate related customer reviews about coffee, written primarily in English, French, and Korean?
-- Does any model work well across code retrieval in Python and Golang for back-end web code chunks, as well as related documentation snippets?
-
-The custom benchmark should be designed with an idea of addressing particular questions.
-
-###  Determine metrics to use
-
-Once the overall goals are defined, the corresponding metrics can follow.
-
-For example, retrieval performance is commonly measured by one or more of precision, recall, MAP, MRR, and NDCG.
-
-Each of these measure slightly different aspects of retrieval performance. However, using NDCG is a good starting point.
-
-NDCG measures the system's ability to correctly sort items based on relevance. Given a query and a dataset ranked for this query, NDCG will reward results for having the higher ranked items higher in the search results.
-
-It is measured on a score of 0 to 1, where 0 means no ranked items were retrieved, and 1 means all top ranked items were retrieved, and ordered correctly.
-
-###  Curate a benchmark dataset
-
-A suitable dataset is critical for a benchmark to be meaningful. While such a dataset may already exist, it is common to build or reshape a dataset to suit the benchmark goals and metrics.
-
-The dataset should aim to:
-
-- Reflect the retrieval task
-- Reflect the task difficulty
-- Capture the data distribution
-- Include sufficient volume
-
-This may be the most time consuming part of the process. However, a pragmatic approach can help to make it manageable. A benchmark with as few as 20 objects and a handful of queries can produce meaningful results.
-
-###  Run benchmark
-
-At this point, run the benchmark using your candidate models. 
-
-As with many other scientific projects, reproducibility and consistency is key here. It is also worth keeping in mind that you may come back to this later on to assess new models, or with updated knowledge about your needs.
-
-In programming terms, you might modularize aspects, such as embedding creation, dataset loading, metric evaluation, and result presentation.
-
-###  Evaluate the results
-
-Once the benchmarks are run, it is important to assess the results using quantitative (e.g. NDCG@k numbers) and qualitative (e.g. which objects were retrieved where) means.
-
-The quantitative results will produce a definitive ranking that you can use, for example to order the models. However, this is subject to many factors, such as dataset composition and metric being used.
-
-The qualitative results may provide more important insights, such as patterns of failure. For example, may see an embedding model:
-
-- Regularly fail to retrieve certain types of objects, such as shorter, but very relevant text, favoring longer ones
-- Perform better with positively phrased text but not ones with negation in the sentence.
-- Struggle with your domain-specific jargon.
-- Work well with English and Mandarin Chinese, but not so well with Hungarian, which may be a key language for your data.
-
-To some extent, these insights may be only discoverable to those with domain familiarity, or those with context on the system being built. Accordingly, qualitative assessment is critically important.
-
-## Questions and feedback
-
-import DocsFeedback from '/_includes/docs-feedback.mdx';
-
-- Check out the Starter guide: retrieval augmented generation, and the Weaviate Academy unit on chunking. + Check out the Starter guide: retrieval augmented generation.
- Check out the Starter guide: retrieval augmented generation, and the Weaviate Academy unit on chunking. + Check out the Starter guide: retrieval augmented generation.
- {props.body} -
-{cardData[k].body}
-
+            ⚠️ Academy course not found: {courseId}
+          
+ Available IDs: {courses.map((c) => c.id).join(", ")} +
+{description}
+ + + {buttonText} + + +