# About the vision features

### Visual features

Once you've initialized an `ImageAnalysisClient`, you need to select one or more visual features to analyze. The options are specified by the enum class `VisualFeatures`. The following features are supported:

1. `VisualFeatures.CAPTION` ([Examples](#generate-an-image-caption-for-an-image-file) | [Samples](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/vision/azure-ai-vision-imageanalysis/samples)): Generate a human-readable sentence that describes the content of an image.
1. `VisualFeatures.READ` ([Examples](#extract-text-from-an-image-file) | [Samples](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/vision/azure-ai-vision-imageanalysis/samples)): Also known as Optical Character Recognition (OCR). Extract printed or handwritten text from images. **Note**: For extracting text from PDF, Office, and HTML documents and document images, use the Document Intelligence service with the [Read model](https://learn.microsoft.com/azure/ai-services/document-intelligence/concept-read?view=doc-intel-4.0.0). This model is optimized for text-heavy digital and scanned documents with an asynchronous REST API that makes it easy to power your intelligent document processing scenarios. This service is separate from the Image Analysis service and has its own SDK.
1. `VisualFeatures.DENSE_CAPTIONS` ([Samples](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/vision/azure-ai-vision-imageanalysis/samples)): Dense Captions provides more details by generating one-sentence captions for up to 10 different regions in the image, including one for the whole image. 
1. `VisualFeatures.TAGS` ([Samples](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/vision/azure-ai-vision-imageanalysis/samples)): Extract content tags for thousands of recognizable objects, living beings, scenery, and actions that appear in images.
1. `VisualFeatures.OBJECTS` ([Samples](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/vision/azure-ai-vision-imageanalysis/samples)): Object detection. This is similar to tagging, but focused on detecting physical objects in the image and returning their location.
1. `VisualFeatures.SMART_CROPS` ([Samples](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/vision/azure-ai-vision-imageanalysis/samples)): Used to find a representative sub-region of the image for thumbnail generation, with priority given to include faces.
1. `VisualFeatures.PEOPLE` ([Samples](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/vision/azure-ai-vision-imageanalysis/samples)): Detect people in the image and return their location.

For more information about these features, see [Image Analysis overview](https://learn.microsoft.com/azure/ai-services/computer-vision/overview-image-analysis?tabs=4-0), and the [Concepts](https://learn.microsoft.com/azure/ai-services/computer-vision/concept-tag-images-40) page.

### Reference:
* https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/vision/azure-ai-vision-imageanalysis#visual-features

In [1]:
from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential
import requests
import json

from dotenv import dotenv_values

config = {
    **dotenv_values("./envs/aivision.env")
}


In [2]:
vision_endpoint = config["AZURE_VISION_ENDPOINT"]
vision_key = config["AZURE_VISION_KEY"]

vision_client = ImageAnalysisClient(endpoint=vision_endpoint, credential=AzureKeyCredential(vision_key))

def get_multimodal_embedding(image_url, conf_threshold=0.6):
    response = requests.get(image_url)
    image_data = response.content

    result = vision_client.analyze(
        image_data=image_data,
        visual_features=[VisualFeatures.DENSE_CAPTIONS, VisualFeatures.TAGS, VisualFeatures.CAPTION],
        # model_version="2023-10-01",
        # model_version="latest"
    )
    # debug output of the result structure
    # print(json.dumps(result.as_dict(), indent=4))

    # Extract embeddings (vector representation)
    if not hasattr(result, 'embedding') or result.embedding is None:
        embedding_vector = None
    else:
        embedding_vector = result.embedding.vector
    
    # Additional metadata (optional)
    tags = [tag.name for tag in result.tags.list if tag.confidence >= conf_threshold] if result.tags else []
    dens_captions = [caption.text for caption in result.dense_captions.list if caption.confidence >= conf_threshold] if result.dense_captions else []
    
    # not output the confidence
    caption = result.caption.text if result.caption else ""

    return {
        "embedding": embedding_vector,
        "tags": tags,
        "dens_captions": dens_captions,
        "caption": caption  
    }
    
    # return result

In [3]:
image_url = "https://images.pexels.com/photos/1661179/pexels-photo-1661179.jpeg?cs=srgb&dl=pexels-roshan-kamath-1661179.jpg&fm=jpg"

result = get_multimodal_embedding(image_url, conf_threshold=0.8)
# print("Multimodal embedding:", embedding)

result

{'embedding': None,
 'tags': ['bird',
  'animal',
  'parrot',
  'parakeet',
  'budgie',
  'perched',
  'feather',
  'green'],
 'dens_captions': ['a close up of a bird',
  "a close up of a parrot's beak",
  "a close-up of a bird's foot",
  "a close up of a bird's face",
  'a close up of a bird'],
 'caption': 'a green bird with red beak'}

In [4]:
# print result to json
# import json
# print(json.dumps(result.as_dict(), indent=4))