# Overview

Azure AI Search is a cloud-based solution for *indexing* and *querying* a wide range of data sources.

AI Search can extract data from structured, semi-structured, and non-structured documents and other sources.

Usecases:
- Implementing an enterprise search solution
- Supporting RAG, AI Search acts as a vector store.
- Creating knowledge mining solutions for data analytics.

# How it works

- First, the data/documents is stored somewhere, for example, Azure Blob Storage, or a database, or some other stores.
- The **Indexer** connects to the data source, begins to extract and index the data *fields* through an *enrichment pipeline*. The process is also called *document cracking*.
- The *document cracking* step uses *skills* to extract the data in many ways. How *Skills* are meant to be executed is based on pre-defined *Skillsets*.
- The *fields* in the final result can be mapped into the index fields by 2 ways:
    - Are all mapped to the index fields. Can be *implicit*: automatically mapped using the same field names. Or can be *explicit*: can be defined to map into index fields with different names, or to apply a function to the data value as it is mapped.
    - Mapped from their hierarchical location in the result to the target index field.
- The result is populated to an **Index**, which can be queried.


# Skills

Indexer uses Skillsets to enrich the extraction data.

## Built-in skills

Built-in skills include functionality from other Azure AI Services (must be in the same region and attached as a resource to AI Search)

**Azure AI resource skills**:
- `Microsoft.Skills.Text.LanguageDetectionSkill`: Detecting the language that text is written in.
- `Microsoft.Skills.Text.V3.EntityRecognitionSkill`: Detecting and extracting places, locations, and other entities in the text.
- `Microsoft.Skills.Text.KeyPhraseExtractionSkill`: Determining and extracting key phrases within a body of text.
- `Microsoft.Skills.Text.TranslationSkill`: Translating text.
- `Microsoft.Skills.Text.PIIDetectionSkill`: Identifying and extracting (or removing) personally identifiable information (PII) within the text.
- `Microsoft.Skills.Vision.OcrSkill`: Extracting text from images.
- `Microsoft.Skills.Vision.ImageAnalysisSkill`: Generating captions and tags to describe images.
- `Microsoft.Skills.Text.CustomEntityLookupSkill`: Looks for text from a custom, user-defined list of words and phrases.
- `Microsoft.Skills.Text.V3.EntityLinkingSkill`: Generate links for recognized entities to articles in Wikipedia.
- `Microsoft.Skills.Text.V3.SentimentSkill`: Sssign sentiment labels such as "negative", "neutral" and "positive".
- `Microsoft.Skills.Vision.VectorizeSkill`: Multimodal image and text vectorization.
- `Microsoft.Skills.Util.DocumentIntelligenceLayoutSkill`: Accelerate information extraction from documents.

**Azure OpenAI skills**:
- `Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill`	Connects to a deployed embedding model on Azure OpenAI for integrated vectorization.

**Utility skills**:
- `Microsoft.Skills.Util.ConditionalSkill`: Allows filtering, assigning a default value, and merging data based on a condition.
- `Microsoft.Skills.Util.DocumentExtractionSkill`: Extracts content from a file within the enrichment pipeline.
- `Microsoft.Skills.Text.MergeSkill`: Consolidates text from a collection of fields into a single field.
- `Microsoft.Skills.Util.ShaperSkill`: Maps output to a complex type.
- `Microsoft.Skills.Text.SplitSkill`: Splits text into pages so that you can enrich or augment content incrementally.

**Custom skills**:
- `Microsoft.Skills.Custom.WebApiSkill`: Allows making an HTTP call into a custom Web API.
- `Microsoft.Skills.Custom.AmlSkill`: Allows extensibility with an Azure Machine Learning model.

# Index searching

Each index field has these attributes that can be configured:
- `key`: Define a unique key for the index records.
- `searchable`: Can be queried using full-text search.
- `filterable`: Can be applied with filter expressions.
- `sortable`: Can be used to order the results.
- `facetable`: Can be used to determine values for facets (filter the result based on a list of known values).
- `retrievable`: Can be included in search results.

## Full-text search
Azure AI Search supports 2 variant of Lucene syntax:
- *Simple*: basic searches that match literal query terms.
- *Full*: supports complex filtering, regular expressions, and other more sophisticated queries.

When submitting the queries, client apps should specify:
- `search`: A search expression.
- `queryType`: The Lucene syntax to be evaluated (`simple` or `full`).
- `searchFields`: The index fields to be searched.
- `select`: The fields to be included in the results.
- `searchMode`: Criteria for including results based on multiple search terms (`Any` or `All`). `Any`: result contains any of the keyword. `All`: result contains all keywords.

The query processing consists of 4 stages:
- *Query parsing*: search expression is evaluated and reconstructed as a tree of appropriate subqueries.
- *Lexical analysis*: query terms are analyzed and refined based on linguistic rules (conversion to root form, splitting conposite words into constituent form, removal of stopwords).
- *Document retrieval*: The query terms are matched against the indexed terms, and the set of matching documents is identified.
- *Scoring*: apply term frequency/inverse document frequency (TF/IDF) score to each of the document in the result.

## Filtering and sorting

Filtering in simple search expression:
```
search=London+author='Reviewer' # find documents containing the text London that have an author field value of Reviewer
queryType=Simple
```

Using an OData filter in a $filter parameter with a full Lucene search expression:
```
search=London
$filter=author eq 'Reviewer'
queryType=Full
```

Filtering with facet:
```
search=*
$filter=author eq 'selected-facet-value-here' # the `author` field should be facetable
```

Sorting:
```
search=*
$orderby=last_modified desc # or `asc` and `last_modified` should be sortable
```

In [None]:
from dotenv import load_dotenv
import os
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient

load_dotenv()

AI_SEARCH_ENDPOINT = os.getenv('AI_SEARCH_ENDPOINT')
AI_SEARCH_KEY = os.getenv('AI_SEARCH_KEY')
AI_SEARCH_INDEX_NAME = os.getenv('AI_SEARCH_INDEX_NAME')

search_client = SearchClient(AI_SEARCH_ENDPOINT, AI_SEARCH_INDEX_NAME, AzureKeyCredential(AI_SEARCH_KEY))

In [12]:
query_text = "Buckingham Hotel"

found_documents = search_client.search(
                        search_text=query_text,
                        select=["metadata_storage_name", "locations", "people", "language", "organizations"],
                        include_total_count=True,
                        query_type="simple",
                        search_mode="all"
                    )

for document in found_documents:
    print(f"\nDocument: {document["metadata_storage_name"]}")
    print(" - Locations:", document["locations"])
    print(" - People:", document["people"])
    print(" - Organizations:", document["organizations"])


Document: 201811.pdf
 - Locations: ['property', 'bedroom', 'suite', 'eat in', 'kitchen', 'Buckingham Palace', 'living room', 'London', 'apartment', 'hotel', 'Westminster', 'home']
 - People: ['Moises Eads']
 - Organizations: ['Buckingham Hotel']

Document: London Brochure.pdf
 - Locations: ['London', 'city', 'England', 'United', 'River Thames', 'Great', 'Britain', 'Romans', 'Middlesex', 'Essex', 'Surrey', 'Kent', 'Hertfordshire', 'Greater London', 'London Hotels', 'Buckingham Hotel', 'hotel', 'Buckingham Palace', 'Regent’s Park', 'Trafalgar', 'The City Hotel', 'Tower Bridge', 'Tower of London', 'Kensington Hotel', 'Earl’s Court']
 - People: ['Margie']
 - Organizations: ['London Assembly', 'Culture', 'Margie’s Travel', 'Buckingham Hotel']

Document: 201803.pdf
 - Locations: ['Hotel', 'rooms', 'courtyard', 'Indian', 'West coast', 'kitchen', 'lounge', 'bedroom', 'bathroom']
 - People: ['Akiko Kaneko']
 - Organizations: ['Buckingham Hotel']

Document: 201802.pdf
 - Locations: ['suite', 'b

More AI Search samples using the SDK can be found at: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/search/azure-search-documents/samples/sample_vector_search.py