# Using Couchbase with Azure OpenAI

Microsoft's **Azure OpenAI** integrates OpenAI's advanced artificial intelligence models into the Azure platform. Azure OpenAI provides a scalable and secure environment to run these powerful models, making it easier for organizations to integrate AI capabilities into their applications and services. Users can access pre-trained models or customize them to fit specific needs.

This notebook provides an example of using Azure OpenAI with Couchbase as a vector database. We will embed the file 'CouchbaseWhitepaper.pdf' with Azure OpenAI, load and index it into Couchbase, and perform a semantic search on its contents.

## Prerequisites

install the required packages

In [None]:
!pip install couchbase OpenAI pypdf

## Reading Sample Data

In [None]:
from pypdf import PdfReader

def read_file(file_path):
    reader = PdfReader(file_path)
    text = ''
    for page in reader.pages:
        text += page.extract_text()
    return text

In [None]:
text = read_file("CouchbaseWhitepaper.pdf")

## Chunking the Text into Paragraphs

We first need to break the paragraphs into searchable chunks. Here, the chunks are divided according to the appearance of '.' characters.

In [None]:
def chunk_text(text, n):
    sentences = text.split(".")
    chunks = []
    curr_chunk = ""
    i = 0
    for sentence in sentences:
        if i < n-1:
            curr_chunk += sentence
            i+=1
        else:
            curr_chunk += sentence
            chunks.append(curr_chunk)
            curr_chunk = ''
            i = 0
    return chunks

In [None]:
chunks = chunk_text(text,5)

## Generating Embeddings

Now, we will initialize Azure OpenAI and use it to generate embeddings for each chunk. These embeddings will store the semantic meaning of each chunk and enable us to perform a semantic similarity search.

**Note:** You can find your Azure key and endpoint on your Azure dashboard. You need to create a deployment with the 'text-embedding-model-ada' model before proceeding with this section of the cookbook.

You can find documentation on how to set this up here: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/deployment-types

In [None]:
def create_embeddings(client, documents):
    embeddings = []
    for doc in documents:
        response = client.embeddings.create(
            input=doc,
            model="<Replace with the name of your deployment>"
        )
        embedding = response.data[0].embedding
        embeddings.append(embedding)
    return embeddings

In [None]:
from openai import AzureOpenAI

apikey = "<Replace with your Azure OpenAI key>"
endpoint = "<Replace with your Azure OpenAI endpoint>"

#See note below to find value for api_version
client = AzureOpenAI(
  api_key = apikey,  
  api_version = "<Replace with the desired api version. Ex: '2024-06-01'>",
  azure_endpoint = endpoint
)

#Embedding the entire document will take a few moments
print("Generating embeddings...")
embedding_array = create_embeddings(client, chunks)
print("Embeddings complete.")

**Note:** Visit https://learn.microsoft.com/en-us/azure/ai-services/openai/reference to find the latest versions of the api. It's recommended to use the latest version of *Data Plane - Inference*. An incorrect version could lead to an opaque 'Deployment Not Found' error message.

Now we can use the original text along with it's associated embedding to create documents ready for ingestion into our couchbase server.

In [None]:
def format_to_dicts(texts, embeddings):
    documents_to_insert = [
                {
                    'text': text,
                    'embedding': vector,
                        }
                for text, vector in zip(
                    texts, embeddings
                )
        ]
    return documents_to_insert

In [None]:
docs = format_to_dicts(chunks, embedding_array)

## Initializing Couchbase Connection

**Note:** Before this step, make sure to create a Couchbase type bucket called "CouchbaseWhitepaper". For information on creating buckets in Couchbase, visit https://docs.couchbase.com/server/current/manage/manage-buckets/create-bucket.html.

In [None]:
from datetime import timedelta

from couchbase.options import ClusterOptions, SearchOptions
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster

#See note below for information about Couchbase credentials
username = "<Replace with your Couchbase username>"
password = "<Replace with your Couchbase password"
connection_string = "<Replace with your connection string>"

auth = PasswordAuthenticator(username, password)
options = ClusterOptions(auth)

cluster = Cluster(connection_string, options)

#Wait until cluster is ready
cluster.wait_until_ready(timedelta(seconds=5))

bucket = cluster.bucket("CouchbaseWhitepaper")
scope = bucket.scope("_default")
collection = scope.collection("_default")

**Note:** Visit https://docs.couchbase.com/python-sdk/current/hello-world/start-using-sdk.html for information on your Couchbase credentials.

## Uploading Documents to Couchbase

In [None]:
import uuid

def batch_insert(docs, collection, batch_size=10):
    for i in range(0,len(docs),batch_size):
        batch = docs[i:i + batch_size]
        docs_with_ids = {}
        for doc in batch:
            docs_with_ids[str(uuid.uuid4())] = doc
        try:
            collection.upsert_multi(docs_with_ids)
        except Exception as e:
            f"Encountered exception: (e) while upserting documents."

In [None]:
batch_insert(docs, collection, 5)

## Creating Couchbase Search Index

Before performing a vector search in Couchbase, it is first required to create an index on the collection containing the desired documents. This can be through the Python SDK, as in this tutorial, or in the Couchbase UI.

**Note:** Visit https://docs.couchbase.com/server/current/vector-search/create-vector-search-index-ui.html for more information on index creation.

In [None]:
import json

with open('index_parameters.json', 'r') as file:
    index_params = json.load(file)

In [None]:
from couchbase.management.search import SearchIndex

index_manager = scope.search_indexes()
index_manager.upsert_index(
    SearchIndex(
                    "VectorIndex",
                    params=index_params,
                    source_name="CouchbaseWhitepaper",
                ),
            )

## Perform a Vector Search on Our Embedded Document

In [None]:
import couchbase.search as search
from couchbase.vector_search import VectorQuery, VectorSearch

def search_by_vector(
        scope,
        query_vector,
        top_k=5,
        score_threshold=0.0
):

    search_req = search.SearchRequest.create(
        VectorSearch.from_vector_query(
            VectorQuery(
                'embedding',
                query_vector,
                top_k,
            )
        )
    ) 
    try:
        search_iter = scope.search(
                "VectorIndex",
                search_req,
                SearchOptions(limit=top_k, collections=["_default"],fields=['*']),
            )

        docs = []
        for row in search_iter.rows():
            text = row.fields.pop('text')
            score = row.score
            doc = {"content":text, "score":score}
            if score >= score_threshold:
                docs.append(doc)
    except Exception as e:
        raise ValueError(f"Search failed with error: {e}")
    return docs

In [None]:
query_string = "How do I query documents in couchbase?"
query_embedding = create_embeddings(client, [query_string])[0]
print(search_by_vector(scope,query_embedding)[0]['content'])