Based on:
[Industry Solutions Engineering LLM Data Ingestion](https://proud-glacier-0c2b6bb1e-1419.westus2.3.azurestaticapps.net/code-with-mlops/lab/llm-lab/module-2a/)

Adapted for Project VICO by: Natasha Kohli

# Licensing Document exploration

This notebook walks through an example exploratory data analysis workflow for exploring PDF documents. It is not a comprehensive overview of all relevant data science tasks that should happen prior to an AI engagement, but goes over several key phases in an LLM-focused project.

## Setup
Assuming this notebook is being run in an Azure ML notebook, please choose the "SDK v2" kernel, then run the first 2 cells below (with the pip installs). You should only need to run those once.

Second, connect the SMR storage container to your Azure Machine Learning Workspace as a Datastore by following [official docs](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-connect-data-ui?view=azureml-api-1&tabs=credential#create-datastores).

## Overview of Stages

1. Use Document Intelligence (formerly Form Recognizer) to extract the content from the PDF
    - Other tools could work just as well such as other 3rd party tools for PDF conversion to text or markdown
2. Chunking the document into sizes that an embedding model can handle
    - This is a critical part of the process, which splits the text of each document into "chunks" that fit within the token limit of embedding models. This notebook takes a single approach via LangChain, but to optimize the Project VICO solution, taking licensing document sections into account when chunking will likely be necessary.
3. Use OpenAI Embedding model to create embedding vectors of the chunks
    - These embeddings can be used to represent the content, and then the vectors can be used for search
4. Use PCA for dimensionality reduction of the embedding vector, and K-means to cluster
    - Allows data scientists to visualize and explore trends within the data
5. Visualize the clusters using plotting tools
6. Create a Cognitive Search Index using embedding vectors
    - Enable Vector Search and Hybrid Search in Azure Cognitive Search

In [None]:
%pip install azure-search-documents==11.4.0b8

In [None]:
%pip install azure-ai-formrecognizer==3.3.0 azureml-fsspec==1.2.0 mltable==1.5.0 tenacity==8.2.3 openai==0.28.0 langchain==0.0.281 tiktoken==0.4.0 plotly==5.16.1 spacy==3.6.1 nbformat==5.9.2

In [None]:
import json
from typing import List, Union, IO

import numpy as np
import openai
import pandas as pd
import plotly.express as px
import spacy
from azure.ai.formrecognizer import AnalyzeResult, DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    HnswVectorSearchAlgorithmConfiguration,
    PrioritizedFields,
    SearchableField,
    SearchField,
    SearchFieldDataType,
    SearchIndex,
    SemanticConfiguration,
    SemanticField,
    SemanticSettings,
    SimpleField,
    VectorSearch,
)
from azure.search.documents.models import Vector
from azureml.fsspec import AzureMachineLearningFileSystem
from langchain.text_splitter import TokenTextSplitter
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from tenacity import retry, stop_after_attempt, wait_random_exponential

In [None]:
# only run this once, then comment it out even if you restart your kernel (unless you restart the whole computer)
spacy.cli.download("en_core_web_lg")

# Load the data for chunking via Document Intelligence Service

In [None]:
# Set your subscription, resource group and workspace name:
subscription_id = "SUBSCRIPTION_ID"
resource_group = "RESOURCE_GROUP"
workspace = "WORKSPACE_NAME"

We can use the Azure Machine Learning File System tools to mount our datastore to the compute instance and load data directly. By default, we are using the full licensing document PDFs.

In [None]:
datastore_name = "licensingpdfs"
datastore_uri = f"azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}"
fs = AzureMachineLearningFileSystem(datastore_uri)
documents = fs.glob("pdf/*.pdf")

In [None]:
# Optional cell - If you ran into auth issues above due to a Mac OS bug:

# Manually get the names of the documents
# documents = ["Aurora Environmental Report_ML20075A004.pdf", "Environmental Impact Kairos Hermes_ML22259A126.pdf", "Hermes Non-Power Reactor_ML21306A133.pdf"]

In [None]:
len(documents)

In [None]:
# An important dataframe saves the path and the content of all documents
document_df = pd.DataFrame({"FileName": documents})

# Document Cracking
Document Intelligence (formerly Form Recognizer) has good PDF cracking capabilities. We won't be using all of the tools at our disposal here, but just pulling the text out of the PDFs. This notebook uses the [General Document Model](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-general-document?view=doc-intel-3.1.0) by default.

The `tenacity` library is used for intelligent retrying of our method, in the case of failures. The `@retry` decorator at the top of the method defines the retry logic.

If you would like to, feel free to try a different toolset for retrieving the text in the PDF documents or tools to convert PDF to other file types. Perhaps you can find a more performant option!

In [None]:
endpoint = "https://<resourcename>.cognitiveservices.azure.com/"
credential = AzureKeyCredential("DOCUMENT INTELLIGENCE KEY")
document_analysis_client = DocumentAnalysisClient(endpoint, credential)

In [None]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def get_content(
    document: Union[bytes, IO[bytes]], model_id: str = "prebuilt-document"
) -> AnalyzeResult:
    """
    Analyze field text from a given document.

    Args:
        document (Union[bytes, IO[bytes]]): the document that is going to be analyzed
        model_id (str, optional): a unique model identifier. Defaults to "prebuilt-document":str.
            Please see here for more details.
            https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-model-overview?view=doc-intel-3.1.0#model-data-extraction

    Returns:
        AnalyzeResult: the analyzed results
    """
    poller = document_analysis_client.begin_analyze_document(
        model_id=model_id, document=document
    )
    result = poller.result()
    return result

# WAIT - Before running next cells
Running Document Intelligence across a large amount of documents (in the hundreds) can take over an hour. Consider trying this out on a smaller subset, and taking advantage of the pre-extracted embeddings/documents for the later modules to access the full dataset.

By default, the code in the above cells references a folder with 3 documents which should take less than 5 minutes to run.

If you would like to run a full dataset in the future, you can change the folder referenced in the above cells. This will require more data cleansing, more data prep, and of course much more time and is not recommended when first running through the lab, or when operating under a fixed timetable (such as a hackathon).

In [None]:
# Recommended path - instead of running this notebook on a full dataset, use a subset to explore
for index, row in document_df.iterrows():
    full_path = f"{datastore_uri}/paths/{row['FileName']}"
    print(full_path)
    with fs.open(full_path, "rb") as f:
        print(f"Analyzing {row['FileName']}")
        result = get_content(document=f)
        filecontent = result.content
        # Optionally, save the full DI response to json in case you want to explore that data later. E.g. page numbers, paragraphs
        # json.dump(result.to_dict(), open(f"data/{row['FileName']}.json", "w"))
    document_df.loc[index, "Content"] = filecontent

In [None]:
document_df

## Troubleshooting Document Intelligence

We have seen times where the DI service times out when running large numbers of docs through it. The next cell's code looks for files that did not have results returned. If you run into a lot of documents that didn't get parsed, use this list to re-run the subset of docs back through the service

In [None]:
# documents = document_df.loc[document_df['Content'].isnull() == True]['FileName']
# documents = documents.to_list()
# documents

In [None]:
# drop the rows if the content is null - many will be since we only evaluated a subset
print(document_df.shape)
document_df = document_df.dropna(subset=["Content"])
print(document_df.shape)

## Post Processing

PDF files can sometimes be corrupted, blank, or read-protected. Resolving these issues is out of scope for this hack, so we'll drop them from our dataset

In [None]:
# Also drop any empty files
document_df = document_df.dropna()
# Drop files that had a processing error (may be corrupted)
document_df = document_df[
    ~document_df.Content.str.contains("This file cannot be downloaded")
]
# Drop protected files (out of scope for fixing)
document_df = document_df[
    ~document_df.Content.str.contains("This PDF file is protected")
]
print(document_df.shape)
document_df = document_df.reset_index(drop=True)
# Optional: Save this checkpoint to parquet
document_df.to_parquet("extracted_pdfs.gzip", compression="gzip")

In [None]:
document_df.head()

In [None]:
# Optional Cell - load the extracted text. Useful for if you had to restart your notebook

# checkpoint
document_df = pd.read_parquet("extracted_pdfs.gzip")
document_df.head()

## Chunking
Now, we have all of our data as text, we need to chunk it into sizes that an embedding model can handle (e.g. < 8192 tokens, the limit for the Ada model).

Below, we are using [LangChain's token based splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token), which uses the `tiktoken` library to intelligently break up the text into our specified sizes. Feel free to try different chunk sizes, and overlap amounts (how much of the previous chunk is in the next one), to see if the final results of the notebook have improved. Document chunking is an important part of an LLM solution, balancing between including enough context in a given chunk, while keeping them small enough to be return in relevant search queries.

If you encounter permission problems when using `tiktoken` due to cache location, try to set the environment variable
```export TIKTOKEN_CACHE_DIR="./YOUR_TIKTOKEN_CACHE_DIR" ```


Before we dive all the way into chunking every document, let's explore what the results are for a single document.

In [None]:
test_content = document_df.loc[0, "Content"] # test on the first document's content
text_splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=15)
texts = text_splitter.split_text(test_content)

len(texts)

We can now review the different chunks that were split in a single document, before splitting the rest

In [None]:
texts

Now that we've seen how chunking works on a single document, let's run it against the rest of our dataset

In [None]:
# the dataframe used for saving the chunked content
chunked_df = []
text_splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=50)

for index, row in document_df.iterrows():
    # print(f"Splitting {row['FileName']}")
    if isinstance(row["Content"], str):
        texts = text_splitter.split_text(row["Content"])
        # Clean up the newline content
        cleaned_texts = []
        for text in texts:
            text = text.strip().replace("\n", " ")
            cleaned_texts.append(text)
        chunked_df.append(
            pd.DataFrame({"FileName": row["FileName"], "Content": cleaned_texts})
        )
chunked_df = pd.concat(chunked_df, ignore_index=True)
# Optional: Save this checkpoint to parquet
chunked_df.to_parquet("chunked_data.gzip", compression="gzip")

In [None]:
chunked_df.head()

In [None]:
chunked_df.loc[0, "Content"]

In [None]:
# Optional Cell - load the chunked data. Useful for if you had to restart your notebook

# checkpoint
chunked_df = pd.read_parquet("chunked_data.gzip")
chunked_df.head()

# Generate Embeddings

Here we define how to use Azure OpenAI for [generating embeddings](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/embeddings?tabs=console) (at bare minimum, we use this for our queries at the end). Once again, the `tenacity` library is used for intelligent retries against the service to avoid failures due to timeout.

For this to work, you need to have an OpenAI service set up in your resource group, and have deployed a `text-embedding-ada-002` model. That said, if you want to try a different embedding model from somewhere like HuggingFace, feel free to explore.

Be mindful of the quota limits of embeddings and how large your set of chunked documents is.

In [None]:
# OpenAI deployment values
openai_api_key = "OPENAI API KEY"
openai_api_type = "azure"
openai_api_base = "https://OPENAI API ENDPOINT.openai.azure.com/"
openai_api_version = "2023-06-01-preview"
openai_deployment_id = "EMBEDDING DEPLOYMENT NAME"


@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def generate_embeddings(text: str) -> List[float]:
    """Generate embdding for string

    Args:
        text (str): string for embedding

    Returns:
        List[float]: the embedding for the text
    """
    openai.api_key = openai_api_key
    openai.api_type = openai_api_type
    openai.api_base = openai_api_base
    openai.api_version = openai_api_version
    response = openai.Embedding.create(input=text, deployment_id=openai_deployment_id)
    embeddings = response["data"][0]["embedding"]
    return embeddings

If you'd like to see how it works, try running it on a single row of the dataframe, or on your smaller test dataset.
Running it on the full chunked dataset can take anywhere from 15 minutes to over an hour depending on the rate limits on your service.

In [None]:
embedding_df = chunked_df.copy()
embedding = generate_embeddings(embedding_df["Content"][0])
embedding

Here, we use a lambda function to run it across our small test dataset

In [None]:
embedding_df["Generated_embedding"] = embedding_df["Content"].apply(
    lambda x: generate_embeddings(x)
)
embedding_df

In [None]:
# Checkpoint - save the data to parquet so that you don't lose it if your notebook needs to restart.
embedding_df.to_parquet("embedded_content.gzip", compression="gzip")

eda_df = embedding_df.copy()

### Checkpoint: Load the parquet
If your notebook is still running, you don't need this cell, just check that the variable name of your dataframe is correct moving forward.

In [None]:
# Optional Cell - load the chunked data. Useful for if you had to restart your notebook

# checkpoint
eda_df = pd.read_parquet("embedded_content.gzip")
# check to see if it loaded correctly
eda_df.head()

## Optional: Generate Keywords on content

If you wish, you can add keywords as metadata to your dataset by using the `Spacy` natural language tools. This takes a few minutes to run, but can be helpful in improving your search index by adding keywords to search across. Here we use the Named Entity Recognition built into the tool to identify these keywords. For more information on NER, spaCy has a great [video here](https://spacy.io/universe/project/video-spacys-ner-model-alt).

In [None]:
# here, we load the spacy model we're going to use. The code to download it is
# near the top of this notebook
nlp = spacy.load("en_core_web_lg")

In [None]:
def ner(text: str) -> List[str]:
    """Generate keywords from the text

    Args:
        text (str): the text you want to generate keywords

    Returns:
        List[str]: keywords
    """

    entities = []
    for entity in nlp(text).ents:
        entities.append(entity.text)
    return entities

In [None]:
eda_df["keywords"] = eda_df["Content"].apply(ner)
eda_df

# EDA

[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) is used to reduce the dimentionality of the embeddings, to eventually cluster them. In this notebook, we run in 3 steps: PCA to ~20 values to reduce dimensionality, [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) cluster on that reduced dataset, and then PCA one more time to use as X and Y values for plotting.

This series of steps has generally resulted in more "interesting" clusters to explore later, but this is an area for you to explore. Consider changes such as only doing the PCA reduction 1 time and clustering/plotting that - or changing the number of clusters. This is the *Exploratory* data analysis after all.

In [None]:
# Pandas and Numpy arrays don't directly play nicely together, so we cast back to list
eda_df["PCA_1"] = (
    PCA(n_components=20).fit_transform(eda_df["Generated_embedding"].tolist()).tolist()
)

Now that we have a more reasonably dimensioned set of embeddings, we'll create clusters (grouping the data together).

In [None]:
kmeans = KMeans(init="k-means++", n_clusters=40, n_init="auto")
kmeans.fit(eda_df["PCA_1"].tolist())

In [None]:
eda_df["Cluster"] = kmeans.predict(eda_df["PCA_1"].tolist())

In [None]:
eda_df

For plotting those clusters, it's easier if we then have a 2-dimensional representation of the embeddings, so we'll run PCA again to reduce down to 2 dimensions

In [None]:
eda_df["PCA"] = (
    PCA(n_components=2).fit_transform(eda_df["Generated_embedding"].tolist()).tolist()
)
eda_df

# Plotting the data

Using [Plotly](https://plotly.com/python/line-and-scatter/), we can create a scatterplot (or [seaborn](https://seaborn.pydata.org/), or [matplotlib](https://matplotlib.org/)). Plotly allows for more interactive plots, which is why it's used here.

In [None]:
# Use our 2-D PCA values as X and Y coordinates for plotting.
eda_df["X"] = eda_df["PCA"].str[0]
eda_df["Y"] = eda_df["PCA"].str[1]
eda_df

In [None]:
px.scatter(eda_df, x="X", y="Y", color="Cluster", hover_data="FileName")

The tradeoff of using PCA 2 times is that the clusters aren't as visually distinctly plotted - vs creating the clusters from a 2-D vector. But, the clusters seemed to be better differentiated in terms of content.

We can filter our dataframe to look at a given cluster to try and see what's being linked.

In [None]:
filter_df = eda_df[eda_df["Cluster"] == 44].head(10)
# If you generated keywords:
# filter_df[["FileName", "Cluster", "Content", "keywords"]]

# Otherwise:
filter_df[["FileName", "Cluster", "Content"]]

Now, let's prepare our Search index. For this, we want to use the FULL dataset, so first, download the pre-prepared CSV. The code below shows the last step that was performed on the full dataset prior to being saved (creating an `id` column, and filtering down to the 4 other primary columns), and is for informational purposes only.

In [None]:
# lets drop some of the columns we won't be using for searching. Optionally leaving keywords
# search_df = eda_df.loc[:, ["FileName", "Content", "Generated_embedding", "keywords"]]
search_df = eda_df.loc[:, ["FileName", "Content", "Generated_embedding"]]
search_df["id"] = range(len(eda_df))
search_df["id"] = search_df["id"].map(str)

search_df.to_parquet("searchdf.gzip", compression="gzip")

# Create Search Index

Using our processed data, we can now create an Azure Cognitive Search Index to query against.

In [None]:
search_df = pd.read_parquet("searchdf.gzip")
search_df.head()

### Index Configuration

Fields:

- SimpleField: Used for faceting, filtering, or sorting results. Can be searchable, but not full text searchable. In this case, we are using only 1, our "id" field, and setting it as the key for the index.
- SearchableFields are full text searchable (they can also be set to be used for filtering, faceting, and sorting).
  - Here, we are passing in the Content, keywords, and FileNames as Strings.
  - Generated_embedding is added as a collection of datapoints. This is the field that we will be using for our vector search.

Configurations:

- Vector Search: Configuration for the Vector Search is primarily static - most of the values in this configuration are hard-coded at this time, aside from the name you wish to set for the configuration. This is still in early enough preview that the documentation is not fully available for python. The REST API details are available [here](https://learn.microsoft.com/en-us/rest/api/searchservice/preview-api/create-or-update-index#request-body).
- Semantic Search: Configuration for the Semantic Search defines which fields in the index will be used for semantic queries. For more info, review the [Quickstart Guide](https://learn.microsoft.com/en-us/azure/search/search-get-started-semantic?tabs=python#add-semantic-search).



In [None]:
# Create a search index
credential = AzureKeyCredential("SEARCH ADMIN KEY")
service_endpoint = "https://SEARCH ENDPOINT.search.windows.net"
index_name = "CREATE INDEX NAME"
index_client = SearchIndexClient(endpoint=service_endpoint, credential=credential)
fields = [
    SimpleField(
        name="id",
        type=SearchFieldDataType.String,
        key=True,
        sortable=True,
        filterable=True,
        facetable=True,
    ),
    SearchableField(name="FileName", type=SearchFieldDataType.String),
    SearchableField(name="Content", type=SearchFieldDataType.String),
    SearchField(
        name="Generated_embedding",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=1536,
        vector_search_configuration="my-vector-config",
    ),
    # if you generated keywords
    # SearchField(
    #     name="keywords", type=SearchFieldDataType.Collection(SearchFieldDataType.String)
    # ),
]

# These values don't seem to be configurable
vector_search = VectorSearch(
    algorithm_configurations=[
        HnswVectorSearchAlgorithmConfiguration(
            name="my-vector-config",
            kind="hnsw",
            parameters={
                "m": 4,
                "efConstruction": 400,
                "efSearch": 500,
                "metric": "cosine",
            },
        )
    ]
)

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=PrioritizedFields(
        title_field=SemanticField(field_name="FileName"),
        prioritized_content_fields=[SemanticField(field_name="Content")],
        # if you generated keywords
        # prioritized_keywords_fields=[SemanticField(field_name="keywords")],
    ),
)

# Create the semantic settings with the configuration
semantic_settings = SemanticSettings(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(
    name=index_name,
    fields=fields,
    vector_search=vector_search,
    semantic_settings=semantic_settings,
)
result = index_client.create_or_update_index(index)
print(f"{result.name} created")

Azure Search expects JSON for creating an index, so convert our index to JSON. This also lets us save our work.

In [None]:
documents = search_df.to_json("temp.json", orient="records", index=True)
with open("temp.json", "r") as file:
    documents = json.load(file)
documents[0].keys()

In [None]:
# batch our docs into manageable sizes for cog services
batches = np.array_split(documents, 10)

In [None]:
search_client = SearchClient(
    endpoint=service_endpoint, index_name=index_name, credential=credential
)
for batch in batches:
    result = search_client.upload_documents(documents=batch.tolist())
print(f"Uploaded {len(documents)} documents")

Finally, run your search query.
This example query combines both hybrid search (combining vector and text search), and semantic search.

In [None]:
# Semantic Search
query = "What information does the introduction section contain?"

search_client = SearchClient(service_endpoint, index_name, credential=credential)
vector = Vector(value=generate_embeddings(query), k=3, fields="Generated_embedding")

results = search_client.search(
    search_text=query,
    vectors=[vector],
    select=["FileName", "Content"],
    query_type="semantic",
    query_language="en-us",
    semantic_configuration_name="my-semantic-config",
    query_caption="extractive",
    query_answer="extractive",
    top=5,
)

semantic_answers = results.get_answers()
for answer in semantic_answers:
    if answer.highlights:
        print(f"Semantic Answer: {answer.highlights}")
    else:
        print(f"Semantic Answer: {answer.text}")
    print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"Document: {result['FileName']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['Content']}")

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")