# Reto 04-B - Generación Aumentada con Recuperación (RAG) para Datos No Estructurados


## Introducción

Las empresas tienen mucha información propietaria que debe tenerse en cuenta al responder las preguntas de los usuarios; estas no siempre pueden ser respondidas a través de los datos con los que se han entrenado los modelos GPT.

En el último notebook, trabajamos principalmente con datos estructurados. Muchas veces, los datos de tu empresa no se limitan solo a formatos estructurados como archivos CSV o tablas SQL. También pueden incluir datos no estructurados como documentos PDF o imágenes. De hecho, tus documentos individuales podrían tener tanto datos no estructurados como estructurados integrados. Extraer información de estos formatos diversos de una manera comprensible presenta un desafío. Herramientas como Azure Document Intelligence permiten la extracción de datos de fuentes no estructuradas como formularios o documentos. Una vez que los datos se extraen en un formato JSON estructurado, se puede utilizar AI Search para consolidar toda la información de diferentes tipos de datos en índices, facilitando la recuperación de documentos relevantes.

En este notebook, te guiaremos a través de un caso de uso de Generación Aumentada con Recuperación (RAG) que implica trabajar con datos no estructurados. El enfoque RAG combina varias tecnologías para mejorar la calidad y la relevancia de las salidas generadas. Aprovecharemos Azure Document Intelligence para procesar documentos complejos, utilizando la API de layout para extraer texto y tablas de manera efectiva. Utilizaremos Azure AI Search para crear un índice configurando capacidades de búsqueda semántica, lo que permite la recuperación de páginas de documentos relevantes. Además, se incorporarán embeddings para recuperar contenido que esté lo más alineado posible con la pregunta del usuario. Finalmente, el modelo ChatGPT de Azure OpenAI utilizará el contenido extraído para generar una respuesta más significativa. Es importante enfatizar que este proceso de grounding (fundamentación) sigue el patrón RAG mencionado en el cuaderno anterior y ayuda a eliminar inexactitudes en las respuestas generadas.

Tus objetivos para este desafío son leer este notebook, ejecutar cada bloque de código, observar los resultados y luego poder responder las preguntas planteadas en la guía del estudiante.


In [1]:
! pip install "tiktoken==0.5.2" 


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
# Import Azure Forms Recognizer, Azure Cognitive Search, OpenAI, and other python modules

import os, json, requests, sys, re
import requests
from pprint import pprint
import pandas as pd
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient 
from azure.search.documents import SearchClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SemanticConfiguration,
    PrioritizedFields,
    SemanticField,
    SemanticSettings
)

from azure.ai.formrecognizer import DocumentAnalysisClient
import openai
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity

from dotenv import load_dotenv
load_dotenv()

True

In [3]:
# This is secure and recommended way to load OpenAI resource credentials and deployment names

openai.api_key = os.environ['OPENAI_API_KEY']
openai.api_base = os.environ['OPENAI_API_BASE']
openai.api_type = os.environ['OPENAI_API_TYPE']
openai.api_version = os.environ['OPENAI_API_VERSION']

chat_model = os.environ['CHAT_MODEL_NAME']
embedding_model=os.environ['EMBEDDING_MODEL_NAME']

**NOTA:** La ruta en la celda de código a continuación se refiere a la carpeta `/data/unstructured/raw`.

In [4]:
# -- raw data
RAW_DATA_FOLDER= '../data/unstructured/raw'
# -- extracted json file 
EXTRACTED_DATA_FOLDER = '../data/unstructured/extracted'

In [5]:
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

endpoint = os.environ["AZURE_FORM_RECOGNIZER_ENDPOINT"]
key = os.environ["AZURE_FORM_RECOGNIZER_KEY"]

document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint, credential=AzureKeyCredential(key)
)

Queremos extraer los datos de nuestros datos no estructurados a un formato más legible para que el modelo los entienda. La herramienta Document Intelligence nos ayuda a hacerlo aprovechando los modelos de diseño preconstruidos. Aquí, trabajamos principalmente con archivos PDF, pero también podríamos tener formatos JPG y PNG que la herramienta Document Intelligence también admite.

Para cada documento, queremos especificar la forma en que se extrae la información. Por ejemplo, en este caso de uso, cada documento tiene muchas páginas. Para hacer un seguimiento de las páginas, las almacenamos en el campo page_number. También queremos extraer el contenido de cada página y colocarlo en un campo page_context.

In [6]:
def extract_local_single_file(file_name: str):
    not_completed = True
    while not_completed:
        with open(file_name, "rb") as f:
            poller = document_analysis_client.begin_analyze_document(
                "prebuilt-layout", document=f
            )
            not_completed=False
    result = poller.result()
    return get_page_content(file_name, result)

def extract_files( folder_name: str, destination_folder_name: str):
    os.makedirs(destination_folder_name, exist_ok=True)
    for file in os.listdir(folder_name):
        if file[-3:].upper() in ['PDF','JPG','PNG']:
            print('Processing file:', file, end='')
        
            page_content = extract_local_single_file(os.path.join(folder_name, file))
            output_file = os.path.join(destination_folder_name, file[:-3] +'json')
            print(f'  write output to {output_file}')
            with open(output_file, "w") as f:
                f.write(json.dumps(page_content))


def get_page_content(file_name:str, result):
    page_content = []
    for page in result.pages:
        all_lines_content = []
        for line_idx, line in enumerate(page.lines):
            all_lines_content.append(' '.join([word.content for word in line.get_words()]))
        page_content.append({'page_number':page.page_number, 
                                'page_content':' '.join(all_lines_content)})
    return {'filename':file_name, 'content':page_content}





In [7]:
extract_files(RAW_DATA_FOLDER, EXTRACTED_DATA_FOLDER)

Processing file: Precise_Zero-Shot_Dense_Retrieval_without_Relevance_Labels.pdf  write output to ../data/unstructured/extracted/Precise_Zero-Shot_Dense_Retrieval_without_Relevance_Labels.json
Processing file: Prefix-Tuning_Optimizing_Continuous_Prompts_for_Generation.pdf  write output to ../data/unstructured/extracted/Prefix-Tuning_Optimizing_Continuous_Prompts_for_Generation.json
Processing file: Power_of_Scale_for_Parameter-Efficient_Prompt_Tuning.pdf  write output to ../data/unstructured/extracted/Power_of_Scale_for_Parameter-Efficient_Prompt_Tuning.json


## Más Sobre Nuestros Datos

En este tutorial, examinaremos varios artículos de investigación sobre temas de LLM en documentos PDF. Esto incluye temas como:

- Autoprompting (prompting automático)
- Chain of thought prompting (prompting de cadena de pensamiento)
- precise zero shot dense retrival (recuperación densa precisa de cero disparos)
- y más. 

Este conjunto de datos contiene varios formatos no estructurados, como texto, tablas, gráficos y fórmulas.

## Descripción de los Datos

El esquema relevante para nuestro trabajo de hoy consiste en:

- document_id
- document_name
- file_path
- page_number
- page_text


In [8]:
documents=[]
for file in os.listdir(EXTRACTED_DATA_FOLDER):
    with open(os.path.join(EXTRACTED_DATA_FOLDER, file)) as f:
        page_content= json.loads(f.read())
    documents.extend(
        [
            {
                'document_id':page_content['filename'].split('\\')[-1].split('.')[0] + '-' + str(page['page_number']),
                'document_name':page_content['filename'].split('\\')[-1],
                'file_path':page_content['filename'],              
                'page_number':page['page_number'],
                'page_text':page['page_content']
            }
            for page in page_content['content']
        ]
    )

In [9]:
#Example of a single page of research paper file that will be indexed in Azure Cognitive Search
documents[1]

{'document_id': '-2',
 'document_name': '../data/unstructured/raw/Power_of_Scale_for_Parameter-Efficient_Prompt_Tuning.pdf',
 'file_path': '../data/unstructured/raw/Power_of_Scale_for_Parameter-Efficient_Prompt_Tuning.pdf',
 'page_number': 2,
 'page_text': 'Model Tuning Pre-trained Model (11B params) Prompt Tuning a1 a2 Task A Batch b1 Task B Batch c1 Task C c2 Batch Mixed-task Batch Task A Model (11B params) A a1 A C c1 b1 a2 B A B Task B Model (11B params) -- C c2 C Task Prompts (20K params each) Task C Model (11B params) ---- Pre-trained Model (11B params) Figure 2: Model tuning requires making a task- specific copy of the entire pre-trained model for each downstream task and inference must be performed in separate batches. Prompt tuning only requires stor- ing a small task-specific prompt for each task, and enables mixed-task inference using the original pre- trained model. With a T5 “XXL” model, each copy of the tuned model requires 11 billion parameters. By contrast, our tuned pr

Esta sección se centrará en AI Search y los siguientes temas:

1. Crear un índice de cliente
2. Definir los campos del índice con los atributos necesarios
3. Crear una configuración semántica
4. Cargar nuestro índice con las páginas de los documentos

In [10]:
# Create an SDK client
service_endpoint = os.getenv("AZURE_COGNITIVE_SEARCH_ENDPOINT")   
key = os.getenv("AZURE_COGNITIVE_SEARCH_KEY")
credential = AzureKeyCredential(key)

index_name = os.getenv("AZURE_COGNITIVE_SEARCH_DOC_INDEX_NAME")

index_client = SearchIndexClient(
    endpoint=service_endpoint, credential=credential)
index_client

<azure.search.documents.indexes._search_index_client.SearchIndexClient at 0x740443815790>

In [11]:
fields = [
    SimpleField(name="document_id", type=SearchFieldDataType.String, key=True),
    SimpleField(name="page_number", type=SearchFieldDataType.Int64),
    SimpleField(name="file_path", type=SearchFieldDataType.String),
    SearchableField(name="document_name", type=SearchFieldDataType.String,
                searchable=True, retrievable=True),
    SearchableField(name="page_text", type=SearchFieldDataType.String,
                filterable=True, searchable=True, retrievable=True),
]

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=PrioritizedFields(
        title_field=SemanticField(field_name="document_id"),
        prioritized_keywords_fields=[SemanticField(field_name="document_name")],
        prioritized_content_fields=[SemanticField(field_name="page_text")]
    )
)


# Create the semantic settings with the configuration
semantic_settings = SemanticSettings(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields, semantic_settings=semantic_settings)
result = index_client.create_or_update_index(index)
print(f' {result.name} created')

 research-paper-index-jpvc created


In [12]:
search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
result = search_client.upload_documents(documents)  
print(f"Uploaded {len(documents)} documents") 

Uploaded 6 documents


In [13]:
len(result)

6

¡Aquí vemos a Azure AI Search en acción! Podemos recuperar los documentos más relevantes de todos con los que estamos trabajando.

In [14]:
query = "What is prompt tuning?"
count = 10
results = search_client.search(search_text=query, top=count, include_total_count=True)
page_chunks = []
citations = []
for result in results:
    page_chunks.append(result['page_text'])
    citations.append(result['document_name'])
    
    

In [15]:
embed_df = pd.DataFrame(page_chunks, columns = ["page_chunks"]) #datframe with document chunks
embed_df

Unnamed: 0,page_chunks
0,"scription of a data table, as shown in Figure ..."
1,Prefix-Tuning: Optimizing Continuous Prompts f...


Una vez que tengamos los documentos más relevantes, crearemos embeddings para todos los fragmentos de página. Esto nos ayudará a encontrar los documentos más similares a nuestra consulta de usuario dada.

In [16]:
# Handling Rate Limits

from openai.error import RateLimitError
from time import sleep


def get_embedding(text: str, engine: str = "text-embedding-ada-002"):
    count=0
    while True:
        try:
            embedding = openai.Embedding().create(input=[text], engine=engine)["data"][0]["embedding"]
            break;
        except RateLimitError:
            count+=1
            #print(f'RateLimitError Count: {count}')
            sleep(2)            
    return np.array(embedding).astype(np.float32)

def get_completion(prompt, model="gpt-35-turbo"): 
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        engine=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]


In [17]:
#Create an embedding vector for each chunk that will capture the semantic meaning and overall topic of that chunk
embed_df['embedding'] = embed_df["page_chunks"].apply(lambda page_text : get_embedding(page_text, engine = embedding_model))

In [18]:
embed_df

Unnamed: 0,page_chunks,embedding
0,"scription of a data table, as shown in Figure ...","[-0.017270068, 0.006634906, 0.005179641, -0.00..."
1,Prefix-Tuning: Optimizing Continuous Prompts f...,"[-0.018153809, 0.0046602665, 0.0045732562, -0...."


In [19]:
query_embedding = get_embedding(query, engine=embedding_model)
embed_df["similarities"] = embed_df['embedding'].apply(lambda page_embedding: cosine_similarity(page_embedding, query_embedding))

top_results = (
    embed_df.sort_values("similarities", ascending=False)
    .reset_index(drop=True)
    .head(3)
)
top_results

Unnamed: 0,page_chunks,embedding,similarities
0,Prefix-Tuning: Optimizing Continuous Prompts f...,"[-0.018153809, 0.0046602665, 0.0045732562, -0....",0.848751
1,"scription of a data table, as shown in Figure ...","[-0.017270068, 0.006634906, 0.005179641, -0.00...",0.840912


In [20]:
prompt = f"""
Provided below are user query and list of extracted pages from research papers separated by triple backticks.
Your task is to extract key pieces of information from that list based on the user query and phrase that as a comprehensive answer. 

User Query: ```{query}```
List of Extracted Pages: ```{top_results['page_chunks'].to_list()}```

Answer:
"""

print(prompt)


Provided below are user query and list of extracted pages from research papers separated by triple backticks.
Your task is to extract key pieces of information from that list based on the user query and phrase that as a comprehensive answer. 

User Query: ```What is prompt tuning?```
List of Extracted Pages: ```['Prefix-Tuning: Optimizing Continuous Prompts for Generation arXiv:2101.00190v1 [cs.CL] 1 Jan 2021 Xiang Lisa Li Stanford University xlisali@stanford.edu Abstract Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task. In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for nat- ural language generation tasks, which keeps language model parameters frozen, but opti- mizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspira- tion from prompting, all

In [21]:
response = get_completion(prompt, chat_model)
print(response)

Prompt tuning is a lightweight alternative to fine-tuning for natural language generation tasks. It keeps language model parameters frozen but optimizes a small continuous task-specific vector called the prefix. Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were "virtual tokens". Prefix-tuning is modular and can support many tasks at once. It stores 1000x fewer parameters than fine-tuning and is space-efficient. It can be applied to other generation tasks and pretrained models. Compared to other lightweight fine-tuning approaches, prefix-tuning obtains a further 30x reduction in task-specific parameters, tuning only 0.1% while maintaining comparable performance.


In [22]:

def query_search(query, count=10):
    results = search_client.search(search_text=query, top=count, include_total_count=True)
    page_chunks = []
    for result in results:
        page_chunks.append(result['page_text'])
        
    #Create an embedding vector for each chunk that will capture the semantic meaning and overall topic of that chunk
    embed_df['embedding'] = embed_df["page_chunks"].apply(lambda page_text : get_embedding(page_text, engine = embedding_model))

    query_embedding = get_embedding(query, engine=embedding_model)
    embed_df["similarities"] = embed_df['embedding'].apply(lambda page_embedding: cosine_similarity(page_embedding, query_embedding))

    top_results = (
        embed_df.sort_values("similarities", ascending=False)
        .reset_index(drop=True)
        .head(3)
    )
    
    prompt = f"""
    Provided below are user query and list of extracted pages from research papers separated by triple backticks.
    Your task is to extract key pieces of information from that list based on the user query and phrase that as a comprehensive answer. 

    User Query: ```{query}```
    List of Extracted Pages: ```{top_results['page_chunks'].to_list()}```

    Answer:
    """
    
    response = get_completion(prompt, chat_model)
    return response

In [23]:
answer = query_search("How does prompt tuning work?", 5)
print(answer)

Prompt tuning is a lightweight alternative to fine-tuning for natural language generation tasks. It keeps language model parameters frozen but optimizes a small continuous task-specific vector called the prefix. The prefix is a sequence of continuous task-specific vectors that are prepended to the input and can be attended to by subsequent tokens as if they were "virtual tokens". Prefix-tuning draws inspiration from prompting and enables a single language model to support many tasks at once. It stores 1000x fewer parameters than fine-tuning and is modular, making it space-efficient. Prefix-tuning has been applied to GPT-2 for table-to-text generation and to BART for summarization, and it obtains comparable performance in the full dataset setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.


In [24]:
answer = query_search("what is a fully zero-shot dense retrieval system?", 10)
print(answer)

A fully zero-shot dense retrieval system is a system that does not require any fine-tuning or task-specific training to perform well on a given task. One approach to achieving this is through prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks. Prefix-tuning optimizes a small continuous task-specific vector (called the prefix) while keeping the language model parameters frozen. This allows subsequent tokens to attend to the prefix as if it were "virtual tokens". Prefix-tuning has been applied to GPT-2 for table-to-text generation and to BART for summarization, and has been found to obtain comparable performance in the full data setting, outperform fine-tuning in low-data settings, and extrapolate better to examples with topics unseen during training. Compared to other lightweight fine-tuning approaches, prefix-tuning tunes only 0.1% of the LM parameters while maintaining comparable performance.
