Skip to content

This assistant tool (WIP) will help you search, browse and summarize the answers to your questions from your uploaded PDF using advanced text analytics, semantic search and Large Language Model (LLM)

Notifications You must be signed in to change notification settings

rachhek/pdf-search-assistant

Repository files navigation

.github/workflows/azure-dev.yml

About

The aim of this project is to build a tool that can:

  1. Search for answers from an uploaded PDF(s)
  2. Filter the search on specific page components (e.g. headers, sub header, tables, paragraphs, etc.)
  3. Filter the search on specific pages
  4. Show the search results as PDF annotations
  5. Summarize the answers for the questions in a concise manner

How?

The project will use Microsoft form recognizer to extract the text from the PDFs and then use a search engine to search for the answers. The search engine will be built using Azure Cognitive Search.The summarization will done using a Large Language Model from Azure OpenAI service.

Project Setup

The project is built using Azure Developer CLI (azd). The following is the README from the azd starter project.

Azure Developer CLI (azd) Bicep Starter

A starter blueprint for getting your application up on Azure using Azure Developer CLI (azd). Add your application code, write Infrastructure as Code assets in Bicep to get your application up and running quickly.

The following assets have been provided:

  • Infrastructure-as-code (IaC) Bicep files under the infra folder that demonstrate how to provision resources and setup resource tagging for azd.
  • A dev container configuration file under the .devcontainer directory that installs infrastructure tooling by default. This can be readily used to create cloud-hosted developer environments such as GitHub Codespaces.
  • Continuous deployment workflows for CI providers such as GitHub Actions under the .github directory, and Azure Pipelines under the .azdo directory that work for most use-cases.

Project Directory

  1. app : Streamlit application host.
  2. infra : Bicep files for provisioning Azure resources.
  3. scripts : Powershell and shell scripts storage.
  4. data: Store files for the application

Application data flow and architecture

  1. User opens the Streamlit application.
    1. User sees Streamlit application UI.
    2. User uploads PDF file(s) to the application.
  2. File processing is triggered.
    1. A directory with the same name as the file is created in Azure Blob Storage to store the files.
    2. The original PDF file is saved to Azure Blob Storage under FileName/original
    3. The PDF file is divided into pages (JPGs) and saved to Azure Blob Storage under FileName/pages
  3. Text extraction is triggered.
    1. Text extraction is done using Azure Form Recognizer using Layout API.
    2. The original PDF file is sent to Azure Form Recognizer to extract the text.
    3. Form recognizer returns 4 types of data
      1. Page per text
      2. Paragraphs in the document along with x,y coordinates bounding box
      3. Tables in the document along with x,y coordinates bounding box
    4. Each of the above data is saved to Azure Blob Storage under FileName/extraction.
  4. Azure Cognitive Search is triggered.
    1. 3 indexes are created per document uploaded in the Azure cognitive search.
      1. Page per text
      2. Paragraphs in the document along with x,y coordinates bounding box
      3. Tables in the document along with x,y coordinates bounding box
    2. The data is indexed in the above indexes.
    3. The schema is created for the above indexes.
  5. After step 2,3, and 4 are done User is shown the search UI.
    1. User can search for a keyword in the search bar.
    2. User can filter the search on the following:
      1. Page
      2. Page component (e.g. header, sub header, paragraph, table, etc.)
      3. Tables
  6. Search and OpenAI summarization is triggered.
    1. The search is done on the Azure Cognitive Search indexes.
    2. The search results are shown to the user as PDF annotations.
    3. The search results are summarized using Azure OpenAI service.
  7. Search results are shown to the user.

Next Steps

Step 1: Add application code

  1. Initialize the service source code projects anywhere under the current directory. Ensure that all source code projects can be built successfully.
    • Note: For function services, it is recommended to initialize the project using the provided quickstart tools.

  2. Once all service source code projects are building correctly, update azure.yaml to reference the source code projects.
  3. Run azd package to validate that all service source code projects can be built and packaged locally.

Step 2: Provision Azure resources

Update or add Bicep files to provision the relevant Azure resources. This can be done incrementally, as the list of Azure resources are explored and added.

  • A reference library that contains all of the Bicep modules used by the azd templates can be found here.
  • All Azure resources available in Bicep format can be found here.

Run azd provision whenever you want to ensure that changes made are applied correctly and work as expected.

Step 3: Tie in application and infrastructure

Certain changes to Bicep files or deployment manifests are required to tie in application and infrastructure together. For example:

  1. Set up application settings for the code running in Azure to connect to other Azure resources.
  2. If you are accessing sensitive resources in Azure, set up managed identities to allow the code running in Azure to securely access the resources.
  3. If you have secrets, it is recommended to store secrets in Azure Key Vault that then can be retrieved by your application, with the use of managed identities.
  4. Configure host configuration on your hosting platform to match your application's needs. This may include networking options, security options, or more advanced configuration that helps you take full advantage of Azure capabilities.

For more details, see additional details below.

When changes are made, use azd to validate and apply your changes in Azure, to ensure that they are working as expected:

  • Run azd up to validate both infrastructure and application code changes.
  • Run azd deploy to validate application code changes only.

Step 4: Up to Azure

Finally, run azd up to run the end-to-end infrastructure provisioning (azd provision) and deployment (azd deploy) flow. Visit the service endpoints listed to see your application up-and-running!

Additional Details

The following section examines different concepts that help tie in application and infrastructure.

Application settings

It is recommended to have application settings managed in Azure, separating configuration from code. Typically, the service host allows for application settings to be defined.

  • For appservice and function, application settings should be defined on the Bicep resource for the targeted host. Reference template example here.
  • For aks, application settings are applied using deployment manifests under the <service>/manifests folder. Reference template example here.

Managed identities

Managed identities allows you to secure communication between services. This is done without having the need for you to manage any credentials.

Azure Key Vault

Azure Key Vault allows you to store secrets securely. Your application can access these secrets securely through the use of managed identities.

Host configuration

For appservice, the following host configuration options are often modified:

  • Language runtime version
  • Exposed port from the running container (if running a web service)
  • Allowed origins for CORS (Cross-Origin Resource Sharing) protection (if running a web service backend with a frontend)
  • The run command that starts up your service

Prerequisites

To Run Locally

  • Azure Developer CLI
  • Python 3+
    • Important: Python and the pip package manager must be in the path in Windows for the setup scripts to work.
    • Important: Ensure you can run python --version from console. On Ubuntu, you might need to run sudo apt install python-is-python3 to link python to python3.
  • Node.js
  • Git
  • Powershell 7+ (pwsh) - For Windows users only.
    • Important: Ensure you can run pwsh.exe from a PowerShell command. If this fails, you likely need to upgrade PowerShell.
  • Powershell for Mac/Linux (pwsh)
  1. Install the Azure CLI
  2. Run azd init -t azure-search-openai-demo
  3. Run azd env refresh -e {environment name} - Note that they will need the azd environment name, subscription Id, and location to run this command - you can find those values in your ./azure/{env name}/.env file. This will populate their azd environment's .env file with all the settings needed to run the app locally.
  4. Run pwsh ./scripts/roles.ps1 - This will assign all of the necessary roles to the user so they can run the app locally. If they do not have the necessary permission to create roles in the subscription, then you may need to run this script for them. Just be sure to set the AZURE_PRINCIPAL_ID environment variable in the azd .env file or in the active shell to their Azure Id, which they can get with az account show.
  5. az ad signed-in user show

NOTE: Your Azure Account must have Microsoft.Authorization/roleAssignments/write permissions, such as User Access Administrator or Owner.

About

This assistant tool (WIP) will help you search, browse and summarize the answers to your questions from your uploaded PDF using advanced text analytics, semantic search and Large Language Model (LLM)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages