# Using Azure OpenAI GPT-4 Vision to extract structured JSON data from PDF documents

This notebook demonstrates [how to use GPT-4 Vision](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/gpt-with-vision?tabs=rest) to extract structured JSON data from PDF documents, such as invoices, using the [Azure OpenAI Service](https://learn.microsoft.com/en-us/azure/ai-services/openai/overview).

## Pre-requisites

The notebook uses [PowerShell](https://learn.microsoft.com/powershell/scripting/install/installing-powershell) and [Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli) to deploy all necessary Azure resources. Both tools are available on Windows, macOS and Linux environments. It also uses [.NET 8](https://dotnet.microsoft.com/download/dotnet/8.0) to run the C# code that interacts with the Azure OpenAI Service.

Running this notebook will deploy the following resources in your Azure subscription:
- Azure Resource Group
- Azure OpenAI Service (West US)
- GPT-4 Vision model deployment (5K capacity)

**Note**: The GPT-4 Vision model is currently in preview and is available in limited capacity (10K per region) in selected regions only. For more information, see the [Azure OpenAI Service documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-4-and-gpt-4-turbo-preview-model-availability).

## Deploy infrastructure with Az CLI & Bicep

The following will prompt you to login to Azure. Once logged in, the current default subscription in your available subscriptions will be set for deployment.

> **Note:** If you have multiple subscriptions, you can change the default subscription by running `az account set --subscription <subscription_id>`.

Then, all the necessary Azure resources will be deployed, previously listed, using [Azure Bicep](https://learn.microsoft.com/en-us/azure/azure-resource-manager/bicep/).

The deployment occurs at the subscription level, creating a new resource group. The location of the deployment is set to **West US** and this can be changed to another location that supports the GPT-4 Vision model, as well as other parameters, in the [`./infra/main.bicepparam`](./infra/main.bicepparam) file.

Once deployed, the Azure OpenAI Service endpoint and key will be stored in the [`./config.env`](./config.env) file for use in the .NET code.

### Understanding the deployment

#### OpenAI Services

An [Azure OpenAI Service](https://learn.microsoft.com/en-us/azure/ai-services/openai/overview) instance is deployed in the West US region. This is deployed with the `gpt-4-vision-preview` model to be used for inference.

In [None]:
# Login to Azure
Write-Host "Checking if logged in to Azure..."

$loggedIn = az account show --query "name" -o tsv

if ($loggedIn -ne $null) {
    Write-Host "Already logged in as $loggedIn"
    az login --tenant <YOUR_TENANT_ID>
    az account set --subscription <YOUR_SUBSCRIPTION_ID>
} else {
    Write-Host "Logging in..."
    az login --tenant <YOUR_TENANT_ID>
    az account set --subscription <YOUR_SUBSCRIPTION_ID>
}

# Retrieve the default subscription ID
$subscriptionId = (
    (
        az account list -o json `
            --query "[?isDefault]"
    ) | ConvertFrom-Json
).id

# Set the subscription
az account set --subscription $subscriptionId
Write-Host "Subscription set to $subscriptionId"

# Deploy the infra/main.bicep file
Write-Host "Deploying the Bicep template..."

$deploymentOutputs = (az deployment sub create --name 'gpt-document-extraction' --location westus --template-file ./infra/main.bicep --parameters ./infra/main.bicepparam --query "properties.outputs" -o json) | ConvertFrom-Json

# Get the Azure OpenAI Service API key
$resourceGroupName = $deploymentOutputs.resourceGroupInfo.value.name
$openAIName = $deploymentOutputs.openAIInfo.value.name
$openAIEndpoint = $deploymentOutputs.openAIInfo.value.endpoint
$openAIVisionModelDeploymentName = $deploymentOutputs.openAIInfo.value.visionModelDeploymentName
$openAIKey = (az cognitiveservices account keys list --name $openAIName --resource-group $resourceGroupName --query key1 -o tsv)

# Save the deployment outputs to a .env file
Write-Host "Saving the deployment outputs to a config.env file..."

function Set-ConfigurationFileVariable($configurationFile, $variableName, $variableValue) {
    if (Select-String -Path $configurationFile -Pattern $variableName) {
        (Get-Content $configurationFile) | Foreach-Object {
            $_ -replace "$variableName = .*", "$variableName = $variableValue"
        } | Set-Content $configurationFile
    } else {
        Add-Content -Path $configurationFile -value "$variableName = $variableValue"
    }
}

$configurationFile = "config.env"

if (-not (Test-Path $configurationFile)) {
    New-Item -Path $configurationFile -ItemType "file" -Value ""
}

Set-ConfigurationFileVariable $configurationFile "AZURE_RESOURCE_GROUP_NAME" $resourceGroupName
Set-ConfigurationFileVariable $configurationFile "AZURE_OPENAI_ENDPOINT" $openAIEndpoint
Set-ConfigurationFileVariable $configurationFile "AZURE_OPENAI_API_KEY" $openAIKey
Set-ConfigurationFileVariable $configurationFile "AZURE_OPENAI_VISION_MODEL_DEPLOYMENT_NAME" $openAIVisionModelDeploymentName

## Install .NET dependencies

This notebook uses .NET to interact with the Azure OpenAI Service. It takes advantage of the following NuGet packages:

### PDFtoImage

The [PDFtoImage](https://github.com/sungaila/PDFtoImage) library is used to convert PDF documents to JPEG images. The library provides a simple layer to convert PDF documents using the static `PDFtoImage.Conversion` class. Reading the bytes of the PDF, the library will create an image and store it with a given file name.

### DotNetEnv

The [DotNetEnv](https://github.com/tonerdo/dotnet-env) library is used to load environment variables from a `.env` file which can be accessed via the `Environment.GetEnvironmentVariable(string)` method. This library is used to load the Azure OpenAI Service endpoint, key and model deployment name from the [`./config.env`](./config.env) file.

In [None]:
#r "nuget:System.Text.Json, 8.0.1"
#r "nuget:DotNetEnv, 3.0.0"
#r "nuget:PDFtoImage, 4.0.1"

In [None]:
using System.Net;
using System.Net.Http;
using System.Text.Json.Nodes;
using System.Text.Json;
using System.IO; 

using DotNetEnv;
using PDFtoImage;
using SkiaSharp;

In [None]:
Env.Load("config.env");

var endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT");
var apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_API_KEY");
var modelDeployment = Environment.GetEnvironmentVariable("AZURE_OPENAI_VISION_MODEL_DEPLOYMENT_NAME");
var apiVersion = "2023-12-01-preview";

var pdfName = "TR4-ESHS6500381 2017_10_17 01_45_45.pdf";
var pdfImageName = "TR4-ESHS6500381 2017_10_17 01_45_45.pdf_stitched.jpg";

## Convert PDF to image

For the GPT-4 Vision model to extract structured JSON data from a PDF document, the document must first be converted to an image. The following code demonstrates how to convert a PDF document to a JPEG image using the `PDFtoImage` library.

### Important notes for image analysis with the GPT-4 Vision model

- The maximum size for images is restricted to 20MB.
- The `image_url` parameter in the message body has a `detail` property that can be set to `low` to enable a lower resolution image analysis for faster results with fewer tokens. However, this could impact the accuracy of the result.
- When providing images, there is a limit of 10 images per call.

Based on these notes, you may need to perform pre-processing of your PDF when converting it to images to ensure that the images are within the size limits and that the resolution is appropriate for the analysis. This may include:

- Reducing the resolution of the images.
- Splitting the PDF into multiple images, if it contains less than 10 pages.
- Stitching multiple images together, if the PDF contains more than 10 pages.
- Compressing the images to reduce the file size.

Experiment with different pre-processing techniques to find the best approach for your specific use case.

The following code provides examples using .NET to convert a PDF document with multiple pages into one image that stitches the pages together.

In [None]:
var pdf = await File.ReadAllBytesAsync(pdfName);
var pageImages = PDFtoImage.Conversion.ToImages(pdf);

int totalHeight = pageImages.Sum(image => image.Height);
int width = pageImages.Max(image => image.Width);

var stitchedImage = new SKBitmap(width, totalHeight);
var canvas = new SKCanvas(stitchedImage);

int currentHeight = 0;
foreach (var pageImage in pageImages)
{
    canvas.DrawBitmap(pageImage, 0, currentHeight);
    currentHeight += pageImage.Height;
}

using (var stitchedFileStream = new FileStream(pdfImageName, FileMode.Create, FileAccess.Write))
{
    stitchedImage.Encode(stitchedFileStream, SKEncodedImageFormat.Jpeg, 100);
}

Console.WriteLine($"Stitched {pdfName} into {pdfName}_stitched.jpg");

## Use GPT-4-Vision-Preview to extract the data from the image

Now that the PDF document has been converted to an image, the GPT-4 Vision model can be used to extract structured JSON data from the image. The following code demonstrates how to use the deployed Azure OpenAI Service directly via the API to extract structured JSON data from the image.

In this example, the payload for the Chat completion endpoint is a JSON object with the following details:

### System Prompt

The system prompt is the instruction to the model that prescribes the model's behavior. They allow you to constrain the model's behavior to a specific task, making it more adaptable for specific use cases, such as extracting structured JSON data from documents.

In this case, it is to extract structured JSON data from the image. Here is what we have provided:

**You are an AI assistant that extracts data from documents and returns them as structured JSON objects. If a value is not present, provide null. The TR4 document has multiple sections with location information, applicant information, plot diagram, and test report. Do not return as a code block. Use only information extracted from image. Do not make up any information.**

> **Note:** GPT-4 Vision doesn't currently allow the `response_format` parameter to be set to `json`. To avoid the response being returned as a code block, we have included the instruction to not return as a code block. 

Learn more about [system prompts](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/system-message).

### User Prompt

The user prompt is the input to the model that provides context for the model's response. It is the input that the model uses to generate a response. 

In this case, it is the image of the document plus some additional text context to help the model understand the task. Here is what we have provided:

**Extract the data from this Soil Investigation Technical Report. There may be multiple Boring pits and soil Test Pits in section 5 Test Report in which case an array can be created for each boring. For the Plot Diagram sections, provide a textual summary for the PlotDiagramSummaryDetails field according to the extracted measurements for an engineer to read. The plot diagram is for the address found in location information section of the document. The Plot diagram drawing will show boring points. Extract the distance for each boring if available in the image and place in the respective DistanceFromNorthBoundary, DistanceFromSouthBoundary, DistanceFromEastBoundary, or DistanceFromSouthBoundary fields. If no value then value should be null. Use the following structure, {\"DocumentInformation\", {\"DocumentType\", {\"value\", \"\", \"\"}, \"JobNumber\", {\"value\", \"\", \"\"}, \"ScanCode\", {\"value\", \"\", \"\"}}, \"LocationDetails\", {\"HouseNumber\", {\"value\", \"\", \"\"}, \"StreetName\", {\"value\", \"\", \"\"}, \"Borough\", {\"value\", \"\", \"\"}, \"Block\", {\"value\", \"\", \"\"}, \"Lot\", {\"value\", \"\", \"\"}, \"BIN\", {\"value\", \"\", \"\"}, \"CommunityBoardNumber\", {\"value\", \"\", \"\"}, \"WorkOnFloors\", {\"value\", \"\", \"\"}, \"ApartmentNumber\", {\"value\", \"\", \"\"}}, \"ApplicantInformation\", {\"LastName\", {\"value\", \"\", \"\"}, \"FirstName\", {\"value\", \"\", \"\"}, \"MiddleInitial\", {\"value\", \"\", \"\"}, \"BusinessName\", {\"value\", \"\", \"\"}, \"BusinessAddress\", {\"value\", \"\", \"\"}, \"BusinessPhone\", {\"value\", \"\", \"\"}, \"MobilePhone\", {\"value\", \"\", \"\"}, \"Fax\", {\"value\", \"\", \"\"}, \"State\", {\"value\", \"\", \"\"}, \"City\", {\"value\", \"\", \"\"}, \"Zip\", {\"value\", \"\", \"\"}, \"Email\", {\"value\", \"\", \"\"}, \"LicenseNumber\", {\"value\", \"\", \"\"}, \"PEcheck\", {\"value\", \"\", \"\"}, \"RAcheck\", {\"value\", \"\", \"\"}}, \"BoringDetails\", [{\"BoringNumber\", {\"value\", \"\", \"\"}, \"PlotDiagramSummary\", \"PlotDiagramSummaryDetails\",{\"DistanceFromNorthBoundary\", {\"value\", \"\", \"\"}, \"DistanceFromEastBoundary\", {\"value\", \"\", \"\"}, \"DistanceFromSouthBoundary\", {\"value\", \"\", \"\"}, \"DistanceFromWestBoundary\", {\"value\", \"\", \"\"}}, \"BoringDate\", {\"value\", \"\", \"\"}, \"FeetBelowCurb\", {\"value\", \"\", \"\"}, \"SoilDescription\", [{\"Depth\", {\"value\", \"\", \"\"}, \"Description\", {\"value\", \"\", \"\"}, \"ClassNumber\", {\"value\", \"\", \"\"}, \"Remarks\", {\"value\", \"\", \"\"}}]}], \"AdditionalRemarks\", {\"Remarks\", {\"value\", \"\", \"\"}}}" }**

> **Note:** For the user prompt, it is ideal to provide a structure for the JSON response. Without one, the model will determine this for you and you may not get consistency across responses. 

This prompt ensures that the model understands the task, and the additional text context provides the model with the necessary information to extract the structured JSON data from the image. This approach would result in a response similar to the following:

```json
{  
  "DocumentInformation": {  
    "DocumentType": "TR4: Technical Report - Soil Investigation",  
    "JobNumber": "321186970",  
    "ScanCode": "ES397871022"  
  },  
  "LocationDetails": {  
    "HouseNumber": "13",  
    "StreetName": "Somers Street",  
    "Borough": "Brooklyn",  
    "Block": "1538",  
    "Lot": "64",  
    "BIN": null,  
    "CommunityBoardNumber": null,  
    "WorkOnFloors": null,  
    "ApartmentNumber": null  
  },  
  "ApplicantInformation": {  
    "LastName": "Lent",  
    "FirstName": "Sanford",  
    "MiddleInitial": "E",  
    "BusinessName": "All Phase Testing, Inc.",  
    "BusinessAddress": "3319 Merritt Avenue",  
    "BusinessPhone": "(718) 994-3200",  
    "MobilePhone": null,  
    "Fax": "(718) 994-5406",  
    "State": "NY",  
    "City": "Bronx",  
    "Zip": "10475",  
    "Email": "info@allphasetesting.com",  
    "LicenseNumber": "046732",  
    "PEcheck": true,  
    "RAcheck": false  
  },  
  "BoringDetails": [  
    {  
      "BoringNumber": "B1",  
      "PlotDiagramSummary": "Boring B1 is located approximately 131' 50\" from Hopkinson Avenue and 5' 0\" from Somers Street.",  
      "DistanceFromNorthBoundary": null,  
      "DistanceFromEastBoundary": null,  
      "DistanceFromSouthBoundary": "5' 0\"",  
      "DistanceFromWestBoundary": "131' 50\"",  
      "BoringDate": null,  
      "FeetBelowCurb": null,  
      "SoilDescription": [],  
      "Remarks": null  
    },  
    {  
      "BoringNumber": "B2",  
      "PlotDiagramSummary": "Boring B2 is located approximately 23' 0\" from the north boundary and 55' 0\" from Boring B1.",  
      "DistanceFromNorthBoundary": "23' 0\"",  
      "DistanceFromEastBoundary": null,  
      "DistanceFromSouthBoundary": null,  
      "DistanceFromWestBoundary": "55' 0\"",  
      "BoringDate": null,  
      "FeetBelowCurb": null,  
      "SoilDescription": [],  
      "Remarks": null  
    }  
  ],  
  "AdditionalRemarks": {  
    "Remarks": "See boring report B-001.00 dated 1/22/17."  
  }  
}
```

In [None]:
var base64Image = Convert.ToBase64String(File.ReadAllBytes(pdfImageName));

JsonObject jsonPayload = new JsonObject
{
    {
        "messages", new JsonArray 
        {
            new JsonObject
            {
                { "role", "system" },
                { "content", "You are an AI assistant that extracts data from documents and returns them as structured JSON objects. If a value is not present, provide null. The TR4 document has multiple sections with location information, applicant information, plot diagram, and test report. Do not return as a code block. Use only information extracted from image. Do not make up any information." }
            },
            new JsonObject
            {
                { "role", "user" },
                { "content",
                    new JsonArray
                    {
                        new JsonObject
                        {
                            { "type", "text" },
                            { "text", "Format the data from this Soil Investigation Technical Report document intelligence output. Use the following structure: {\"DocumentInformation\": {\"DocumentType\": \"\", \"JobNumber\": \"\", \"ScanCode\": \"\"}, \"LocationDetails\": {\"HouseNumber\": \"\", \"StreetName\": \"\", \"Borough\": \"\", \"Block\": \"\", \"Lot\": \"\", \"BIN\": \"\", \"CommunityBoardNumber\": \"\", \"WorkOnFloors\": \"\", \"ApartmentNumber\": \"\"}, \"ApplicantInformation\": {\"LastName\": \"\", \"FirstName\": \"\", \"MiddleInitial\": \"\", \"BusinessName\": \"\", \"BusinessAddress\": \"\", \"BusinessPhone\": \"\", \"MobilePhone\": \"\", \"Fax\": \"\", \"State\": \"\", \"City\": \"\", \"Zip\": \"\", \"Email\": \"\", \"LicenseNumber\": \"\", \"PEcheck\": \"\", \"RAcheck\": \"\"}, \"BoringDetails\": [{\"BoringNumber\": \"\", \"BoringDate\": \"\", \"BoringNumber\": \"\", \"FeetBelowCurb\": \"\", \"SoilDescription\": \"\", \"ClassNumber\": \"\"}], \"AdditionalRemarks\": {\"Remarks\": \"\"}}" }
                        },
                        new JsonObject
                        {
                            { "type", "image_url" },
                            { "image_url", new JsonObject { { "url", $"data:image/jpeg;base64,{base64Image}" } } }
                        }
                    }
                }
            }
        }
    },
    { "model", modelDeployment },
    { "max_tokens", 4096 },
    { "temperature", 0 },
    { "top_p", 0 },
};

string payload = JsonSerializer.Serialize(jsonPayload, new JsonSerializerOptions
{
    WriteIndented = true
});

In [None]:
string visionEndpoint = $"{endpoint}openai/deployments/{modelDeployment}/chat/completions?api-version={apiVersion}";

using (HttpClient httpClient = new HttpClient())
{
    httpClient.BaseAddress = new Uri(visionEndpoint);
    httpClient.DefaultRequestHeaders.Add("api-key", apiKey);
    httpClient.DefaultRequestHeaders.Accept.Add(new System.Net.Http.Headers.MediaTypeWithQualityHeaderValue("application/json"));

    var stringContent = new StringContent(payload, Encoding.UTF8, "application/json");

    var response = await httpClient.PostAsync(visionEndpoint, stringContent);

    if (response.IsSuccessStatusCode)
    {
        using (var responseStream = await response.Content.ReadAsStreamAsync())
        {
            // Parse the JSON response using JsonDocument
            using (var jsonDoc = await JsonDocument.ParseAsync(responseStream))
            {
                // Access the message content dynamically
                JsonElement jsonElement = jsonDoc.RootElement;
                string messageContent = jsonElement.GetProperty("choices")[0].GetProperty("message").GetProperty("content").GetString();

                // Output the message content
                Console.WriteLine(messageContent);
            }
        }
    }
    else
    {
        Console.WriteLine(response);
    }
}