---
---

<h1>Notebook: Extract Info from Unstructured Documents using AISAY API</h1>

## Setting up Notebook

In [None]:
import os
import pathlib
import requests
import json
import time
import pandas as pd
import logging

In [None]:
import os

for filename in os.listdir():
  print(filename)

.config
product_brochure.pdf
sample_data


In [None]:
import requests

file_urls = [
    'https://abc-notes.data.tech.gov.sg/resources/data/sample-medicine-label.jpg',
    'https://abc-notes.data.tech.gov.sg/resources/data/product_brochure.pdf'
]

for file_url in file_urls:
    try:
        # Send a GET request to the URL
        response = requests.get(file_url)
        response.raise_for_status()  # Check if the request was successful

        # Extract the file name from the URL
        file_name = os.path.basename(file_url)

        # Save the file to the current directory
        with open(file_name, 'wb') as file:
            file.write(response.content)

        print(f"Downloaded {file_name}")

    # Handle exceptions for failed requests
    except requests.exceptions.RequestException as e:
        print(f"Error occurred while downloading the file: {e}")

Downloaded sample-medicine-label.jpg.
Downloaded product_brochure.pdf.


---
---
<br>

# Example Use Case 1

- This is a walkthrough where the steps for calling the API is broken in a step-by-step manner to make it easier to understand the processes involved.

---

## Authenticating with an API key

- This involves two steps:
    1. Send a POST request to the /auth endpoint with the API key to get an access token
    2. Extract the access token from the response and include it in the Authorization header of subsequent requests

---

<br>

### Getting the Access Token



**Purpose of the Access Token**:
1. **Authentication**: The access token is used to authenticate the client when making API requests. It proves that the client has the necessary permissions to access the API.
2. **Authorization**: The token contains information about the client's permissions (scopes). It ensures that the client can only perform actions that it is authorized to do.
3. **Security**: Using tokens instead of directly passing credentials (like username and password) enhances security. Tokens are typically short-lived and can be revoked if compromised.

---

**Why We Need the Token**:
1. **Secure API Access**: The API requires an access token to ensure that only authenticated and authorized clients can access its endpoints. This prevents unauthorized access and potential misuse.
2. **Client Credentials Flow**: In the provided code, the client is using the OAuth 2.0 Client Credentials Grant flow. This flow is used when the client (e.g., a backend service) needs to access resources on behalf of itself, not on behalf of a user.
3. **Scope Limitation**: The token includes scopes (e.g., `aisay-api/query`) that define what the client can do. This limits the client's actions to only what is necessary, reducing the risk of unintended operations.

---

**How do we use Token for AISAY api**:
In this notebook, the access token is used for making subsequent API requests to the `aisay-api`. The process involves:
1. **Requesting the Access Token**: The client sends a request to the AWS Cognito Token endpoint with its credentials and the desired scope.
2. **Using the Access Token**: The client includes the access token in the Authorization header of subsequent API requests to authenticate and authorize those requests.
3. This approach ensures that the API requests are secure and that the client has the necessary permissions to perform the requested actions.


---

In [None]:
# Define the client credentials (replace 'your_client_id' and 'your_client_secret' with actual values)
CLIENT_ID = input("Enter your client ID: ")

Enter your client ID: 1t8qqsphkhvloqshclb4pa3oba


In [None]:
CLIENT_SECRET = input("Enter your client secret: ")

---

- The purpose of thi cell is to get the access token from the AWS Cognito Token endpoint,
- which AISAY uses for authentication. This access token will be used to authenticate requests to the AISAY API.

In [None]:
import requests
from requests.auth import HTTPBasicAuth

# Define the URL for the AWS Cognito Token endpoint
url = "https://aisay-stg.auth.ap-southeast-1.amazoncognito.com/oauth2/token"

# Define the payload for the access token request
payload = {
    "grant_type": "client_credentials",
    "scope": "aisay-api/query"
}

# Define the client credentials (replace 'your_client_id' and 'your_client_secret' with actual values)
client_id = '1t8qqsphkhvloqshclb4pa3oba'
client_secret = '11erb0ub6qmi2l65j0p9e8592fr5vmgfltl7ss3bo2ofgjt3a3oh'

# Make the request to the AWS Cognito Token endpoint
response = requests.post(
    url,
    data=payload,
    auth=HTTPBasicAuth(client_id, client_secret),
    headers={"Content-Type": "application/x-www-form-urlencoded"}
)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    token_response = response.json()
    access_token = token_response.get("access_token")
    print("Access Token obtained")
else:
    print("Failed to retrieve access token:", response.status_code, response.text)

Access Token obtained


---

### Inserting the token into the `headers`

- This code cell defines the headers that will be used for API calls to the AISAY API.

- The headers can be reused for multiple API calls.

In [None]:
# Set Authorization header for API calls
headers_w_token = {
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {access_token}'
}

---

## Uploading Files to AISAY

- The files are required to be uploaded to a cloud storage prior to the processing

- This subsection focuses on how to upload the file(s)

- There are two main steps to upload a file to a cloud storage:
    1. Get Pre-signed URL for the file upload
    2. Upload the file to the cloud storage using the Pre-signed URL

---

### Get Presigned URL

A pre-signed URL is a URL that grants temporary access to a specific resource, typically in cloud storage, without requiring the user to have direct access credentials. Here's why and how it is used:

**Purpose of Pre-Signed URL**:
1. **Secure Access**: It allows secure access to a resource without exposing sensitive credentials.
2. **Temporary Access**: The URL is valid only for a limited time, reducing the risk of unauthorized access.
3. **Controlled Permissions**: It can be configured with specific permissions, such as read or write access.

**Why Need a Pre-Signed URL**:
1. **File Uploads**: When uploading files to cloud storage (e.g., AWS S3, Google Cloud Storage), a pre-signed URL allows clients to upload directly to the storage service without going through the server, reducing server load and bandwidth usage.
2. **File Downloads**: It enables clients to download files securely without exposing the storage service's credentials.
3. **Delegated Access**: It allows third parties to access specific resources without giving them full access to the storage service.

**In this Notebook**:
In the code, the pre-signed URL is used for uploading a file. The process typically involves:
1. **Requesting a Pre-Signed URL**: The client sends a request to the server (API endpoint) to get a pre-signed URL for the file upload.
2. **Using the Pre-Signed URL**: The client then uses this URL to upload the file directly to the cloud storage.

This approach ensures that the file upload is secure, efficient, and does not require the server to handle the file data directly.

In [None]:
# API Gateway settings
API_ENDPOINT = 'https://stg.ai.ff.gov.sg/preSignedUrl'

In [None]:

data = {
  "filename": "sample-medicine-label.jpg"
}

try:
    serialized_data = json.dumps(data)  # Serialize the data dictionary as JSON string
    print('Sending data \n')
    response = requests.post(API_ENDPOINT, data=serialized_data, headers=headers_w_token)
    response.raise_for_status()  # Raise an exception for unsuccessful requests
    presigned_response = response.json()
    print('POST request was successful.')
    print('Response:', presigned_response)
except requests.exceptions.RequestException as e:
    print('Error occurred:', e)


Sending data 

POST request was successful.
Response: {'url': 'https://aisay-api-files-storage-stg.s3.amazonaws.com/', 'fields': {'key': 'ba8da0a5-5b77-4824-ba65-04f0ee0da20e.jpg', 'AWSAccessKeyId': 'ASIA3QRED2HIAJSEO4BO', 'x-amz-security-token': 'IQoJb3JpZ2luX2VjEPj//////////wEaDmFwLXNvdXRoZWFzdC0xIkcwRQIhAMEKDtIzYBByq+njzQ8fynSN5ZbMPYruhs6JDqp6TXXMAiBQ6UeadFmwxRKv8mzrnfL6AFn07ab6YfQN0OfaosWIqCq0AwgREAAaDDc5MTQyMzQ3MjA4MCIMi/5vNxZID7w121D9KpEDIxR202gdByxpNMdpQRTdFaC2p6EXcM49kR9jfTZtor8xNMIUs8BfwqtH1U+XQSs1++6AJVRhw/jPGiNq5QXLPP/7GBjM+jboeKnfjjW9Jj+T2XNdm+2OHQ1TTFjaldX7XuJ4bnkj6tBM8oj6uY2cETrkSPAYyq9h/VqsumzeE+wstxZjGTmdypYPzdNwHO0UDbBMqnyUBgAE8WuLnOP4exenJ6Evcs1gFITxsI25KyQz+iP4dF7+bky6/6MavIWltepJbLHMHRST6t3CuAfDj/L6tkNOfBUYCcuHipj4ezk+4lpMkGUB39A+/YyJV7k/j6GOSHehIbRgAFpaiAXH7WzXFZc8vCUoUl4ntgQK+vmOPAsIfphm9cRjq7up+PwdE0no856qbtayfDO+zXt4uIXFvXs8EcbvuRAw3u7MV7H0bdXZoPo42nQQ+RrwHHum3lxm2jrtAFbnFIjAYU7yIOFPmCoaUHpiEXszJcpphxHeMum44a5GxcwGrOEIQL0tyugPVimCUfjK93xvjmaRdpAwqpK2tgY6ngGZc0mP

In [None]:
# Observe how the response is structured
presigned_response['fields']

---

The dictionary you provided contains fields typically used for uploading a file to an Amazon S3 bucket using a presigned URL. Here is a breakdown of each field:

1. **`key`**: The name of the file to be uploaded to the S3 bucket.

2. **`AWSAccessKeyId`**: The AWS access key ID that is used to authenticate the request.

3. **`x-amz-security-token`**: A security token that is used in conjunction with the AWS access key ID to authenticate the request.

4. **`policy`**: A base64-encoded policy document that specifies the conditions for the upload, such as expiration time and allowed file size.

5. **`signature`**: The signature generated based on the policy document and the secret key, used to verify the authenticity of the request.

These fields are used together to securely upload a file to an S3 bucket using a presigned URL. The presigned URL allows you to grant temporary access to the S3 bucket without exposing your AWS credentials.

---
---

### Uploading the File


In [None]:
# UPLOADING FILE Path to the file you want to upload
file_path = 'sample-medicine-label.jpg'

with open(file_path, 'rb') as f:
    # Merge file data with provided fields
    files = {'file': (presigned_response['fields']['key'], f)}

    http_response = requests.post('https://aisay-api-files-storage-stg.s3.amazonaws.com/',
                                  data=presigned_response['fields'],
                                  files=files)

# Check if upload was successful
if http_response.status_code == 204:  # HTTP status 204 means 'No Content' which is a success response for this operation
    print("File uploaded successfully!")
else:
    print(f"File upload failed with status code {http_response.status_code}: {http_response.text}")


File uploaded successfully!


## Query the APIs

- API Endpoints: `/query/async`
    - This query allows you to extract user defined fields detected in document.

In [None]:
# Define the data to be sent to the API

data = {
    "s3_object": {"key": presigned_response["fields"]["key"]},
    "gpt_query": {
        "Quantity": {
            "description": "This is the total quantity of the medication. This is usually a numerical value",
            "type": "string",
        },
        "Dosage": {
            "description": "This is the dosage per quantity. E.g. 10 mg",
            "type": "string",
        },
        "Dosage_Measurement": {
            "description": "This is the scientific measurement of the dosage. E.g. mg",
            "type": "string",
        },
        "Prescription_Date": {
            "description": f"This is the date which the medication is prescribed in dd/mm/yyyy",
            "type": "string",
        }
    },
}

In [None]:
# Send the API Call (Query) to the API
try:
    serialized_data = json.dumps(data)  # Serialize the data dictionary as JSON string
    print('Sending data \n')
    response = requests.post('https://stg.ai.ff.gov.sg/query/async', data=serialized_data, headers=headers_w_token)
    response.raise_for_status()  # Raise an exception for unsuccessful requests

    print('POST request was successful.')
    print('Response:', response.json())
except requests.exceptions.RequestException as e:
    print(response.text)
    print('Error occurred:', e)

Sending data 

POST request was successful.
Response: {'result_url': 'https://aisay-api-result-storage-stg.s3.amazonaws.com/ba8da0a5-5b77-4824-ba65-04f0ee0da20e.jpg?AWSAccessKeyId=ASIA3QRED2HIOGGYBLMX&Signature=FYOKwakHf6oP1Vg%2Fn6cr7hMVeXA%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEPj%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDmFwLXNvdXRoZWFzdC0xIkcwRQIgGOwOcjKEhH4WY%2BGMZNn4JpvoR5z9xQ6OCul%2Fc90ouw8CIQDNZfHPFhg%2B5t17%2BuZOvVjNmM%2B9DY5DMUOPWzRLVoif8yrCAwgREAAaDDc5MTQyMzQ3MjA4MCIMC%2Fa9zgJLNMzhkWHiKp8DrFWM14t1VR9nRlFgXqczkBGMFMT%2F9F3BjjjTsVOYmo1NCFhabuERfoiOGe92%2FuSFcDuvfPbXZarbUOs2xrLdxitdva0lhRBKjh6EZPOr4d092Lai8bYunEJB%2B7EL8OAEI0E%2Frnybtv8lWVwjYDkD0LzydcdjgI2uQQnvMUl0EmRaBjuWVdr8iQEf0wpaKKAhMtKxv9RVbQxeZsXp8nudvfka7JIeq5WZXxBNN1hyxmB4CSi1f9I0UU9yq3xMRcZCZB%2BTEItInJDbNxcDlnLqlBbsTl21%2BhyWoupJ7NCY7a8yUdJ6AlYtKi95UT2vpqJS6NSxJOUI3zQ7lBQSnP8Z2Ho%2FS4KDldF47LGqTLpvkpI%2B98Tj7JOtt9QNL2OksUXQkn%2BzVsESxYP4I62JpC%2BcZ25JDL7tno8p07udJzPx%2B41Aym0ixNDQhMCsBzNB6QOgd9%2B%2FOvQdAgD3nStWJWs%2B1y67%2F

---

> Please wait for 1-to-2 minutes before runnign the next cell.

In [None]:
retries = 0
while retries < 3:
    try:
        result_response = requests.get(response.json()['result_url'])
        if result_response.status_code == 200:
            print(result_response.json())
            break
        else:
            print(f"Attempt {retries + 1} failed with status code: {result_response.status_code}")

    except requests.RequestException as e:
        print(f"Request failed: {e}")

    time.sleep(5)  # Wait for a specified timeout period before retrying
    retries += 1

Attempt 1 failed with status code: 404
Attempt 2 failed with status code: 404
{'fields': {'Quantity': '90', 'Dosage': '50', 'Dosage_Measurement': 'MCG', 'Prescription_Date': '01/12/2018'}, 'quota': 'Pay per use: 12 used.'}


---
---

<br>

# Example Use Case 2

- In the code below, we will put the code we learned into another use case.

- This time, there will be only the code with a few comments, so we can have a clear view of the code.

---

In [None]:
# UPLOADING FILE Path to the file you want to upload
file_path = 'product_brochure.pdf'


data = {
  "filename": file_path
}

#region <--------------- Get PreSigned URL --------------->
try:
    serialized_data = json.dumps(data)  # Serialize the data dictionary as JSON string
    print('Sending data \n')
    response = requests.post(API_ENDPOINT, data=serialized_data, headers=headers_w_token)
    response.raise_for_status()  # Raise an exception for unsuccessful requests
    presigned_response = response.json()
    print('POST request was successful.')
    print('Response:', presigned_response)
except requests.exceptions.RequestException as e:
    print('Error occurred:', e)
#endregion <--------------- Get PreSigned URL --------------->

#region <--------------- Upload File --------------->
with open(file_path, 'rb') as f:
    # Merge file data with provided fields
    files = {'file': (presigned_response['fields']['key'], f)}

    http_response = requests.post('https://aisay-api-files-storage-stg.s3.amazonaws.com/',
                                  data=presigned_response['fields'],
                                  files=files)

# Check if upload was successful
if http_response.status_code == 204:  # HTTP status 204 means 'No Content' which is a success response for this operation
    print("File uploaded successfully!")
else:
    print(f"File upload failed with status code {http_response.status_code}: {http_response.text}")

#endregion <--------------- Upload File --------------->

In [None]:
#region <--------------- Construct Query --------------->
data = {
    "s3_object": {"key": presigned_response["fields"]["key"]},
    "gpt_query": {
        "Summary": {
            "description": "Give me a 100 word summary of the product",
            "type": "string",
        },
        "Product_Specs_Match": {
            "description": "Is the product able to reduce the temperature of cooked food from 70-C to 3-C or below within 90 minutes or faster. The temperature range performance can be better but not worse. Return `True` or `False`",
            "type": "bool",
        },
    },
}
#endregion <--------------- Construct Query --------------->


#region <--------------- Send Query --------------->
try:
    serialized_data = json.dumps(data)  # Serialize the data dictionary as JSON string
    print('Sending data \n')
    response = requests.post('https://stg.ai.ff.gov.sg/query/async', data=serialized_data, headers=headers_w_token)
    response.raise_for_status()  # Raise an exception for unsuccessful requests

    print('POST request was successful.')
    print('Response:', response.json())
except requests.exceptions.RequestException as e:
    print(response.text)
    print('Error occurred:', e)
#endregion <--------------- Send Query --------------->

Sending data 

POST request was successful.
Response: {'result_url': 'https://aisay-api-result-storage-stg.s3.amazonaws.com/7d1baa61-a6cf-4f73-834e-53ba777a86a9.pdf?AWSAccessKeyId=ASIA3QRED2HIGGIKLKLX&Signature=DRuj6Cvc66l9vashZ2g2CFH6aCQ%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEDYaDmFwLXNvdXRoZWFzdC0xIkcwRQIgBknnFncz4If%2Bx9w%2BrTvFoCVyWF641ktRs8XOwyDvFo8CIQC0kVssfJ90rEvPskaAQqILQHrHp0ULTueoG8MJoATmOCrCAwg%2FEAAaDDc5MTQyMzQ3MjA4MCIM%2ByOHtJu6qz5v0zE5Kp8DO81DC8scsswGvdqzVuy0XKvDkiwCJmd03ZMA0HoHs%2BkHjArOluf8RKODAQtz5Eqjx3uQH%2BrhBuihrlbqow%2ByKkFsQ3NsWwMmyrSbM5cJgc8xu%2BR8pxxyFHGMhuND6JE8jkEx79k8%2FaJlBW9Wjc%2BQsvUn9VBNdbzEUPClShr3o125WCf4OPZQYZhJRQlPicSZHGLnIiSfYahdbTPcK5flbfiGoREeTiVClelm5bDdlI3W6BoatwsalRB5xdG3U%2FlADUr%2FZXza%2BQ6LPpOLazTzd1kdsXw2TowF%2FM4DEzS9LlWwGgKUe0y0R4ex0AARkmQeTDjcnBO%2FfltCGqwEYMUZbxbf173FG3vKKR88zPttrBvAyzGphp%2BzW7Ql5M9Wyb4mOuhWYv6I%2BHRk9hYUpGnFGzMLSmhhOaNdAXJx76HvNwAKL1LC8xjqezf58Q3XGpXclKoZMl1ymZgmwA918jph4bwogcU%2BkjSsAVPoHWw54cr6TBvbw5TpUXCSWrlbtWtCH

> Please wait for about 1-to-2 minute before running the cell below:

In [None]:
retries = 0
while retries < 3:
    try:
        result_response = requests.get(response.json()['result_url'])
        if result_response.status_code == 200:
            print(result_response.json())
            break
        else:
            print(f"Attempt {retries + 1} failed with status code: {result_response.status_code}")

    except requests.RequestException as e:
        print(f"Request failed: {e}")

    time.sleep(5)  # Wait for a specified timeout period before retrying
    retries += 1

{'fields': {'Quantity': '90', 'Dosage': '50 MCG', 'Dosage_Measurement': 'MCG', 'Prescription_Date': '12-01-2018'}, 'quota': 'Pay per use: 11 used.'}


Summary: Give me a 100 word summary of the product
Product_specs_match: Is the product able to reduce the temperature of cooked food from 70-C to 3-C or below within 90 minutes or faster. The temperature range performance can be better but not worse