# Textract with HF-FLAN-T5 model for text2text generation

With the increasing popularity of Foundation Models (FMs) and Large Language Models (LLMs), it is crucial to consider secure implementation practices. Here are some important points to remember:

- Avoid passing sensitive data, like Personally Identifiable Information (PII), to these models due to the risk of exposing such information to unauthorized parties.

- Utilizing AI models that handle sensitive data can raise ethical concerns, as the potential misuse of such data could have serious implications, including the creation and usage of biased models.

-  LLMs may unintentionally disclose sensitive information through the outputs they generate.


This notebook will focus on how you can utilize **[Amazon Textract](https://aws.amazon.com/textract/)** and **[Amazon Comprehend](https://aws.amazon.com/comprehend/)** to redact any sensitive information from your input text before you send it to a LLM.

## Step 1. Detecting the text using Amazon Textract

### 1. Set Up

Just as we did in the first lab, the first step is to import some necessary libraries that will be used throughout this notebook.

In [None]:
import json
import boto3
from IPython.display import Image, display
from IPython.display import IFrame

# To refer to the endpoint_name variable created in lab 1
%store -r endpoint_name

### 2. Extract text
Let's use an image to extract some example input text. You can use the image provided to you or upload your own. 

The model will return the output of the accomplished task.

In [None]:
# Document
documentName = "document_7.png"
display(Image(filename=documentName))


The following cell will call Amazon Textract's **[detect_document_text](https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html)** API to detect any text in the input document. Printing this demonstrates the response.

In [None]:
# Amazon Textract client
textract = boto3.client('textract')

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})
text = []
# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        text.append(item["Text"])
textract_text = "\n".join(text)

print(textract_text)

Now, lets add a prompt alongside the Textract response so that we can use it to ask our model a question.

In [None]:
prompt_text = "Given the following text, what is the document date? %s"%(textract_text)

print(prompt_text)

### 3. Query the endpoint


We can call the endpoint that was created in the previous notebook. This model also supports many advanced parameters while performing inference. They include:

* **max_length:** Model generates text until the output length (which includes the input context length) reaches `max_length`. If specified, it must be a positive integer.
* **num_return_sequences:** Number of output sequences returned. If specified, it must be a positive integer.
* **num_beams:** Number of beams used in the greedy search. If specified, it must be integer greater than or equal to `num_return_sequences`.
* **no_repeat_ngram_size:** Model ensures that a sequence of words of `no_repeat_ngram_size` is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **early_stopping:** If True, text generation is finished when all beam hypotheses reach the end of stence token. If specified, it must be boolean.
* **do_sample:** If True, sample the next word as per the likelyhood. If specified, it must be boolean.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **seed:** Fix the randomized state for reproducibility. If specified, it must be an integer.




The **query_endpoint_with_json_payload** and **parse_response_multiple_texts** functions are doing the same as they were in lab 1. The only change is the new **payload** variable, which is where we can specify any subset of the parameters mentioned above while invoking the endpoint.

The below code is an example of how to invoke the endpoint with these arguments, using **prompt_text** as the input text.

In [None]:
#Input must be a json
payload = {"text_inputs":prompt_text, "max_length":300}

def query_endpoint_with_json_payload(encoded_json):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=encoded_json)
    return response

query_response = query_endpoint_with_json_payload(json.dumps(payload).encode('utf-8'))

def parse_response_multiple_texts(query_response):
    model_predictions = json.loads(query_response['Body'].read())
    generated_text = model_predictions['generated_texts']
    return generated_text

generated_texts = parse_response_multiple_texts(query_response)
print(generated_texts)



This step is now complete. Amazon Textract was used to extract all of the text from our example image, and this was parsed to our model, alongside a prompt question. As demonstrated by the output, our model is able to identify the document date. 

## Step 2. Detecting PII entities with Amazon Comprehend

Now that we have extracted the information using Amazon Textract, the next step is to identify any PII using Amazon Comprehend's **[Detect PII Entities API](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DetectPiiEntities.html)**. 


In [None]:
client = boto3.client('comprehend')

response = client.detect_pii_entities(
    Text= textract_text,
    LanguageCode='en'
)
print(response)

The **response.json** file contains the above response formatted in JSON. Feel free to take a look at this to help you understand what the data we are dealing with looks like. 

We can evidently see that the response contains a list of all entity types identified, their location in the text, and their associated confidence scores.

### Step. 3 Redacting all PII from the input text

With the help of Amazon Textract and Amazon Comprehend we were able to identify all instances of PII within our input data, and the next step is to redact it all before we call our model endpoint.

The below code uses the 'BeginOffset" and 'EndOffset' values to identify the location of all PII, and substitutes the original values with just their associated entity types.

In [None]:
comprehend_txt = textract_text
 
# reversed to not modify the offsets of other entities when substituting
print("Detected Entities: ")
for entity in reversed(response['Entities']):
    print(entity['Type'] )
    comprehend_txt  = textract_text[:entity['BeginOffset']] + entity['Type'] + comprehend_txt[entity['EndOffset']:]
 


In [None]:
print(comprehend_txt)

As we can see here, all PII has been redacted. Now, lets add a prompt alongside the Comprehend response so that we can ask our model a question securely.

In [None]:
prompt_text = "Given the following text, what is the document date? %s"%(comprehend_txt)

print(prompt_text)

In [None]:
#Input must be a json
payload = {"text_inputs":prompt_text, "max_length":300}

def query_endpoint_with_json_payload(encoded_json):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=encoded_json)
    return response

query_response = query_endpoint_with_json_payload(json.dumps(payload).encode('utf-8'))

def parse_response_multiple_texts(query_response):
    model_predictions = json.loads(query_response['Body'].read())
    generated_text = model_predictions['generated_texts']
    return generated_text

generated_texts = parse_response_multiple_texts(query_response)
print(generated_texts)

redacted_response = client.detect_pii_entities(
    Text= str(generated_texts),
    LanguageCode='en'
)

for entity in reversed(redacted_response['Entities']):
    if entity['Score'] > 0.9:
        print(entity['Type']) 
    else:
        print('\n\033[1;32mCongratulations, no PII has been detected!')


This output has no meaning at all, which is because the document date was redacted before the model was queried.

We have used a combination of **Amazon Textract** and **Amazon Comprehend** to identify and remove all instances of PII from our input data, ensuring that no sensitive data can be passed to our model.

If you want to learn more about detecting PII data, feel free to read **[here](https://aws.amazon.com/blogs/industries/common-techniques-to-detect-phi-and-pii-data-using-aws-services/)**.


### Clean up

Make sure to run the cell below to delete your SageMaker endpoint once you are finished with this lab.


In [None]:
sagemaker_client = boto3.client('sagemaker')
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)