In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# 2. Text Extraction with Generative Models on Vertex AI


## Overview

Text extraction is a process of extracting text from a document. This can be done manually or automatically. Manual text extraction is the process of reading the document and copying the text into a new document. Automatic text extraction is the process of using software to extract the text from the document.

Text extraction can be used for a variety of purposes. One common purpose is to convert documents into a machine-readable format. This can be useful for storing documents in a database or for processing documents with software. Another common purpose is to extract information from documents. This can be useful for finding specific information in a document or for summarizing the content of a document.

Large language models (LLMs) are good for text extraction because they are trained on massive datasets of text and code, which allows them to learn the relationships between words and phrases. They can also understand the context of text and generate text, which allows them to extract information that is not explicitly stated or fill in the gaps in text that is missing information. The answers from LLMs can also be further improved through methods like few-shot prompting.

Learn more about extraction prompts in the [official documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/text/extraction-prompts).

## 0. Getting Started

### Install Vertex AI SDK

In [None]:
!pip install google-cloud-aiplatform --upgrade --user

### Import libraries


In [1]:
from vertexai.language_models import TextGenerationModel

### Import models

In [2]:
generation_model = TextGenerationModel.from_pretrained("text-bison@001")

## Section 1. Text Extraction

### 1. Entity extraction from contract data

In this example, you try to extract insights from a rental contract text using the PaLM API. 


![Image](https://img.freepik.com/free-vector/businessman-signing-contract-hands-person-holding-pen-paper-document-sign-agreement-flat-vector-illustration-signature-deal-concept-banner-website-design-landing-web-page_74855-24354.jpg?w=1800&t=st=1693341332~exp=1693341932~hmac=2044fc79c6d2264d0987ff8382aa23379a2f189c3946e96b828622fca47cfd45)

#### Can you extract the following information by using the PaLM API? 

In [None]:
1. Rental Address:
2. Start/End date of rental
3. Tenant: 
4. Landloard:
5. Rental Amount:
6. Payment Method: 
7. Payment Due: 
8. Penalty fee:
9. Additional tenants: 

In [None]:
prompt = """
<INPUT: Your Prompt>

<Context>

<Optional: how would you specify your output?>


"""

print(
    generation_model.predict(
        prompt, temperature=0.2, max_output_tokens=1024, top_k=40, top_p=0.8
    ).text
)

#### Any other prompt you want to try out? 

In [13]:
prompt = """
<INPUT: Your Prompt>

<Context>

<Optional: how would you specify your output?>


"""

print(
    generation_model.predict(
        prompt, temperature=0.2, max_output_tokens=1024, top_k=40, top_p=0.8
    ).text
)

INPUT: What is the difference between a 1000 watt and a 1500 watt heater?

CONTEXT: I'm trying to decide which heater to buy. I'm looking for something that will heat my living room quickly and efficiently.

OUTPUT: A 1000 watt heater will produce about 1000 BTUs of heat per hour, while a 1500 watt heater will produce about 1500 BTUs of heat per hour. This means that the 1500 watt heater will heat your living room more quickly and efficiently. However, it will also use more electricity.

If you're concerned about the cost of electricity, you may want to choose a 1000 watt heater. However, if you're looking for the fastest and most efficient way to heat your living room, you may want to choose a 1500 watt heater.


## Section 2. WiFi troubleshooting with constraints

In this example, you ask the generative model to answer a question about troubleshooting a Google WiFi router based on the description of the different status lights on the router. The model will only be able to respond with the text that was provided, which helps to prevent it from generating potentially harmful or incorrect answers. Here is how you can do this using the PaLM API.

![Image](https://img.freepik.com/free-vector/internet-day-concept-illustration_114360-5303.jpg?w=1060&t=st=1693519662~exp=1693520262~hmac=8c3a8f9f55f097edabea0f4a79a542ff34a9050fadf48af450470b40d79c0185)

#### Wifi troubleshooting Q&A Content


#### Question: What should I do to fix my disconnected WiFi? The light on my Google WiFi router is yellow and blinking slowly.

In [None]:
prompt = """

<Question>

<ADD CONTENTS>

"""

print(
    generation_model.predict(
        prompt, temperature=0.2, max_output_tokens=256, top_k=1, top_p=0.8
    ).text
)

#### Any other question you want to ask? 

In [None]:
prompt = """

<Question>

<ADD CONTENTS>

"""

print(
    generation_model.predict(
        prompt, temperature=0.2, max_output_tokens=256, top_k=1, top_p=0.8
    ).text
)

## Section 3. Respond to inquiries in character

Now, you instruct the generative model to pretend to be Klara, a person. You will also tell the model about Klara's personality traits. Then, you will ask the model to answer a question as Klara would answer it.

![Image](https://img.freepik.com/free-vector/flat-design-profile-icons-collection_23-2149102741.jpg?w=1060&t=st=1693341588~exp=1693342188~hmac=1d9173e0ef918b03449e63a2ec64b72068bffacdd6999d0e4466b442eda1515d)

In [None]:
prompt = """
<Add Contents>


<User's Question>

Klara's answer:
"""

print(
    generation_model.predict(
        prompt, temperature=0.2, max_output_tokens=256, top_k=40, top_p=0.8
    ).text
)

###  Converting an ingredients list to JSON format

Suppose that you want to itemize ingredients in recipes to enter into a database, which requires a well-formatted output like JSON. This can be done using a generative model in the following way:

In [None]:
Ingredient List 

Ingredients:
* 1 tablespoon olive oil
* 1 onion, chopped
* 2 carrots, chopped
* 2 celery stalks, chopped
* 1 teaspoon ground cumin
* 1/2 teaspoon ground coriander
* 1/4 teaspoon turmeric powder
* 1/4 teaspoon cayenne pepper (optional)
* Salt and pepper to taste
* 1 (15 ounce) can black beans, rinsed and drained
* 1 (15 ounce) can kidney beans, rinsed and drained
* 1 (14.5 ounce) can diced tomatoes, undrained
* 1 (10 ounce) can diced tomatoes with green chilies, undrained
* 4 cups vegetable broth
* 1 cup chopped fresh cilantro

In [None]:
prompt = """

<Add INSTRUCTION>

<ADD CONTENTS>
"""

print(
    generation_model.predict(
        prompt, temperature=0.2, max_output_tokens=1024, top_k=40, top_p=0.8
    ).text
)

### Organizing the results of a text extraction

In this section, you extract the information you want from a block of text and organize it in a structured way, such as separating it by commas. Here you use few-shot prompting to guide the model to format your outputs to be separated by commas.


In [None]:
Question 

- Extract companies funded by CapitalG

In [None]:
prompt = """

<ADD CONTENTS>
<ADD PROMPT>

"""

print(
    generation_model.predict(
        prompt, temperature=0.2, max_output_tokens=256, top_k=1, top_p=0.8
    ).text
)

As you can see in the output above, based on the few-shot prompt, you should see the names of companies funded by CapitalG.