<a href="https://colab.research.google.com/github/saracarl/colab_notebooks/blob/main/BethlehemSteelGeminiExperiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2025 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## 1. Set up Environment and create inference Client

The first task is to install the `google-genai` [Python SDK](https://googleapis.github.io/python-genai/) and obtain an API key. If you don”t have a can get one from Google AI Studio: [Get a Gemini API key](https://aistudio.google.com/app/apikey). If you are new to Google Colab checkout the [quickstart](../quickstarts/Authentication.ipynb)).


In [17]:
%pip install "google-genai>=1"



Once you have the SDK and API key, you can create a client and define the model you are going to use the new Gemini 2.0 Flash model, which is available via [free tier](https://ai.google.dev/pricing#2_0flash) with 1,500 request per day (at 2025-02-06).

In [18]:
from google import genai
from google.colab import userdata
api_key = userdata.get("GOOGLE_API_KEY") # If you are not using Colab you can set the API key directly

# Create a client
client = genai.Client(api_key=api_key)

# Define the model you are going to use
model_id =  "gemini-2.0-flash" # or "gemini-2.0-flash-lite"  , "gemini-2.5-flash-preview-05-20","gemini-2.5-pro-preview-05-06"

*Note: If you want to use Vertex AI see [here](https://googleapis.github.io/python-genai/#create-a-client) how to create your client*

## 2. Work with cropped images

Gemini models are able to process [images and videos](https://ai.google.dev/gemini-api/docs/vision?lang=python#image-input), which can used with base64 strings or using the `files`api. After uploading the file you can include the file uri in the call directly. The Python API includes a [upload](https://googleapis.github.io/python-genai/#upload) and [delete](https://googleapis.github.io/python-genai/#delete) method.



In [19]:
!wget -q -O aperturecard1.jpg https://fromthepage.com/image-service/35006126//7222,5495,1715,894/1280,/0/default.jpg
!wget -q -O aperturecard2.jpg https://fromthepage.com/image-service/35006125//7222,5495,1715,894/1280,/0/default.jpg
!wget -q -O aperturecard3.jpg https://fromthepage.com/image-service/35006124//7222,5495,1715,894/1280,/0/default.jpg

You can now upload the files using our client with the `upload` method. Let's try this for one of the files.


In [None]:
card1 = client.files.upload(file="aperturecard1.jpg", config={'display_name': 'card1'})
card2 = client.files.upload(file="aperturecard2.jpg", config={'display_name': 'card2'})
card3 = client.files.upload(file="aperturecard3.jpg", config={'display_name': 'card3'})

_Note: The File API lets you store up to 20 GB of files per project, with a per-file maximum size of 2 GB. Files are stored for 48 hours. They can be accessed in that period with your API key, but they cannot be downloaded. File uploads are available at no cost._

After a file is uploaded you can check to how many tokens it got converted. This not only help us understand the context you are working with it also helps to keep track of the cost.

In [None]:
file_size = client.models.count_tokens(model=model_id,contents=card1)
print(f'File: {card1.display_name} equals to {file_size.total_tokens} tokens')
file_size = client.models.count_tokens(model=model_id,contents=card2)
print(f'File: {card2.display_name} equals to {file_size.total_tokens} tokens')
file_size = client.models.count_tokens(model=model_id,contents=card3)
print(f'File: {card3.display_name} equals to {file_size.total_tokens} tokens')


File: card1 equals to 259 tokens
File: card2 equals to 259 tokens
File: card3 equals to 259 tokens


## 3. Structured outputs with Gemini 2.0 and Pydantic

Structured Outputs is a feature that ensures Gemini always generate responses that adhere to a predefined format, such as JSON Schema. This means you have more control over the output and how to integrate it into our application as it is guaranteed to return a valid JSON object with the schema you define.

Gemini 2.0 currenlty supports 3 dfferent types of how to define a JSON schemas:
- A single python type, as you would use in a [typing annotation](https://docs.python.org/3/library/typing.html).
- A Pydantic [BaseModel](https://docs.pydantic.dev/latest/concepts/models/)
- A dict equivalent of [genai.types.Schema](https://googleapis.github.io/python-genai/genai.html#genai.types.Schema) / [Pydantic BaseModel](https://docs.pydantic.dev/latest/concepts/models/)


In [20]:
# prompt: Display the results as a table in formatted markdown using display(Markdown)

def display_results(result, image_name, image_url):
  from IPython.display import display, Markdown

  # Assuming 'result' is an instance of the Metadata model
  # Convert the Pydantic model to a dictionary
  result_dict = result.model_dump()

  # Create the markdown table header
  markdown_table = "| Field | Value |\n"
  markdown_table += "|---|---|\n"

  # Add rows to the markdown table
  for field, value in result_dict.items():
      # Escape potential markdown characters in the value
      safe_value = str(value).replace('|', '\\|').replace('\n', '<br/>')
      markdown_table += f"| {field.replace('_', ' ').title()} | {safe_value} |\n"

  #display(Markdown("### card1.jpg \n ![card1.jpg](https://fromthepage.com/image-service/35006126//7222,5495,1715,894/1280,/0/default.jpg)"))
  display(Markdown(f"### {image_name} \n ![{image_name}]({image_url})"))
  display(Markdown(markdown_table))



## 4. Extract Structured data from images using Gemini 2.0

Now, let's combine the File API and structured output to extract information from our files. You can create a simple method that accepts a local file path and a pydantic model and return the structured data for us. The method will:

1. Upload the file to the File API
2. Generate a structured response using the Gemini API
3. Convert the response to the pydantic model and return it


In [21]:
def process_cards(client, filename, file_url):
  from pydantic import BaseModel, Field

  class Metadata(BaseModel):
      header: str = Field(description="The header from the title block, usually Bethlehem Steel Company")
      location: str = Field(description="The location from the title block, usually South Bethlehem, PA, U.S.A.")
      for_whom: str = Field(description="customer the design was for")
      place: str = Field(description="location of the customer")
      job: str = Field(description="job description")
      part: str = Field(description="part description")
      job_no: str = Field(description="job number")
      scale: str = Field(description="scale of the design")
      approved: str = Field(description="who approved the design")

  file = client.files.upload(file=filename, config={'display_name': filename.split('/')[-1].split('.')[0]})
  #file = filename
  # Generate a structured response using the Gemini API
  prompt = f"You are a historian and need to extract the data in this image. It is from a title block from an engineering drawing. The text may be typed or handwritten and must be transcribed. Using your ability to transcribe both handwriting and printed text, extract all text as structured data. Try your best to capture the text faithfully. DO NOT fabricate text where there is none in the original image. Instead, indicate this with the string not filled. You MUST capture all text that you see on the card. DO NOT omit any text."
  response = client.models.generate_content(model=model_id, contents=[prompt, file], config={'response_mime_type': 'application/json', 'response_schema': Metadata})
  # Convert the response to the pydantic model and return it
  result = response.parsed

  #result = extract_structured_data(filename, Metadata)
  print(type(result))
  #print(f"Extracted Metadata: {result}")
  display_results(result, filename, file_url)

In [22]:
process_cards(client, "aperturecard1.jpg","https://fromthepage.com/image-service/35006126//7222,5495,1715,894/1280,/0/default.jpg")

<class '__main__.process_cards.<locals>.Metadata'>


### aperturecard1.jpg 
 ![aperturecard1.jpg](https://fromthepage.com/image-service/35006126//7222,5495,1715,894/1280,/0/default.jpg)

| Field | Value |
|---|---|
| Header | BETHLEHEM STEEL COMPANY |
| Location | SOUTH BETHLEHEM, PR., U.S.A. |
| For Whom | CHASE ROLLING MILL. |
| Place | not filled |
| Job | # 16 GAS ENGINE |
| Part | SPECIAL CYLINDER |
| Job No | 47930 |
| Scale | 3' + 6" = 1 FOOT |
| Approved | Alhus B. Haws |


In [23]:
process_cards(client, "aperturecard2.jpg","https://fromthepage.com/image-service/35006125//7222,5495,1715,894/1280,/0/default.jpg")

<class '__main__.process_cards.<locals>.Metadata'>


### aperturecard2.jpg 
 ![aperturecard2.jpg](https://fromthepage.com/image-service/35006125//7222,5495,1715,894/1280,/0/default.jpg)

| Field | Value |
|---|---|
| Header | BETHLEHEM STEEL COMPANY, |
| Location | SOUTH BETHLEHEM, PA, U. S. A. |
| For Whom | BETHLEHEM STEEL CO |
| Place | not filled |
| Job | 21'-0 x 30'-0" CAR BOTTOM HEATING FURNACE |
| Part | 27" REVERSING VALVE-DETAILS |
| Job No | 10030 |
| Scale | 3/4" = 1 FOOT |
| Approved | Chas.E.Kohn |


In [25]:
process_cards(client, "aperturecard3.jpg","https://fromthepage.com/image-service/35006124//7222,5495,1715,894/1280,/0/default.jpg")

<class '__main__.process_cards.<locals>.Metadata'>


### aperturecard3.jpg 
 ![aperturecard3.jpg](https://fromthepage.com/image-service/35006124//7222,5495,1715,894/1280,/0/default.jpg)

| Field | Value |
|---|---|
| Header | BETHLEHEM STEEL COMPANY, |
| Location | SOUTH BETHLEHEM, PA., U. S. A. |
| For Whom | BETHLEHEM STEEL CO. |
| Place | not filled |
| Job | DOUBLE 50½ FORGING PRESS |
| Part | MAIN PIPES FOR PRESS CYLINDER |
| Job No | 3515 |
| Scale | 1½ & 3"-1 FOOT |
| Approved | chas. terkel |
