Cell 1: Environment Setup
The goal is to load environment variables (API keys) and configure the Google Generative AI client to access the Gemini Vision model.

In [None]:
import warnings
warnings.filterwarnings("ignore")

# Load environment variables and API keys
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())  # Read local .env file

# Load API key for Google Generative AI
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

# Import Google Generative AI library and configure it
import google.generativeai as genai
from google.api_core.client_options import ClientOptions

# Set up the Google Generative AI client
genai.configure(
    api_key=GOOGLE_API_KEY,
    transport="rest",
    client_options=ClientOptions(
        api_endpoint=os.getenv("GOOGLE_API_BASE"),
    )
)


Cell 2: Helper Functions
Helper functions are provided to convert text to Markdown format and call the Large Multimodal Model (LMM) for image analysis.



In [None]:
import textwrap
import PIL.Image
from IPython.display import Markdown, Image

# Convert text to markdown format for better notebook display
def to_markdown(text):
    """
    Convert plain text into markdown format.
    
    Args:
        text (str): The text to format.
    
    Returns:
        Markdown: Formatted markdown text.
    """
    text = text.replace("•", "  *")
    return Markdown(textwrap.indent(text, "> ", predicate=lambda _: True))

# Function to call the LMM to analyze images and return a response
def call_LMM(image_path: str, prompt: str, plain_text: bool=False) -> str:
    """
    Call the LMM to analyze an image and generate a response based on a prompt.

    Args:
        image_path (str): Path to the image to be analyzed.
        prompt (str): A text prompt for the model to process.
        plain_text (bool): If True, returns plain text; otherwise returns Markdown.

    Returns:
        str: Model response in Markdown or plain text.
    """
    img = PIL.Image.open(image_path)  # Load the image
    model = genai.GenerativeModel("gemini-1.5-flash")  # Initialize the model
    response = model.generate_content([prompt, img], stream=True)  # Generate content
    response.resolve()  # Resolve the content

    if plain_text:
        return response.text
    else:
        return to_markdown(response.text)


Cell 3: Extracting Structured Data from an Invoice Image
The Large Multimodal Model (LMM) is used to analyze an invoice image and extract structured data.

In [None]:
# Display the invoice image
from IPython.display import Image
Image(url="invoice.png")

# Extract structured data from the invoice (as JSON)
call_LMM("invoice.png", 
    """Identify items on the invoice, Make sure you output 
    JSON with quantity, description, unit price, and amount."""
)


Cell 4: Querying Additional Information
Ask the model follow-up questions about the invoice for specific cost estimates.

In [None]:
# Ask the model to calculate costs based on the invoice
call_LMM("invoice.png", 
    """How much would four sets of pedal arms cost 
    and 6 hours of labor?""",
    plain_text=True  # Return response in plain text format
)


Cell 5: Extracting Tables from Images
Analyze an image of a table and extract the contents in Markdown format.

In [None]:
# Display the table image
Image("prosus_table.png")

# Convert table image contents into Markdown table format
call_LMM("prosus_table.png", 
    "Print the contents of the image as a markdown table."
)

# Analyze the contents of the table to find the highest revenue growth
call_LMM("prosus_table.png", 
    """Analyze the contents of the image as a markdown table.
    Which of the business units has the highest revenue growth?"""
)


Cell 6: Analyzing Flow Charts
Use the LMM to analyze a flowchart image and provide a summarized breakdown.

In [None]:
# Display the swimlane diagram
Image("swimlane-diagram-01.png")

# Provide a summarized breakdown of the flowchart as a numbered list
call_LMM("swimlane-diagram-01.png", 
    """Provide a summarized breakdown of the flow chart in the image
    in the format of a numbered list."""
)

# Analyze the flow chart and generate Python code based on the logical flow
call_LMM("swimlane-diagram-01.png", 
    """Analyze the flow chart in the image,
    then output Python code that implements this logical flow in one function."""
)


Cell 7: Generated Python Code from Flow Chart
The LMM generates Python code to implement the flow described in the diagram. This code can be tested, although it may require additional context.

In [None]:
# The generated Python code from the flowchart analysis
def order_fulfillment(client, online_shop, courier_company):
    """
    This function handles the order fulfillment process involving 
    a client, an online shop, and a courier company.
    
    Args:
        client (object): The client who placed the order.
        online_shop (object): The online shop processing the order.
        courier_company (object): The courier company delivering the order.
    """
    # The client places an order
    order = client.place_order()

    # The client makes a payment for the order
    payment = client.make_payment(order)

    # If payment is successful, ship the order
    if payment.status == "successful":
        online_shop.ship_order(order)
        courier_company.transport_order(order)
    
    # If payment fails, cancel the order and issue a refund
    else:
        online_shop.cancel_order(order)
        client.refund_order(order)

    # Finally, generate an invoice for the order
    online_shop.invoice_order(order)


The above code demonstrates how to use Google Gemini Vision (Large Multimodal Model) to analyze images and generate meaningful outputs such as structured data, tables, and Python code. The workflow includes the following key tasks:

Environment Setup: API keys are loaded, and the Google Generative AI client is configured to interact with the Gemini Vision model.

Helper Functions: Functions are defined to convert text to markdown and to call the multimodal model for image analysis.

Image Analysis: The model is used to extract structured data from an invoice, query cost estimates, and retrieve tables from images, which are output as markdown or analyzed further.

Flowchart Analysis: The model analyzes flowcharts, generating a summarized breakdown and Python code that reflects the logical flow of the diagram.

Generated Code: Python code generated by the model based on the flowchart is provided as an example for testing logical implementation.

The notebook illustrates how industry applications can leverage multimodal AI for extracting insights from visual data.