Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New notebook + 1 video + 1 image file #1700

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

anurag-openai
Copy link

Summary

Briefly describe the changes and the goal of this PR. Make sure the PR title summarizes the changes effectively.

This PR introduces a detailed notebook demonstrating how to leverage GPT-4o's vision capabilities for analyzing video frames to extract structured operational insights in a manufacturing warehouse. It provides step-by-step instructions, best practices for bounding boxes, structured data extraction, confidence scoring, and cost considerations to effectively implement an AI-driven monitoring system.

Motivation

Why are these changes necessary? How do they improve the cookbook?

Warehouse managers often lack real-time visibility into their operations, relying instead on delayed or manual reporting methods, leading to reactive rather than proactive decision-making. This contribution addresses these issues by using GPT-4o's vision capabilities to analyze video footage, enabling rapid identification of safety concerns, monitoring space utilization, and detecting operational inefficiencies in near-real-time. This significantly improves decision-making speed, enhances safety compliance, and reduces operational inefficiencies.


For new content

When contributing new content, read through our contribution guidelines, and mark the following action items as completed:

  • I have added a new entry in registry.yaml (and, optionally, in authors.yaml) so that my content renders on the cookbook website.
  • [X ] I have conducted a self-review of my content based on the contribution guidelines:
    • [X ] Relevance: This content is related to building with OpenAI technologies and is useful to others.
    • Uniqueness: I have searched for related examples in the OpenAI Cookbook, and verified that my content offers new insights or unique information compared to existing documentation.
    • [X ] Spelling and Grammar: I have checked for spelling or grammatical mistakes.
    • [X ] Clarity: I have done a final read-through and verified that my submission is well-organized and easy to understand.
    • Correctness: The information I include is correct and all of my code executes successfully.
    • [X ] Completeness: I have explained everything fully, including all necessary references and citations.

We will rate each of these areas on a scale from 1 to 4, and will only accept contributions that score 3 or higher on all areas. Refer to our contribution guidelines for more details.

Copy link
Contributor

@danial-openai danial-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Without live real-time tracking" - maybe drop real-time

Copy link
Contributor

@danial-openai danial-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: add period '.' to end of every line; some lines are missing

Copy link
Contributor

@danial-openai danial-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Using computer vision to analyze warehouse videos and provide real-time operational insights" - Highlight the product e.g. "Using GPT-4o Vision capabilities..."

Copy link
Contributor

@danial-openai danial-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Simple Workflow:" - I would frame this a bit differently e.g. "In this cookbook, we will leverage GPT-4o Vision capabilities to analyze warehouse videos and provide operational insights. Here is our proposed approach: ..."

Copy link
Contributor

@danial-openai danial-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

video = cv2.VideoCapture("/Users/anurag/github/openai-cookbook/openai-cookbook/examples/data/manufacturing/warehouse_operations.mp4")

Use the relative path to the video that you uploaded on github if you want them to use that directly e.g. "data/manufacturing/warehouse_operations.mp4" or just generalize this with a placeholder e.g. "<PATH_TO_YOUR_VIDEO>

Copy link
Contributor

@danial-openai danial-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prompting nit: "based on the MfgEvent model"

Would change this to something like "Your task is to analyze each frame and return a response in the specified format..." for more clarity

Copy link
Contributor

@danial-openai danial-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the pandas df it would be nice to format so that the explanation isn't cut off e.g. add this before you display pd.set_option("display.max_colwidth", None)

Copy link
Contributor

@danial-openai danial-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Step 5: 💸 Cost Considerations & Best Practices" - add a bit more description here. What are Resolution and Detail Mode? How do you set these parameters?

Copy link
Contributor

@danial-openai danial-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your cost estimation, also provide more description upfront including your assumptions e.g. "Assuming that we take 1 image per minute, every hour of the day, for 365 days in a year..." etc...

It also looks like your printed output is duplicated:
Total annual cost: $1451.97
Token cost per image: 1105
Annual token cost (1 image per minute): 1451.97

Copy link
Contributor

@danial-openai danial-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your analysis, I would highlight what works well e.g. we correctly identify the 5 workers in the first two frames with high confidence... in the last frame we miss a worker, but we also have lower confidence... gpt-4o is really good at respecting bounding boxes and it never counts workers outside the bounding box etc...

Would be good to see some commentary on the results

Copy link
Contributor

@danial-openai danial-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Implementing advanced function calling to streamline interactions between YOLO detections and GPT-4o analysis." - What are YOLO detections?

"Exploring real-time Vision APIs, currently under development, to achieve true real-time insights and faster decision-making." - Not sure what this is referring to? Note this is a public resource!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants