### Image Text Extraction and Description Generation using OpenAI and Tesseract
This Python script extracts text from an image using the Tesseract OCR engine and generates a detailed description of the image content by leveraging the OpenAI API.

In [1]:
import openai
from PIL import Image
import pytesseract
import os 

In [13]:

# === Step 1: Load the image from the specified path ===
image_path = r"../../data/marketecture.png"

# Load the image
image = Image.open(image_path)

# === Step 2: Extract text from the image using Tesseract === 

#change this path based on where pytesseract is located ! 
pytesseract.pytesseract.tesseract_cmd = r"C:\Users\sami.arem\AppData\Local\Programs\Tesseract-OCR\tesseract.exe"

extracted_text = pytesseract.image_to_string(image)

# === Step 3: Use the OpenAI API to describe the image ===
openai.api_key = os.environ['OPENAI_API_KEY']


# Prepare a prompt based on the extracted text
prompt = (
    f"Based on the following extracted text from the image:\n\n"
    f"{extracted_text}\n\n"
    "Provide a well-structured and detailed description of the image content. Focus on relevant and meaningful text. "
    "Format the description clearly into sections if needed, such as 'Title', 'Subheadings', 'Content', or 'Key Information'. "
    "Ignore any text that seems incomplete, irrelevant, or repetitive. Use proper grammar, punctuation, and clear formatting."
)
# Use the OpenAI API to generate a description
response = openai.ChatCompletion.create(
    model="gpt-4",  # You can switch to "gpt-3.5-turbo" if needed
    messages=[
        {"role": "system", "content": "You are an AI that describes images and diagrams based on extracted text."},
        {"role": "user", "content": prompt}
    ],
    max_tokens=500
)

# Get the generated description
description = response['choices'][0]['message']['content'].strip()

# === Step 4: Print the description ===
print("Image Description:")
print(description)


Image Description:
Title: Data Science/AI

Subheadings: 

1. Data Streaming & Warehouse 
2. Business Intelligence Orchestration 
3. Unified Governance
4. Unified Data

Content or Key Information:

The image seems to be a visual representation or an infographic related to Data Science or Artificial Intelligence (AI). The four main aspects discussed or represented in the image are:

1. Data Streaming & Warehouse: This indicates the process or system used to collect and store data continuously which is related to data science or AI. A Data Warehouse signifies a massive storage system handling and processing large amounts of data for analytical purposes.

2. Business Intelligence Orchestration: This part of the image is likely to describe the process of integrating, managing, and analyzing business data to create meaningful information that aids in making business decisions.

3. Unified Governance: This implies an amalgamated system or process overseeing and controlling the use, handling, 