## YOLOv9 Detection Script

This script automates the process of running object detection using YOLOv9 on a collection of video files. The key components and their functionalities are as follows:

- **Dependencies**: The script imports essential modules (`os` and `subprocess`) for directory management, command execution, and string operations.
- **`run_yolov9_detect` Function**: This function constructs and executes a command to run the `detect.py` script from the YOLOv9 repository. It specifies parameters such as the source video file, image size, weights, and options for saving detection results.
- **Base Directory and Models**: The base directory (`keyframes`) contains the video files, and the `models` list specifies the versions of YOLOv9 to be used (`c` and `e`).
- **Directory Listing**: The script lists all directories in the base directory, excluding system files like `.DS_Store`.
- **Detection Execution**: The script iterates through each video file and model, running the detection function for each combination, and prints the directories being processed.

This setup ensures that YOLOv9 detection is run systematically across all specified videos and models, with the results and any errors being output to the console.


In [None]:
import os
import shutil
import subprocess
import re

def run_yolov9_detect(model, base_dir, video):
    """
    Run the yolov9 detect.py script with specified parameters.

    Args:
        model (str): The YOLOv9 model to use ('c' or 'e').
        base_dir (str): Base directory containing the videos.
        video (str): The video file or directory to run detection on.
    """
    command = [
        'python3', 'yolov9-main/detect.py', 
        '--source', f'{os.path.join(base_dir, video)}', 
        '--img', '640', 
        '--weights', f'yolov9-main/yolov9-{model}-converted.pt', 
        '--name', f'{video}_{model}', 
        '--save-txt', 
        '--save-conf', 
        '--save-crop',
        '--nosave'
    ]

    try:
        result = subprocess.run(command, check=True, capture_output=True, text=True)
        print(f"Script output:\n{result.stdout}")
    except subprocess.CalledProcessError as e:
        print(f"Script failed with error:\n{e.stderr}")

# Configuration
base_dir = 'keyframes'  # Directory containing the keyframes
models = ['c', 'e']  # List of YOLOv9 models to use

# Create a list of directories in the keyframes folder
directories = [f for f in os.listdir(base_dir) if os.path.isdir(os.path.join(base_dir, f))]
if '.DS_Store' in directories:
    directories.remove('.DS_Store')
print(directories)

# Run detection for each video and model
for video in directories:
    for model in models:
        run_yolov9_detect(model, base_dir, video)


# Image Analysis and Description using Detectron2 and Vision-Transformer Models

### 1. Object Detection and Segmentation
- **Detectron2**: A  object detection library by Facebook AI Research (FAIR).
  - **Model**: Utilizes a pre-trained Mask R-CNN model for detecting and segmenting objects within an image.
  - **Configuration**: The model can be configured to run on either CPU or GPU, depending on the availability of a CUDA-enabled GPU.

### 2. Image Captioning
- **Transformers Library**: A  library by Hugging Face for natural language processing and vision tasks.
  - **Model**: Uses the `VisionEncoderDecoderModel`, which combines a vision transformer (ViT) encoder with a GPT-2 decoder.
  - **Processor**: The `ViTImageProcessor` for preprocessing images and `AutoTokenizer` for handling the text generation.


1. **Image Loading**:
   - Load an image using the `PIL` library.

2. **Object Detection and Segmentation**:
   - Configure and use the Detectron2 model to detect objects and generate segmentation masks.
   - Visualize the segmentation results and extract instances (detected objects) from the image.

3. **Image and Object Description**:
   - Generate a detailed description of the entire image.
   - For each detected object, generate specific descriptions by focusing on their bounding boxes.

#### Results

- **Description**: Provides a textual description of the overall image.
- **Object Descriptions**: Generates individual descriptions for each detected object, detailing their appearance, actions, or other relevant attributes.


In [None]:
import cv2
import torch
import numpy as np
from PIL import Image
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2 import model_zoo
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog

# Function to load and preprocess the image
def load_image(image_path):
    """
    Load an image from the specified file path.

    Args:
        image_path (str): Path to the image file.

    Returns:
        PIL.Image.Image: Loaded image.
    """
    image = Image.open(image_path)
    return image

# Function to detect objects in the image
def detect_objects(image_path, device):
    """
    Detect objects in an image using Detectron2 and visualize the results.

    Args:
        image_path (str): Path to the image file.
        device (str): Device to run the model on ('cpu' or 'cuda').

    Returns:
        detectron2.structures.Instances: Detected instances.
        np.ndarray: Image with visualized instance predictions.
    """
    # Configure Detectron2
    cfg = get_cfg()
    cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
    cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # set threshold for this model
    cfg.MODEL.DEVICE = device  # Use CPU or GPU
    cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
    predictor = DefaultPredictor(cfg)

    # Read the image
    image = cv2.imread(image_path)
    outputs = predictor(image)

    # Visualize the predictions
    v = Visualizer(image[:, :, ::-1], MetadataCatalog.get(cfg.DATASETS.TRAIN[0]), scale=1.2)
    out = v.draw_instance_predictions(outputs["instances"].to(device))
    segmented_image = out.get_image()[:, :, ::-1]

    return outputs["instances"], segmented_image

# Function to describe the background or specific object
def describe_image(image, focus_area=None, device='cpu'):
    """
    Generate a description for the entire image or a specific object using a vision-language model.

    Args:
        image (PIL.Image.Image): Image to describe.
        focus_area (tuple, optional): Bounding box (left, upper, right, lower) to crop the image. Defaults to None.
        device (str): Device to run the model on ('cpu' or 'cuda').

    Returns:
        str: Generated description of the image or object.
    """
    # Load the model and processor
    model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning").to(device)
    processor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
    tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")

    # Preprocess the image
    if focus_area:
        image = image.crop(focus_area)

    pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)

    # Generate the caption
    output_ids = model.generate(pixel_values, max_length=100, num_beams=4, eos_token_id=tokenizer.eos_token_id)
    caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    return caption

# Example usage
device = "cuda" if torch.cuda.is_available() else "cpu"
image_path = 'keyframes/00176/00176_Scene-5.jpg'

# Load and detect objects in the image
image = load_image(image_path)
instances, segmented_image = detect_objects(image_path, device)

# Describe the entire image (background)
image_description = describe_image(image, device=device)
print("Image description:", image_description)

# Describe specific objects (uncomment to use)
# for i in range(len(instances)):
#     bbox = instances.pred_boxes[i].tensor.cpu().numpy()[0]
#     focus_area = (bbox[0], bbox[1], bbox[2], bbox[3])
#     object_description = describe_image(image, focus_area, device=device)
#     print(f"Object {i+1} description:", object_description)

# Optionally, save the segmented image (uncomment to use)
# segmented_image_pil = Image.fromarray(segmented_image)
# segmented_image_pil.save("segmented_image.jpg")


# Color Detection and Caption Generation with BLIP

This script uses the BLIP (Bootstrapping Language-Image Pre-training) model to generate captions for images extracted from video frames and detect colors mentioned in those captions. The process involves several key steps:

1. **Model Initialization**:
   - Loads the BLIP model (`Salesforce/blip-image-captioning-base`) and its corresponding processor to generate image captions based on the given prompt.

2. **Caption Generation**:
   - **Image Loading and Processing**: Reads and converts images to the appropriate format for BLIP.
   - **Prompt-Based Captioning**: Generates descriptive captions for each image using the BLIP model, with a focus on color description.

3. **Color Extraction**:
   - Analyzes the generated captions to identify and extract color names from a predefined list (e.g., red, green, blue, yellow, etc.).

4. **Directory Processing**:
   - **Folder Traversal**: Walks through the specified directory structure to locate image files.
   - **File Filtering**: Processes only relevant image files (e.g., `.jpg` and `.png`).

5. **Result Structuring**:
   - **Metadata Parsing**: Extracts video name, scene number, and object class from the file path to organize results.
   - **Data Organization**: Stores the generated captions and detected colors in a nested dictionary structure for easy retrieval.

6. **Output Generation**:
   - **JSON Serialization**: Saves the results to JSON files, one per video, ensuring the output directory exists.
   - **Progress Visualization**: Utilizes `tqdm` to display progress bars during the processing of files and directories.


In [None]:
import os
import json
import re
from transformers import BlipProcessor, BlipForConditionalGeneration
import cv2
from tqdm import tqdm

# Load the BLIP model and processor
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

def generate_caption(image_path, prompt="Describe the colors."):
    """
    Generate a caption for an image using the BLIP model.

    Args:
        image_path (str): Path to the image file.
        prompt (str): Prompt for the caption generation model. Default is "Describe the colors.".

    Returns:
        str: Generated caption.
    """
    raw_image = cv2.imread(image_path)
    raw_image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB)
    
    # Prepare image and prompt for BLIP
    inputs = processor(images=raw_image, return_tensors="pt")
    
    # Generate caption
    out = model.generate(**inputs)
    caption = processor.decode(out[0], skip_special_tokens=True)
    return caption

def extract_colors(caption):
    """
    Extract color names from a generated caption.

    Args:
        caption (str): Generated caption from the BLIP model.

    Returns:
        list: List of found colors in the caption.
    """
    colors = ["red", "green", "blue", "yellow", "white", "black", "orange", "purple", "brown", "gray", "pink"]
    found_colors = [color for color in colors if re.search(r'\b' + color + r'\b', caption)]
    return found_colors

def process_video_folder(video_folder):
    """
    Process all image files in a video folder, generating captions and extracting colors.

    Args:
        video_folder (str): Path to the video folder containing image files.

    Returns:
        dict: Results organized by video name, scene number, and object class.
    """
    results = {}
    for root, dirs, files in os.walk(video_folder):
        # Use tqdm to display a progress bar
        for file in tqdm(files, desc=f"Processing files in {video_folder}", unit="file"):
            if file.endswith(".jpg") or file.endswith(".png"):
                image_path = os.path.join(root, file)
                caption = generate_caption(image_path)
                colors_in_caption = extract_colors(caption)

                # Parse video name, scene number, and object class from the file path
                match = re.match(r'.*/(\d{5})_Scene-(\d+)', image_path)
                if match:
                    video_name = match.group(1)
                    scene_number = match.group(2)
                    object_class = os.path.basename(os.path.dirname(image_path))

                    if video_name not in results:
                        results[video_name] = {}

                    if scene_number not in results[video_name]:
                        results[video_name][scene_number] = {}

                    if object_class not in results[video_name][scene_number]:
                        results[video_name][scene_number][object_class] = []

                    results[video_name][scene_number][object_class].append({
                        'image': image_path,
                        'caption': caption,
                        'colors': colors_in_caption
                    })
    return results

def main(input_folder, output_folder):
    """
    Main function to process multiple video folders, generate captions, and extract colors.

    Args:
        input_folder (str): Path to the input folder containing video folders.
        output_folder (str): Path to the output folder to save results.
    """
    all_results = {}
    for root, dirs, _ in os.walk(input_folder):
        for dir_name in tqdm(dirs, desc="Processing directories", unit="dir"):
            if re.match(r'\d{5}_[ec]', dir_name):
                video_folder = os.path.join(root, dir_name, 'crops')
                video_results = process_video_folder(video_folder)
                for video_name, scenes in video_results.items():
                    if video_name not in all_results:
                        all_results[video_name] = scenes
                    else:
                        for scene_number, objects in scenes.items():
                            print(f"Parsing {video_name} scene {scene_number} objects")
                            if scene_number not in all_results[video_name]:
                                all_results[video_name][scene_number] = objects
                            else:
                                for object_class, details in objects.items():
                                    if object_class not in all_results[video_name][scene_number]:
                                        all_results[video_name][scene_number][object_class] = details
                                    else:
                                        all_results[video_name][scene_number][object_class].extend(details)

    # Ensure the output folder exists
    os.makedirs(output_folder, exist_ok=True)
    
    # Save results to JSON files, one per video
    for video_name, data in all_results.items():
        output_file = os.path.join(output_folder, f'{video_name}_colors.json')
        with open(output_file, 'w') as f:
            json.dump({video_name: data}, f, indent=4)

# Example usage
input_folder = 'yolov9-main/runs/detect'
output_folder = 'color_detection_results_blip'

main(input_folder, output_folder)


# Dominant Color Detection with Enhanced Vibrance and Saturation

This script processes an image to identify the two most dominant colors, emphasizing vibrant colors over less vibrant ones like gray and black. The key steps involved in the process are:

1. **Image Preprocessing**:
   - **Saturation Adjustment**: Increases the saturation of all colors in the image to make colors more vivid.
   - **Vibrance Adjustment**: Boosts the vibrance of colors, particularly those that are less saturated, ensuring more nuanced color enhancement.

2. **Color Filtering**:
   - Converts the image to the HSV color space to filter out low-saturation colors, reducing the likelihood of selecting colors like gray or black as dominant colors.

3. **K-Means Clustering**:
   - Uses K-means clustering to identify the most common colors in the preprocessed image. The algorithm ensures that the top colors are distinct by checking the Euclidean distance between colors.

4. **Color Preference Weighting**:
   - Applies a weighting mechanism to prioritize vibrant colors (e.g., red, green, blue, yellow, orange) over less vibrant ones during the color matching process.

5. **Output**:
   - Ensures that the two most dominant colors are unique and outputs their names and RGB values. The original and preprocessed images are displayed side-by-side to visualize the effect of the adjustments.

In [None]:
import os
import json
import re
import cv2
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import webcolors
from scipy.spatial.distance import cdist

# Define a dictionary of common colors with an additional preference weight for each color
COMMON_COLORS = {
    'red': ('#FF0000', 10),
    'green': ('#008000', 10),
    'blue': ('#0000FF', 10),
    'yellow': ('#FFFF00', 10),
    'black': ('#000000', 1),
    'white': ('#FFFFFF', 10),
    'orange': ('#FFA500', 8),
    'purple': ('#800080', 1),
    'brown': ('#A52A2A', 1),
    'gray': ('#808080', 1),
    'pink': ('#FFC0CB', 1)
}

def closest_color(requested_color):
    """
    Find the closest color name from the COMMON_COLORS dictionary based on RGB distance.

    Args:
        requested_color (tuple): RGB tuple of the color to find the closest match for.

    Returns:
        str: Name of the closest color.
    """
    min_colors = {}
    for name, (hex_code, weight) in COMMON_COLORS.items():
        r_c, g_c, b_c = webcolors.hex_to_rgb(hex_code)
        rd = (r_c - requested_color[0]) ** 2
        gd = (g_c - requested_color[1]) ** 2
        bd = (b_c - requested_color[2]) ** 2
        min_colors[(rd + gd + bd) / weight] = name
    return min_colors[min(min_colors.keys())]

def adjust_saturation(image, saturation_scale=2.0):
    """
    Adjust the saturation of an image.

    Args:
        image (np.ndarray): Input image in RGB format.
        saturation_scale (float): Scale factor for saturation adjustment.

    Returns:
        np.ndarray: Image with adjusted saturation.
    """
    hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV).astype(np.float32)
    hsv[..., 1] *= saturation_scale
    hsv[..., 1] = np.clip(hsv[..., 1], 0, 255)
    return cv2.cvtColor(hsv.astype(np.uint8), cv2.COLOR_HSV2RGB)

def adjust_vibrance(image, vibrance_scale=2.0):
    """
    Adjust the vibrance of an image.

    Args:
        image (np.ndarray): Input image in RGB format.
        vibrance_scale (float): Scale factor for vibrance adjustment.

    Returns:
        np.ndarray: Image with adjusted vibrance.
    """
    hsv = cv2.cvtColor(image, cv2.COLOR_RGB2HSV).astype(np.float32)
    saturation = hsv[..., 1]
    mean_saturation = np.mean(saturation)
    increase = (1 - (saturation / 255.0)) * (saturation - mean_saturation) * vibrance_scale
    hsv[..., 1] = np.clip(saturation + increase, 0, 255)
    return cv2.cvtColor(hsv.astype(np.uint8), cv2.COLOR_HSV2RGB)

def preprocess_image(image):
    """
    Preprocess the image by adjusting saturation and vibrance.

    Args:
        image (np.ndarray): Input image in RGB format.

    Returns:
        np.ndarray: Preprocessed image.
    """
    image = adjust_saturation(image, saturation_scale=3.0)
    image = adjust_vibrance(image, vibrance_scale=3.0)
    return image

def get_dominant_colors(image_path, k=10, top_n=2, min_distance=50, saturation_threshold=50):
    """
    Get the dominant colors from an image.

    Args:
        image_path (str): Path to the image file.
        k (int): Number of clusters for KMeans.
        top_n (int): Number of top dominant colors to return.
        min_distance (float): Minimum distance between distinct colors.
        saturation_threshold (float): Minimum saturation threshold for a pixel to be considered.

    Returns:
        list: Names of the top dominant colors.
        list: RGB values of the top dominant colors.
    """
    try:
        image = cv2.imread(image_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        preprocessed_image = preprocess_image(image)
        pixels = preprocessed_image.reshape((-1, 3))
        hsv_pixels = cv2.cvtColor(pixels.reshape(-1, 1, 3).astype(np.uint8), cv2.COLOR_RGB2HSV).reshape(-1, 3)
        pixels = pixels[hsv_pixels[:, 1] > saturation_threshold]

        if pixels.shape[0] == 0:
            raise ValueError(f"No pixels with saturation above {saturation_threshold} in image {image_path}")

        unique_pixels = np.unique(pixels, axis=0)
        if unique_pixels.shape[0] < k:
            k = unique_pixels.shape[0]

        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(pixels)

        counts = np.bincount(kmeans.labels_)
        dominant_indices = np.argsort(-counts)
        dominant_colors = kmeans.cluster_centers_[dominant_indices]

        distinct_colors = []
        for color in dominant_colors:
            if len(distinct_colors) == 0:
                distinct_colors.append(color)
            else:
                if all(cdist([color], [distinct_color], metric='euclidean')[0][0] > min_distance for distinct_color in distinct_colors):
                    distinct_colors.append(color)
                if len(distinct_colors) >= top_n:
                    break

        dominant_color_names = [closest_color(color) for color in distinct_colors]

        distinct_color_names = []
        distinct_colors_filtered = []
        for name, color in zip(dominant_color_names, distinct_colors):
            if name not in distinct_color_names:
                distinct_color_names.append(name)
                distinct_colors_filtered.append(color)

        if len(distinct_color_names) < top_n:
            for color in dominant_colors[len(distinct_colors):]:
                name = closest_color(color)
                if name not in distinct_color_names:
                    distinct_color_names.append(name)
                    distinct_colors_filtered.append(color)
                if len(distinct_color_names) >= top_n:
                    break

        return distinct_color_names[:top_n], distinct_colors_filtered[:top_n]
    except Exception as e:
        print(f"Error processing image {image_path}: {e}")
        return [], []

def process_video_folder(video_folder):
    """
    Process all image files in a video folder, detecting dominant colors.

    Args:
        video_folder (str): Path to the video folder containing image files.

    Returns:
        dict: Results organized by video name, scene number, and object class.
    """
    results = {}
    for root, dirs, files in os.walk(video_folder):
        for file in files:
            if file.endswith(".jpg") or file.endswith(".png"):
                image_path = os.path.join(root, file)
                dominant_color_names, dominant_colors = get_dominant_colors(image_path)

                if dominant_color_names:  # Only process if colors were successfully detected
                    match = re.match(r'.*/(\d{5})_Scene-(\d+)', image_path)
                    if match:
                        video_name = match.group(1)
                        scene_number = match.group(2)
                        object_class = os.path.basename(os.path.dirname(image_path))

                        if video_name not in results:
                            results[video_name] = {}

                        if scene_number not in results[video_name]:
                            results[video_name][scene_number] = {}

                        if object_class not in results[video_name][scene_number]:
                            results[video_name][scene_number][object_class] = []

                        results[video_name][scene_number][object_class].append({
                            'image': image_path,
                            'dominant_colors': dominant_color_names
                        })
                        print(f"Processed: {image_path}")
                        print("Dominant Colors:", dominant_color_names)
    return results

def main(input_folder, output_folder):
    """
    Main function to process multiple video folders, detect dominant colors, and save results to JSON files.

    Args:
        input_folder (str): Path to the input folder containing video folders.
        output_folder (str): Path to the output folder to save results.
    """
    all_results = {}
    for root, dirs, _ in os.walk(input_folder):
        for dir_name in dirs:
            if re.match(r'\d{5}_[ec]', dir_name):
                video_folder = os.path.join(root, dir_name, 'crops')
                video_results = process_video_folder(video_folder)
                for video_name, scenes in video_results.items():
                    if video_name not in all_results:
                        all_results[video_name] = scenes
                    else:
                        for scene_number, objects in scenes.items():
                            if scene_number not in all_results[video_name]:
                                all_results[video_name][scene_number] = objects
                            else:
                                for object_class, details in objects.items():
                                    if object_class not in all_results[video_name][scene_number]:
                                        all_results[video_name][scene_number][object_class] = details
                                    else:
                                        all_results[video_name][scene_number][object_class].extend(details)

    os.makedirs(output_folder, exist_ok=True)
    
    output_file_all = os.path.join(output_folder, f'all_detected_colors.json')
    with open(output_file_all, 'w') as f:
        json.dump(all_results, f, indent=4)
        
    for video_name, data in all_results.items():
        output_file = os.path.join(output_folder, f'{video_name}_colors.json')
        with open(output_file, 'w') as f:
            json.dump({video_name: data}, f, indent=4)

# Example usage
input_folder = 'yolov9-main/runs/detect'
output_folder = 'color_detection_results_dominant_color'

main(input_folder, output_folder)

# # Load your cropped image
# image_path = 'yolov9-main/runs/detect/00120_e/crops/backpack/00120_Scene-36_.jpg'
# 
# # Get the dominant colors in the image
# dominant_color_names, dominant_colors, preprocessed_image = get_dominant_colors(image_path)
# print("Dominant Color Names:", dominant_color_names)
# print("Dominant Colors RGB:", dominant_colors)
# 
# # Display the original image and the image with increased vibrance and saturation
# original_image = cv2.imread(image_path)
# original_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2RGB)
# 
# plt.figure(figsize=(12, 6))
# 
# plt.subplot(1, 2, 1)
# plt.imshow(original_image)
# plt.title("Original Image")
# plt.axis('off')
# 
# plt.subplot(1, 2, 2)
# plt.imshow(preprocessed_image)
# plt.title(f"Preprocessed Image\n(Dominant Colors: {dominant_color_names})")
# plt.axis('off')
# 
# plt.show()


# OCR using EasyOCR

This script utilizes OCR (Optical Character Recognition) to extract and process text from video frames. The primary functionalities and steps involved are:

1. **Video Capture and OCR Initialization**:
   - **Video Loading**: Uses `cv2.VideoCapture` to load the video.
   - **OCR Reader Setup**: Initializes the `easyocr.Reader` for English text recognition.

2. **Frame Extraction and Text Recognition**:
   - **Frame Iteration**: Processes the video in steps, defined by `frame_step`, to capture frames at regular intervals.
   - **Timestamp Calculation**: Computes the timestamp for each frame to log when the text appears.
   - **Color Conversion**: Converts frames to RGB format for OCR processing.
   - **Text Extraction**: Uses EasyOCR to extract text from the frames.

3. **Text Similarity and Block Creation**:
   - **Similarity Check**: Compares extracted text from consecutive frames using `difflib.SequenceMatcher` to determine similarity.
   - **Block Management**: Groups similar text into blocks with start and end times. If the text changes significantly, a new block is created.

4. **Results Compilation and Storage**:
   - **Result Structuring**: Organizes extracted text and timestamps into a list of dictionaries.
   - **JSON Output**: Saves the results to a JSON file for each video.

5. **Batch Processing**:
   - **Directory Traversal**: Walks through the input folder to find video files.
   - **Output Management**: Ensures the output directory exists and skips processing if results already exist.


In [None]:
import os
import cv2
import easyocr
import difflib
import json

def ocr_video(video_path, output_file, similarity_threshold=0.5, frame_step=10):
    """
    Perform OCR on a video and save the results to a JSON file.

    Args:
        video_path (str): Path to the input video file.
        output_file (str): Path to the output JSON file.
        similarity_threshold (float): Threshold for text similarity to merge frames into a single block.
        frame_step (int): Number of frames to skip between each OCR operation.

    Returns:
        list: List of OCR results with text, start time, and end time.
    """
    # Initialize the video capture and OCR reader
    cap = cv2.VideoCapture(video_path)
    reader = easyocr.Reader(['en'])

    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = cap.get(cv2.CAP_PROP_FPS)
    
    # Initialize variables to store the results
    results = []

    prev_text = ""
    current_block = None

    for i in range(0, frame_count, frame_step):
        ret, frame = cap.read()
        if not ret:
            break
        
        # Skip frames until we reach the next frame of interest
        cap.set(cv2.CAP_PROP_POS_FRAMES, i)
        
        # Calculate the timestamp for the current frame
        timestamp = i / fps

        # Convert the frame to RGB (easyocr works on RGB images)
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

        # Perform OCR on the frame
        ocr_result_list = reader.readtext(frame_rgb, detail=0)
        ocr_result = ' '.join(ocr_result_list).strip().lower()

        # Skip frames with no text
        if not ocr_result:
            continue

        # Calculate similarity with the previous text
        similarity = difflib.SequenceMatcher(None, prev_text, ocr_result).ratio()

        if similarity >= similarity_threshold:
            # If the text is similar enough, update the end time of the current block
            if current_block:
                current_block['end_time'] = timestamp + (frame_step / fps)
            else:
                current_block = {
                    'text': ocr_result,
                    'start_time': timestamp,
                    'end_time': timestamp + (frame_step / fps)
                }
        else:
            # If the text is different enough, finalize the current block and start a new one
            if current_block:
                results.append(current_block)
            current_block = {
                'text': ocr_result,
                'start_time': timestamp,
                'end_time': timestamp + (frame_step / fps)
            }
        prev_text = ocr_result

    # Finalize the last block
    if current_block:
        results.append(current_block)

    cap.release()

    # Save results to a JSON file
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=4)

    return results

def process_videos(input_folder, output_folder, similarity_threshold=0.5, frame_step=10):
    """
    Process multiple videos in a folder, perform OCR, and save the results to JSON files.

    Args:
        input_folder (str): Path to the folder containing input videos.
        output_folder (str): Path to the folder to save the OCR results.
        similarity_threshold (float): Threshold for text similarity to merge frames into a single block.
        frame_step (int): Number of frames to skip between each OCR operation.
    """
    # Ensure the output folder exists
    os.makedirs(output_folder, exist_ok=True)

    # Traverse the input folder
    for root, dirs, files in os.walk(input_folder):
        for dir_name in dirs:
            video_folder = os.path.join(root, dir_name)
            video_file = os.path.join(video_folder, f"{dir_name}.mp4")
            if os.path.exists(video_file):
                output_file = os.path.join(output_folder, f"{dir_name}.json")
                # Check if the output file already exists
                if os.path.exists(output_file):
                    print(f"Skipping {video_file} as output already exists.")
                    continue
                print(f"Processing video: {video_file}")
                ocr_video(video_file, output_file, similarity_threshold, frame_step)

# Example usage
input_folder = 'preprocessed_videos'
output_folder = 'ocr_results'
similarity_threshold = 0.5
frame_step = 10

process_videos(input_folder, output_folder, similarity_threshold, frame_step)


# Search for a string in OCR results

This script compares an input string against OCR results stored in a JSON file to identify similar text segments. The key steps and functionalities are:

1. **Loading OCR Results**:
   - **File Reading**: Loads OCR results from a specified JSON file. Each OCR result contains text and its corresponding start and end timestamps.

2. **String Normalization**:
   - **Input String Processing**: Strips and converts the input string to lowercase for case-insensitive comparison.

3. **Similarity Comparison**:
   - **Text Comparison**: Uses `difflib.SequenceMatcher` to calculate the similarity ratio between the input string and each OCR result's text.
   - **Threshold Filtering**: Compares the similarity ratio against a specified threshold to determine if the texts are similar enough.

4. **Result Compilation**:
   - **Matching Results**: Collects OCR results that meet the similarity threshold, including their text and timestamps.


In [None]:
import json
import difflib

def compare_string_with_ocr_results(ocr_file, input_string, similarity_threshold=0.8):
    """
    Compare an input string with OCR results from a JSON file and return matching results based on similarity.

    Args:
        ocr_file (str): Path to the JSON file containing OCR results.
        input_string (str): The input string to compare with the OCR results.
        similarity_threshold (float): The minimum similarity ratio to consider a match (default is 0.8).

    Returns:
        list: A list of matching results, each containing the matching text, start time, and end time.
    """
    # Load OCR results from the file
    with open(ocr_file, 'r') as f:
        ocr_results = json.load(f)

    # Normalize the input string
    input_string = input_string.strip().lower()

    matching_results = []

    for result in ocr_results:
        ocr_text = result['text']
        similarity = difflib.SequenceMatcher(None, ocr_text, input_string).ratio()

        if similarity >= similarity_threshold:
            matching_results.append({
                'text': ocr_text,
                'start_time': result['start_time'],
                'end_time': result['end_time']
            })

    return matching_results

# Example usage
ocr_file = 'ocr_results.json'
input_string = 'Get reliable diving gear'
similarity_threshold = 0.6

matching_results = compare_string_with_ocr_results(ocr_file, input_string, similarity_threshold)

for result in matching_results:
    print(f"Matching Text: {result['text']}, Start Time: {result['start_time']:.2f}, End Time: {result['end_time']:.2f}")
