# Extract Images from PowerPoint, Perform OCR, and Save to Excel

This notebook demonstrates how to:
1. **Export images** from a PowerPoint file (.pptx)
2. **Apply OCR** to each extracted image (using Tesseract via `pytesseract`)
3. **Store the results** (`ImageName : ExtractedText`) in an Excel file

We’ll use:
- [**python-pptx**](https://python-pptx.readthedocs.io/en/latest/) to parse and extract images from `.pptx`
- [**pytesseract**](https://pypi.org/project/pytesseract/) plus [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) to do the text extraction
- [**pandas**](https://pandas.pydata.org/) to store results in a DataFrame and export to Excel

## 1. Prerequisites & Installations
1. **Python 3.7+** environment
2. Install Python libraries:
```bash
pip install python-pptx pytesseract pandas openpyxl
```
3. **Tesseract OCR** installed on your system:
   - Windows: [Download Tesseract installer](https://github.com/UB-Mannheim/tesseract/wiki)
   - macOS: `brew install tesseract`
   - Linux (Debian/Ubuntu): `sudo apt-get install tesseract-ocr`

Make sure `tesseract --version` works in your command prompt/terminal.

## 2. Library Imports & Helper Setup
We’ll import all required libraries in one cell. Also, ensure Tesseract’s executable is in your system PATH or specify its location (e.g., `pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"`).

In [7]:
import os
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE
import pytesseract
from PIL import Image
import pandas as pd

# If needed, specify the full path to tesseract:
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

## 3. Extract Images from PowerPoint
Below is a code snippet that:
1. Opens a PowerPoint file with `Presentation()`.
2. Iterates over **slides** and **shapes**.
3. If a shape is a **picture**, extract its **raw bytes** (`shape.image.blob`) and write to an image file.
4. Records the image’s path and original name for later OCR.

In [8]:
def extract_images_from_pptx(pptx_file, output_folder="extracted_images"):
    os.makedirs(output_folder, exist_ok=True)
    prs = Presentation(pptx_file)

    image_info_list = []

    for slide_idx, slide in enumerate(prs.slides, start=1):
        for shape_idx, shape in enumerate(slide.shapes, start=1):
            if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
                image_bytes = shape.image.blob
                filename = shape.image.filename  # original name if available
                if not filename:
                    filename = f"slide{slide_idx}_shape{shape_idx}.png"

                image_path = os.path.join(output_folder, filename)
                with open(image_path, "wb") as f:
                    f.write(image_bytes)

                image_info_list.append({
                    "image_path": image_path,
                    "image_name": filename,
                    "slide_index": slide_idx
                })

    return image_info_list

## 4. Performing OCR on Extracted Images
We use `pytesseract.image_to_string` on each extracted image. We’ll gather results in a list of dictionaries for easy use.

If images contain text in multiple languages, see `lang` parameter in `pytesseract.image_to_string()`.

In [9]:
def perform_ocr_on_images(image_info_list):
    ocr_results = []
    for info in image_info_list:
        img_path = info["image_path"]
        img_name = info["image_name"]

        with Image.open(img_path) as img:
            text_extracted = pytesseract.image_to_string(img)

        # Clean up text
        text_extracted = text_extracted.strip()

        ocr_results.append({
            "ImageName": img_name,
            "SlideIndex": info["slide_index"],
            "ExtractedText": text_extracted
        })

    return ocr_results

## 5. Saving Results to Excel
We’ll create a Pandas DataFrame from the OCR results and write to an Excel file using `df.to_excel()`. The resulting file will have columns: `ImageName`, `SlideIndex`, `ExtractedText`.

In [10]:
def save_ocr_results_to_excel(ocr_results, excel_path="image_text_extraction.xlsx"):
    df = pd.DataFrame(ocr_results)
    df.to_excel(excel_path, index=False)
    print(f"Saved OCR results to {excel_path}")

## 6. Putting It All Together
Below is a **single function** that:
1. Extracts images from `.pptx`.
2. Runs OCR on each image.
3. Saves the result to Excel.

We’ll call this function with a sample `.pptx` file (replace with your actual path). Make sure Tesseract is installed and accessible!

In [11]:
def extract_images_and_ocr_to_excel(pptx_file,
                                  output_folder="extracted_images",
                                  output_excel="image_text_extraction.xlsx"):
    # 1) Extract images
    image_info_list = extract_images_from_pptx(pptx_file, output_folder)
    print(f"Extracted {len(image_info_list)} images from {pptx_file}")

    # 2) Perform OCR
    ocr_results = perform_ocr_on_images(image_info_list)
    print("OCR completed.")

    # 3) Save to Excel
    save_ocr_results_to_excel(ocr_results, excel_path=output_excel)
    print("All steps finished!")

### Usage Example
Uncomment and run the cell below (after placing your `.pptx` in the same folder or providing the full path).

In [None]:
pptx_file_path = "../files/example.pptx"  # Replace with your actual file
extract_images_and_ocr_to_excel(pptx_file_path,
                                 output_folder="extracted_images",
                                 output_excel="image_text_extraction.xlsx")

## 7. Tips & Troubleshooting
1. **Check Tesseract Installation**: If you get `TesseractNotFoundError`, specify the path:
```python
pytesseract.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
```
2. **Low Accuracy OCR**: Try image preprocessing (contrast, resizing) or ensure the text is clear. 
3. **Grouping / Flattening**: Some shapes in PPT might be grouped. If they aren’t recognized as pictures, consider ungrouping them in PowerPoint.
4. **Performance**: For large PowerPoint decks with many images, OCR can be slow. Consider parallelization or more advanced OCR solutions.
5. **Multiple Languages**: Use `pytesseract.image_to_string(img, lang='xxx')` if the text is in another language (install appropriate Tesseract language packs).

## 8. Conclusion
This notebook showcased how to:
1. **Extract images** from `.pptx` slides with `python-pptx`.
2. **Perform OCR** on each image via Tesseract (`pytesseract`).
3. **Store** the extracted text in an **Excel** file using Pandas.

This approach is invaluable if you need to **translate** or **localize** text embedded in images within PowerPoint slides. Once the text is extracted, you can provide it to translators or feed it into further automation.

**Happy coding & OCR-ing!**