# Detect Text in Newspaper Images with Feluda

This example demonstrates how to use the `DetectTextInImage` operator to extract text
from newspaper images. It processes multiple newspaper clippings and displays the
detected text with proper formatting and analysis.

### Install Required Packages
Install dependencies conditionally based on whether the notebook is running in Colab or locally.

In [None]:
%%time
import sys

IN_COLAB = "google.colab" in sys.modules
print("Running Notebook in Google Colab" if IN_COLAB else "Running Notebook locally")

if IN_COLAB:
    # Since Google Colab has preinstalled libraries like tensorflow and numba, we create a folder called feluda_custom_venv and isolate the environment there.
    # This is done to avoid any conflicts with the preinstalled libraries.
    %pip install uv
    !mkdir -p /content/feluda_custom_venv
    !uv pip install --target=/content/feluda_custom_venv --prerelease allow feluda feluda-detect-text-in-image-tesseract > /dev/null 2>&1

    sys.path.insert(0, "/content/feluda_custom_venv")
else:
    !uv pip install feluda feluda-detect-text-in-image-tesseract opencv-python matplotlib > /dev/null 2>&1

Running Notebook locally
[2mUsing Python 3.10.12 environment at: /home/aatman/Aatman/Tattle/feluda/.venv[0m
[2mAudited [1m6 packages[0m [2min 11ms[0m[0m
CPU times: user 6.38 ms, sys: 4.13 ms, total: 10.5 ms
Wall time: 138 ms


### Initializing Feluda class with config file

We'll use one operator for this example, for detecting text from newspaper clippings. Let's initialize the operator.

In [None]:
from feluda.factory import ImageFactory
from feluda.operators import DetectTextInImage

detector = DetectTextInImage(psm=6, oem=1)

OCR may not work correctly for these languages.


Image links for the newspaper-clippings

In [2]:
newspaper_image_links = [
    "https://github.com/tattle-made/feluda_datasets/blob/main/feluda-sample-media/newspaper-clipings/news1.png",
    "https://github.com/tattle-made/feluda_datasets/blob/main/feluda-sample-media/newspaper-clipings/news2.png",
    "https://github.com/tattle-made/feluda_datasets/blob/main/feluda-sample-media/newspaper-clipings/news3.png",
    "https://github.com/tattle-made/feluda_datasets/blob/main/feluda-sample-media/newspaper-clipings/news4.png",
    "https://github.com/tattle-made/feluda_datasets/blob/main/feluda-sample-media/newspaper-clipings/news5.png",
    "https://github.com/tattle-made/feluda_datasets/blob/main/feluda-sample-media/newspaper-clipings/news6.png",
    "https://github.com/tattle-made/feluda_datasets/blob/main/feluda-sample-media/newspaper-clipings/news7.png",
]

Download each image and extract text using the operator

In [4]:
results = []
for i, image_url in enumerate(newspaper_image_links, 1):
    print(f"Processing image {i}/{len(newspaper_image_links)}...")

    try:
        # Convert GitHub blob URL to CDN raw URL for direct download
        raw_url = image_url.replace("/blob/", "/raw/")

        # Download image using ImageFactory
        image_obj = ImageFactory.make_from_url_to_path(raw_url)

        # Extract text using the detector
        detected_text = detector.run(image_obj, remove_after_processing=True)

        # Store results
        result = {
            "image_number": i,
            "url": image_url,
            "text": detected_text.strip(),
            "text_length": len(detected_text.strip()),
            "word_count": len(detected_text.split())
            if detected_text.strip()
            else 0,
        }
        results.append(result)
        print(
            f"✓ Extracted {result['text_length']} chars, {result['word_count']} words"
        )

    except (FileNotFoundError, RuntimeError, ValueError) as e:
        print(f"✗ Error processing image {i}: {e}")
        continue

Processing image 1/7...
Downloading image from URL

Image downloaded
✓ Extracted 2345 chars, 421 words
Processing image 2/7...
Downloading image from URL

Image downloaded
✓ Extracted 4036 chars, 587 words
Processing image 3/7...
Downloading image from URL

Image downloaded
✓ Extracted 2983 chars, 493 words
Processing image 4/7...
Downloading image from URL

Image downloaded
✓ Extracted 1814 chars, 308 words
Processing image 5/7...
Downloading image from URL

Image downloaded
✓ Extracted 668 chars, 120 words
Processing image 6/7...
Downloading image from URL

Image downloaded
✓ Extracted 3152 chars, 523 words
Processing image 7/7...
Downloading image from URL

Image downloaded
✓ Extracted 1449 chars, 273 words


Display the results neatly.

In [None]:
def display_results(results: list[dict]) -> None:
    """
    Display the results of text detection in a formatted manner.
    """
    if not results:
        print("No images were successfully processed.")
        return

    # Summary statistics
    total_images = len(results)
    total_characters = sum(r["text_length"] for r in results)
    total_words = sum(r["word_count"] for r in results)

    print(
        f"\nSUMMARY: {total_images} images, {total_characters:,} chars, {total_words:,} words"
    )
    print(
        f"Average: {total_characters // total_images if total_images > 0 else 0:,} chars/image, {total_words // total_images if total_images > 0 else 0:,} words/image"
    )

    # Detailed results for each image
    for result in results:
        print(f"\n--- IMAGE #{result['image_number']} ---")
        print(f"URL: {result['url']}")
        print(f"Stats: {result['text_length']} chars, {result['word_count']} words")
        print("DETECTED TEXT:")

        if result["text"]:
            # Display text with proper formatting
            lines = result["text"].split("\n")
            for line in lines:
                if line.strip():  # Only print non-empty lines
                    print(line.strip())
        else:
            print("(No text detected)")

In [6]:
display_results(results)


SUMMARY: 7 images, 16,447 chars, 2,725 words
Average: 2,349 chars/image, 389 words/image

--- IMAGE #1 ---
URL: https://github.com/tattle-made/feluda_datasets/blob/main/feluda-sample-media/newspaper-clipings/news1.png
Stats: 2345 chars, 421 words
DETECTED TEXT:
ter source after a long wait
Authorities dig a c a ou Sistas
borewell after cs) fe | ae Ws yt, ;
——. CAT
an RTI query So oat ar Reo 3
SIDHARTH YADAV P wa? | |
BHOPAL se Ait aN , t elo
Taunts, casteist slursandbo-  }} \e ‘ \s am. {3
rewell motor being turned ) 3) my
off at whim by upper-caste
households have haunted
members of a Dalit settle- Tough journey: The group took turns to fetch water from
ment ata village in Rewa dis- _ private borewells over a kilometre away. «SPECIAL ARRANGEMENT
trict for 20 years. Buta Right
to Information (RTI) query _ source, closer to them. When the department
has got them a borewell dug “We even approached the didn’t respond, he first ap-
— a water source they can fi- local MLA... he isa messiah 

Clean up the operator

In [7]:
detector.cleanup()