<a href="https://colab.research.google.com/github/sheikh495/Algorithms/blob/main/ipjTotext.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# JPG to text using OCR in Google Colab


In [2]:
# Step 1: Install required packages
!sudo apt install tesseract-ocr -y
!pip install pytesseract Pillow

# Step 2: Import libraries
from PIL import Image
import pytesseract

# Step 3: Upload the JPG file
from google.colab import files
uploaded = files.upload()

# Step 4: Load and OCR the uploaded image
import io

for filename in uploaded:
    image = Image.open(io.BytesIO(uploaded[filename]))
    text = pytesseract.image_to_string(image)
    print("Extracted Text:\n")
    print(text)


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13


Saving 1.jpg to 1.jpg
Extracted Text:

The software developers at Amazon are working on detecting
configuration anomalies in a server. They are provided with a set of
configurations represented by config, a string of concatenated
decimal digits (0-9). However, some digits in these configurations
have been inadvertently erased.

These configurations were initially generated using a specific
procedure involving two integer parameters, x and y.

The procedure begins with the two numbers, xand y, and initializes a
current value (cur)to 0. The following operation can be performed
any number of times.

* Ineach step, either x or yis added to cur.
* Compute the unit digit of cur (curr% 10) after each addition.

* Record this digit as part of the configuration sequence,

Unfortunately, some of these recorded digits are missing due to data
corruption, complicating the reconstruction of the original
sequence. Additionally, it is known that the first character of each
given configuration string c

# JPG ➜ Text ➜ Word (.docx)

In [4]:
# Step 1: Install required packages
!sudo apt install tesseract-ocr -y
!pip install pytesseract Pillow python-docx

# Step 2: Import libraries
from PIL import Image
import pytesseract
from docx import Document
from google.colab import files
import io
import re

# Step 3: Upload image
uploaded = files.upload()

# Step 4: Extract text, clean it, and save to .docx
for filename in uploaded:
    image = Image.open(io.BytesIO(uploaded[filename]))
    raw_text = pytesseract.image_to_string(image)

    # ✅ Clean invalid characters (remove NULL bytes and control characters)
    cleaned_text = re.sub(r'[\x00-\x08\x0B-\x1F\x7F]', '', raw_text)

    # Create a Word document
    doc = Document()
    doc.add_heading('Extracted Text from Image', 0)
    doc.add_paragraph(cleaned_text)

    # Save the document
    doc_filename = filename.rsplit('.', 1)[0] + "_extracted.docx"
    doc.save(doc_filename)

    # Download the file
    files.download(doc_filename)


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


Saving 2.jpg to 2.jpg


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

 # JPG ➜ Text ➜ HTML (.html)

In [5]:
# Step 1: Install required packages
!sudo apt install tesseract-ocr -y
!pip install pytesseract Pillow

# Step 2: Import libraries
from PIL import Image
import pytesseract
from google.colab import files
import io
import re

# Step 3: Upload image
uploaded = files.upload()

# Step 4: Extract text, clean it, and save to .html
for filename in uploaded:
    image = Image.open(io.BytesIO(uploaded[filename]))
    raw_text = pytesseract.image_to_string(image)

    # ✅ Clean text (remove null/control characters)
    cleaned_text = re.sub(r'[\x00-\x08\x0B-\x1F\x7F]', '', raw_text)

    # Step 5: Create simple HTML content
    html_content = f"""
    <html>
    <head><title>Extracted Text</title></head>
    <body>
    <h1>Extracted Text from Image</h1>
    <pre>{cleaned_text}</pre>
    </body>
    </html>
    """

    # Save HTML file
    html_filename = filename.rsplit('.', 1)[0] + "_extracted.html"
    with open(html_filename, "w", encoding="utf-8") as f:
        f.write(html_content)

    # Download the file
    files.download(html_filename)


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


Saving 5.jpg to 5.jpg


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Enhanced HTML Output with Full Color Styling

In [6]:
# Step 1: Install required packages
!sudo apt install tesseract-ocr -y
!pip install pytesseract Pillow

# Step 2: Import libraries
from PIL import Image
import pytesseract
from google.colab import files
import io
import re

# Step 3: Upload image
uploaded = files.upload()

# Step 4: Extract text, clean it, and save to colorful HTML
for filename in uploaded:
    image = Image.open(io.BytesIO(uploaded[filename]))
    raw_text = pytesseract.image_to_string(image)

    # Clean bad characters
    cleaned_text = re.sub(r'[\x00-\x08\x0B-\x1F\x7F]', '', raw_text)

    # Step 5: Styled HTML content
    html_content = f"""
    <html>
    <head>
        <title>Extracted Text</title>
        <style>
            body {{
                background: linear-gradient(to right, #1e3c72, #2a5298);
                color: #ffffff;
                font-family: 'Courier New', monospace;
                padding: 40px;
                line-height: 1.6;
            }}
            h1 {{
                color: #ffcc00;
                font-size: 28px;
                text-shadow: 1px 1px 2px #000;
            }}
            pre {{
                background-color: rgba(0, 0, 0, 0.3);
                padding: 20px;
                border-radius: 10px;
                white-space: pre-wrap;
                word-wrap: break-word;
                box-shadow: 0 0 10px #00000066;
            }}
        </style>
    </head>
    <body>
        <h1>Extracted Text from Image</h1>
        <pre>{cleaned_text}</pre>
    </body>
    </html>
    """

    # Save HTML file
    html_filename = filename.rsplit('.', 1)[0] + "_extracted.html"
    with open(html_filename, "w", encoding="utf-8") as f:
        f.write(html_content)

    # Download the file
    files.download(html_filename)


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


Saving 1.jpg to 1 (2).jpg


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Book-Style Formatted HTML Output

In [7]:
# Step 1: Install required packages
!sudo apt install tesseract-ocr -y
!pip install pytesseract Pillow

# Step 2: Imports
from PIL import Image
import pytesseract
from google.colab import files
import io
import re

# Step 3: Upload image
uploaded = files.upload()

# Step 4: Extract text, clean it, and save in book-style HTML
for filename in uploaded:
    image = Image.open(io.BytesIO(uploaded[filename]))
    raw_text = pytesseract.image_to_string(image)

    # Clean invalid characters
    cleaned_text = re.sub(r'[\x00-\x08\x0B-\x1F\x7F]', '', raw_text)

    # Book-style HTML content
    html_content = f"""
    <html>
    <head>
        <title>Extracted Text - Book Format</title>
        <style>
            body {{
                max-width: 700px;
                margin: 50px auto;
                padding: 40px 60px;
                font-family: 'Georgia', serif;
                font-size: 18px;
                line-height: 1.8;
                color: #333333;
                background-color: #fafafa;
                box-shadow: 0 0 20px rgba(0,0,0,0.1);
                text-align: justify;
                border-radius: 10px;
            }}
            h1 {{
                text-align: center;
                font-weight: normal;
                font-size: 36px;
                margin-bottom: 40px;
                color: #5B4636;
                font-family: 'Palatino Linotype', 'Book Antiqua', Palatino, serif;
            }}
            pre {{
                white-space: pre-wrap;       /* Wrap long lines */
                word-wrap: break-word;
                font-family: 'Georgia', serif;
                font-size: 18px;
                line-height: 1.8;
                margin: 0;
                background: none;
                box-shadow: none;
                padding: 0;
            }}
        </style>
    </head>
    <body>
        <h1>Extracted Text from Image</h1>
        <pre>{cleaned_text}</pre>
    </body>
    </html>
    """

    # Save and download the file
    html_filename = filename.rsplit('.', 1)[0] + "_book_format.html"
    with open(html_filename, "w", encoding="utf-8") as f:
        f.write(html_content)

    files.download(html_filename)


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


Saving 1.jpg to 1 (3).jpg


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# # Research Paper Styled HTML Output

In [8]:
# Step 1: Install required packages
!sudo apt install tesseract-ocr -y
!pip install pytesseract Pillow

# Step 2: Imports
from PIL import Image
import pytesseract
from google.colab import files
import io
import re

# Step 3: Upload image
uploaded = files.upload()

# Step 4: Extract text, clean it, and save in research paper style HTML
for filename in uploaded:
    image = Image.open(io.BytesIO(uploaded[filename]))
    raw_text = pytesseract.image_to_string(image)

    # Clean invalid characters
    cleaned_text = re.sub(r'[\x00-\x08\x0B-\x1F\x7F]', '', raw_text)

    # Simple split for title and body — you can customize or extract sections if needed
    lines = cleaned_text.strip().split('\n')
    title = lines[0] if lines else "Research Paper Title"
    body_text = '\n'.join(lines[1:]) if len(lines) > 1 else ""

    # Research paper style HTML
    html_content = f"""
    <html>
    <head>
        <title>{title}</title>
        <style>
            @page {{
                margin: 1in;
            }}
            body {{
                font-family: 'Times New Roman', Times, serif;
                margin: 1in;
                font-size: 12pt;
                line-height: 2; /* double spacing */
                color: #000000;
            }}
            h1.title {{
                text-align: center;
                font-weight: bold;
                font-size: 16pt;
                margin-bottom: 0.5em;
                text-transform: uppercase;
            }}
            h2.section-header {{
                font-weight: bold;
                font-size: 14pt;
                margin-top: 1.5em;
                margin-bottom: 0.5em;
                border-bottom: 1px solid #000;
                padding-bottom: 4px;
            }}
            p {{
                text-align: justify;
                margin: 0 0 1em 0;
            }}
            footer {{
                position: fixed;
                bottom: 0;
                width: 100%;
                text-align: center;
                font-size: 10pt;
                color: #888;
                border-top: 1px solid #ccc;
                padding-top: 5px;
            }}
        </style>
    </head>
    <body>
        <h1 class="title">{title}</h1>

        <h2 class="section-header">Abstract</h2>
        <p>{body_text[:1000]}</p>

        <h2 class="section-header">Introduction</h2>
        <p>{body_text[1000:2000]}</p>

        <h2 class="section-header">Main Content</h2>
        <p>{body_text[2000:]}</p>

        <footer>
            Page 1
        </footer>
    </body>
    </html>
    """

    # Save and download the file
    html_filename = filename.rsplit('.', 1)[0] + "_research_paper.html"
    with open(html_filename, "w", encoding="utf-8") as f:
        f.write(html_content)

    files.download(html_filename)


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


Saving 5.jpg to 5 (1).jpg


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# @ Multiple JPGs → Extract Text → Combine → Save as One Research Paper HTML

In [9]:
# Step 1: Install required packages
!sudo apt install tesseract-ocr -y
!pip install pytesseract Pillow

# Step 2: Imports
from PIL import Image
import pytesseract
from google.colab import files
import io
import re

# Step 3: Upload multiple images
uploaded = files.upload()  # You can select multiple files in the file picker

# Step 4: Extract text from each, clean and accumulate
all_text = []
for filename in uploaded:
    image = Image.open(io.BytesIO(uploaded[filename]))
    raw_text = pytesseract.image_to_string(image)
    cleaned_text = re.sub(r'[\x00-\x08\x0B-\x1F\x7F]', '', raw_text)
    all_text.append(cleaned_text.strip())

# Combine all texts with page breaks
combined_text = "\n\n--- PAGE BREAK ---\n\n".join(all_text)

# Optional: split for a simple research paper structure
lines = combined_text.split('\n')
title = lines[0] if lines else "Research Paper Title"
body_text = '\n'.join(lines[1:]) if len(lines) > 1 else ""

# Step 5: Create research paper style HTML with all combined text
html_content = f"""
<html>
<head>
    <title>{title}</title>
    <style>
        @page {{
            margin: 1in;
        }}
        body {{
            font-family: 'Times New Roman', Times, serif;
            margin: 1in;
            font-size: 12pt;
            line-height: 2; /* double spacing */
            color: #000000;
        }}
        h1.title {{
            text-align: center;
            font-weight: bold;
            font-size: 16pt;
            margin-bottom: 0.5em;
            text-transform: uppercase;
        }}
        p {{
            text-align: justify;
            margin: 0 0 1em 0;
            white-space: pre-wrap;
        }}
        .page-break {{
            page-break-after: always;
            border-top: 1px dashed #aaa;
            margin: 2em 0;
        }}
    </style>
</head>
<body>
    <h1 class="title">{title}</h1>

    <p>{body_text.replace('--- PAGE BREAK ---', '</p><div class="page-break"></div><p>')}</p>

</body>
</html>
"""

# Step 6: Save and download
html_filename = "combined_research_paper.html"
with open(html_filename, "w", encoding="utf-8") as f:
    f.write(html_content)

files.download(html_filename)


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


Saving 1.jpg to 1 (4).jpg
Saving 2.jpg to 2 (1).jpg
Saving 3.jpg to 3.jpg
Saving 4.jpg to 4.jpg
Saving 5.jpg to 5 (2).jpg


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Full Code: JPGs → Text → Research Paper-Style PDF

In [3]:
!pip install reportlab pytesseract Pillow

from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet
from PIL import Image
import pytesseract
import io
from google.colab import files

# Upload images
uploaded = files.upload()

# Extract text
all_text = []
for filename in uploaded:
    img = Image.open(io.BytesIO(uploaded[filename]))
    text = pytesseract.image_to_string(img)
    all_text.append(text.strip())

# Create PDF
from reportlab.lib.pagesizes import letter
doc = SimpleDocTemplate("output.pdf", pagesize=letter)
styles = getSampleStyleSheet()
flowables = []

for section in all_text:
    for para in section.split("\n\n"):
        if para.strip():
            flowables.append(Paragraph(para.strip(), styles["Normal"]))
            flowables.append(Spacer(1, 12))

doc.build(flowables)
files.download("output.pdf")


Collecting reportlab
  Downloading reportlab-4.4.3-py3-none-any.whl.metadata (1.7 kB)
Downloading reportlab-4.4.3-py3-none-any.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: reportlab
Successfully installed reportlab-4.4.3


Saving 1.jpg to 1 (8).jpg
Saving 2.jpg to 2 (5).jpg
Saving 3.jpg to 3 (4).jpg
Saving 4.jpg to 4 (4).jpg
Saving 5.jpg to 5 (6).jpg


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>