
# **OCR Bootcamp Notes**

*(Optical Character Recognition)*

---

## ✅ **1. Introduction to OCR**

### **What is OCR?**

* OCR = **Optical Character Recognition**
* Converts **printed or handwritten text in images** into **editable text**.
* Works by detecting **characters, words, and text layout** from an image or document.

---

### **Why OCR?**

* Automates data entry.
* Digitizes physical documents.
* Enables **searchable PDFs**.
* Used in banking, healthcare, transportation, retail.

---

### **Real-world Applications**

* Scanning books into eBooks.
* License Plate Recognition.
* Invoice & Receipt Automation.
* Passport/ID verification.
* Subtitle extraction from videos.

---

### **OCR vs ICR**

* **OCR**: Recognizes printed text.
* **ICR**: Recognizes **handwritten text** using AI.

---

## ✅ **2. How OCR Works**

1. **Image Acquisition** → Capture image/document.
2. **Preprocessing** → Clean image (noise removal, binarization).
3. **Text Detection** → Locate text regions.
4. **Character Recognition** → Extract text.
5. **Post-processing** → Correct errors.

---

## ✅ **3. Image Preprocessing for OCR**

**Why preprocessing?**
Bad quality images → Low OCR accuracy.
Preprocessing helps clean the image for better results.

### **Techniques**

* **Grayscale conversion** → Reduce complexity.
* **Thresholding** → Convert to black & white.

  * Binary threshold.
  * Adaptive threshold.
* **Noise removal** → Gaussian blur, Median filter.
* **Morphological operations** → Remove small artifacts.
* **Deskewing** → Correct tilted text.

**Code Example (OpenCV Preprocessing):**


In [1]:
import cv2

# Use raw string for path
img = cv2.imread(r'C:\Users\hp\Documents\ds_materials\9.Deep_learning\cv\5.ocr_text_recognition\images\lifestyle-02.jpg')

# Check if image is loaded
if img is None:
    raise FileNotFoundError("Image not found. Check the file path.")

# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Apply threshold
_, thresh = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)

# Save processed image
cv2.imwrite('images\processed.jpg', thresh)


True


## ✅ **4. OCR Engines & Libraries**

* **Tesseract OCR** (most popular, by Google)
* **EasyOCR** (Deep learning-based, supports multiple languages)
* **PaddleOCR** (High accuracy)
* **Google Vision API** (Cloud-based)

---

### **Installing Tesseract**

**Windows/Linux Setup:**

* Install Tesseract from official site.
* Add path to environment variables.
* Install Python wrapper:


---

### **Basic OCR in Python:**

import pytesseract
from PIL import Image

img = Image.open('text_image.png')
text = pytesseract.image_to_string(img)
print(text)

In [2]:
!pip install pytesseract opencv-python


Collecting pytesseract
  Using cached pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Using cached pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13



## ✅ **5. Advanced OCR Features**

* **Multilingual OCR** → Support for 100+ languages.
* **Custom language models** → Train for new fonts.
* **Extracting structured data** → Tables, forms.
* **Confidence scores** → Check accuracy.
* **Handwriting recognition** → Using deep learning (ICR).

---

## ✅ **6. Improving OCR Accuracy**

* Use **high-resolution images**.
* Apply **deskewing** & **denoising**.
* Convert to **grayscale or binary**.
* Train custom models for complex fonts.
* Use **Deep Learning (CRNN)** for better accuracy.

---

## ✅ **7. OCR in Applications**

* **Extract text from PDFs** using `pdf2image` + OCR.
* **Real-time OCR** using webcam feed.
* **Batch OCR** for multiple documents.
* **Integrating OCR with Flask/FastAPI for APIs**.

In [3]:
!pip install pdf2image

Collecting pdf2image
  Using cached pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Using cached pdf2image-1.17.0-py3-none-any.whl (11 kB)
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.17.0


In [4]:
!pip install pdf2image



In [1]:
### **Example: Extract Text from PDF**

from pdf2image import convert_from_path
import pytesseract

pages = convert_from_path('document.pdf', 300)
for page in pages:
    text = pytesseract.image_to_string(page)
    print(text)

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

In [7]:
import pytesseract
import numpy as np

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"


In [9]:
# !pip install pymupdf pillow


In [10]:
import fitz  # PyMuPDF
import pytesseract
from PIL import Image
import io

# Open the PDF
doc = fitz.open("document.pdf")

for i, page in enumerate(doc):
    # Render page to image (in memory)
    pix = page.get_pixmap(dpi=300)
    img = Image.open(io.BytesIO(pix.tobytes("png")))

    # OCR the image using Tesseract
    text = pytesseract.image_to_string(img)
    print(f"--- Page {i + 1} ---")
    print(text)


--- Page 1 ---
MD SHAHIL ANSARI

® mohammadshahi|4u@gmai|.com ﬂﬂ Linkedln ©GitHub ©+91 9708549289 ﬂ Greater N0lda UP

 

Education

Master’s in Computer Application Marwari College (8.88 CGPA) Ranchi, JH 2021 -2023
Bachelor’s in Computer Application DSPMU (8.51 CGPA) Ranchi, JH 2018 - 2021
Technical Skills

Languages and Tools : Python, SQL (MySQL, PostgreSQL), PowerBI, Excel, Git, GitHub, Mlﬂow, NoSQL, Keras.
Libraries & Frameworks : NumPy, Pandas, Matplotlib, Seaborn, PySpark, SK—Learn, Bs4, Flask, Django, Fast API.

Data Science & Machine Learning : Data Collection, Data Preprocessing, Data Visualization, Data Warehousing, Linear
And Logistic regression, KNN, Decision Tree, Random forest, SVM and K Means
Neural Networks, NLP, ETL, LLMS, RestAPI.

Mathematics for ML & DL : Algebra, Statistics, Probability, Matrices

Extra Tools : Dockers, AWS, Django, CI/ CD.

Experience

Data Science Trainer | Itvedant (Mar 2024 — Now)

0 Conduct ofﬂine classes, mock interviews, and practical sessio

- PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?


## Exract data from image

In [None]:
import streamlit as st
from ocr_utils import *
import numpy as np
import cv2

st.set_page_config(page_title="OCR Text Recognition")
st.title("📝 OCR Text Recognition with Pytesseract")

uploaded_file = st.file_uploader("Upload an image with text", type=["jpg", "jpeg", "png"])
if uploaded_file:
    image = load_image(uploaded_file.read())
    st.image(image, caption="Original Image", use_column_width=True)

    thresh = preprocess_image(image)
    st.image(thresh, caption="Preprocessed Image", use_column_width=True)

    st.subheader("📄 Extracted Text")
    text = extract_text(thresh)
    st.code(text)

    st.subheader("🔍 Word Detection with Bounding Boxes")
    data = extract_data(thresh)
    n_boxes = len(data['level'])
    for i in range(n_boxes):
        (x, y, w, h) = (data['left'][i], data['top'][i], data['width'][i], data['height'][i])
        cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)
    st.image(image, caption="Detected Words", use_column_width=True)



---

## ✅ **8. Projects & Case Studies**

### **Beginner**

✔ Extract text from an image.
✔ Convert scanned PDF to editable text.

### **Intermediate**

✔ License plate recognition.
✔ Business card reader using OCR + OpenCV.

### **Advanced**

✔ Automated invoice processing with table extraction.
✔ Real-time subtitle extraction from video feed.

---

## ✅ **Practice Session**

**Questions:**

1. Define OCR and its real-world uses.
2. Explain why image preprocessing is important for OCR.
3. Write Python code to extract text from an image using Tesseract.
4. How do you improve OCR accuracy?
5. What is the difference between OCR and ICR?

---

## ✅ **Assignments**

1. Extract text from a scanned handwritten note.
2. Create an OCR pipeline for multilingual documents.
3. Implement OCR for real-time video stream using OpenCV.
4. Process 100 scanned documents and export extracted text into Excel.

---

## ✅ **Tools & Libraries**

* **OpenCV** → Image preprocessing.
* **Pytesseract** → OCR engine.
* **pdf2image** → Convert PDFs to images.
* **EasyOCR** → Multilingual deep learning-based OCR.
* **Flask/FastAPI** → Build OCR APIs.

---

In [12]:
import cv2
import pytesseract

# Set path to tesseract.exe if needed
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

def recognize_plate(image_path):
    # Load image
    img = cv2.imread(image_path)

    # Resize (optional for large images)
    img = cv2.resize(img, (600, 400))

    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Noise reduction
    blur = cv2.bilateralFilter(gray, 11, 17, 17)

    # Edge detection
    edged = cv2.Canny(blur, 30, 200)

    # Find contours
    contours, _ = cv2.findContours(edged.copy(), cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

    # Sort contours by area
    contours = sorted(contours, key=cv2.contourArea, reverse=True)[:10]
    plate_contour = None

    for contour in contours:
        # Approximate the contour
        approx = cv2.approxPolyDP(contour, 10, True)
        if len(approx) == 4:  # Looks like a rectangle
            plate_contour = approx
            break

    if plate_contour is not None:
        # Create mask and extract plate region
        mask = cv2.drawContours(cv2.cvtColor(gray, cv2.COLOR_GRAY2BGR), [plate_contour], 0, (255, 255, 255), -1)
        x, y, w, h = cv2.boundingRect(plate_contour)
        plate = gray[y:y+h, x:x+w]

        # OCR
        text = pytesseract.image_to_string(plate, config='--psm 8')
        print("Detected Plate Text:", text.strip())

        # Optional: Show detected plate
        cv2.imshow("Plate", plate)
        cv2.waitKey(0)
        cv2.destroyAllWindows()
    else:
        print("License plate contour not detected.")

# Example usage
recognize_plate("S:\ds_materials\9.Deep_learning\cv\5.ocr_text_recognition\car.jpg")


error: OpenCV(4.12.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\resize.cpp:4208: error: (-215:Assertion failed) !ssize.empty() in function 'cv::resize'


In [None]:
import cv2
import pytesseract
import os

# Path to image
image_path = r"car.jpg"

# Check if file exists
if not os.path.exists(image_path):
    print(f"❌ File not found at: {image_path}")
    exit()

# Read the image
img = cv2.imread(image_path)

if img is None:
    print("❌ Failed to load image. Check format or path.")
    exit()

# Resize safely
img = cv2.resize(img, (600, 400))
# Continue processing...


❌ Failed to load image. Check format or path.


error: OpenCV(4.12.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\resize.cpp:4208: error: (-215:Assertion failed) !ssize.empty() in function 'cv::resize'


: 

## Business carder reader

In [2]:
import cv2
import pytesseract
import re

# Optional: Set path to tesseract if not in PATH
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

def extract_business_card_info(image_path):
    # Read image
    img = cv2.imread(image_path)
    if img is None:
        print("❌ Could not load image.")
        return

    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # OCR to extract text
    text = pytesseract.image_to_string(gray)
    print("📃 Raw OCR Text:\n", text)

    # Regex-based extraction
    email = re.findall(r'\b[\w.-]+?@\w+?\.\w+?\b', text)
    phone = re.findall(r'(\+?\d[\d\s\-().]{7,})', text)
    website = re.findall(r'(https?://\S+|www\.\S+)', text)

    # Just mock logic for demo purposes
    lines = text.strip().split("\n")
    lines = [line.strip() for line in lines if line.strip()]
    name = lines[0] if lines else ""
    company = lines[-1] if len(lines) > 2 else ""

    result = {
        "Name": name,
        "Email": email,
        "Phone": phone,
        "Website": website,
        "Company": company,
    }

    return result

# Example usage
data = extract_business_card_info("business.jpg")
for k, v in data.items():
    print(f"{k}: {v}")


📃 Raw OCR Text:
  

Mariana Anderson

Marketing Manager

+123—456r789O
+123—456r7890

www real\ygreats\te.(om
hel\o@rea\|ygreats>te.com

123 Anywhere St., Any Qty, ST
12345

Business
Logo

 


Name: Mariana Anderson
Email: []
Phone: []
Website: []
Company: Logo
