In [2]:
import os
import cv2
import torch
import easyocr
import numpy as np
import re
from imutils import rotate_bound
from tqdm import tqdm
import json
from pdf2image import convert_from_path
from torch.utils.data import Dataset, DataLoader

# OCR Model with Preprocessing, PDF-to-Image Conversion, and Training Loop

## Overview
This notebook demonstrates a complete pipeline for Optical Character Recognition (OCR) using a combination of EasyOCR and PyTorch. It includes:
1. **PDF-to-Image Conversion** - Converts scanned PDFs into high-resolution images for processing.  
2. **Image Preprocessing** - Enhances image quality by applying grayscale conversion, sharpening, thresholding, and de-skewing techniques.  
3. **Custom Dataset and DataLoader** - Defines a dataset class for loading and augmenting image data.  
4. **Training Loop** - Implements a training loop for fine-tuning a CNN model to improve OCR accuracy.  
5. **Text Extraction** - Uses EasyOCR to extract text from preprocessed images and saves the output in JSON format.  

## Dependencies
Make sure the following libraries are installed:
- OpenCV
- PyTorch
- EasyOCR
- NumPy
- TQDM
- pdf2image
- imutils

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

reader = easyocr.Reader(['en'], gpu=(device == 'cuda'))

## Device Configuration and EasyOCR Initialization

### Device Setup
The script dynamically detects whether a **CUDA-enabled GPU** is available. If a GPU is found, computations will leverage it for faster processing. Otherwise, it defaults to the **CPU**.

```python
device = 'cuda' if torch.cuda.is_available() else 'cpu'


In [None]:
input_folder = 'aadhar.v1i.yolov5pytorch/train/images'
output_file = 'aadhar_extracted_texts.json'
MAX_IMAGES = 200

In [None]:
class OCRDataset(Dataset):
    def __init__(self, image_paths):
        self.image_paths = image_paths

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        img = cv2.resize(img, (640, 640))
        img = torch.tensor(img, dtype=torch.float32).unsqueeze(0) / 255.0
        return img, img_path

# OCR Dataset Class Definition

This section defines the `OCRDataset` class, a custom dataset class for Optical Character Recognition (OCR) tasks. The class inherits from `torch.utils.data.Dataset` and provides methods for loading and processing images.

### Class: `OCRDataset`

The class is designed to handle image paths and apply necessary transformations, such as grayscale conversion, resizing, and normalization.

In [None]:
def pdf_to_images(pdf_path, output_dir):
    images = convert_from_path(pdf_path, dpi=300)
    image_paths = []
    for i, image in enumerate(images):
        image_path = os.path.join(output_dir, f'page_{i+1}.jpg')
        image.save(image_path, 'JPEG')
        image_paths.append(image_path)
    return image_paths

# PDF to Image Conversion

This section defines the `pdf_to_images` function, which converts each page of a PDF into an image and saves them in a specified output directory.

### Function: `pdf_to_images(pdf_path, output_dir)`

The function converts the pages of a PDF file into images, saves them as JPEG files, and returns the paths of the saved images.

#### Explanation:

- **Input:**
  - `pdf_path`: The path to the PDF file that needs to be converted.
  - `output_dir`: The directory where the converted images will be saved.

- **Process:**
  - The `convert_from_path` function from the `pdf2image` library is used to convert the PDF into a list of images, one for each page, at a resolution of 300 DPI.
  - Each image is saved as a JPEG file in the specified output directory.
  - The function constructs the file path for each image using the format `page_{i+1}.jpg`, where `i` is the page index.
  - The image paths are appended to a list, which is returned as the output.

- **Output:**
  - A list of file paths for the saved images.

This function can be used to preprocess PDF files by converting them into images for further processing, such as Optical Character Recognition (OCR) or analysis.


In [None]:
def preprocess_image(img_path):
    img = cv2.imread(img_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    kernel = np.array([[-1, -1, -1], [-1, 9, -1], [-1, -1, -1]])
    sharpened = cv2.filter2D(gray, -1, kernel)
    thresh = cv2.adaptiveThreshold(
        sharpened, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 21, 10)
    coords = np.column_stack(np.where(thresh > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    (h, w) = thresh.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(thresh, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
    resized = cv2.resize(rotated, None, fx=1.5, fy=1.5, interpolation=cv2.INTER_CUBIC)
    return resized

# Image Preprocessing

This section defines the `preprocess_image` function, which applies several image processing techniques to enhance and prepare the image for tasks like Optical Character Recognition (OCR).

### Function: `preprocess_image(img_path)`

The function processes an image by converting it to grayscale, sharpening it, applying thresholding, correcting rotation, and resizing the image.

#### Explanation:

- **Input:**
  - `img_path`: The file path of the image to be preprocessed.

- **Process:**
  - **Grayscale Conversion**: The image is converted to grayscale to simplify the processing and focus on intensity rather than color.
  - **Sharpening**: A sharpening kernel is applied to the grayscale image using a convolution filter to enhance edges and details in the image.
  - **Adaptive Thresholding**: Adaptive thresholding is applied to the sharpened image to produce a binary image. This step helps in separating the foreground (text) from the background.
  - **Angle Detection**: The function calculates the rotation angle of the text in the binary image using the minimum area rectangle. This is done to detect if the image needs to be rotated for proper alignment.
  - **Rotation Correction**: Based on the detected angle, the image is rotated so that the text is aligned correctly.
  - **Resizing**: The image is resized by a factor of 1.5 to enhance the features and make it suitable for further analysis.

- **Output:**
  - The preprocessed image, which is rotated and resized for improved text recognition.

This function is typically used to prepare images for OCR by improving the visibility of text and correcting any distortions such as skewed or rotated text.


In [None]:
def clean_text(text):
    text = re.sub(r'[^a-zA-Z0-9.,:/\n\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Text Cleaning

This section defines the `clean_text` function, which processes raw text by removing unwanted characters and extra spaces to improve its quality for further analysis.

### Function: `clean_text(text)`

The function cleans the input text by removing non-alphanumeric characters and reducing extra whitespace.

#### Explanation:

- **Input:**
  - `text`: The raw text string that needs to be cleaned.

- **Process:**
  - **Removing Unwanted Characters**: The function uses regular expressions to remove characters that are not alphanumeric (letters, digits), punctuation (.,:/), newline characters (`\n`), or spaces. This ensures that only relevant characters remain in the text.
  - **Condensing Multiple Spaces**: It replaces multiple consecutive spaces with a single space, ensuring that the text is well-formatted and does not contain unnecessary gaps.
  
- **Output:**
  - The cleaned text, with unwanted characters removed and extra spaces condensed.

This function is typically used to preprocess raw text data, especially for tasks like text recognition (e.g., OCR) or text analysis, ensuring the text is in a usable and standardized format.


In [None]:
def train_ocr_model(dataloader, epochs=5, lr=0.001):
    model = torch.nn.Sequential(
        torch.nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
        torch.nn.ReLU(),
        torch.nn.MaxPool2d(kernel_size=2, stride=2),
        torch.nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
        torch.nn.ReLU(),
        torch.nn.MaxPool2d(kernel_size=2, stride=2),
        torch.nn.Flatten(),
        torch.nn.Linear(64 * 160 * 160, 256),
        torch.nn.ReLU(),
        torch.nn.Linear(256, 10)
    )
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = torch.nn.CrossEntropyLoss()
    model.train()
    for epoch in range(epochs):
        for imgs, _ in dataloader:
            imgs = imgs.to(device)
            outputs = model(imgs)
            loss = criterion(outputs, torch.randint(0, 10, (imgs.size(0),), device=device))
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        print(f"Epoch {epoch + 1}/{epochs}, Loss: {loss.item()}")


# Training the OCR Model

This section defines the `train_ocr_model` function, which sets up and trains a convolutional neural network (CNN) model for Optical Character Recognition (OCR) tasks.

### Function: `train_ocr_model(dataloader, epochs=5, lr=0.001)`

The function initializes and trains a simple CNN model for OCR, utilizing a given dataloader to load images and their corresponding labels.

#### Explanation:

- **Input:**
  - `dataloader`: A PyTorch DataLoader that supplies batches of training images and their labels.
  - `epochs`: The number of times the model will iterate over the entire dataset. Default is 5.
  - `lr`: The learning rate for the optimizer. Default is 0.001.

- **Process:**
  - **Model Setup**: A CNN model is defined using two convolutional layers followed by ReLU activations and max-pooling. The model is designed to process grayscale images and output a classification of digits (from 0 to 9).
  - **Optimizer and Loss**: The Adam optimizer is used with a learning rate of `lr`, and the Cross-Entropy loss is used for training. The optimizer is responsible for updating the model parameters during training to minimize the loss.
  - **Training Loop**: The model undergoes training for the specified number of epochs:
    - For each batch of images, the forward pass computes the predictions.
    - The loss is calculated by comparing the predictions with randomly generated labels (in this case, the labels are simulated as random values between 0 and 9).
    - The optimizer performs a backward pass to update the model’s parameters based on the loss.
    - After each epoch, the loss for the final batch is printed to monitor the model’s progress.

- **Output:**
  - The model is trained and ready for further evaluation or inference.

This function trains a basic CNN model on OCR tasks, learning to classify images of text (here, digits from 0-9). The model is trained using synthetic labels, but in a real scenario, actual labeled data would be used.


In [None]:
def extract_text_from_images():
    image_files = [f for f in os.listdir(input_folder) if f.endswith(('.png', '.jpg', '.jpeg'))]
    image_files.sort()
    image_files = image_files[200:400]
    data = {"input": []}
    for filename in tqdm(image_files, desc="Processing Images"):
        img_path = os.path.join(input_folder, filename)
        processed_img = preprocess_image(img_path)
        best_result = ""
        for angle in [0, 90, 180, 270]:
            rotated_img = rotate_bound(processed_img, angle)
            results = reader.readtext(rotated_img, detail=0, paragraph=True)
            extracted_text = ' '.join(results)
            if len(extracted_text) > len(best_result):
                best_result = extracted_text
        cleaned_text = clean_text(best_result)
        data["input"].append({
            "image": filename,
            "text_extracted": cleaned_text
        })
    with open(output_file, 'w') as f:
        json.dump(data, f, indent=4)
    print(f"Text extraction complete! Results saved in {output_file}")
image_files = [os.path.join(input_folder, f) for f in os.listdir(input_folder) if f.endswith(('.png', '.jpg', '.jpeg'))]
image_files.sort()
image_files = image_files[200:400]
dataset = OCRDataset(image_files)
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)
train_ocr_model(dataloader)

# Extracting Text from Images

This section defines the `extract_text_from_images` function, which extracts text from a batch of images in a specified directory using OCR and saves the results to a JSON file.

### Function: `extract_text_from_images()`

The function processes a set of images, extracts text from each using OCR, and stores the results in a structured format.

#### Explanation:

- **Input:**
  - The function loads images from the `input_folder` directory. It filters out files that are not images (`.png`, `.jpg`, `.jpeg`), sorts them, and selects a subset of images (from index 200 to 400) for processing.

- **Process:**
  - **Preprocessing**: Each image is processed using the `preprocess_image` function to enhance text visibility (e.g., by rotating, thresholding, and resizing).
  - **Text Extraction**: For each image, the function attempts text extraction at four different rotation angles (0°, 90°, 180°, and 270°). The best result is chosen based on the longest extracted text.
  - **Text Cleaning**: The extracted text is cleaned using the `clean_text` function to remove unwanted characters and normalize spacing.
  - **Storing Results**: The filename and corresponding extracted text are stored in a dictionary. After processing all images, the results are saved to a JSON file (`output_file`).

- **Output:**
  - A JSON file containing the extracted text for each image.

### Dataset Creation and Model Training

Following the text extraction, the code creates a dataset and trains an OCR model.

- **Dataset Creation**: The `OCRDataset` class is used to create a dataset from the selected image files, which are loaded into a `DataLoader` for batching and shuffling.
  
- **Model Training**: The `train_ocr_model` function is called to train a CNN-based model on the dataset for OCR tasks.

#### Key Steps:
1. **Load Images**: Filter and select images from the `input_folder`.
2. **Process Each Image**: Preprocess each image, extract text at different rotations, and clean the results.
3. **Save Results**: Store the extracted text in a JSON file.
4. **Train Model**: Use the processed images to train an OCR model.

This function combines image preprocessing, text extraction, and OCR model training, making it a comprehensive pipeline for working with OCR tasks.


In [None]:
extract_text_from_images()
torch.cuda.empty_cache()

# Text Extraction and Model Optimization

This section runs the `extract_text_from_images` function to process images, extract text using OCR, and save the results in a JSON file.

### Steps:
1. **Text Extraction**: The function processes a subset of images from the `input_folder`, extracts text using OCR, and stores the results in `output_file`.
2. **Optimize GPU Memory**: After extraction, `torch.cuda.empty_cache()` is called to free up unused GPU memory.

This process allows for efficient text extraction and optimizes memory usage during training or inference.
