# OCR for RECEIPTS - by Yeriko Vargas



The code aims to solve the problem of converting multiple High-Efficiency Image Format (HEIC) files to the more commonly used JPEG format in a batch process. HEIC is an image format used by Apple devices, but it's not universally supported. This code snippet takes care of this issue by automatically scanning a directory for HEIC files, assigning unique identifiers to them, and then converting them to JPEGs.

Library Imports: Libraries like os, pillow_heif, PIL, and cv2 are imported to facilitate the task.
List HEIC Files: It scans the current directory and creates a list of all files with a .HEIC extension.
Generate IDs for Filenames: The code generates unique identifiers for each file, appending a custom label to the original name minus the HEIC extension.
Convert from HEIC to JPEG: It performs the conversion by reading the HEIC file, transforming it into a PIL Image object, and then saving it as a PNG temporarily. After that, it reads this PNG using cv2 and saves it as a JPEG.
Execution: All of the steps are executed through a main function.
The code accomplishes its tasks in an efficient and organized manner, providing a quick solution to a common problem faced by users of Apple's image format.

In [None]:


# +------------+        +----------------+       +----------------+         +----------------+
# | List Files |  --->  | Generate IDs   | --->  | Convert to JPG |  --->   | Save as JPEG   |
# +------------+        +----------------+       +----------------+         +----------------+



# Import required libraries
import os
import pillow_heif
from PIL import Image
import cv2

# Function to list all HEIC files in the current directory
def list_heic_files():
    file_list = []
    for file in os.listdir():
        if file.endswith(".HEIC"):
            file_list.append(file)
    return file_list

# Function to generate IDs for file names
def generate_ids(file_list):
    id_str = [str(x) for x in list(range(len(file_list)))]
    modified_names = ["_receipt -" + name.replace(".HEIC", "") for name in file_list]
    return [i + j for i, j in zip(id_str, modified_names)]

# Function to convert HEIC files to JPG
def convert_heic_to_jpg(heic_files, output_names):
    for i in range(len(heic_files)):
        heic_file = pillow_heif.read_heif(heic_files[i])
        image = Image.frombytes(heic_file.mode, heic_file.size, heic_file.data, "raw")
        image.save("temp.png", format("png"))
        image = cv2.imread('temp.png')
        cv2.imwrite(f'{output_names[i]}.jpg', image, [int(cv2.IMWRITE_JPEG_QUALITY), 100])
        print("Successfully converted:", output_names[i])

# Main function to execute the code
def main():
    heic_files = list_heic_files()
    output_names = generate_ids(heic_files)
    convert_heic_to_jpg(heic_files, output_names)

if __name__ == "__main__":
    main()


## Optimizing Files for Best Results


The code aims to improve the quality of a scanned document or an image. It focuses on six main stages: reading the image, resizing it, applying a Gaussian blur, enhancing its contrast, sharpening it, and finally binarizing it before saving the enhanced image.

Problem It Solves
Poor-quality scans can be difficult to read, and re-scanning is not always an option. This code enhances scanned documents or grayscale images to make them more legible.

How Does It Accomplish This?
Read and Resize: The code begins by reading an input image and resizing it while maintaining its aspect ratio.
Gaussian Blur: It applies a Gaussian blur to smoothen the image.
Contrast Enhancement: The code enhances the contrast by equalizing the image histogram.
Sharpening: The sharpness of the image is enhanced using unsharp masking.
Binarization: The image is binarized, which means it is converted to a black and white format.
Save Output: Finally, the enhanced image is saved.


                            Diagram
                            
                +---------------------+
                |     Read Image      |
                +---------------------+
                        |
                        V
                +---------------------+
                |     Resize Image    |
                +---------------------+
                        |
                        V
                +---------------------+
                |   Gaussian Blur     |
                +---------------------+
                        |
                        V
                +---------------------+
                |  Enhance Contrast   |
                +---------------------+
                        |
                        V
                +---------------------+
                |     Sharpening      |
                +---------------------+
                        |
                        V
                +---------------------+
                |     Binarization    |
                +---------------------+
                        |
                        V
                +---------------------+
                |     Save Output     |
                +---------------------+

In [None]:


#Import required libraries
import numpy as np
import os
from PIL import Image, ImageFilter

def enhance_document(input_file, output_file, max_dimension=800, blur_radius=10, contrast_factor=2.0, sharpen_factor=2.0, threshold_block_size=11, threshold_c=5):
    """
    Enhances the quality of a scanned document.
    
    Parameters:
    - input_file (str): The name of the input image file
    - output_file (str): The name of the output image file
    - max_dimension (int): The maximum dimension for resizing
    - blur_radius (float): Radius for Gaussian blur
    - contrast_factor (float): Factor to enhance contrast
    - sharpen_factor (float): Factor to enhance sharpness
    - threshold_block_size (int): Block size for adaptive thresholding
    - threshold_c (float): Constant for adaptive thresholding
    """
    # Step 1: Read and Resize the Image
    input_path = os.path.join(os.getcwd(), input_file)
    img = Image.open(input_path).convert("L")
    width, height = img.size
    new_width, new_height = (max_dimension, int(max_dimension * height / width)) if width > height else (int(max_dimension * width / height), max_dimension)
    img_resized = img.resize((new_width, new_height), resample=Image.BICUBIC)
    
    # Step 2: Apply Gaussian Blur
    img_blur = img_resized.filter(ImageFilter.GaussianBlur(radius=blur_radius))
    
    # Step 3: Enhance Contrast
    img_eq = Image.fromarray(np.uint8(np.array(img_blur) * 255))
    img_eq = img_eq.point(lambda x: x * contrast_factor)
    
    # Step 4: Sharpen Image
    img_sharp = img_blur.filter(ImageFilter.UnsharpMask(radius=2, percent=sharpen_factor))
    
    # Step 5: Binarize Image
    img_thresh = img_sharp.point(lambda x: 255 if x > np.mean(img_sharp) else 0)
    img_thresh = img_thresh.filter(ImageFilter.MedianFilter(size=3))
    img_thresh = img_thresh.point(lambda x: 255 if x > img_thresh.point(lambda y: np.mean(y) + threshold_c) else 0)
    
    # Step 6: Save the Output Image
    output_path = os.path.join(os.getcwd(), output_file)
    img_thresh.save(output_path)

# Example usage
enhance_document("image0.jpeg", "image_out.jpeg")


# Read and send RECEIPTS fields to a data frame

Image Reading: The function ocr_to_table reads image files from a given directory path.
OCR Processing: pytesseract performs OCR on these images, converting the image text into machine-readable strings.
Information Extraction: The script then employs Regular Expressions to extract specific details like the date, amount, and card information.

                           Diagram
                            
                +---------------------+
                |     Image Reading   |
                +---------------------+
                        |
                        V
                +---------------------+
                |    OCR Processing   |
                +---------------------+
                        |
                        V
                +---------------------+
                | Information         |
                | Extraction          |
                +---------------------+


In [None]:
import re
from datetime import datetime
import pytesseract
import pandas as pd

def ocr_to_table(path_source, pic_ids):
    """
    Function to OCR images from a directory and transform the fields of a receipt into a DataFrame.
    """
    data = []
    
    for i in pic_ids:
        image_path = f'{path_source}/PY/MAIN/KIMVA/Receipts/{i}'
        text = pytesseract.image_to_string(image_path, lang='eng')
        
        # Date Extraction
        date = re.search(r'\d{4}/\d{2}/\d{2}', text)
        date = datetime.strptime(date.group(), '%Y/%m/%d').date() if date else None

        # Amount Extraction
        amount = re.search(r"\$\d+\.\d{2}", text)
        amount = amount.group() if amount else None

        # Card Extraction
        card = re.search(r"Card\s*:\s*\d{4}", text)
        card = card.group() if card else None

        # Adding to the data list
        data.append([i, date, amount, card])
        
    # Creating DataFrame
    df = pd.DataFrame(data, columns=['pic_id_name', 'Transaction Date', 'Total Amount', 'Card Info'])
    
    return df


In [None]:
#end