<a href="https://colab.research.google.com/github/Shreya-singh01/HealthGuard/blob/main/OCR_REPORTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!apt-get install poppler-utils tesseract-ocr -y
!pip install pytesseract pdf2image pillow numpy

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  poppler-utils tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 4 newly installed, 0 to remove and 21 not upgraded.
Need to get 5,002 kB of archives.
After this operation, 16.3 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.6 [186 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 5,002 kB in 2s (2,531 kB/s)
Selecting previously unselected package popp

In [2]:
import pytesseract
from pdf2image import convert_from_path
import re
from PIL import Image
import numpy as np
from collections import defaultdict
from google.colab import files
import io

In [3]:
class LipidProfileExtractor:
    def __init__(self):
        self.custom_config = r'--oem 3 --psm 6'
        self.reference_ranges = {
            'CHOLESTEROL': (125, 200),
            'TRIGLYCERIDES': (23, 200),
            'HDL': (35, 80),
            'LDL': (85, 130),
            'VLDL': (5, 40)
        }

    def preprocess_image(self, image):
        img_array = np.array(image)
        if len(img_array.shape) == 3:
            gray = np.dot(img_array[..., :3], [0.2989, 0.5870, 0.1140])
        else:
            gray = img_array
        normalized = ((gray - gray.min()) * (255.0 / (gray.max() - gray.min()))).astype(np.uint8)
        return Image.fromarray(normalized)

    def check_health_status(self, test_name, value):
        if test_name in self.reference_ranges:
            low, high = self.reference_ranges[test_name]
            if value < low:
                return 'LOW'
            elif value > high:
                return 'HIGH'
            else:
                return 'NORMAL'
        return 'UNKNOWN'

    def extract_table_data(self, text):
        lines = text.split('\n')
        results = {}

        for line in lines:
            parts = line.strip().split()
            if not parts:
                continue

            if len(parts) >= 2:
                try:
                    value = float(parts[1])
                    test_name = parts[0].upper()

                    if any(marker in test_name for marker in ['CHOLESTEROL', 'TRIGLYCERIDES', 'HDL', 'LDL', 'VLDL']):
                        health_status = self.check_health_status(test_name, value)
                        results[test_name] = {'value': value, 'status': health_status}
                except (ValueError, IndexError):
                    continue

        return results

    def process_report(self, file_content):
        try:
            image = Image.open(io.BytesIO(file_content))
            processed_image = self.preprocess_image(image)
            text = pytesseract.image_to_string(processed_image)

            print("Extracted text:")
            print(text)
            print("\n---\n")

            results = self.extract_table_data(text)

            print("Found values:")
            print(results)

            return results

        except Exception as e:
            print(f"Error processing image: {e}")
            return None


def main():
    print("Upload your lipid profile report")
    uploaded = files.upload()

    extractor = LipidProfileExtractor()

    for filename, file_content in uploaded.items():
        print(f"\nProcessing file: {filename}")
        results = extractor.process_report(file_content)

        if results:
            print("\nFinal Results:")
            for test, data in results.items():
                print(f"{test}: {data['value']} mg/dL - {data['status']}")
        else:
            print("No results found")


if __name__ == "__main__":
    main()



Upload your lipid profile report


Saving lipid prof.jpg to lipid prof.jpg

Processing file: lipid prof.jpg
Extracted text:
let Tut-la eosed a LE)
Sample Letterhead

 

 

 

 

Wa Sean BRSITAR HMUULIAIE A eae
ieee oa Regstereson 17102024 52 Pl ee
Retered by : Dr. Sachin Pat BBS) Coleen ta oeet er:
aed — Reported on: 17/1072024 05:33 PM Gee

BIOCHEMISTRY
LIPID PROFILE:
TEST VALUE NT REFERENCE

[Tora cHOLESTEROL 160 mgt 125-200
TRIGLYCERIDES 12 mpi 23-200
HOL CHOLESTEROL 5 mpi 35-80
LDL CHOLESTEROL 9060 mpi 85-130
VLOL CHOLESTEROL saao mpi 5-40

| uu HOL 1.65 15-35
TOTAL CHOLESTEROL / HOL sar 35-5

[te /HoL 313
NON-HOL. CHOLESTEROL 12500

 

 

‘Abnormalities f lipids are associated with increased risk of coronary artery disease (CAD) in patients with DM. This rsk can be
Feduced by intensive treatment of lipid abnormalities. The usual pattem of pid abnormalities in type 2 DM is elevated
triglycerides, decreased HDL cholesterol and higher proportion of smal, dense LDL particles. Cholesterol isa lip found inal
Cell membran