## Patient Details Parser : 

This notebook contains steps to parse through patient details and extract details such as patient name, phone, vaccine and medical problems. All the steps are similar to prescription_parser notebook.

### Step 1 : Convert pdf to image using pdf2image

In [38]:
from pdf2image import convert_from_path

In [39]:
pages = convert_from_path(r'docs\patient_details\pd_1.pdf', poppler_path=r'C:\poppler-24.02.0\Library\bin')

poppler_path: This parameter specifies the path to the Poppler utility's binary files. Poppler is a PDF rendering library, and the pdf2image library depends on it for PDF processing. In this case, the path is set to r'C:\poppler-24.02.0\Library\bin'.



In [40]:
#displays the first page of the PDF document as an image 
pages[0].show()

In [41]:
#Extracting text from image using pytesseract
import pytesseract

In [42]:
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
text = pytesseract.image_to_string(pages[0], lang='eng')
print(text)

47/12/2020

Patient Medical Record

Patient Information Birth Date
Kathy Crawford May 6 1972
(737) 988-0851 Weight
9264 Ash Dr 95
New York City, 10005 .
United States Height:
190
In Case of Emergency
m _ a _
Simeone Crawford 9266 Ash Dr
New York City, New York, 10005
Home phone United States
(990) 375-4621
Work phone
Genera! Medical History
. : a ee

Chicken Pox (Varicella):

IMMUNE

Have you had the Hepatitis B vaccination?

No

List any Medical Problems (asthma, seizures, headaches):

Migraine



#### We don't get the entire text from the image. Some of the information is unclear so we need to preprocess the image.

### Step 2 : Preprocess the image using Computer Vision

In [43]:
import numpy as np
import cv2
from PIL import Image


### Preprocess Image Function:


In [44]:
def preprocess_image(img):
    gray = cv2.cvtColor(np.array(img), cv2.COLOR_BGR2GRAY)
    resized = cv2.resize(gray, None, fx=1.5, fy=1.5, interpolation=cv2.INTER_LINEAR)
    processed_image = cv2.adaptiveThreshold(
        resized,
        255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
        cv2.THRESH_BINARY, 
        61,
        11
    )
    return processed_image

In [45]:
img = preprocess_image(pages[0])
Image.fromarray(img).show()

### Step 3: Extract text from image using pytesseract

In [46]:
pytesseract.pytesseract.tesseract_cmd=r'C:\Program Files\Tesseract-OCR\tesseract.exe'
text = pytesseract.image_to_string(img, lang='eng')
print(text)

17/12/2020

Patient Medical Record

Patient Information Birth Date
Kathy Crawford May 6 1972
(737) 988-0851 Weightâ€™
9264 Ash Dr 95
New York City, 10005 .
United States Height:
190
In Casc of Emergency
7 ee
Simeone Crawford 9266 Ash Dr
New York City, New York, 10005
Home phone United States
(990) 375-4621
Work phone

Genera! Medical History

a

a

a ea A CE i a

Chicken Pox (Varicella): Measies:

IMMUNE IMMUNE

Have you had the Hepatitis B vaccination?
No

List any Medical Problems (asthma, seizures, headaches}:

Migraine

be

CO
nat
aa
oo



### Step 4: Using regex(regular expressions) to match patterns and extract information from the text

#### Extract name

In [47]:
import re

pattern = 'Patient Information(.*?)\(\d{3}\)'

matches = re.findall(pattern, text, flags=re.DOTALL)
matches

[' Birth Date\nKathy Crawford May 6 1972\n']

In [48]:
matches[0].strip()

'Birth Date\nKathy Crawford May 6 1972'

In [49]:
match = matches[0].replace("Birth Date","").strip()
match

'Kathy Crawford May 6 1972'

In [50]:
match = matches[0].replace("Birth Date","").strip()
match

'Kathy Crawford May 6 1972'

In [51]:
pattern = '((Jan|Feb|March|April|May|June|July|Aug|Sep|Oct|Nov|Dec)[ \d]+)'

date_matches = re.findall(pattern, match)
date = date_matches[0][0]
date

'May 6 1972'

In [52]:
match.replace(date, '').strip()

'Kathy Crawford'

Can combine all these processes into a function to make it easier

In [53]:
def remove_noise_from_name(name):
    name = name.replace('Birth Date','').strip()
    date_pattern = '((Jan|Feb|March|April|May|June|July|Aug|Sep|Oct|Nov|Dec)[ \d]+)'
    date_matches = re.findall(date_pattern, name)
    
    if date_matches:
        date = date_matches[0][0]
        name = name.replace(date, '').strip()
    
    return name
    

In [54]:
name = '\n\n \n\n \n\nBirth Date\nKathy Crawford May 6 1972\n'

name = remove_noise_from_name(name)
name

'Kathy Crawford'

#### Extract Phone

In [55]:
pattern = 'Patient Information(.*?)(\(\d{3}\) \d{3}-\d{4})'

matches = re.findall(pattern, text, flags=re.DOTALL)
matches

[(' Birth Date\nKathy Crawford May 6 1972\n', '(737) 988-0851')]

In [56]:
matches[0][1]

'(737) 988-0851'

#### Extract Vaccine

In [57]:
pattern = 'Have you had the Hepatitis B vaccination\?.*(Yes|No)'

matches = re.findall(pattern, text, flags=re.DOTALL)
matches

['No']

#### Extract Medical Problems

In [58]:
pattern = 'List any Medical Problems .*?:(.*)'

matches = re.findall(pattern, text, flags=re.DOTALL)
matches[0].strip()


'Migraine\n\nbe\n\nCO\nnat\naa\noo'