Run on Jupyter Notebook/Colab. Chnage the document path extension accordingly
1. Using PDF Miner library I converted all the data from the pdf to textual data in the form of string. We can use this string data to query the CPT codes from it
2. Do some data cleaning, first replacing the "*" in the data with space so that we can easily get CPT codes which have asterick in it. Also we need to remove the footers that are in each page. Since the footer consist of the number 2020. We don't want our code to consider it as a CPT code
3. Using regular expressions to generate a particular CPT code. After analyzing the document, I found that there are many CPT codes of different variation. And few of them are not even numeric.
- All the numeric CPT codes which requires prior authorization
- All the non-numeric which is the combination of the codes which requires prior authorization
- All the codes which do not require prior authorization
- And finally all the CPT codes in the document