GitHub - trystine/Text-Extraction-Using-Regular-Expression-in-Python

Text Extraction Using Regular Expression In Python

Requirements:

Run on Jupyter Notebook/Colab. Chnage the document path extension accordingly

Procedure:

1. Using PDF Miner library I converted all the data from the pdf to textual data in the form of string. We can use this string data to query the CPT codes from it

2. Do some data cleaning, first replacing the "*" in the data with space so that we can easily get CPT codes which have asterick in it. Also we need to remove the footers that are in each page. Since the footer consist of the number 2020. We don't want our code to consider it as a CPT code

3. Using regular expressions to generate a particular CPT code. After analyzing the document, I found that there are many CPT codes of different variation. And few of them are not even numeric.

I am generating three columns in my dataframe.

All the numeric CPT codes which requires prior authorization
All the non-numeric which is the combination of the codes which requires prior authorization
All the codes which do not require prior authorization
And finally all the CPT codes in the document

4. Save the file as csv.

5. Observation: Based on my observation few codes in the breast reconstruction does not require a prior authorization. Hence I decided to have a separate column for it. There might be multiple ways in which we can have more filtered results, like based on the procedure name e.g Arthoplasty we only generate the CPT codes for that procedure only, that can also be done, studying the location in it and also deciding a terminating index can work to get such customised CPT codes.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ReadMe.md		ReadMe.md
Text_Extraction_Trishala_Ahalpara.ipynb		Text_Extraction_Trishala_Ahalpara.ipynb
UHC-Commercial-Advance-Notification-Prior-Authorization-Requirements-10-1-2021.pdf		UHC-Commercial-Advance-Notification-Prior-Authorization-Requirements-10-1-2021.pdf
final_dataframe_of_CPT_codes.csv		final_dataframe_of_CPT_codes.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Extraction Using Regular Expression In Python

Requirements:

Procedure:

1. Using PDF Miner library I converted all the data from the pdf to textual data in the form of string. We can use this string data to query the CPT codes from it

2. Do some data cleaning, first replacing the "*" in the data with space so that we can easily get CPT codes which have asterick in it. Also we need to remove the footers that are in each page. Since the footer consist of the number 2020. We don't want our code to consider it as a CPT code

3. Using regular expressions to generate a particular CPT code. After analyzing the document, I found that there are many CPT codes of different variation. And few of them are not even numeric.

I am generating three columns in my dataframe.

4. Save the file as csv.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text Extraction Using Regular Expression In Python

Requirements:

Procedure:

1. Using PDF Miner library I converted all the data from the pdf to textual data in the form of string. We can use this string data to query the CPT codes from it

2. Do some data cleaning, first replacing the "*" in the data with space so that we can easily get CPT codes which have asterick in it. Also we need to remove the footers that are in each page. Since the footer consist of the number 2020. We don't want our code to consider it as a CPT code

3. Using regular expressions to generate a particular CPT code. After analyzing the document, I found that there are many CPT codes of different variation. And few of them are not even numeric.

I am generating three columns in my dataframe.

4. Save the file as csv.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages