## Find key data points from multiple documents

Download <a href="https://drive.google.com/file/d/1V6hmJhCqMyR65e4tal1Q70Lc_jvtZm0F/view?usp=sharing">these documents</a>.

They all have an identical structure to them.

Using regex, capture and export as a CSV the following data points in all the documents:

- The case number.
- Whether the decision was to accept or reject the appeal.
- The request date.
- The decision date.
- Source file name




In [1]:
## import libraries
import re
import pandas as pd
import glob
# from google.colab import files ## for google colab only


In [None]:
### COLAB ONLY
## import colab's file uploader
# files.upload() 

In [2]:
## path to documents
## my documents are stored in a folder called docs
path = "docs/*.txt"
myfiles = sorted(glob.glob(path))
myfiles

['docs/decision01.txt', 'docs/decision02.txt', 'docs/decision03.txt']

In [3]:
## quick read reminder
for file in myfiles:
  with open(file, "r") as my_text_doc:
    print(my_text_doc.read())

STATE OF NEW YORK REQUEST: February 5, 2015 
DEPARTMENT OF HEALTH AGENCY
CASE #: 6952578N
______________________________________________________
 In the Matter of the Appeal of
:
:
HEARING from a determination by the New York City :

2. On December 22, 2014, a nursing assessor completed a Uniform Assessment System evaluation of the Appellant’s personal care needs.
3. On December 22, 2014, a nursing assessor completed a client task sheet as to the Appellant’s personal care needs.
4. By notice dated January 23, 2015, the Managed Long Term Care Plan determined to reduce the Appellant’s Personal Care Services authorization from 16 hours daily, 7 days weekly to 8 hours daily, 7 days weekly.
5. On January 23, 2015, the Appellant requested an internal appeal.
6. On February 5, 2015, this fair hearing was requested.
7. By notice dated February 27, 2015, the Managed Long Term Care Plan determined to
uphold its determination to reduce the Appellant’s Personal Care Services authorization from 16 

In [4]:
## find date pattern
date_pat = re.compile(r"request:\s(\w+\s\d{1,2},\s\d{4})")

In [5]:
## find date pattern and store findings in a list
for file in myfiles:
  with open(file, "r") as my_doc:
    all_text = my_doc.read()
    all_text = all_text.lower()
    date = date_pat.findall(all_text)
    print(date[0])

february 5, 2015
march 14, 2019
october 28, 2019


In [6]:
## call request dates list hold
request_dates_list = []
for file in myfiles:
  with open(file, "r") as my_doc:
    all_text = my_doc.read()
    all_text = all_text.lower()
    date = date_pat.findall(all_text)
    request_dates_list.append(date[0])

request_dates_list

['february 5, 2015', 'march 14, 2019', 'october 28, 2019']

In [16]:
## add case number pattern and store findings in a list
date_pat = re.compile(r"request:\s(\w+\s\d{1,2},\s\d{4})") ## date regex pattern
case_pat = re.compile(r"case #:\s(\d+\w)") ## case number regex pattern
decision_pat = re.compile(r"decision:\n{1,2}.+is\s(\w+)") ## ## decision regex pattern
decision_date_pat = re.compile(r"decision:\n.*dated\s(\w+\s\d{1,2},\s\d{4})")
## initializing lists
request_dates_list = [] 
case_list = []
decision_list = []
dec_list = []
dec_date_list =[]

## iterate through docs to find, capture and store relevant data
for file in myfiles:
  with open(file, "r") as my_doc:
    all_text = my_doc.read()
    all_text = all_text.lower()
    date = date_pat.findall(all_text)
    request_dates_list.append(date[0])
    case = case_pat.findall(all_text)
    case_list.append(case[0])
    decision = decision_pat.findall(all_text)
    decision_list.append(decision[0])
    dec_list.append(decision_pat.findall(all_text)[0])
    dec_date_list.append(decision_date_pat.findall(all_text)[0])


In [18]:
## call different lists to confirm capture
dec_date_list

['february 27, 2015', 'march 14, 2019', 'march 14, 2019']

In [13]:
## call the case number
case_list

['6952578n', '7924923n', '4964154n']

In [24]:
## zip all lists together
final_decision = []

for (request_date, case_number, decision, decision_date, source)\
  in zip(request_dates_list, case_list, decision_list, dec_date_list, myfiles):
  decision_dict = {"request_date": request_date,
                   "case_number": case_number,
                   "decision": decision,
                   "decision_date": decision_date,
                   "source_file": source}
  final_decision.append(decision_dict)

In [25]:
## call final decisions list
final_decision

[{'request_date': 'february 5, 2015',
  'case_number': '6952578n',
  'decision': 'rejected',
  'decision_date': 'february 27, 2015',
  'source_file': 'docs/decision01.txt'},
 {'request_date': 'march 14, 2019',
  'case_number': '7924923n',
  'decision': 'accepted',
  'decision_date': 'march 14, 2019',
  'source_file': 'docs/decision02.txt'},
 {'request_date': 'october 28, 2019',
  'case_number': '4964154n',
  'decision': 'rejected',
  'decision_date': 'march 14, 2019',
  'source_file': 'docs/decision03.txt'}]

In [26]:
## export to csv
df = pd.DataFrame(final_decision)
df.to_csv("decisions.csv", encoding = "UTF-8", index = False)

In [27]:
df

Unnamed: 0,request_date,case_number,decision,decision_date,source_file
0,"february 5, 2015",6952578n,rejected,"february 27, 2015",docs/decision01.txt
1,"march 14, 2019",7924923n,accepted,"march 14, 2019",docs/decision02.txt
2,"october 28, 2019",4964154n,rejected,"march 14, 2019",docs/decision03.txt
