# What this script does

We process 61 of the earliest enforcement letters, corresponsing to January-February of 2017.

These letters were not available in the Department of Social and Health Services's [Nursing Home Facilities Locator](https://fortress.wa.gov/dshs/adsaapps/lookup/NHPubLookup.aspx) when we first downloaded them in bulk, because that site only publishes the letters from the previous 3 years, and we ran the download in March 2020. 

We requested directly to the DSHS for those letters, and they provided them to us. But instead of sending us 61 different PDFs, they sent us all those letters merged into a single massive PDF, which needed to be processed separately from the ones we downloaded. Hence this separate script.

# Settings

In [1]:
import pdfplumber
from PyPDF2 import PdfFileWriter, PdfFileReader
import pandas as pd
import os
import re
from shutil import copyfile

pd.set_option('display.max_columns', None)

# Creating single enforcement letters

### Find out where each letter begins and where it ends

Identify which of the pages in the large PDF file are the cover pages of each individual enforcement letter.

In [2]:
path = '/Volumes/files/COVID19/Manuel_RCF_Data/State_DSHS/ALTSA_reports/NH_2020_Jan-Feb/'
mega_pdf = '202006 PRR 249_Jan-Feb 2017 Nursing Home enforcement letters.pdf'

In [3]:
pdf_plum = pdfplumber.open(path + mega_pdf)

first_pages = []

# This loop will create a list with the index number of the first pages of each letter
for pg in pdf_plum.pages:
    if 'aging and long-term support administration' in pg.extract_text().lower():
        first_pages.append(pg.page_number)

In [4]:
print('The large PDF contains', len(first_pages), 'enforcement letters.\n',
     'The following list show the index number for the first page of each of those letters:\n')
print(first_pages)

The large PDF contains 61 enforcement letters.
 The following list show the index number for the first page of each of those letters:

[1, 6, 11, 15, 16, 21, 26, 30, 35, 39, 43, 51, 55, 60, 64, 68, 73, 74, 78, 83, 87, 92, 97, 101, 106, 111, 116, 120, 124, 129, 133, 137, 142, 144, 149, 150, 154, 162, 167, 171, 175, 180, 182, 189, 193, 198, 202, 207, 213, 218, 222, 226, 230, 234, 239, 244, 248, 254, 258, 265, 271]


### Break down into individual letters

Split the large PDF into several smaller PDFs, each corresponding to a different enforcement letter.

In [5]:
pdf_reader = PdfFileReader(open(path + mega_pdf, 'rb'))

for j in range(len(first_pages)):
    output = PdfFileWriter()
    
    # Was not intelligent enough to deal with the last list item, so took the primitive
    # approach and inseted a crude if statement:
    if j == len(first_pages)-1: # 60
        page_range = range(270,272)
    else:
        page_range = range(first_pages[j]-1, first_pages[j+1]-1)

    # Create a new PDF for each enforcement letter and save it down
    for i in page_range:
        output.addPage(pdf_reader.getPage(i))
    letter =  open(path + 'individual_letters/letter_%s.pdf' % j, 'wb')
    output.write(letter)
    
# Build a list of all the PDFs we created
letters = os.listdir(path + 'individual_letters/')
letters = [l for l in letters if re.match('.*pdf$', l)]

# Consistency test: Confirm we have the number of letters we expected
assert len(letters) == len(first_pages)

### Rename the letters

Rename each enforcement letter, embedding the facility name and the letter date in the PDF title.

In [6]:
# For some reason, the last letter ('letter_60.pdf') always ends up corrupted.
# We take the primitive approach: We delete it and rebuild/save it down manually from the large PDF.
# We named it 'last_letter.pdf'
os.remove(path + 'individual_letters/letter_60.pdf')
copyfile(path + 'last_letter.pdf', 
         path + 'individual_letters/last_letter.pdf')

'/Volumes/files/COVID19/Manuel_RCF_Data/State_DSHS/ALTSA_reports/NH_2020_Jan-Feb/individual_letters/last_letter.pdf'

In [7]:
# Consistency test: Confirm we still have the number of letters we expected
del(letters)
letters = os.listdir(path + 'individual_letters/')
letters = [l for l in letters if re.match('.*pdf$', l)]

print(len(letters))
assert len(letters) == len(first_pages)

61


In [8]:
# For each enforcement letter, obtain its date and the nursing home name.
# Use both items to rename the letter.
for pdf_name in letters: 

    pdf = pdfplumber.open(path + 'individual_letters/' + pdf_name)
    
    # Exctract the text from the first page
    pg_txt = pdf.pages[0].extract_text()
    lines = pg_txt.split('\n')
    lines = [l.strip() for l in lines]

    # Obtain the letter date
    pattern = '^[A-Za-z]+\s?\d{1,2},\s?\d{4}'
    date = [line for line in lines if re.match(pattern, line.strip())]
    date = date[0].replace(',','').replace(' ','_')
    
    # Obtain the facility name
    facility = lines[lines.index('Administrator')+1]
    facility = facility.replace(' ','_')

    print(pdf_name, '|', facility + '_' + date + '.pdf')
    os.rename(path + 'individual_letters/' + pdf_name,
              path + 'individual_letters/' + facility + '_' + date + '_' + pdf_name)

last_letter.pdf | Willapa_Harbor_Health_and_Rehab_February_14_2017.pdf
letter_0.pdf | Alaska_Gardens_Health_and_Rehabilitation_Center_February_6_2017.pdf
letter_1.pdf | Avamere_Rehabilitation_of_Cascade_Park_February_28_2017.pdf
letter_10.pdf | Emerald_Hills_Rehabilitation_March_10_2017.pdf
letter_11.pdf | Everett_Center_March_6_2017.pdf
letter_12.pdf | Fir_Lane_Health_and_Rehab_February_21_2017.pdf
letter_13.pdf | Grays_Harbor_Health_and_Rehabilitation_Center_March_6_2017.pdf
letter_14.pdf | Heartwood_Extended_Health_Care_February_27_2017.pdf
letter_15.pdf | Highland_Health_and_Rehabilitation_January_5_2017.pdf
letter_16.pdf | Josephine_Sunset_Home_March_3_2017.pdf
letter_17.pdf | Josephine_Sunset_Home_March_1_2017.pdf
letter_18.pdf | Josephine_Sunset_Home_February_2_2017.pdf
letter_19.pdf | Josephine_Sunset_Home_January_9_2017.pdf
letter_2.pdf | Bremerton_Convalescent_and_Rehabilitation_Center_January_20_2017.pdf
letter_20.pdf | Kindred_Transitional_Care_-_Lakewood_February_27_2017.p

In [9]:
# Consistency test: One last time, confirm we still have the number of letters we expected
del(letters)
letters = os.listdir(path + 'individual_letters/')
letters = [l for l in letters if re.match('.*pdf$', l)]

print(len(letters))
assert len(letters) == len(first_pages)

61
