# What this script does

Now that we have obtained the more than 1,000 enforcement letters, we dive into scraping festival using [Jeremy Singer-Vine](https://github.com/jsvine)'s amazing Python package, [PDF Plumber](https://github.com/jsvine/pdfplumber).

As we mentioned before, becasue of their vast number, we did not save them down in this repository, but they are available at the the DSHS's [Nursing Home Facilities Locator](https://fortress.wa.gov/dshs/adsaapps/lookup/NHPubLookup.aspx)


# SETTINGS

In [1]:
import pdfplumber
import pandas as pd
from os import listdir
import numpy as np
import re

pd.set_option('display.max_columns', None)

# LIST OF ENFORCEMENT LETTERS

We have periodically ran bulk downloads of eforcement letters from the DSHS's [Nursing Home Facilities Locator](https://fortress.wa.gov/dshs/adsaapps/lookup/NHPubLookup.aspx), because the letters posted there change over time. The DSHS explained to us why that is the case:

- On the first of each month an automatic script is ran that purges anything more than three years old and posts new enforcement letters.
- In addition, when a facility closes down, all its documents automatically purge.
- Sometimes, older enforcement letters (sometimes from previous years) are not present in the locator website, but will be posted later. Those older letters may be from nursing homes that had been going through a change of ownership. In that scenario, there would be a time where the locator would have no documents because the facility’s state license is officially “closed” (So all documents automatically purge). NH are the one facility type where even when they change ownership they still “own” their previous owners enforcement record. So the letters get reposted under the new licensee. There may be a delay between the reposting since that must be done manually.

As a result, each bulk download will contain duplicated letters from previous downloads. 

Here we create a list of unique letters across all downloads:

In [2]:
path = '/Volumes/files/COVID19/Manuel_RCF_Data/State_DSHS/ALTSA_reports/'
folders = ['NH_enforcement_letters_2020-03/',
           'NH_enforcement_letters_2020-06-09/',
           'NH_enforcement_letters_2020-06-24/',
           'NH_enforcement_letters_2020-07-16/',
           'NH_enforcement_letters_2020-08-20/',
           'NH_enforcement_letters_2020-09-08/',
           'NH_2020_Jan-Feb/individual_letters/']

In [3]:
d = {}
for i in range(len(folders)):
    d['{0}'.format(i)] = listdir(path + folders[i])

d2 = {}
for key, pdf_list in d.items():
    d2['{0}'.format(key)] = pd.DataFrame(pdf_list, columns=['pdf_name'])
    d2['{0}'.format(key)]['folder'] = folders[int(key)]

df_letters = pd.DataFrame()
for key, df in d2.items():
    print(key, df.shape)
    df_letters = pd.concat([df_letters, df])

df_letters = df_letters.drop_duplicates(subset=['pdf_name'], keep='first')
df_letters = df_letters.reset_index(drop=True)

0 (859, 2)
1 (826, 2)
2 (834, 2)
3 (792, 2)
4 (785, 2)
5 (771, 2)
6 (61, 2)


In [4]:
df_letters

Unnamed: 0,pdf_name,folder
0,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",NH_enforcement_letters_2020-03/
1,"Alaska Gardens (FP, Hx G, prior E, CMP, CF) 12...",NH_enforcement_letters_2020-03/
2,Alaska Gardens (Hx D prior E) 7 16.pdf,NH_enforcement_letters_2020-03/
3,Alaska Gardens (Hx F prior D) 8 9 18.pdf,NH_enforcement_letters_2020-03/
4,"Alaska Gardens (Hx G, prior E) 11 16 18.pdf",NH_enforcement_letters_2020-03/
...,...,...
1030,Warm_Beach_Care_Center_February_21_2017_letter...,NH_2020_Jan-Feb/individual_letters/
1031,Washington_Veterans_Home_-_Retsil_February_10_...,NH_2020_Jan-Feb/individual_letters/
1032,Willapa_Harbor_Health_and_Rehab_February_13_20...,NH_2020_Jan-Feb/individual_letters/
1033,Willapa_Harbor_Health_and_Rehab_February_14_20...,NH_2020_Jan-Feb/individual_letters/


So we have got ourselves over a thousand PDF files—most of them with several pages—to scrape. Shall we?

# SCRAPING FEST

### The loooooooooop

In [5]:
# Create the dataframe where all the data will be deposited.
df_all_reports = pd.DataFrame(columns = ['pdf_name', 'letter_date', 
                                         'survey_date', 'survey_type',
                                         'vendor_num', 'fed_num', 'aem_num', 'action', 
                                         'page', 'image_based',
                                         'fed_enforcement', 'nc_hist','epoc', 'state_rem', 'appeal_rights',
                                         'find_code', 'find_desc', 
                                         'wac', 'cmp_item', 'cmp_total'])

print('idx|pgs|rows|pdf_name')

# For each enforcement letter:
for index, row in df_letters.iterrows(): 

    pdf = pdfplumber.open(path + row['folder'] + row['pdf_name'])
    print(index, '|', len(pdf.pages), '|', len(df_all_reports), '|', row['pdf_name'])
    
    # As we scrap each page, we will save the data we extract from them in these dataframes:
    df_cmps = pd.DataFrame(columns = ['page', 'wac', 'cmp_item', 'cmp_total'])
    df_sections = pd.DataFrame(columns = ['page','fed_enforcement', 'nc_hist', 'epoc', 'state_rem', 'appeal_rights'])
    df_finds = pd.DataFrame(columns = ['page','find_code', 'find_desc'])

    
    # For each page in the report:
    for pg in pdf.pages:
        
        # Exctract all the text in the page into a single variable
        pg_txt = pg.extract_text()
        
        if not pg_txt:
            new_record = {'pdf_name':row['pdf_name'],
                          'letter_date':np.nan,
                          'vendor_num':np.nan, 
                          'fed_num':np.nan, 
                          'aem_num':np.nan, 
                          'action':np.nan, 
                          'survey_date':np.nan, 
                          'survey_type':np.nan,
                          'page':pg.page_number,
                          'image_based':True,
                          'fed_enforcement':np.nan,
                          'nc_hist':np.nan,
                          'epoc':np.nan,
                          'state_rem':np.nan,
                          'appeal_rights':np.nan,
                          'find_code':np.nan,
                          'find_desc':np.nan,
                          'wac':np.nan, 
                          'cmp_item':np.nan,
                          'cmp_total':np.nan}
            df_all_reports = df_all_reports.append(new_record, ignore_index=True)

        else:

            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            # REPORT METADATA: 'letter_date, 'vendor_num', 'fed_num', 'aem_num', 'action'
            # (These 5 data pieces are always on page 1)
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            
            if pg.page_number == 1:

                # ~~~~~ Data found on top area: 'letter_date, 'vendor_num', 'fed_num', 'aem_num'

                top_area = pg.crop((pg.bbox[0], pg.bbox[1],
                                     pg.bbox[2], pg.bbox[3]/3))
                top_lines = top_area.extract_text().split('\n')

                # 'letter_date'
                try:
                    pattern = '^[A-Za-z]+\s?\d{1,2},\s?\d{4}'
                    letter_date = [line for line in top_lines if re.match(pattern, line.strip())]
                    letter_date = letter_date[0]
                    del(pattern)
                except:
                    letter_date = np.nan

                # 'vendor_num', 'fed_num', 'aem_num',
                id_lines = [line for line in top_lines if re.match('^Vendor|AEM', line.strip())]
                # Some times these data points are absent. For those cases:
                if not id_lines: 
                    vendor_num = np.nan
                    fed_num = np.nan
                    aem_num = np.nan
                # If we do find the data points:
                else: 
                    # Vendor number
                    try:
                        vendor_num = id_lines[0].split('/')[0].strip()
                        vendor_num = vendor_num.split(':')[-1].strip()
                    except:
                        vendor_num = np.nan
                    # Fed number
                    try:
                        fed_num = id_lines[0].split('/')[1].strip()
                        fed_num = fed_num.split(':')[-1].strip()
                    except:
                        fed_num = np.nan
                    # AEM number
                    try:
                        aem_num = id_lines[1].split('#')[-1].strip()
                    except:
                        aem_num = np.nan


                # ~~~~~ Data found on middle area: 'action'

                middle_area = pg.crop((pg.bbox[0], pg.bbox[3]/3,
                                       pg.bbox[2], pg.bbox[3]*2/3))
                middle_lines = middle_area.extract_text().split('\n')
                middle_lines = [line.strip() for line in middle_lines]
                middle_lines = [line for line in middle_lines if re.match('^[A-Z,\s]+$', line)]
                if middle_lines:
                    action = ' '.join(middle_lines)
                else:
                    action = np.nan

                    
                    
                # ~~~~~ 'survey_date' and 'survey_type'
                
                try:
                    survey_txt = re.search('(On.*conducted an? .* at your facility)', pg_txt.replace('\n','')).group(1)
                    # There may have been more than one instance of the key phrase 'at your facility'.
                    # Get just the first instance
                    survey_txt = survey_txt.split('at your facility')[0]

                    rgx = '((january|february|march|april|may|june|july|august|september|october|november|december)\s\d{1,2},?\s\d{4})'
                    survey_date = re.search(rgx, survey_txt.lower()).group(1)

                    rgx = 'conducted an? (.*)'
                    survey_type = re.search(rgx, survey_txt.lower()).group(1)

                except:
                    survey_txt = np.nan
                    survey_date = np.nan
                    survey_type = np.nan

                    
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            # DETAILED DATA: Findings, federal enforcement, ePOC and fines
            # (These data points can show up in any page, not just in page 1)
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

            
            # ~~~~~ Findings: 'find_code', 'find_desc'
            
            # Split the pate's text using a first parenthesis as the mark
            finds_list = pg_txt.split('(')
            # Reduce the created list to only the elements that start with this pattern: CAPITAL);
            finds_list = [f for f in finds_list  if re.match('^[A-Z]\);', f)]
            if not finds_list:
                find_code = np.nan
                find_desc = np.nan
            else:
                # Select only the first find (If there are >1, probably from non-compliance history)
                find = finds_list[0]
                # Get rid of all text after the first period.
                find = find.split('.')
                find = find[0]
                # Eliminate noise characters
                find = find.replace(')', '').replace('\n', '') 
                # Split using ';'
                find = find.split(';')
                # The first element of the new list is the finding code
                find_code = find[0].strip()
                # The second element of the new list is the finding description
                find_desc = find[1].strip()
                    
            find_record = {'page':pg.page_number,
                           'find_code':find_code,
                           'find_desc':find_desc}
            df_finds = df_finds.append(find_record, ignore_index=True)


            
            # ~~~~~ Sections
            # - Federal enforcement
            # - Electronic Plan of Correction (ePOC)
            # - Non-compliance history
            # - Appeal Rights

            # For sections + fines, we will split the page text into lines using '\n':
            pg_lines = pg_txt.split('\n')
            pg_lines = [line.strip() for line in pg_lines]
         
            # For each section, determine if there is a line that indicates its presence in the report
            fed_lines = [line for line in pg_lines if re.search('Federal Enforcement', line)]
            epoc_lines = [line for line in pg_lines if re.search('Electronic Plan of Correction \(ePOC\)', line)]
            history_lines = [line for line in pg_lines if re.search('History of Non-Compliance', line)]
            remedy_lines = [line for line in pg_lines if re.search('State Remedies', line)]
            appeal_lines = [line for line in pg_lines if re.search('Appeal Rights', line)]
            
            # Create a record for federal enforcement, nc history & epoc:
            sections_record = {'page':pg.page_number,
                               'fed_enforcement':len(fed_lines)>0,
                               'nc_hist':len(history_lines)>0,
                               'epoc':len(epoc_lines)>0,
                               'state_rem':len(remedy_lines)>0,
                               'appeal_rights':len(appeal_lines)>0
                              }
            df_sections = df_sections.append(sections_record, ignore_index=True)

            
            # ~~~~~ Fines: 'wac', 'cmp_item', 'cmp_total'
            
            # Look for any lines that contain a $ sign
            dollar_lines = [line for line in pg_lines if re.search('\$', line)]                        

            # We now want to divide 'dollar_lines' list into two lists:
            # 1- A list with only one line: the line that contains the aggregated amount of all fines
            # 2- A list iwth all the lines that contain individual WAC fines
            
            # 1- Line that contain the aggregate fined amount
            cmp_total_line = [line for line in dollar_lines if re.search('check', line)]
            if not cmp_total_line:
                cmp_total = np.nan
            else:
                # This list should only have one element, since there should only be 
                # one line with the total amount mulcted. Lets test for that:
                assert len(cmp_total_line) == 1
                try:
                    pattern = '\$((\d+(,\s?|\.)?)+\d+)'
                    cmp_total = re.search(pattern, cmp_total_line[0]).group(1)
                    del(pattern)
                except:
                    cmp_total = np.nan

            # 2- Lines that contain individual WAC fines
            # Subset of those dollar_lines that START with a wac code
            cmp_item_lines = [line for line in dollar_lines if re.search('^(WAC\s?)?\d+-\d+-\s?\d+', line)]

            # If there are no lines with $ signs and WAC codes
            if not cmp_item_lines:
                wac = np.nan
                cmp_item = np.nan
                
                cmp_record = {'page':pg.page_number,
                                 'wac':wac,
                                 'cmp_item':cmp_item,
                                 'cmp_total':cmp_total}
                df_cmps = df_cmps.append(cmp_record, ignore_index=True)
            # And if there are
            else:
                for line in cmp_item_lines:
                    # WAC code (Notice that here we don't require it to be at the beginnig)
                    pattern = '((WAC\s?)?\d+-\d+-\s?\d+\s?(\(([0-9]|[a-z])\)\s?)*)'
                    wac = re.search(pattern, line).group(1)
                    del(pattern)
                    # CMP
                    pattern = '\$((\d+(,|\.)?)+\d+)'
                    cmp_item = re.search(pattern, line).group(1)
                    del(pattern)

                    cmp_record = {'page':pg.page_number,
                                     'wac':wac,
                                     'cmp_item':cmp_item,
                                     'cmp_total':cmp_total}
                    df_cmps = df_cmps.append(cmp_record, ignore_index=True)

                    
    # Consolidate df_finds, df_cmps & df_sections into a single data frame: df_one_report
    df_one_report = df_finds.join(df_cmps.set_index('page'), on='page', how='outer').reset_index(drop=True)
    df_one_report = df_one_report.join(df_sections.set_index('page'), on='page', how='outer').reset_index(drop=True)

    # Add the report-wide data to df_one_report
    df_one_report['pdf_name'] = row['pdf_name']
    df_one_report['letter_date'] = letter_date
    df_one_report['survey_date'] = survey_date
    df_one_report['survey_type'] = survey_type
    df_one_report['vendor_num'] = vendor_num
    df_one_report['fed_num'] = fed_num
    df_one_report['aem_num'] = aem_num
    df_one_report['action'] = action
    df_one_report['image_based'] = False
    
    # Rearrange columns according to the column order of 'df_all_reports' data frame
    df_one_report = df_one_report[df_all_reports.columns]

    # Attach 'df_one_report' to 'df_all_reports'
    df_all_reports = df_all_reports.append(df_one_report, ignore_index=True)
    del(df_one_report)

idx|pgs|rows|pdf_name
0 | 5 | 0 | Advance Post Acute (G, CMP, CF) 4 6 18.pdf
1 | 5 | 5 | Alaska Gardens (FP, Hx G, prior E, CMP, CF) 12 19 18.pdf
2 | 4 | 12 | Alaska Gardens (Hx D prior E) 7 16.pdf
3 | 4 | 16 | Alaska Gardens (Hx F prior D) 8 9 18.pdf
4 | 5 | 20 | Alaska Gardens (Hx G, prior E) 11 16 18.pdf
5 | 6 | 26 | Alaska Gardens (K, HX Prior Fg, Prior G, Prior D, SUB, SP, CMP, CF) 1 17 19.pdf
6 | 1 | 34 | Alaska Gardens (Lift SP, BIC) 2 14 19.pdf
7 | 5 | 35 | Alaska Gardens (OSFM IJ, R, G, CMP, CF) 4 4 19.pdf
8 | 5 | 41 | Alaska Gardens Health (G, CMP, CF) 2 20 18.pdf
9 | 4 | 46 | Aldercrest ( 2nd Hx GG, prior G, prior D) 5 2 19.pdf
10 | 4 | 50 | Aldercrest (Failed Post, prior GG, CMP, CF) 7 19 18.pdf
11 | 5 | 54 | Aldercrest (GG, CMP, CF) 8 17.pdf
12 | 5 | 61 | Aldercrest (Hx GG, CMP, CF, prior D) 3 20 19.pdf
13 | 5 | 67 | Aldercrest (Hx GG, CMP, CF, prior D) 4 25 18.pdf
14 | 5 | 72 | Aldercrest (Hx GG, CMP, CF, prior G, prior D) 6 15 18.pdf
15 | 4 | 77 | Aldercrest Health & Reh

122 | 2 | 576 | Brookfield Cascadia (Lift SP, BIC) 5 22 18.pdf
123 | 5 | 578 | Brookfield Cascadia Amended (Hx, GG, Amend DCH, Prior G, CMP, CF) 4 11 18.pdf
124 | 5 | 583 | Brookfield Centralia (GG, CMP, CF) 3 1 18.pdf
125 | 5 | 588 | Burien Nursing (IJ, R, CMP, CF) 6 5 19.pdf
126 | 5 | 593 | Burien Nursing and Rehab (G, CMP, CF) 3 17.pdf
127 | 4 | 601 | Burien Nursing and Rehab (G, CMP, CF) 3 2 18.pdf
128 | 4 | 605 | Burien Nursing and Rehab (GG, CMP, CF) 12 18 18.pdf
129 | 5 | 609 | Careage of Whidbey (Failed Post, CI, CF, CMP) 1 17 08.pdf
130 | 5 | 614 | Careage of Whidbey (FP G, Lift SP Hx Sub H, Sub J, G, SP) 4 17 19.pdf
131 | 5 | 619 | Careage of Whidbey (GG, CMP, CF) 1 18.pdf
132 | 4 | 625 | Careage of Whidbey (GG, CMP, CF) 10 18 17.pdf
133 | 4 | 629 | Careage of Whidbey (GG, CMP, CF) 5 17.pdf
134 | 6 | 633 | Careage of Whidbey (Sub H, Sub J, G, CMP, CF, SP) 2 14 19.pdf
135 | 5 | 644 | Cashmere Care Center (D, DCH, CF) 7 31 19.pdf
136 | 4 | 649 | Cashmere Care Center (G, CMP, CF

249 | 5 | 1184 | Fir Lane (GG, CMP, CF) 11 21 17.pdf
250 | 4 | 1190 | Fir Lane (Hx D prior F) 4 17 18.pdf
251 | 5 | 1194 | Fir Lane (Hx G prior D, CMP, CF) 2 28 18.pdf
252 | 5 | 1199 | Fir Lane (Hx G prior G, CMP, CF) 3 20 19.pdf
253 | 3 | 1204 | Fir Lane (IJ, NR) 3 17.pdf
254 | 4 | 1207 | Fir Lane Health and Rehab (GG, CMP, CF) 7 11 19.pdf
255 | 6 | 1211 | Fir Lane Health and Rehab (IJ, R, CMP, CF) 4 17.pdf
256 | 4 | 1220 | Fircrest School Pat N (Hx E, prior D) 6 28 18.pdf
257 | 4 | 1224 | Fircrest School Pat N (Hx, OSFM, G, CMP) 11 16.pdf
258 | 4 | 1228 | Fircrest School, Pat N (G, CMP) 12 10 19.pdf
259 | 4 | 1232 | Fircrest School, Pat N (G, CMP) 12 7 18.pdf
260 | 6 | 1236 | Forest Ridge (DCH) 11 4 19.pdf
261 | 4 | 1242 | Forest Ridge (DCH) 4 24 18.pdf
262 | 4 | 1246 | Forest Ridge (F, DCH) 1 3 20.pdf
263 | 5 | 1250 | Forest Ridge (G, SUB, CMP, CF) 2 7 19.pdf
264 | 5 | 1256 | Forest Ridge (GG, CMP, CF) 7 17.pdf
265 | 4 | 1261 | Forest Ridge Health and rehab (Hx D, prior G, IJ non) 4

370 | 2 | 1753 | Kindred Arden (BIC Lift SP) 6 17.pdf
371 | 4 | 1755 | Kindred Lakewood (GG, CMP, CF) 9 6 17.pdf
372 | 7 | 1759 | Kindred Nursing Arden (3OOC K, WAC Hx D, prior J, Cond, CF) 5 17.pdf
373 | 4 | 1769 | Kindred Nursing Arden (Hx D, prior J) 3 17.pdf
374 | 5 | 1773 | Kindred Nursing Arden (WAC Hx D, prior J, Cond, CF) 4 17.pdf
375 | 5 | 1778 | Kindred-Arden (IJ R, Sun, CMP, CF) 3 17.pdf
376 | 4 | 1783 | Lake Ridge Center (Hx D, prior D) 6 28 18.pdf
377 | 4 | 1787 | Lake Ridge Center (Hx E prior D) 5 23 19.pdf
378 | 5 | 1791 | Lake Ridge Center (Hx G, prior E, CMP, CF) 8 29 17.pdf
379 | 4 | 1796 | Lakeland Village (Hx IJ, R, CMP, prior G) 8 9 19.pdf
380 | 4 | 1800 | Landmark (GG, CMP, CF) 11 9 17.pdf
381 | 4 | 1804 | Landmark Care (Hx D prior D) 4 17 19.pdf
382 | 4 | 1808 | Landmark Care (Hx D, prior D, prior D) 5 16 19.pdf
383 | 4 | 1812 | Landmark Care (Hx E prior D) 2 12 20.pdf
384 | 5 | 1816 | Landmark Care (Hx GG, prior D, CMP, CF) 1 16 19.pdf
385 | 5 | 1821 | Landmark 

490 | 3 | 2342 | Olympia Transitional Care (Past Non G) 1 25 19.pdf
491 | 7 | 2345 | Olympia Transitional Care-incorrect (IJ, R, OSFM, SUB, CF, SP) 6 17.pdf
492 | 5 | 2356 | Orchard Park (G, CMP, CF) 12 15 17.pdf
493 | 4 | 2361 | Pacific Care & Rehab (Hx D, prior D) 3 5 19.pdf
494 | 5 | 2365 | Pacific Care (IJ R, SUB, CMP, CF) 8 17.pdf
495 | 5 | 2371 | Pacific Care and Rehab (3OOC, G prior D, prior D, CMP, CF) 4 11 19.pdf
496 | 5 | 2376 | Pacific Care and Rehab (GG, CMP, CF) 1 22 18.pdf
497 | 4 | 2382 | Pacific Care and Rehabitlation (Hx F prior D) 8 13 19.pdf
498 | 4 | 2386 | Pacific Specialty  (GG, CMP, CF) 3 1 18.pdf
499 | 5 | 2390 | Pacific Specialy and Rehab Care (IJ, R, SUB, CMP, CF) 8 31 17.pdf
500 | 7 | 2397 | Paramount (2nd Failed post E, prior E, prior IJ) 5 17.pdf
501 | 2 | 2411 | Paramount (BIC - Lift Cond) 5 30 18.pdf
502 | 2 | 2413 | Paramount (BIC - Lift Cond, SP) 5 17.pdf
503 | 4 | 2415 | Paramount (Failed post E, prior IJ) 3 17.pdf
504 | 6 | 2419 | Paramount (FP, Prior

604 | 4 | 2901 | Regency at the Park (GG, CMP, CF) 2 21 19.pdf
605 | 4 | 2905 | Regency at the Park (Hx F prior D) 12 6 18.pdf
606 | 5 | 2909 | Regency at the Park (Hx G, CF, CMP, prior E) 11 7 19.pdf
607 | 5 | 2915 | Regency at the Park (Hx G, prior F prior D, CF) 1 11 19.pdf
608 | 4 | 2920 | Regency Canyon Lakes Rehab & Nursing (Hx D prior D) 7 30 18.pdf
609 | 4 | 2924 | Regency Canyon Lakes Rehab and Nursing Center (Hx D prior G, CMP) 12 17 18.pdf
610 | 5 | 2928 | Regency Care at Monroe (Failed Post, prior G) 11 19 18.pdf
611 | 4 | 2934 | Regency Everett (GG, CMP, CF) 6 17.pdf
612 | 4 | 2938 | Regency Harmony House (Hx D prior D) 4 17.pdf
613 | 2 | 2942 | Regency North Bend (BIC, Lift SP) 12 15 17.pdf
614 | 5 | 2944 | Regency North Bend (Direct Care hours, CF) 1 8 18.pdf
615 | 5 | 2949 | Regency North Bend (G, CMP, CF) 10 17 19.pdf
616 | 5 | 2954 | Regency North Bend (GG, CMP, CF) 4 17.pdf
617 | 6 | 2960 | Regency North Bend (GG, SUB, CMP, CF, SP) 10 12 17.pdf
618 | 4 | 2969 | Regen

726 | 5 | 3476 | Sunshine Health and Rehab (G, CMP, CF) 10 2 19.pdf
727 | 1 | 3481 | Tacoma Lutheran (Withdraw citation) 1 30 19.pdf
728 | 4 | 3482 | Tacoma Lutheran Home (Hx E, prior E) 5 22 18.pdf
729 | 5 | 3486 | Tacoma Lutheran Home (IJ R, SUB, CMP, CF) 7 17.pdf
730 | 5 | 3491 | Tacoma Nursing & Rehab (G, CMP,CF) 4 30 19.pdf
731 | 4 | 3500 | Tacoma Nursing and Rehab (2nd FP, WAC only) 1 15 20.pdf
732 | 4 | 3504 | Tacoma Nursing and Rehab (FP, WAC only) 12 06 19.pdf
733 | 4 | 3508 | Talbot (GG, CMP, CF) 4 6 18.pdf
734 | 4 | 3512 | Talbot (Hx E, Prior GG. CMP) 2 23 18.pdf
735 | 4 | 3516 | Talbot Center (GG, CMP, CF) 4 29 19.pdf
736 | 5 | 3520 | Talbot Center (Hx, D prior IJ, R, SUB, CMP, CF) 9 17.pdf
737 | 6 | 3525 | Talbot Center (IJ, R, SUB, CMP, CF) 8 17.pdf
738 | 4 | 3533 | Talbot Center for Rehab & Healthcare (F, Directed POC) 8 21 19.pdf
739 | 5 | 3537 | Talbot Center for Rehab (3OOC F prior D prior D) 1 22 20.pdf
740 | 4 | 3542 | Talbot Center for Rehab (GG, CMP, CF) 1 26 18.p

841 | 5 | 4027 | Wesley Homes Health Center (Hx G prior G, CMP, CF) 10 29 19.pdf
842 | 4 | 4032 | Whitman Health and Rehab (G, CMP, CF) 5 7 18.pdf
843 | 5 | 4036 | Whitman Health and Rehab (G, CMP, CF) 7 17.pdf
844 | 5 | 4041 | Willapa Harbor (Hx H prior D, SUB, CMP, CF) 11 27 17.pdf
845 | 4 | 4047 | Willapa Harbor (IJ, SUB, CMP, CF) 6 17.pdf
846 | 4 | 4051 | Willow Springs (D prior F) 5 2 19.pdf
847 | 5 | 4055 | Willow Springs (Hx D, prior D, CF) 4 17.pdf
848 | 4 | 4061 | Willow Springs (Hx F prior D) 7 23 19.pdf
849 | 5 | 4065 | Woodland Convelescent Center (G, CMP, CF) 8 9 19.pdf
850 | 4 | 4070 | Woodland DCH (CF) 9 6 19.pdf
851 | 4 | 4074 | Yakima Valley School (3OOC G, Hx G, prior G, prior OSFM F) 5 17.pdf
852 | 4 | 4078 | Yakima Valley School (G, CMP) 7 23 18.pdf
853 | 3 | 4082 | Yakima Valley School (GG, CMP) 11 13 18.pdf
854 | 4 | 4085 | Yakima Valley School (Hx E prior D) 7 11 19.pdf
855 | 4 | 4089 | Yakima Valley School (Hx E prior E) 6 4 18.pdf
856 | 4 | 4093 | Yakima Valley

960 | 5 | 4568 | View Ridge (6OOC D, prior G prior FP, CF, CMP) 8 5 20.pdf
961 | 5 | 4573 | Wesley Homes (Hx GG, prior D, CF, CMP, CF) 7 9 20.pdf
962 | 5 | 4578 | Advanced Post Acute (G, CMP, CF) 8 7 20.pdf
963 | 5 | 4583 | Avalon Care Center (Level G - Rec Fed CMP ) 7 31 20.pdf
964 | 5 | 4588 | Crescent Health Care (Covid F) 8 17 20.pdf
965 | 7 | 4593 | Franklin Hills (Covid L, CMP, Cont cond, cont SP, hx prior F, CF) 8 7 20.pdf
966 | 1 | 4600 | Franklin Hills Notice of DPOC.pdf
967 | 4 | 4601 | Lynnwod Post Acute (Hx D, prior IJ civil fine).pdf
968 | 6 | 4605 | Mira Vista (3 OOC Covid E prior G prior D, CMP) 8 5 20.pdf
969 | 4 | 4611 | Prestige Post Acute - Centralia (GG, CF, CMP) 8 18 20.pdf
970 | 5 | 4615 | Prestige Post Acute-Edmonds Amended (DPOC, F, 45 day DDPNA) 8 3 20.pdf
971 | 7 | 4620 | Providence Mount St Vincent (K, DPOC, CMP, CF) 8 4 20.pdf
972 | 6 | 4628 | WA Care Center (HX E, DPOC, prior D) 8 7 20.pdf
973 | 6 | 4634 | Wesley Homes (Hx Covid E prior G, prior D) 8 18 20.

Let's take a look at our freshly baked dataset.

In [6]:
print(df_all_reports.shape)
print(df_all_reports.nunique())

df_all_reports

(4958, 20)
pdf_name           1035
letter_date         580
survey_date         581
survey_type         111
vendor_num          248
fed_num             217
aem_num             701
action               56
page                  8
image_based           2
fed_enforcement       2
nc_hist               2
epoc                  2
state_rem             2
appeal_rights         2
find_code             9
find_desc            50
wac                 178
cmp_item             41
cmp_total            66
dtype: int64


Unnamed: 0,pdf_name,letter_date,survey_date,survey_type,vendor_num,fed_num,aem_num,action,page,image_based,fed_enforcement,nc_hist,epoc,state_rem,appeal_rights,find_code,find_desc,wac,cmp_item,cmp_total
0,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf","April 12, 2018","april 6, 2018",unannounced complaint investigation,4115441,505355,WADC9Q,IMPOSITION OF CIVIL FINES,1,False,False,False,True,False,False,G,isolated deficiencies that constitute actual h...,,,
1,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf","April 12, 2018","april 6, 2018",unannounced complaint investigation,4115441,505355,WADC9Q,IMPOSITION OF CIVIL FINES,2,False,False,False,False,True,False,,,WAC 388-97- 1060(3)(g),1000.00,
2,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf","April 12, 2018","april 6, 2018",unannounced complaint investigation,4115441,505355,WADC9Q,IMPOSITION OF CIVIL FINES,3,False,False,False,False,False,True,,,,,
3,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf","April 12, 2018","april 6, 2018",unannounced complaint investigation,4115441,505355,WADC9Q,IMPOSITION OF CIVIL FINES,4,False,False,False,False,False,False,,,,,1000.00
4,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf","April 12, 2018","april 6, 2018",unannounced complaint investigation,4115441,505355,WADC9Q,IMPOSITION OF CIVIL FINES,5,False,False,False,False,False,False,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4953,Willapa_Harbor_Health_and_Rehab_February_3_201...,"February 3, 2017","january 31, 2017",complaint investigation and partial extended s...,4113577,505349,WA6PHJ,,4,False,False,False,False,False,True,,,WAC 388-97-0640(2)(a)(b),3000.00,
4954,Willapa_Harbor_Health_and_Rehab_February_3_201...,"February 3, 2017","january 31, 2017",complaint investigation and partial extended s...,4113577,505349,WA6PHJ,,4,False,False,False,False,False,True,,,WAC 388-97-1060(3)(b),1000.00,
4955,Willapa_Harbor_Health_and_Rehab_February_3_201...,"February 3, 2017","january 31, 2017",complaint investigation and partial extended s...,4113577,505349,WA6PHJ,,4,False,False,False,False,False,True,,,WAC 388-97-1620(1),3000.00,
4956,Willapa_Harbor_Health_and_Rehab_February_3_201...,"February 3, 2017","january 31, 2017",complaint investigation and partial extended s...,4113577,505349,WA6PHJ,,5,False,False,False,False,False,False,,,,,13000.00


# CLEANING THE SCRAPED DATA

In [7]:
df = df_all_reports.copy()

## WAC codes

The first clean version of the code will include:
- Title
- Chapter
- Section
- Subsection

In [8]:
df['wac_clean_long'] = df['wac'].str.replace('WAC|\s', '') # Don't need spaces, nor to be reminded they are *WAC* codes.
df['wac_clean_long'] = df['wac_clean_long'].str.strip()
df['wac_clean_long'] = df['wac_clean_long'].str.replace('97-97-', '97-')
df['wac_clean_long'] = df['wac_clean_long'].str.replace('\\', '') # Only one case we could see with '\'

Create shorter version of the code, one that will not include the subsection.

In [9]:
df['wac_clean_short'] = df['wac_clean_long'].str.extract('(\d+-\d+-\d+)')

## Civil moneray penalties

We transform *cmp_item* and *cmp_total* into numeric type variables

In [10]:
# Individual cmp
df['cmp_item_num'] = df['cmp_item'].str.strip().str.replace(',|\s', '')
df['cmp_item_num'] = df['cmp_item_num'].str.replace('1.500.00', '1500.00') # Just one case
df['cmp_item_num'] = pd.to_numeric(df['cmp_item_num'])

# Aggregate cmp
df['cmp_total_num'] = df['cmp_total'].str.strip().str.replace(',|\s', '')
df['cmp_total_num'] = pd.to_numeric(df['cmp_total_num'])

print('Total fines (from adding individual WAC fines of each report) =', df['cmp_item_num'].sum(skipna=True))
print('Total fines (from adding only the aggregate fine of each report) =', df['cmp_total_num'].sum(skipna=True))
print('So we have a slight discrepancy of', 
      df['cmp_total_num'].sum(skipna=True) - df['cmp_item_num'].sum(skipna=True))

Total fines (from adding individual WAC fines of each report) = 2560968.52
Total fines (from adding only the aggregate fine of each report) = 2645518.52
So we have a slight discrepancy of 84550.0


## Letter dates

We transform *letter_date* into a date-type variable

In [11]:
# Consistecy test: Does each enforcement letter have only one report date?
temp = df[['pdf_name', 'letter_date']]
temp = temp.drop_duplicates().reset_index(drop=True)
assert len(temp) == df['pdf_name'].nunique()

del(temp)

In [12]:
df['letter_dt'] = pd.to_datetime(df['letter_date'])

**Consistency test**: What are the oldest and earliest dates?

In [13]:
print(df['letter_dt'].min())
print(df['letter_dt'].max())

2016-12-30 00:00:00
2108-05-15 00:00:00


Turns out the earliest date is almost a century in the future. Obviously there must be something wrong with some years. Let's look for them.

In [14]:
df['letter_dt'].dt.year.value_counts(dropna=False)

2017.0    1630
2019.0    1560
2018.0    1168
2020.0     555
NaN         33
2108.0       4
2107.0       4
2016.0       4
Name: letter_dt, dtype: int64

In [15]:
df[df['letter_date'].str.contains('2108|2107', na=False)]['pdf_name'].unique()

array(['Garden Village (Hx E prior F) 6 17.pdf',
       'Olympia Transitional Care (GG, CMP, CF) 5 1 18.pdf'], dtype=object)

A visual review of those two PDF reports confirms our theory. We correct for that typo.

In [16]:
df = df.drop(['letter_dt'], axis=1)

df['letter_dt'] = df['letter_date'].str.replace('2107', '2017').str.replace('2108', '2018')
df['letter_dt'] = pd.to_datetime(df['letter_dt'])

In [17]:
df['letter_dt'].dt.year.value_counts(dropna=False)

2017.0    1634
2019.0    1560
2018.0    1172
2020.0     555
NaN         33
2016.0       4
Name: letter_dt, dtype: int64

In [18]:
print(df['letter_dt'].min())
print(df['letter_dt'].max())

2016-12-30 00:00:00
2020-08-28 00:00:00


## Survey dates

We transform *survey_date* into a date type variable

In [19]:
# Consistecy test: Does each enforcement letter have only one report date?
temp = df[['pdf_name', 'survey_date']]
temp = temp.drop_duplicates().reset_index(drop=True)
assert len(temp) == df['pdf_name'].nunique()

del(temp)

In [20]:
df['survey_dt'] = pd.to_datetime(df['survey_date'])

In [21]:
df['survey_date'].value_counts(dropna=False)

NaN                  165
february 24, 2017     50
may 25, 2017          36
february 21, 2019     28
february 21, 2017     28
                    ... 
july 17, 2016          3
october 18, 2019       2
may 30, 2018           2
may 10, 2017           2
september 2, 2016      2
Name: survey_date, Length: 582, dtype: int64

In [22]:
df['survey_dt'].value_counts(dropna=False)

NaT           165
2017-02-24     50
2017-05-25     36
2019-02-21     28
2017-02-21     28
             ... 
2018-11-13      3
2018-05-30      2
2019-10-18      2
2017-05-10      2
2016-09-02      2
Name: survey_dt, Length: 582, dtype: int64

**Consistency test**: What are the oldest and earliest dates?

In [23]:
print(df['survey_dt'].min())
print(df['survey_dt'].max())

2016-01-20 00:00:00
2109-01-17 00:00:00


Turns out the earliest date is almost a century in the future. Obviously there must be something wrong with some years. Let's look for them.

In [24]:
df['survey_dt'].dt.year.value_counts(dropna=False)

2017.0    1572
2019.0    1507
2018.0    1130
2020.0     520
NaN        165
2016.0      35
2107.0      16
2109.0       8
2108.0       5
Name: survey_dt, dtype: int64

In [25]:
df[df['survey_date'].str.contains('2107|2108|2109', na=False)]['pdf_name'].unique()

array(['Alaska Gardens (K, HX Prior Fg, Prior G, Prior D, SUB, SP, CMP, CF) 1 17 19.pdf',
       'Franklin Hills Health and Rehab (GG, CMP, CF) 3 16.pdf',
       'Grays Harbor (GG, CMP, CF) 3 17.pdf',
       'Prestige Care - Sunnyside (G, CMP, CF) 1 3 18.pdf',
       'Richland Rehab (D prior D, CMP) 10 18 17.pdf',
       'Grays_Harbor_Health_and_Rehabilitation_Center_March_6_2017_letter_13.pdf'],
      dtype=object)

A visual inspection of those letters shows that indeed these are typos, and consist in having the position of the '1' and the '0' in the year flipped. We correct for that now.

In [26]:
df = df.drop('survey_dt', axis=1)

df['survey_dt'] = df['survey_date'].str.replace('2107','2017').str.replace('2108','2018').str.replace('2109','2019')
df['survey_dt'] = pd.to_datetime(df['survey_dt'])

In [27]:
df['survey_dt'].dt.year.value_counts(dropna=False)

2017.0    1588
2019.0    1515
2018.0    1135
2020.0     520
NaN        165
2016.0      35
Name: survey_dt, dtype: int64

In [28]:
print(df['survey_dt'].min())
print(df['survey_dt'].max())

2016-01-20 00:00:00
2020-08-18 00:00:00


## Adding WAC official definitions

In [29]:
# Import the official WAC definitions
df_wac = pd.read_csv('../C_output_data/wac_codes_df_t338c97.csv')

# Join
df = df.join(df_wac.set_index('ttl_chp_sec'), on='wac_clean_short', how='left')

# Reorganize columns and eliminate obsolete ones
df = df[['pdf_name', 'page', 'image_based', 
         'letter_dt', 'survey_dt', 'survey_type', 'vendor_num', 'fed_num', 'aem_num', 'action',
         'fed_enforcement', 'nc_hist','epoc', 'state_rem', 'appeal_rights',
         'find_code', 'find_desc', 
         'wac_clean_long', 'wac_clean_short', 'sub_chp_num', 'sub_chp_name', 'section', 'ttl_chp_sec_desc',
         'cmp_item_num', 'cmp_total_num']]

# Renaming columns
df.columns = ['pdf_name', 'page', 'image_based', 
              'letter_dt', 'survey_dt', 'survey_type', 'vendor_num', 'fed_num', 'aem_num', 'action',
              'fed_enforcement', 'nc_hist','epoc', 'state_rem', 'appeal_rights',
              'finding_code', 'finding_desc', 
              'wac_long', 'wac_short', 'subchp_num', 'subchp_name', 'section', 'section_desc',
              'cmp_item', 'cmp_agg']

In [30]:
df.head()

Unnamed: 0,pdf_name,page,image_based,letter_dt,survey_dt,survey_type,vendor_num,fed_num,aem_num,action,fed_enforcement,nc_hist,epoc,state_rem,appeal_rights,finding_code,finding_desc,wac_long,wac_short,subchp_num,subchp_name,section,section_desc,cmp_item,cmp_agg
0,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",1,False,2018-04-12,2018-04-06,unannounced complaint investigation,4115441,505355,WADC9Q,IMPOSITION OF CIVIL FINES,False,False,True,False,False,G,isolated deficiencies that constitute actual h...,,,,,,,,
1,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",2,False,2018-04-12,2018-04-06,unannounced complaint investigation,4115441,505355,WADC9Q,IMPOSITION OF CIVIL FINES,False,False,False,True,False,,,388-97-1060(3)(g),388-97-1060,SUBCHAPTER I,"RESIDENT RIGHTS, CARE AND RELATED SERVICES",Quality of Care,Quality of care.,1000.0,
2,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",3,False,2018-04-12,2018-04-06,unannounced complaint investigation,4115441,505355,WADC9Q,IMPOSITION OF CIVIL FINES,False,False,False,False,True,,,,,,,,,,
3,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",4,False,2018-04-12,2018-04-06,unannounced complaint investigation,4115441,505355,WADC9Q,IMPOSITION OF CIVIL FINES,False,False,False,False,False,,,,,,,,,,1000.0
4,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",5,False,2018-04-12,2018-04-06,unannounced complaint investigation,4115441,505355,WADC9Q,IMPOSITION OF CIVIL FINES,False,False,False,False,False,,,,,,,,,,


In [31]:
df.tail()

Unnamed: 0,pdf_name,page,image_based,letter_dt,survey_dt,survey_type,vendor_num,fed_num,aem_num,action,fed_enforcement,nc_hist,epoc,state_rem,appeal_rights,finding_code,finding_desc,wac_long,wac_short,subchp_num,subchp_name,section,section_desc,cmp_item,cmp_agg
4953,Willapa_Harbor_Health_and_Rehab_February_3_201...,4,False,2017-02-03,2017-01-31,complaint investigation and partial extended s...,4113577,505349,WA6PHJ,,False,False,False,False,True,,,388-97-0640(2)(a)(b),388-97-0640,SUBCHAPTER I,"RESIDENT RIGHTS, CARE AND RELATED SERVICES",Resident Rights,Prevention of abuse.,3000.0,
4954,Willapa_Harbor_Health_and_Rehab_February_3_201...,4,False,2017-02-03,2017-01-31,complaint investigation and partial extended s...,4113577,505349,WA6PHJ,,False,False,False,False,True,,,388-97-1060(3)(b),388-97-1060,SUBCHAPTER I,"RESIDENT RIGHTS, CARE AND RELATED SERVICES",Quality of Care,Quality of care.,1000.0,
4955,Willapa_Harbor_Health_and_Rehab_February_3_201...,4,False,2017-02-03,2017-01-31,complaint investigation and partial extended s...,4113577,505349,WA6PHJ,,False,False,False,False,True,,,388-97-1620(1),388-97-1620,SUBCHAPTER I,"RESIDENT RIGHTS, CARE AND RELATED SERVICES",Administration,General administration.,3000.0,
4956,Willapa_Harbor_Health_and_Rehab_February_3_201...,5,False,2017-02-03,2017-01-31,complaint investigation and partial extended s...,4113577,505349,WA6PHJ,,False,False,False,False,False,,,,,,,,,,13000.0
4957,Willapa_Harbor_Health_and_Rehab_February_3_201...,6,False,2017-02-03,2017-01-31,complaint investigation and partial extended s...,4113577,505349,WA6PHJ,,False,False,False,False,False,,,,,,,,,,


# EXPORT RESULTS

In [32]:
df.to_csv('../C_output_data/scraped_data.csv', index=False)

# DATA INTEGRITY REVIEW

### Did all the reports make it through?

In [33]:
assert df['pdf_name'].nunique() == len(df_letters)
print('The consistency test above confirms that all the', len(df_letters), 'enforcement letters made it through.')

The consistency test above confirms that all the 1035 enforcement letters made it through.


### How many PDFs were image-based?

In [34]:
df_image = df[df['image_based']]

print('Out of the', len(df_letters), 'enforcement letter PDFs, apparently only', 
      df_image['pdf_name'].nunique(), 'of them contained image-based pages.',
      ' This does not necessarily mean that all pages in those PDFs are image-based, however. Let us confirm that:')

Out of the 1035 enforcement letter PDFs, apparently only 7 of them contained image-based pages.  This does not necessarily mean that all pages in those PDFs are image-based, however. Let us confirm that:


In [35]:
# List of all the pdf_names that have at least one image-based page
image_pdfs = df_image['pdf_name'].unique()
len(image_pdfs)

# Create a subset of df_all_reports that contains only the pdf_names that have at least one image-based page
df_one_report = df[df['pdf_name'].isin(image_pdfs)]

# Confirm all the pages in those PDFs are image-based
assert df_one_report['image_based'].all()
del(image_pdfs, df_one_report)

print('Indeed, if a report is identified as image-based, then all of its pages are image-based.')

Indeed, if a report is identified as image-based, then all of its pages are image-based.


In [36]:
df_image

Unnamed: 0,pdf_name,page,image_based,letter_dt,survey_dt,survey_type,vendor_num,fed_num,aem_num,action,fed_enforcement,nc_hist,epoc,state_rem,appeal_rights,finding_code,finding_desc,wac_long,wac_short,subchp_num,subchp_name,section,section_desc,cmp_item,cmp_agg
927,"Cristwood Nursing and Rehab (GG, CMP, CF) 6 21...",1,True,NaT,NaT,,,,,,,,,,,,,,,,,,,,
928,"Cristwood Nursing and Rehab (GG, CMP, CF) 6 21...",2,True,NaT,NaT,,,,,,,,,,,,,,,,,,,,
929,"Cristwood Nursing and Rehab (GG, CMP, CF) 6 21...",3,True,NaT,NaT,,,,,,,,,,,,,,,,,,,,
930,"Cristwood Nursing and Rehab (GG, CMP, CF) 6 21...",4,True,NaT,NaT,,,,,,,,,,,,,,,,,,,,
931,"Cristwood Nursing and Rehab (GG, CMP, CF) 6 21...",5,True,NaT,NaT,,,,,,,,,,,,,,,,,,,,
962,"Delta Rehab (IJ, R, SUB, CMP, CF) 6 14 18.pdf",1,True,NaT,NaT,,,,,,,,,,,,,,,,,,,,
963,"Delta Rehab (IJ, R, SUB, CMP, CF) 6 14 18.pdf",2,True,NaT,NaT,,,,,,,,,,,,,,,,,,,,
964,"Delta Rehab (IJ, R, SUB, CMP, CF) 6 14 18.pdf",3,True,NaT,NaT,,,,,,,,,,,,,,,,,,,,
965,"Delta Rehab (IJ, R, SUB, CMP, CF) 6 14 18.pdf",4,True,NaT,NaT,,,,,,,,,,,,,,,,,,,,
966,"Delta Rehab (IJ, R, SUB, CMP, CF) 6 14 18.pdf",5,True,NaT,NaT,,,,,,,,,,,,,,,,,,,,


Saved down copies of the image-based files to inspect visually.

### Close look at those stratospheric fines

There are a few unusual fines. Although they are rare—each only shows up once—they are stratospheric.

They are so big, it is worth to take a close look at them. One test that is worth doing is whether the sum of the individual WAC fines adds up to the sum of the total fines for this particular group.

In [37]:
df['cmp_item'].value_counts(dropna=False)

NaN          3897
1000.00       525
3000.00       191
500.00        119
2000.00       103
1500.00        86
2500.00         3
98778.43        1
536.40          1
11152.70        1
4046.56         1
3350.18         1
33412.96        1
96909.62        1
4415.58         1
8849.46         1
52720.56        1
250.00          1
8858.00         1
9094.35         1
15800.40        1
8872.63         1
29830.14        1
190835.73       1
2173.25         1
35904.06        1
5591.52         1
7662.23         1
44745.68        1
37371.36        1
14080.91        1
17990.78        1
52163.05        1
69282.25        1
13823.65        1
2544.65         1
37851.09        1
115519.72       1
9166.88         1
8124.16         1
9259.58         1
Name: cmp_item, dtype: int64

In [38]:
# List all the cmp_item values and the number of times they appear through the reports
mega_cmp_list = df['cmp_item'].value_counts(dropna=False).reset_index()
mega_cmp_list.columns = ['cmp_item', 'freq']
# Filter for only the unusual cmp_item amounts
mega_cmp_list = mega_cmp_list[mega_cmp_list['freq'] == 1].reset_index(drop=True)
# Reduce it to a series that contains those mega cmp
mega_cmp_list = mega_cmp_list['cmp_item']
mega_cmp_list

0      98778.43
1        536.40
2      11152.70
3       4046.56
4       3350.18
5      33412.96
6      96909.62
7       4415.58
8       8849.46
9      52720.56
10       250.00
11      8858.00
12      9094.35
13     15800.40
14      8872.63
15     29830.14
16    190835.73
17      2173.25
18     35904.06
19      5591.52
20      7662.23
21     44745.68
22     37371.36
23     14080.91
24     17990.78
25     52163.05
26     69282.25
27     13823.65
28      2544.65
29     37851.09
30    115519.72
31      9166.88
32      8124.16
33      9259.58
Name: cmp_item, dtype: float64

In [39]:
# Create a subset of df with only the pdf_names that contain those unusual cmp
df_mega_cmp = df[df['cmp_item'].isin(mega_cmp_list)]
df_mega_cmp = df[df['pdf_name'].isin(df_mega_cmp['pdf_name'].unique())]
df_mega_cmp

Unnamed: 0,pdf_name,page,image_based,letter_dt,survey_dt,survey_type,vendor_num,fed_num,aem_num,action,fed_enforcement,nc_hist,epoc,state_rem,appeal_rights,finding_code,finding_desc,wac_long,wac_short,subchp_num,subchp_name,section,section_desc,cmp_item,cmp_agg
571,"Brookfield Cascadia (Hx, GG, Amend DCH, Prior ...",1,False,2018-04-23,2018-04-12,complaint investigation,4115521,505331,WAS3K4,IMPOSITION OF CIVIL FINES,False,True,False,False,False,G,isolated deficiencies that constitute actual h...,,,,,,,,
572,"Brookfield Cascadia (Hx, GG, Amend DCH, Prior ...",2,False,2018-04-23,2018-04-12,complaint investigation,4115521,505331,WAS3K4,IMPOSITION OF CIVIL FINES,False,False,False,False,False,G,isolated deficiencies that constitute actual h...,388-97-1090(1),388-97-1090,SUBCHAPTER I,"RESIDENT RIGHTS, CARE AND RELATED SERVICES",Nursing Services,Direct care hours.,8872.63,
573,"Brookfield Cascadia (Hx, GG, Amend DCH, Prior ...",3,False,2018-04-23,2018-04-12,complaint investigation,4115521,505331,WAS3K4,IMPOSITION OF CIVIL FINES,False,False,True,True,True,,,388-97-1060(3)(g),388-97-1060,SUBCHAPTER I,"RESIDENT RIGHTS, CARE AND RELATED SERVICES",Quality of Care,Quality of care.,1000.00,
574,"Brookfield Cascadia (Hx, GG, Amend DCH, Prior ...",4,False,2018-04-23,2018-04-12,complaint investigation,4115521,505331,WAS3K4,IMPOSITION OF CIVIL FINES,False,False,False,False,False,,,,,,,,,,
575,"Brookfield Cascadia (Hx, GG, Amend DCH, Prior ...",5,False,2018-04-23,2018-04-12,complaint investigation,4115521,505331,WAS3K4,IMPOSITION OF CIVIL FINES,False,False,False,False,False,,,,,,,,,,9872.63
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4487,Crestwood Health and Rehab (DCH) 7 10 20.pdf,4,False,2020-07-16,2020-07-10,unannounced complaint investigation,4115621,505185,,,False,False,False,False,False,,,,,,,,,,
4507,Montesano Health and Rehab (DCH) 7 8 20.pdf,1,False,2020-07-16,2020-07-08,unannounced complaint investigation,4115301,505503,,,False,False,False,False,False,,,,,,,,,,
4508,Montesano Health and Rehab (DCH) 7 8 20.pdf,2,False,2020-07-16,2020-07-08,unannounced complaint investigation,4115301,505503,,,False,False,False,True,True,,,388-97-1090(1)(8)(a)(b)(d),388-97-1090,SUBCHAPTER I,"RESIDENT RIGHTS, CARE AND RELATED SERVICES",Nursing Services,Direct care hours.,9259.58,
4509,Montesano Health and Rehab (DCH) 7 8 20.pdf,3,False,2020-07-16,2020-07-08,unannounced complaint investigation,4115301,505503,,,False,False,False,False,False,,,,,,,,,,9259.58


In [40]:
# Consistency test: are all the unsual cmp_item values in this new data frame?
assert mega_cmp_list.isin(df_mega_cmp['cmp_item']).all()

# Are the sums of individual fines and total fines consistent for this group of mega fines?
print('There is a slight difference of', df_mega_cmp['cmp_item'].sum() - df_mega_cmp['cmp_agg'].sum(),
      'out of a total', df_mega_cmp['cmp_agg'].sum(), 'or', 
     round((df_mega_cmp['cmp_item'].sum() - df_mega_cmp['cmp_agg'].sum()) / df_mega_cmp['cmp_agg'].sum()*100,1),
     '%')

There is a slight difference of 5500.0 out of a total 1064968.52 or 0.5 %


Note: A separate visual analysis these mega fines show that mostly correspond to a particular code violation: [WAC 388-97-1090(1)](https://apps.leg.wa.gov/wac/default.aspx?cite=388-97-1090), a Quality of Care related code that regulates the minimum hours of direct care per resident day (HDR) that a nursing home must provide.

### From how many reports were we able to extract a total mulct amount?

In [41]:
temp = df[~df['cmp_agg'].isna()]

print(len(temp))
print(temp['pdf_name'].nunique())

635
634


There seems to be one report that has 2 total amounts. Let's check.

In [42]:
temp = temp['pdf_name'].value_counts().reset_index()
x = temp[temp['pdf_name'] > 1].loc[0, 'index']
x

'Cheney Care Center Amended by hearing 2 (G, CMP, CF) 9 8 17.pdf'

In [43]:
df[df['pdf_name'] == x]

Unnamed: 0,pdf_name,page,image_based,letter_dt,survey_dt,survey_type,vendor_num,fed_num,aem_num,action,fed_enforcement,nc_hist,epoc,state_rem,appeal_rights,finding_code,finding_desc,wac_long,wac_short,subchp_num,subchp_name,section,section_desc,cmp_item,cmp_agg
699,"Cheney Care Center Amended by hearing 2 (G, CM...",1,False,2018-08-22,2017-09-08,complaint investigation,4173209,505346,WANOSW,IMPOSITION OF CIVIL FINES,False,False,True,False,False,,,,,,,,,,
700,"Cheney Care Center Amended by hearing 2 (G, CM...",2,False,2018-08-22,2017-09-08,complaint investigation,4173209,505346,WANOSW,IMPOSITION OF CIVIL FINES,False,False,False,True,False,,,388-97-1620(2)(b),388-97-1620,SUBCHAPTER I,"RESIDENT RIGHTS, CARE AND RELATED SERVICES",Administration,General administration.,1000.0,2500.0
701,"Cheney Care Center Amended by hearing 2 (G, CM...",2,False,2018-08-22,2017-09-08,complaint investigation,4173209,505346,WANOSW,IMPOSITION OF CIVIL FINES,False,False,False,True,False,,,388-97-1060(1),388-97-1060,SUBCHAPTER I,"RESIDENT RIGHTS, CARE AND RELATED SERVICES",Quality of Care,Quality of care.,1500.0,2500.0
702,"Cheney Care Center Amended by hearing 2 (G, CM...",3,False,2018-08-22,2017-09-08,complaint investigation,4173209,505346,WANOSW,IMPOSITION OF CIVIL FINES,False,False,False,False,False,,,,,,,,,,


### Findings

In [44]:
print(df['finding_code'].nunique())
df['finding_code'].value_counts(dropna=False)

9


NaN    3903
G       513
D       241
E        90
J        69
F        53
K        47
L        23
H        17
I         2
Name: finding_code, dtype: int64

In [45]:
print(df['finding_desc'].nunique())
df['finding_desc'].value_counts(dropna=False)

50


NaN                                                                                                                                                                                                                                                                                                                                                                                     3903
isolated deficiencies that constitute actual harm that is not immediate jeopardy                                                                                                                                                                                                                                                                                                         477
isolated deficiencies that constitute no actual harm with potential for more than minimal harm that is not immediate jeopardy                                                                                                                 

Ther are far more descriptions for a finding than codes. That is, a single finding code has typically multiple descriptions.

Unless the official finding descriptions have changed over time, each code should only have one description. It seems that the problem here might arise from typos or because the algorithm picked up noise text when scraping for the descriptions. We should base our analysis in the finding codes instead of the descriptions and confirm that their definitions have not changed over time.

### Date consistency

Logically, all survey dates should precede their corresponding enforcement letter dates. Let's see if we can find instances where that is not the case.

In [46]:
print(df[(df['letter_dt'] <= df['survey_dt'])]['pdf_name'].unique())
df[(df['letter_dt'] <= df['survey_dt'])]#[['pdf_name', 'letter_dt','survey_dt']].drop_duplicates()

['Alderwood Park (G, CMP, CF) 1 3 20.pdf'
 'Talbot Center (Hx, D prior IJ, R, SUB, CMP, CF) 9 17.pdf'
 'University Place Care Center (Hx G prior K, Cond) 10 21 19.pdf']


Unnamed: 0,pdf_name,page,image_based,letter_dt,survey_dt,survey_type,vendor_num,fed_num,aem_num,action,fed_enforcement,nc_hist,epoc,state_rem,appeal_rights,finding_code,finding_desc,wac_long,wac_short,subchp_num,subchp_name,section,section_desc,cmp_item,cmp_agg
126,"Alderwood Park (G, CMP, CF) 1 3 20.pdf",1,False,2020-01-14,2020-01-20,unannounced complaint investigation,4114602,505092,WAOYRU,IMPOSITION OF A CIVIL FINE,False,False,True,False,False,G,isolated deficiencies that constitute actual h...,,,,,,,,
127,"Alderwood Park (G, CMP, CF) 1 3 20.pdf",2,False,2020-01-14,2020-01-20,unannounced complaint investigation,4114602,505092,WAOYRU,IMPOSITION OF A CIVIL FINE,False,False,False,True,False,,,388-97-1060(3)(b),388-97-1060,SUBCHAPTER I,"RESIDENT RIGHTS, CARE AND RELATED SERVICES",Quality of Care,Quality of care.,1000.0,
128,"Alderwood Park (G, CMP, CF) 1 3 20.pdf",3,False,2020-01-14,2020-01-20,unannounced complaint investigation,4114602,505092,WAOYRU,IMPOSITION OF A CIVIL FINE,False,False,False,False,True,,,,,,,,,,
129,"Alderwood Park (G, CMP, CF) 1 3 20.pdf",4,False,2020-01-14,2020-01-20,unannounced complaint investigation,4114602,505092,WAOYRU,IMPOSITION OF A CIVIL FINE,False,False,False,False,False,,,,,,,,,,1000.0
130,"Alderwood Park (G, CMP, CF) 1 3 20.pdf",5,False,2020-01-14,2020-01-20,unannounced complaint investigation,4114602,505092,WAOYRU,IMPOSITION OF A CIVIL FINE,False,False,False,False,False,,,,,,,,,,
3520,"Talbot Center (Hx, D prior IJ, R, SUB, CMP, CF...",1,False,2017-08-29,2017-08-30,unannounced abbreviated survey and partial ext...,4113114,505202,WAVEO1,,False,True,False,False,False,D,isolated deficiencies that constitute no actua...,,,,,,,,
3521,"Talbot Center (Hx, D prior IJ, R, SUB, CMP, CF...",2,False,2017-08-29,2017-08-30,unannounced abbreviated survey and partial ext...,4113114,505202,WAVEO1,,False,False,True,False,False,,,,,,,,,,
3522,"Talbot Center (Hx, D prior IJ, R, SUB, CMP, CF...",3,False,2017-08-29,2017-08-30,unannounced abbreviated survey and partial ext...,4113114,505202,WAVEO1,,False,False,False,True,True,,,,,,,,,,
3523,"Talbot Center (Hx, D prior IJ, R, SUB, CMP, CF...",4,False,2017-08-29,2017-08-30,unannounced abbreviated survey and partial ext...,4113114,505202,WAVEO1,,False,False,False,False,False,,,,,,,,,,
3524,"Talbot Center (Hx, D prior IJ, R, SUB, CMP, CF...",5,False,2017-08-29,2017-08-30,unannounced abbreviated survey and partial ext...,4113114,505202,WAVEO1,,False,False,False,False,False,,,,,,,,,,
