# What this script does

Now that we have identified the set of unique enforcement letters that will be out target, we dive into scraping festival using [Jeremy Singer-Vine](https://github.com/jsvine)'s amazing Python package, [PDF Plumber](https://github.com/jsvine/pdfplumber).

# SETTINGS

In [1]:
import pdfplumber
import pandas as pd
from os import listdir
import numpy as np
import re

pd.set_option('display.max_columns', None)

# LIST OF ENFORCEMENT LETTERS

We have periodically ran bulk downloads of eforcement letters from the DSHS's [Nursing Home Facilities Locator](https://fortress.wa.gov/dshs/adsaapps/lookup/NHPubLookup.aspx), because the letters posted there change over time. The DSHS explained to us why that is the case:

- On the first of each month an automatic script is ran that purges anything more than three years old and posts new enforcement letters.
- In addition, when a facility closes down, all its documents automatically purge.
- Sometimes, older enforcement letters (sometimes from previous years) are not present in the locator website, but will be posted later. Those older letters may be from nursing homes that had been going through a change of ownership. In that scenario, there would be a time where the locator would have no documents because the facility’s state license is officially “closed” (So all documents automatically purge). NH are the one facility type where even when they change ownership they still “own” their previous owners enforcement record. So the letters get reposted under the new licensee. There may be a delay between the reposting since that must be done manually.

As a result, there each bulk download will have letters in commont with other bulk downloads, as well as letters that don't show up in other bulk downloads. Here we create a list of unique letters across all downloads:

In [2]:
path = '/Volumes/files/COVID19/Manuel_RCF_Data/State_DSHS/ALTSA_reports/'
folders = ['NH_enforcement_letters_2020-03/',
           'NH_enforcement_letters_2020-06-09/',
           'NH_enforcement_letters_2020-06-24/',
           'NH_enforcement_letters_2020-07-16/',
           'NH_2020_Jan-Feb/individual_letters/']

list_0 = listdir(path + folders[0])
list_1 = listdir(path + folders[1])
list_2 = listdir(path + folders[2])
list_3 = listdir(path + folders[3])
list_4 = listdir(path + folders[4])

df_0 = pd.DataFrame(list_0, columns=['pdf_name'])
df_0['folder'] = folders[0]

df_1 = pd.DataFrame(list_1, columns=['pdf_name'])
df_1['folder'] = folders[1]

df_2 = pd.DataFrame(list_2, columns=['pdf_name'])
df_2['folder'] = folders[2]

df_3 = pd.DataFrame(list_3, columns=['pdf_name'])
df_3['folder'] = folders[3]

df_4 = pd.DataFrame(list_4, columns=['pdf_name'])
df_4['folder'] = folders[4]

In [3]:
df_letters = pd.concat([df_0, df_1, df_2, df_3, df_4])
df_letters = df_letters.drop_duplicates(subset=['pdf_name'], keep='first')
df_letters = df_letters.reset_index(drop=True)

df_letters

Unnamed: 0,pdf_name,folder
0,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",NH_enforcement_letters_2020-03/
1,"Alaska Gardens (FP, Hx G, prior E, CMP, CF) 12...",NH_enforcement_letters_2020-03/
2,Alaska Gardens (Hx D prior E) 7 16.pdf,NH_enforcement_letters_2020-03/
3,Alaska Gardens (Hx F prior D) 8 9 18.pdf,NH_enforcement_letters_2020-03/
4,"Alaska Gardens (Hx G, prior E) 11 16 18.pdf",NH_enforcement_letters_2020-03/
...,...,...
994,Warm_Beach_Care_Center_February_21_2017_letter...,NH_2020_Jan-Feb/individual_letters/
995,Washington_Veterans_Home_-_Retsil_February_10_...,NH_2020_Jan-Feb/individual_letters/
996,Willapa_Harbor_Health_and_Rehab_February_13_20...,NH_2020_Jan-Feb/individual_letters/
997,Willapa_Harbor_Health_and_Rehab_February_14_20...,NH_2020_Jan-Feb/individual_letters/


# SCRAPING FEST

In [None]:
# # List of reports that posed troubles

# obstacles = [
#     'Regency North Bend Direct care hrs WAC with CF.pdf',
#     'Univeristy Place (FP, CMP, CF, Cond) 2 13 20.pdf',
#     'Prestige Post Acute - Edmonds (IJ R, SP, SUB, CMP, CF) 4 5 18.pdf',
#     'Forest Ridge (DCH) 4 24 18.pdf',
#     'The Oaks at Timberline (Hx G, prior D, CMP, CF) 12 5 19.pdf',
#     'Emerald Hills (GG, CMP, CF, Sub) 3 17.pdf', # Over $30K in cmp
#     'Brookfield Cascadia (3 OOC, Sub, Hx, GG, Amend DCH, Prior G, CMP, CF) 4 25 18.pdf', # No date
#     'Cristwood Nursing and Rehab (GG, CMP, CF) 6 21 18.pdf',
#     'The Gardens on University ( HX GG, CMP, CF) 9 17.pdf',
#     'Puyallup Nursing and Rehab (Hx D Prior F) 2 16 18.pdf', # Please remit a check for $XXXX.XX
#     'Crestwood Health and Rehab Amended (DCH, Wac, CF) 10 6 17.pdf', # Space inside the fine amount: $35, 904.06 (Fixed)
#     'Cheney Care Center Amended by hearing 2 (G, CMP, CF) 9 8 17.pdf', # Shows 2 cmp_total values. (Same value, repeated)
#     'Alderwood Park Health and Rehab (GG, CMP, CF) 6 14 18.pdf', # An example that has ePOC and federal enforcement

#     # Below this point are reports where we could scrap an individual mulct, but not its corresponding WAC code.
#     'Brookfield Cascadia (Hx, GG, Amend DCH, Prior G, CMP, CF) 4 11 18.pdf', 
#     'Cashmere Care Center (D, DCH, CF) 7 31 19.pdf', 
#     'Crestwood (4OOC, DCH, prior G, prior D prior F, CMP, CF) 7 9 19.pdf', 
#     'Crestwood (5OOC DCH E prior E prior D prior D prior E) 11 14 19.pdf', 
#     'Crestwood (6OOC G, prior DCH E prior E prior D prior D prior E) 11 18 19.pdf', 
#     'Crestwood (Wac Only, CF, DCH) 8 17.pdf', 'Crestwood HRD quarterly fining 5 17.pdf', 
#     'Crestwood HRD quarterly fining Amended 5 17.pdf', 'Crestwood Health and Rehab  (DCH, Wac, CF) 10 6 17.pdf', 
#     'Crestwood Health and Rehab Amended (DCH, Wac, CF) 10 6 17.pdf', 
#     'Enumclaw Health & Rehab HRD Amended quarterly fining 5 17 .pdf', 
#     'Enumclaw Health & Rehab HRD quarterly fining 5 17 .pdf', 
#     'Forest Ridge (DCH) 11 4 19.pdf', 'Forest Ridge (DCH) 4 24 18.pdf', 
#     'Forest Ridge (F, DCH) 1 3 20.pdf', 
#     'Life Care of Port Townsend (DCH) 8 9 19.pdf', 
#     'McKay Healthcare and Rehab (DCH) 7 29 19.pdf', 
#     'Prestige Care - Burlington (G, DCH, CF, CMP) 10 1 19.pdf', 
#     'Regency North Bend (Direct Care hours, CF) 1 8 18.pdf', 
#     'Regency North Bend Direct care hrs WAC with CF.pdf', 
#     'Regency North Bend HRD quarterly fining 5 17.pdf', 
#     'Sunrise View (DCH, CF) 3 28 18.pdf', 
#     'Woodland DCH (CF) 9 6 19.pdf'
# ]
# obstacles = pd.Series(obstacles)

### The loooooooooop

In [None]:
# Create the dataframe where all the data will be deposited.
df_all_reports = pd.DataFrame(columns = ['pdf_name', 'letter_date', 
                                         'survey_date', 'survey_type',
                                         'vendor_num', 'fed_num', 'aem_num', 'action', 
                                         'page', 'image_based',
                                         'fed_enforcement', 'nc_hist','epoc', 'state_rem', 'appeal_rights',
                                         'find_code', 'find_desc', 
                                         'wac', 'cmp_item', 'cmp_total'])

print('idx|pgs|rows|pdf_name')

# For each enforcement letter:
for index, row in df_letters.iterrows(): 

    pdf = pdfplumber.open(path + row['folder'] + row['pdf_name'])
    print(index, '|', len(pdf.pages), '|', len(df_all_reports), '|', row['pdf_name'])
    
    # As we scrap each page, we will save the data we extract from them in these dataframes:
    df_cmps = pd.DataFrame(columns = ['page', 'wac', 'cmp_item', 'cmp_total'])
    df_sections = pd.DataFrame(columns = ['page','fed_enforcement', 'nc_hist', 'epoc', 'state_rem', 'appeal_rights'])
    df_finds = pd.DataFrame(columns = ['page','find_code', 'find_desc'])

    
    # For each page in the report:
    for pg in pdf.pages:
        
        # Exctract all the text in the page into a single variable
        pg_txt = pg.extract_text()
        
        if not pg_txt:
            new_record = {'pdf_name':row['pdf_name'],   #
                          'letter_date':np.nan,
                          'vendor_num':np.nan, 
                          'fed_num':np.nan, 
                          'aem_num':np.nan, 
                          'action':np.nan, 
                          'survey_date':np.nan, 
                          'survey_type':np.nan,
                          'page':pg.page_number, # 
                          'image_based':True,    #
                          'fed_enforcement':np.nan,
                          'nc_hist':np.nan,
                          'epoc':np.nan,
                          'state_rem':np.nan,
                          'appeal_rights':np.nan,
                          'find_code':np.nan,
                          'find_desc':np.nan,
                          'wac':np.nan, 
                          'cmp_item':np.nan,
                          'cmp_total':np.nan}
            df_all_reports = df_all_reports.append(new_record, ignore_index=True)

        else:

            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            # REPORT METADATA: 'letter_date, 'vendor_num', 'fed_num', 'aem_num', 'action'
            # (These 5 data pieces are always on page 1)
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            
            if pg.page_number == 1:

                # ~~~~~ Data found on top area: 'letter_date, 'vendor_num', 'fed_num', 'aem_num'

                top_area = pg.crop((pg.bbox[0], pg.bbox[1],
                                     pg.bbox[2], pg.bbox[3]/3))
                top_lines = top_area.extract_text().split('\n')

                # 'letter_date'
                try:
                    pattern = '^[A-Za-z]+\s?\d{1,2},\s?\d{4}'
                    letter_date = [line for line in top_lines if re.match(pattern, line.strip())]
                    letter_date = letter_date[0]
                    del(pattern)
                except:
                    letter_date = np.nan

                # 'vendor_num', 'fed_num', 'aem_num',
                id_lines = [line for line in top_lines if re.match('^Vendor|AEM', line.strip())]
                # Some times these data points are absent. For those cases:
                if not id_lines: 
                    vendor_num = np.nan
                    fed_num = np.nan
                    aem_num = np.nan
                # If we do find the data points:
                else: 
                    # Vendor number
                    try:
                        vendor_num = id_lines[0].split('/')[0].strip()
                        vendor_num = vendor_num.split(':')[-1].strip()
                    except:
                        vendor_num = np.nan
                    # Fed number
                    try:
                        fed_num = id_lines[0].split('/')[1].strip()
                        fed_num = fed_num.split(':')[-1].strip()
                    except:
                        fed_num = np.nan
                    # AEM number
                    try:
                        aem_num = id_lines[1].split('#')[-1].strip()
                    except:
                        aem_num = np.nan


                # ~~~~~ Data found on middle area: 'action'

                middle_area = pg.crop((pg.bbox[0], pg.bbox[3]/3,
                                       pg.bbox[2], pg.bbox[3]*2/3))
                middle_lines = middle_area.extract_text().split('\n')
                middle_lines = [line.strip() for line in middle_lines]
                middle_lines = [line for line in middle_lines if re.match('^[A-Z,\s]+$', line)]
                if middle_lines:
                    action = ' '.join(middle_lines)
                else:
                    action = np.nan

                    
                    
                # ~~~~~ 'survey_date' and 'survey_type'
                
                try:
                    survey_txt = re.search('(On.*conducted an? .* at your facility)', pg_txt.replace('\n','')).group(1)
                    # There may have been more than one instance of the key phrase 'at your facility'.
                    # Get just the first instance
                    survey_txt = survey_txt.split('at your facility')[0]

                    rgx = '((january|february|march|april|may|june|july|august|september|october|november|december)\s\d{1,2},?\s\d{4})'
                    survey_date = re.search(rgx, survey_txt.lower()).group(1)

                    rgx = 'conducted an? (.*)'
                    survey_type = re.search(rgx, survey_txt.lower()).group(1)

                except:
                    survey_txt = np.nan
                    survey_date = np.nan
                    survey_type = np.nan

                    
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            # DETAILED DATA: Findings, federal enforcement, ePOC and fines
            # (These data points can show up in any page, not just in page 1)
            # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

            
            # ~~~~~ Findings: 'find_code', 'find_desc'
            
            # Split the pate's text using a first parenthesis as the mark
            finds_list = pg_txt.split('(')
            # Reduce the created list to only the elements that start with this pattern: CAPITAL);
            finds_list = [f for f in finds_list  if re.match('^[A-Z]\);', f)]
            if not finds_list:
                find_code = np.nan
                find_desc = np.nan
            else:
                # Select only the first find (If there are >1, probably from non-compliance history)
                find = finds_list[0]
                # Get rid of all text after the first period.
                find = find.split('.')
                find = find[0]
                # Eliminate noise characters
                find = find.replace(')', '').replace('\n', '') 
                # Split using ';'
                find = find.split(';')
                # The first element of the new list is the finding code
                find_code = find[0].strip()
                # The second element of the new list is the finding description
                find_desc = find[1].strip()
                    
            find_record = {'page':pg.page_number,
                           'find_code':find_code,
                           'find_desc':find_desc}
            df_finds = df_finds.append(find_record, ignore_index=True)


            
            # ~~~~~ Sections
            # - Federal enforcement
            # - Electronic Plan of Correction (ePOC)
            # - Non-compliance history
            # - Appeal Rights

            # For sections + fines, we will split the page text into lines using '\n':
            pg_lines = pg_txt.split('\n')
            pg_lines = [line.strip() for line in pg_lines]
         
            # For each section, determine if there is a line that indicates its presence in the report
            fed_lines = [line for line in pg_lines if re.search('Federal Enforcement', line)]
            epoc_lines = [line for line in pg_lines if re.search('Electronic Plan of Correction \(ePOC\)', line)]
            history_lines = [line for line in pg_lines if re.search('History of Non-Compliance', line)]
            remedy_lines = [line for line in pg_lines if re.search('State Remedies', line)]
            appeal_lines = [line for line in pg_lines if re.search('Appeal Rights', line)]
            
            # Create a record for federal enforcement, nc history & epoc:
            sections_record = {'page':pg.page_number,
                               'fed_enforcement':len(fed_lines)>0,
                               'nc_hist':len(history_lines)>0,
                               'epoc':len(epoc_lines)>0,
                               'state_rem':len(remedy_lines)>0,
                               'appeal_rights':len(appeal_lines)>0
                              }
            df_sections = df_sections.append(sections_record, ignore_index=True)

            
            # ~~~~~ Fines: 'wac', 'cmp_item', 'cmp_total'
            
            # Look for any lines that contain a $ sign
            dollar_lines = [line for line in pg_lines if re.search('\$', line)]                        

            # We now want to divide 'dollar_lines' list into two lists:
            # 1- A list with only one line: the line that contains the aggregated amount of all fines
            # 2- A list iwth all the lines that contain individual WAC fines
            
            # 1- Line that contain the aggregate fined amount
            cmp_total_line = [line for line in dollar_lines if re.search('check', line)]
            if not cmp_total_line:
                cmp_total = np.nan
            else:
                # This list should only have one element, since there should only be 
                # one line with the total amount mulcted. Lets test for that:
                assert len(cmp_total_line) == 1
                try:
                    pattern = '\$((\d+(,\s?|\.)?)+\d+)'
                    cmp_total = re.search(pattern, cmp_total_line[0]).group(1)
                    del(pattern)
                except:
                    cmp_total = np.nan

            # 2- Lines that contain individual WAC fines
            # Subset of those dollar_lines that START with a wac code
            cmp_item_lines = [line for line in dollar_lines if re.search('^(WAC\s?)?\d+-\d+-\s?\d+', line)]

            # If there are no lines with $ signs and WAC codes
            if not cmp_item_lines:
                wac = np.nan
                cmp_item = np.nan
                
                cmp_record = {'page':pg.page_number,
                                 'wac':wac,
                                 'cmp_item':cmp_item,
                                 'cmp_total':cmp_total}
                df_cmps = df_cmps.append(cmp_record, ignore_index=True)
            # And if there are
            else:
                for line in cmp_item_lines:
                    # WAC code (Notice that here we don't require it to be at the beginnig)
                    pattern = '((WAC\s?)?\d+-\d+-\s?\d+\s?(\(([0-9]|[a-z])\)\s?)*)'
                    wac = re.search(pattern, line).group(1)
                    del(pattern)
                    # CMP
                    pattern = '\$((\d+(,|\.)?)+\d+)'
                    cmp_item = re.search(pattern, line).group(1)
                    del(pattern)

                    cmp_record = {'page':pg.page_number,
                                     'wac':wac,
                                     'cmp_item':cmp_item,
                                     'cmp_total':cmp_total}
                    df_cmps = df_cmps.append(cmp_record, ignore_index=True)

                    
    # Consolidate df_finds, df_cmps & df_sections into a single data frame: df_one_report
    df_one_report = df_finds.join(df_cmps.set_index('page'), on='page', how='outer').reset_index(drop=True)
    df_one_report = df_one_report.join(df_sections.set_index('page'), on='page', how='outer').reset_index(drop=True)

    # Add the report-wide data to df_one_report
    df_one_report['pdf_name'] = row['pdf_name']
    df_one_report['letter_date'] = letter_date
    df_one_report['survey_date'] = survey_date
    df_one_report['survey_type'] = survey_type
    df_one_report['vendor_num'] = vendor_num
    df_one_report['fed_num'] = fed_num
    df_one_report['aem_num'] = aem_num
    df_one_report['action'] = action
    df_one_report['image_based'] = False
    
    # Rearrange columns according to the column order of 'df_all_reports' data frame
    df_one_report = df_one_report[df_all_reports.columns]

    # Attach 'df_one_report' to 'df_all_reports'
    df_all_reports = df_all_reports.append(df_one_report, ignore_index=True)
    del(df_one_report)

In [None]:
print(df_all_reports.shape)
print(df_all_reports.nunique())

In [None]:
df_all_reports

# CLEANING THE SCRAPED DATA

In [None]:
df = df_all_reports.copy()

## WAC codes

The first clean version of the code will include:
- Title
- Chapter
- Section
- Subsection

In [None]:
df['wac_clean_long'] = df['wac'].str.replace('WAC|\s', '') # Don't need spaces, nor to be reminded they are *WAC* codes.
df['wac_clean_long'] = df['wac_clean_long'].str.strip()
df['wac_clean_long'] = df['wac_clean_long'].str.replace('97-97-', '97-')
df['wac_clean_long'] = df['wac_clean_long'].str.replace('\\', '') # Only one case we could see with '\'

Create shorter version of the code, one that will not include the subsection.

In [None]:
df['wac_clean_short'] = df['wac_clean_long'].str.extract('(\d+-\d+-\d+)')

## Civil moneray penalties

We transform *cmp_item* and *cmp_total* into numeric type variables

In [None]:
# Individual cmp
df['cmp_item_num'] = df['cmp_item'].str.strip().str.replace(',|\s', '')
df['cmp_item_num'] = df['cmp_item_num'].str.replace('1.500.00', '1500.00') # Just one case
df['cmp_item_num'] = pd.to_numeric(df['cmp_item_num'])

# Aggregate cmp
df['cmp_total_num'] = df['cmp_total'].str.strip().str.replace(',|\s', '')
df['cmp_total_num'] = pd.to_numeric(df['cmp_total_num'])

print('Total fines (from adding individual WAC fines of each report) =', df['cmp_item_num'].sum(skipna=True))
print('Total fines (from adding only the aggregate fine of each report) =', df['cmp_total_num'].sum(skipna=True))
print('So we have a slight discrepancy of', 
      df['cmp_total_num'].sum(skipna=True) - df['cmp_item_num'].sum(skipna=True))

## Letter dates

We transform *letter_date* into a date-type variable

In [None]:
# Consistecy test: Does each enforcement letter have only one report date?
temp = df[['pdf_name', 'letter_date']]
temp = temp.drop_duplicates().reset_index(drop=True)
assert len(temp) == df['pdf_name'].nunique()

del(temp)

In [None]:
df['letter_dt'] = pd.to_datetime(df['letter_date'])

**Consistency test**: What are the oldest and earliest dates?

In [None]:
print(df['letter_dt'].min())
print(df['letter_dt'].max())

Turns out the earliest date is almost a century in the future. Obviously there must be something wrong with some years. Let's look for them.

In [None]:
df['letter_dt'].dt.year.value_counts(dropna=False)

In [None]:
df[df['letter_date'].str.contains('2108|2107', na=False)]['pdf_name'].unique()

A visual review of those two PDF reports confirms our theory. We correct for that typo.

In [None]:
df = df.drop(['letter_dt'], axis=1)

df['letter_dt'] = df['letter_date'].str.replace('2107', '2017').str.replace('2108', '2018')
df['letter_dt'] = pd.to_datetime(df['letter_dt'])

In [None]:
df['letter_dt'].dt.year.value_counts(dropna=False)

In [None]:
print(df['letter_dt'].min())
print(df['letter_dt'].max())

## Survey dates

We transform *survey_date* into a date type variable

In [None]:
# Consistecy test: Does each enforcement letter have only one report date?
temp = df[['pdf_name', 'survey_date']]
temp = temp.drop_duplicates().reset_index(drop=True)
assert len(temp) == df['pdf_name'].nunique()

del(temp)

In [None]:
df['survey_dt'] = pd.to_datetime(df['survey_date'])

In [None]:
df['survey_date'].value_counts(dropna=False)

In [None]:
df['survey_dt'].value_counts(dropna=False)

**Consistency test**: What are the oldest and earliest dates?

In [None]:
print(df['survey_dt'].min())
print(df['survey_dt'].max())

Turns out the earliest date is almost a century in the future. Obviously there must be something wrong with some years. Let's look for them.

In [None]:
df['survey_dt'].dt.year.value_counts(dropna=False)

In [None]:
df[df['survey_date'].str.contains('2107|2108|2109', na=False)]['pdf_name'].unique()

A visual inspection of those letters shows that indeed these are typos, and consist in having the position of the '1' and the '0' in the year flipped. We correct for that now.

In [None]:
df = df.drop('survey_dt', axis=1)

df['survey_dt'] = df['survey_date'].str.replace('2107','2017').str.replace('2108','2018').str.replace('2109','2019')
df['survey_dt'] = pd.to_datetime(df['survey_dt'])

In [None]:
df['survey_dt'].dt.year.value_counts(dropna=False)

In [None]:
print(df['survey_dt'].min())
print(df['survey_dt'].max())

## Adding WAC official definitions

In [None]:
# Import the official WAC definitions
df_wac = pd.read_csv('../C_output_data/wac_codes_df_t338c97.csv')

# Join
df = df.join(df_wac.set_index('ttl_chp_sec'), on='wac_clean_short', how='left')

# Reorganize columns and eliminate obsolete ones
df = df[['pdf_name', 'page', 'image_based', 
         'letter_dt', 'survey_dt', 'survey_type', 'vendor_num', 'fed_num', 'aem_num', 'action',
         'fed_enforcement', 'nc_hist','epoc', 'state_rem', 'appeal_rights',
         'find_code', 'find_desc', 
         'wac_clean_long', 'wac_clean_short', 'sub_chp_num', 'sub_chp_name', 'section', 'ttl_chp_sec_desc',
         'cmp_item_num', 'cmp_total_num']]

# Renaming columns
df.columns = ['pdf_name', 'page', 'image_based', 
              'letter_dt', 'survey_dt', 'survey_type', 'vendor_num', 'fed_num', 'aem_num', 'action',
              'fed_enforcement', 'nc_hist','epoc', 'state_rem', 'appeal_rights',
              'finding_code', 'finding_desc', 
              'wac_long', 'wac_short', 'subchp_num', 'subchp_name', 'section', 'section_desc',
              'cmp_item', 'cmp_agg']

In [None]:
df.head()

In [None]:
df.tail()

# EXPORT RESULTS

In [None]:
df.to_csv('../C_output_data/scraped_data.csv', index=False)

# DATA INTEGRITY REVIEW

### Did all the reports make it through?

In [None]:
assert df['pdf_name'].nunique() == len(df_letters)
print('The consistency test above confirms that all the', len(df_letters), 'enforcement letters made it through.')

### How many PDFs were image-based?

In [None]:
df_image = df[df['image_based']]

print('Out of the', len(df_letters), 'enforcement letter PDFs, apparently only', 
      df_image['pdf_name'].nunique(), 'of them contained image-based pages.',
      ' This does not necessarily mean that all pages in those PDFs are image-based, however. Let us confirm that:')

In [None]:
# List of all the pdf_names that have at least one image-based page
image_pdfs = df_image['pdf_name'].unique()
len(image_pdfs)

# Create a subset of df_all_reports that contains only the pdf_names that have at least one image-based page
df_one_report = df[df['pdf_name'].isin(image_pdfs)]

# Confirm all the pages in those PDFs are image-based
assert df_one_report['image_based'].all()
del(image_pdfs, df_one_report)

print('Indeed, if a report is identified as image-based, then all of its pages are image-based.')

In [None]:
df_image

Saved down copies of the image-based files to inspect visually.

In [None]:
from shutil import copyfile

for pdf in df_image['pdf_name'].unique():
    copyfile('/Volumes/files/COVID19/Manuel_RCF_Data/State_DSHS/ALTSA_reports/NH_3333/' + pdf, 
             '../D_Documents/DSHS/image_based_enforcement_letters/' + pdf)

### Close look at those stratospheric fines

There are a few unusual fines. Although they are rare—each only shows up once—they are stratospheric.

They are so big, it is worth to take a close look at them. One test that is worth doing is whether the sum of the individual WAC fines adds up to the sum of the total fines for this particular group.

In [None]:
df['cmp_item'].value_counts(dropna=False)

In [None]:
# List all the cmp_item values and the number of times they appear through the reports
mega_cmp_list = df['cmp_item'].value_counts(dropna=False).reset_index()
mega_cmp_list.columns = ['cmp_item', 'freq']
# Filter for only the unusual cmp_item amounts
mega_cmp_list = mega_cmp_list[mega_cmp_list['freq'] == 1].reset_index(drop=True)
# Reduce it to a series that contains those mega cmp
mega_cmp_list = mega_cmp_list['cmp_item']
mega_cmp_list

In [None]:
# Create a subset of df with only the pdf_names that contain those unusual cmp
df_mega_cmp = df[df['cmp_item'].isin(mega_cmp_list)]
df_mega_cmp = df[df['pdf_name'].isin(df_mega_cmp['pdf_name'].unique())]
df_mega_cmp

In [None]:
# Consistency test: are all the unsual cmp_item values in this new data frame?
assert mega_cmp_list.isin(df_mega_cmp['cmp_item']).all()

# Are the sums of individual fines and total fines consistent for this group of mega fines?
assert round(df_mega_cmp['cmp_item'].sum()) == round(df_mega_cmp['cmp_agg'].sum())

Note: A separate visual analysis these mega fines show that mostly correspond to a particular code violation: [WAC 388-97-1090(1)](https://apps.leg.wa.gov/wac/default.aspx?cite=388-97-1090), a Quality of Care related code that regulates the minimum hours of direct care per resident day (HDR) that a nursing home must provide.

### From how many reports were we able to extract a total mulct amount?

In [None]:
temp = df[~df['cmp_agg'].isna()]

print(len(temp))
print(temp['pdf_name'].nunique())

There seems to be one report that has 2 total amounts. Let's check.

In [None]:
temp = temp['pdf_name'].value_counts().reset_index()
x = temp[temp['pdf_name'] > 1].loc[0, 'index']
x

In [None]:
df[df['pdf_name'] == x]

### Findings

In [None]:
print(df['finding_code'].nunique())
df['finding_code'].value_counts(dropna=False)

In [None]:
print(df['finding_desc'].nunique())
df['finding_desc'].value_counts(dropna=False)

Ther are far more descriptions for a finding than codes. That is, a single finding code has typically multiple descriptions.

Unless the official finding descriptions have changed over time, each code should only have one description. It seems that the problem here might arise from typos or because the algorithm picked up noise text when scraping for the descriptions. We should base our analysis in the finding codes instead of the descriptions and confirm that their definitions have not changed over time.

### Date consistency

Logically, all survey dates should precede their corresponding enforcement letter dates. Let's see if we can find instances where that is not the case.

In [None]:
print(df[(df['letter_dt'] <= df['survey_dt'])]['pdf_name'].unique())
df[(df['letter_dt'] <= df['survey_dt'])]#[['pdf_name', 'letter_dt','survey_dt']].drop_duplicates()