# What this script does

Here we download all the avilable enforcement letters issued to nursing homes by the Washington State Department of Social and Health Services (DSHS).

The letters available for download are updated each month, so we ran this process a few times since March 2020. In various palces in this script, you will find converted into comments lines of code corresponding to older downloads.

# A. Settings

In [1]:
import pandas as pd
import requests
import bs4
import re
import time
from os import listdir

# B. Listing file

The DSHS's [Nursing Home Facilities Locator](https://fortress.wa.gov/dshs/adsaapps/lookup/NHPubLookup.aspx) provides a CSV file that contains:
- A list of all the nursing homes regestered at the DSHS
- The links where the ALTSA PDF reports issued to those homes can be downloaded

As we mention above, this CSV file is updated monthly. Each time, letters issued in the earliest month of th 3-year period are dumped and those issued during the latest one are added. That is why we ran this download a few times, as can be seen by lines in the code that were left as comments, corresponding to those downloads.

In [2]:
# Pull the original data
# df_list_orig = pd.read_csv('../A_source_data/DSHS/NFListing_2020-03.csv')
# df_list_orig = pd.read_csv('../A_source_data/DSHS/NFListing_2020-06-09.csv')
# df_list_orig = pd.read_csv('../A_source_data/DSHS/NFListing_2020-06-24.csv')
# df_list_orig = pd.read_csv('../A_source_data/DSHS/NFListing_2020-07-16.csv')
# df_list_orig = pd.read_csv('../A_source_data/DSHS/NFListing_2020-08-20.csv')
df_list_orig = pd.read_csv('../A_source_data/DSHS/NFListing_2020-09-08.csv')

# Create a workable copy of the df_list
df_list = df_list_orig.copy()
df_list.columns = df_list.columns.str.strip().str.lower().str.replace('\s', '_')

## Getting to know the listing file

In [3]:
df_list.head()

Unnamed: 0,nf_location_num,nf_loc_region_cde,nf_loc_street_address,nf_loc_city,nf_loc_zip_cde,nf_loc_phone_num,nf_loc_fax_num,nf_mailing_address,nf_mailing_city,nf_mailing_state,...,rcsunit,total_beds_nf_bed_count,xviii_beds_nf_bed_count,xix_beds_nf_bed_count,t1819_beds_nf_bed_count,fain_id,nf_bed_type_desc,has_reports?,reports_location,unnamed:_32
0,19800,1,495 N 13th Ave,Othello,99344,5094890000.0,,495 N 13th Ave,Othello,WA,...,A,39,,,39.0,43484,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
1,4400,1,1242 11TH ST,CLARKSTON,99403,5097583000.0,5097519000.0,PO BOX 159,CLARKSTON,WA,...,A,90,,,90.0,43575,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
2,20400,1,1508 WEST 7TH AVENUE,KENNEWICK,99336,5095869000.0,5095864000.0,1508 W 7TH AVE,KENNEWICK,WA,...,D,136,,,136.0,43505,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
3,20500,1,44 GOETHALS DRIVE,RICHLAND,99352,5099431000.0,5099435000.0,44 GOETHALS DR,RICHLAND,WA,...,D,104,,,104.0,43507,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
4,40660,1,2702 S Ely St,Kennewick,99337,5095826000.0,5094972000.0,2702 S Ely St,Kennewick,WA,...,D,53,0.0,0.0,53.0,44102,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,


In [4]:
df_list.tail()

Unnamed: 0,nf_location_num,nf_loc_region_cde,nf_loc_street_address,nf_loc_city,nf_loc_zip_cde,nf_loc_phone_num,nf_loc_fax_num,nf_mailing_address,nf_mailing_city,nf_mailing_state,...,rcsunit,total_beds_nf_bed_count,xviii_beds_nf_bed_count,xix_beds_nf_bed_count,t1819_beds_nf_bed_count,fain_id,nf_bed_type_desc,has_reports?,reports_location,unnamed:_32
200,20000,1,721 OTIS AVE,SUNNYSIDE,989442328,5098372000.0,5098373000.0,721 OTIS AVE,SUNNYSIDE,WA,...,D,80,,,80.0,43574,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
201,6000,1,3801 SUMMITVIEW AVENUE,YAKIMA,98902,5099666000.0,5098533000.0,3801 SUMMITVIEW AVE,YAKIMA,WA,...,D,78,0.0,,78.0,42645,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
202,22200,1,802 W 3rd Ave,Toppenish,98948,5098654000.0,5098654000.0,802 W 3rd Ave,Toppenish,WA,...,D,75,,0.0,75.0,43479,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
203,10300,1,4007 TIETON DRIVE,YAKIMA,98908,5099664000.0,5099661000.0,4007 TIETON DR,YAKIMA,WA,...,D,75,,0.0,75.0,43516,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
204,700,1,609 SPEYERS ROAD B 39-15,SELAH,989421099,5096981000.0,5096972000.0,609 SPEYERS RD B 39-15,SELAH,WA,...,D,160,,160.0,,43608,Title 19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,


### What field do we use to ID facilities?

In [5]:
num_rows_orig = len(df_list) # We will use this value for consistency tests later.

print('Number of rows =', num_rows_orig)
print('Number of unique nf_location_num =', df_list['nf_location_num'].nunique())
print('Number of unique fain_id =', df_list['fain_id'].nunique())
print('Number of unique nf_name =', df_list['nf_name'].nunique())

Number of rows = 205
Number of unique nf_location_num = 205
Number of unique fain_id = 205
Number of unique nf_name = 205


It seems there are various variables uniquely identify each facility. We will use **fain_id** as the unique ID fore each facility.

### Subsetting: Only facilities that have reports

How many facilities have reports and how many don't?

In [6]:
df_list['has_reports?'].value_counts(dropna=False)

Yes    205
Name: has_reports?, dtype: int64

In [7]:
print('Out of the', len(df_list), 'nursing homes,', (df_list['has_reports?'] == 'Yes').sum(), 
      'have at least one report issued to them—whether inspection reports, enforcement letters, etc.')

Out of the 205 nursing homes, 205 have at least one report issued to them—whether inspection reports, enforcement letters, etc.


In [8]:
# # Make a df_list for the facilities with no reports
# df_list_no_reports = df_list[df_list['has_reports?'] == 'No']
# df_list_no_reports = df_list_no_reports.reset_index(drop=True)
# df_list_no_reports

# # Filter out from dd_fac those facilities with no reports
# df_list = df_list[df_list['has_reports?'] == 'Yes']
# df_list = df_list.reset_index(drop=True)

# # Confirm that the facilities marked as having no reports show no query link
# assert df_list_no_reports['reports_location'].str.strip().all() == ''

# # Confirm no rows were lost during the dataframe split
# assert len(df_list) + len(df_list_no_reports) == num_rows_orig

In [9]:
# Any nursing homes with no reports?
print(df_list['has_reports?'].value_counts(dropna=False))
assert df_list['reports_location'].str.strip().all() != ''

Yes    205
Name: has_reports?, dtype: int64


# C. The long looping road to downloading

In [10]:
# download_path = '/Volumes/files/COVID19/Manuel_RCF_Data/State_DSHS/ALTSA_reports/NH_3333/'
# download_path = '/Volumes/files/COVID19/Manuel_RCF_Data/State_DSHS/ALTSA_reports/NH_enforcement_letters_2020-06-09/'
# download_path = '/Volumes/files/COVID19/Manuel_RCF_Data/State_DSHS/ALTSA_reports/NH_enforcement_letters_2020-06-24/'
# download_path = '/Volumes/files/COVID19/Manuel_RCF_Data/State_DSHS/ALTSA_reports/NH_enforcement_letters_2020-07-16/'
# download_path = '/Volumes/files/COVID19/Manuel_RCF_Data/State_DSHS/ALTSA_reports/NH_enforcement_letters_2020-08-20/'
download_path = '/Volumes/files/COVID19/Manuel_RCF_Data/State_DSHS/ALTSA_reports/NH_enforcement_letters_2020-09-08/'

# As we dowload each report, we will save some of its metadata in this new df_list:
df_metadata = pd.DataFrame(columns = ['fain_id', 'nf_name', 'pdf_name', 
                                      'rep_type', 'rep_date', 'url','download_record'])


# Each iteration of this first loop correspond to a single facility
for index, row in df_list.iterrows():
    
    print(index, '|', row['nf_name'])

    # Send a URL request to the site that contains all the links to each of the reports for a single facility.
    page = requests.get(row['reports_location'])
    soup = bs4.BeautifulSoup(page.text, 'html.parser')
    
    # The list of links to each of the reports is contained it a 'div' table with id='content_results'
    table = soup.find('div', {"id": "content_results"})
    
    # Each of the following table elements contains the URL of each PDF report
    ls_li = table.find_all('li')

    # Each iteration of this second loop corresponds to a single pdf report (from a single facility)
    for li in ls_li:    
        # We are only looking for enforcement letters
        if 'enforcement' in li.find('a').contents[0].lower():

            # Obtain some metadata from the report
            pdf_url = li.find('a').get('href')
            pdf_name = pdf_url.split("/")[-1]
            rep_type = re.search('\d{2}\/\d{4}\s-\s(.*)$', li.find('a').contents[0]).groups()[0]
            rep_date = re.search('(\d{2}\/\d{4})', li.find('a').contents[0]).groups()[0]

            # Save down the pdf
            pdf = requests.get('https://fortress.wa.gov' + pdf_url)
            open(download_path + pdf_name, 'wb').write(pdf.content)
            
            # Save the report's metadata in 'df_metadata'
            new_record = {'fain_id':row['fain_id'], 
                          'nf_name':row['nf_name'], 
                          'pdf_name':pdf_name, 
                          'rep_type':rep_type, 
                          'rep_date':rep_date,
                          'url':'https://fortress.wa.gov' + pdf_url,
                          'download_record':time.ctime()}
            df_metadata = df_metadata.append(new_record, ignore_index=True)
            del(new_record)
        
        time.sleep(0.5)

0 | AVALON CARE CENTER - OTHELLO, LLC
1 | PRESTIGE CARE & REHABILITATION - CLARKSTON
2 | LIFE CARE CENTER OF KENNEWICK
3 | LIFE CARE CENTER OF RICHLAND
4 | Regency Canyon Lakes Rehabilitation and Nursing Center
5 | RICHLAND REHABILITATION CENTER
6 | Cashmere Care Center
7 | Colonial Vista Post Acute & Rehabilitation Center
8 | Regency Wenatchee Rehabilitation and Nursing Center
9 | AVAMERE OLYMPIC REHABILITATION OF SEQUIM
10 | Crestwood Health & Rehabilitation Center
11 | FORKS COMMUNITY HOSPITAL LTC UNIT
12 | Sequim Health & Rehabilitation Center
13 | Avamere Rehabilitation of Cascade Park
14 | Brookfield Health and Rehabilitation of Cascadia
15 | DISCOVERY NURSING & REHAB OF VANCOUVER
16 | Fort Vancouver Post Acute
17 | Manor Care Health Services - Salmon Creek
18 | PRESTIGE CARE & REHABILITATION - CAMAS
19 | The Oaks at Timberline
20 | Vancouver Specialty and Rehabilitative Care
21 | WOODLAND CONVALESCENT CENTER
22 | BOOKER REST HOME ANNEX
23 | Americana Health and Rehabilitation Ce

In [12]:
# df_metadata = pd.read_csv('../C_output_data/download_metadata_2020-03.csv')
# df_metadata = pd.read_csv('../C_output_data/download_metadata_2020-06-09.csv')
# df_metadata = pd.read_csv('../C_output_data/download_metadata_2020-06-24.csv')
# df_metadata = pd.read_csv('../C_output_data/download_metadata_2020-07-16.csv')
# df_metadata = pd.read_csv('../C_output_data/download_metadata_2020-08-20.csv')
# df_metadata = pd.read_csv('../C_output_data/download_metadata_2020-09-08.csv')

# df_metadata.to_csv('../C_output_data/download_metadata_2020-03.csv', index=False)
# df_metadata.to_csv('../C_output_data/download_metadata_2020-06-09.csv', index=False)
# df_metadata.to_csv('../C_output_data/download_metadata_2020-06-24.csv', index=False)
# df_metadata.to_csv('../C_output_data/download_metadata_2020-07-16.csv', index=False)
# df_metadata.to_csv('../C_output_data/download_metadata_2020-08-20.csv', index=False)
df_metadata.to_csv('../C_output_data/download_metadata_2020-09-08.csv', index=False)

# D. Consistency tests

In [13]:
df_metadata.head()

Unnamed: 0,fain_id,nf_name,pdf_name,rep_type,rep_date,url,download_record
0,43484,"AVALON CARE CENTER - OTHELLO, LLC","Avalon Care Center - Othello (G, CMP, CF) 5 3 ...",Enforcement Letter,05/2019,https://fortress.wa.gov/dshs/adsaapps/lookup/R...,Tue Sep 8 15:09:11 2020
1,43484,"AVALON CARE CENTER - OTHELLO, LLC","Avalon Care Center - Othello (Hx E, prior D) 8...",Enforcement Letter,08/2018,https://fortress.wa.gov/dshs/adsaapps/lookup/R...,Tue Sep 8 15:09:12 2020
2,43575,PRESTIGE CARE & REHABILITATION - CLARKSTON,Prestige Care Clarkston (Hx E prior D) 6 21 19...,Enforcement Letter,07/2019,https://fortress.wa.gov/dshs/adsaapps/lookup/R...,Tue Sep 8 15:09:21 2020
3,43575,PRESTIGE CARE & REHABILITATION - CLARKSTON,"Prestige Care - Clarkston (Hx D prior E, CMP) ...",Enforcement Letter,06/2018,https://fortress.wa.gov/dshs/adsaapps/lookup/R...,Tue Sep 8 15:09:22 2020
4,43575,PRESTIGE CARE & REHABILITATION - CLARKSTON,"Prestige Clarkston (Past Non J, CMP) 1 26 18.pdf",Enforcement Letter,02/2018,https://fortress.wa.gov/dshs/adsaapps/lookup/R...,Tue Sep 8 15:09:23 2020


In [14]:
df_metadata.tail()

Unnamed: 0,fain_id,nf_name,pdf_name,rep_type,rep_date,url,download_record
766,43608,YAKIMA VALLEY SCHOOL,Yakima Valley School (Hx E prior D) 7 11 19.pdf,Enforcement Letter,07/2019,https://fortress.wa.gov/dshs/adsaapps/lookup/R...,Tue Sep 8 16:43:06 2020
767,43608,YAKIMA VALLEY SCHOOL,"Yakima Valley School (GG, CMP) 11 13 18.pdf",Enforcement Letter,11/2018,https://fortress.wa.gov/dshs/adsaapps/lookup/R...,Tue Sep 8 16:43:08 2020
768,43608,YAKIMA VALLEY SCHOOL,"Yakima Valley School (G, CMP) 7 23 18.pdf",Enforcement Letter,08/2018,https://fortress.wa.gov/dshs/adsaapps/lookup/R...,Tue Sep 8 16:43:09 2020
769,43608,YAKIMA VALLEY SCHOOL,Yakima Valley School (Hx E prior E) 6 4 18.pdf,Enforcement Letter,06/2018,https://fortress.wa.gov/dshs/adsaapps/lookup/R...,Tue Sep 8 16:43:11 2020
770,43608,YAKIMA VALLEY SCHOOL,"Yakima Vally School (GG, CMP) 1 23 18.pdf",Enforcement Letter,02/2018,https://fortress.wa.gov/dshs/adsaapps/lookup/R...,Tue Sep 8 16:43:12 2020


Were all the reports in 'df_metadata' actually downloaded?

In [15]:
# Obtain a list of all the downloaded nursing home PDF enforcement letters
file_list = listdir(download_path)

# Weed out any files in the folder that are not PDFs
file_list = [file for file in file_list if re.search('\.pdf$', file)]
file_list = pd.Series(file_list)

# Is the set of all downloaded PDFs consistent with the list of PDFs in 'df_metadata'?
assert len(file_list) == len(df_metadata)
assert set(file_list) == set(df_metadata['pdf_name'])

Yes.

Any FAIN IDs in the meta data that were not in the original listing file?

In [16]:
set(df_metadata['fain_id']).difference(set(df_list['fain_id']))

set()

Nope. Nothing was created out of nothing.

Any FAIN IDs in the original listing that did not show up in the metadata?

In [17]:
missing  = set(df_list['fain_id']).difference(set(df_metadata['fain_id']))
missing

{42569,
 42682,
 42714,
 42767,
 42813,
 43242,
 43348,
 43441,
 43546,
 43560,
 43585,
 43594,
 43844,
 44101,
 44158,
 44409,
 44988,
 45037,
 45564,
 45976,
 46175,
 46200}

So there are cases of nursing homes that were in the original listing and were marked as having reports, but for which  no enforcement letters were downloaded. Let's take a look:

In [18]:
df_missing = df_list[df_list['fain_id'].isin(missing)]
df_missing

Unnamed: 0,nf_location_num,nf_loc_region_cde,nf_loc_street_address,nf_loc_city,nf_loc_zip_cde,nf_loc_phone_num,nf_loc_fax_num,nf_mailing_address,nf_mailing_city,nf_mailing_state,...,rcsunit,total_beds_nf_bed_count,xviii_beds_nf_bed_count,xix_beds_nf_bed_count,t1819_beds_nf_bed_count,fain_id,nf_bed_type_desc,has_reports?,reports_location,unnamed:_32
8,15700,1,1326 Red Apple Rd,Wenatchee,988013227,5096823000.0,5096824000.0,1326 Red Apple Rd,Wenatchee,WA,...,D,55,0.0,0.0,55.0,44158,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
24,34100,3,128 BEACON HILL DR,LONGVIEW,986325859,3604234000.0,3606361000.0,128 BEACON HILL DR,LONGVIEW,WA,...,C,67,,,67.0,43585,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
38,40260,2,2720 E Madison St,Seattle,98112,2063225000.0,2067202000.0,2720 E Madison St,Seattle,WA,...,H,35,0.0,30.0,5.0,43242,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
42,40940,2,100 Timber Ridge Way NW,Issaquah,98027,4254275000.0,4254275000.0,100 Timber Ridge Way NW,Issaquah,WA,...,F,45,45.0,0.0,0.0,46200,Title 18,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
45,23500,2,7500 SEWARD PARK AVE SO,SEATTLE,981180000,2067259000.0,2064570000.0,7500 SEWARD PARK AVE SO,SEATTLE,WA,...,F,205,0.0,,205.0,42813,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
55,10500,2,805 FRONT STREET SOUTH,ISSAQUAH,98027,,4255576000.0,805 FRONT STREET SOUTH,ISSAQUAH,WA,...,F,140,,,140.0,44988,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
57,40540,2,4416 SOUTH BRANDON STREET,SEATTLE,981180000,2067214000.0,2067214000.0,4416 SOUTH BRANDON ST,SEATTLE,WA,...,H,100,0.0,,100.0,43348,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
63,41118,2,17420 106th Pl SE,Renton,98055,2538534000.0,2538535000.0,17420 106th Pl SE,Renton,WA,...,F,60,0.0,0.0,60.0,45037,Title 18/19,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
81,40770,2,24423 100TH AVENUE SE,KENT,98030,2538132000.0,2538133000.0,24423 100TH AVENUE SE,KENT,WA,...,F,8,8.0,,,43441,Title 18,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,
83,39960,2,919 109TH AVENUE NE,BELLEVUE,980044404,4256470000.0,,919 109TH AVENUE NE,BELLEVUE,WA,...,H,54,54.0,,,43844,Title 18,Yes,https://fortress.wa.gov/dshs/adsaapps/lookup/N...,


Upon visual inspection of the links for these cases, we find the reason: Those facilities indeed had reoprts issued on them (inspections, investigations, etc.) but none of those reports were enforcement letters. That explains why they appear in 'df_listing' but don't show up in 'df_metadata'.

The next logical question is: Of the 204 nursing homes that had reports issued to them, how many had enforcement letters?

In [19]:
print('Out of the', df_list['fain_id'].nunique(), 'nursing homes that had reports issued to them', 
      df_metadata['fain_id'].nunique(), 'had enfrocement letters issued to them.')

Out of the 205 nursing homes that had reports issued to them 183 had enfrocement letters issued to them.
