# What this script does

First we obtained the **enforcement letters PDFs**, scraped their contents and organized those contents into a dataframe.

Next, we dowloaded the **CMS** data on deficiencies.

While the CMS data was an structured dataset from the beginnig, the scraped data from the enforcement letters still needs some cleaning and standarization. For example, we could not obtain the a facility's federal number from several enforcement letters, either because it was not present int the letter or because it contained typos.

In this script, we carry out the cleaning of the data scraped from the enforcement letters.

# I. Settings

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

# II. Importing data

#### Import

In [2]:
# Penalties (scraped from the enforcement letter PDFs)
df_letters_orig = pd.read_csv('../C_output_data/scraped_data.csv', 
                              parse_dates=['letter_dt', 'survey_dt'])
# Deficiencies (from CMS database)
df_sod_wa_orig = pd.read_csv('../C_output_data/sod_wa.csv', 
                             dtype='object', parse_dates=['inspection_dt'])

#### Working copies

In [3]:
# Working copies
df_sod_wa = df_sod_wa_orig.copy()
df_letters = df_letters_orig.copy()

In [4]:
print(df_letters.shape)
df_letters.nunique()

(4958, 25)


pdf_name           1035
page                  8
image_based           2
letter_dt           578
survey_dt           577
survey_type         111
vendor_num          248
fed_num             217
aem_num             701
action               56
fed_enforcement       2
nc_hist               2
epoc                  2
state_rem             2
appeal_rights         2
finding_code          9
finding_desc         50
wac_long            145
wac_short            46
subchp_num            3
subchp_name           3
section              15
section_desc         41
cmp_item             40
cmp_agg              66
dtype: int64

In [5]:
print(df_sod_wa.shape)
df_sod_wa.nunique()

(11915, 12)


facility_name      225
facility_id        224
eventid           2075
inspection_dt      742
tag                290
tag_group_num       20
tag_group_name      20
tag_old_new          2
severity_code       11
severity_desc       11
complaint            2
standard             2
dtype: int64

# III. Enforcement letter data: Fixing non-matching federal numbers

Each enforcement letter issued to a facility included that facility federal number. Those numbers are how we link the enfocement letters with the CMS database. However, not all of them could be scraped, either because they were missinng or because they had typos.

In this section, we will fix the records scraped from the enforcement letters whose facility federal number are corrupted.

To start, let's find out which are those federal numbers contained in *df_letters* that are not found in the CMS dataset.

### Identify the faulty federal numbers in *df_letters*

In [6]:
# Find the fed number from the letters (and their corresponding PDF names) that don't show up in the CMS data
not_in_cms = set(df_letters['fed_num']).difference(set(df_sod_wa['facility_id']))
not_in_cms = df_letters[df_letters['fed_num'].isin(not_in_cms)]
not_in_cms = not_in_cms[['fed_num', 'pdf_name']].drop_duplicates()
not_in_cms = not_in_cms.sort_values(['pdf_name', 'fed_num']).reset_index(drop=True)

print('There are', not_in_cms['fed_num'].nunique(dropna=False), 'federal numbers contained in', 
      not_in_cms['pdf_name'].nunique(dropna=False), 'enforcement letters that are not found in CMS.')

There are 22 federal numbers contained in 89 enforcement letters that are not found in CMS.


In [7]:
not_in_cms

Unnamed: 0,fed_num,pdf_name
0,,"Alaska Gardens (Lift SP, BIC) 2 14 19.pdf"
1,,Alderwood Park (BIC - Lift SP) 5 9 18 .pdf
2,505X255,"Avalon Care Center - Othello (G, CMP, CF) 5 3 19.pdf"
3,,"Avamere Bellingham (Lift Sp, BIC) 3 23 20.pdf"
4,Fed# 505223,Avamere Bellingham SFF removal 6 17.pdf
...,...,...
84,505XXX,"Victory Health and Rehab (G, CMP, CF) 3 16.pdf"
85,,"View Ridge (BIC, Lift SP, Cond) 3 13 2018.pdf"
86,505XXX,"View Ridge (IJ, OSFM, SP, IJ, ) 1 30 18.pdf"
87,X505349,"Willapa Harbor (IJ, SUB, CMP, CF) 6 17.pdf"


Let's look at those non-matching federal numbers.

In [8]:
print(len(set(not_in_cms['fed_num'])))
set(not_in_cms['fed_num'])

22


{'50098',
 '50230',
 '505502',
 '505A174',
 '505A260',
 '505X114',
 '505X255',
 '505X276',
 '505X283',
 '505X311',
 '505X319',
 '505X331',
 '505X339',
 '505X344',
 '505X379',
 '505X395',
 '505X470',
 '505XX',
 '505XXX',
 'Fed# 505223',
 'X505349',
 nan}

- Some of them are clearlly wrong—they either have the letter 'X' emebbed in them, have too few digits.
- The few that have all the full 6 digits, and no 'X' letters, re probably are misspelled, since they were not found in the CMS data——i.e., they likely don't exist.
- The 'NaN' values indicate cases where the enforcement letter did not include a federal number.

So we will have to manually map the correct federal numbers of the corresponding PDFs, listed below.

In [9]:
print(len(set(not_in_cms['pdf_name'])))
set(not_in_cms['pdf_name'])

89


{'Alaska Gardens (Lift SP, BIC) 2 14 19.pdf',
 'Alderwood Park (BIC - Lift SP) 5 9 18 .pdf',
 'Avalon Care Center - Othello (G, CMP, CF) 5 3 19.pdf',
 'Avamere Bellingham (Lift Sp, BIC) 3 23 20.pdf',
 'Avamere Bellingham SFF removal 6 17.pdf',
 'Ballard Center (Lift SP, BIC) 12 20 17.pdf',
 'Ballard Center (Lift SP, Cond) 5 15 20.pdf',
 'Ballard Center (Lift SP, Still OOC) 12 28 18.pdf',
 'Bremerton_Convalescent_and_Rehabilitation_Center_January_20_2017_letter_3.pdf',
 'Brookfield Cascadia (Hx D prior G, prior G) 2 28 19.pdf',
 'Brookfield Cascadia (Lift SP, BIC) 5 22 18.pdf',
 'Canterbury House (D prior D) 9 18 18.pdf',
 'Cashmere Care Center (G, CMP, CF) 5 25 18.pdf',
 'Columbia Lutheran Home (G, CMP, CF) 5 17.pdf',
 'Cristwood Nursing and Rehab (GG, CMP, CF) 6 21 18.pdf',
 'Delta Rehab (IJ, R, SUB, CMP, CF) 6 14 18.pdf',
 'Emerald Hills (GG, CMP, CF, Sub) 3 17.pdf',
 'Emerald Hills BIC 6 17.pdf',
 'Emerald_Hills_Rehabilitation_March_10_2017_letter_10.pdf',
 'Fidalgo (BIC, Cond remai

### Build a mapping table for faulty missing CMS numbers

The following mapping was done manually, based on a visual matching beween the names of the facilities in the PDF titles and similar names in the CMS data.

In [10]:
lost_map = [['Alaska Gardens (Lift SP, BIC) 2 14 19.pdf', '505483'],
            ['Alderwood Park (BIC - Lift SP) 5 9 18 .pdf', '505092'],
            ['Avalon Care Center - Othello (G, CMP, CF) 5 3 19.pdf', '505255'],
            ['Avamere Bellingham (Lift Sp, BIC) 3 23 20.pdf', '505223'],
            ['Avamere Bellingham SFF removal 6 17.pdf', '505223'],
            ['Ballard Center (Lift SP, BIC) 12 20 17.pdf', '505042'],
            ['Ballard Center (Lift SP, Cond) 5 15 20.pdf', '505042'],
            ['Ballard Center (Lift SP, Still OOC) 12 28 18.pdf', '505042'],
            ['Bremerton_Convalescent_and_Rehabilitation_Center_January_20_2017_letter_3.pdf', '505123'],
            ['Brookfield Cascadia (Hx D prior G, prior G) 2 28 19.pdf', '505331'],
            ['Brookfield Cascadia (Lift SP, BIC) 5 22 18.pdf', '505331'],
            ['Canterbury House (D prior D) 9 18 18.pdf', '505344'],
            ['Cashmere Care Center (G, CMP, CF) 5 25 18.pdf', '505151'],
            ['Columbia Lutheran Home (G, CMP, CF) 5 17.pdf', '505470'],
            ['Delta Rehab (IJ, R, SUB, CMP, CF) 6 14 18.pdf', '505467'],
            ['Fir Lane  BIC - Lift Amended SP 3 17.pdf', '505230'],
            ['Fir Lane  BIC - Lift SP 3 17.pdf', '505230'],
            ['Fir Lane (Lift Sp, Cont Cond, still OOC) 5 8 20.pdf', '505230'],
            ['Fir Lane Health and Rehab (IJ, R, CMP, CF) 4 17.pdf', '505230'],
            ['Fircrest School Pat N (Hx E, prior D) 6 28 18.pdf', '50A260'],
            ['Forks Community Hospital (G, CMP) 1 28 19.pdf', '50A174'],
            ['Foss Home and Village (BIC, Lift SP) 10 26 18.pdf', '505416'],
            ['Foss Home and Village (Reduce CF-settelment) 8 12 19.pdf', '505416'],
            ['Franklin Hills (Lift Sp, BIC) 8 17.pdf', '505024'],
            ['Frontier Rehab (Hx D prior D) 1 10 19.pdf', '505276'],
            ['Frontier Rehab (Hx D prior E) 11 1 19.pdf', '505276'],
            ['Garden Village (Health BIC, LSC OOC) 8 27 19.pdf', '505010'],
            ['Gardens on University (GG, CMP, CF) 12 11 17.pdf', '505114'],
            ['Good Samaritian Society - Stafholt (3OOC, G, prior G, prior D) 6 7 19.pdf', '505395'],
            ['Hallmark Manor BIC lift SP 4 17.pdf', '505313'],
            ['Hearthstone (G, CMP, CF) 1 12 18.pdf', '505027'],
            ['Heartwood Extended Health Care (amended-rescind CF)1 3 19.pdf', '505326'],
            ['Heartwood Extended Health Care (rescind CF).pdf', '505326'],
            ['Life Care of Kennewick (BIC-Lift SP) 5 31 2018.pdf', '505080'],
            ['Linden Grove (Lift Sp, BIC) 6 24 20.pdf', '505485'],
            ['Lynnwood Post Acute (IJ DPOC notice) 7 17 20.pdf', '505434'],
            ['Lynnwood Post Acute (IJ, Pend, NR, SP, Cond) 7 1 20.pdf', '505434'],            
            ['Manor Care Lynnwood (Hx E prior D) 7 29 19.pdf', '505319'],
            ['Manor Care - Salmon Creek (GG, CMP, CF) 12 17 19.pdf', '505522'],
            ['Manor Care - Gig Harbor (Lift SP, BIC) 11 8 18.pdf', '505436'],
            ['Montesano_Health_and_Rehabilitation_March_7_2017_letter_30.pdf', '505503'],
            ['Montesano (G, CMP, CF) 3 17.pdf', '505503'],
            ['Montesano (GG, CMP, CF 3 17 17 SOD ) 3 17.pdf', '505503'],
            ['North_Central_Care_and_Rehabilitation_February_8_2017_letter_32.pdf', '505441'],
            ['Pacific Care (Lift SP, Still OOC) 10 18 18.pdf', '505081'],
            ['Paramount (BIC - Lift Cond, SP) 5 17.pdf', '505511'],
            ['Paramount (BIC - Lift Cond) 5 30 18.pdf', '505511'],
            ['Park Ridge Care Center (Lift SP, BIC) 6 11 20.pdf', '505009'],
            ['Park Shore (FP, Hx prior G, cmp, CF) 9 20 2018.pdf', '505493'],
            ['Park Shore (G, CMP, CF) 7 2 18.pdf', '505493'],
            ['Prestige Care - Clarkston (Hx D prior E, CMP) 3 17.pdf', '505283'],
            ['Prestige Care and Rehab - Burlington (G, CMP, CF) 6 17.pdf', '505378'],
            ['Prestige Post Acute Care - Edmonds (Lift SP, BIC) 5 24 18.pdf', '505527'],
            ['Providence St Joseph Care Center (BIC - Lift Cond) 5 17.pdf', '505414'],
            ['Queen Anne Healthcare (Lift SP, BIC) 10 24 18.pdf', '505204'],
            ['Regency North Bend (BIC, Lift SP) 12 15 17.pdf', '505339'],
            ['Regency North Bend (Direct Care hours, CF) 1 8 18.pdf', '505339'],
            ['Royal Park (GG, CMP, CF) 09 22 17.pdf', '505379'],
            ['Saint Anne Nursing and Rehab (Lift SP, BIC) 3 18 19.pdf', '505417'],            
            ['Seattle Medical Post Acute Care (Hx G, prior F, CF, CMP) 3 2 18.pdf', '505311'],
            ['Shoreline (Lift SP, Still OOC) 11 21 17.pdf', '505262'],
            ['Shuksan Healthcare (3OOC, G, CF, CMP, prior D, prior L) 10 31 19.pdf', '505098'],
            ['Shuksan Healthcare (4OOC, D prior G, CF, CMP, prior D, prior L) 11 14 19.pdf', '505098'],
            ['Snohomish Health and Rehab (GG, CMP, CF) 4 17.pdf', '505338'],
            ['Snohomish Health and Rehab (HX D prior G) 1 21 20.pdf', '505338'],
            ['St Francis of Bellingham (Lift SP, BIC) 3 5 18.pdf', '505296'],
            ['St Francis of Bellingham (FP, D prior G, Lift SP) 6 10 20.pdf', '505296'],
            ['Tacoma Lutheran (Withdraw citation) 1 30 19.pdf', '505435'],
            ['Tacoma Nursing and Rehab (2nd FP, WAC only) 1 15 20.pdf', '505154'],
            ['Tacoma Nursing and Rehab (FP, WAC only) 12 06 19.pdf', '505154'],
            ['The Gardens on Univeristy (Hx G, CMP, CF, prior D) 6 19 18.pdf', '505114'],
            ['The Oaks at Forest Bay (Lift SP Cond) 10 18 19.pdf', '505214'],
            ['The Oaks at Forest Bay (Lift SP, Still OOC) 3 8 17.pdf', '505214'],
            ['University Place (Lift SP, Cond, BIC) 3 10 20.pdf', '505473'],
            ['University Place Rehab Center (Rescinded CF) Jan 18.pdf', '505473'],
            ['View Ridge (BIC, Lift SP, Cond) 3 13 2018.pdf', '505362'],
            ['View Ridge (IJ, OSFM, SP, IJ, ) 1 30 18.pdf', '505362'],
            ['Willapa Harbor (IJ, SUB, CMP, CF) 6 17.pdf', '505349'],
            ['Willapa_Harbor_Health_and_Rehab_February_14_2017_last_letter.pdf', '505349'],
           ]

lost_map = pd.DataFrame(lost_map, columns=['pdf_name', 'manual_cms_number'])

# Consistency test: Are all these PDFs present in df_letters?
assert lost_map['pdf_name'].isin(df_letters['pdf_name']).all()

Let's run a visual test, comparing the facility names embedded in each PDF's title with the facility names that the mapping above matches with the CMS data.

In [11]:
df_test = not_in_cms.join(lost_map.set_index('pdf_name'), on='pdf_name', how='inner').reset_index(drop=True)
df_test = df_test.join(df_sod_wa.set_index('facility_id'), on='manual_cms_number', how='left')
df_test = df_test[['pdf_name', 'facility_name']].drop_duplicates().reset_index(drop=True)

# df_test.to_csv('/Users/mvilla/Downloads/cms_mapping_TRASHABLE.csv', index=False)
del(df_test)

A visual survey confirmed that all the facility names that came out from the mapping were consistent with the PDF names.

How many non-matching PDFs do we still have?

In [12]:
not_in_cms_still = set(not_in_cms['pdf_name']).difference(set(lost_map['pdf_name']))

print('There are still', len(not_in_cms_still), 'enforcement letters whose facility numbers/names we could not find')
not_in_cms_still

There are still 13 enforcement letters whose facility numbers/names we could not find


{'Cristwood Nursing and Rehab (GG, CMP, CF) 6 21 18.pdf',
 'Emerald Hills (GG, CMP, CF, Sub) 3 17.pdf',
 'Emerald Hills BIC 6 17.pdf',
 'Emerald_Hills_Rehabilitation_March_10_2017_letter_10.pdf',
 'Fidalgo (BIC, Cond remain) 10 11 19.pdf',
 'Fidalgo (GG, CMP, CF) 7 11 18.pdf',
 'Franke Tobey Jones (FP, CF) 3 6 20.pdf',
 'Franklin Hills Notice of DPOC.pdf',
 'Josephine Sunset (Lift SP cont OOC) 3 17.pdf',
 'Josephine_Sunset_Home_March_3_2017_letter_16.pdf',
 'Kindred Arden (BIC Lift SP) 6 17.pdf',
 'Pacific_Specialty_and_Rehabilitative_Care_January_23_2017_letter_34.pdf',
 'Victory Health and Rehab (G, CMP, CF) 3 16.pdf'}

### Mapping the missing federal numers in *df_letters*

First, let's reduce *df_letters* by eliminating columns we don't need.

In [13]:
print(df_letters.shape)

drop_cols = ['page', 'image_based', 'vendor_num','finding_desc', 
             'subchp_num', 'subchp_name', 'section_desc']
df_letters = df_letters.drop(drop_cols, axis=1).drop_duplicates().reset_index(drop=True)

print(df_letters.shape)

(4958, 25)
(4666, 18)


Split *df_letters* in two DFs, according to whether a pdf name has a valid federal number or not.

In [14]:
df_letters_lost = pd.DataFrame(df_letters[df_letters['pdf_name'].isin(lost_map['pdf_name'])])
df_letters_present = pd.DataFrame(df_letters[~df_letters['pdf_name'].isin(lost_map['pdf_name'])])

print(df_letters_lost.shape)
print(df_letters_present.shape)

# Consistency tests
assert len(set(df_letters_lost.index).intersection(set(df_letters_present.index))) == 0
assert set(df_letters_lost.index).union(set(df_letters_present.index)) == set(df_letters.index)

(194, 18)
(4472, 18)


In both dataframes, add a new column that will contain updated federal numbers.

In [15]:
# The fed numbers in df_letters_present are all valid ones, so the new column is just a copy of 'fed_num'
df_letters_present['manual_cms_number'] = df_letters_present['fed_num']

# For df_letters_lost, the new column will result from joining with lost_map
df_letters_lost = df_letters_lost.join(lost_map.set_index('pdf_name'), on='pdf_name', how='left')

Concatenate both DFs back together.

In [16]:
df_letters_old = df_letters

del(df_letters)
df_letters = pd.concat([df_letters_present, df_letters_lost])
df_letters = df_letters.sort_index()

# Consistency test
assert (df_letters.index == df_letters_old.index).all()

# We don't need 'fed_num' column any more. (And we never really needed 'aem_num' either)
df_letters = df_letters.drop(['fed_num', 'aem_num'], axis=1)

# Eliminate obsolete DFs
del(df_letters_old, df_letters_lost, df_letters_present)

### Add facility names to *df_letters*

Now that we have (most) of the CMS numbers available in *df_letters*, let's map in the facility names.

In [17]:
temp = df_sod_wa[['facility_id', 'facility_name']].drop_duplicates()
df_letters = df_letters.join(temp.set_index('facility_id'), on='manual_cms_number', how='left')
del(temp)

Reorder columns

In [18]:
df_letters = df_letters[['pdf_name', 'manual_cms_number', 'facility_name', 'letter_dt', 'survey_dt', 'survey_type',
                         'wac_short', 'wac_long','section', 'cmp_item', 'cmp_agg',
                         'finding_code', 'action', 'fed_enforcement', 'nc_hist', 'epoc', 'state_rem', 'appeal_rights']]

In [19]:
df_letters

Unnamed: 0,pdf_name,manual_cms_number,facility_name,letter_dt,survey_dt,survey_type,wac_short,wac_long,section,cmp_item,cmp_agg,finding_code,action,fed_enforcement,nc_hist,epoc,state_rem,appeal_rights
0,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",505355,ADVANCED POST ACUTE,2018-04-12,2018-04-06,unannounced complaint investigation,,,,,,G,IMPOSITION OF CIVIL FINES,False,False,True,False,False
1,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",505355,ADVANCED POST ACUTE,2018-04-12,2018-04-06,unannounced complaint investigation,388-97-1060,388-97-1060(3)(g),Quality of Care,1000.0,,,IMPOSITION OF CIVIL FINES,False,False,False,True,False
2,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",505355,ADVANCED POST ACUTE,2018-04-12,2018-04-06,unannounced complaint investigation,,,,,,,IMPOSITION OF CIVIL FINES,False,False,False,False,True
3,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",505355,ADVANCED POST ACUTE,2018-04-12,2018-04-06,unannounced complaint investigation,,,,,1000.0,,IMPOSITION OF CIVIL FINES,False,False,False,False,False
4,"Advance Post Acute (G, CMP, CF) 4 6 18.pdf",505355,ADVANCED POST ACUTE,2018-04-12,2018-04-06,unannounced complaint investigation,,,,,,,IMPOSITION OF CIVIL FINES,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4661,Willapa_Harbor_Health_and_Rehab_February_3_2017_letter_59.pdf,505349,WILLAPA HARBOR HEALTH AND REHAB,2017-02-03,2017-01-31,complaint investigation and partial extended survey,388-97-0640,388-97-0640(5)(6)(a),Resident Rights,3000.0,,,,False,False,False,False,True
4662,Willapa_Harbor_Health_and_Rehab_February_3_2017_letter_59.pdf,505349,WILLAPA HARBOR HEALTH AND REHAB,2017-02-03,2017-01-31,complaint investigation and partial extended survey,388-97-0640,388-97-0640(2)(a)(b),Resident Rights,3000.0,,,,False,False,False,False,True
4663,Willapa_Harbor_Health_and_Rehab_February_3_2017_letter_59.pdf,505349,WILLAPA HARBOR HEALTH AND REHAB,2017-02-03,2017-01-31,complaint investigation and partial extended survey,388-97-1060,388-97-1060(3)(b),Quality of Care,1000.0,,,,False,False,False,False,True
4664,Willapa_Harbor_Health_and_Rehab_February_3_2017_letter_59.pdf,505349,WILLAPA HARBOR HEALTH AND REHAB,2017-02-03,2017-01-31,complaint investigation and partial extended survey,388-97-1620,388-97-1620(1),Administration,3000.0,,,,False,False,False,False,True


# IV. Comparing periods covered in both dataframes

Do both dataframes cover the same period?

In [20]:
print('Period covered by the deficiencies data:')
print(df_sod_wa['inspection_dt'].min())
print(df_sod_wa['inspection_dt'].max())
print('\n')

print('Period covered by the enforcement letters:')
print(df_letters['survey_dt'].min())
print(df_letters['survey_dt'].max())

Period covered by the deficiencies data:
2017-01-03 00:00:00
2020-03-03 00:00:00


Period covered by the enforcement letters:
2016-01-20 00:00:00
2020-08-18 00:00:00


Enforcement letters start earlier and end later than deficiencies. Something to keep in mind in the next few scripts, when we do the analysis.

# IV. EXPORT

In [21]:
# Updated scraped data
df_letters.to_csv('../C_output_data/scraped_data_v2.csv', index=False)