# What this script does

We create a dataset of all deficiencies committed by WA-based nursing homes from 2017.

# I. SETTINGS

In [1]:
import pandas as pd
from os import listdir
import re

# II. IMPORT DATA

## Deficienies (CMS)

In [2]:
df_sod_wa_orig = pd.read_csv('/Users/mvilla/Documents/repos/covid19_nursing_homes_big_data/cms_WA/cms_WA.csv')

In [3]:
assert (df_sod_wa_orig['state'] == 'WA').all()
print(df_sod_wa_orig.shape)

(25750, 15)


## F-Tags (CMS)

The SOD dataframe above contains tha tag code for each defficiency recorded, but it doesn't contain the general group that each of those tag belongs to. The information is containd in [this list of the revised F-tags](https://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/GuidanceforLawsAndRegulations/Downloads/List-of-Revised-FTags.pdf). 

A version is in this [F-Tag crosswalk Excel file](https://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/GuidanceforLawsAndRegulations/Downloads/F-Tag-Crosswalk.xlsx). This is the data we are importing now and that will be adding to the SOD dataset later in the script:

In [4]:
df_tags_orig = pd.read_excel('../A_source_data/CMS/LTC FTags_Phase 2_Crosswalk.xlsx',
                             sheet_name='Sortable by Tags', usecols='A:H')
df_tags_orig.columns = ['tag', 'sqc_tag?', 'tag_title', 'cfr', 'tag_group', 'phase3', 'tag_old', 'moved_text']

So now we have a data frame that contains all the deficiencies found in all surveys carried out, and another dataframe that contains detailed information about the tags used to classify those deficiencies. We need to join both dataframes.

## Severity code descriptions

The SOD dataframe also contains codes for the severity of each deficiency, but not a description of the severity level of each of those codes. Those descriptions can be found in the docment [Design for Nursing Home Compare
Five-Star Quality Rating System:
Technical Users’ Guide](https://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/CertificationandComplianc/downloads/usersguide.pdf). The following mapping is based on that document:

In [5]:
severity = [['A', 'No actual harm with potential for minimal harm - Isolated'],
            ['B', 'No actual harm with potential for minimal harm - Pattern'],
            ['C', 'No actual harm with potential for minimal harm - Widespread'],
            ['D', 'No actual harm with potential for more than minimal harm that is not immediate jeopardy - Isolated'],
            ['E', 'No actual harm with potential for more than minimal harm that is not immediate jeopardy - Pattern'],
            ['F', 'No actual harm with potential for more than minimal harm that is not immediate jeopardy - Widespread'],
            ['G', 'Actual harm that is not immediate jeopardy - Isolated'],
            ['H', 'Actual harm that is not immediate jeopardy - Pattern'],
            ['I', 'Actual harm that is not immediate jeopardy - Widespread'],
            ['J', 'Immediate jeopardy to resident health or safety - Isolated'],
            ['K', 'Immediate jeopardy to resident health or safety - Pattern'],
            ['L', 'Immediate jeopardy to resident health or safety - Widespread']]

severity = pd.DataFrame(severity, columns=['scope_severity', 'severity_desc'])
severity

# Consitency test
assert set(df_sod_wa_orig['scope_severity']).issubset(set(severity['scope_severity']))

# III. Processing

In [6]:
# Create working copy
df_sod_wa = df_sod_wa_orig.copy()

# Add the severity descriptions
df_sod_wa = df_sod_wa.join(severity.set_index('scope_severity'), on='scope_severity', how='left')

# Create a proper date column
df_sod_wa['inspection_dt'] = pd.to_datetime(df_sod_wa['inspection_date'])

# Eliminate unnecesary fields and reset index
df_sod_wa = df_sod_wa.drop(['address', 'city', 'state', 'zip', 'cms_region',
                            'inspection_date', 'inspection_text'], axis=1)
df_sod_wa = df_sod_wa.drop_duplicates().reset_index(drop=True)

# Change some column names into something easier to use
df_sod_wa = df_sod_wa.rename(columns={'deficiency_tag':'tag', 
                                      'scope_severity':'severity_code'})

In [7]:
print(df_sod_wa.shape)
df_sod_wa.head()

(25750, 10)


Unnamed: 0,facility_name,facility_id,tag,severity_code,complaint,standard,eventid,filedate,severity_desc,inspection_dt
0,ISSAQUAH NURSING AND REHABILITATION CENTER,505004,514,D,1.0,0.0,DY4R11,2020-06-01,No actual harm with potential for more than mi...,2017-04-26
1,ISSAQUAH NURSING AND REHABILITATION CENTER,505004,253,E,0.0,1.0,PWON11,2020-06-01,No actual harm with potential for more than mi...,2017-08-29
2,ISSAQUAH NURSING AND REHABILITATION CENTER,505004,279,D,0.0,1.0,PWON11,2020-06-01,No actual harm with potential for more than mi...,2017-08-29
3,ISSAQUAH NURSING AND REHABILITATION CENTER,505004,328,D,0.0,1.0,PWON11,2020-06-01,No actual harm with potential for more than mi...,2017-08-29
4,ISSAQUAH NURSING AND REHABILITATION CENTER,505004,329,D,0.0,1.0,PWON11,2020-06-01,No actual harm with potential for more than mi...,2017-08-29


### What does each row this dataframe represent?

Question: Is each row a unique combination of facility/survey/deficiency?

In [8]:
temp = df_sod_wa[['facility_id', 'eventid', 'tag']]

print(len(temp.drop_duplicates()))
print(len(df_sod_wa))

25693
25750


There are a few duplicated records. Let's eliminate them.

In [9]:
df_sod_wa = df_sod_wa[~df_sod_wa[['facility_id', 'eventid', 'tag']].duplicated()]
df_sod_wa = df_sod_wa.reset_index(drop=True)

Now let's confirm that each row is a unique combination of facility/survey/deficiency

In [10]:
temp = df_sod_wa[['facility_id', 'eventid', 'tag']]

print(len(df_sod_wa))
assert len(temp.drop_duplicates()) == len(df_sod_wa)
del(temp)

25693


That is indeed the case. So each row seem to represent:
- a single deficiency
- found at a particular facility
- during a particular event (i.e., inspection or investigation)

### What is the tiime period covered by the dataset?

In [11]:
print(df_sod_wa['inspection_dt'].min())
print(df_sod_wa['inspection_dt'].max())

2010-04-09 00:00:00
2020-03-03 00:00:00


To be consistent with the enfocement letters dataset, we reduce this deficiencies dataset to records from 2017.

In [12]:
df_sod_wa = df_sod_wa[df_sod_wa['inspection_dt']>='2017-01-01']
df_sod_wa = df_sod_wa.reset_index(drop=True)

In [13]:
print(df_sod_wa['inspection_dt'].min())
print(df_sod_wa['inspection_dt'].max())

2017-01-03 00:00:00
2020-03-03 00:00:00


In [14]:
print(df_sod_wa.shape)
df_sod_wa.nunique()

(11915, 10)


facility_name     225
facility_id       224
tag               290
severity_code      11
complaint           2
standard            2
eventid          2075
filedate           23
severity_desc      11
inspection_dt     742
dtype: int64

# IV. JOINING BOTH DATAFAMES

In [15]:
df_tags = df_tags_orig.copy()

In [16]:
# df_sod_wa
print('There are', df_sod_wa['tag'].nunique(), 'different tags in df_sod_wa\n')

# df_tags
print('There are', df_tags['tag'].nunique(), 'different NEW tags in df_tags')
print('There are', df_tags['tag_old'].nunique(), 'different OLD tags in df_tags')

There are 290 different tags in df_sod_wa

There are 205 different NEW tags in df_tags
There are 176 different OLD tags in df_tags


## Building a mapping table via tag numbers

First of all, based on a comparison of the two documents mentioned earlier ([the list of the revised F-tags](https://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/GuidanceforLawsAndRegulations/Downloads/List-of-Revised-FTags.pdf) and the [F-Tag crosswalk spreadsheet](https://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/GuidanceforLawsAndRegulations/Downloads/F-Tag-Crosswalk.xlsx)) there are a few adjustments that need to be made.

In [17]:
df_tags['tag_group'] = df_tags['tag_group'].str.replace('483.20  Resident Assessments\n483.70  Administration', '483.70  Administration')
df_tags['tag_group'] = df_tags['tag_group'].str.replace('483.10 Resident Rights\n483.12  Freedom from Abuse, Neglect, and Exploitation', '483.12  Freedom from Abuse, Neglect, and Exploitation')
df_tags['tag_group'] = df_tags['tag_group'].str.replace('483.10 Resident Rights\n483.90  Physical Environment', '483.90  Physical Environment')

Now we split the tag group numbers and names.

In [18]:
df_tags['tag_group_num'] = df_tags['tag_group'].str.extract('(\d+\.\d+)')
df_tags['tag_group_name'] = df_tags['tag_group'].str.replace('\d+\.\d+', '').str.strip()

In [19]:
df_tags

Unnamed: 0,tag,sqc_tag?,tag_title,cfr,tag_group,phase3,tag_old,moved_text,tag_group_num,tag_group_name
0,F540,,Definitions,483.5,,,F150,483.5,,
1,F550,X,Resident Rights/Exercise of Rights,483.10(a)(1)(2)(b)(1)(2),483.10 Resident Rights,,F151,483.10(b)(1)(2),483.10,Resident Rights
2,F551,,Rights Exercised by Representative,483.10(b)(3)-(7)(i)-(iii),483.10 Resident Rights,,F152,483.10(b)(3)-(7),483.10,Resident Rights
3,F573,,Right to Access/Purchase Copies of Records,483.10(g)(2)(i)(ii)(3),483.10 Resident Rights,,F153,483.10(g)(2)(3),483.10,Resident Rights
4,F552,,Right to be Informed/Make Treatment Decisions,483.10(c)(1)(4)(5),483.10 Resident Rights,,F154,483.10(c)(1)(4)(5),483.10,Resident Rights
...,...,...,...,...,...,...,...,...,...,...
244,F942,,Resident’s Rights Training,483.95(b),483.95 Training Requirements,Entire tag - Phase 3\nWill not be in ASPEN unt...,,No Associated Tag,483.95,Training Requirements
245,F944,,QAPI Training,483.95(d),483.95 Training Requirements,Entire tag - Phase 3\nWill not be in ASPEN unt...,,No Associated Tag,483.95,Training Requirements
246,F945,,Infection Control Training,483.95(e),483.95 Training Requirements,Entire tag - Phase 3\nWill not be in ASPEN unt...,,No Associated Tag,483.95,Training Requirements
247,F946,,Compliance and Ethics Training,483.95(f)(1)(2),483.95 Training Requirements,Entire tag - Phase 3\nWill not be in ASPEN unt...,,No Associated Tag,483.95,Training Requirements


In [20]:
# New tags
df_tags_new = pd.DataFrame(df_tags[['tag', 'tag_group_num', 'tag_group_name']])
df_tags_new['tag_old_new'] = 'New'

# Old tags
df_tags_old = pd.DataFrame(df_tags[['tag_old', 'tag_group_num', 'tag_group_name']])
df_tags_old = df_tags_old.rename(columns={'tag_old':'tag'})
df_tags_old['tag_old_new'] = 'Old'

# Old and new together
df_tag_map = pd.concat([df_tags_new, df_tags_old], axis=0)

# Reduce and tidy up
df_tag_map = df_tag_map.dropna(axis=0, how='any')
df_tag_map = df_tag_map.drop_duplicates()
df_tag_map = df_tag_map.sort_values(['tag', 'tag_group_num'], ascending=True)
df_tag_map = df_tag_map.reset_index(drop=True)

In [21]:
df_tag_map

Unnamed: 0,tag,tag_group_num,tag_group_name,tag_old_new
0,F151,483.10,Resident Rights,Old
1,F152,483.10,Resident Rights,Old
2,F153,483.10,Resident Rights,Old
3,F154,483.10,Resident Rights,Old
4,F155,483.10,Resident Rights,Old
...,...,...,...,...
385,F945,483.95,Training Requirements,New
386,F946,483.95,Training Requirements,New
387,F947,483.95,Training Requirements,New
388,F948,483.95,Training Requirements,New


In [22]:
df_tag_map['tag_group_name'].value_counts()

Resident Rights                                         77
Physical Environment                                    37
Administration                                          33
Quality of Care                                         29
Food and Nutrition Services                             28
Laboratory, Radiology, and Other Diagnostic Services    22
Resident Assessments                                    22
Nursing Services                                        17
Freedom from Abuse, Neglect, and Exploitation           16
Admission, Transfer, and Discharge                      15
Training Requirements                                   13
Comprehensive Resident Centered Care Plans              13
Pharmacy Services                                       13
Quality of Life                                         13
Physician Services                                      12
Behavioral Health Services                              10
Infection Control                                       

#### Check and adjust for double classifications

Back in 2018, some tags were renamed and/or reclassified. The mapping table below addressses that reclassification, except for one situation: When a tag is not fully "transported" into another section, but broken down into components and then those components being moved around. 

For example, the old tag F309 was belonged to the *Quality of Life* regulatory grouping. When F309 was reviewed, it got broken down into components and one of them was reclassified into the *Behavioral Health Services* group. So now when we do the mapping from old to new tags, each instance of F309 produces two records that are copies of each other, except for the grouping. In other words, we are double counting.

Here we adjust to correct that.

In [23]:
# If a tag show up more than once in the mapping table, 
# it means it was wrongly assigned more than tag group. Let's find those.
double_count = df_tag_map['tag'].value_counts()
double_count = double_count[double_count > 1].reset_index()
double_count = df_tag_map[df_tag_map['tag'].isin(double_count['index'])]

double_count

Unnamed: 0,tag,tag_group_num,tag_group_name,tag_old_new
4,F155,483.1,Resident Rights,Old
5,F155,483.24,Quality of Life,Old
13,F164,483.1,Resident Rights,Old
14,F164,483.7,Administration,Old
41,F226,483.12,"Freedom from Abuse, Neglect, and Exploitation",Old
42,F226,483.95,Training Requirements,Old
68,F279,483.2,Resident Assessments,Old
69,F279,483.21,Comprehensive Resident Centered Care Plans,Old
70,F280,483.1,Resident Rights,Old
71,F280,483.21,Comprehensive Resident Centered Care Plans,Old


We found [an old CMS document](https://www.cms.gov/Regulations-and-Guidance/Guidance/Transmittals/downloads/R5SOM.pdf), that sheds light into the previous grouping of the old codes. The following adjustments are made based on that document:

In [24]:
df_tag_map.loc[df_tag_map['tag']=='F155', ['tag_group_num','tag_group_name']] = ['483.10','Resident Rights']
df_tag_map.loc[df_tag_map['tag']=='F164', ['tag_group_num','tag_group_name']] = ['483.10','Resident Rights']
df_tag_map.loc[df_tag_map['tag']=='F280', ['tag_group_num','tag_group_name']] = ['483.10','Resident Rights']
df_tag_map.loc[df_tag_map['tag']=='F226', ['tag_group_num','tag_group_name']] = ['483.12','Freedom from Abuse, Neglect, and Exploitation']
df_tag_map.loc[df_tag_map['tag']=='F279', ['tag_group_num','tag_group_name']] = ['483.21','Comprehensive Resident Centered Care Plans']
df_tag_map.loc[df_tag_map['tag']=='F309', ['tag_group_num','tag_group_name']] = ['483.25','Quality of Care']
df_tag_map.loc[df_tag_map['tag']=='F461', ['tag_group_num','tag_group_name']] = ['483.90','Physical Environment']
df_tag_map.loc[df_tag_map['tag']=='F498', ['tag_group_num','tag_group_name']] = ['483.35','Nursing Services']

df_tag_map = df_tag_map.drop_duplicates().reset_index(drop=True)

In [25]:
df_tag_map

Unnamed: 0,tag,tag_group_num,tag_group_name,tag_old_new
0,F151,483.10,Resident Rights,Old
1,F152,483.10,Resident Rights,Old
2,F153,483.10,Resident Rights,Old
3,F154,483.10,Resident Rights,Old
4,F155,483.10,Resident Rights,Old
...,...,...,...,...
375,F945,483.95,Training Requirements,New
376,F946,483.95,Training Requirements,New
377,F947,483.95,Training Requirements,New
378,F948,483.95,Training Requirements,New


In [26]:
# After our reclassification, let's see if there remain any tags that are assigned to more than group.
double_count = df_tag_map['tag'].value_counts()
double_count = double_count[double_count > 1].reset_index()
double_count = df_tag_map[df_tag_map['tag'].isin(double_count['index'])]
double_count = double_count.drop_duplicates()

double_count

Unnamed: 0,tag,tag_group_num,tag_group_name,tag_old_new
110,F373,483.6,Food and Nutrition Services,Old
111,F373,483.95,Training Requirements,Old


Just one. We checked, and it is not present in the ***df_sod_wa***, so we let it be.

Now we standardize some of the names of the tag_group_name, to make them consistent with the way they are named in the WA state regulation

In [27]:
df_tag_map['tag_group_name'] = df_tag_map['tag_group_name'].str.replace('Resident Rights.*', 'Resident Rights')
df_tag_map['tag_group_name'] = df_tag_map['tag_group_name'].str.replace('Admission, Transfer, and Discharge', 'Admission, Transfer and Discharge')
df_tag_map['tag_group_name'] = df_tag_map['tag_group_name'].str.replace('Resident Assessments.*', 'Resident Assessment and Plan of Care')
df_tag_map['tag_group_name'] = df_tag_map['tag_group_name'].str.replace('Specialized Rehabilitative Services', 'Specialized Habilitative and Rehabilitative Services')
df_tag_map['tag_group_name'] = df_tag_map['tag_group_name'].str.replace('Food and Nutrition Services', 'Food Services Areas')

# Reduce and tidy up
df_tag_map = df_tag_map.drop_duplicates().reset_index(drop=True)

In [28]:
df_tag_map

Unnamed: 0,tag,tag_group_num,tag_group_name,tag_old_new
0,F151,483.10,Resident Rights,Old
1,F152,483.10,Resident Rights,Old
2,F153,483.10,Resident Rights,Old
3,F154,483.10,Resident Rights,Old
4,F155,483.10,Resident Rights,Old
...,...,...,...,...
375,F945,483.95,Training Requirements,New
376,F946,483.95,Training Requirements,New
377,F947,483.95,Training Requirements,New
378,F948,483.95,Training Requirements,New


In [29]:
df_tag_map['tag'] = df_tag_map['tag'].str.replace('F','').astype(int)

In [30]:
df_tag_map['tag_group_name'].value_counts(dropna=False)

Resident Rights                                         76
Physical Environment                                    37
Administration                                          32
Quality of Care                                         28
Food Services Areas                                     28
Laboratory, Radiology, and Other Diagnostic Services    22
Resident Assessment and Plan of Care                    21
Nursing Services                                        17
Freedom from Abuse, Neglect, and Exploitation           16
Admission, Transfer and Discharge                       15
Pharmacy Services                                       13
Comprehensive Resident Centered Care Plans              12
Physician Services                                      12
Quality of Life                                         11
Training Requirements                                   11
Behavioral Health Services                               9
Infection Control                                       

## Joining

In [31]:
df_sod_wa = df_sod_wa.join(df_tag_map.set_index('tag'), on='tag', how='left')

In [32]:
df_sod_wa

Unnamed: 0,facility_name,facility_id,tag,severity_code,complaint,standard,eventid,filedate,severity_desc,inspection_dt,tag_group_num,tag_group_name,tag_old_new
0,ISSAQUAH NURSING AND REHABILITATION CENTER,505004,514,D,1.0,0.0,DY4R11,2020-06-01,No actual harm with potential for more than mi...,2017-04-26,483.70,Administration,Old
1,ISSAQUAH NURSING AND REHABILITATION CENTER,505004,253,E,0.0,1.0,PWON11,2020-06-01,No actual harm with potential for more than mi...,2017-08-29,483.10,Resident Rights,Old
2,ISSAQUAH NURSING AND REHABILITATION CENTER,505004,279,D,0.0,1.0,PWON11,2020-06-01,No actual harm with potential for more than mi...,2017-08-29,483.21,Comprehensive Resident Centered Care Plans,Old
3,ISSAQUAH NURSING AND REHABILITATION CENTER,505004,328,D,0.0,1.0,PWON11,2020-06-01,No actual harm with potential for more than mi...,2017-08-29,483.25,Quality of Care,Old
4,ISSAQUAH NURSING AND REHABILITATION CENTER,505004,329,D,0.0,1.0,PWON11,2020-06-01,No actual harm with potential for more than mi...,2017-08-29,483.45,Pharmacy Services,Old
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11910,BETHANY AT PACIFIC,505404,657,D,1.0,0.0,YIRL11,2020-06-01,No actual harm with potential for more than mi...,2020-02-12,483.21,Comprehensive Resident Centered Care Plans,New
11911,BETHANY AT PACIFIC,505404,697,D,1.0,0.0,YIRL11,2020-06-01,No actual harm with potential for more than mi...,2020-02-12,483.25,Quality of Care,New
11912,SHARON CARE CENTER,505429,686,G,1.0,0.0,S1FN11,2020-06-01,Actual harm that is not immediate jeopardy - I...,2020-01-24,483.25,Quality of Care,New
11913,PRESTIGE POST-ACUTE AND REHAB CENTER - EDMONDS,505527,623,E,1.0,0.0,2HNG11,2020-06-01,No actual harm with potential for more than mi...,2020-03-03,483.15,"Admission, Transfer and Discharge",New


In [33]:
df_sod_wa['tag_group_name'].value_counts(dropna=False)

Quality of Care                                         2472
Resident Rights                                         1707
Pharmacy Services                                       1316
Comprehensive Resident Centered Care Plans              1076
Freedom from Abuse, Neglect, and Exploitation           1033
Resident Assessment and Plan of Care                     609
Infection Control                                        609
Nursing Services                                         544
Food Services Areas                                      485
Quality of Life                                          474
Admission, Transfer and Discharge                        437
Administration                                           412
Behavioral Health Services                               244
Dental Services                                          122
Physical Environment                                     116
Quality Assurance and Performance Improvement             74
Training Requirements   

Reorder columns

In [34]:
df_sod_wa = df_sod_wa[['facility_name', 'facility_id',
                       'eventid', 'inspection_dt', 
                       'tag', 'tag_group_num', 'tag_group_name', 'tag_old_new', 
                       'severity_code', 'severity_desc',
                       'complaint', 'standard']]

# V. EXPORTING RESULTS

In [35]:
df_sod_wa.to_csv('../C_output_data/sod_wa.csv', index=False)