## Working with Nursing Home Complaints data.  
__author__: sandeep  
__date__: Mar 20, 2023

### about the data...
- The data includes Nursing Home (NH)-level statistics as well investigation reports associated with f-tags.
- NHs can receive more than one f-tag in each investigation or response to complaints.
- NHs can receive more than one f-tags across many different years.

### focus here...
Mainly interested in information retrieval from the investigation reports.
- the investigation reports include
    - parties involved - Residents (R); Licensed Physician's Nurse (LPN); etc. referred as LPN#1, etc.
    - dates listed of interaction, complaints, and general progress in resolution   
    
*so the focus -->*
- who is the complaint against?
- who is the complainant?
- what is the event mentioned in the complaint?
- are there more than one person complained against?
- extract symptoms from the text - apply clinical NER model
- how to use the text extracted in the modeling problem


### potential approaches...
- regex methods to extract key actors
- build a word embedding with this data to extract similarities, etc
- NLTK chunking to extract syntax-based verb chunks to identify actions

In [6]:
# packages needed
import pandas as pd
import re
import spacy
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
# data snapshot - saved locally
dat = pd.read_csv("./merged_data/text2567_20190501_cms_reg1.csv", index_col=0, dtype=str)

In [4]:
txt_rep=dat[['inspection_text','deficiency_tag', 'scope_severity', 'complaint']].copy() # subset of columns
txt_rep.inspection_text = txt_rep.inspection_text.str.lower().str.replace("<br/?>","", regex=True)
txt_rep.inspection_text = txt_rep.inspection_text.str.replace("(\s?\#\s?)","#", regex=True)

In [None]:
# identify key actors in a given report -  OPTION 1

def find_actors(txt):
    # list 1
    person_list0 = list(set(re.findall("\s([a-z]*\#[0-9]*)",txt)))
    print(person_list0)
    person_list=[]
    doc = nlp(txt)
    for tk in doc.sents:
        for sent in tk.ents:
            if sent.label==380:
                person_list.append(sent.text)
    print(list(set(person_list)))
    person_list = set(person_list + person_list0)
    return person_list # find hospital staff involved

txt_rep['actors']=txt_rep.inspection_text.map(find_actors)
#txt_rep.actors=txt_rep.actors.apply(lambda x: [re.sub("resident", "r", ax) if 'resident' in ax else ax for ax in x ])

['lpn#1', 'lpn#2', 'resident#312', 'person#1']
['lpn#2']
['person#4', 'na#7', 'r#227', 'sw#1', 'rn#1', 'lpn#3', 'na#6', 'resident#227']
['na#6', 'r#227']
['person#4', 'na#7', 'r#227', 'sw#1', 'rn#1', 'lpn#3', 'resident#227']
['r#227']
[]
[]
['lpn#1', 'lpn#2', 'resident#312', 'person#1']
['lpn#2']
['na#4', 'adns#2', 'resident#183', 'na#2']
[]
['person#2', 'resident#347', 'sw#1', 'adns#2']
[]
['na#11', 'resident#43', 'rn#7', 'housekeeper#1']
['resident#43']
['room#320', 'room#319', 'room#303', 'and#320']
[]
['resident#180', 'na#1', 'person#4', 'na#4', 'resident#349', 'rn#1', 'rn#2', 'na#2', 'receptionist#2']
["b. resident#180's", "a. resident#180's"]
['rn#8', 'lpn#5', 'lpn#6', 'resident#60']
['resident#60']
['resident#347']
[]
['resident#347']
[]
['aprn#1']
['4/9/18']
[]
['room.10', 'brown']
['sw#1', 'resident#644', 'na#3', 'na#2', 'adon#1']
[]
['resident#40', 'resident#92', 'adns#2']
["c. resident#92's", 'resident#92', "a. resident#40's"]
['resident#744', 'resident#175']
['cm']
['reside

[]
[]
['c. observation']
['resident#225', 'rp#1', 'lpn#2', 'party#1', 'and#295', 'rn#5', 'resident#53', 'resident#109', 'resident#295', 'r#53']
["a. resident#225's", 'lpn#2', "c. resident#53's"]
['resident#190']
[]
['and#5', 'resident#5', 'resident#4']
[]
['na#1', 'lpn#2', 'resident#1', 'na#3', 'lpn#3']
['lpn#2', 'na#3']
['na#1', 'resident#1', 'lpn#2', 'na#3']
['lpn#2', 'na#3']
['na#1', 'na#5', 'and#4', 'rn#1', 'resident#2', 'resident#4', 'lpn#3', 'resident#3', 'na#2']
['don', 'resident#3']
['resident#3', 'na#2']
['resident#3']
['resident#55', 'md#1']
['resident#55']
['na#1', 'resident#87', 'lpn#2', 'na#2']
['don', 'na#1', 'lpn#2', 'na#2']
['rn#5', 'resident#92', 'rn#7', 'resident#104']
['resident#92 and/or', 'redacted].physician', 'resident#92']
['resident#18', 'rn#5']
[]
['rn#6', 'resident#404', 'lpn#1']
['boggy heels']
['lpn#4', 'resident#37']
['lpn#4', 'physician']
['na#1', 'resident#87', 'na#2']
['don', 'na#1', 'na#2']
['resident#55', 'md#1']
['resident#55']
['rn#6', 'rn#3', 'rn#2

[]
['r#12', 'and#112', 'na#4', 'lpn#2', 'resident#112', 'and#191', 'lpn#1', 'r#112', 'rn#2', 'r#78', 'resident#12', 'residents#186']
['resident#112', 'na#4', 'lpn#2', 'station1']
['resident#122', 'r#122', 'lpn#1', 'rn#2']
[]
['resident#98', 'na#1', 'r#98', 'rn#1']
['kick na#1']
[]
[]
['resident#13', 'resident#61', 'resident#74', 'resident#23']
['resident#13', 'resident#61']
['lpn#2']
['lpn#2', 'resident#70']
['r#229', 'r#61', 'rn#3', 'r#235', 'lpn#1', 'rn#1', 'resident#229', 'resident#61', 'resident#235']
['tracking.3', 'resident#229', 'resident#61']
['resident#24', 'r#235', 'resident#199', 'r#84', 'resident#', 'rn#1', 'r#199', 'resident#235']
['redacted].2', 'r#199', 'resident##84', 'r#235']
[]
[]
['r#229', 'r#61', 'lpn#1', 'rn#1', 'resident#229', 'resident#61']
['resident#229']
['rn#3']
[]
['lpn#1', 'resident#517']
[]
['resident#17', 'r#517', 'resident#517', 'rn#1']
[]
['aprn#1', 'resident#135', 'rn#1']
[]
['resident#1', 'rn#2', 'rn#1']
['don']
['resident#302', 'resident#300', 'perso

['staff#2']
[]
[]
['na#1', 'na#4', 'na#7', 'resident#1', 'lpn#1', 'na#3', 'rn#1', 'na#2', 'na#8']
['na#3']
['na#5', 'resident#1', 'person#1']
['don']
['na#1', 'na#4', 'na#7', 'resident#1', 'lpn#1', 'na#3', 'rn#1', 'na#2', 'na#6', 'na#8']
['na#6']
['md#1']
[]
['na#1', 'na#7', 'resident#1', 'nurse#1', 'lpn#1', 'na#3', 'na#6', 'na#8']
['c.', 'nurse#1', 'na#3', 'na#6', 'redacted].resident#1']
[]
[]
[]
[]
['resident#34', 'rn#1']
['resident#34']
['resident#38', 'supervisor#1']
['resident#38']
['lpn#2', 'rn#3', 'st#1', 'na#2', 'resident#60']
['foley', 'lpn#2', 'resident#60']
['rn#1']
['cm']
['resident#39']
[]
['resident#127']
[]
['na#1', 'lpn#2', 'md#1', 'resident#157', 'lpn#1', 'lpn#3']
['resident#157', 'lpn#2']
['resident#69', 'consultant#1']
['resident#69', "a. resident#69's", "b. resident#69's"]
['resident#65', 'lpn#1']
['resident#65']
['resident#9', 'resident#113', 'md#1', 'rn#1']
['b.', 'a.']
['md#1', 'lpn#4', 'nurse#1', 'rn#6', 'resident#133', 'na#6']
['lpn#4', 'na#6']
['resident#27', 

['lpn#2', 'resident#92', '1/8/18', 'redacted].c', 'resident#29']
['aprn#2', 'lpn#3', 'resident#3', 'na#8']
['don', 'aprn#2', 'resident#3']
['resident#1', 'lpn#2', 'rn#1']
['lpn#2']
['rn#1', 'person#1']
[]
['na#1', 'rn#2', 'rn#1']
[]
['lpn#1', 'resident#181']
['mar']
['aprn#1', 'resident#4']
[]
['resident#1']
[]
['resident#2', 'person#1']
['don']
['housekeeper#1', 'room#343', 'rn#1']
[]
['cook#1', 'cook#2']
[]
['aprn#4', 'rn#8', 'aprn#1', 'person#6', 'person#7', 'resident#113', 'resident#149', 'resident#81', 'lpn#3', 'rn#11', 'resident#168', 'rn#14', 'resident#150']
['don', 'and/or hematoma', 'resident#81']
['resident#423', 'person#10', 'worker#1', 'sw#1']
['worker#1']
['rn#1', 'person#1', 'na#1', 'lpn#2', 'resident#122', 'resident#60', 'na#4', 'resident#95', 'rn#5', 'rn#6', 'and#142', 'resident#123', 'resident#142', 'na#3', 'resident#53', 'rn#2', 'na#2', 'rn#10', 'rn#4']
['na#1', 'lpn#2', 'resident#95', 'don', 'resident#60']
['and#122', 'na#2']
[]
['na#1', 'resident#123', 'na#4', 'and#

['lpn#2', 'resident#38', 'activity.b', 'lpn#5', "a. resident#38's"]
[]
[]
['resident#3']
['resident#3']
['person#2', 'resident#1', 'r#6', 'lpn#1', 'r#1', 'resident#6']
['don', 'admission).review', 'review.b']
['resident#1']
['don']
['resident#1', 'r#1', 'rn#1']
['don']
['na#5', 'r#4', 'rn#1']
['a. r#4 [', 'revised.b']
['r#4', 'lpn#2', 'na#3', 'person#1']
['lpn#2']
['r#2', 'resident#2', 'lpn#3', 'rn#1']
['foley', 'foley catheter']
['r#2', 'resident#2']
['don', 'mar', '4/9/18']
['r#5', 'resident#1', 'nurse#1', 'lpn#1', 'rn#1', 'physician#1', 'r#1']
['don', 'a.']
['lpn#2', 'resident#38', 'md#1', 'resident#1', 'lpn#1', 'rn#1', 'rn#2']
['lpn#2', 'b. resident#38', 'resident#38', 'abd pad']
['resident#38', 'rn#1']
['resident#38']
['na#1', 'lpn#4', 'na#3', 'rn#1', 'resident#7', 'lpn#3', 'resident#8', 'and#8']
['lpn#4', 'resident#7']
['rn#3', 'resident#49', 'na#2']
['matt']
['resident#23']
['redacted].resident#23']
['t#54', 'resident#54']
['a.', 'resident#54']
['lpn#5', 'lpn#6', 'lpn#3']
[]
['r

In [49]:
find_actors(txt_rep.loc[1,'inspection_text'])

['person#4', 'na#7', 'r#227', 'sw#1', 'rn#1', 'lpn#3', 'na#6', 'resident#227']
['na#6', 'r#227']


{'lpn#3', 'na#6', 'na#7', 'person#4', 'r#227', 'resident#227', 'rn#1', 'sw#1'}

In [6]:
# identify key actors in a given report -  OPTION 2 using spacy


In [7]:
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [15]:
for snt in nltk.sent_tokenize(txt_rep.loc[1,'inspection_text']):
    snt2=preprocess(snt)
    namedEnt = nltk.ne_chunk(snt2, binary=False)
    print(namedEnt.)

(S
  */JJ
  */NNP
  note-/NN
  terms/NNS
  in/IN
  brackets/NNS
  have/VBP
  been/VBN
  edited/VBN
  to/TO
  protect/VB
  confidentiality/NN
  */NNP
  */NNP
  >/NNP
  based/VBN
  on/IN
  clinical/JJ
  record/NN
  review/NN
  ,/,
  interviews/NNS
  ,/,
  review/NN
  of/IN
  facility/NN
  documentation/NN
  ,/,
  and/CC
  review/NN
  of/IN
  facility/NN
  policy/NN
  and/CC
  procedure/NN
  for/IN
  one/CD
  of/IN
  three/CD
  residents/NNS
  reviewed/VBN
  for/IN
  dignity/NN
  (/(
  resident/JJ
  #/#
  227/CD
  )/)
  ,/,
  the/DT
  facility/NN
  failed/VBD
  to/TO
  ensure/VB
  a/DT
  resident/NN
  's/POS
  grievance/concern/NN
  was/VBD
  acted/VBN
  upon/IN
  in/IN
  accordance/NN
  with/IN
  facility/NN
  policy/NN
  and/CC
  procedure/NN
  ./.)
(S
  the/DT
  findings/NNS
  include/VBP
  :/:
  resident/JJ
  #/#
  227/CD
  's/POS
  [/JJ
  diagnoses/NNS
  redacted/VBD
  ]/NN
  ./.)
(S
  the/DT
  quarterly/JJ
  mds/NN
  assessment/NN
  dated/VBD
  [/JJ
  date/NN
  ]/NNP
  identified/VB

In [21]:
doc_sent1=nltk.sent_tokenize(dat.loc[1,'inspection_text'])

In [25]:
sent5=doc_sent1[5]

In [101]:
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+\s?\#\d+')

In [102]:
sent = tokenizer.tokenize(sent5)
sent2 = nltk.pos_tag(sent)

In [103]:
sent

['LPN #3', 'NA #7', 'R #227']

In [32]:
nltk.ne_chunk(sent, binary=True)

ModuleNotFoundError: No module named 'svgling'

Tree('S', [('The', 'DT'), ('documentation', 'NN'), ('of', 'IN'), ('the', 'DT'), ('resolution', 'NN'), ('of', 'IN'), ('the', 'DT'), ('grievance/concern', 'NN'), ('noted', 'VBD'), ('that', 'IN'), Tree('NE', [('LPN', 'NNP')]), ('#', '#'), ('3', 'CD'), ('had', 'VBD'), ('spoken', 'VBN'), ('to', 'TO'), ('NA', 'NNP'), ('#', '#'), ('7', 'CD'), ('regarding', 'VBG'), ('his/her', 'JJR'), ('tone', 'NN'), ('while', 'IN'), ('providing', 'VBG'), ('care', 'NN'), ('to', 'TO'), ('R', 'NNP'), ('#', '#'), ('227', 'CD'), ('.', '.')])

ModuleNotFoundError: No module named 'svgling'

Tree('S', [('The', 'DT'), ('documentation', 'NN'), ('of', 'IN'), ('the', 'DT'), ('resolution', 'NN'), ('of', 'IN'), ('the', 'DT'), ('grievance/concern', 'NN'), ('noted', 'VBD'), ('that', 'IN'), Tree('NE', [('LPN', 'NNP')]), ('#', '#'), ('3', 'CD'), ('had', 'VBD'), ('spoken', 'VBN'), ('to', 'TO'), ('NA', 'NNP'), ('#', '#'), ('7', 'CD'), ('regarding', 'VBG'), ('his/her', 'JJR'), ('tone', 'NN'), ('while', 'IN'), ('providing', 'VBG'), ('care', 'NN'), ('to', 'TO'), ('R', 'NNP'), ('#', '#'), ('227', 'CD'), ('.', '.')])

**building a network of actors for each complaint**   
approach is to look for the actors listed in actors columns
1. find if they are in a sentence together  
2. if they are then make a direct link,  
3. moreover if the sentence is active voice -- make a directed edge!  

In [270]:
sent_tokenize(txt_rep.inspection_text[0])

["**note- terms in brackets have been edited to protect confidentiality** based on review of the clinical record and interview for one of three sampled residents with a change in condition (r#312), the facility failed to report to the physician when the resident had a change in condition and the resident's insulin was held in violation of the physician's orders [redacted].",
 "resident#312's [diagnoses redacted].",
 'the resident care plan dated 5/20/16 identified the resident had insulin dependent diabetes.',
 'interventions directed the staff to administer medications as ordered, monitor for signs and symptoms of hypo/[medical condition], and report abnormal findings to the physician.',
 "physician's orders [redacted].the admission mds dated [date] identified the resident was without cognitive impairment, was independent with bed mobility and transfers and eating, required supervision for dressing and personal hygiene and physical help in part of the bathing activity.physician orders

In [12]:
from transformers import AutoModel, AutoTokenizer, AutoModelForTokenClassification
model_blue = "distilbert-base-uncased"
tokenize = AutoTokenizer.from_pretrained(
    model_blue, use_fast=True, add_prefix_space=True
)

In [260]:
import nltk
from nltk.tokenize import wordpunct_tokenize, word_tokenize, RegexpTokenizer, MWETokenizer, sent_tokenize
from nltk.tag import pos_tag, pos_tag_sents

In [262]:
te = """R#227 voiced a concern to LPN#3 related to NA#7's attitude when providing care. Hello there!"""
wordpunct_tokenize(te)
pos_tag(sent_tokenize(te))

[("R#227 voiced a concern to LPN#3 related to NA#7's attitude when providing care.",
  'NNP'),
 ('Hello there!', 'NNP')]

In [None]:
grammar = r"""chunk1: {<NN><WRB><VBG><NN>}"""

In [50]:
dat['inspection_date']=pd.to_datetime(dat.inspection_date)

In [51]:
dat['inspection_year']=dat.inspection_date.dt.year

In [52]:
dat.inspection_year.value_counts()

2018    5599
2017    4521
2016    3684
2019     968
2015     651
Name: inspection_year, dtype: int64