<a href="https://colab.research.google.com/github/rhailper/milestoneII/blob/main/SIADS696_DataExplorationAndCleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Questions to discuss as a team  
Should we all use the same inout dataset so that the models we use are comparable?


In [1]:
#!git clone https://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@github.com/rhailper/milestoneII.git

Cloning into 'milestoneII'...
remote: Enumerating objects: 34, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 34 (delta 7), reused 7 (delta 0), pack-reused 0[K
Unpacking objects: 100% (34/34), 8.11 MiB | 3.31 MiB/s, done.


In [3]:
import pandas as pd
import numpy as numpy
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

### Import and do basic cleaning to files

#### Client information

In [64]:
# import client info - this file has basic demographic information about the client
demo = pd.read_csv('/content/milestoneII/data/CLIENT_INFORMATION.csv') 

In [44]:
# drop row with na values
# If a client has na values in this value that means they died or stopped 
# receiving services from teh organization
#client_info = client_info.dropna()

Due to the deidentification process, this dataset was not allowed to disclose the exact age of any client over the age of 90. Any client over the age of 90 was coded as 90+ so this needs to be changed in order to make the feature numerical.

In [65]:
# replace '90+' with 90
demo['Age'] = demo['Age'].str.replace('90+','90',regex=False)#.astype(int)

In [66]:
# convert categorical variables into one hot encoded dummy variables
df = pd.get_dummies(demo, columns=['Gender','Federal Poverty','Race','Primary Funding Source','Multiple Funding Sources?'])

In [67]:
df

Unnamed: 0,ID,Age,Federal % of Poverty,ADL Count,Critical Need Count,IADL Count,Skilled Need Count,Nutrition Score,Gender_Female,Gender_Male,...,Race_Asian,Race_Black or African American,Race_Multiracial,Race_Native Hawaiian or Other Pacific Islander,Race_Unknown/Missing,Race_White,Primary Funding Source_Non-waiver,Primary Funding Source_Waiver,Multiple Funding Sources?_No,Multiple Funding Sources?_Yes
0,10,38,87.0,4.0,0.0,5.0,0.0,0.0,0,1,...,0,0,0,0,1,0,0,1,1,0
1,100035,52,73.0,6.0,1.0,8.0,0.0,3.0,1,0,...,0,1,0,0,0,0,0,1,1,0
2,100048,90,124.0,,,,,,0,1,...,0,0,0,0,0,1,0,1,1,0
3,100061,53,51.0,6.0,1.0,6.0,0.0,3.0,1,0,...,0,1,0,0,0,0,0,1,1,0
4,100073,69,65.0,0.0,2.0,1.0,0.0,8.0,0,1,...,0,1,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13598,99910,67,145.0,5.0,1.0,7.0,0.0,3.0,1,0,...,0,0,0,0,0,1,0,1,1,0
13599,99919,90,96.0,7.0,1.0,8.0,0.0,6.0,1,0,...,0,1,0,0,0,0,0,1,1,0
13600,99963,55,76.0,5.0,2.0,6.0,0.0,9.0,1,0,...,0,0,0,0,0,1,0,1,1,0
13601,99967,60,72.0,5.0,1.0,5.0,0.0,8.0,1,0,...,0,1,0,0,0,0,0,1,1,0


#### Client services


In [60]:
# import client services - this files contains service utilization
serv = pd.read_csv('/content/milestoneII/data/CLIENT_SERVICES.csv') 

In [61]:
client_info

Unnamed: 0,ID,Month,Year,Services,Cost of Serivces
0,10,April,2022,"['Structured Family Care - Level 2', 'Case Man...",4531.06
1,10,August,2021,"['Structured Family Care - Level 2', 'Case Man...",4673.14
2,10,December,2021,"['Structured Family Care - Level 2', 'Case Man...",4673.14
3,10,February,2022,"['Structured Family Care - Level 2', 'Case Man...",4246.90
4,10,January,2022,"['Structured Family Care - Level 2', 'Case Man...",4673.14
...,...,...,...,...,...
127089,99976,March,2022,"['Attendant Care', 'Case Management - flat rat...",2782.90
127090,99976,May,2022,"['Attendant Care', 'Case Management - flat rat...",4226.26
127091,99976,November,2021,"['Attendant Care', 'Case Management - flat rat...",2643.22
127092,99976,October,2021,"['Attendant Care', 'Case Management - flat rat...",2782.90


#### Diagnoses

In [6]:
# import diagnoses - this gile contains client diagnoses based on ICD-10 codes
diag = pd.read_csv('//content/milestoneII/data/DIAGNOSES.csv') 

#### Questionaire

In [9]:
# import questionaire - this file contains information about clients ability to complete daily activities 
quest = pd.read_csv('/content/milestoneII/data/QUESTIONAIRE.csv') 

In [12]:
quest['InterRAI Period'].unique()

array(['JAN_JUN2021', 'JAN_JUN2022', 'JUL_DEC2021', 'JUL_DEC2022',
       'JUL_DEC2020'], dtype=object)

#### Hospitalzations (will be outcome variable for supervised learning)

In [4]:
# import hospitalzations - this file contains information about client hospitalzations in the past 2 years
hosp = pd.read_csv('/content/milestoneII/data/HOSPITALIZATIONS.csv') 

In [15]:
hosp.sort_values(['Year','Month'])

Unnamed: 0,ID,Month,Year,Admittype,Number Hospitalzations
7,100035,2,2022,Outpatient,1
24,100061,2,2022,Outpatient,1
42,100074,2,2022,Outpatient,2
49,100102,2,2022,Outpatient,1
81,10019,2,2022,Outpatient,1
...,...,...,...,...,...
101640,99862,3,2023,Outpatient,2
101643,99870,3,2023,Outpatient,1
101675,99963,3,2023,Emergency,1
101676,99963,3,2023,Outpatient,1


In [25]:
%cd 'milestoneII/'

/content/milestoneII


In [26]:
!pwd

/content/milestoneII


In [22]:
!cd 'milestoneII/'
!git config --global user.email "rhailper@umich.edu"
!git config --global user.name "rhailper"
!git pull

fatal: not a git repository (or any of the parent directories): .git


In [None]:
!git add .
!git commit -m 'Updates to data cleaning'
!git push 'https://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@github.com/rhailper/milestoneII.git'

In [18]:
!git remote add origin https://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@github.com/rhailper/milestoneII.git

fatal: not a git repository (or any of the parent directories): .git
