### CSV File Data Dictionary : https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary

#### In conditions.csv :
Start : The date the condition was diagnosed.<br>
Stop : The date the condition resolved, if applicable.<br>

#### In allergies.csv :
Start : The date the allergy was diagnosed.<br>
Stop : The date the allergy ended, if applicable.<br>

#### These will help in categorizing the patients as adherant and non-adherant

### Determining non-adherance from conditions.csv

In [1]:
import pandas as pd
import numpy as np

In [2]:
conditions = pd.read_csv('../Case Files/sample_date_csv/conditions.csv')
print('Total shape : ',conditions.shape)
conditions.head()

Total shape :  (8376, 6)


Unnamed: 0,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION
0,2001-05-01,,1d604da9-9a81-4ba9-80c2-de3375d59b40,8f104aa7-4ca9-4473-885a-bba2437df588,40055000,Chronic sinusitis (disorder)
1,2011-08-09,2011-08-16,8d4c4326-e9de-4f45-9a4c-f8c36bff89ae,9d35ec9f-352a-4629-92ef-38eae38437e7,444814009,Viral sinusitis (disorder)
2,2011-11-16,2011-11-26,8d4c4326-e9de-4f45-9a4c-f8c36bff89ae,ae7555a9-eaff-4c09-98a7-21bc6ed1b1fd,195662009,Acute viral pharyngitis (disorder)
3,2011-05-13,2011-05-27,10339b10-3cd1-4ac3-ac13-ec26728cb592,e1ab4933-07a1-49f0-b4bd-05500919061d,10509002,Acute bronchitis (disorder)
4,2011-02-06,2011-02-14,f5dcd418-09fe-4a2f-baa0-3da800bd8c3a,b8f76eba-7795-4dcd-a544-f27ac2ef3d46,195662009,Acute viral pharyngitis (disorder)


In [3]:
# calculating number of NULL in STOP
conditions['STOP'].isnull().sum()
# total 3811 are such records in which stop is NAN
# From these records, we can say that here the condition was not resolved, hence it is non - adherence

3811

In [4]:
total_patients_conditions = conditions['PATIENT'].nunique()
total_patients_non_adherant = conditions[conditions['STOP'].isnull()]['PATIENT'].nunique()
print('Total unique patients in conditions.csv : ',total_patients_conditions)
print('Total unique patients whose condition was not resolved (non - adherent patients) : ',total_patients_non_adherant)
print('Total unique patients whose condition was resolved (adherent patients) : ',total_patients_conditions-total_patients_non_adherant)

Total unique patients in conditions.csv :  1152
Total unique patients whose condition was not resolved (non - adherent patients) :  922
Total unique patients whose condition was resolved (adherent patients) :  230


In [5]:
# 230 patients are adherent
# 922 patients are non - adherent
# creating a new column , isadherent : which will tell is that particular patient is adherent or not

In [6]:
conditions['non_adherence_conditions'] = conditions['STOP'].isnull()*1
conditions.head()

Unnamed: 0,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION,non_adherence_conditions
0,2001-05-01,,1d604da9-9a81-4ba9-80c2-de3375d59b40,8f104aa7-4ca9-4473-885a-bba2437df588,40055000,Chronic sinusitis (disorder),1
1,2011-08-09,2011-08-16,8d4c4326-e9de-4f45-9a4c-f8c36bff89ae,9d35ec9f-352a-4629-92ef-38eae38437e7,444814009,Viral sinusitis (disorder),0
2,2011-11-16,2011-11-26,8d4c4326-e9de-4f45-9a4c-f8c36bff89ae,ae7555a9-eaff-4c09-98a7-21bc6ed1b1fd,195662009,Acute viral pharyngitis (disorder),0
3,2011-05-13,2011-05-27,10339b10-3cd1-4ac3-ac13-ec26728cb592,e1ab4933-07a1-49f0-b4bd-05500919061d,10509002,Acute bronchitis (disorder),0
4,2011-02-06,2011-02-14,f5dcd418-09fe-4a2f-baa0-3da800bd8c3a,b8f76eba-7795-4dcd-a544-f27ac2ef3d46,195662009,Acute viral pharyngitis (disorder),0


In [7]:
# Extracting patients and non_adherence_conditions columns seperately
df_conditions = conditions[['PATIENT', 'non_adherence_conditions']]
df_conditions.head()

Unnamed: 0,PATIENT,non_adherence_conditions
0,1d604da9-9a81-4ba9-80c2-de3375d59b40,1
1,8d4c4326-e9de-4f45-9a4c-f8c36bff89ae,0
2,8d4c4326-e9de-4f45-9a4c-f8c36bff89ae,0
3,10339b10-3cd1-4ac3-ac13-ec26728cb592,0
4,f5dcd418-09fe-4a2f-baa0-3da800bd8c3a,0


In [8]:
#removing duplicates
df_conditions = df_conditions.drop_duplicates()
df_conditions.shape

(2041, 2)

In [9]:
print('But total unique patients in conditions.csv : ',total_patients_conditions)

But total unique patients in conditions.csv :  1152


In [10]:
# this means that 889(2041 - 1152) are such records in which the patient is sometimes adherent and sometimes not isadherent
# if a patient is not adherent even once, he should fall in the category of non-adherent

In [11]:
group = df_conditions.groupby('PATIENT')
df_conditions = group.apply(lambda x : x['non_adherence_conditions'].unique())
df_conditions = df_conditions.reset_index() 
df_conditions.rename(columns = {0:'non_adherence_conditions'}, inplace = True)

In [12]:
df_conditions.head()

Unnamed: 0,PATIENT,non_adherence_conditions
0,00185faa-2760-4218-9bf5-db301acf8274,"[0, 1]"
1,0042862c-9889-4a2e-b782-fac1e540ecb4,[0]
2,0047123f-12e7-486c-82df-53b3a450e365,"[0, 1]"
3,010d4a3a-2316-45ed-ae15-16f01c611674,"[0, 1]"
4,0149d553-f571-4e99-867e-fcb9625d07c2,"[1, 0]"


In [13]:
# columns having value of isadherent are those patients which are sometimes adherent and sometimes not
# Hence first calculating the frequency of column and then appending that column

In [14]:
df_conditions1 = group.apply(lambda x : x['non_adherence_conditions'].nunique())
df_conditions1 = df_conditions1.reset_index() 
df_conditions1.rename(columns = {0:'freq_isadherent'}, inplace = True)
freq_isadherent = df_conditions1['freq_isadherent']
df_conditions = pd.concat([df_conditions, freq_isadherent], axis=1)
df_conditions.head()

Unnamed: 0,PATIENT,non_adherence_conditions,freq_isadherent
0,00185faa-2760-4218-9bf5-db301acf8274,"[0, 1]",2
1,0042862c-9889-4a2e-b782-fac1e540ecb4,[0],1
2,0047123f-12e7-486c-82df-53b3a450e365,"[0, 1]",2
3,010d4a3a-2316-45ed-ae15-16f01c611674,"[0, 1]",2
4,0149d553-f571-4e99-867e-fcb9625d07c2,"[1, 0]",2


In [15]:
# The columns having freq_isadherent mare than 1 are those non adherent patients

In [16]:
df_conditions.loc[df_conditions['freq_isadherent'] > 1, 'non_adherence_conditions'] = 1
df_conditions.drop('freq_isadherent',axis = 1,inplace = True)
print(df_conditions['non_adherence_conditions'].value_counts())

1      889
[0]    230
[1]     33
Name: non_adherence_conditions, dtype: int64


In [17]:
# converting [0] -> 0 and [1] -> 1
df_conditions.loc[df_conditions['non_adherence_conditions'] == 0, 'non_adherence_conditions'] = 0
df_conditions.loc[df_conditions['non_adherence_conditions'] == 1, 'non_adherence_conditions'] = 1
df_conditions['non_adherence_conditions'].value_counts()

1    922
0    230
Name: non_adherence_conditions, dtype: int64

In [18]:
df_conditions.rename(columns = {'PATIENT':'Id'}, inplace = True)
print('df_conditions shape : ' ,df_conditions.shape)
df_conditions.head()
#df_conditions.to_csv('../myCSV/isadherent.csv',index=False)

df_conditions shape :  (1152, 2)


Unnamed: 0,Id,non_adherence_conditions
0,00185faa-2760-4218-9bf5-db301acf8274,1
1,0042862c-9889-4a2e-b782-fac1e540ecb4,0
2,0047123f-12e7-486c-82df-53b3a450e365,1
3,010d4a3a-2316-45ed-ae15-16f01c611674,1
4,0149d553-f571-4e99-867e-fcb9625d07c2,1


### non-adherance from allergies.csv

Similar steps that were done with conditions.csv

In [19]:
allergies = pd.read_csv('../Case Files/sample_date_csv/allergies.csv')
print('Total shape : ',allergies.shape)
allergies.head()

Total shape :  (597, 6)


Unnamed: 0,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION
0,25-10-1982,,76982e06-f8b8-4509-9ca3-65a99c8650fe,b896bf40-8b72-42b7-b205-142ee3a56b55,300916003,Latex allergy
1,25-10-1982,,76982e06-f8b8-4509-9ca3-65a99c8650fe,b896bf40-8b72-42b7-b205-142ee3a56b55,300913006,Shellfish allergy
2,25-01-2002,,71ba0469-f0cc-4177-ac70-ea07cb01c8b8,7be1a590-4239-4826-9872-031327f3c368,419474003,Allergy to mould
3,25-01-2002,,71ba0469-f0cc-4177-ac70-ea07cb01c8b8,7be1a590-4239-4826-9872-031327f3c368,232347008,Dander (animal) allergy
4,25-01-2002,,71ba0469-f0cc-4177-ac70-ea07cb01c8b8,7be1a590-4239-4826-9872-031327f3c368,418689008,Allergy to grass pollen


In [20]:
# calculating number of NULL in STOP
allergies['STOP'].isnull().sum()
# total 3811 are such records in which stop is NAN
# From these records, we can say that here the condition was not resolved, hence it is non - adherence

533

In [21]:
total_patients_allergies = allergies['PATIENT'].nunique()
total_patients_non_adherant = allergies[allergies['STOP'].isnull()]['PATIENT'].nunique()
print('Total unique patients in allergies.csv : ',total_patients_allergies)
print('Total unique patients whose condition was not resolved (non - adherent patients) : ',total_patients_non_adherant)
print('Total unique patients whose condition was resolved (adherent patients) : ',total_patients_allergies-total_patients_non_adherant)

Total unique patients in allergies.csv :  141
Total unique patients whose condition was not resolved (non - adherent patients) :  137
Total unique patients whose condition was resolved (adherent patients) :  4


In [22]:
# 230 patients are adherent
# 922 patients are non - adherent
# creating a new column , isadherent : which will tell is that particular patient is adherent or not

In [23]:
allergies['non_adherence_allergies'] = allergies['STOP'].isnull()*1
allergies.head()

Unnamed: 0,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION,non_adherence_allergies
0,25-10-1982,,76982e06-f8b8-4509-9ca3-65a99c8650fe,b896bf40-8b72-42b7-b205-142ee3a56b55,300916003,Latex allergy,1
1,25-10-1982,,76982e06-f8b8-4509-9ca3-65a99c8650fe,b896bf40-8b72-42b7-b205-142ee3a56b55,300913006,Shellfish allergy,1
2,25-01-2002,,71ba0469-f0cc-4177-ac70-ea07cb01c8b8,7be1a590-4239-4826-9872-031327f3c368,419474003,Allergy to mould,1
3,25-01-2002,,71ba0469-f0cc-4177-ac70-ea07cb01c8b8,7be1a590-4239-4826-9872-031327f3c368,232347008,Dander (animal) allergy,1
4,25-01-2002,,71ba0469-f0cc-4177-ac70-ea07cb01c8b8,7be1a590-4239-4826-9872-031327f3c368,418689008,Allergy to grass pollen,1


In [24]:
# Extracting patients and non_adherence_allergies columns seperately
df_allergies = allergies[['PATIENT', 'non_adherence_allergies']]
df_allergies.head()

Unnamed: 0,PATIENT,non_adherence_allergies
0,76982e06-f8b8-4509-9ca3-65a99c8650fe,1
1,76982e06-f8b8-4509-9ca3-65a99c8650fe,1
2,71ba0469-f0cc-4177-ac70-ea07cb01c8b8,1
3,71ba0469-f0cc-4177-ac70-ea07cb01c8b8,1
4,71ba0469-f0cc-4177-ac70-ea07cb01c8b8,1


In [25]:
#removing duplicates
df_allergies = df_allergies.drop_duplicates()
df_allergies.shape

(160, 2)

In [26]:
print('But total unique patients in allergies.csv : ',total_patients_allergies)

But total unique patients in allergies.csv :  141


In [27]:
# this means that 889(2041 - 1152) are such records in which the patient is sometimes adherent and sometimes not isadherent
# if a patient is not adherent even once, he should fall in the category of non-adherent

In [28]:
group = df_allergies.groupby('PATIENT')
df_allergies = group.apply(lambda x : x['non_adherence_allergies'].unique())
df_allergies = df_allergies.reset_index() 
df_allergies.rename(columns = {0:'non_adherence_allergies'}, inplace = True)
df_allergies.head()

Unnamed: 0,PATIENT,non_adherence_allergies
0,0288abb6-633c-40c3-ba0c-66c7d957727e,[1]
1,076688b0-f0d5-4c45-8bc6-b206684fa9ac,[1]
2,08ea9043-5f84-46ab-9815-81d90024169a,[1]
3,09616ead-22c8-4210-8cb9-2fdc28e043ca,[1]
4,097079b1-ff8f-4ee0-8ce3-0ea744ecfa21,"[0, 1]"


In [29]:
# columns having value of isadherent are those patients which are sometimes adherent and sometimes not
# Hence first calculating the frequency of column and then appending that column

In [30]:
df_allergies1 = group.apply(lambda x : x['non_adherence_allergies'].nunique())
df_allergies1 = df_allergies1.reset_index() 
df_allergies1.rename(columns = {0:'freq_isadherent'}, inplace = True)
freq_isadherent = df_allergies1['freq_isadherent']
df_allergies = pd.concat([df_allergies, freq_isadherent], axis=1)
df_allergies.head()

Unnamed: 0,PATIENT,non_adherence_allergies,freq_isadherent
0,0288abb6-633c-40c3-ba0c-66c7d957727e,[1],1
1,076688b0-f0d5-4c45-8bc6-b206684fa9ac,[1],1
2,08ea9043-5f84-46ab-9815-81d90024169a,[1],1
3,09616ead-22c8-4210-8cb9-2fdc28e043ca,[1],1
4,097079b1-ff8f-4ee0-8ce3-0ea744ecfa21,"[0, 1]",2


In [31]:
# The columns having freq_isadherent mare than 1 are those non adherent patients
df_allergies.loc[df_allergies['freq_isadherent'] > 1, 'non_adherence_allergies'] = 1
df_allergies.drop('freq_isadherent',axis = 1,inplace = True)
print(df_allergies['non_adherence_allergies'].value_counts())

[1]    118
1       19
[0]      4
Name: non_adherence_allergies, dtype: int64


In [32]:
# converting [0] -> 0 and [1] -> 1
df_allergies.loc[df_allergies['non_adherence_allergies'] == 0, 'non_adherence_allergies'] = 0
df_allergies.loc[df_allergies['non_adherence_allergies'] == 1, 'non_adherence_allergies'] = 1
df_allergies['non_adherence_allergies'].value_counts()

1    137
0      4
Name: non_adherence_allergies, dtype: int64

In [33]:
df_allergies.rename(columns = {'PATIENT':'Id'}, inplace = True)
print('df_allergies shape : ' ,df_allergies.shape)
df_allergies.head()
#df_allergies.to_csv('../myCSV/isadherent.csv',index=False)

df_allergies shape :  (141, 2)


Unnamed: 0,Id,non_adherence_allergies
0,0288abb6-633c-40c3-ba0c-66c7d957727e,1
1,076688b0-f0d5-4c45-8bc6-b206684fa9ac,1
2,08ea9043-5f84-46ab-9815-81d90024169a,1
3,09616ead-22c8-4210-8cb9-2fdc28e043ca,1
4,097079b1-ff8f-4ee0-8ce3-0ea744ecfa21,1


## Now we have columns non_adherence_allergies and non_adherence_conditions.<br> Merging them in patients.csv for further analytics

In [34]:
patients = pd.read_csv('../Case Files/sample_date_csv/patients.csv')
print('Total shape : ',patients.shape)
patients.info()

Total shape :  (1171, 25)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1171 entries, 0 to 1170
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Id                   1171 non-null   object 
 1   BIRTHDATE            1171 non-null   object 
 2   DEATHDATE            171 non-null    object 
 3   SSN                  1171 non-null   object 
 4   DRIVERS              958 non-null    object 
 5   PASSPORT             898 non-null    object 
 6   PREFIX               927 non-null    object 
 7   FIRST                1171 non-null   object 
 8   LAST                 1171 non-null   object 
 9   SUFFIX               12 non-null     object 
 10  MAIDEN               331 non-null    object 
 11  MARITAL              791 non-null    object 
 12  RACE                 1171 non-null   object 
 13  ETHNICITY            1171 non-null   object 
 14  GENDER               1171 non-null   object 
 15  BIRTHPLACE  

In [35]:
# Droping unnecessary columns
patients.drop(['DEATHDATE','DRIVERS','PASSPORT','PREFIX','SUFFIX','MAIDEN','MARITAL','COUNTY','ZIP','LAT','LON'],axis = 1,inplace = True)
patients.head()

Unnamed: 0,Id,BIRTHDATE,SSN,FIRST,LAST,RACE,ETHNICITY,GENDER,BIRTHPLACE,ADDRESS,CITY,STATE,HEALTHCARE_EXPENSES,HEALTHCARE_COVERAGE
0,1d604da9-9a81-4ba9-80c2-de3375d59b40,1989-05-25,999-76-6866,José Eduardo181,Gómez206,white,hispanic,M,Marigot Saint Andrew Parish DM,427 Balistreri Way Unit 19,Chicopee,Massachusetts,271227.08,1334.88
1,034e9e3b-2def-4559-bb2a-7850888ae060,1983-11-14,999-73-5361,Milo271,Feil794,white,nonhispanic,M,Danvers Massachusetts US,422 Farrell Path Unit 69,Somerville,Massachusetts,793946.01,3204.49
2,10339b10-3cd1-4ac3-ac13-ec26728cb592,1992-06-02,999-27-3385,Jayson808,Fadel536,white,nonhispanic,M,Springfield Massachusetts US,1056 Harris Lane Suite 70,Chicopee,Massachusetts,574111.9,2606.4
3,8d4c4326-e9de-4f45-9a4c-f8c36bff89ae,1978-05-27,999-85-4926,Mariana775,Rutherford999,white,nonhispanic,F,Yarmouth Massachusetts US,999 Kuhn Forge,Lowell,Massachusetts,935630.3,8756.19
4,f5dcd418-09fe-4a2f-baa0-3da800bd8c3a,1996-10-18,999-60-7372,Gregorio366,Auer97,white,nonhispanic,M,Patras Achaea GR,1050 Lindgren Extension Apt 38,Boston,Massachusetts,598763.07,3772.2


In [36]:
# merging df_conditions with patients
df = (pd.merge(patients, df_conditions, how = 'outer',on = 'Id'))

In [37]:
# merging df_conditions with patients
df = (pd.merge(df, df_allergies, how = 'outer',on = 'Id'))

In [38]:
df.head()

Unnamed: 0,Id,BIRTHDATE,SSN,FIRST,LAST,RACE,ETHNICITY,GENDER,BIRTHPLACE,ADDRESS,CITY,STATE,HEALTHCARE_EXPENSES,HEALTHCARE_COVERAGE,non_adherence_conditions,non_adherence_allergies
0,1d604da9-9a81-4ba9-80c2-de3375d59b40,1989-05-25,999-76-6866,José Eduardo181,Gómez206,white,hispanic,M,Marigot Saint Andrew Parish DM,427 Balistreri Way Unit 19,Chicopee,Massachusetts,271227.08,1334.88,1,
1,034e9e3b-2def-4559-bb2a-7850888ae060,1983-11-14,999-73-5361,Milo271,Feil794,white,nonhispanic,M,Danvers Massachusetts US,422 Farrell Path Unit 69,Somerville,Massachusetts,793946.01,3204.49,0,
2,10339b10-3cd1-4ac3-ac13-ec26728cb592,1992-06-02,999-27-3385,Jayson808,Fadel536,white,nonhispanic,M,Springfield Massachusetts US,1056 Harris Lane Suite 70,Chicopee,Massachusetts,574111.9,2606.4,0,
3,8d4c4326-e9de-4f45-9a4c-f8c36bff89ae,1978-05-27,999-85-4926,Mariana775,Rutherford999,white,nonhispanic,F,Yarmouth Massachusetts US,999 Kuhn Forge,Lowell,Massachusetts,935630.3,8756.19,0,
4,f5dcd418-09fe-4a2f-baa0-3da800bd8c3a,1996-10-18,999-60-7372,Gregorio366,Auer97,white,nonhispanic,M,Patras Achaea GR,1050 Lindgren Extension Apt 38,Boston,Massachusetts,598763.07,3772.2,0,


#### Combining non_adherence_conditions and non_adherence_allergies and generating a new column non-adherence as following :<br>
1 and Nan or Nan and 1 : 1<br>
0 and Nan or Nan and 0 : 0<br>
1 and 0 or 0 and 1 : 0 (If a patient is not adherent even once, he should be classified as not adherent)<br>
1 and 1 : 1 <br>
0 and 0 : 0

In [39]:
def label_adherence(row):
    if row['non_adherence_conditions'] == 1 and pd.isna(row['non_adherence_allergies']):
        return 1
    elif pd.isna(row['non_adherence_conditions']) and row['non_adherence_allergies'] == 1:
        return 1
    elif row['non_adherence_conditions'] == 1 and row['non_adherence_allergies'] == 1:
        return 1
    elif row['non_adherence_conditions'] == 0 and pd.isna(row['non_adherence_allergies']):
        return 0
    elif pd.isna(row['non_adherence_conditions']) and row['non_adherence_allergies'] == 0:
        return 0
    elif row['non_adherence_conditions'] == 0 and row['non_adherence_allergies'] == 0:
        return 0
    elif row['non_adherence_conditions'] == 1 and row['non_adherence_allergies'] == 0:
        return 0
    elif row['non_adherence_conditions'] == 0 and row['non_adherence_allergies'] == 1:
        return 0
    elif pd.isna(row['non_adherence_conditions']) and pd.isna(row['non_adherence_allergies']):
        return None

In [40]:
df['non-adherence'] = df.apply(lambda row: label_adherence(row), axis=1)

In [41]:
df.head()

Unnamed: 0,Id,BIRTHDATE,SSN,FIRST,LAST,RACE,ETHNICITY,GENDER,BIRTHPLACE,ADDRESS,CITY,STATE,HEALTHCARE_EXPENSES,HEALTHCARE_COVERAGE,non_adherence_conditions,non_adherence_allergies,non-adherence
0,1d604da9-9a81-4ba9-80c2-de3375d59b40,1989-05-25,999-76-6866,José Eduardo181,Gómez206,white,hispanic,M,Marigot Saint Andrew Parish DM,427 Balistreri Way Unit 19,Chicopee,Massachusetts,271227.08,1334.88,1,,1.0
1,034e9e3b-2def-4559-bb2a-7850888ae060,1983-11-14,999-73-5361,Milo271,Feil794,white,nonhispanic,M,Danvers Massachusetts US,422 Farrell Path Unit 69,Somerville,Massachusetts,793946.01,3204.49,0,,0.0
2,10339b10-3cd1-4ac3-ac13-ec26728cb592,1992-06-02,999-27-3385,Jayson808,Fadel536,white,nonhispanic,M,Springfield Massachusetts US,1056 Harris Lane Suite 70,Chicopee,Massachusetts,574111.9,2606.4,0,,0.0
3,8d4c4326-e9de-4f45-9a4c-f8c36bff89ae,1978-05-27,999-85-4926,Mariana775,Rutherford999,white,nonhispanic,F,Yarmouth Massachusetts US,999 Kuhn Forge,Lowell,Massachusetts,935630.3,8756.19,0,,0.0
4,f5dcd418-09fe-4a2f-baa0-3da800bd8c3a,1996-10-18,999-60-7372,Gregorio366,Auer97,white,nonhispanic,M,Patras Achaea GR,1050 Lindgren Extension Apt 38,Boston,Massachusetts,598763.07,3772.2,0,,0.0


In [42]:
df.shape

(1171, 17)

In [43]:
df.isnull().sum()

Id                             0
BIRTHDATE                      0
SSN                            0
FIRST                          0
LAST                           0
RACE                           0
ETHNICITY                      0
GENDER                         0
BIRTHPLACE                     0
ADDRESS                        0
CITY                           0
STATE                          0
HEALTHCARE_EXPENSES            0
HEALTHCARE_COVERAGE            0
non_adherence_conditions      19
non_adherence_allergies     1030
non-adherence                 19
dtype: int64

In [44]:
# Droping non_adherence_allergies and non_adherence_conditions columns
df.drop(['non_adherence_conditions','non_adherence_allergies'],axis = 1,inplace = True)
df.head()

Unnamed: 0,Id,BIRTHDATE,SSN,FIRST,LAST,RACE,ETHNICITY,GENDER,BIRTHPLACE,ADDRESS,CITY,STATE,HEALTHCARE_EXPENSES,HEALTHCARE_COVERAGE,non-adherence
0,1d604da9-9a81-4ba9-80c2-de3375d59b40,1989-05-25,999-76-6866,José Eduardo181,Gómez206,white,hispanic,M,Marigot Saint Andrew Parish DM,427 Balistreri Way Unit 19,Chicopee,Massachusetts,271227.08,1334.88,1.0
1,034e9e3b-2def-4559-bb2a-7850888ae060,1983-11-14,999-73-5361,Milo271,Feil794,white,nonhispanic,M,Danvers Massachusetts US,422 Farrell Path Unit 69,Somerville,Massachusetts,793946.01,3204.49,0.0
2,10339b10-3cd1-4ac3-ac13-ec26728cb592,1992-06-02,999-27-3385,Jayson808,Fadel536,white,nonhispanic,M,Springfield Massachusetts US,1056 Harris Lane Suite 70,Chicopee,Massachusetts,574111.9,2606.4,0.0
3,8d4c4326-e9de-4f45-9a4c-f8c36bff89ae,1978-05-27,999-85-4926,Mariana775,Rutherford999,white,nonhispanic,F,Yarmouth Massachusetts US,999 Kuhn Forge,Lowell,Massachusetts,935630.3,8756.19,0.0
4,f5dcd418-09fe-4a2f-baa0-3da800bd8c3a,1996-10-18,999-60-7372,Gregorio366,Auer97,white,nonhispanic,M,Patras Achaea GR,1050 Lindgren Extension Apt 38,Boston,Massachusetts,598763.07,3772.2,0.0


In [45]:
#df.to_csv('../myCSV/modified_patients.csv',index = False)