Data Cleaning to analyze given data set having relationship between lifestyle behaviors (activity, sleep, diet) and glucose regulation in diabetic patients.

In [443]:
import glob
import os
import pandas as pd

In [444]:
# Get all csv files from folder
files = glob.glob("/Users/venmeen/Downloads/HUPA-UC Diabetes Dataset/*.csv")

# Reading demographic patient file
demographic_df = pd.read_csv("/Users/venmeen/Downloads/HUPA-UC Diabetes Dataset/T1DM_patient_sleep_demographics_with_race.csv")

<h4>Merging 25 patient data files as one file </h4>
<h5><font face="TimesNewRoman">Adding patient_Id column as unique identifier based on filename</h5>
<h5><font face="TimesNewRoman">Reasoning: Merging files together will help for group analysis like finiding glucose trends and other general pattern across patients</h5>

In [446]:
def merge_rawfiles():
# Merging raw data file as one file
    df_files = []
    for file in files:

        filename = os.path.splitext(os.path.basename(file))[0]
        if not filename.startswith("HUPA"):
            continue
        df = pd.read_csv(file,sep=";") 

        # Uniform column header in all files
        df.columns = df.columns.str.strip().str.lower()

        # Removing duplicate rows - if more than one record has the same time for single patient, then it is considered as duplicate.
        df.drop_duplicates(subset=["time"], inplace= True)
    
        #Add patient_id since we are merging all files together
        df["patient_id"] = filename
        df_files.append(df)

    # Merge the patient data files
    df = pd.concat(df_files, ignore_index=True)

    # Saving it as Single Merged csv file
    df.to_csv("mergedraw_file.csv", index=False)
    return df

<h4>Verying data by checking for column's data types,null values and Nan values</h4>

In [527]:
def verify_data():
    print("\033[1mDataFrame's Information:\033[0m\n")
    print(df.info())
    print("\033[0m\nNull Value Count:\033[0m\n",df.isnull().sum())
    print("\033[0m\nNan Values:\033[0m\n",df.isna().sum())
    print("\033[0m\nNumber of rows and cols:\033[0m\n",df.shape)
    print("\033[0m\nDescription of DataFrame :\033[0m\n")
    print(df.describe)

<h4>Display Raw Merged File Information as it is</h4>

In [510]:
df = merge_rawfiles()
print("\033[1mRaw Merged Data Info:\033[0m\n")
verify_data()

[1mRaw Merged Data Info:[0m

[1mDataFrame's Information:[0m

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309392 entries, 0 to 309391
Data columns (total 9 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   time                    309392 non-null  object 
 1   glucose                 309392 non-null  float64
 2   calories                309392 non-null  float64
 3   heart_rate              309392 non-null  float64
 4   steps                   309392 non-null  float64
 5   basal_rate              309392 non-null  float64
 6   bolus_volume_delivered  309392 non-null  float64
 7   carb_input              309392 non-null  float64
 8   patient_id              309392 non-null  object 
dtypes: float64(7), object(2)
memory usage: 21.2+ MB
None
[0m
Null Value Count:[0m
 time                      0
glucose                   0
calories                  0
heart_rate                0
steps                     0
basal

<h4>Setting index with time and patient_id</h4>
<h5><font face="TimesNewRoman">set time and patient_id as multi_index to identify particular record.</font></h5>
<h5><font face="TimesNewRoman">Reasoning: This index will help for resampling, rolling averages and for plotting</font></h5>

In [450]:
def set_index(df):

    # Setting the time and patient_id column as multi_index 
    df.set_index(['time','patient_id'])

<h4> Ensure the columns DataTypes</h4>

In [531]:
def ensure_col_dtype():

    # Collecting column names in one list
    cols_dtype = ['glucose', 'calories', 'heart_rate','steps','basal_rate','bolus_volume_delivered','carb_input']

    #Ensure all the columns in list are in numeric
    df[cols_dtype] = df[cols_dtype].apply(pd.to_numeric, errors='coerce')

    #Setting patient_id column as string
    df['patient_id'] = df['patient_id'].astype("string")

    #Setting time column as datetime
    df["time"] = pd.to_datetime(df["time"],errors="coerce")
    return df

<h4> Standardize numeric column values </h4>
<font face="TimesNewRoman">Rounding the float value to 3 decimal numbers for clarity and usability</font>

In [533]:
def standardize_numeric_cols():
    stdardize_cols = ['glucose', 'calories', 'heart_rate']
    df[stdardize_cols] = df[stdardize_cols].round(3)
    return df

<h4> Check for negative values in numeric columns </h4>
<font face="TimesNewRoman">To treat the negative values as error</font>

In [456]:
def check_negative_values():
    columns = ['glucose', 'calories', 'heart_rate', 'steps', 'basal_rate', 'bolus_volume_delivered', 'carb_input']
    col_negative_values = (df[columns] < 0).any()
    print(col_negative_values)

<h4> Display After Treating the Data</h4>

In [535]:
set_index(df)
df = standardize_numeric_cols()
df = ensure_col_dtype()
print("\033[1mChecking for negative values in columns:\033[0m\n")
check_negative_values()

[1mChecking for negative values in columns:[0m

glucose                   False
calories                  False
heart_rate                False
steps                     False
basal_rate                False
bolus_volume_delivered     True
carb_input                False
dtype: bool


<h4>Setting up glucose range</h4>
Glucose range provides framework to evaluate a patient's blood sugar control

In [521]:
def classify_by_glucose_value(value):
    if pd.isna(value): 
        return "NA"
    if value < 40 or value > 500:
        return "Invalid"
    elif value < 70:
        return "Below Range"
    elif value > 180:
        return "Above Range"
    else:
        return "In Range"
        
def set_glucose_range(df):
    df["glucose_range_level"] = df["glucose"].apply(classify_by_glucose_value)

<h4>Setting up Calories Burned into categories for easy Analysis</h4>
<h4><font face="TimeNewRoman">Classifying the burned calories data as (0-4 calories burned)'Resting', (5-19 calories burned)'Light Activity', (20-34 calories burned)'Moderate Activity', (35-50 calories burned)'Intense activity', (50-59 calories burned)'Very Extreme/SPIKE' and more then 60 is considered as Error since this is 5 min interval data</font></h4>

In [462]:
def set_caloriesburned_categories(value):
    if value < 5:
        return "Resting"
    elif 5 <= value <20:
        return "Light Activity"
    elif 20 <= value <35:
        return "Moderate Activity"
    elif 35 <= value <50:
        return "Intense activity"
    elif 50 <= value <60:
        return "Very Extreme/SPIKE"
    elif value >60:
        return "ERROR"
def calories_categories(df):
    df['calories_categories'] = df['calories'].apply(set_caloriesburned_categories)

In [523]:
set_glucose_range(df)
calories_categories(df)

<h4>Display After setting Range validation columns</h4>

In [537]:
print("\033[1mAfter DataCleanup:\033[0m\n")
verify_data()

[1mAfter DataCleanup:[0m

[1mDataFrame's Information:[0m

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309392 entries, 0 to 309391
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   time                    309392 non-null  datetime64[ns]
 1   glucose                 309392 non-null  float64       
 2   calories                309392 non-null  float64       
 3   heart_rate              309392 non-null  float64       
 4   steps                   309392 non-null  float64       
 5   basal_rate              309392 non-null  float64       
 6   bolus_volume_delivered  309392 non-null  float64       
 7   carb_input              309392 non-null  float64       
 8   patient_id              309392 non-null  string        
 9   glucose_range_level     309392 non-null  object        
 10  calories_categories     309392 non-null  object        
dtypes: datetime64[ns](1), float64