<h1 style="text-align: center;"><b>2.Descriptive Analysis</b></h1>

<font size="4">**Q What is the distribution of average sleep duration across the different age groups?**</font>

 **Analyzing this can provide insights into potential differences in sleep patterns across generations. The solution involves binning the Age column into distinct groups (e.g., "Young," "Middle-aged," "Elderly") and then calculating the average Average Sleep Duration (hrs) for each of these age groups.**

In [1]:
import pandas as pd

# Load your dataset from a CSV file.
# Make sure to replace 'your_file_name.csv' with the actual path to your file.
# Example: df = pd.read_csv('C:/Users/YourUsername/Documents/health_data.csv')
df = pd.read_csv('C:\Python_Hackathon_Aug2025\HUPA-UC Diabetes Dataset\T1DM_patient_sleep_demographics_with_race.csv')

print("\033[1mDataFrame's Information:\033[0m\n")
df.info()

# Define age bins and labels
bins = [20, 30, 40, 50, 60, 70]
labels = ['20-29', '30-39', '40-49', '50-59', '60-69']

# Create a new column 'Age Group' by binning the 'Age' column
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Group by 'Age Group' and calculate the mean of 'Average Sleep Duration (hrs)'
sleep_duration_by_age = df.groupby('Age Group',observed =False)['Average Sleep Duration (hrs)'].mean()

print("\n\033[1mAverage Sleep Duration by Age Group:\033[0m\n")
print(sleep_duration_by_age)



[1mDataFrame's Information:[0m

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Patient_ID                    25 non-null     object 
 1   Age                           25 non-null     int64  
 2   Gender                        25 non-null     object 
 3   Race                          25 non-null     object 
 4   Average Sleep Duration (hrs)  25 non-null     float64
 5   Sleep Quality (1-10)          25 non-null     float64
 6   % with Sleep Disturbances     25 non-null     int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 1.5+ KB

[1mAverage Sleep Duration by Age Group:[0m

Age Group
20-29    6.133333
30-39    6.171429
40-49    5.860000
50-59    6.150000
60-69    5.857143
Name: Average Sleep Duration (hrs), dtype: float64


<font size="5"> **Merging all patient files and demographic file and creating a new file**</font>

In [2]:
import pandas as pd
import glob
import os

def merge_rawfiles():
    # Merging raw data file as one file
    files = glob.glob(r"C:\Python_Hackathon_Aug2025\HUPA-UC Diabetes Dataset\*.csv")
    df_files = []
    for file in files:
        filename = os.path.splitext(os.path.basename(file))[0]
        if not filename.startswith("HUPA"):
            continue
        df = pd.read_csv(file, delimiter=";")
        
        # Uniform column header in all files
        df.columns = df.columns.str.strip().str.lower()
        df.drop_duplicates(subset=["time"], inplace=True)
        df["patient_id"] = filename
        df_files.append(df)
        
    # Merge the patient data files
    df = pd.concat(df_files, ignore_index=True)
    
    return df

# Get the raw patient data DataFrame
merged_raw_df = merge_rawfiles()

# Reading the demographic patient file which contains the new columns
demographic_df = pd.read_csv(r"C:\Python_Hackathon_Aug2025\HUPA-UC Diabetes Dataset\T1DM_patient_sleep_demographics_with_race.csv")

# Convert all column names in the demographic DataFrame to lowercase
demographic_df.columns = demographic_df.columns.str.lower()

# Merge the raw data with the demographic data on 'patient_id'
final_df = pd.merge(merged_raw_df, demographic_df, on="patient_id", how="left")

# Save the final merged file
final_df.to_csv(r"C:\Python_Hackathon_Aug2025\final_merged_file.csv", index=False)

print("Final DataFrame with sleep quality columns created and saved as final_merged_file.csv")
print(final_df.head())

Final DataFrame with sleep quality columns created and saved as final_merged_file.csv
                  time  glucose  calories  heart_rate  steps  basal_rate  \
0  2018-06-13T18:40:00    332.0    6.3595   82.322835   34.0    0.091667   
1  2018-06-13T18:45:00    326.0    7.7280   83.740157    0.0    0.091667   
2  2018-06-13T18:50:00    330.0    4.7495   80.525180    0.0    0.091667   
3  2018-06-13T18:55:00    324.0    6.3595   89.129032   20.0    0.091667   
4  2018-06-13T19:00:00    306.0    5.1520   92.495652    0.0    0.075000   

   bolus_volume_delivered  carb_input patient_id  age gender   race  \
0                     0.0         0.0  HUPA0001P   34   Male  Other   
1                     0.0         0.0  HUPA0001P   34   Male  Other   
2                     0.0         0.0  HUPA0001P   34   Male  Other   
3                     0.0         0.0  HUPA0001P   34   Male  Other   
4                     0.0         0.0  HUPA0001P   34   Male  Other   

   average sleep duration (hrs

<font size="3">**Q Is there a correlation between sleep quality, sleep disturbances, and physical activity metrics (steps and calories)?**</font>

**This question aims to investigate the relationships between different health and wellness metrics. By calculating the Pearson correlation coefficient, we can determine the strength and direction of the linear relationship between these variables. A positive correlation suggests that as one variable increases, so does the other, while a negative correlation suggests an inverse relationship.**

In [6]:
import pandas as pd

# Load your dataset from a CSV file.
# Make sure to replace 'your_file_name.csv' with the actual path to your file.
# For example: df = pd.read_csv('C:/Users/YourUsername/Documents/health_data.csv')
df = pd.read_csv(r'C:\Python_Hackathon_Aug2025\final_merged_file.csv')

print("\033[1mDataFrame's Information:\033[0m\n")
df.info()

# Create a new DataFrame with only the relevant columns for correlation
correlation_df = df[['sleep quality (1-10)', '% with sleep disturbances', 'steps', 'calories']]

# Calculate the correlation matrix
correlation_matrix = correlation_df.corr(method='pearson')

print("\n\033[1mCorrelation Matrix of Sleep, Disturbance, and Physical Activity Metrics:\033[0m\n")
print(correlation_matrix)


[1mDataFrame's Information:[0m

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309392 entries, 0 to 309391
Data columns (total 15 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   time                          309392 non-null  object 
 1   glucose                       309392 non-null  float64
 2   calories                      309392 non-null  float64
 3   heart_rate                    309392 non-null  float64
 4   steps                         309392 non-null  float64
 5   basal_rate                    309392 non-null  float64
 6   bolus_volume_delivered        309392 non-null  float64
 7   carb_input                    309392 non-null  float64
 8   patient_id                    309392 non-null  object 
 9   age                           309392 non-null  int64  
 10  gender                        309392 non-null  object 
 11  race                          309392 non-null  object 
 12  average sl

**Key Concept to understand:A correlation coefficient is a number between -1 and +1 that indicates the strength and direction of a linear relationship between two variables.**</font>

**Correlation Coefficient	Interpretation**
<ul><li>sleep quality (1-10) vs. % with sleep disturbances	-0.280536	This is a weak negative correlation. As the percentage of time with sleep disturbances increases, a patient's sleep quality tends to slightly decrease, which is an expected relationship.
</li><li>steps vs. calories	0.802922	This is a strong positive correlation. As the number of steps taken increases, the number of calories burned also increases significantly. This is a very strong and expected relationship, as physical activity directly leads to calorie expenditure.
</li><li>sleep quality (1-10) vs. steps	-0.065487	This is a very weak negative correlation. The relationship is negligible, suggesting that there is no meaningful linear connection between a patient's sleep quality and the number of steps they take.
</li><li>% with sleep disturbances vs. calories	0.138353	This is a very weak positive correlation. The relationship is negligible, suggesting that a higher percentage of sleep disturbances has a minimal, if any, effect on calories burned.</li></ul>