# Data Work

### 1. Importing and exploring the DataFrame

Importing libraries we will need to clean the Dataset - Sleep Health and Lifestyle.

In [34]:
import numpy as np
import pandas as pd
import yaml

In [35]:
try:
    with open("../config.yaml", "r") as file:
        config = yaml.safe_load(file)
except:
    print("Configuration file not found!")

In [36]:
config

{'input_data': {'file': '../data/raw/Sleep_health_and_lifestyle_dataset.csv'},
 'output_data': {'file': '../data/clean/cleaned_data_file.csv'}}

In this step, we load the Sleep Health and Lifestyle dataset into a pandas DataFrame.

This dataset contains information about individuals' sleep habits, health indicators, lifestyle patterns, and the presence of sleep disorders.

In [37]:
sleep_df = pd.read_csv(config['input_data']['file'], encoding='ISO-8859-1')
sleep_df.head(5)

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea


Columns information:

- Person ID: An identifier for each individual.
- Gender: The gender of the person (Male/Female).
- Age: The age of the person in years.
- Occupation: The occupation or profession of the person.
- Sleep Duration (hours): The number of hours the person sleeps per day.
- Quality of Sleep (scale: 1-10): A subjective rating of the quality of sleep, ranging from 1 to 10.
- Physical Activity Level (minutes/day): The number of minutes the person engages in physical activity daily.
- Stress Level (scale: 1-10): A subjective rating of the stress level experienced by the person, ranging from 1 to 10.
- BMI Category: The BMI category of the person (e.g., Underweight, Normal, Overweight).
- Blood Pressure (systolic/diastolic): The blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.
- Heart Rate (bpm): The resting heart rate of the person in beats per minute.
- Daily Steps: The number of steps the person takes per day.
- Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).

Checking the shape of the DataFrame

In [5]:
sleep_df.shape

(374, 13)

### 2. Cleaning names of columns

In [7]:
sleep_df.columns = (
    sleep_df.columns
      .str.lower()
      .str.normalize('NFKD')      
      .str.encode('ascii', errors='ignore')
      .str.decode('utf-8')
      .str.replace(' ', '_')
      .str.replace('[^0-9a-zA-Z_]', '')
)
sleep_df.head(5)

Unnamed: 0,person_id,gender,age,occupation,sleep_duration,quality_of_sleep,physical_activity_level,stress_level,bmi_category,blood_pressure,heart_rate,daily_steps,sleep_disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea


### 3. Cleaning Data

Before analysis, we check:

- Missing values
- Duplicates
- Incorrect data types
- Formatting inconsistencies (e.g., "140/90" for blood pressure)
- Inconsistent categories (BMI, occupation, sleep disorder)

In [8]:
sleep_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   person_id                374 non-null    int64  
 1   gender                   374 non-null    object 
 2   age                      374 non-null    int64  
 3   occupation               374 non-null    object 
 4   sleep_duration           374 non-null    float64
 5   quality_of_sleep         374 non-null    int64  
 6   physical_activity_level  374 non-null    int64  
 7   stress_level             374 non-null    int64  
 8   bmi_category             374 non-null    object 
 9   blood_pressure           374 non-null    object 
 10  heart_rate               374 non-null    int64  
 11  daily_steps              374 non-null    int64  
 12  sleep_disorder           155 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB


In [9]:
sleep_df.isnull().sum()

person_id                    0
gender                       0
age                          0
occupation                   0
sleep_duration               0
quality_of_sleep             0
physical_activity_level      0
stress_level                 0
bmi_category                 0
blood_pressure               0
heart_rate                   0
daily_steps                  0
sleep_disorder             219
dtype: int64

Now we can check the unique values of each columns, so we can see if we need to clean them or if they are fine.

In [11]:
sleep_df["gender"].unique()

array(['Male', 'Female'], dtype=object)

In [12]:
sleep_df["occupation"].unique()

array(['Software Engineer', 'Doctor', 'Sales Representative', 'Teacher',
       'Nurse', 'Engineer', 'Accountant', 'Scientist', 'Lawyer',
       'Salesperson', 'Manager'], dtype=object)

In [13]:
sleep_df["bmi_category"].unique()

array(['Overweight', 'Normal', 'Obese', 'Normal Weight'], dtype=object)

"Normal" and "Normal Weight" Categories are refering to the same category, so we can rename them. 

In [14]:
sleep_df.loc[sleep_df["bmi_category"] == "Normal Weight", "bmi_category"] = "Normal"

In [17]:
sleep_df["blood_pressure"].unique()

array(['126/83', '125/80', '140/90', '120/80', '132/87', '130/86',
       '117/76', '118/76', '128/85', '131/86', '128/84', '115/75',
       '135/88', '129/84', '130/85', '115/78', '119/77', '121/79',
       '125/82', '135/90', '122/80', '142/92', '140/95', '139/91',
       '118/75'], dtype=object)

We can split blod presure in two:
- Systolic (upper number)
        Pressure when the heart contracts

- Diastolic (lower number)
        Pressure when the heart relaxes

In [18]:
sleep_df[['systolic', 'diastolic']] = sleep_df['blood_pressure'].str.split('/', expand=True)
sleep_df['systolic'] = pd.to_numeric(sleep_df['systolic'])
sleep_df['diastolic'] = pd.to_numeric(sleep_df['diastolic'])

In [19]:
sleep_df

Unnamed: 0,person_id,gender,age,occupation,sleep_duration,quality_of_sleep,physical_activity_level,stress_level,bmi_category,blood_pressure,heart_rate,daily_steps,sleep_disorder,systolic,diastolic
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,,126,83
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,,125,80
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,,125,80
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,140,90
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,140,90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,370,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,140,95
370,371,Female,59,Nurse,8.0,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,140,95
371,372,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,140,95
372,373,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,140,95


In [None]:
# sleep_df.drop(columns=["blood_pressure"], inplace=True)

In [21]:
sleep_df["sleep_disorder"].unique()

array([nan, 'Sleep Apnea', 'Insomnia'], dtype=object)

In [22]:
sleep_df["sleep_disorder"].value_counts()

sleep_disorder
Sleep Apnea    78
Insomnia       77
Name: count, dtype: int64

In [23]:
sleep_df.fillna({"sleep_disorder": "No Disorder"}, inplace=True)

In [24]:
sleep_df["sleep_disorder"].value_counts()

sleep_disorder
No Disorder    219
Sleep Apnea     78
Insomnia        77
Name: count, dtype: int64

In [25]:
sleep_df

Unnamed: 0,person_id,gender,age,occupation,sleep_duration,quality_of_sleep,physical_activity_level,stress_level,bmi_category,blood_pressure,heart_rate,daily_steps,sleep_disorder,systolic,diastolic
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,No Disorder,126,83
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,No Disorder,125,80
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,No Disorder,125,80
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,140,90
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,140,90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,370,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,140,95
370,371,Female,59,Nurse,8.0,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,140,95
371,372,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,140,95
372,373,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,140,95


In [27]:
sleep_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   person_id                374 non-null    int64  
 1   gender                   374 non-null    object 
 2   age                      374 non-null    int64  
 3   occupation               374 non-null    object 
 4   sleep_duration           374 non-null    float64
 5   quality_of_sleep         374 non-null    int64  
 6   physical_activity_level  374 non-null    int64  
 7   stress_level             374 non-null    int64  
 8   bmi_category             374 non-null    object 
 9   blood_pressure           374 non-null    object 
 10  heart_rate               374 non-null    int64  
 11  daily_steps              374 non-null    int64  
 12  sleep_disorder           374 non-null    object 
 13  systolic                 374 non-null    int64  
 14  diastolic                3

In [28]:
sleep_df.duplicated().sum()

np.int64(0)

### 4. Checking and deleting duplicated values

In [30]:
sleep_df.duplicated(subset= sleep_df.columns.difference(['person_id'])).sum()

np.int64(242)

We see that we have 242 duplicated rows, so we can drop them.

In [32]:
sleep_df_clean = sleep_df.drop_duplicates(subset=sleep_df.columns.difference(['person_id']), keep='first')

sleep_df_clean

Unnamed: 0,person_id,gender,age,occupation,sleep_duration,quality_of_sleep,physical_activity_level,stress_level,bmi_category,blood_pressure,heart_rate,daily_steps,sleep_disorder,systolic,diastolic
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,No Disorder,126,83
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,No Disorder,125,80
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,140,90
5,6,Male,28,Software Engineer,5.9,4,30,8,Obese,140/90,85,3000,Insomnia,140,90
6,7,Male,29,Teacher,6.3,6,40,7,Obese,140/90,82,3500,Insomnia,140,90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
358,359,Female,59,Nurse,8.0,9,75,3,Overweight,140/95,68,7000,No Disorder,140,95
359,360,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,No Disorder,140,95
360,361,Female,59,Nurse,8.2,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,140,95
364,365,Female,59,Nurse,8.0,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,140,95


### 5. Looking at Statistical summary

#### 5.1 Statistical summary of numerical columns

In [41]:
sleep_df_clean.describe()

Unnamed: 0,person_id,age,sleep_duration,quality_of_sleep,physical_activity_level,stress_level,heart_rate,daily_steps,systolic,diastolic
count,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0,132.0
mean,171.727273,41.128788,7.082576,7.151515,58.393939,5.537879,71.204545,6637.878788,128.363636,84.537879
std,110.418779,8.813942,0.775335,1.269037,20.46884,1.740428,4.867306,1766.288657,7.82565,6.049926
min,1.0,27.0,5.8,4.0,30.0,3.0,65.0,3000.0,115.0,75.0
25%,79.5,33.75,6.4,6.0,44.25,4.0,68.0,5000.0,120.75,80.0
50%,166.5,41.0,7.15,7.0,60.0,6.0,70.0,7000.0,130.0,85.0
75%,268.25,49.0,7.725,8.0,75.0,7.0,74.0,8000.0,135.0,88.5
max,367.0,59.0,8.5,9.0,90.0,8.0,86.0,10000.0,142.0,95.0


#### 5.2 Statistical summary of categorical columns

In [42]:
sleep_df_clean.select_dtypes(include='object').describe()

Unnamed: 0,gender,occupation,bmi_category,blood_pressure,sleep_disorder
count,132,132,132,132,132
unique,2,11,3,25,3
top,Male,Nurse,Normal,130/85,No Disorder
freq,67,29,73,28,73


### 6. Exporting the clean DataFrame

In [43]:
sleep_df.to_csv("sleep_health_project_clean.csv", index=False, encoding='utf-8')