# Data Work

### 1. Importing and exploring the DataFrame

Importing libraries we will need to clean the Dataset - Sleep Health and Lifestyle.

In [None]:
import numpy as np
import pandas as pd
import yaml

In [None]:
try:
    with open("../config.yaml", "r") as file:
        config = yaml.safe_load(file)
except:
    print("Configuration file not found!")

In [None]:
config

In this step, we load the Sleep Health and Lifestyle dataset into a pandas DataFrame.

This dataset contains information about individuals' sleep habits, health indicators, lifestyle patterns, and the presence of sleep disorders.

In [None]:
sleep_df = pd.read_csv(config['input_data']['file'], encoding='ISO-8859-1')
sleep_df.head(5)

Columns information:

- Person ID: An identifier for each individual.
- Gender: The gender of the person (Male/Female).
- Age: The age of the person in years.
- Occupation: The occupation or profession of the person.
- Sleep Duration (hours): The number of hours the person sleeps per day.
- Quality of Sleep (scale: 1-10): A subjective rating of the quality of sleep, ranging from 1 to 10.
- Physical Activity Level (minutes/day): The number of minutes the person engages in physical activity daily.
- Stress Level (scale: 1-10): A subjective rating of the stress level experienced by the person, ranging from 1 to 10.
- BMI Category: The BMI category of the person (e.g., Underweight, Normal, Overweight).
- Blood Pressure (systolic/diastolic): The blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.
- Heart Rate (bpm): The resting heart rate of the person in beats per minute.
- Daily Steps: The number of steps the person takes per day.
- Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).

Checking the shape of the DataFrame

In [None]:
sleep_df.shape

### 2. Cleaning names of columns

In [None]:
sleep_df.columns = (
    sleep_df.columns
      .str.lower()
      .str.normalize('NFKD')      
      .str.encode('ascii', errors='ignore')
      .str.decode('utf-8')
      .str.replace(' ', '_')
      .str.replace('[^0-9a-zA-Z_]', '')
)
sleep_df.head(5)

### 3. Cleaning Data

Before analysis, we check:

- Missing values
- Duplicates
- Incorrect data types
- Formatting inconsistencies (e.g., "140/90" for blood pressure)
- Inconsistent categories (BMI, occupation, sleep disorder)

In [None]:
sleep_df.info()

In [None]:
sleep_df.isnull().sum()

Now we can check the unique values of each columns, so we can see if we need to clean them or if they are fine.

In [None]:
sleep_df["gender"].unique()

In [None]:
sleep_df["occupation"].unique()

In [None]:
sleep_df["bmi_category"].unique()

"Normal" and "Normal Weight" Categories are refering to the same category, so we can rename them. 

In [None]:
sleep_df.loc[sleep_df["bmi_category"] == "Normal Weight", "bmi_category"] = "Normal"

In [None]:
sleep_df["blood_pressure"].unique()

We can split blod presure in two:
- Systolic (upper number)
        Pressure when the heart contracts

- Diastolic (lower number)
        Pressure when the heart relaxes

In [None]:
sleep_df[['systolic', 'diastolic']] = sleep_df['blood_pressure'].str.split('/', expand=True)
sleep_df['systolic'] = pd.to_numeric(sleep_df['systolic'])
sleep_df['diastolic'] = pd.to_numeric(sleep_df['diastolic'])

In [None]:
sleep_df

In [None]:
# sleep_df.drop(columns=["blood_pressure"], inplace=True)

In [None]:
sleep_df["sleep_disorder"].unique()

In [None]:
sleep_df["sleep_disorder"].value_counts()

In [None]:
sleep_df.fillna({"sleep_disorder": "No Disorder"}, inplace=True)

In [None]:
sleep_df["sleep_disorder"].value_counts()

In [None]:
sleep_df

In [None]:
sleep_df.info()

In [None]:
sleep_df.duplicated().sum()

### 4. Checking and deleting duplicated values

In [None]:
sleep_df.duplicated(subset= sleep_df.columns.difference(['person_id'])).sum()

We see that we have 242 duplicated rows, so we can drop them.

In [None]:
sleep_df_clean = sleep_df.drop_duplicates(subset=sleep_df.columns.difference(['person_id']), keep='first')

sleep_df_clean

### 5. Looking at Statistical summary

#### 5.1 Statistical summary of numerical columns

In [None]:
sleep_df_clean.describe()

#### 5.2 Statistical summary of categorical columns

In [None]:
sleep_df_clean.select_dtypes(include='object').describe()

### 6. Exporting the clean DataFrame

In [None]:
sleep_df.to_csv("sleep_health_project_clean.csv", index=False, encoding='utf-8')