# Introduction

Sleep is a critical factor in maintaining overall health and well-being, yet various lifestyle choices can significantly influence the quality and duration of sleep. This analysis delves into the intricate relationship between lifestyle factors and sleep health, providing valuable insights for individuals seeking to enhance their sleep patterns. As a data scientist, deciphering these connections allows for informed decision-making, ultimately optimizing sleep health.

# Dataset Columns

The dataset includes the following columns:

* Person ID: An identifier for each individual in the dataset.

* Gender: The gender of the person (Male/Female).

* Age: The age of the person in years.

* Occupation: The occupation or profession of the person.

* Sleep Duration (hours): The number of hours the person sleeps per day.

* Quality of Sleep (scale: 1-10): A subjective rating of sleep quality (1 to 10).

* Physical Activity Level (minutes/day): The daily minutes of physical activity.

* Stress Level (scale: 1-10): A subjective rating of stress level (1 to 10).

* BMI Category: The BMI category of the person (Underweight, Normal, Overweight).

* Blood Pressure (systolic/diastolic): The blood pressure measurement (systolic over diastolic).

* Heart Rate (bpm): Resting heart rate in beats per minute.

* Daily Steps: The number of steps taken per day.

* Sleep Disorder: The presence or absence of a sleep disorder (None, Insomnia, Sleep Apnea).



# Objective:
Identify and analyze the presence or absence of sleep disorders, including Insomnia and Sleep Apnea

# Importing Libraries:

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import LabelEncoder ,StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Loading the Dataset:

In [2]:
data=pd.read_csv('/kaggle/input/sleep-health-and-lifestyle-dataset/Sleep_health_and_lifestyle_dataset.csv')

# Exploratory Data Analysis (EDA):

In [3]:
data.head()

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           155 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB


In [5]:
data.duplicated().sum()


0

In [6]:
data.isna().sum()

Person ID                    0
Gender                       0
Age                          0
Occupation                   0
Sleep Duration               0
Quality of Sleep             0
Physical Activity Level      0
Stress Level                 0
BMI Category                 0
Blood Pressure               0
Heart Rate                   0
Daily Steps                  0
Sleep Disorder             219
dtype: int64

In [7]:
data.dropna(inplace=True)

# Data Preprocessing:

Label Encoding: Transform categorical variables into numerical representations for machine learning models.

In [8]:
label_encoder=LabelEncoder()

In [9]:
cat_cols=['Gender','Occupation','BMI Category','Sleep Disorder']
for col in cat_cols:
    data[col] = label_encoder.fit_transform(data[col])

In [10]:
data

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
3,4,1,28,5,5.9,4,30,8,2,140/90,85,3000,1
4,5,1,28,5,5.9,4,30,8,2,140/90,85,3000,1
5,6,1,28,8,5.9,4,30,8,2,140/90,85,3000,0
6,7,1,29,9,6.3,6,40,7,2,140/90,82,3500,0
16,17,0,29,4,6.5,5,40,7,1,132/87,80,4000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,370,0,59,4,8.1,9,75,3,3,140/95,68,7000,1
370,371,0,59,4,8.0,9,75,3,3,140/95,68,7000,1
371,372,0,59,4,8.1,9,75,3,3,140/95,68,7000,1
372,373,0,59,4,8.1,9,75,3,3,140/95,68,7000,1


In [11]:

 # Split the 'Blood Pressure' column into two columns
data[['Systolic BP', 'Diastolic BP']] = data['Blood Pressure'].str.split('/', expand=True)
    
# Convert the new columns to numeric type
data[['Systolic BP', 'Diastolic BP']] = data[['Systolic BP', 'Diastolic BP']].apply(pd.to_numeric)
    
# Drop the original 'Blood Pressure' column
data = data.drop('Blood Pressure', axis=1)


In [12]:
data

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Heart Rate,Daily Steps,Sleep Disorder,Systolic BP,Diastolic BP
3,4,1,28,5,5.9,4,30,8,2,85,3000,1,140,90
4,5,1,28,5,5.9,4,30,8,2,85,3000,1,140,90
5,6,1,28,8,5.9,4,30,8,2,85,3000,0,140,90
6,7,1,29,9,6.3,6,40,7,2,82,3500,0,140,90
16,17,0,29,4,6.5,5,40,7,1,80,4000,1,132,87
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,370,0,59,4,8.1,9,75,3,3,68,7000,1,140,95
370,371,0,59,4,8.0,9,75,3,3,68,7000,1,140,95
371,372,0,59,4,8.1,9,75,3,3,68,7000,1,140,95
372,373,0,59,4,8.1,9,75,3,3,68,7000,1,140,95


In [13]:
data.drop('Person ID', axis=1, inplace=True)

# Splitting the Data:

In [14]:
# Splitting the data into features (X) and the target variable (y)
X = data.drop('Sleep Disorder', axis=1)  # Features
y = data['Sleep Disorder']  # Target variable

# Splitting the data into training and testing sets (e.g., 80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Scaling Data: Standardizing numerical features like 'Age', 'Sleep Duration', and 'Daily Steps' through scaling is essential.
This process ensures that variables with different units and scales contribute equally to the model, preventing biases and allowing machine learning algorithms to effectively learn patterns, leading to improved model performance.


In [15]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the scaler on the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing data using the same scaler
X_test_scaled = scaler.transform(X_test)

# Modeling with Lazy Predict:

Lazy Predict: Employing Lazy Predict allows for a quick and comprehensive comparison of multiple machine learning models without the need for extensive manual tuning. This streamlined approach aids in the efficient identification of promising models, saving time during the initial stages of analysis and providing valuable insights into the dataset's predictability.


In [16]:
pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12
Note: you may need to restart the kernel to use updated packages.


In [17]:
from lazypredict.Supervised import LazyClassifier
# Use LazyClassifier
clf = LazyClassifier()
models, predictions = clf.fit(X_train_scaled, X_test_scaled, y_train, y_test)

# Print the model performance
print(models)

100%|██████████| 29/29 [00:01<00:00, 26.87it/s]

[LightGBM] [Info] Number of positive: 61, number of negative: 63
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002180 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 100
[LightGBM] [Info] Number of data points in the train set: 124, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.491935 -> initscore=-0.032261
[LightGBM] [Info] Start training from score -0.032261
                               Accuracy  Balanced Accuracy  ROC AUC  F1 Score  \
Model                                                                           
SGDClassifier                      0.90               0.91     0.91      0.90   
ExtraTreeClassifier                0.90               0.91     0.91      0.90   
Perceptron                         0.90               0.91     0.91      0.90   
QuadraticDiscriminantAnalysis      0.87               0.88     0.88      0.87   
AdaBoostClassifier           




# Let's try to imporve the accuracy by handling the outliers 

In [18]:
num_col = ['Age', 'Sleep Duration', 'Quality of Sleep', 'Physical Activity Level', 'Stress Level',
           'Heart Rate', 'Daily Steps', 'Systolic BP', 'Diastolic BP']

Q1 = data[num_col].quantile(0.25)
Q3 = data[num_col].quantile(0.75)
IQR = Q3 - Q1

data = data[~((data[num_col] < (Q1 - 1.5 * IQR)) | (data[num_col] > (Q3 + 1.5 * IQR))).any(axis=1)]

In [19]:
# Splitting the data into features (X) and the target variable (y)
X = data.drop('Sleep Disorder', axis=1)  # Features
y = data['Sleep Disorder']  # Target variable

# Splitting the data into training and testing sets (e.g., 80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [20]:

# Fit and transform the scaler on the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing data using the same scaler
X_test_scaled = scaler.transform(X_test)

In [21]:
models, predictions = clf.fit(X_train_scaled, X_test_scaled, y_train, y_test)

# Print the model performance
print(models)

'tuple' object has no attribute '__name__'
Invalid Classifier(s)


100%|██████████| 29/29 [00:00<00:00, 31.67it/s]

[LightGBM] [Info] Number of positive: 32, number of negative: 55
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000054 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 73
[LightGBM] [Info] Number of data points in the train set: 87, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.367816 -> initscore=-0.541597
[LightGBM] [Info] Start training from score -0.541597
                               Accuracy  Balanced Accuracy  ROC AUC  F1 Score  \
Model                                                                           
AdaBoostClassifier                 0.95               0.96     0.96      0.96   
RandomForestClassifier             0.95               0.96     0.96      0.96   
ExtraTreeClassifier                0.95               0.94     0.94      0.95   
Perceptron                         0.95               0.94     0.94      0.95   
PassiveAggressiveClassifier    




**After effectively addressing outliers in the dataset, we observed a notable improvement in model accuracy. Outlier handling has played a crucial role in enhancing the robustness of our predictive models, resulting in more accurate and reliable predictions.**

# ****Your feedback and support are highly valuable – if you found this notebook helpful, consider giving it an upvote.****