<a href="https://colab.research.google.com/github/riteshraj1362/SmartSleepPredictor/blob/main/Sleep_Health_and_Lifestyle_Predication_with_94_Ac.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
uom190346a_sleep_health_and_lifestyle_dataset_path = kagglehub.dataset_download('uom190346a/sleep-health-and-lifestyle-dataset')

print('Data source import complete.')


# Problem Statement: Impact of Lifestyle on Sleep Health

# Introduction

Sleep plays a vital role in maintaining overall health and well-being. However, various lifestyle factors can significantly impact sleep quality and duration. Understanding the relationship between lifestyle choices and sleep health is essential for individuals seeking to improve their sleep patterns. As a data scientist, analyzing these factors can provide valuable insights into the causes and effects of sleep disturbances, helping individuals make informed decisions to optimize their sleep health.

# Dataset Overview

The Sleep Health and Lifestyle Dataset comprises 400 rows and 13 columns, providing comprehensive information on sleep-related variables and daily habits. It covers a wide range of factors, including sleep duration, sleep quality, physical activity levels, stress levels, BMI category, blood pressure, heart rate, daily steps, and the presence or absence of sleep disorders. This dataset offers valuable insights into the relationship between lifestyle and sleep health.

# Key Features of the Dataset

Comprehensive Sleep Metrics: The dataset includes variables related to sleep duration, quality, and factors influencing sleep patterns. These metrics allow for a detailed analysis of sleep-related aspects.

Lifestyle Factors: The dataset provides information on various lifestyle factors, such as physical activity levels and stress levels. These variables allow for the exploration of how lifestyle choices impact sleep health.

Cardiovascular Health: Blood pressure and heart rate measurements are included in the dataset. These variables enable the examination of the relationship between cardiovascular health and sleep-related factors.

Sleep Disorder Analysis: The presence or absence of sleep disorders, such as Insomnia and Sleep Apnea, is indicated in the dataset. This information allows for the identification and analysis of sleep disorders within the context of other variables.

# Dataset Columns

The dataset consists of the following columns:

Person ID: An identifier for each individual in the dataset.

Gender: The gender of the person (Male/Female).

Age: The age of the person in years.

Occupation: The occupation or profession of the person.

Sleep Duration (hours): The number of hours the person sleeps per day.

Quality of Sleep (scale: 1-10): A subjective rating of the quality of sleep, ranging from 1 to 10

Physical Activity Level (minutes/day): The number of minutes the person engages in physical activity daily.

Stress Level (scale: 1-10): A subjective rating of the stress level experienced by the person, ranging from 1 to 10.

BMI Category: The BMI category of the person (e.g., Underweight, Normal, Overweight).

Blood Pressure (systolic/diastolic): The blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.

Heart Rate (bpm): The resting heart rate of the person in beats per minute.

Daily Steps: The number of steps the person takes per day.

Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).

This dataset provides a rich source of information for exploring the impact of various lifestyle factors on sleep health. Analyzing this data can yield valuable insights and assist in developing strategies to improve sleep quality and overall well-being.

# Importing

In [None]:
# Importing

import numpy as np
import pandas as pd
import os
import plotly.graph_objs as go
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load the dataset

In [None]:

df = pd.read_csv('/kaggle/input/sleep-health-and-lifestyle-dataset/Sleep_health_and_lifestyle_dataset.csv')


# Check the head and tail of the dataset

In [None]:
df.head()

In [None]:
df.tail()

# Data Outline and Preprocessing


In [None]:
df.info()

In [None]:
df.describe()

In [None]:
print('Unique Values of Occupation are', df['Occupation'].unique())

print('\nUnique Values of BMI Category are', df['BMI Category'].unique())

print('\nUnique Values of Sleep Disorder are', df['Sleep Disorder'].unique())


# Preprocessing - Divide 'Blood Pressure' to highest and lowest

In [None]:
df['Blood Pressure'].unique()

In [None]:
df1 = pd.concat([df, df['Blood Pressure'].str.split('/', expand=True)], axis=1).drop('Blood Pressure', axis=1)

In [None]:
df1

In [None]:
df1 = df1.rename(columns={0: 'BloodPressure_Upper_Value', 1: 'BloodPressure_Lower_Value'})

In [None]:
df1

In [None]:
df1['BloodPressure_Upper_Value'] = df1['BloodPressure_Upper_Value'].astype(float)
df1['BloodPressure_Lower_Value'] = df1['BloodPressure_Lower_Value'].astype(float)


In [None]:
df1.info()

# Handling Categorical Variables

In [None]:
#import label encoder
from sklearn import preprocessing
#make an instance of Label Encoder
label_encoder = preprocessing.LabelEncoder()
df1['Gender'] = label_encoder.fit_transform(df1['Gender'])
df1['Occupation'] = label_encoder.fit_transform(df1['Occupation'])
df1['BMI Category'] = label_encoder.fit_transform(df1['BMI Category'])
df1['Sleep Disorder'] = label_encoder.fit_transform(df1['Sleep Disorder'])
df1.head()

In [None]:
# Outlier Removal
num_col = ['Age', 'Sleep Duration', 'Quality of Sleep', 'Physical Activity Level', 'Stress Level',
           'Heart Rate', 'Daily Steps', 'BloodPressure_Upper_Value', 'BloodPressure_Lower_Value']

Q1 = df1[num_col].quantile(0.25)
Q3 = df1[num_col].quantile(0.75)
IQR = Q3 - Q1

df1 = df1[~((df1[num_col] < (Q1 - 1.5 * IQR)) | (df1[num_col] > (Q3 + 1.5 * IQR))).any(axis=1)]


In [None]:
df1.head()

# Visualization

In [None]:
# Correlation Heatmap
fig = px.imshow(df1.drop('Person ID', axis=1).corr())
fig.show()

In [None]:
# Pairplot
fig = px.scatter_matrix(df1.drop(['Person ID'], axis=1), color='Sleep Disorder')
fig.show()

In [None]:
# Histogram by Sleep Disorder
fig = px.histogram(df1, x='Sleep Duration', color='Sleep Disorder', marginal='rug', nbins=30)
fig.update_layout(title='Histogram by Sleep Disorder',
                  xaxis=dict(title='Sleep Duration'),
                  yaxis=dict(title='Count'),
                  legend=dict(title='Sleep Disorder'),
                  showlegend=True)
fig.show()

In [None]:
# Histogram by BMI Category
fig = px.histogram(df1, x='Sleep Duration', color='BMI Category', marginal='rug', nbins=30)
fig.update_layout(title='Histogram by BMI Category',
                  xaxis=dict(title='Sleep Duration'),
                  yaxis=dict(title='Count'),
                  legend=dict(title='BMI Category'),
                  showlegend=True)
fig.show()

In [None]:
# Boxplot by Gender
fig = px.box(df1, x='Gender', y='Sleep Duration', color='Gender')
fig.update_layout(title='Boxplot by Gender',
                  xaxis=dict(title='Gender'),
                  yaxis=dict(title='Sleep Duration'))
fig.show()

In [None]:
# Boxplot by Occupation
fig = px.box(df1, x='Occupation', y='Sleep Duration', color='Occupation')
fig.update_layout(title='Boxplot by Occupation',
                  xaxis=dict(title='Occupation'),
                  yaxis=dict(title='Sleep Duration'))
fig.show()

In [None]:
# Boxplot by BMI Category
fig = px.box(df1, x='BMI Category', y='Sleep Duration', color='BMI Category')
fig.update_layout(title='Boxplot by BMI Category',
                  xaxis=dict(title='BMI Category'),
                  yaxis=dict(title='Sleep Duration'))
fig.show()

In [None]:
# Boxplot by Sleep Disorder
fig = px.box(df1, x='Sleep Disorder', y='Sleep Duration', color='Sleep Disorder')
fig.update_layout(title='Boxplot by Sleep Disorder',
                  xaxis=dict(title='Sleep Disorder'),
                  yaxis=dict(title='Sleep Duration'))
fig.show()

In [None]:
# Analysis - "Relationship between sleep duration and body mass index depends on age"

# Scatterplot with Age, Sleep Duration and BMI Category
fig = px.scatter(df1, x='Age', y='Sleep Duration', color='BMI Category', hover_data=['Age', 'Sleep Duration'])
fig.update_layout(title='Scatterplot: Age vs Sleep Duration (Color: BMI Category)',
                  xaxis=dict(title='Age'),
                  yaxis=dict(title='Sleep Duration'))
fig.show()

In [None]:
df1['Age'].unique()

# Create age group 20s, 30s, 40s, and 50s

In [None]:
# Create age group 20s, 30s, 40s, and 50s
df1['Age_bin'] = pd.cut(df1['Age'], [20, 30, 40, 50, 60], labels=['20s', '30s', '40s', '50s'])

In [None]:
# Boxplot: BMI Category by Age_bin
fig = px.box(df1, x='Age_bin', y='BMI Category', color='Age_bin')
fig.update_layout(title='Boxplot: BMI Category by Age_bin',
                  xaxis=dict(title='Age_bin'),
                  yaxis=dict(title='BMI Category'))
fig.show()

In [None]:
# Boxplot: Sleep Duration by Age_bin
fig = px.box(df1, x='Age_bin', y='Sleep Duration', color='Age_bin')
fig.update_layout(title='Boxplot: Sleep Duration by Age_bin',
                  xaxis=dict(title='Age_bin'),
                  yaxis=dict(title='Sleep Duration'))
fig.show()

In [None]:
# Age_bin, BMI Category, and Sleep Duration Boxplot by Occupation
df_long = pd.melt(df1, id_vars=['Occupation'], value_vars=['Age_bin', 'BMI Category', 'Sleep Duration'],
                  var_name='Variable', value_name='Value')

fig = px.box(df_long, x='Occupation', y='Value', color='Variable')
fig.update_layout(title='Boxplot: Age_bin, BMI Category, and Sleep Duration by Occupation',
                  xaxis=dict(title='Occupation'),
                  yaxis=dict(title='Value'))
fig.show()


In [None]:
df1.head()

In [None]:
df1.info()

# Machine Learning - Multi-Classification Prediction

In [None]:
# Machine Learning - Multi-Classification Prediction
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Prepare the data

In [None]:
# Prepare the data
X = df1.drop(['Person ID', 'Sleep Disorder'], axis=1)
y = df1['Sleep Disorder']

In [None]:
X.drop(['Age_bin'], axis=1, inplace=True)

# Split the data into train and test sets

In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# Create a pipeline

In [None]:
# Create a pipeline with data preprocessing and classification model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier())
])

In [None]:
# Define parameter grids for hyperparameter tuning
param_grid = [
    {
        'clf': [RandomForestClassifier()],
        'clf__n_estimators': [100, 200, 300,400],
        'clf__max_depth': [None, 5, 10,15],
    },
    {
        'clf': [SVC()],
        'clf__kernel': ['linear', 'rbf'],
        'clf__C': [0.01,0.1, 1, 10],
    },
    {
        'clf': [LogisticRegression()],
        'clf__solver': ['liblinear', 'lbfgs'],
        'clf__C': [0.01,0.1, 1, 10],
    },
    {
        'clf': [KNeighborsClassifier()],
        'clf__n_neighbors': [3, 5, 7,9],
    },
    {
        'clf': [GradientBoostingClassifier()],
        'clf__n_estimators': [100, 200, 300,400],
        'clf__learning_rate': [0.01, 0.1, 1],
    },
    {
        'clf': [DecisionTreeClassifier()],
        'clf__max_depth': [None, 5, 10,15],
    }
]

# Perform grid search for hyperparameter tuning

In [None]:
# Perform grid search for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Calculate accuracy scores for each model
models = [
    ('Random Forest', RandomForestClassifier()),
    ('SVM', SVC()),
    ('Logistic Regression', LogisticRegression()),
    ('KNN', KNeighborsClassifier()),
    ('Gradient Boosting', GradientBoostingClassifier()),
    ('Decision Tree', DecisionTreeClassifier())
]

accuracy_scores = []
for name, model in models:
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', model)
    ])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

# Comparison Chart

In [None]:
# Comparison Chart
fig = go.Figure(data=go.Bar(x=[name for name, _ in models], y=accuracy_scores))
fig.update_layout(title='Comparison of Models',
                  xaxis=dict(title='Models'),
                  yaxis=dict(title='Accuracy Score'))
fig.show()

# Feature Importance

In [None]:
# Feature Importance
importance = best_model.named_steps['clf'].feature_importances_
feature_names = X.columns

sorted_indices = np.argsort(importance)[::-1]
sorted_importance = importance[sorted_indices]
sorted_features = feature_names[sorted_indices]

fig = go.Figure(data=go.Bar(x=sorted_features, y=sorted_importance))
fig.update_layout(title='Feature Importance',
                  xaxis=dict(title='Features'),
                  yaxis=dict(title='Importance'))
fig.show()

# Hey So best Algo/Model for this dataset called GradientBoostingClassifier() which gives me the best highend accuracy with 94%.

# Hey if you like this notebook then please upvote it and share your feedback.