## 1. Introduction
Many university students experience irregular sleep schedules due to study pressure, extended screen time, and unhealthy daily habits.  
Lack of proper sleep can affect focus, productivity, and overall mental and physical well-being. This project aims to study the factors that influence students’ sleep quality by analyzing real data and discovering patterns that connect lifestyle behaviors with sleep outcomes. Understanding these patterns can help raise awareness among students and encourage healthier academic and personal routines.



### 1.1 Problem Definition
Students often struggle to maintain a consistent and healthy sleep schedule because of academic workload, high stress levels, and technology usage late at night.These behaviors lead to poor sleep quality, fatigue, and reduced academic performance. The problem addressed in this project is the lack of understanding of how different lifestyle factors such as study hours, caffeine intake, and physical activity interact and influence sleep quality. By analyzing these relationships, the project seeks to highlight the main contributors to irregular sleep among university students.



### 1.2 Project Scope
* Data Understanding: Using a real-world dataset that contains students’ lifestyle and academic factors such as study hours, screen time, caffeine intake, and physical activity.  
* Data Preprocessing: Handling missing values, verifying data consistency, and preparing the dataset for analysis.  
* Exploratory Data Analysis: Exploring patterns and relationships between different variables to understand how lifestyle habits impact sleep quality.  
* Visualization and Insights: Creating charts and statistical summaries to present findings about students’ sleep behaviors.  
* Sleep Quality Prediction: Building a simple predictive framework that estimates or classifies sleep quality based on lifestyle and academic factors.  
* Future Work: Expanding the system to provide personalized recommendations for improving sleep habits.

---


## 2. Dataset Goal & Source

### 2.1 Dataset Goal
The Student Sleep Patterns dataset is designed to analyze and predict how academic and lifestyle factors affect students’ sleep quality and duration. It includes information such as study hours, screen time, caffeine intake, stress levels, and physical activity, allowing the exploration of correlations between daily habits and sleep quality. The dataset supports modeling and prediction to identify key factors that influence healthy and balanced sleep routines among university students.

### 2.2 Dataset Source
The dataset was obtained from a public Kaggle source titled *“Student Sleep Patterns Dataset”* created by Arsalan Jamal.  
It provides real-world information about students’ demographics, study habits, screen time, caffeine intake, physical activity, and sleep quality.  
[https://www.kaggle.com/datasets/arsalanjamal002/student-sleep-patterns](https://www.kaggle.com/datasets/arsalanjamal002/student-sleep-patterns)

---


## 3. General Information
### 3.1 Number of Observations and Features

In [None]:
df.shape

(500, 14)

**Explanation:**

The dataset contains **500 observations** (students) and **14 features** that capture various aspects of students' lifestyle, study habits, and sleep patterns.  

Below is the full list of features:

1- **Student_ID** = Unique identifier for each student record.  
2- **Age** = The age of the student (in years).  
3- **Gender** = The student’s gender (Male, Female, or Other).  
4- **University_Year** = The academic year of the student (e.g., 1st, 2nd, 3rd, 4th).  
5- **Sleep_Duration** = The average number of sleep hours per night.  
6- **Study_Hours** = The average number of study hours per day.  
7- **Screen_Time** = The average daily screen exposure (in hours).  
8- **Caffeine_Intake** = The number of caffeine servings consumed per day (coffee, energy drinks, etc.).  
9- **Physical_Activity** = A numeric indicator of how active the student is (e.g., exercise frequency).  
10- **Sleep_Quality** = The target variable indicating sleep quality on a scale of 1 to 10 (higher = better).  
11- **Weekday_Sleep_Start** = Average time the student goes to sleep during weekdays.  
12- **Weekday_Sleep_End** = Average time the student wakes up during weekdays.  
13- **Weekend_Sleep_Start** = Average time the student goes to sleep during weekends.  
14- **Weekend_Sleep_End** = Average time the student wakes up during weekends.  

### 3.2 Data Types

In [4]:
df.dtypes

Student_ID               int64
Age                      int64
Gender                  object
University_Year         object
Sleep_Duration         float64
Study_Hours            float64
Screen_Time            float64
Caffeine_Intake          int64
Physical_Activity        int64
Sleep_Quality            int64
Weekday_Sleep_Start    float64
Weekend_Sleep_Start    float64
Weekday_Sleep_End      float64
Weekend_Sleep_End      float64
dtype: object

**Explanation:**

The dataset contains both **numerical** and **categorical** data types.

1- **Numerical Features (int64 / float64):**  
  `Age`, `Sleep_Duration`, `Study_Hours`, `Screen_Time`, `Caffeine_Intake`, `Physical_Activity`,  
  `Sleep_Quality`, `Weekday_Sleep_Start`, `Weekday_Sleep_End`, `Weekend_Sleep_Start`, `Weekend_Sleep_End`

2- **Categorical Features (object):**  
  `Gender`, `University_Year`

3- **Identifier:**  
  `Student_ID` — a unique identifier for each record that doesn’t affect prediction.


### 3.3 Target Variable / Classes Description

In [5]:
df['Sleep_Quality'].value_counts().sort_index()

Sleep_Quality
1     66
2     46
3     54
4     46
5     41
6     57
7     45
8     40
9     55
10    50
Name: count, dtype: int64

**Explanation:**

The result above shows the distribution of the **Sleep_Quality** column.  
On the **left side**, the numbers represent the different **sleep quality levels** from **1 to 10**,  
and on the **right side**, we can see how many students fall into each level.

For example, **66 students** have a sleep quality of **1**, **46 students** rated **2**, and so on.  
This means that each value shows how many students reported that specific level of sleep quality.

Overall, the data looks balanced — there isn’t a big difference between the groups.  
Most students fall between **5 and 9**, which suggests that the majority have **moderate to good sleep quality**.

### 3.4 Summary & Visualization
#### 3.4.1 Statistical Summary

In [10]:
df.drop(columns=['Student_ID']).describe()

Unnamed: 0,Age,Sleep_Duration,Study_Hours,Screen_Time,Caffeine_Intake,Physical_Activity,Sleep_Quality,Weekday_Sleep_Start,Weekend_Sleep_Start,Weekday_Sleep_End,Weekend_Sleep_End
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,21.536,6.4724,5.9816,2.525,2.462,62.342,5.362,11.16686,12.37586,6.9299,8.9881
std,2.33315,1.485764,3.475725,0.859414,1.682325,35.191674,2.967249,5.972352,5.789611,1.183174,1.111253
min,18.0,4.0,0.1,1.0,0.0,0.0,1.0,1.08,2.05,5.0,7.02
25%,20.0,5.1,2.9,1.8,1.0,32.75,3.0,6.0875,7.2975,5.9,8.0475
50%,21.0,6.5,6.05,2.6,2.0,62.5,5.0,10.635,12.69,6.885,9.005
75%,24.0,7.8,8.8,3.3,4.0,93.25,8.0,16.1525,17.3275,7.9725,9.925
max,25.0,9.0,12.0,4.0,5.0,120.0,10.0,21.93,22.0,8.98,10.99


**Explanation:**

This table displays the statistical summary for all numerical features in the dataset, excluding `Student_ID`, `Gender`, and `University_Year`.  
The categorical columns were not included because they are non-numeric.  

From this summary:
1- The average student age is around **21.5 years**, ranging between **18 and 25**.  
2- The average **Sleep Duration** is approximately **6.5 hours**, while **Study Hours** average around **6 hours per day**.  
3- The **Screen Time** mean is about **2.5 hours**, and **Caffeine Intake** averages **2 to 3 cups per day**.  
4- **Physical Activity** levels vary widely, indicating differences in lifestyle habits.  
5- **Sleep Quality** ranges from **1 to 10**, showing a broad spread across different students.
6- The **Weekday_Sleep_Start** values range generally from **19.0 to 23.0 hours**, with an **average near 21.0 (≈ 9:00 PM)**.  
7- On weekends, the **Weekend_Sleep_Start** values often fall between **0.0 and 3.5 hours**, averaging around **0.5–1.0 (≈ 12:30 AM)** — showing that students go to bed about **2 hours later on weekends**.  
8- The **Weekday_Sleep_End** values cluster between **6.0 and 8.0 hours (≈ 6:00–8:00 AM)**
9- On **Weekend_Sleep_End** extends between **8.5 and 10.5 hours (≈ 8:30–10:30 AM)**, meaning students **wake up around 2 hours later on weekends**.    

#### 3.4.2 Variable Distributions Visualization

#### 3.4.3 Missing Value Analysis

#### 3.4.4 Class Imbalance


## 4. Preprocessing Techniques
Preprocessing is an essential step that prepares the dataset for accurate and meaningful analysis. It ensures that the data is clean, consistent, and properly formatted before performing any statistical or predictive modeling. The preprocessing phase typically involves checking for missing values, transforming variables, normalizing data ranges, and removing unnecessary or redundant information.
### 4.1 Variable Transformation
Transforming categorical variables into numerical form is an important step that enables proper statistical analysis and model training. In this dataset, variables such as **Gender** and **University_Year** contain categorical values that need to be represented numerically for easier processing and interpretation.  
To achieve this, they were encoded into numeric values as follows:  

- **Gender:** Female → 0, Male → 1, Other → 2  
- **University_Year:** 1st Year → 1, 2nd Year → 2, 3rd Year → 3, 4th Year → 4  

This transformation helps standardize the data, allowing models to recognize and analyze relationships between categories more effectively.  
By converting these variables, the dataset becomes more consistent and ready for the upcoming modeling phase.



### 4.2 Discretization


### 4.3 Value or Variable Removal
We first examined whether the **Student_ID** variable provides any meaningful contribution to understanding students’ sleep patterns. After review, we confirmed that **Student_ID** serves only as a unique identifier and does not contain analytical value related to lifestyle or academic factors. Therefore, it was removed from the dataset to prevent unnecessary complexity and ensure the analysis focuses on relevant attributes such as sleep duration, study hours, screen time, and caffeine intake. Removing this column improves the dataset’s clarity and supports more accurate modeling and interpretation.


In [5]:
import pandas as pd

# Load dataset
df = pd.read_csv("Dataset/student_sleep_patterns.csv")

# Remove the Student_ID column since it does not contribute to the analysis
initial_columns = df.shape[1]

df = df.drop(columns=['Student_ID'])

print("'Student_ID' column has been removed successfully.")
print(f"Number of columns before removal: {initial_columns}")
print(f"Number of columns after removal: {df.shape[1]}")

# Display the first few rows to confirm the change
df.head()


'Student_ID' column has been removed successfully.
Number of columns before removal: 14
Number of columns after removal: 13


Unnamed: 0,Age,Gender,University_Year,Sleep_Duration,Study_Hours,Screen_Time,Caffeine_Intake,Physical_Activity,Sleep_Quality,Weekday_Sleep_Start,Weekend_Sleep_Start,Weekday_Sleep_End,Weekend_Sleep_End
0,24,Other,2nd Year,7.7,7.9,3.4,2,37,10,14.16,4.05,7.41,7.06
1,21,Male,1st Year,6.3,6.0,1.9,5,74,2,8.73,7.1,8.21,10.21
2,22,Male,4th Year,5.1,6.7,3.9,5,53,5,20.0,20.47,6.88,10.92
3,24,Other,4th Year,6.3,8.6,2.8,4,55,9,19.82,4.08,6.69,9.42
4,20,Male,4th Year,4.7,2.7,2.7,0,85,3,20.98,6.12,8.98,9.01


### 4.4 Normalization


### 4.5 Handling Missing Values

### 4.6 Handling Duplicates