# COGS 108 - Data Checkpoint

## Authors

- Jacob Lee: Conceptualization, Background research, Methodology, Analysis, Writing - original draft
- Travis Dao: Software, Visualization, Data curation, Analysis, Experimental investigation, Writing - review & editing
- Ranya Tashkandy: Project administration, Software, Visualization, Analysis, Writing - original draft
- Steven Bui: Project administration, Software, Data curation, Analysis, Writing - review & editing

## Research Question

To what extent does insomnia-related sleep quality predict students’ academic performance, specifically their focus, motivation, and assignment completion?

This project will use data representing sleep quality with insomnia serving as the primary metric, and this data will come in the form of self-reported questionnaires from students in which they rate the severity of their insomnia. For data representing academic performance, we can use self-reported data on academic performance including metrics such as assignment completion, focus, and motivation. 

## Background and Prior Work

Sleep is a periodically recurring state of rest that occurs in the human body every 24 hours and lasts for several hours. It is naturally induced by the brain and is characterized by reduced activity and decreased responsiveness to stimuli. <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) The functions of sleep with respect to the brain are numerous, including clearing out accumulated toxins and waste products, consolidating information acquired during daytime, and repairing neurons damaged by free radicals. These processes support the cognitive systems required for effective learning, attention, and decision-making. Because academic performance depends heavily on these same cognitive functions, it is reasonable to expect that sleep quality could predict how well students are able to perform academically. This connection forms the basis of our project.  

There are numerous studies existing on the impact of sleep on cognitive functioning in general. For example, a study by Garcia et al. found that all basic cognitive processes of people were detrimentally affected after 24 hours or more without sleep, including attention, working memory, and executive functions such as cognitive flexibility and inhibition of irrelevant sensory stimuli. However, the researchers also found that these processes were differentially affected by sleep deprivation. Attention related processes were found to be impacted the most, while working memory was impacted moderately and cognitive flexibility, one component of executive functioning, was not found to be impacted at all. <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) These researchers were able to quantify the effects of sleep deprivation through cognitive tests given before and after the participants were subjected to the sleep deprivation. This pattern is especially relevant for our research because academic tasks such as focusing in class, completing assignments, and maintaining motivation rely heavily on attention and working memory, the very processes most vulnerable to poor sleep quality.

Narrowing down further, a study done by researchers from the University of Washington directly investigated the relationship of sleep with academic performance using multiple parameters including chronotype, sleep variability, and sleep timing differences between weekdays and weekends or whether one is an early bird or a night owl, They found that irregular or low-quality sleep predicted worse academic performance, even when total hours slept were similar across students. This suggests that subjective experiences of sleep quality such as insomnia symptoms, capture meaningful variation in students’ cognitive and academic functioning. <a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)

Overall, these studies show that sleep is closely tied to the cognitive abilities that support academic performance, especially attention and working memory. When sleep quality declines, these systems become less reliable, making it more difficult for students to stay focused, complete assignments, and maintain motivation. This connection is the reason we use insomnia symptoms as our primary measure of sleep quality, since they capture the kinds of disturbances that directly affect these cognitive processes. Similarly, relying on students’ self-reported academic behaviors aligns with how previous research has approached this question and allows us to examine whether the same relationships appear within our own sample. 


1. <a name="cite_note-1"></a> [^](#cite_ref-1) Kalat, J. (2018). *Biological Psychology*. Cengage Learning.  
2. <a name="cite_note-2"></a> [^](#cite_ref-2) García, A., et al. (2021). *Effects of Sleep Deprivation on Cognitive Performance*. *Frontiers in Psychology*. https://pmc.ncbi.nlm.nih.gov/articles/PMC8340886/  
3. <a name="cite_note-3"></a> [^](#cite_ref-3) University of Washington eScience Institute. (2014). *Students’ Sleep and Academic Performance*. https://escience.washington.edu/incubator-14-sleep/


## Hypothesis


We expect to find a positive relationship between sleep quality and academic performance. Specifically, students who report more severe insomnia symptoms are likely to report lower levels of focus, motivation, and assignment completion. 

This prediction follows from prior research showing that poor or insufficient sleep weakens attention, working memory, and other cognitive processes that are essential for academic success.


## Data

### Data overview

### Dataset #1: Student Insomnia and Educational Outcomes

This dataset focuses on understanding the relationship between students’ sleep habits, lifestyle factors, and academic performance. Each row represents an individual survey response, and each column corresponds to a self-reported variable related to sleep behavior, stress, and performance. Although only two sample responses are shown, the full dataset reportedly includes hundreds of participants, providing insight into how insufficient or poor-quality sleep affects students’ academic and cognitive functioning. Since there are two versions of the dataset, we will be sticking with Version 2, which has 996 rows of data.

#### Data Collection

The data was collected through an online survey administered via Google Forms in Oct-Nov 2024. Respondents were asked to provide insights into their sleep behaviors and the effects on their academic and daily activities.

#### Key Features
1. Demographics: Year of study and gender.
2. Sleep Patterns: Frequency of difficulty falling asleep, hours of sleep, night awakenings, and overall sleep quality.
3. Cognitive and Academic Effects: Impact on concentration, fatigue, class attendance, assignment completion, and overall academic performance.
4. Lifestyle Factors: Electronic device usage before sleep, caffeine consumption, and physical activity frequency.
5. Stress Levels: Self-reported stress related to academic workload.

#### Concerns
Since all data are self-reported, responses may be influenced by the participants ability to recall information or an effort to make themselves look better. For example, students may overestimate how much sleep or exercise they get, or underreport screen time and caffeine consumption. Moreover, because the survey was optional, students that filled it out were likely more aware or concerned about their sleep habits. This could lead to potential sampling bias. Measures such as “sleep quality”, “stress”, and “academic performance” are also subjective, leading to a lack of standardizationg and comparability of results across individuals. Another limitation lies in the timing and context of data collection. The survey was conducted between October and November 2024, when midterm or final exams often occur. As a result, there would be more academic stress and worse sleep quality during this period, meaning the responses may not accurately represent students’ typical sleep behaviors throughout the semester.

Despite these limitations, the dataset remains valuable for exploring general trends and relationships between sleep, lifestyle factors, and academic performance, offering meaningful insights into how sleep hygiene affects university students’ well-being and success.

### Dataset #2: Student Sleep Patterns

The Student Sleep Patterns Dataset contains information on the sleep habits and daily routines of 500 university students. It includes 14 columns describing each student's demographics, lifestyle factors, and sleep-related behavior. The dataset is synthetic, meaning it was generated artificially to simulate realistic patterns rather than collected from real participants.

#### Data Collection

The data was synthetically generated using realistic statistical distributions to reflect common patterns among university students. It was designed to model how different lifestyle factors might influence sleep duration and quality.

#### Key Features
1. Sleep_Duration (hours): Average nightly sleep time.
2. Sleep_Quality (1-10): Self-reported quality of sleep.
3. Study_Hours (hours/day): Average daily study time.
4. Screen_Time (hours/day): Daily non-academic screen exposure.
5. Caffeine_Intake (drinks/day): Number of caffeinated beverages consumed.
6. Physical_Activity (minutes/day): Time spent on physical exercise.
7. Sleep_Times (weekday/weekend): Bedtime and wake-up times in 24-hour format.
8. Demographics: Age, gender, and university year.

#### Concerns
Because the dataset is synthetic, it does not represent real students and may not capture the full complexity of real-world sleep behaviors. The sleep quality variable is subjective and may not align with clinical measures. Additionally, the dataset lacks important factors like stress, diet, or mental health, which could influence sleep patterns. Despite these limitations, it provides a clean, well-structured dataset for learning and practicing data analysis.

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

# import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 

# Travis - I just manually downloaded the dataset_1
# Steven - we just decided to manually upload the datasets to "data/00-raw/" directory!

'''
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'dataset_2_student_sleep_patterns.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')
'''

"\ndatafiles = [\n    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},\n    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'dataset_2_student_sleep_patterns.csv'}\n]\n\nget_data.get_raw(datafiles,destination_directory='data/00-raw/')\n"

### Instructions

1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you

In [3]:
# Imports

import pandas as pd
import numpy as np

### Dataset #1: Student Insomnia and Educational Outcomes

The Student Insomnia and Educational Outcomes Dataset explores how sleep habits, lifestyle choices, and stress levels relate to students’ academic performance. Each row represents an individual’s survey response, capturing variables such as sleep duration, cognitive effects, device usage, and academic functioning. While only two sample entries are shown in the excerpt, the full Version 2 dataset contains 996 responses, offering a substantial basis for analyzing how poor sleep or irregular sleep patterns influence academic outcomes.

#### Data Collection

The dataset was gathered through an online Google Forms survey conducted between October and November 2024. Participants provided self-reported information about their nightly sleep behavior, daytime functioning, and academic habits. Because the survey was optional, respondents were individuals who chose to share their sleep experiences, potentially reflecting greater awareness or concern about their sleep health.

#### Key Features

1. **Demographics:**  
   - Year of study and gender.

2. **Sleep Patterns:**  
   - Difficulty falling asleep, sleep duration, nighttime awakenings, and overall sleep quality.

3. **Cognitive & Academic Effects:**  
   - Impact on concentration, fatigue, class attendance, assignment completion, and perceived academic performance.

4. **Lifestyle Factors:**  
   - Use of electronic devices before bed, caffeine intake, and frequency of physical activity.

5. **Stress Levels:**  
   - Self-reported stress related to academic workload.

#### Concerns

Because the data is entirely self-reported, responses may include recall inaccuracies or social desirability bias—for example, students may overstate their sleep hours or physical activity while underreporting screen time or caffeine consumption. The voluntary nature of the survey may also introduce sampling bias, as participants who completed it may be more concerned about sleep issues than the general student population. Additionally, subjective metrics like “sleep quality,” “stress,” and “academic performance” lack standardized measurements, making comparisons across individuals less precise.

The timing of data collection presents another limitation: the survey took place during October–November 2024, a period when midterms or finals are common. Students may experience heightened stress and sleep disruption during this time, meaning their responses could reflect atypical behavior rather than their usual sleep patterns.

Despite these limitations, the dataset is valuable for examining overall trends between sleep habits, stress, lifestyle factors, and academic performance. It provides meaningful insights into how sleep hygiene affects students' well-being and success.

In [4]:
df_1 = pd.read_csv('data/00-raw/dataset_1/Student Insomnia and Educational Outcomes Dataset_version-2.csv')

df_1.head()

Unnamed: 0,Timestamp,1. What is your year of study?,2. What is your gender?,3. How often do you have difficulty falling asleep at night?,"4. On average, how many hours of sleep do you get on a typical day?",5. How often do you wake up during the night and have trouble falling back asleep?,6. How would you rate the overall quality of your sleep?,7. How often do you experience difficulty concentrating during lectures or studying due to lack of sleep?,"8. How often do you feel fatigued during the day, affecting your ability to study or attend classes?","9. How often do you miss or skip classes due to sleep-related issues (e.g., insomnia, feeling tired)?",10. How would you describe the impact of insufficient sleep on your ability to complete assignments and meet deadlines?,"11. How often do you use electronic devices (e.g., phone, computer) before going to sleep?","12. How often do you consume caffeine (coffee, energy drinks) to stay awake or alert?",13. How often do you engage in physical activity or exercise?,14. How would you describe your stress levels related to academic workload?,15. How would you rate your overall academic performance (GPA or grades) in the past semester?
0,10/24/2024 16:51:15,Graduate student,Male,Often (5-6 times a week),7-8 hours,Often (5-6 times a week),Good,Sometimes,Often,Often (3-4 times a week),Moderate impact,Often (5-6 times a week),Rarely (1-2 times a week),Sometimes (3-4 times a week),High stress,Average
1,10/24/2024 16:51:51,Third year,Male,Often (5-6 times a week),7-8 hours,Often (5-6 times a week),Good,Often,Sometimes,Sometimes (1-2 times a week),Major impact,Sometimes (3-4 times a week),Sometimes (3-4 times a week),Sometimes (3-4 times a week),Low stress,Good
2,10/24/2024 16:52:21,First year,Female,Sometimes (3-4 times a week),7-8 hours,Sometimes (3-4 times a week),Good,Often,Often,Sometimes (1-2 times a week),Major impact,Often (5-6 times a week),Often (5-6 times a week),Often (5-6 times a week),High stress,Below Average
3,10/24/2024 16:53:00,Third year,Male,Often (5-6 times a week),More than 8 hours,Sometimes (3-4 times a week),Poor,Often,Often,Rarely (1-2 times a month),Minor impact,Sometimes (3-4 times a week),Sometimes (3-4 times a week),Every day,Extremely high stress,Excellent
4,10/24/2024 16:53:25,Graduate student,Male,Often (5-6 times a week),7-8 hours,Often (5-6 times a week),Very good,Always,Sometimes,Sometimes (1-2 times a week),Moderate impact,Sometimes (3-4 times a week),Sometimes (3-4 times a week),Often (5-6 times a week),Low stress,Average


In [5]:
df_1.shape

(996, 16)

In [6]:
df_1.dtypes

Timestamp                                                                                                                  object
1. What is your year of study?                                                                                             object
2. What is your gender?                                                                                                    object
3. How often do you have difficulty falling asleep at night?                                                               object
4. On average, how many hours of sleep do you get on a typical day?                                                        object
5. How often do you wake up during the night and have trouble falling back asleep?                                         object
6. How would you rate the overall quality of your sleep?                                                                   object
7. How often do you experience difficulty concentrating during lectures or studying due to

In [7]:
df_1.columns

Index(['Timestamp', '1. What is your year of study?',
       '2. What is your gender?',
       '3. How often do you have difficulty falling asleep at night? ',
       '4. On average, how many hours of sleep do you get on a typical day?',
       '5. How often do you wake up during the night and have trouble falling back asleep?',
       '6. How would you rate the overall quality of your sleep?',
       '7. How often do you experience difficulty concentrating during lectures or studying due to lack of sleep?',
       '8. How often do you feel fatigued during the day, affecting your ability to study or attend classes?',
       '9. How often do you miss or skip classes due to sleep-related issues (e.g., insomnia, feeling tired)?',
       '10. How would you describe the impact of insufficient sleep on your ability to complete assignments and meet deadlines?',
       '11. How often do you use electronic devices (e.g., phone, computer) before going to sleep?',
       '12. How often do you con

Renaming columns to make it more readable

In [8]:
df_1_column_map = {
  "1. What is your year of study?": "year_study",
  "2. What is your gender?": "gender",
  "3. How often do you have difficulty falling asleep at night? ": "diff_fall_asleep",
  "4. On average, how many hours of sleep do you get on a typical day?": "sleep_hours",
  "5. How often do you wake up during the night and have trouble falling back asleep?": "wake_during_night",
  "6. How would you rate the overall quality of your sleep?": "sleep_quality",
  "7. How often do you experience difficulty concentrating during lectures or studying due to lack of sleep?": "concentration_issues",
  "8. How often do you feel fatigued during the day, affecting your ability to study or attend classes?": "daytime_fatigue",
  "9. How often do you miss or skip classes due to sleep-related issues (e.g., insomnia, feeling tired)?": "miss_classes",
  "10. How would you describe the impact of insufficient sleep on your ability to complete assignments and meet deadlines?": "sleep_impact_assignments",
  "11. How often do you use electronic devices (e.g., phone, computer) before going to sleep?": "device_before_sleep",
  "12. How often do you consume caffeine (coffee, energy drinks) to stay awake or alert?": "caffeine_use",
  "13. How often do you engage in physical activity or exercise?": "exercise_freq",
  "14. How would you describe your stress levels related to academic workload?": "academic_stress",
  "15. How would you rate your overall academic performance (GPA or grades) in the past semester?": "academic_perf"
}

df_1.rename(columns=df_1_column_map, inplace=True)

#### Clean and Tidy

In [9]:
df_1.isna().sum()

Timestamp                   0
year_study                  0
gender                      0
diff_fall_asleep            0
sleep_hours                 0
wake_during_night           0
sleep_quality               0
concentration_issues        0
daytime_fatigue             0
miss_classes                0
sleep_impact_assignments    0
device_before_sleep         0
caffeine_use                0
exercise_freq               0
academic_stress             0
academic_perf               0
dtype: int64

#### Wrangling

Dropping the first column (timestamp of submission)

In [10]:
df_1 = df_1.iloc[:, 1:]

In [11]:
df_1.describe()

Unnamed: 0,year_study,gender,diff_fall_asleep,sleep_hours,wake_during_night,sleep_quality,concentration_issues,daytime_fatigue,miss_classes,sleep_impact_assignments,device_before_sleep,caffeine_use,exercise_freq,academic_stress,academic_perf
count,996,996,996,996,996,996,996,996,996,996,996,996,996,996,996
unique,4,2,5,5,5,5,5,5,5,5,5,5,5,4,5
top,Graduate student,Male,Often (5-6 times a week),7-8 hours,Often (5-6 times a week),Very poor,Often,Often,Often (3-4 times a week),Major impact,Often (5-6 times a week),Often (5-6 times a week),Often (5-6 times a week),Extremely high stress,Poor
freq,481,691,446,508,491,290,501,470,518,475,508,500,453,490,491


Printing out the unique values for each columns shows us that each participant's response is standardized

In [12]:
for col in df_1.columns:
  print(f"\nColumn: {col}")
  print(df_1[col].unique())


Column: year_study
['Graduate student' 'Third year' 'First year' 'Second year']

Column: gender
['Male' 'Female']

Column: diff_fall_asleep
['Often (5-6 times a week)' 'Sometimes (3-4 times a week)' 'Every night'
 'Rarely (1-2 times a week)' 'Never']

Column: sleep_hours
['7-8 hours' 'More than 8 hours' '6-7 hours' '4-5 hours'
 'Less than 4 hours']

Column: wake_during_night
['Often (5-6 times a week)' 'Sometimes (3-4 times a week)' 'Every night'
 'Rarely (1-2 times a week)' 'Never']

Column: sleep_quality
['Good' 'Poor' 'Very good' 'Average' 'Very poor']

Column: concentration_issues
['Sometimes' 'Often' 'Always' 'Rarely' 'Never']

Column: daytime_fatigue
['Often' 'Sometimes' 'Rarely' 'Always' 'Never']

Column: miss_classes
['Often (3-4 times a week)' 'Sometimes (1-2 times a week)'
 'Rarely (1-2 times a month)' 'Always' 'Never']

Column: sleep_impact_assignments
['Moderate impact' 'Major impact' 'Minor impact' 'No impact'
 'Severe impact']

Column: device_before_sleep
['Often (5-6 ti

Mapping string values to numeric codes

In [13]:
year_map = {
  "First year": 1,
  "Second year": 2,
  "Third year": 3,
  "Graduate student": 4
}
df_1["year_study"] = df_1["year_study"].map(year_map)

In [14]:
gender_map = {
  "Male": 'M',
  "Female": 'F'
}
df_1["gender"] = df_1["gender"].map(gender_map)

In [15]:
freq_map = {
  "Never": 0,
  "Rarely (1-2 times a week)": 1,
  "Sometimes (3-4 times a week)": 2,
  "Often (5-6 times a week)": 3,
  "Every night": 4,
  "Every day": 4  # where applicable
}
for col in ["diff_fall_asleep", "wake_during_night", "device_before_sleep", "caffeine_use", "exercise_freq"]:
  df_1[col] = df_1[col].map(freq_map)

In [16]:
sleep_map = {
  "Less than 4 hours": 3,
  "4-5 hours": 4.5,
  "6-7 hours": 6.5,
  "7-8 hours": 7.5,
  "More than 8 hours": 9
}
df_1["sleep_hours"] = df_1["sleep_hours"].map(sleep_map)

In [17]:
quality_map = {
  "Very poor": 1,
  "Poor": 2,
  "Average": 3,
  "Good": 4,
  "Very good": 5
}
df_1["sleep_quality"] = df_1["sleep_quality"].map(quality_map)

In [18]:
issue_map = {
  "Never": 0,
  "Rarely": 1,
  "Sometimes": 2,
  "Often": 3,
  "Always": 4
}
for col in ["concentration_issues", "daytime_fatigue"]:
    df_1[col] = df_1[col].map(issue_map)

In [19]:
miss_map = {
  "Never": 0,
  "Rarely (1-2 times a month)": 1,
  "Sometimes (1-2 times a week)": 2,
  "Often (3-4 times a week)": 3,
  "Always": 4
}
df_1["miss_classes"] = df_1["miss_classes"].map(miss_map)

In [20]:
impact_map = {
  "No impact": 0,
  "Minor impact": 1,
  "Moderate impact": 2,
  "Major impact": 3,
  "Severe impact": 4
}
df_1["sleep_impact_assignments"] = df_1["sleep_impact_assignments"].map(impact_map)

In [21]:
stress_map = {
  "No stress": 0,
  "Low stress": 1,
  "High stress": 2,
  "Extremely high stress": 3
}
df_1["academic_stress"] = df_1["academic_stress"].map(stress_map)

In [22]:
perf_map = {
  "Poor": 1,
  "Below Average": 2,
  "Average": 3,
  "Good": 4,
  "Excellent": 5
}
df_1["academic_perf"] = df_1["academic_perf"].map(perf_map)

Verify data is properly wrangled

In [23]:
df_1.info()
df_1.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 996 entries, 0 to 995
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   year_study                996 non-null    int64  
 1   gender                    996 non-null    object 
 2   diff_fall_asleep          996 non-null    int64  
 3   sleep_hours               996 non-null    float64
 4   wake_during_night         996 non-null    int64  
 5   sleep_quality             996 non-null    int64  
 6   concentration_issues      996 non-null    int64  
 7   daytime_fatigue           996 non-null    int64  
 8   miss_classes              996 non-null    int64  
 9   sleep_impact_assignments  996 non-null    int64  
 10  device_before_sleep       996 non-null    int64  
 11  caffeine_use              996 non-null    int64  
 12  exercise_freq             996 non-null    int64  
 13  academic_stress           996 non-null    int64  
 14  academic_p

year_study                  0
gender                      0
diff_fall_asleep            0
sleep_hours                 0
wake_during_night           0
sleep_quality               0
concentration_issues        0
daytime_fatigue             0
miss_classes                0
sleep_impact_assignments    0
device_before_sleep         0
caffeine_use                0
exercise_freq               0
academic_stress             0
academic_perf               0
dtype: int64

In [24]:
for col in df_1.columns:
  print(f"\nColumn: {col}")
  print(df_1[col].unique())


Column: year_study
[4 3 1 2]

Column: gender
['M' 'F']

Column: diff_fall_asleep
[3 2 4 1 0]

Column: sleep_hours
[7.5 9.  6.5 4.5 3. ]

Column: wake_during_night
[3 2 4 1 0]

Column: sleep_quality
[4 2 5 3 1]

Column: concentration_issues
[2 3 4 1 0]

Column: daytime_fatigue
[3 2 1 4 0]

Column: miss_classes
[3 2 1 4 0]

Column: sleep_impact_assignments
[2 3 1 0 4]

Column: device_before_sleep
[3 2 1 0 4]

Column: caffeine_use
[1 2 3 4 0]

Column: exercise_freq
[2 3 4 1 0]

Column: academic_stress
[2 1 3 0]

Column: academic_perf
[3 4 2 5 1]


In [25]:
df_1.head()

Unnamed: 0,year_study,gender,diff_fall_asleep,sleep_hours,wake_during_night,sleep_quality,concentration_issues,daytime_fatigue,miss_classes,sleep_impact_assignments,device_before_sleep,caffeine_use,exercise_freq,academic_stress,academic_perf
0,4,M,3,7.5,3,4,2,3,3,2,3,1,2,2,3
1,3,M,3,7.5,3,4,3,2,2,3,2,2,2,1,4
2,1,F,2,7.5,2,4,3,3,2,3,3,3,3,2,2
3,3,M,3,9.0,2,2,3,3,1,1,2,2,4,3,5
4,4,M,3,7.5,3,5,4,2,2,2,2,2,3,1,3


#### Notes

1. Will have to create a reverse-map for data visualization

#### Export processed data

In [26]:
output_path = 'data/02-processed/dataset_1.csv'

df_1.to_csv(output_path, index=False)

### Dataset #2: Student Performance & Behavior

The Student Performance & Behavior Dataset contains real academic and behavioral records from 5,000 students enrolled at a private learning provider. It includes 24 columns covering demographics, academic outcomes, study habits, and lifestyle factors. Because the data comes from actual student records (with sensitive fields anonymized), it offers meaningful insights into how various factors relate to academic success.

#### Data Collection

This dataset was compiled directly from the institution’s internal academic tracking system. The data reflects real student performance measures, attendance logs, participation scores, and self-reported lifestyle indicators such as study hours, stress levels, and sleep duration. Some fields—like names or emails—may be anonymized to protect student privacy.

#### Key Features

1. **Academic Performance:**  
   - *Midterm_Score, Final_Score, Assignments_Avg, Quizzes_Avg, Participation_Score, Projects_Score*  
   - *Total_Score:* Weighted composite based on official grading policy:  
     - Midterm (15%)  
     - Final (25%)  
     - Assignments (15%)  
     - Quizzes (10%)  
     - Participation (5%)  
     - Projects (30%)

2. **Demographics:**  
   - *Student_ID, First_Name, Last_Name, Email, Gender, Age, Department*

3. **Behavioral & Lifestyle Factors:**  
   - *Attendance (%), Study_Hours_per_Week, Extracurricular_Activities, Internet_Access_at_Home, Stress_Level (1–10), Sleep_Hours_per_Night*

4. **Family Background:**  
   - *Parent_Education_Level, Family_Income_Level*

5. **Grade:**  
   - Letter grade from A–F based on Total_Score  
   - Attendance is not included in the total score or has minimal weight.

#### Concerns

Because this dataset is based on real institutional data, it may contain natural irregularities such as missing values in fields like Attendance, Assignments_Avg, or Parent_Education_Level. Some biases may also be present—for example, students with higher attendance occasionally receiving slightly higher grades. The distribution of students across departments is imbalanced, which may affect model performance or generalizability. Additionally, self-reported variables like stress level or sleep duration may not perfectly reflect actual behavior. Despite these challenges, the dataset is rich, detailed, and highly suitable for academic research, predictive modeling, and exploratory data analysis.

In [27]:
# fetch the csv for dataset 2
df_2 = pd.read_csv('data/00-raw/Students Performance Dataset.csv')

# look at what dataset 2 looks like
df_2.head()

Unnamed: 0,Student_ID,First_Name,Last_Name,Email,Gender,Age,Department,Attendance (%),Midterm_Score,Final_Score,...,Projects_Score,Total_Score,Grade,Study_Hours_per_Week,Extracurricular_Activities,Internet_Access_at_Home,Parent_Education_Level,Family_Income_Level,Stress_Level (1-10),Sleep_Hours_per_Night
0,S1000,Omar,Williams,student0@university.com,Female,22,Mathematics,97.36,40.61,59.61,...,62.84,59.8865,F,10.3,Yes,No,Master's,Medium,1,5.9
1,S1001,Maria,Brown,student1@university.com,Male,18,Business,97.71,57.27,74.0,...,98.23,81.917,B,27.1,No,No,High School,Low,4,4.3
2,S1002,Ahmed,Jones,student2@university.com,Male,24,Engineering,99.52,41.84,63.85,...,91.22,67.717,D,12.4,Yes,No,High School,Low,9,6.1
3,S1003,Omar,Williams,student3@university.com,Female,24,Engineering,90.38,45.65,44.44,...,55.48,51.6535,F,25.5,No,Yes,High School,Low,8,4.9
4,S1004,John,Smith,student4@university.com,Female,23,CS,59.41,53.13,61.77,...,87.43,71.403,C,13.3,Yes,No,Master's,Medium,6,4.5


In [28]:
df_2.shape

(5000, 23)

In [29]:
df_2.dtypes

Student_ID                     object
First_Name                     object
Last_Name                      object
Email                          object
Gender                         object
Age                             int64
Department                     object
Attendance (%)                float64
Midterm_Score                 float64
Final_Score                   float64
Assignments_Avg               float64
Quizzes_Avg                   float64
Participation_Score           float64
Projects_Score                float64
Total_Score                   float64
Grade                          object
Study_Hours_per_Week          float64
Extracurricular_Activities     object
Internet_Access_at_Home        object
Parent_Education_Level         object
Family_Income_Level            object
Stress_Level (1-10)             int64
Sleep_Hours_per_Night         float64
dtype: object

In [30]:
df_2.columns

Index(['Student_ID', 'First_Name', 'Last_Name', 'Email', 'Gender', 'Age',
       'Department', 'Attendance (%)', 'Midterm_Score', 'Final_Score',
       'Assignments_Avg', 'Quizzes_Avg', 'Participation_Score',
       'Projects_Score', 'Total_Score', 'Grade', 'Study_Hours_per_Week',
       'Extracurricular_Activities', 'Internet_Access_at_Home',
       'Parent_Education_Level', 'Family_Income_Level', 'Stress_Level (1-10)',
       'Sleep_Hours_per_Night'],
      dtype='object')

## clean and tidy

### here's df_2 before cleaning:

In [31]:
df_2.head()

Unnamed: 0,Student_ID,First_Name,Last_Name,Email,Gender,Age,Department,Attendance (%),Midterm_Score,Final_Score,...,Projects_Score,Total_Score,Grade,Study_Hours_per_Week,Extracurricular_Activities,Internet_Access_at_Home,Parent_Education_Level,Family_Income_Level,Stress_Level (1-10),Sleep_Hours_per_Night
0,S1000,Omar,Williams,student0@university.com,Female,22,Mathematics,97.36,40.61,59.61,...,62.84,59.8865,F,10.3,Yes,No,Master's,Medium,1,5.9
1,S1001,Maria,Brown,student1@university.com,Male,18,Business,97.71,57.27,74.0,...,98.23,81.917,B,27.1,No,No,High School,Low,4,4.3
2,S1002,Ahmed,Jones,student2@university.com,Male,24,Engineering,99.52,41.84,63.85,...,91.22,67.717,D,12.4,Yes,No,High School,Low,9,6.1
3,S1003,Omar,Williams,student3@university.com,Female,24,Engineering,90.38,45.65,44.44,...,55.48,51.6535,F,25.5,No,Yes,High School,Low,8,4.9
4,S1004,John,Smith,student4@university.com,Female,23,CS,59.41,53.13,61.77,...,87.43,71.403,C,13.3,Yes,No,Master's,Medium,6,4.5


In [32]:
df_2.isna().sum()

Student_ID                       0
First_Name                       0
Last_Name                        0
Email                            0
Gender                           0
Age                              0
Department                       0
Attendance (%)                   0
Midterm_Score                    0
Final_Score                      0
Assignments_Avg                  0
Quizzes_Avg                      0
Participation_Score              0
Projects_Score                   0
Total_Score                      0
Grade                            0
Study_Hours_per_Week             0
Extracurricular_Activities       0
Internet_Access_at_Home          0
Parent_Education_Level        1025
Family_Income_Level              0
Stress_Level (1-10)              0
Sleep_Hours_per_Night            0
dtype: int64

### rename the columns

In [33]:
df_2 = df_2.rename(columns={
    'Student_ID': 'student_id',
    'First_Name': 'first_name',
    'Last_Name': 'last_name',
    'Email': 'email',
    'Gender': 'gender',
    'Age': 'age',
    'Department': 'department',
    'Attendance (%)': 'attendance',
    'Midterm_Score': 'midterm_score',
    'Final_Score': 'final_score',
    'Assignments_Avg': 'assignments_avg',
    'Quizzes_Avg': 'quizzes_avg',
    'Participation_Score': 'participation_score',
    'Projects_Score': 'projects_score',
    'Total_Score': 'total_score',
    'Grade': 'grade',
    'Study_Hours_per_Week': 'study_hours_per_week',
    'Extracurricular_Activities': 'extracurricular_activities',
    'Internet_Access_at_Home': 'internet_access_at_home',
    'Parent_Education_Level': 'parent_education_level',
    'Family_Income_Level': 'family_income_level',
    'Stress_Level (1-10)': 'stress_level',
    'Sleep_Hours_per_Night': 'sleep_hours_per_night'
})

### Drop unnecessary columns

In [34]:
columns_to_drop = [
  'student_id',
  'first_name',
  'last_name',
  'email',
  'internet_access_at_home',
  'parent_education_level',
  'family_income_level'
]

df_2 = df_2.drop(columns_to_drop, axis=1)

In [35]:
df_2.isna().sum()

gender                        0
age                           0
department                    0
attendance                    0
midterm_score                 0
final_score                   0
assignments_avg               0
quizzes_avg                   0
participation_score           0
projects_score                0
total_score                   0
grade                         0
study_hours_per_week          0
extracurricular_activities    0
stress_level                  0
sleep_hours_per_night         0
dtype: int64

### clean up the column for gender (use 'M', 'F', 'O' to represent Male, Female, Other)

In [36]:
gender_map = {
    'Male': 'M',
    'Female': 'F',
    'Other': 'O'
}

df_2['gender'] = df_2['gender'].map(gender_map)

### Convert letter grade to number

In [37]:
df_2['grade'].unique()

array(['F', 'B', 'D', 'C', 'A'], dtype=object)

In [38]:
grade_map = {
    'A': 4,
    'B': 3,
    'C': 2,
    'D': 1,
    'F': 0
}

df_2['grade'] = df_2['grade'].map(grade_map)

### extracurricular_activities -> binary

In [39]:
extracurricular_activities_map = {
    'Yes': 1,
    'No': 0
}

df_2['extracurricular_activities'] = df_2['extracurricular_activities'].map(extracurricular_activities_map)

### here's df_2 after cleaning

In [40]:
df_2.head()

Unnamed: 0,gender,age,department,attendance,midterm_score,final_score,assignments_avg,quizzes_avg,participation_score,projects_score,total_score,grade,study_hours_per_week,extracurricular_activities,stress_level,sleep_hours_per_night
0,F,22,Mathematics,97.36,40.61,59.61,73.69,53.17,73.4,62.84,59.8865,0,10.3,1,1,5.9
1,M,18,Business,97.71,57.27,74.0,74.23,98.23,88.0,98.23,81.917,3,27.1,0,4,4.3
2,M,24,Engineering,99.52,41.84,63.85,85.85,50.0,4.7,91.22,67.717,1,12.4,1,9,6.1
3,F,24,Engineering,90.38,45.65,44.44,68.1,66.27,4.2,55.48,51.6535,0,25.5,0,8,4.9
4,F,23,CS,59.41,53.13,61.77,67.66,83.98,64.3,87.43,71.403,2,13.3,1,6,4.5


The dataset is tidy, because each row = one student, each column = one variable, and no nested data.

### consider missing data

In [41]:
total_missing = df_2.isnull().sum().sum()
print(f'total missing data = {total_missing}')

total missing data = 0


## Export processed data

In [42]:
output_path = 'data/02-processed/dataset_2.csv'

df_2.to_csv(output_path, index=False)

## Ethics

### A. Data Collection

#### A.1 Informed consent
X
Because our project involves surveying students, all participants will be informed about what data we are collecting, how it will be used, and that participation is voluntary. They will be able to opt in and may exit the survey at any time. No data will be collected without their explicit agreement.

#### A.2 Collection bias
X
Our sample may be biased toward students who choose to respond, which could skew results toward individuals who are more motivated or more concerned about their sleep habits. We will acknowledge this limitation and keep the survey questions simple and neutral to reduce self-selection bias as much as possible.

#### A.3 Limit PII exposure
X
We will not collect names, emails, student IDs, or any information that could identify a participant. All responses will be anonymous, and no demographic details will be required unless directly relevant to the analysis.

#### A.4 Downstream bias mitigation
X
Because our analysis is exploratory and anonymous, fairness risks are low. However, we will still check for uneven representation across groups (if available) and note how this may influence our conclusions.

### B. Data Storage

#### B.1 Data security
X
All survey data will be stored securely within our group's folders. Only team members will have access, and the data will not be uploaded to public platforms or shared outside the course context.

#### B.2 Right to be forgotten
X
Since responses will be anonymous, individual participants cannot be directly identified. However, we will give participants the ability to contact us before final analysis if they wish to have their response removed.

#### B.3 Data retention plan
X
We will delete all data after the project is done. The dataset will not be used any further.

### C. Analysis

#### C.1 Missing perspectives: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
X
We recognize that our project focuses primarily on student-reported data, which may not fully capture external factors such as instructor expectations, campus culture, or socioeconomic influences on sleep and academic performance.

#### C.2 Dataset bias
X
There is a risk of omitted confounding variables in the dataset as the data only implies causation through correlation, however the dataset has some level of mitigation strategies in the form of introducing other possible factors to academic performance such as stress from workload and physical activity. We have also thought of the possibility that getting less sleep could be linked to better academic performance for some individuals, as in some cases exceptionally high performing students get less sleep due to the sheer amount of work they subject themselves to, however these cases are anomalies and can be treated as outliers or otherwise be averaged out by the rest of the dataset.
C.5 Auditability: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
Yes. Since we are using Git’s version control system, all steps in the process are recorded and accessible. If there are issues, we can check the Git history of our repository. 

#### D.1 Proxy discrimination
X
Yes. The variables for the first dataset are “sleep patterns, quality, fatigue, stress levels, academic performance, and lifestyle habits”. These variables are universal to all human beings, so they don’t pose a risk of being discriminatory.

## Team Expectations 

Our team agreed on clear expectations to make sure the project stays organized and that everyone feels respected throughout the process. We will treat each other’s time, ideas, and responsibilities with respect, especially when giving feedback or redistributing tasks. At the beginning of the project, we outlined who is responsible for survey design, analysis, coding, writing, and project coordination, and we will revisit these roles as the project develops.

A core expectation for our group is communicating early and often. If someone is confused about their part, needs help, or expects a delay, they are responsible for reaching out to the team as soon as possible. Early communication prevents last-minute stress and helps us support each other when workloads get heavier.

For accountability, we will use shared documents and a shared workspace so that everyone can see updates to the code, written sections, and comments. Each team member will look over the full notebook before submission so the final product reflects shared understanding rather than isolated work. By emphasizing respect, consistent communication, and clear division of responsibilities, we aim to maintain a collaborative and balanced workflow throughout the project.

Communicate early and often

Communicate to coordinate work at least four days before due date, whenever a task is accomplished, any doubts/questions

Availability -> (not) being able to do something

Collaborative environment: Help each other with work; when something is confusing we should be able to rely on one another

Use google docs and other collaborative tools to encourage working together rather than solo
Respect: respectful discussion, respectful of each others time

Accountability

Do your part in the project

Give effort on a level equivalent to getting 90% or more of the grade

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/29  |  Before 11:59 PM | Conduct background research on sleep and academic performance; gather initial ideas for research question | Discuss ideal dataset(s) and ethics; draft project proposal |  
| 11/7  |  Before 11:59 PM |  Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   | 
| 11/14  | 5:30pm  | Import and clean data once again in prep for visualization; perform exploratory data analysis (EDA) including summary statistics and actually do visualizations | Review and refine EDA findings; finalize analysis plan and select appropriate variables   |
| 11/21  | 5:30pm  | Complete data wrangling and begin correlation and regression analysis between sleep quality and academic performance | Discuss early findings, identify outliers or confounders, and complete project check-in; look into possible visualizations |
| 11/29  | 5:30pm  | Finalize all statistical analysis and visualizations; draft results, discussion, and conclusions | Review full project draft; make edits to improve clarity, formatting, and coherence |
| 12/6  | Before 11:59 PM  | Ensure all edits are incorporated and visualizations are finalized | Turn in Final Project & Group Project Surveys |