# **Netflix & Chill... or Fail? Unpacking the Link Between Netflix and Grades**

![Netflix Photo](https://cinemafaith.com/wp-content/uploads/2016/03/netflix-2-final-2.jpg)

# **1. Problem Statement**

In this project, we aim to decode the complex relationship between student entertainment habits and academic performance. Given the overwhelming prevalence of daily digital media consumption, particularly Netflix, among students, we seek to identify the critical thresholds where entertainment transitions from harmless leisure to a detrimental factor for academic success.

By exploring key variable categories including academic factors (exam scores, study hours, attendance), entertainment habits (Netflix and social media consumption), lifestyle factors (sleep patterns, mental health ratings, and diet quality), and demographics (age, gender, and parental education levels), we hope to uncover the nuanced 'optimal range' where students can enjoy their digital lives without significant academic penalty. This analysis will provide data-driven insights to help students, educators, and parents navigate the digital landscape, fostering both well-being and academic achievement.

# **2. Date Set Description**
Our analysis is based on a robust dataset sourced from Kaggle, titled ["Student Habits vs Academic Performance."](https://www.kaggle.com/datasets/jayaantanaath/student-habits-vs-academic-performance) This dataset is particularly valuable as it represents real student data, offering a grounded basis for our research.

The dataset characteristics speak to its exceptional quality and suitability for our study: it comprises 1,000 complete records with zero missing values across 18 total variables. This ensures 100% data integrity for our analysis, providing a reliable foundation to explore how various student habits and demographics correlate with academic outcomes.

## **2.1 Initial Data Inspection**
Before diving deeper into analysis, it's important to perform an initial inspection of the dataset. This helps us understand its structure, identify potential issues (like missing data or incorrect types), and get a sense of the distributions. Here's what we look at:

- `df.shape` – Displays the number of rows and columns, giving a sense of dataset size.

- `df.head()` – Shows the first few rows to preview the format and sample values.

- `df.info()` – Lists column names, data types, and non-null counts, useful for detecting missing data or type mismatches.

- `df.describe()` – Provides summary statistics (mean, std, min, max, etc.) for numerical features, helping to identify potential outliers or skewed distributions.

These steps offer a foundational understanding before proceeding to data cleaning and visualization.

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
df = pd.read_csv('/content/drive/MyDrive/3rd year 2024-25/Term 3/DT/student_habits_performance.csv')

In [16]:
df.shape

(1000, 16)

In [17]:
df.head(10)

Unnamed: 0,student_id,age,gender,study_hours_per_day,social_media_hours,netflix_hours,part_time_job,attendance_percentage,sleep_hours,diet_quality,exercise_frequency,parental_education_level,internet_quality,mental_health_rating,extracurricular_participation,exam_score
0,S1000,23,Female,0.0,1.2,1.1,No,85.0,8.0,Fair,6,Master,Average,8,Yes,56.2
1,S1001,20,Female,6.9,2.8,2.3,No,97.3,4.6,Good,6,High School,Average,8,No,100.0
2,S1002,21,Male,1.4,3.1,1.3,No,94.8,8.0,Poor,1,High School,Poor,1,No,34.3
3,S1003,23,Female,1.0,3.9,1.0,No,71.0,9.2,Poor,4,Master,Good,1,Yes,26.8
4,S1004,19,Female,5.0,4.4,0.5,No,90.9,4.9,Fair,3,Master,Good,1,No,66.4
5,S1005,24,Male,7.2,1.3,0.0,No,82.9,7.4,Fair,1,Master,Average,4,No,100.0
6,S1006,21,Female,5.6,1.5,1.4,Yes,85.8,6.5,Good,2,Master,Poor,4,No,89.8
7,S1007,21,Female,4.3,1.0,2.0,Yes,77.7,4.6,Fair,0,Bachelor,Average,8,No,72.6
8,S1008,23,Female,4.4,2.2,1.7,No,100.0,7.1,Good,3,Bachelor,Good,1,No,78.9
9,S1009,18,Female,4.8,3.1,1.3,No,95.4,7.5,Good,5,Bachelor,Good,10,Yes,100.0


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   student_id                     1000 non-null   object 
 1   age                            1000 non-null   int64  
 2   gender                         1000 non-null   object 
 3   study_hours_per_day            1000 non-null   float64
 4   social_media_hours             1000 non-null   float64
 5   netflix_hours                  1000 non-null   float64
 6   part_time_job                  1000 non-null   object 
 7   attendance_percentage          1000 non-null   float64
 8   sleep_hours                    1000 non-null   float64
 9   diet_quality                   1000 non-null   object 
 10  exercise_frequency             1000 non-null   int64  
 11  parental_education_level       909 non-null    object 
 12  internet_quality               1000 non-null   ob

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   student_id                     1000 non-null   object 
 1   age                            1000 non-null   int64  
 2   gender                         1000 non-null   object 
 3   study_hours_per_day            1000 non-null   float64
 4   social_media_hours             1000 non-null   float64
 5   netflix_hours                  1000 non-null   float64
 6   part_time_job                  1000 non-null   object 
 7   attendance_percentage          1000 non-null   float64
 8   sleep_hours                    1000 non-null   float64
 9   diet_quality                   1000 non-null   object 
 10  exercise_frequency             1000 non-null   int64  
 11  parental_education_level       909 non-null    object 
 12  internet_quality               1000 non-null   ob

In [20]:
df.describe()

Unnamed: 0,age,study_hours_per_day,social_media_hours,netflix_hours,attendance_percentage,sleep_hours,exercise_frequency,mental_health_rating,exam_score
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.498,3.5501,2.5055,1.8197,84.1317,6.4701,3.042,5.438,69.6015
std,2.3081,1.46889,1.172422,1.075118,9.399246,1.226377,2.025423,2.847501,16.888564
min,17.0,0.0,0.0,0.0,56.0,3.2,0.0,1.0,18.4
25%,18.75,2.6,1.7,1.0,78.0,5.6,1.0,3.0,58.475
50%,20.0,3.5,2.5,1.8,84.4,6.5,3.0,5.0,70.5
75%,23.0,4.5,3.3,2.525,91.025,7.3,5.0,8.0,81.325
max,24.0,8.3,7.2,5.4,100.0,10.0,6.0,10.0,100.0


# **3. Column Descriptions**
Here are the descriptions for each variable in the "Student Habits vs Academic Performance" dataset:

| Variable Name                 | Description                                                                 | Category             |
|------------------------------|-----------------------------------------------------------------------------|----------------------|
| `student_id`                 | Unique identifier for each student.                                         | N/A                  |
| `age`                        | Age of the student in years.                                                | Demographics         |
| `gender`                     | Gender of the student (e.g., Male, Female, Other).                          | Demographics         |
| `study_hours_per_day`        | Average number of hours a student spends studying per day.                  | Academic Factors     |
| `social_media_hours`         | Average number of hours a student spends on social media per day.           | Entertainment Habits |
| `netflix_hours`              | Average number of hours a student spends watching Netflix per day.          | Entertainment Habits |
| `part_time_job`              | Indicates whether the student has a part-time job (e.g., Yes, No).          | Lifestyle Factors    |
| `attendance_percentage`      | Percentage of classes attended by the student.                              | Academic Factors     |
| `sleep_hours`                | Average number of hours a student sleeps per night.                         | Lifestyle Factors    |
| `diet_quality`               | Self-reported rating of diet quality (e.g., Poor, Fair, Good, Excellent).   | Lifestyle Factors    |
| `exercise_frequency`         | How often a student exercises (e.g., Never, Rarely, Sometimes, Often, Always). | Lifestyle Factors |
| `parental_education_level`   | Highest education level achieved by parents (e.g., High School, Bachelor's, Master's, PhD). | Demographics |
| `internet_quality`           | Self-reported rating of internet quality (e.g., Poor, Fair, Good, Excellent). | Lifestyle Factors  |
| `mental_health_rating`       | Self-reported rating of mental health (e.g., Poor, Fair, Good, Excellent).  | Lifestyle Factors    |
| `extracurricular_participation` | Indicates whether the student participates in extracurricular activities (e.g., Yes, No). | Academic Factors |
| `exam_score`                 | The student's final exam score, representing academic performance. This is the primary dependent variable. | Academic Factors |


# **4. Data Cleaning & Integrity Check**
Before diving into the analysis, it's crucial to ensure our data is clean and reliable. This involves checking for missing values, duplicate entries, appropriate data types, and understanding the basic statistical properties of our numerical variables.

## **4.1 Missing Values**

In [21]:
# Check for Missing Values
df.isnull().sum()

Unnamed: 0,0
student_id,0
age,0
gender,0
study_hours_per_day,0
social_media_hours,0
netflix_hours,0
part_time_job,0
attendance_percentage,0
sleep_hours,0
diet_quality,0


In [22]:
# Fill missing parental education level with Unknown
df['parental_education_level'] = df['parental_education_level'].fillna('Unknown')

# Check if there are still missing values
df.isnull().sum()

Unnamed: 0,0
student_id,0
age,0
gender,0
study_hours_per_day,0
social_media_hours,0
netflix_hours,0
part_time_job,0
attendance_percentage,0
sleep_hours,0
diet_quality,0


## **4.2 Duplicate Rows**

In [23]:
# Check for Duplicate rows (1 = True, 0 = False)
df.duplicated().sum()

np.int64(0)

## **4.3 Data Types**

## **4.4 Outliers**

# **4.5 Data Integrity Check**
Range Validation:
- age should be within a realistic student range (e.g., 15–30).
- attendance_percentage should be between 0 and 100.
- exam_score should be within the exam scale (e.g., 0–100).

Consistency:
- Ensure categorical variables use consistent labels (e.g., "Yes"/"No", not "Y"/"N").

Uniqueness:
- Check that student_id is unique.

# **5. Visualization**

### **5.1 Univariate Analysis: Student Profiling**

Daily Netflix Hours (netflix_hours)
- Purpose: To understand the distribution of the primary entertainment habit under investigation, revealing patterns of light, moderate, heavy, and extreme usage.
- Visualization: Histogram or Density Plot.

Exam Score (exam_score)
- Purpose: To visualize the distribution of academic performance, our key outcome variable.
- Visualization: Histogram or Density Plot.

Daily Study Hours (study_hours_per_day)
- Purpose: To understand the distribution of dedicated academic effort among students, a crucial input to performance.
- Visualization: Histogram or Density Plot.

Daily Sleep Hours (sleep_hours)
- Purpose: To profile students' sleep patterns, a significant lifestyle factor known to influence cognitive function and well-being, which in turn impacts academics.
- Visualization: Histogram or Density Plot.

### **5.2 Bivariate Analysis: Correlation Heatmap**

Our correlation heatmap provides a powerful overview of the relationships between all numerical variables in our dataset. The color intensity and the value in each cell indicate the strength and direction of the correlation (from -1 to 1).

### **5.3 Scatter plot - Exam Score vs Natflix and Exam Score vs Study Hours**

### **5.4 Understanding The Performance Ladder**

The Performance Ladder is a key analytical tool developed in this study to explore the relationship between Netflix usage and academic performance among students. It provides a structured way to examine how different levels of entertainment consumption impact exam scores.

</br>

**What It Is**

The concept groups students into categories based on their average daily Netflix usage — from those who don’t watch at all to those who watch heavily. Each category, or "rung" on the ladder, represents a different level of consumption. By calculating the average exam scores for each group, we create a clear, tiered view of performance across the spectrum of media usage.

</br>

**What It Aims to Do**
The goal of The Performance Ladder is to move beyond a simple "screen time is bad" narrative. Instead, it aims to:
- Quantify how Netflix usage correlates with academic outcomes.
- Identify the point at which media consumption begins to negatively affect performance.
- Explore whether a moderate, controlled level of entertainment might be harmless or even beneficial.

</br>

**Why It Matters to the Problem Statement**
This analysis directly supports the study’s central question: how do lifestyle and entertainment habits affect academic performance? Rather than assuming all screen time is equally harmful, The Performance Ladder helps uncover whether there's an optimal balance — a level of media engagement that doesn't significantly hurt academic results. It reveals where that balance tips, giving us data-driven insight into how modern students can manage leisure and learning more effectively.