# **NOTE:** Use File > Save a copy in Drive to make a copy before doing anything else

# Project 8: Student Habits vs Academic Performance Analysis

#### Overview

This project analyzes the relationship between student lifestyle habits and academic performance using a comprehensive dataset from Kaggle. The dataset contains information about 1,000 students and includes 16 variables covering various aspects of student life, including study habits, social media usage, sleep patterns, diet quality, exercise frequency, and academic outcomes.

The analysis focuses on understanding how different study patterns correlate with exam performance.The project demonstrates fundamental data analysis skills including data cleaning, statistical calculations, and comparative analysis using Python and pandas.

In [None]:
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
# Update file_path to point to the specific file within the dataset
file_path = "student_habits_performance.csv"

# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "jayaantanaath/student-habits-vs-academic-performance",
  file_path,
  # Provide any additional arguments like
  # sql_query or pandas_kwargs. See the
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

print("First 5 records:", df.head())

  df = kagglehub.load_dataset(


First 5 records:   student_id  age  gender  study_hours_per_day  social_media_hours  \
0      S1000   23  Female                  0.0                 1.2   
1      S1001   20  Female                  6.9                 2.8   
2      S1002   21    Male                  1.4                 3.1   
3      S1003   23  Female                  1.0                 3.9   
4      S1004   19  Female                  5.0                 4.4   

   netflix_hours part_time_job  attendance_percentage  sleep_hours  \
0            1.1            No                   85.0          8.0   
1            2.3            No                   97.3          4.6   
2            1.3            No                   94.8          8.0   
3            1.0            No                   71.0          9.2   
4            0.5            No                   90.9          4.9   

  diet_quality  exercise_frequency parental_education_level internet_quality  \
0         Fair                   6                   Master  

In [None]:
one_col = df['age']

In [None]:
type(one_col)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   student_id                     1000 non-null   object 
 1   age                            1000 non-null   int64  
 2   gender                         1000 non-null   object 
 3   study_hours_per_day            1000 non-null   float64
 4   social_media_hours             1000 non-null   float64
 5   netflix_hours                  1000 non-null   float64
 6   part_time_job                  1000 non-null   object 
 7   attendance_percentage          1000 non-null   float64
 8   sleep_hours                    1000 non-null   float64
 9   diet_quality                   1000 non-null   object 
 10  exercise_frequency             1000 non-null   int64  
 11  parental_education_level       909 non-null    object 
 12  internet_quality               1000 non-null   ob

In [None]:
df.shape

(1000, 16)

In [None]:
df.columns

Index(['student_id', 'age', 'gender', 'study_hours_per_day',
       'social_media_hours', 'netflix_hours', 'part_time_job',
       'attendance_percentage', 'sleep_hours', 'diet_quality',
       'exercise_frequency', 'parental_education_level', 'internet_quality',
       'mental_health_rating', 'extracurricular_participation', 'exam_score'],
      dtype='object')

### clean the data

In [None]:
df.isnull().sum()

Unnamed: 0,0
student_id,0
age,0
gender,0
study_hours_per_day,0
social_media_hours,0
netflix_hours,0
part_time_job,0
attendance_percentage,0
sleep_hours,0
diet_quality,0


This checks how many missing (NaN) values are in each column of the DataFrame. Based on the output:

All columns have 0 missing values except for parental_education_level, which has 91 missing values.

The dataset is mostly clean, but you do need to clean or handle the missing data in the parental_education_level column.

#### Fill with a default or common value

#### drop rows with missing values

In [None]:
df.dropna(subset=['parental_education_level'], inplace=True)


In [None]:
# checking the data again
df.isnull().sum()

Unnamed: 0,0
student_id,0
age,0
gender,0
study_hours_per_day,0
social_media_hours,0
netflix_hours,0
part_time_job,0
attendance_percentage,0
sleep_hours,0
diet_quality,0


### Questions to Answer

Please find the answer for the following questoins.

1. Find the average study hours per day for all students. Please create a code cell below this to answer the question.(0.5 point)

In [None]:
# prompt: Find the average study hours per day for all students

average_study_hours = df['study_hours_per_day'].mean()
print(f"The average study hours per day for all students is: {average_study_hours:.2f}")

The average study hours per day for all students is: 3.54


2. Identify the student who studies MOST hours per day. Please create a code cell below to answer the question.(0.5 point)

In [None]:
# prompt: Identify the student who studies MOST hours per day.

# Find the row with the maximum study hours per day
student_most_study_hours = df.loc[df['study_hours_per_day'].idxmax()]

print("Student with the most study hours per day:")
student_most_study_hours

Student with the most study hours per day:


Unnamed: 0,455
student_id,S1455
age,19
gender,Male
study_hours_per_day,8.3
social_media_hours,3.3
netflix_hours,2.6
part_time_job,Yes
attendance_percentage,86.6
sleep_hours,6.5
diet_quality,Fair


3. Count how many students study more than 6 hours per day. Please create a code cell below this to answer the question.(0.5 point)

In [None]:
# prompt: Count how many students study more than 6 hours per day

# Filter the DataFrame to include only students who study more than 6 hours per day
students_more_than_6_hours = df[df['study_hours_per_day'] > 6]

# Count the number of students in the filtered DataFrame
count_students_more_than_6_hours = students_more_than_6_hours.shape[0]

print(f"The number of students who study more than 6 hours per day is: {count_students_more_than_6_hours}")

The number of students who study more than 6 hours per day is: 40


4. What is the percentage of students who study more than 6 hours per day. Please create a code cell below this to answer the question.(0.5 point)

In [None]:
# prompt: What is the percentage of students who study more than 6 hours per day.

# Calculate the total number of students
total_students = df.shape[0]

# Calculate the percentage of students who study more than 6 hours per day
percentage_students_more_than_6_hours = (count_students_more_than_6_hours / total_students) * 100

print(f"The percentage of students who study more than 6 hours per day is: {percentage_students_more_than_6_hours:.2f}%")

The percentage of students who study more than 6 hours per day is: 4.40%


5. Calculate what percentage of students study less than 2 hours per day. Please create a code cell below this to answer the question.(0.5 point)

In [None]:
# prompt: Calculate what percentage of students study less than 2 hours per day.

# Filter the DataFrame to include only students who study less than 2 hours per day
students_less_than_2_hours = df[df['study_hours_per_day'] < 2]

# Count the number of students in the filtered DataFrame
count_students_less_than_2_hours = students_less_than_2_hours.shape[0]

# Calculate the total number of students
total_students = df.shape[0]

# Calculate the percentage of students who study less than 2 hours per day
percentage_students_less_than_2_hours = (count_students_less_than_2_hours / total_students) * 100

print(f"The percentage of students who study less than 2 hours per day is: {percentage_students_less_than_2_hours:.2f}%")

The percentage of students who study less than 2 hours per day is: 13.53%


6. Do students who study more than 5 hours per day have higher exam scores on average? Please create a code cell below to answer this question. (0.5 point)

In [23]:
# prompt: Do students who study more than 5 hours per day have higher exam scores on average?

# Separate students into two groups: those who study more than 5 hours and those who don't
students_more_than_5_hours = df[df['study_hours_per_day'] > 5]
students_5_hours_or_less = df[df['study_hours_per_day'] <= 5]

# Calculate the average exam score for each group
average_exam_score_more_than_5_hours = students_more_than_5_hours['exam_score'].mean()
average_exam_score_5_hours_or_less = students_5_hours_or_less['exam_score'].mean()

print(f"Average exam score for students studying more than 5 hours per day: {average_exam_score_more_than_5_hours:.2f}")
print(f"Average exam score for students studying 5 hours or less per day: {average_exam_score_5_hours_or_less:.2f}")

# Compare the average scores and print the conclusion
if average_exam_score_more_than_5_hours > average_exam_score_5_hours_or_less:
  print("Conclusion: Students who study more than 5 hours per day have higher exam scores on average.")
else:
  print("Conclusion: Students who study more than 5 hours per day do not have higher exam scores on average.")

Average exam score for students studying more than 5 hours per day: 91.12
Average exam score for students studying 5 hours or less per day: 65.67
Conclusion: Students who study more than 5 hours per day have higher exam scores on average.


7. Use "Explain code" for the code you produced for Question 6 and summarize in your own words to show that you understood the code Gemini produced. Please create a text cell below to answer this question. (0.5 point)

The code snippet aims to investigate whether students who study more than 5 hours per day tend to have higher exam scores compared to those who study 5 hours or less. This code divides the student data into two categories: those who hit the books for over 5 hours a day and those who study 5 hours or less. It then figures out the average exam score for each group. The code then compares these two averages. If the average score is higher for the students who study more than 5 hours, it says so; otherwise, it says they don't have higher scores on average. Essentially, it's checking if putting in more than 5 hours of study time generally leads to better test results in this dataset.

8. The codes produced to answer the questions use "vectorization"? Please justify your answer with an example. Please create a text cell below to answer this question. (0.5 point)

Yes, the codes produced to answer questions often utilize "vectorization," particularly when dealing with numerical computations or data manipulation.
Consider a scenario where you have two lists of numbers, and you want to add them element-wise to create a new list containing the sums.
Let's say you have the following Python lists:
In this approach, we explicitly iterate through the elements of the lists using a for loop. For each index i, we access the corresponding elements from list1 and list2, perform the addition, and append the result to the result list. This is a scalar operation performed repeatedly.
Now, let's use the NumPy library, which is designed for efficient numerical operations using vectorized computations:
In this vectorized approach, we first convert the Python lists into NumPy arrays. Then, when we perform the addition array1 + array2, NumPy doesn't perform the addition element by element in an explicit loop in Python. Instead, it leverages highly optimized, low-level implementations (often written in C or Fortran) to perform the operation on the entire arrays simultaneously. This "vectorized" operation treats the arrays as single entities for the addition.

Count how many students study more than 6 hours per day. Please create a code cell below this to answer the question.(0.5 point)

In [None]:
df['study_hours_per_day']

Unnamed: 0,study_hours_per_day
0,0.0
1,6.9
2,1.4
3,1.0
4,5.0
...,...
995,2.6
996,2.9
997,3.0
998,5.4


In [None]:
df['study_hours_per_day'] > 6

Unnamed: 0,study_hours_per_day
0,False
1,True
2,False
3,False
4,False
...,...
995,False
996,False
997,False
998,False


In [None]:
# prompt: Count how many students study more than 6 hours per day.

(df['study_hours_per_day'] > 6).sum()

np.int64(40)

In [None]:
# prompt: Calculate what percentage of students study less than 2 hours per day.

# Filter students who study less than 2 hours per day
students_less_than_2_hours = df[df['study_hours_per_day'] < 2]

# Count the number of students who study less than 2 hours per day
num_students_less_than_2_hours = len(students_less_than_2_hours)

# Get the total number of students
total_students = len(df)

# Calculate the percentage
percentage_less_than_2_hours = (num_students_less_than_2_hours / total_students) * 100

print(f"Percentage of students who study less than 2 hours per day: {percentage_less_than_2_hours:.2f}%")


Percentage of students who study less than 2 hours per day: 13.53%


In [None]:
df['study_hours_per_day'] < 2

Unnamed: 0,study_hours_per_day
0,True
1,False
2,True
3,True
4,False
...,...
995,False
996,False
997,False
998,False


In [None]:
(df['study_hours_per_day'] < 2).sum()

np.int64(123)

In [None]:
# prompt: Do students who study more than 5 hours per day have higher exam scores on average?

# Separate students into two groups: those who study more than 5 hours and those who don't
students_more_than_5_hours = df[df['study_hours_per_day'] > 5]
students_5_hours_or_less = df[df['study_hours_per_day'] <= 5]

# Calculate the average exam score for each group
average_score_more_than_5_hours = students_more_than_5_hours['exam_score'].mean()
average_score_5_hours_or_less = students_5_hours_or_less['exam_score'].mean()

print(f"Average exam score for students studying more than 5 hours per day: {average_score_more_than_5_hours:.2f}")
print(f"Average exam score for students studying 5 hours or less per day: {average_score_5_hours_or_less:.2f}")

# Compare the averages and print the conclusion
if average_score_more_than_5_hours > average_score_5_hours_or_less:
    print("Students who study more than 5 hours per day have higher exam scores on average.")
else:
    print("Students who study more than 5 hours per day do not have higher exam scores on average.")


Average exam score for students studying more than 5 hours per day: 91.12
Average exam score for students studying 5 hours or less per day: 65.67
Students who study more than 5 hours per day have higher exam scores on average.
