#  Exploratory Data Analysis (EDA)

This notebook explores trends in student behavior and their relationship with academic performance.

## Project Objective and Overarching Question
The central question driving this project is:
**To what extent can student exam scores be predicted from lifestyle habits, wellness factors, and socioeconomic background?**

We aim to identify which features contribute most to academic performance and explore predictive models that can help estimate student outcomes.

#  Exploratory Data Analysis (EDA)

This notebook explores trends in student behavior and their relationship with academic performance.

##  Load and Preview Data

This code imports the necessary libraries and loads the dataset. We preview the first few rows to understand the structure.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('../../student_habits_performance.csv')
df.head()

##  Summary Statistics and Data Types

We display the data types and summary statistics to understand the scale, range, and types of data we're working with.

In [None]:
df.info()
df.describe(include='all')

## 🧹 Missing Values

This line checks for any missing values in each column, ensuring data completeness.

In [None]:
df.isnull().sum()

##  Distribution of Exam Scores

This plot shows the distribution of exam scores. A normal distribution with some skew can impact model performance.

In [None]:
sns.histplot(df['exam_score'], bins=30, kde=True)
plt.title('Distribution of Exam Scores')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()

## 🔗 Relationship Between Key Features

This pairplot helps us visually inspect relationships between key numeric features and exam score. We can identify patterns and correlations.

In [None]:
sns.pairplot(df[['study_hours_per_day', 'social_media_hours', 'sleep_hours', 'mental_health_rating', 'exam_score']])
plt.show()

##  Correlation Matrix

A correlation matrix helps quantify relationships between numeric variables. Higher absolute values indicate stronger linear relationships.

In [None]:
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

##  Boxplots of Exam Score by Category

These boxplots visualize how exam scores vary across different categories. We can gauge if certain groups tend to perform better or worse.

### Interpretation of the Correlation Matrix
The correlation matrix shows how strongly numerical variables are related to each other. `study_hours_per_day`, `attendance_percentage`, and `mental_health_rating` have positive correlations with `exam_score`, suggesting that students who study more, attend more classes, and have better mental health tend to perform better. Negative correlations, like with `social_media_hours` and `netflix_hours`, indicate a potential distraction from academics.

### Interpretation of Boxplots by Category
From the boxplots, we can see performance differences across categories. For example:
- **Diet Quality**: Students with 'Good' diets generally perform better.
- **Internet Quality**: Those with 'Good' internet access have slightly higher scores, possibly due to smoother study experiences.
- **Extracurricular Participation**: Students involved in activities tend to score higher, possibly due to better time management or motivation.


## Final Interpretation and Key Takeaways

The Random Forest and Gradient Boosting models performed the best in our evaluation. Their ability to capture non-linear patterns and handle mixed data types made them ideal for this dataset.

Most influential features across models were:
- Study hours per day
- Class attendance
- Mental health rating
- Sleep hours
- Diet and exercise frequency

This confirms that academic performance isn't just about studying longer—wellness and environmental factors matter significantly. These results can help educators support students holistically.
