# üìä Student Performance: Comprehensive Data Insights & EDA Plan

## üìù Problem Statement (The "Why")
The core objective of this analysis is to understand the factors that influence a student's academic performance. In the real world, a student's score isn't just about studying; it‚Äôs often tied to their background. We want to find out:
- How much do **socio-economic factors** (like lunch and parental education) impact scores?
- Is there a significant difference in performance based on **gender or ethnicity**?
- Does completing a **test preparation course** actually lead to better results?

---

## üìë Dataset Information
The dataset provides performance data in three core subjects: **Math, Reading, and Writing**.

| Feature Name | Type | Description |
| :--- | :--- | :--- |
| **gender** | Categorical | Male or Female |
| **race/ethnicity** | Categorical | Groups A, B, C, D, E |
| **parental level of education** | Categorical | Range from High School to Master's Degree |
| **lunch** | Categorical | Standard or Free/Reduced |
| **test preparation course** | Categorical | Completed or None |
| **math score** | Numerical | Marks in Mathematics |
| **reading score** | Numerical | Marks in Reading |
| **writing score** | Numerical | Marks in Writing |

---

## üîç Exploratory Data Analysis (EDA) Steps
We will perform the following operations to clean and understand the data:

### 1. Data Cleaning & Sanity Check
* Identify and handle **Missing Values**.
* Detect and remove **Duplicate** records.
* Check **Data Types** to ensure numerical values are correctly formatted.

### 2. Statistical Analysis
* Analyze Mean, Median, and Mode of the test scores.
* Check for the **Skewness** of data (are most students scoring high or low?).

### 3. Feature Engineering
* **Total Score**: Creating a combined score of all three subjects.
* **Average Score**: Calculating the mean score for a holistic view of the student.

### 4. Data Visualization & Relationship Mapping
* **Univariate Analysis**: Using Histograms and KDE plots to see score distributions.
* **Bivariate Analysis**: 
    - Using Bar Plots to see how **Parental Education** affects the **Average Score**.
    - Using Pie Charts to see the distribution of **Gender** and **Ethnicity**.
* **Multivariate Analysis**: Using Heatmaps to find the correlation between Math, Reading, and Writing scores.



---

## üõ†Ô∏è Data Preprocessing Roadmap
To prepare this data for any future analysis or calculation:
- **One-Hot Encoding**: Converting categorical text data into numbers.
- **Standard Scaling**: Ensuring all numerical scores are on a similar scale.
- **Outlier Detection**: Using Boxplots to find students with exceptionally high or low marks.

#### Load all the libraries

In [2]:
import numpy as numpy
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


#### Load the data

In [3]:
df=pd.read_csv("D:/machine learning/machine-learning-portfolio/ml-projects/student-performance-prediction/notebook/data/stud.csv")

In [4]:
df # print the dataset

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


In [5]:
## check the data column
df.columns

Index(['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch',
       'test_preparation_course', 'math_score', 'reading_score',
       'writing_score'],
      dtype='object')

In [6]:
for i in df:
    print(i)

gender
race_ethnicity
parental_level_of_education
lunch
test_preparation_course
math_score
reading_score
writing_score


In [7]:
# find the type of the columms
for i in df:
    print(f"{i} ---------------->{type(i) }")

gender ----------------><class 'str'>
race_ethnicity ----------------><class 'str'>
parental_level_of_education ----------------><class 'str'>
lunch ----------------><class 'str'>
test_preparation_course ----------------><class 'str'>
math_score ----------------><class 'str'>
reading_score ----------------><class 'str'>
writing_score ----------------><class 'str'>


In [8]:
# find the information of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race_ethnicity               1000 non-null   object
 2   parental_level_of_education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test_preparation_course      1000 non-null   object
 5   math_score                   1000 non-null   int64 
 6   reading_score                1000 non-null   int64 
 7   writing_score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [9]:
# describe of the datasset means what is the numerical computation 
df.describe()

Unnamed: 0,math_score,reading_score,writing_score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


In [10]:
# find the shape of the dataset

In [11]:
df.shape

(1000, 8)

### check the missing values is present or not in this dataset

In [12]:
df.isnull().sum()

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

In [13]:
# in this dataset there are not any type of the milling values are present

### find the Duplicate values in this dataset

In [14]:
df.duplicated().sum()

0

### check the data types of the entire datast

In [15]:
df.isna().sum()

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

In [16]:
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Length: 1000, dtype: bool

#### There are not present any type of the duplicate data in this dataset

### check the number of unique values of each column

In [19]:
df.nunique()

gender                          2
race_ethnicity                  5
parental_level_of_education     6
lunch                           2
test_preparation_course         2
math_score                     81
reading_score                  72
writing_score                  77
dtype: int64

### xploring Data

In [23]:
print("categories in  'gender' variable: ",end=" ")
print(df['gender'].unique())

categories in  'gender' variable:  ['female' 'male']


In [24]:
print("categories in 'race/ethinity' variable: ",end=" ")
print(df['race_ethnicity'].unique())

categories in 'race/ethinity' variable:  ['group B' 'group C' 'group A' 'group D' 'group E']


In [26]:
print("categories in 'parental level' of education variable: ",end=" ")
print(df['parental_level_of_education'].unique())

categories in 'parental level' of education variable:  ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']


In [28]:
print("catergories in 'lunch' variable: ",end=" ")
print(df['lunch'].unique())

catergories in 'lunch' variable:  ['standard' 'free/reduced']


In [30]:
print("categories in 'test peparation course' variable: ",end=" ")
print(df["test_preparation_course"].unique())

categories in 'test peparation course' variable:  ['none' 'completed']


### Define the numerical and categorical column

In [40]:
numeric_feature=[feature for feature in df.columns if df[feature].dtype !='O']
categorical_features=[feature for feature in df.columns if df[feature].dtype == 'O']


In [41]:
# print the columns
print("we have {} numerical feature : {} ".format(len(numeric_feature),numeric_feature))
print("we have {} categorical feature : {} ".format(len(categorical_features),categorical_features))

we have 3 numerical feature : ['math_score', 'reading_score', 'writing_score'] 
we have 5 categorical feature : ['gender', 'race_ethnicity', 'parental_level_of_education', 'lunch', 'test_preparation_course'] 


In [42]:
df.head(2)

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88


### Adding columns for "total Score and Avrage 

In [53]:
#add the total score column
df["total_score"]=df['math_score']+ df['reading_score']+ df['writing_score']

In [54]:
# add the avarage sum 
df["avg_score"]=df['total_score']/3

In [55]:
## data set
df

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score,total_score,avg_score
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333
4,male,group C,some college,standard,none,76,78,75,229,76.333333
...,...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,282,94.000000
996,male,group C,high school,free/reduced,none,62,55,55,172,57.333333
997,female,group C,high school,free/reduced,completed,59,71,65,195,65.000000
998,female,group D,some college,standard,completed,68,78,77,223,74.333333


In [56]:
df.head(4)

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score,total_score,avg_score
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333


In [63]:
## find the total studentthat that get the full marks on the math,writing and reading
print("reading",df[df["reading_score"]==100]["avg_score"].count())
print("math",df[df["math_score"]==100]['avg_score'].count())
print("writing ",df[df["writing_score"]==100]['avg_score'].count())

reading 17
math 7
writing  14


In [64]:
## find the total studentthat that get the less than 20 marks on the math,writing and reading
print("reading",df[df["reading_score"]<20]["avg_score"].count())
print("math",df[df["math_score"]<20]['avg_score'].count())
print("writing ",df[df["writing_score"]<20]['avg_score'].count())

reading 1
math 4
writing  3
