Student Performance Analysis for an Education Board

Task 1: Data Ingestion & Initial Exploration
Objective: Understand dataset structure and content.

Load the dataset using Pandas

In [None]:
import pandas as pd

data = pd.read_csv(r"D:\Python\StudentsPerformance.csv")
data

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


Display:
o First 5 records
o Last 5 records

In [2]:
data.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [3]:
data.tail()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77
999,female,group D,some college,free/reduced,none,77,86,86


Check: Number of rows and columns

In [4]:
data.shape

(1000, 8)

Check : Column names

In [5]:
data.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='str')

Check : Data types of each column

In [6]:
data.dtypes

gender                           str
race/ethnicity                   str
parental level of education      str
lunch                            str
test preparation course          str
math score                     int64
reading score                  int64
writing score                  int64
dtype: object

Generate statistical summary for numerical columns

In [7]:
data.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


Task 2: Data Quality & Missing Value Analysis
Objective: Ensure clean data for reliable analysis

Check if the dataset contains any missing values

In [8]:
data.isnull()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,False,False,False
996,False,False,False,False,False,False,False,False
997,False,False,False,False,False,False,False,False
998,False,False,False,False,False,False,False,False


Count missing values column-wise

In [9]:
data.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

If missing values exist: Handle them appropriately using Pandas methods.

** Here in the given data set, there is no missing values

Verify the dataset after cleaning.

In [10]:
data.isnull().sum().sum()

np.int64(0)

Task 3: Overall Student Performance Analysis
Objective: Measure overall academic performance

Calculate:
o Average math score
o Average reading score
o Average writing score

In [13]:
print("Average math score: ", data['math score'].mean())
print("Average reading score: ",data['reading score'].mean())
print("Average writing score: ",data['writing score'].mean())

Average math score:  66.089
Average reading score:  69.169
Average writing score:  68.054


Identify:
o Highest score in each subject
o Lowest score in each subject

In [14]:
print("Highest score in Maths: ",data['math score'].max())
print("Highest score in Reading: ",data['reading score'].max())
print("Highest score in writing: ",data['writing score'].max())

print("Lowest score in Maths: ",data['math score'].min())
print("Lowest score in Reading: ",data['reading score'].min())
print("Lowest score in writing: ",data['writing score'].min())

Highest score in Maths:  100
Highest score in Reading:  100
Highest score in writing:  100
Lowest score in Maths:  0
Lowest score in Reading:  17
Lowest score in writing:  10


Find the total score for each student (sum of all three subjects).
Add a new column total_score to the dataset

In [16]:
data['total score'] = data['math score'] + data['reading score'] + data['writing score']
data

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total score
0,female,group B,bachelor's degree,standard,none,72,72,74,218
1,female,group C,some college,standard,completed,69,90,88,247
2,female,group B,master's degree,standard,none,90,95,93,278
3,male,group A,associate's degree,free/reduced,none,47,57,44,148
4,male,group C,some college,standard,none,76,78,75,229
...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,282
996,male,group C,high school,free/reduced,none,62,55,55,172
997,female,group C,high school,free/reduced,completed,59,71,65,195
998,female,group D,some college,standard,completed,68,78,77,223


In [17]:
data.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score', 'total score'],
      dtype='str')

Task 4: Gender-Based Performance Study
Objective: Identify performance patterns across genders

Calculate average scores (math, reading, writing) for each gender

In [18]:
data.groupby('gender')[['math score', 'reading score','writing score']].mean()

Unnamed: 0_level_0,math score,reading score,writing score
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,63.633205,72.608108,72.467181
male,68.728216,65.473029,63.311203


Identify:
o Which gender performs better in math
o Which gender performs better in reading and writing

In [19]:
better_math_gender = data.groupby('gender')['math score'].mean().idxmax()
better_reading_gender = data.groupby('gender')['reading score'].mean().idxmax()
better_writing_gender = data.groupby('gender')['writing score'].mean().idxmax()

print("Gender better in math: ", better_math_gender)
print("Gender better in reading: ", better_reading_gender)
print("Gender better in writing: ", better_writing_gender)

Gender better in math:  male
Gender better in reading:  female
Gender better in writing:  female


Display the performance comparison

In [21]:
print("Performance comparison based on gender: ")
data.groupby('gender')[['math score', 'reading score','writing score','total score']].mean()


Performance comparison based on gender: 


Unnamed: 0_level_0,math score,reading score,writing score,total score
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,63.633205,72.608108,72.467181,208.708494
male,68.728216,65.473029,63.311203,197.512448


Task 5: Impact of Test Preparation 
Objective: Evaluate effectiveness of test preparation programs

Separate students who:
o Completed the test preparation course
o Did not complete the course

In [22]:
data['test preparation course'].value_counts()

test preparation course
none         642
completed    358
Name: count, dtype: int64

In [24]:
completed_prep_course = data[data['test preparation course'] == 'completed']
print("List of students who completed test preparation course: ")
completed_prep_course

List of students who completed test preparation course: 


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total score
1,female,group C,some college,standard,completed,69,90,88,247
6,female,group B,some college,standard,completed,88,95,92,275
8,male,group D,high school,free/reduced,completed,64,64,67,195
13,male,group A,some college,standard,completed,78,72,70,220
18,male,group C,master's degree,free/reduced,completed,46,42,46,134
...,...,...,...,...,...,...,...,...,...
990,male,group E,high school,free/reduced,completed,86,81,75,242
991,female,group B,some high school,standard,completed,65,82,78,225
995,female,group E,master's degree,standard,completed,88,99,95,282
997,female,group C,high school,free/reduced,completed,59,71,65,195


In [25]:
not_completed_prep_course = data[data['test preparation course'] == 'none']
print("List of students who did not completed test preparation course: ")
not_completed_prep_course

List of students who did not completed test preparation course: 


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total score
0,female,group B,bachelor's degree,standard,none,72,72,74,218
2,female,group B,master's degree,standard,none,90,95,93,278
3,male,group A,associate's degree,free/reduced,none,47,57,44,148
4,male,group C,some college,standard,none,76,78,75,229
5,female,group B,associate's degree,standard,none,71,83,78,232
...,...,...,...,...,...,...,...,...,...
992,female,group D,associate's degree,free/reduced,none,55,76,76,207
993,female,group D,bachelor's degree,free/reduced,none,62,72,74,208
994,male,group A,high school,standard,none,63,63,62,188
996,male,group C,high school,free/reduced,none,62,55,55,172


Calculate average scores for both groups

In [26]:
print("Average total score for Students who completed test preparation course: ", completed_prep_course['total score'].mean())
print("Average total score for Students not completed test preparation course: ", not_completed_prep_course['total score'].mean())

Average total score for Students who completed test preparation course:  218.00837988826817
Average total score for Students not completed test preparation course:  195.11682242990653


Compare performance across math, reading, and writing

In [27]:
print("Performance across math, reading, and writing for both groups: ")
data.groupby('test preparation course')[['math score', 'reading score','writing score']].mean()

Performance across math, reading, and writing for both groups: 


Unnamed: 0_level_0,math score,reading score,writing score
test preparation course,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
completed,69.695531,73.893855,74.418994
none,64.077882,66.534268,64.504673


Conclude whether test preparation improves performance

From the above two steps of code execution for data analysis, it is evident that students who completed the test preparation course consistently secure highier scores in each subjects.

The test preparation course appear to improve student performance across all subjects, with the strongest impact on reading and writing scores.


Task 6: Parental Education & Student Performance
Objective: Understand socio-educational influence on learning

Group students by parental level of education

In [28]:
print("Number of students for each - parental level of education")
data['parental level of education'].value_counts()

Number of students for each - parental level of education


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: count, dtype: int64

Calculate average scores for each education level

In [29]:
print("Average scores based on parental level of education: ")
data.groupby('parental level of education')[['math score', 'reading score','writing score','total score']].mean()

Average scores based on parental level of education: 


Unnamed: 0_level_0,math score,reading score,writing score,total score
parental level of education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
associate's degree,67.882883,70.927928,69.896396,208.707207
bachelor's degree,69.389831,73.0,73.381356,215.771186
high school,62.137755,64.704082,62.44898,189.290816
master's degree,69.745763,75.372881,75.677966,220.79661
some college,67.128319,69.460177,68.840708,205.429204
some high school,63.497207,66.938547,64.888268,195.324022


Identify:
o Highest performing parental education group
o Lowest performing group

In [30]:
highest_performing_parent_group = data.groupby('parental level of education')[['math score', 'reading score','writing score','total score']].mean().idxmax()

print("Highest performing parental education group")
highest_performing_parent_group

Highest performing parental education group


math score       master's degree
reading score    master's degree
writing score    master's degree
total score      master's degree
dtype: str

In [31]:
lowest_performing_parent_group = data.groupby('parental level of education')[['math score', 'reading score','writing score','total score']].mean().idxmin()

print("Lowest performing parental education group")
lowest_performing_parent_group

Lowest performing parental education group


math score       high school
reading score    high school
writing score    high school
total score      high school
dtype: str

Task 7: Lunch Program & Academic Achievement
Objective: Study the effect of nutrition and welfare programs

Analyze student performance based on lunch type

In [32]:
print("Student performance based on lunch type: ")
data.groupby('lunch')[['math score', 'reading score','writing score','total score']].mean()


Student performance based on lunch type: 


Unnamed: 0_level_0,math score,reading score,writing score,total score
lunch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
free/reduced,58.921127,64.653521,63.022535,186.597183
standard,70.034109,71.654264,70.823256,212.511628


Calculate average scores for each lunch category.

In [33]:
print("Average scores based on lunch type: ")
data.groupby('lunch')[['math score', 'reading score','writing score','total score']].mean()

Average scores based on lunch type: 


Unnamed: 0_level_0,math score,reading score,writing score,total score
lunch,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
free/reduced,58.921127,64.653521,63.022535,186.597183
standard,70.034109,71.654264,70.823256,212.511628


Identify which lunch program group performs better overall.

In [34]:
print("The lunch program group which performs better overall is: ", data.groupby('lunch')['total score'].mean().idxmax())

The lunch program group which performs better overall is:  standard


Task 8: Performance Categorization (Data Manipulation)
Objective: Classify students based on academic achievement

Create a new column performance_level:
o Total Score ≥ 250 → Excellent
o Total Score 200–249 → Good
o Total Score < 200 → Needs Improvement

In [35]:
def performance(total_score):
    if total_score >= 250:
        return "Excellent"
    elif total_score >= 200:
        return "Good"
    else:
        return "Needs Improvement"

data['performance level'] = data['total score'].apply(performance)
data

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total score,performance level
0,female,group B,bachelor's degree,standard,none,72,72,74,218,Good
1,female,group C,some college,standard,completed,69,90,88,247,Good
2,female,group B,master's degree,standard,none,90,95,93,278,Excellent
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,Needs Improvement
4,male,group C,some college,standard,none,76,78,75,229,Good
...,...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,282,Excellent
996,male,group C,high school,free/reduced,none,62,55,55,172,Needs Improvement
997,female,group C,high school,free/reduced,completed,59,71,65,195,Needs Improvement
998,female,group D,some college,standard,completed,68,78,77,223,Good


Count number of students in each category

In [36]:
print("Number of students based on performance level :")
data['performance level'].value_counts()

Number of students based on performance level :


performance level
Needs Improvement    444
Good                 417
Excellent            139
Name: count, dtype: int64

Task 9: Top & Bottom Performers Identification
Objective: Detect high achievers and at-risk students

Identify top 10 students based on total score. Display relevant student details

In [37]:
data.sort_values('total score', ascending=False).head(10)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total score,performance level
916,male,group E,bachelor's degree,standard,completed,100,100,100,300,Excellent
962,female,group E,associate's degree,standard,none,100,100,100,300,Excellent
458,female,group E,bachelor's degree,standard,none,100,100,100,300,Excellent
114,female,group E,bachelor's degree,standard,completed,99,100,100,299,Excellent
712,female,group D,some college,standard,none,98,100,99,297,Excellent
179,female,group D,some high school,standard,completed,97,100,100,297,Excellent
165,female,group C,bachelor's degree,standard,completed,96,100,100,296,Excellent
625,male,group D,some college,standard,completed,100,97,99,296,Excellent
685,female,group E,master's degree,standard,completed,94,99,100,293,Excellent
903,female,group D,bachelor's degree,free/reduced,completed,93,100,100,293,Excellent


Identify bottom 10 students based on total score. Display relevant student details

In [38]:
data.sort_values('total score').head(10)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total score,performance level
59,female,group C,some high school,free/reduced,none,0,17,10,27,Needs Improvement
980,female,group B,high school,free/reduced,none,8,24,23,55,Needs Improvement
596,male,group B,high school,free/reduced,none,30,24,15,69,Needs Improvement
327,male,group A,some college,free/reduced,none,28,23,19,70,Needs Improvement
17,female,group B,some high school,free/reduced,none,18,32,28,78,Needs Improvement
76,male,group E,some high school,standard,none,30,26,22,78,Needs Improvement
601,female,group C,high school,standard,none,29,29,30,88,Needs Improvement
338,female,group B,some high school,free/reduced,none,24,38,27,89,Needs Improvement
787,female,group B,some college,standard,none,19,38,32,89,Needs Improvement
211,male,group C,some college,free/reduced,none,35,28,27,90,Needs Improvement


In [39]:
data.sort_values('total score', ascending=False).tail(10)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total score,performance level
211,male,group C,some college,free/reduced,none,35,28,27,90,Needs Improvement
787,female,group B,some college,standard,none,19,38,32,89,Needs Improvement
338,female,group B,some high school,free/reduced,none,24,38,27,89,Needs Improvement
601,female,group C,high school,standard,none,29,29,30,88,Needs Improvement
76,male,group E,some high school,standard,none,30,26,22,78,Needs Improvement
17,female,group B,some high school,free/reduced,none,18,32,28,78,Needs Improvement
327,male,group A,some college,free/reduced,none,28,23,19,70,Needs Improvement
596,male,group B,high school,free/reduced,none,30,24,15,69,Needs Improvement
980,female,group B,high school,free/reduced,none,8,24,23,55,Needs Improvement
59,female,group C,some high school,free/reduced,none,0,17,10,27,Needs Improvement


Task 10: Insights & Conclusion (Analytical Thinking)
Objective: Translate data into real-world decisions

Which factor has the strongest impact on student performance?

The strongest impacts overall come from parental education and lunch type, both of which are proxies for socioeconomic and educational background.

Test preparation courses provide measurable benefits, particularly in reading and writing.

Does test preparation significantly improve scores?

Test preparation is not a magic bullet, but when done seriously, it significantly boosts student performance on standardized tests.

Which subject shows the highest overall performance?

In [40]:
#print("Subject that shows the highest overall performance is:")
subject_avg = data.melt(
    value_vars=['math score', 'reading score', 'writing score'], 
    var_name='subject', value_name='score' ) 

avg_scores = subject_avg.groupby('subject')['score'].mean() 
print("Subject that shows the highest overall performance is: ",avg_scores.idxmax().split()[0])

Subject that shows the highest overall performance is:  reading


Provide 2–3 data-driven recommendations for the education board

1. Expand Access to Test Preparation Programs : Provide free or subsidized prep courses for all students, particularly those from disadvantaged backgrounds, to reduce performance gaps.

2. Address Socioeconomic Disparities (Lunch Type) :  Improve school nutrition programs and pair them with academic support initiatives to ensure students from lower‑income families have equal opportunities to succeed.

3. Strengthen Parental Engagement and Education Support : Higher parental education levels correlate strongly with better student scores. Launch parental involvement workshops and community learning programs to equip parents with strategies to support their children’s studies.