# **Bootcamp Project: Student Performance Analysis**

# **Step 1: Load and Inspect the Data**

In [1]:
import pandas as pd
df = pd.read_csv("StudentsPerformance.csv")
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


#  Step 2: Clean the Data

In [2]:
print(df.isnull().sum())
df.columns = [col.strip().lower().replace(" ", "_") for col in df.columns]

df.rename(columns={
    "parental_level_of_education": "parent_edu",
    "test_preparation_course": "prep_course"
}, inplace=True)

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64


# **Step 3: Answer EDA Questions**

**1. Which parental education level is linked with the highest average math score?**

In [3]:
df.groupby("parent_edu")["math_score"].mean().sort_values(ascending=False)


Unnamed: 0_level_0,math_score
parent_edu,Unnamed: 1_level_1
master's degree,69.745763
bachelor's degree,69.389831
associate's degree,67.882883
some college,67.128319
some high school,63.497207
high school,62.137755


**2. Is there a significant score difference between males and females across all subjects?**

In [4]:
df.groupby("gender")[["math_score", "reading_score", "writing_score"]].mean()


Unnamed: 0_level_0,math_score,reading_score,writing_score
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,63.633205,72.608108,72.467181
male,68.728216,65.473029,63.311203


**3. How much does completing the test preparation course improve performance in each subject?**

In [5]:
df.groupby("prep_course")[["math_score", "reading_score", "writing_score"]].mean()


Unnamed: 0_level_0,math_score,reading_score,writing_score
prep_course,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
completed,69.695531,73.893855,74.418994
none,64.077882,66.534268,64.504673


 4. Which combination of gender, lunch type, and test preparation status produces the top 10% of scores?

In [6]:
df["avg_score"] = df[["math_score", "reading_score", "writing_score"]].mean(axis=1)
top_10 = df[df["avg_score"] >= df["avg_score"].quantile(0.90)]
top_10.groupby(["gender", "lunch", "prep_course"]).size().sort_values(ascending=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
gender,lunch,prep_course,Unnamed: 3_level_1
female,standard,none,31
female,standard,completed,29
male,standard,completed,20
male,standard,none,9
female,free/reduced,completed,6
male,free/reduced,completed,3
female,free/reduced,none,2
male,free/reduced,none,2


**5. Does lunch type have a uniform impact across all race/ethnicity groups, or does its effect vary?**

In [7]:
df.groupby(["race/ethnicity", "lunch"])[["math_score", "reading_score", "writing_score"]].mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,math_score,reading_score,writing_score
race/ethnicity,lunch,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
group A,free/reduced,55.222222,60.555556,57.194444
group A,standard,65.981132,67.471698,66.396226
group B,free/reduced,57.434783,63.971014,61.521739
group B,standard,66.884298,69.280992,67.92562
group C,free/reduced,56.412281,63.412281,61.412281
group C,standard,68.941463,72.268293,71.395122
group D,free/reduced,61.115789,66.431579,66.452632
group D,standard,70.916168,72.077844,72.245509
group E,free/reduced,66.560976,68.731707,67.195122
group E,standard,76.828283,74.808081,73.151515


**6. What is the correlation between reading and writing scores? Is it stronger than math and writing?**

In [8]:
df[["reading_score", "writing_score", "math_score"]].corr()


Unnamed: 0,reading_score,writing_score,math_score
reading_score,1.0,0.954598,0.81758
writing_score,0.954598,1.0,0.802642
math_score,0.81758,0.802642,1.0


**7. Identify the top 5% performing students and analyze their demographic profiles.**

In [9]:
top_5 = df[df["avg_score"] >= df["avg_score"].quantile(0.95)]
top_5.describe(include='object')


Unnamed: 0,gender,race/ethnicity,parent_edu,lunch,prep_course
count,50,50,50,50,50
unique,2,5,6,2,2
top,female,group E,associate's degree,standard,completed
freq,36,14,16,46,33


 8. Can we cluster students into performance categories using just Pandas?

In [10]:
def categorize(score):
    if score >= 85:
        return "High"
    elif score >= 60:
        return "Medium"
    else:
        return "Low"

df["performance_cluster"] = df["avg_score"].apply(categorize)
df["performance_cluster"].value_counts()


Unnamed: 0_level_0,count
performance_cluster,Unnamed: 1_level_1
Medium,599
Low,285
High,116


In [11]:
df.to_csv("Cleaned_StudentPerformance.csv", index=False)


In [12]:
from google.colab import files
files.download("Cleaned_StudentPerformance.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>