Exploratory Data Analysis of Student Performance

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

We have 2 types of Variables:
1.Numeric Variables- Math score,Reading score,Writing score
2.Categorical Variables-Gender,Race/ethnicity,parental level of education,lunch,test prep course

In [None]:
a="../input/students-performance-in-exams/StudentsPerformance.csv"
bd=pd.read_csv(a)
bds=bd.copy()

In [None]:
bd.apply(lambda x: sum(x.isnull()),axis=0)

No NULL values in the Data Set

In [None]:
bd.describe()

In [None]:
bd.shape

1000 Rows and 8 Columns

In [None]:
bd["race/ethnicity"].value_counts()

In [None]:
bds.columns=['gender','race','ped','lunch','tpc','m','r','w']

In [None]:
bds.sample(3)

In [None]:
bds.gender.value_counts().plot(kind='bar')

In [None]:
bds.gender.value_counts().plot.pie(autopct="%1.1f%%")
plt.show()

Proportion of male and female students fairly same

In [None]:
bds.race.value_counts().plot(kind='bar')

In [None]:
bds.race.value_counts().plot.pie(autopct="%1.1f%%")
plt.show()

Most students belong to Group C race/ethnicity followed by Group D and so on as shown in the graph.
Least students belong to Group A

In [None]:
bds["ped"].value_counts()

In [None]:
bds.ped.value_counts().plot(kind='bar')

In [None]:
bds.ped.value_counts().plot.pie(autopct="%1.1f%%")
plt.show()

Most student's parental level of education is Some college, followed by Associate's degree. Least Student's parents have a Master's degree

In [None]:


bds["lunch"].value_counts()


In [None]:
bds.lunch.value_counts().plot(kind='bar')

In [None]:
bds.lunch.value_counts().plot.pie(autopct="%1.1f%%")
plt.show()

Most students(nearly twice) take the standard lunch

In [None]:
bds["tpc"].value_counts()

In [None]:
bds.tpc.value_counts().plot(kind='bar')

In [None]:
bds.tpc.value_counts().plot.pie(autopct="%1.1f%%")
plt.show()

Most students did not complete the Test preparation course

Let's start analyzing the scores

In [None]:
plt.rcParams['figure.figsize']=(20,10)
sns.countplot(bd['math score'])
plt.show()

Here we see the distribution of Math scores

In [None]:
sns.violinplot(y='math score',data=bd)
plt.show()

Maximun students have scored in the range 60-80 in Maths

In [None]:
plt.rcParams['figure.figsize']=(20,10)
sns.countplot(bd['writing score'])
plt.show()

Here's the distribution of writing scores

In [None]:
sns.violinplot(y='writing score',data=bd)
plt.show()

Most students have scored in the range 60-80

In [None]:
plt.rcParams['figure.figsize']=(20,10)
sns.countplot(bd['reading score'])
plt.show()

In [None]:
sns.violinplot(y='reading score',data=bd)
plt.show()

Most students have scored in the range 60-80

In [None]:
bd.mean().plot.bar()
plt.show()

Average scores are highest for Reading and lowest for Math

Now let's take a look at how various variables affect different scores

In [None]:
bd.groupby(["test preparation course"]).mean().plot.bar()
plt.show()

Average scores are higher across all subjects for students who completed the Test preparation course

In [None]:
bd.groupby(["parental level of education"]).mean().plot.bar()
plt.show()


Average scores are highest for students whose parents hold a master's degree.They are lowest for students whose parents have gone to just High school.
Students with higher educated parents tend to perform best in Writing tests while others perform better at Reading tests

In [None]:
bd.groupby(["gender"]).mean().plot.bar()
plt.show()

Female students perform better at Reading and Writing tests while Male Students perform better at Maths

In [None]:
bd.groupby(["race/ethnicity"]).mean().plot.bar()
plt.show()

Average scores are highest for Group E students and Lowest for Group A

In [None]:
bd.groupby(["lunch"]).mean().plot.bar()
plt.show()

Students opting for a standard lunch have higher scores across all subjects

Now let's take a look at how categorical variables affect each other

In [None]:
sns.countplot(x='gender',hue='tpc',data=bds)
plt.show()

In [None]:
sns.countplot(x='race',hue='gender',data=bds)
plt.show()

In [None]:
sns.countplot(x='gender',hue='ped',data=bds)
plt.show()

In [None]:
sns.countplot(x='gender',hue='lunch',data=bds)
plt.show()

In [None]:
sns.countplot(x='race',hue='tpc',data=bds)
plt.show()

In [None]:
sns.countplot(x='race',hue='ped',data=bds)
plt.show()

In [None]:
bd.corr()

All scores are well correlated with highest correlation between reading and writing scores.

Correlation can only be calculated between numeric values. This is not possible between numeric and categorical data,so we transform the data a bit.

Male and Female have been given the binary values i.e, 0 and 1 respectively.
The Various ethnic groups have been assigned numeric values between [1,5], starting with group A and going on till group E.
The various degrees have been numbered in such a manner so as to facilitate the higher levels of education with a higher numeric value.
Lunch: 1 has been assigned to "standard" and 2 to "free/reduced".
test perparation course: 0 has been assigned to "none" and 1 to "completed".

In [None]:
bds.replace(to_replace='male', value=0, inplace=True)
bds.replace(to_replace='female', value=1, inplace=True)
bds.replace(to_replace=['group A', "group B", "group C", "group D", "group E"], value=[1,2,3,4,5], inplace=True)
bds.replace(to_replace=["bachelor's degree", 'some college', "master's degree", 
                       "associate's degree", 'high school', 'some high school'],
                        value=[5,3,6,4,2,1], inplace=True)
bds.replace(to_replace=['standard', 'free/reduced'], value=[1,2], inplace=True)
bds.replace(to_replace=['none', 'completed'], value=[0,1], inplace=True)

In [None]:
bds.corr()

In [None]:
sns.heatmap(bds.corr(),cmap="Greens")

Let's find out grades

In [None]:
bd['total']=bd['math score']+bd['reading score']+bd['writing score']

In [None]:
bd['percentage']=bd['total']/300*100

In [None]:
def grd(score):
    if score>=90 and score<=100:
        return 'A'
    elif score>=80 and score<90:
        return 'B'
    elif score>=70 and score<80:
        return 'C'
    elif score>=60 and score<70:
        return 'D'
    elif score>=50 and score<60:
        return 'E'
    elif  score<50:
        return 'F'
bd['grades']=bd['percentage'].apply(grd)
    

In [None]:
bd.sample(3)

In [None]:
bd.grades.value_counts().plot(kind='bar')

Most students got Grade C and D. Least students got grade A

In [None]:
sns.countplot(hue='gender',x='grades',data=bd)

In [None]:
sns.countplot(hue='test preparation course',x='grades',data=bd)

In [None]:
sns.countplot(hue='lunch',x='grades',data=bd)

In [None]:
sns.countplot(hue='race/ethnicity',x='grades',data=bd)

In [None]:
sns.countplot(hue='parental level of education',x='grades',data=bd)

# Inferences 

1. Proportion of male and female students almost same.

2. Most Students belong to the Group C and D of race and least to group A

3. Most students have parents who went to some college or have an associate's degree. Least student's parents have a master's Degree.

4. Almost twice as many students opted for a standard lunch

5. Only about a third of students completed the Test preparation course

6. Most students have scored marks in the Range 60-80 in all 3 subjects.

7. Most students secured the Grades C and D. Least students scored Grade A.

8. Average scores are highest for Reading and lowest for Math

9. Average scores are higher across all subjects for students who completed the Test preparation course

10. Average scores are highest for students whose parents hold a master's degree.They are lowest for students whose parents have gone to just High school. 

11. Students with higher educated parents tend to perform best in Writing tests while others perform better at Reading tests.

12. Female students perform better at Reading and Writing tests while Male Students perform better at Maths

13. Average scores are highest for Group E students and Lowest for Group A

14. Students opting for a standard lunch have higher scores across all subjects

15. Higher proportion of Group E students completed the Test preparation course and hence scored better marks.

16. There are more female students belonging to Group B and C. Rest have higher male students.

17. All scores are well correlated with highest correlation between reading and writing scores.

18. There is very little correlation between categorical variables.

19. Very low number of students performed poorly if their parents were well educated.

20. There is high negative correlation between lunch choices and scores.