This notebook was prepared by Shreyas Kumbhar. Source and license info is on GitHub.

1) Problem statement

Analysis on the Student’s Performance dataset to learn and explore the reasons which affect the marks scored by students.

2) Data Collection.

The Dataset is collected from https://www.kaggle.com/datasets/spscientist/students-performance-in-exams
The data consists of 8 column and 1000 rows.

3) Attribute infomation

We see that most of the variables are Categorical.
I am not very sure why would ‘lunch’ affect the scores.
I think Parent's education level does affect the student's performance to some extent.
Columns: gender, race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, writing score.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("F:\Dataset\StudentsPerformance.csv")

df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
0	female	group B	bachelor's degree	standard	none	72	72	74
1	female	group C	some college	standard	completed	69	90	88
2	female	group B	master's degree	standard	none	90	95	93
3	male	group A	associate's degree	free/reduced	none	47	57	44
4	male	group C	some college	standard	none	76	78	75

df.tail()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
995	female	group E	master's degree	standard	completed	88	99	95
996	male	group C	high school	free/reduced	none	62	55	55
997	female	group C	high school	free/reduced	completed	59	71	65
998	female	group D	some college	standard	completed	68	78	77
999	female	group D	some college	free/reduced	none	77	86	86

Observation:

We see that most of the variables are Categorical.
I am not very sure why would ‘lunch’ affect the scores.
I think Parent's education level does affect the student's performance to some extent.
Columns: gender, race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, writing score.

df.shape

(1000, 8)

df.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

df.dtypes

gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

# numeric column 
numeric_features = [feature for feature in df.columns if df[feature].dtypes != 'O']
numeric_features

['math score', 'reading score', 'writing score']

# categorical column
categorical_features = [feature for feature in df.columns if df[feature].dtypes == 'O']
categorical_features

['gender',
 'race/ethnicity',
 'parental level of education',
 'lunch',
 'test preparation course']

df.memory_usage()

Index                           128
gender                         8000
race/ethnicity                 8000
parental level of education    8000
lunch                          8000
test preparation course        8000
math score                     8000
reading score                  8000
writing score                  8000
dtype: int64

df.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

df.duplicated().sum()

df.describe().T

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	count	mean	std	min	25%	50%	75%	max
math score	1000.0	66.089	15.163080	0.0	57.00	66.0	77.0	100.0
reading score	1000.0	69.169	14.600192	17.0	59.00	70.0	79.0	100.0
writing score	1000.0	68.054	15.195657	10.0	57.75	69.0	79.0	100.0

Summary:

We have only three columns with numerical data.
The average score for all three subjects is between 65–70
Most of the students have done better in reading score
There are students who have scored ZERO in Mathematics.
In each subject, we see that the highest mark obtained is 100.
It seems that mathematics is not so favorite subject among these students.

df.skew()

math score      -0.278935
reading score   -0.259105
writing score   -0.289444
dtype: float64

# If skewness is less than 1 or greater than +1, the distribution is highly skewed.
# If skewness is between 1 and  ½ or between +½ and +1, the distribution is moderately skewed.
# If skewness is between  ½ and +½, the distribution is approximately symmetric.

df.kurtosis()
# A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any distribution with kurtosis ≈3
# (excess ≈0) is called mesokurtic.
# A distribution with kurtosis <3 (excess kurtosis <0 ) is called platykurtic. Compared to a normal distribution,
# its tails are shorter and thinner, and often its central peak is lower and broader.
# A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic. Compared to a normal distribution,
# its tails are longer and fatter, and often its central peak is higher and sharper.

math score       0.274964
reading score   -0.068265
writing score   -0.033365
dtype: float64

Univariate Analysis

The term univariate analysis refers to the analysis of one variable prefix “uni” means “one.” The purpose of univariate analysis is to understand the distribution of values for a single variable.

#Univariate Analysis

fig = plt.figure(figsize = (20,15))
plt.suptitle('Univariate Analysis of Numerical Features',fontsize=20,fontweight='bold',y=1.)
for i in enumerate(numeric_features):
    plt.subplot(3,3,i[0]+1)
    sns.kdeplot(x=df[numeric_features[i[0]]],shade='True',color='b')
    plt.xlabel(numeric_features[i[0]])
    plt.tight_layout()

report

math score, reading score, writing score are follows normal distrubution but slightly left-skewed .
Data is normally distributed.
We try to bring the data to normal by applying different techniques.

# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Univariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
    plt.subplot(3, 2, i+1)
    sns.countplot(x=df[categorical_features[i]])
    plt.xlabel(categorical_features[i])
    plt.xticks(rotation=45)
    plt.tight_layout()

Bivariate Analysis of Categorical Features vs math score

# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Bivariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
    plt.subplot(3, 2, i+1)
    sns.boxplot(x= df[categorical_features[i]], y= df[numeric_features[0]], data=df)
    plt.xlabel(categorical_features[i])

Observation regarding the Math Scores vs Categorical Features plots:

In Math, on average, Male students have scored more than Female students. And one student that scored zero in Maths is a Female.
In Math, group E has performed the best and group A the worst. The student who scored zero belongs to group C.
For all the three subjects, the students who have scored the highest belonging to the parents with Masters Degree.
It is clear that all the students who completed the course have scored better marks on an average than those who did not complete.

Bivariate Analysis of Categorical Features vs reading score

# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Bivariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
    plt.subplot(3, 2, i+1)
    sns.boxplot(x= df[categorical_features[i]], y= df[numeric_features[1]], data=df)
    plt.xlabel(categorical_features[i])

Observation regarding the Reading Scores vs Categorical Features plots:

In Reading, female students have done better than male students. Also, the student that scored the lowest marks is a female.
In Reading, group E has performed the best and group A the worst. The student who scored the lowest marks belongs to group C.
Students whose parents have a bachelor’s degree have also consistently performed well.
The students who did not complete the course are the ones who scored zero and the lowest marks.

Bivariate Analysis of Categorical Features vs writing score

# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Bivariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
    plt.subplot(3, 2, i+1)
    sns.boxplot(x= df[categorical_features[i]], y= df[numeric_features[2]], data=df)
    plt.xlabel(categorical_features[i])

Observation regarding the Writing Scores vs Categorical Features plots:

females have performed better than males and again the student that scored the lowest in this subject is also a female.
In Writing, groups C, D & E have done well. Group A performed the worst and the lowest marks are from group C.
Students whose parents have a bachelor’s degree have also consistently performed well.
The students who did not complete the course are the ones who scored zero and the lowest marks.

plt.figure(figsize=(15, 15))
plt.suptitle('Univariate Analysis of Numerical Features(Checking Outliers)', fontsize=20, fontweight='bold', alpha=0.8, y=1.)

for i in range(0, len(numeric_features)):
    plt.subplot(5, 3, i+1)
    sns.boxplot(x=df[numeric_features[i]], color='r')
    plt.xlabel(numeric_features[i])
    plt.tight_layout()

Report

Outleirs present in all numeric feature.
Distribution is almost normal.
We will try to remove outliers by IQR method.

sns.pairplot(df, corner=True)

<seaborn.axisgrid.PairGrid at 0x243e3eaa6a0>

Multivariate Analysis

Multivariate analysis is the analysis of more than one variable.

Check Multicollinearity in Numerical features

df[(list(df.columns)[1:])].corr()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	math score	reading score	writing score
math score	1.000000	0.817580	0.802642
reading score	0.817580	1.000000	0.954598
writing score	0.802642	0.954598	1.000000

plt.figure(figsize = (15,10))
sns.heatmap(df.corr(), cmap="CMRmap", annot=True)
plt.show()

Final Report

The datatypes and Column names were right and there was 15411 rows and 13 columns
There are outliers in the math score, reading score, writing score.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EDA_students_performance_in_exams.md

EDA_students_performance_in_exams.md

This notebook was prepared by Shreyas Kumbhar. Source and license info is on GitHub.

1) Problem statement

2) Data Collection.

3) Attribute infomation

Observation:

Summary:

Univariate Analysis

report

Bivariate Analysis of Categorical Features vs math score

Observation regarding the Math Scores vs Categorical Features plots:

Bivariate Analysis of Categorical Features vs reading score

Observation regarding the Reading Scores vs Categorical Features plots:

Bivariate Analysis of Categorical Features vs writing score

Observation regarding the Writing Scores vs Categorical Features plots:

Report

Multivariate Analysis

Check Multicollinearity in Numerical features

Final Report

Files

EDA_students_performance_in_exams.md

Latest commit

History

EDA_students_performance_in_exams.md

File metadata and controls

This notebook was prepared by Shreyas Kumbhar. Source and license info is on GitHub.

1) Problem statement

2) Data Collection.

3) Attribute infomation

Observation:

Summary:

Univariate Analysis

report

Bivariate Analysis of Categorical Features vs math score

Observation regarding the Math Scores vs Categorical Features plots:

Bivariate Analysis of Categorical Features vs reading score

Observation regarding the Reading Scores vs Categorical Features plots:

Bivariate Analysis of Categorical Features vs writing score

Observation regarding the Writing Scores vs Categorical Features plots:

Report

Multivariate Analysis

Check Multicollinearity in Numerical features

Final Report