This notebook was prepared by Shreyas Kumbhar. Source and license info is on GitHub.
Analysis on the Student’s Performance dataset to learn and explore the reasons which affect the marks scored by students.
The Dataset is collected from https://www.kaggle.com/datasets/spscientist/students-performance-in-exams
The data consists of 8 column and 1000 rows.
- We see that most of the variables are Categorical.
- I am not very sure why would ‘lunch’ affect the scores.
- I think Parent's education level does affect the student's performance to some extent.
- Columns: gender, race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, writing score.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("F:\Dataset\StudentsPerformance.csv")
df.head()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|
0 | female | group B | bachelor's degree | standard | none | 72 | 72 | 74 |
1 | female | group C | some college | standard | completed | 69 | 90 | 88 |
2 | female | group B | master's degree | standard | none | 90 | 95 | 93 |
3 | male | group A | associate's degree | free/reduced | none | 47 | 57 | 44 |
4 | male | group C | some college | standard | none | 76 | 78 | 75 |
df.tail()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
---|---|---|---|---|---|---|---|---|
995 | female | group E | master's degree | standard | completed | 88 | 99 | 95 |
996 | male | group C | high school | free/reduced | none | 62 | 55 | 55 |
997 | female | group C | high school | free/reduced | completed | 59 | 71 | 65 |
998 | female | group D | some college | standard | completed | 68 | 78 | 77 |
999 | female | group D | some college | free/reduced | none | 77 | 86 | 86 |
- We see that most of the variables are Categorical.
- I am not very sure why would ‘lunch’ affect the scores.
- I think Parent's education level does affect the student's performance to some extent.
- Columns: gender, race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, writing score.
df.shape
(1000, 8)
df.columns
Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
'test preparation course', 'math score', 'reading score',
'writing score'],
dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 1000 non-null int64
6 reading score 1000 non-null int64
7 writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
df.dtypes
gender object
race/ethnicity object
parental level of education object
lunch object
test preparation course object
math score int64
reading score int64
writing score int64
dtype: object
# numeric column
numeric_features = [feature for feature in df.columns if df[feature].dtypes != 'O']
numeric_features
['math score', 'reading score', 'writing score']
# categorical column
categorical_features = [feature for feature in df.columns if df[feature].dtypes == 'O']
categorical_features
['gender',
'race/ethnicity',
'parental level of education',
'lunch',
'test preparation course']
df.memory_usage()
Index 128
gender 8000
race/ethnicity 8000
parental level of education 8000
lunch 8000
test preparation course 8000
math score 8000
reading score 8000
writing score 8000
dtype: int64
df.isnull().sum()
gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64
df.duplicated().sum()
0
df.describe().T
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
math score | 1000.0 | 66.089 | 15.163080 | 0.0 | 57.00 | 66.0 | 77.0 | 100.0 |
reading score | 1000.0 | 69.169 | 14.600192 | 17.0 | 59.00 | 70.0 | 79.0 | 100.0 |
writing score | 1000.0 | 68.054 | 15.195657 | 10.0 | 57.75 | 69.0 | 79.0 | 100.0 |
- We have only three columns with numerical data.
- The average score for all three subjects is between 65–70
- Most of the students have done better in reading score
- There are students who have scored ZERO in Mathematics.
- In each subject, we see that the highest mark obtained is 100.
- It seems that mathematics is not so favorite subject among these students.
df.skew()
math score -0.278935
reading score -0.259105
writing score -0.289444
dtype: float64
# If skewness is less than 1 or greater than +1, the distribution is highly skewed.
# If skewness is between 1 and ½ or between +½ and +1, the distribution is moderately skewed.
# If skewness is between ½ and +½, the distribution is approximately symmetric.
df.kurtosis()
# A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any distribution with kurtosis ≈3
# (excess ≈0) is called mesokurtic.
# A distribution with kurtosis <3 (excess kurtosis <0 ) is called platykurtic. Compared to a normal distribution,
# its tails are shorter and thinner, and often its central peak is lower and broader.
# A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic. Compared to a normal distribution,
# its tails are longer and fatter, and often its central peak is higher and sharper.
math score 0.274964
reading score -0.068265
writing score -0.033365
dtype: float64
The term univariate analysis refers to the analysis of one variable prefix “uni” means “one.” The purpose of univariate analysis is to understand the distribution of values for a single variable.
#Univariate Analysis
fig = plt.figure(figsize = (20,15))
plt.suptitle('Univariate Analysis of Numerical Features',fontsize=20,fontweight='bold',y=1.)
for i in enumerate(numeric_features):
plt.subplot(3,3,i[0]+1)
sns.kdeplot(x=df[numeric_features[i[0]]],shade='True',color='b')
plt.xlabel(numeric_features[i[0]])
plt.tight_layout()
- math score, reading score, writing score are follows normal distrubution but slightly left-skewed .
- Data is normally distributed.
- We try to bring the data to normal by applying different techniques.
# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Univariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
plt.subplot(3, 2, i+1)
sns.countplot(x=df[categorical_features[i]])
plt.xlabel(categorical_features[i])
plt.xticks(rotation=45)
plt.tight_layout()
# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Bivariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
plt.subplot(3, 2, i+1)
sns.boxplot(x= df[categorical_features[i]], y= df[numeric_features[0]], data=df)
plt.xlabel(categorical_features[i])
- In Math, on average, Male students have scored more than Female students. And one student that scored zero in Maths is a Female.
- In Math, group E has performed the best and group A the worst. The student who scored zero belongs to group C.
- For all the three subjects, the students who have scored the highest belonging to the parents with Masters Degree.
- It is clear that all the students who completed the course have scored better marks on an average than those who did not complete.
# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Bivariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
plt.subplot(3, 2, i+1)
sns.boxplot(x= df[categorical_features[i]], y= df[numeric_features[1]], data=df)
plt.xlabel(categorical_features[i])
- In Reading, female students have done better than male students. Also, the student that scored the lowest marks is a female.
- In Reading, group E has performed the best and group A the worst. The student who scored the lowest marks belongs to group C.
- Students whose parents have a bachelor’s degree have also consistently performed well.
- The students who did not complete the course are the ones who scored zero and the lowest marks.
# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Bivariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
plt.subplot(3, 2, i+1)
sns.boxplot(x= df[categorical_features[i]], y= df[numeric_features[2]], data=df)
plt.xlabel(categorical_features[i])
- females have performed better than males and again the student that scored the lowest in this subject is also a female.
- In Writing, groups C, D & E have done well. Group A performed the worst and the lowest marks are from group C.
- Students whose parents have a bachelor’s degree have also consistently performed well.
- The students who did not complete the course are the ones who scored zero and the lowest marks.
plt.figure(figsize=(15, 15))
plt.suptitle('Univariate Analysis of Numerical Features(Checking Outliers)', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(numeric_features)):
plt.subplot(5, 3, i+1)
sns.boxplot(x=df[numeric_features[i]], color='r')
plt.xlabel(numeric_features[i])
plt.tight_layout()
- Outleirs present in all numeric feature.
- Distribution is almost normal.
- We will try to remove outliers by IQR method.
sns.pairplot(df, corner=True)
<seaborn.axisgrid.PairGrid at 0x243e3eaa6a0>
Multivariate analysis is the analysis of more than one variable.
df[(list(df.columns)[1:])].corr()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
math score | reading score | writing score | |
---|---|---|---|
math score | 1.000000 | 0.817580 | 0.802642 |
reading score | 0.817580 | 1.000000 | 0.954598 |
writing score | 0.802642 | 0.954598 | 1.000000 |
plt.figure(figsize = (15,10))
sns.heatmap(df.corr(), cmap="CMRmap", annot=True)
plt.show()
The datatypes and Column names were right and there was 15411 rows and 13 columns
There are outliers in the math score, reading score, writing score.