Skip to content

Latest commit

 

History

History
767 lines (545 loc) · 16.4 KB

EDA_students_performance_in_exams.md

File metadata and controls

767 lines (545 loc) · 16.4 KB

This notebook was prepared by Shreyas Kumbhar. Source and license info is on GitHub.

1) Problem statement

Analysis on the Student’s Performance dataset to learn and explore the reasons which affect the marks scored by students.

2) Data Collection.

The Dataset is collected from https://www.kaggle.com/datasets/spscientist/students-performance-in-exams
The data consists of 8 column and 1000 rows.

3) Attribute infomation

  1. We see that most of the variables are Categorical.
  2. I am not very sure why would ‘lunch’ affect the scores.
  3. I think Parent's education level does affect the student's performance to some extent.
  4. Columns: gender, race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, writing score.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("F:\Dataset\StudentsPerformance.csv")
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
0 female group B bachelor's degree standard none 72 72 74
1 female group C some college standard completed 69 90 88
2 female group B master's degree standard none 90 95 93
3 male group A associate's degree free/reduced none 47 57 44
4 male group C some college standard none 76 78 75
df.tail()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
995 female group E master's degree standard completed 88 99 95
996 male group C high school free/reduced none 62 55 55
997 female group C high school free/reduced completed 59 71 65
998 female group D some college standard completed 68 78 77
999 female group D some college free/reduced none 77 86 86

Observation:

  1. We see that most of the variables are Categorical.
  2. I am not very sure why would ‘lunch’ affect the scores.
  3. I think Parent's education level does affect the student's performance to some extent.
  4. Columns: gender, race/ethnicity, parental level of education, lunch, test preparation course, math score, reading score, writing score.
df.shape
(1000, 8)
df.columns
Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
df.dtypes
gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object
# numeric column 
numeric_features = [feature for feature in df.columns if df[feature].dtypes != 'O']
numeric_features
['math score', 'reading score', 'writing score']
# categorical column
categorical_features = [feature for feature in df.columns if df[feature].dtypes == 'O']
categorical_features
['gender',
 'race/ethnicity',
 'parental level of education',
 'lunch',
 'test preparation course']
df.memory_usage()
Index                           128
gender                         8000
race/ethnicity                 8000
parental level of education    8000
lunch                          8000
test preparation course        8000
math score                     8000
reading score                  8000
writing score                  8000
dtype: int64
df.isnull().sum()
gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64
df.duplicated().sum()
0
df.describe().T
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count mean std min 25% 50% 75% max
math score 1000.0 66.089 15.163080 0.0 57.00 66.0 77.0 100.0
reading score 1000.0 69.169 14.600192 17.0 59.00 70.0 79.0 100.0
writing score 1000.0 68.054 15.195657 10.0 57.75 69.0 79.0 100.0

Summary:

  1. We have only three columns with numerical data.
  2. The average score for all three subjects is between 65–70
  3. Most of the students have done better in reading score
  4. There are students who have scored ZERO in Mathematics.
  5. In each subject, we see that the highest mark obtained is 100.
  6. It seems that mathematics is not so favorite subject among these students.
df.skew()
math score      -0.278935
reading score   -0.259105
writing score   -0.289444
dtype: float64
# If skewness is less than 1 or greater than +1, the distribution is highly skewed.
# If skewness is between 1 and  ½ or between +½ and +1, the distribution is moderately skewed.
# If skewness is between  ½ and +½, the distribution is approximately symmetric.
df.kurtosis()
# A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any distribution with kurtosis ≈3
# (excess ≈0) is called mesokurtic.
# A distribution with kurtosis <3 (excess kurtosis <0 ) is called platykurtic. Compared to a normal distribution,
# its tails are shorter and thinner, and often its central peak is lower and broader.
# A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic. Compared to a normal distribution,
# its tails are longer and fatter, and often its central peak is higher and sharper.
math score       0.274964
reading score   -0.068265
writing score   -0.033365
dtype: float64

Univariate Analysis

The term univariate analysis refers to the analysis of one variable prefix “uni” means “one.” The purpose of univariate analysis is to understand the distribution of values for a single variable.

#Univariate Analysis

fig = plt.figure(figsize = (20,15))
plt.suptitle('Univariate Analysis of Numerical Features',fontsize=20,fontweight='bold',y=1.)
for i in enumerate(numeric_features):
    plt.subplot(3,3,i[0]+1)
    sns.kdeplot(x=df[numeric_features[i[0]]],shade='True',color='b')
    plt.xlabel(numeric_features[i[0]])
    plt.tight_layout()

png

report

  1. math score, reading score, writing score are follows normal distrubution but slightly left-skewed .
  2. Data is normally distributed.
  3. We try to bring the data to normal by applying different techniques.
# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Univariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
    plt.subplot(3, 2, i+1)
    sns.countplot(x=df[categorical_features[i]])
    plt.xlabel(categorical_features[i])
    plt.xticks(rotation=45)
    plt.tight_layout()

png

Bivariate Analysis of Categorical Features vs math score

# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Bivariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
    plt.subplot(3, 2, i+1)
    sns.boxplot(x= df[categorical_features[i]], y= df[numeric_features[0]], data=df)
    plt.xlabel(categorical_features[i])

png

Observation regarding the Math Scores vs Categorical Features plots:

  1. In Math, on average, Male students have scored more than Female students. And one student that scored zero in Maths is a Female.
  2. In Math, group E has performed the best and group A the worst. The student who scored zero belongs to group C.
  3. For all the three subjects, the students who have scored the highest belonging to the parents with Masters Degree.
  4. It is clear that all the students who completed the course have scored better marks on an average than those who did not complete.

Bivariate Analysis of Categorical Features vs reading score

# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Bivariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
    plt.subplot(3, 2, i+1)
    sns.boxplot(x= df[categorical_features[i]], y= df[numeric_features[1]], data=df)
    plt.xlabel(categorical_features[i]) 

png

Observation regarding the Reading Scores vs Categorical Features plots:

  1. In Reading, female students have done better than male students. Also, the student that scored the lowest marks is a female.
  2. In Reading, group E has performed the best and group A the worst. The student who scored the lowest marks belongs to group C.
  3. Students whose parents have a bachelor’s degree have also consistently performed well.
  4. The students who did not complete the course are the ones who scored zero and the lowest marks.

Bivariate Analysis of Categorical Features vs writing score

# categorical columns
plt.figure(figsize=(20, 15))
plt.suptitle('Bivariate Analysis of Categorical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
for i in range(0, len(categorical_features)):
    plt.subplot(3, 2, i+1)
    sns.boxplot(x= df[categorical_features[i]], y= df[numeric_features[2]], data=df)
    plt.xlabel(categorical_features[i]) 

png

Observation regarding the Writing Scores vs Categorical Features plots:

  1. females have performed better than males and again the student that scored the lowest in this subject is also a female.
  2. In Writing, groups C, D & E have done well. Group A performed the worst and the lowest marks are from group C.
  3. Students whose parents have a bachelor’s degree have also consistently performed well.
  4. The students who did not complete the course are the ones who scored zero and the lowest marks.
plt.figure(figsize=(15, 15))
plt.suptitle('Univariate Analysis of Numerical Features(Checking Outliers)', fontsize=20, fontweight='bold', alpha=0.8, y=1.)

for i in range(0, len(numeric_features)):
    plt.subplot(5, 3, i+1)
    sns.boxplot(x=df[numeric_features[i]], color='r')
    plt.xlabel(numeric_features[i])
    plt.tight_layout()

png

Report

  1. Outleirs present in all numeric feature.
  2. Distribution is almost normal.
  3. We will try to remove outliers by IQR method.
sns.pairplot(df, corner=True)
<seaborn.axisgrid.PairGrid at 0x243e3eaa6a0>

png

Multivariate Analysis

Multivariate analysis is the analysis of more than one variable.

Check Multicollinearity in Numerical features

df[(list(df.columns)[1:])].corr()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
math score reading score writing score
math score 1.000000 0.817580 0.802642
reading score 0.817580 1.000000 0.954598
writing score 0.802642 0.954598 1.000000
plt.figure(figsize = (15,10))
sns.heatmap(df.corr(), cmap="CMRmap", annot=True)
plt.show()

png

Final Report

The datatypes and Column names were right and there was 15411 rows and 13 columns
There are outliers in the math score, reading score, writing score.