# ANALYSIS OF STUDENTS PERFORMANCE IN EXAMS

![](https://lh3.googleusercontent.com/AL2RmhBDzstx1TSb6Bh-ahOHs308MTdQ6CDRVr9noAp5TYVyUHt9pWQbg0-v03pNp6qD6_aZzTvMOF1VxYORZFpnP5PnHgyh3SWUxw=w1440-v1)

In this notebook, we'll set to perform an analysis on how different covariates impact the performance of a student in an exam, and then we'll try to find some insights on how these covariates cluster.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#Importing some necesary packages
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Patch
sns.set_style('darkgrid')

In [None]:
# reading the csv file
df = pd.read_csv('/kaggle/input/students-performance-in-exams/StudentsPerformance.csv')

In [None]:
df.head()

In [None]:
#computing the mean score of the test and assigning it to a new column
df['mean test score'] = df[['math score','reading score','writing score']].mean(axis = 1)

In [None]:
#Exploring descriptive statistics of each column
df.describe(include = 'all')

## Exploring distribution of scores for each gender

In [None]:
#splitting the dataset to compare males with females
female_df = df[df.gender == 'female']
male_df = df[df.gender == 'male']

In [None]:
plt.figure(figsize = (10,8))
sns.distplot(female_df['mean test score'], kde= False, label = 'Females')
sns.distplot(male_df['mean test score'], kde = False, label = 'Males')
plt.title('Comparison of males mean test scores vs females mean test scores', size = 16, weight = 'bold')
plt.legend()

print('Females mean test scores (avg): {:.1f}'.format(female_df['mean test score'].mean()))
print('Males mean test scores (avg): {:.1f}'.format(male_df['mean test score'].mean()))

We can see that, on average, male students perform better than female students in this particular exam, but, is it true?
We'll now dive a bit deeper to explore the math, reading and writing scores.

In [None]:
#comparing math scores between males and females
plt.figure(figsize = (10,8))
sns.distplot(female_df['math score'], kde= False, label = 'Females')
sns.distplot(male_df['math score'], kde = False, label = 'Males')
plt.title('Comparison of males math scores vs females math score', size = 16, weight = 'bold')
plt.legend()

print('Females mean math score: {:.1f}'.format(female_df['math score'].mean()))
print('Males mean math score: {:.1f}'.format(male_df['math score'].mean()))

In [None]:
#Comparing reading scores between males and females
plt.figure(figsize = (10,8))
sns.distplot(female_df['reading score'], kde= False, label = 'Females')
sns.distplot(male_df['reading score'], kde = False, label = 'Males')
plt.title('Comparison of males reading scores vs females reading score', size = 16, weight = 'bold')
plt.legend()

print('Females mean reading score: {:.1f}'.format(female_df['reading score'].mean()))
print('Males mean reading score: {:.1f}'.format(male_df['reading score'].mean()))

In [None]:
#Comparing writing scores between males and females
plt.figure(figsize = (10,8))
sns.distplot(female_df['writing score'], kde= False, label = 'Females')
sns.distplot(male_df['writing score'], kde = False, label = 'Males')
plt.title('Comparison of males writing scores vs females writing score', size = 16, weight = 'bold')
plt.legend()

print('Females mean writing score: {:.1f}'.format(female_df['writing score'].mean()))
print('Males mean writing score: {:.1f}'.format(male_df['writing score'].mean()))

We see that males, on average, only exceed females in the math test performance, in the rest, women tend to perform better.

## Exploring other variables and assessing their impact on mean test score.

In [None]:
#plotting test preparation course status: females vs males
plt.figure(figsize = (10,8))
sns.barplot(data = df.groupby(['gender','test preparation course'], as_index = False)['lunch'].\
count().rename(columns = {'lunch':'count'}),x = 'test preparation course', y= 'count', hue = 'gender')
plt.title("Test preparation course status:\ncomparison between genders", size = 16, weight = 'bold')

print("Precentage of females that took the test preparation course: {:.1f}%".\
      format(100*len(female_df[female_df['test preparation course']=='completed'])/len(female_df)))

print("Precentage of males that took the test preparation course: {:.1f}%".\
      format(100*len(male_df[male_df['test preparation course']=='completed'])/len(male_df)))

In [None]:
plt.figure(figsize = (10,8))
sns.boxplot(data = df, x = 'gender', y = 'mean test score', hue = 'test preparation course')
plt.title("Test preparation course status:\nimpact on mean test score", size = 16, weight = 'bold')

We see that status of the preparation course impacts the mean test score, and that both males and females have similar amount of individuals who have taken the course or not.

In [None]:
plt.figure(figsize = (10,8))
sns.barplot(data = df.groupby(['gender','lunch'], as_index = False)['test preparation course'].\
count().rename(columns = {'test preparation course':'count'}),x = 'lunch', y= 'count', hue = 'gender')
plt.title("Lunch status:\ncomparison between genders", size = 16, weight = 'bold')

print("Precentage of females that took the test preparation course: {:.1f}%".\
      format(100*len(female_df[female_df['lunch']=='standard'])/len(female_df)))

print("Precentage of males that took the test preparation course: {:.1f}%".\
      format(100*len(male_df[male_df['lunch']=='standard'])/len(male_df)))

In [None]:
plt.figure(figsize = (10,8))
sns.boxplot(data = df, x = 'gender', y = 'mean test score', hue = 'lunch')
plt.title("Lunch status:\nimpact on mean test score", size = 16, weight = 'bold')

We see that lunch regime seems to impact the student performance.

In [None]:
plt.figure(figsize = (10,8))
sns.barplot(data = df.groupby(['gender','parental level of education'], as_index = False)['lunch'].\
count().rename(columns = {'lunch':'count'}),x = 'parental level of education', y= 'count', hue = 'gender')
plt.title("Parental level of education:\ncomparison between genders", size = 16, weight = 'bold')

print("Precentage of females with parents having an associate's degree: {:.1f}%".\
      format(100*len(female_df[female_df['parental level of education']=="associate's degree"])/len(female_df)))

print("Precentage of males with parents having an associate's degree: {:.1f}%".\
      format(100*len(male_df[male_df['parental level of education']=="associate's degree"])/len(male_df)))

print("Precentage of females with parents having a bachelor's degree: {:.1f}%".\
      format(100*len(female_df[female_df['parental level of education']=="bachelor's degree"])/len(female_df)))

print("Precentage of males with parents having a bachelor's degree: {:.1f}%".\
      format(100*len(male_df[male_df['parental level of education']=="bachelor's degree"])/len(male_df)))

print("Precentage of females with parents having high school level: {:.1f}%".\
      format(100*len(female_df[female_df['parental level of education']=="high school"])/len(female_df)))

print("Precentage of males with parents having high school level: {:.1f}%".\
      format(100*len(male_df[male_df['parental level of education']=="high school"])/len(male_df)))

print("Precentage of females with parents having a master's degree: {:.1f}%".\
      format(100*len(female_df[female_df['parental level of education']=="master's degree"])/len(female_df)))

print("Precentage of males with parents having a master's degree: {:.1f}%".\
      format(100*len(male_df[male_df['parental level of education']=="master's degree"])/len(male_df)))

print("Precentage of females with parents having some college: {:.1f}%".\
      format(100*len(female_df[female_df['parental level of education']=="some college"])/len(female_df)))

print("Precentage of males with parents having some college: {:.1f}%".\
      format(100*len(male_df[male_df['parental level of education']=="some college"])/len(male_df)))

print("Precentage of females with parents having some high school: {:.1f}%".\
      format(100*len(female_df[female_df['parental level of education']=="some high school"])/len(female_df)))

print("Precentage of males with parents having some high school: {:.1f}%".\
      format(100*len(male_df[male_df['parental level of education']=="some high school"])/len(male_df)))

In [None]:
plt.figure(figsize = (10,8))
sns.boxplot(data = df, x = 'gender', y = 'mean test score', hue = 'parental level of education')
plt.title("Parental level of education:\nimpact on mean test score", size = 16, weight = 'bold')

We don't see any significant impact when exploring parental level of education.

In [None]:
plt.figure(figsize = (10,8))
sns.barplot(data = df.groupby(['gender','race/ethnicity'], as_index = False)['lunch'].\
count().rename(columns = {'lunch':'count'}),x = 'race/ethnicity', y= 'count', hue = 'gender')
plt.title("Race/ethnicity:\ncomparison between genders", size = 16, weight = 'bold')

In [None]:
plt.figure(figsize = (10,8))
sns.boxplot(data = df, x = 'gender', y = 'mean test score', hue = 'race/ethnicity')
plt.title("Parental level of education:\nimpact on mean test score", size = 16, weight = 'bold')

The ethnicity doesn't seem to impact significantly on mean test scores, though group E median is a little bit above the other groups, and group A has the lowest median, their ranges tend to overlap.

## Clustering.

Now we're going to perform a clustering to get groups of individuals who share certain features and assess impact of features accounting for the others.

In [None]:
#Importing a model to detect clusters in the data
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV

X = df[["gender","race/ethnicity","parental level of education","lunch","test preparation course"]]
X = pd.get_dummies(X)

dict_params = {'n_clusters':[8,9,10,11,12,13,14,15,16,17,18],
               'max_iter':[25,50,100,150],
               'algorithm':['full','auto'],
               'random_state':[1234]}
model = KMeans()

search = GridSearchCV(estimator = model,param_grid = dict_params,cv = 5,n_jobs = -1).fit(X)

In [None]:
model1 = search.best_estimator_
y=model1.predict(X)
print(search.best_params_, search.best_score_)

In [None]:
#Creating a new column to assign the labels of clusters
df['clusters'] = np.nan

for i in range(19):
    df.loc[df[y==i].index,'clusters'] = i

In [None]:
#Comparing the different clusters mean test scores

plt.figure(figsize = (10,8))
sns.violinplot(data = df,x = 'clusters', y = 'mean test score')
plt.title("Comparison of clusters' mean test scores and\nstandard deviations", size = 16, weight = 'bold')

for i in range(18):
    print('Cluster '+str(int(i))+' mean test score (avg): {:.2f}, and standard deviation: {:.2f}'.format(df[df.clusters == i]['mean test score'].mean(),
                                                                                               df[df.clusters == i]['mean test score'].std()))

We will choose the worst and the best cluster, we chose cluster 1 as the best and 0 as the worst, we considered the standard deviation as an additional measure because we want clusters that are clearly consistent around a mean value, that's why we didn't consider cluster 17, which even though it has a the lowest mean score, there is not enough consistency around that mean value and spreads at an extent that other better mean values are caught in its distribution.

In [None]:
#Creating lists that later will be used to create small dataframes
variables = ["gender","race/ethnicity","parental level of education","lunch","test preparation course"]
var0 = []
quant0 = []
var1 = []
quant1 = []

for k in range(5):
    for i in variables:
        for j in df[df.clusters==k][i]:
            if k == 0:
                if j not in var0:
                    var0.append(j)
                    quant0.append(1)
                else:
                    quant0[var0.index(j)]+=1
            elif k == 1:
                if j not in var1:
                    var1.append(j)
                    quant1.append(1)
                else:
                    quant1[var1.index(j)]+=1

In [None]:
#Creating dataframes according to clusters to plot their features for comparison
s0 = pd.Series(var0, quant0)
s0 = s0.reset_index(drop = False).rename(columns = {'index':'count',0:'variables'})

s1 = pd.Series(var1, quant1)
s1 = s1.reset_index(drop = False).rename(columns = {'index':'count',0:'variables'})

plt.figure(figsize = (18,11))

ax0 = plt.subplot(1,2,1)
bar0=s0.plot(kind = 'bar', x = 'variables', y = 'count', legend = False, ax = ax0, color = ['DarkRed','DarkSlateBlue','DarkSlateBlue','GoldenRod','GoldenRod','GoldenRod','GoldenRod',
                                                                                            'GoldenRod','GoldenRod','DarkGreen','DarkMagenta'])
plt.legend(handles = [Patch(facecolor = 'DarkRed',label = 'gender'),
                      Patch(facecolor = 'DarkSlateBlue', label = 'ethnicity'),
                      Patch(facecolor = 'GoldenRod', label = 'parents educ. level'),
                      Patch(facecolor = 'DarkGreen',label = 'lunch'),
                      Patch(facecolor = 'DarkMagenta', label = 'test prep.course')])

for k,i in enumerate(s0['count']):
    plt.annotate(s = str(round(100*(i/s0['count'].max())))+'%', xy = (k-0.3,i+0.1))
    
plt.title("Cluster 0, mean test score: {:.2f}".format(df[y==0]['mean test score'].mean()),size = 16, weight = 'bold')

ax1 = plt.subplot(1,2,2)
bar1=s1.plot(kind = 'bar', x = 'variables', y = 'count', legend = False, ax = ax1, color = ['DarkRed','DarkSlateBlue','DarkSlateBlue','DarkSlateBlue','DarkSlateBlue','GoldenRod','GoldenRod','GoldenRod',
                                                                                            'GoldenRod','GoldenRod','GoldenRod','DarkGreen','DarkMagenta'])

plt.legend(handles = [Patch(facecolor = 'DarkRed',label = 'gender'),
                      Patch(facecolor = 'DarkSlateBlue', label = 'ethnicity'),
                      Patch(facecolor = 'GoldenRod', label = 'parents educ. level'),
                      Patch(facecolor = 'DarkGreen',label = 'lunch'),
                      Patch(facecolor = 'DarkMagenta', label = 'test prep.course')])

for k,i in enumerate(s1['count']):
    plt.annotate(s = str(round(100*(i/s1['count'].max())))+'%', xy = (k-0.3,i+0.1))

plt.title("Cluster 1, mean test score: {:.2f}".format(df[y==1]['mean test score'].mean()),size = 16, weight = 'bold')

It is interesting to see that, accounting for all the other variables, the worst performing group is 100% constitued by males, while the opposite occurs in the best performing group. We also see that group C is predominant in the worst performing group, while it doesn't appear in the best performing group. As for lunch regime, it is clear that it impacts performance, 100% free/reduced in the group with lower performance, and 100% standard, while the worst performing group didn't take the preparation course, quite the contrary occurs in the best performing group.
Perhaps ease of access to better lunch regime and preparation course could ensure better performance?
Maybe there's some economical reason behind the performance (group C ethnicity being prevalent in the worst group and absent in the best group can be something to think about). If one were to perform a research on this, it can turn out to be valuable information.