# Students Performance in Exams EDA + Neural Network

In this notebook, I aim to analyze how different features are correlated with each other and how they are correlated with math, reading, and writing scores. If you like this notebook, please give it an <span style="color:green"> upvote! </span> I would greatly appreciate it.

<h2> Table of Contents </h2>
<ol>
    <li> Reading the Data </li>
    <li> Data Preprocessing </li>
    <li> Feature Correlation </li>
    <li> Score Correlation </li>
    <li> Features and Scores </li>
    <li> Basic Neural Network Implementation </li>

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from scipy import stats

In [None]:
%time df = pd.read_csv("../input/students-performance-in-exams/StudentsPerformance.csv")

In [None]:
df.head()

# What features are there that we can use?

In [None]:
for label, content in df.items():
    print("%s:" % label, df[label].unique())

# How many of each feature is there?

In [None]:
len(df)

In [None]:
df.groupby(['gender']).size()

In [None]:
df.groupby(['race/ethnicity']).size()

In [None]:
df.groupby(['parental level of education']).size()

In [None]:
df.groupby(['lunch']).size()

In [None]:
df.groupby(['test preparation course']).size()

# Generate numerical data for easier computation

0 - male

1 - female

In [None]:
new_df = pd.DataFrame({})
gender_list = []
gender_labels = ['male', 'female']
for i in range(len(df)):
    gender_list.append(gender_labels.index(df['gender'][i]))

new_df['gender'] = gender_list
new_df.head()

0 - Group A

1 - Group B

2 - Group C

3 - Group D

4 - Group E

In [None]:
race_list = []
race_labels = ['group A', 'group B', 'group C', 'group D', 'group E']
for i in range(len(df)):
    race_list.append(race_labels.index(df['race/ethnicity'][i]))
        
new_df['race'] = race_list
new_df.head()

0 - some high school

1 - high school

2 - some college

3 - associates

4 - bachelors

5 - masters

In [None]:
college_list = []
college_labels = ['some high school', 'high school', 'some college', "associate's degree", "bachelor's degree", "master's degree"]
for i in range(len(df)):
    college_list.append(college_labels.index(df['parental level of education'][i]))
        
new_df['parental_college'] = college_list
new_df.head()

0 - free/reduced

1 - standard

In [None]:
lunch_list = []
lunch_labels = ['free/reduced', 'standard']
for i in range(len(df)):
    lunch_list.append(lunch_labels.index(df['lunch'][i]))
    
new_df['lunch'] = lunch_list
new_df.head()

0 - no test prep

1 - completed test prep

In [None]:
prep_list = []
prep_labels = ['none', 'completed']
for i in range(len(df)):
    prep_list.append(prep_labels.index(df['test preparation course'][i]))

new_df['prep'] = prep_list
new_df.head()

# Correlation Within Features

In [None]:
college_race_list = [[0 for j in range(len(race_labels))] for i in range(len(college_labels))]
for i in range(len(new_df)):
    college_race_list[new_df['parental_college'][i]][new_df['race'][i]] += 1
    
college_race_df = pd.DataFrame(college_race_list)

In [None]:
college_race_df.plot.bar(rot=0, figsize=(10, 8))
plt.legend(race_labels)
plt.xticks(np.arange(0, len(college_labels)), labels=college_labels, rotation=0)
plt.title("Parental Level of Education vs Race/Ethnicity", fontsize=20)
plt.show()

In [None]:
df.groupby(['race/ethnicity', 'parental level of education']).size()

Through the board, group C looks like the largest group, so perhaps Caucasian/white? Group D has an unusually high number of people who pursue master degrees, so perhaps Asian? And then group A has the lowest number of degree-pursuers, so probably hispanic/latinx/black? Knowing what education is available in the United States gives us a pretty reasonable guess as to what the groups may stand for.

In [None]:
plt.figure(figsize=(10, 8))
sns.barplot(x="race", y="prep", data=new_df)
plt.xticks(np.arange(0, len(race_labels)), labels=race_labels, rotation=0)
plt.xlabel('', fontsize=16)
plt.ylabel('Proportion of Students Who Completed Prep', fontsize=14)
plt.title("Race/Ethnicity vs Prep Course Completion", fontsize=20)
plt.show()

In [None]:
df.groupby(['race/ethnicity', 'test preparation course']).size()

Group E is the hardest working, whereas group D seems like the least hard working. Not sure if this graph proves or disproves that Group D is Asian, as there's an argument for both sides; perhaps Asian parents would force their kid to complete the test prep course, or maybe they would see no need for the prep course and therefore not complete it.

In [None]:
plt.figure(figsize=(10, 8))
sns.barplot(x="parental_college", y="prep", data=new_df)
plt.xticks(np.arange(0, len(college_labels)), labels=college_labels, rotation=0)
plt.xlabel('', fontsize=16)
plt.ylabel('Proportion of Students Who Completed Prep', fontsize=14)
plt.title("Parental Education vs Prep Course Completion", fontsize=20)
plt.show()

In [None]:
df.groupby(['parental level of education', 'test preparation course']).size()

It seems like children of parents from just "some high school" are most hardworking, as on average, they complete the course most often. This may just be an anomaly though since children of parents from "high school" seem like the least hardworking.

In [None]:
plt.figure(figsize=(10, 8))
sns.barplot(x="gender", y="prep", data=new_df)
plt.xticks(np.arange(0, len(gender_labels)), labels=gender_labels, rotation=0)
plt.xlabel('', fontsize=16)
plt.ylabel('Proportion of Students Who Completed Prep', fontsize=14)
plt.title("Gender vs Prep Course Completion", fontsize=20)
plt.show()

In [None]:
df.groupby(['gender', 'test preparation course']).size()

Almost negligible difference here, with males completing the course only *slightly* more times.

In [None]:
plt.figure(figsize=(10, 8))
sns.barplot(x="race", y="lunch", data=new_df)
plt.xticks(np.arange(0, len(race_labels)), labels=race_labels, rotation=0)
plt.xlabel('', fontsize=16)
plt.ylabel('Proportion of Students Lunch', fontsize=14)
plt.title("Race/Ethnicity vs Lunch", fontsize=20)
plt.show()

In [None]:
df.groupby(['race/ethnicity', 'lunch']).size()

Group A receives free/reduced lunch most, furthering the previous observation that group A may be a minority group.

In [None]:
plt.figure(figsize=(10, 8))
sns.barplot(x="parental_college", y="lunch", data=new_df)
plt.xticks(np.arange(0, len(college_labels)), labels=college_labels, rotation=0)
plt.xlabel('', fontsize=16)
plt.ylabel('Proportion of Students Lunch', fontsize=14)
plt.title("Parental Education vs Lunch", fontsize=20)
plt.show()

In [None]:
df.groupby(['parental level of education', 'lunch']).size()

Although children of parents with master's degrees seem to receive free/reduced lunch the most, the standard deviation is far too great to conclude that children of parents with master's degrees are most poor.

In [None]:
plt.figure(figsize=(10, 8))
sns.barplot(x="lunch", y="prep", data=new_df)
plt.xticks(np.arange(0, len(lunch_labels)), labels=lunch_labels, rotation=0)
plt.xlabel('', fontsize=16)
plt.ylabel('Proportion of Students Who Completed Prep', fontsize=14)
plt.title("Lunch vs Prep Course Completion", fontsize=20)
plt.show()

In [None]:
df.groupby(['lunch', 'test preparation course']).size()

Children who receive standard lunch plans seems to complete the test course a bit less.

# Distribution of and Correlation within Scores

In [None]:
def print_info(data):
    print("Mean of Scores: %.3f" % np.mean(data))
    print("Median of Scores: %.3f" % np.median(data))
    print("Standard Deviation of Scores: %.3f" % np.std(data))
    intv1 = np.mean(data)-2*np.std(data)
    intv2 = np.mean(data)+2*np.std(data)
    print("95%c confidence interval: %.3f to %.3f" % ('%', intv1, intv2))
    quant1 = np.quantile(data, 0.025)
    quant2 = np.quantile(data, 0.975)
    print("95%c of data is between %.3f and %.3f" % ('%', quant1, quant2))
    print("Skew of the distribution is %.3f" % (stats.skew(data)))
    print("Kurtosis of the distribution is %.3f" % (stats.kurtosis(data)))
    print("Exccess kurtosis of the distribution is %.3f" % (stats.kurtosis(data)-3))

In [None]:
print_info(df['reading score'])

plt.figure(figsize=(10, 8))
sns.distplot(df['reading score'], kde_kws = {'linewidth': 3})
plt.show()

In [None]:
print_info(df['writing score'])

plt.figure(figsize=(10, 8))
sns.distplot(df['writing score'], kde_kws = {'linewidth': 3})
plt.show()

In [None]:
print_info(df['math score'])

plt.figure(figsize=(10, 8))
sns.distplot(df['math score'], kde_kws = {'linewidth': 3})
plt.show()

All distributions have negative excess kurotsis, so they are all platykurtic (we don't expect many extreme events near the tails of the distribution).

In [None]:
plt.figure(figsize=(10, 8))
sns.distplot(df['reading score'], hist=False, kde_kws = {'linewidth': 3, 'shade': True})
sns.distplot(df['writing score'], hist=False, kde_kws = {'linewidth': 3, 'shade': True})
sns.distplot(df['math score'], hist=False, kde_kws = {'linewidth': 3, 'shade': True})
plt.legend(['reading score', 'writing score', 'math score'])
plt.xlabel('Score', fontsize=16)
plt.title("All Distributions Stacked on Each Other", fontsize=20)
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
plt.title("Reading Score vs Math Score, correlation: %.3f" % (pearsonr(df['math score'], df['reading score'])[0]), fontsize=20)
plt.scatter(df['reading score'], df['math score'])
plt.xlabel("Reading Score", fontsize=16)
plt.ylabel("Math Score", fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
plt.title("Writing Score vs Math Score, correlation: %.3f" % (pearsonr(df['math score'], df['writing score'])[0]), fontsize=20)
plt.scatter(df['writing score'], df['math score'])
plt.xlabel("Writing Score", fontsize=16)
plt.ylabel("Math Score", fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
plt.title("Reading Score vs Writing Score, correlation: %.3f" % (pearsonr(df['writing score'], df['reading score'])[0]), fontsize=20)
plt.scatter(df['reading score'], df['writing score'])
plt.xlabel("Reading Score", fontsize=16)
plt.ylabel("Writing Score", fontsize=16)
plt.show()

Very impressive correlation between reading score and writing score!

# Correlation Between Features and Scores

In [None]:
labels_list = ['math score', 'reading score', 'writing score']

fig, ax = plt.subplots(1, 3, figsize=(30, 8))

r = 0
for label in labels_list:
    df1 = df.sort_values("gender")
    sns.barplot(x="gender", y=label, data=df1, ax=ax[r])
    ax[r].set_xlabel('', fontsize=16)
    ax[r].set_ylabel('%ss' % label, fontsize=14)
    ax[r].set_title("Gender vs %ss"% label, fontsize=20)
    r += 1

fig.tight_layout()
plt.show()

Males do much better in math than females, but do much worse in both reading and writing.

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(30, 8))

r = 0
for label in labels_list:
    df1 = df.sort_values("race/ethnicity")
    sns.barplot(x="race/ethnicity", y=label, data=df1, ax=ax[r])
    ax[r].set_xlabel('', fontsize=16)
    ax[r].set_ylabel('%ss' % label, fontsize=14)
    ax[r].set_title("Race/Ethnicity vs %ss"% label, fontsize=20)
    r += 1

fig.tight_layout()
plt.show()

Group E has the best scores across the board, perhaps due to their preparation (recall that group E also completed the prep course, on average, the most times). Group D does quite well for having the lowest prep course completion out of all groups.

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(30, 8))

r = 0
for label in labels_list:
    df1 = df.sort_values("parental level of education")
    sns.barplot(x="parental level of education", y=label, data=df1, ax=ax[r])
    ax[r].set_xlabel('', fontsize=16)
    ax[r].set_ylabel('%ss' % label, fontsize=14)
    ax[r].set_title("Parental Education vs %ss"% label, fontsize=20)
    r += 1

fig.tight_layout()
plt.show()

It seems like children with parents from "high school" do the worst, and children of parents with master degrees do the best. This proves an unsurprising fact: parents with a better degree will, on average, earn more money and be able to support their child's education with key resources (and of course, their own expertise).

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(30, 8))

r = 0
for label in labels_list:
    df1 = df.sort_values("lunch")
    sns.barplot(x="lunch", y=label, data=df1, ax=ax[r])
    ax[r].set_xlabel('', fontsize=16)
    ax[r].set_ylabel('%ss' % label, fontsize=14)
    ax[r].set_title("Lunch vs %ss"% label, fontsize=20)
    r += 1

fig.tight_layout()
plt.show()

Children with standard lunch plans do far better than children with free/reduced lunch plans across the board.

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(30, 8))

r = 0
for label in labels_list:
    df1 = df.sort_values("test preparation course")
    sns.barplot(x="test preparation course", y=label, data=df1, ax=ax[r])
    ax[r].set_xlabel('', fontsize=16)
    ax[r].set_ylabel('%ss' % label, fontsize=14)
    ax[r].set_title("Test Preparation vs %ss"% label, fontsize=20)
    r += 1

fig.tight_layout()
plt.show()

Of course, children who complete the test course do far better than those who do not.

# Random Forest Permutation Importance

In [None]:
from sklearn.ensemble import RandomForestRegressor
import eli5
from eli5.sklearn import PermutationImportance

regressor = RandomForestRegressor(max_features='auto')

In [None]:
regressor.fit(new_df, df['reading score'])
perm_import = PermutationImportance(regressor, random_state=1).fit(new_df, df['reading score'])
eli5.show_weights(perm_import, top=15, feature_names = new_df.columns.tolist())

In [None]:
regressor.fit(new_df, df['writing score'])
perm_import = PermutationImportance(regressor, random_state=1).fit(new_df, df['writing score'])
eli5.show_weights(perm_import, top=15, feature_names = new_df.columns.tolist())

In [None]:
regressor.fit(new_df, df['math score'])
perm_import = PermutationImportance(regressor, random_state=1).fit(new_df, df['math score'])
eli5.show_weights(perm_import, top=15, feature_names = new_df.columns.tolist())

Interestingly, although preparation seemed like the biggest factor in our visual analysis, random forest permutation says it's actually one of the less important features. Instead, the child's parent's college education seems like the most important feature.

# Simple Neural Network Implementation

In [None]:
df['average score'] = df.loc[:, df.columns.str.contains('score')].mean(axis=1)

In [None]:
df.head()

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

ohe = OneHotEncoder()
scaler = MinMaxScaler()
x_arr = np.array(ohe.fit_transform(df.loc[:, ~df.columns.str.contains('score')]).todense())
y_arr = df['average score'].to_numpy().reshape(-1, 1)
y_transform = scaler.fit_transform(y_arr)

end = 800
x_train = x_arr[:end]
x_test = x_arr[end:]
y_train = y_transform[:end]
y_test = y_transform[end:]

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.regularizers import l2
from keras.initializers import GlorotNormal
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

model = Sequential()
model.add(Dense(units=64, kernel_initializer=GlorotNormal(), activation='relu'))
model.add(Dense(1))
model.compile(loss='mse', optimizer=Adam(learning_rate=1e-5))

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10, min_delta=1e-5)
history = model.fit(x_train, y_train, epochs=10000, verbose=0, validation_data=(x_test, y_test), callbacks=[es])

In [None]:
plt.figure(figsize=(10, 8))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['loss', 'val_loss'])
plt.xlabel("")
plt.ylabel("Mean Squared Error", fontsize=16)
plt.show()

In [None]:
predict = scaler.inverse_transform(model.predict(x_test))
predict

In [None]:
print("Average Error: %.3f" % abs(y_arr[end:]-predict).mean())

And that's it for this notebook! There are definitely some optimizations to be made for the neural network if you would like to experiment with it; in particular, generate more data (1000 is very little) through random Gaussian noise and optimizing the neural network architecture (I use a basic dense layer but autoencoders and random forests etc. can definitely be used).

Finally, if you found this notebook particularly educational or helpful, please give it an <span style="color:green"> upvote! </span> And if you would like, check out some of my other work, I recently made some chess visualizations that I think are kind of cool.