# Exam Performance Visualizations with Seaborn and Matplotlib

In this Jupyter Noteboook, we will be exploring how various factors affect exam performance. We will do this by using Seaborn and Matplotlib to create informative visualizations that show the relationship between various categorical variables and exam performance. 

First, let's import some necessary libraries:

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

print('All libraries have been imported.')

Next, we will import the dataset and preview it to see what we're working with:

In [None]:
filepath = '../input/students-performance-in-exams/StudentsPerformance.csv'
student = pd.read_csv(filepath)

student.head()

Now let's look at the shape of the DataFrame:

In [None]:
student.shape

The dimensions of this DataFrame are 1,000 rows by 8 columns. This means that the DataFrame contains data on 1,000 different test-takers, each of whom has data recorded on them in 8 different categories.

Before we do some cleaning of the data, let's look at the dtypes:

In [None]:
student.dtypes

All of the data is in the format we need it to be; categorical data (gender, race, parental level of education, lunch, and test preparation) are stored as strings, while numerical data (math, reading, and writing test scores) are stored as integers.

Now let's clean up the DataFrame so it is easier to work with. First, we will rename the columns with more accessible names:

In [None]:
student.columns = ['gender', 'race', 'parent_education', 'lunch', 'test_prep', 'math_score', 'reading_score', 'writing_score']

student.head()

Next, let's see if there are any missing values we need to take care of:

In [None]:
student.isnull().sum()

Luckily, there are no missing values we need to take care of.

It seems that these are all the adjustments we need to make. Now let's go ahead and learn more about the data itself. We'll start by looking at some descriptive statistics about the numerical values:

In [None]:
student.describe()

Next, let's create visualizations to get a more intuitive understanding of the above statistics. Histograms are perfect for understanding distributions amongst a group -- in this case, we want to understand the distribution of test scores among the 1,000 test-takers for each test subject.

In [None]:
# Framework for subplots and subplot titles.
fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize = (20, 5))
chart_titles = ['Math Score Distribution', 'Reading Score Distribution', 'Writing Score Distribution']

# Plot charts.
for col, ax, chart_title in zip(student.columns[-3:], axes.flatten(), chart_titles):
    sns.distplot(student[col], norm_hist = True, ax = ax, kde = False).set_title(chart_title, fontsize = 14)
    ax.set(xlabel = 'Score')
    
# Add gridlines to all plots from this point forward.
plt.rcParams['axes.grid'] = True

Some important statistics that we can draw from both the table and histograms above are the following:
- Math: Mean score of 66.10 with a standard deviation of 15.16
- Reading: Mean score of 69.17 with a standard deviation of 14.60
- Writing: Mean score of 68.05 with a standard deviation of 15.20

Now, let's look a bit more into the categorical data. The first thing we will do is look at the value counts for each column:

In [None]:
for col in student.columns[:5]:
    print('\n' + '-'*50)       # Serves as a divider between each column summary.
    print(col.upper() + ' COLUMN SUMMARY:\n') # Indicates which column is being summarized in each section.
    print(student[col].value_counts())

Let's go ahead and create some bar charts to gain a more intuitive understanding of each column's distribution:

In [None]:
# Framework and titles for subplots.
fig, axes = plt.subplots(nrows = 3, ncols = 2, figsize = (15, 15))
chart_titles = ['Gender (Total: 1,000)', 'Race (Total: 1,000)', 'Parental Education (Total: 1,000)', 'Lunch (Total: 1,000)', 'Test Preparation (Total: 1,000)']

# Plot charts.
for col, ax, chart_title in zip(student.columns[:5], axes.flatten(), chart_titles):
    sns.countplot(y = str(col), ax = ax, data = student).set_title(chart_title, fontsize = 14)
    ax.set(xlabel = 'Count', ylabel = col.replace('_', ' ').title())
fig.delaxes(axes[2,1]) # Delete extra plot, only needed 5.

# Adjust spacing.
plt.subplots_adjust(wspace = 0.3, hspace = 0.3)

Looking at the bar charts above allows us to actually see the distribution of various categorical values in the DataFrame.

Now let's explore the relationship between our categorical variables and the test-taker's respective test scores. To make this simpler, we will instead use a composite test score average for our analysis. However, the DataFrame does not already contain this data, so we will need to make a new column for composite scores:

In [None]:
# New col for composite test score averages.
student['composite'] = student.mean(axis = 1).round(2)

student.head()

And we will create a histogram like we previously did for the individual test subject scores: 

In [None]:
sns.distplot(student.composite, norm_hist = True, kde = False)
plt.title('Composite Score Distribution')
plt.xlabel('Score')

Now that we have the composite test score column, we won't be needing the individual test score columns so let's go ahead and drop them from the DataFrame:

In [None]:
student = student.drop(columns = ['math_score', 'reading_score', 'writing_score'])
student.head()

Before creating visualizations showing the relationship between categorical variables and composite test scores, let's look at summary statistics when grouped by each categorical variable:

In [None]:
# Grouped-by summary statistics
for col in student.columns[:5]:
    print('-'*50)
    print(col.upper() + ' COLUMN STATISTICAL SUMMARY:\n')
    print(student.groupby(col).describe())

Let's create some visualizations which show the relationships between various categorical variables and the test-takers' composite test scores. Box and whisker plots would likely be the best choice here, because we are showing numerical distributions by category.

In [None]:
# Framework and titles for subplots.
fig, axes = plt.subplots(nrows = 3, ncols = 2, figsize = (15, 20))
chart_titles = ['Composite Test Score Avg.\nby Gender', 'Composite Test Score Avg.\nby Race','Composite Test Score Avg.\nby Level of Parental Education',
                'Composite Test Score Avg.\nby Lunch Type', 'Composite Test Score Avg.\nby Status of Test Prep Course']

# Plot charts.
for col, ax, chart_title in zip(student.columns[:5], axes.flatten(), chart_titles):
    sns.boxplot(x = str(col), y = 'composite', ax = ax, data = student).set_title(chart_title, fontsize = 14)
    ax.tick_params(axis = 'x', labelrotation = 45)
    ax.set(xlabel = 'Count', ylabel = col.replace('_', ' ').title())
fig.delaxes(axes[2,1]) # Delete extra plot, only need 5.

# Adjust spacing.
plt.subplots_adjust(wspace = 0.2, hspace = 0.55)

These box and whisker plots are extremely helpful because they allow us to compare side by side differences in composite test scores for a given category. As such, it becomes evident that certain categorical attributes indicate likely outcomes for composite test score averagers. A specific example can be observed in the last chart, which compares people who completed a test prep course versus people who did not. It is evident that people who took the test prep course on average scored higher than those who didn't; the former scored an average of 73% while the latter scored an average of 65%. The same methodolgy can be used to examine the other box and whiskerp plots, further revealing how various categorical attributes lead to higher average test scores.