## Step 1 - Read dataset

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import probplot

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Filename of the input CSV file
filename = "/kaggle/input/students-performance-in-exams/StudentsPerformance.csv"

In [None]:
# Read the data into a dataframe
data = pd.read_csv(filename)

In [None]:
# Check first 5 rows
data.head()

## Step 2 - Univariate Analysis

### Step 2.1 - Data Outline

In [None]:
# Start off by having a look at the shape of the data
data.shape

We have 8 columns and just 1000 rows. This is a comparatively smaller data and thus we won't be removing any samples.

In [None]:
# Check the datatypes
data.info()

Note that there are no null values present in any column. The data also doesn't require any datatype conversion at this step.

In [None]:
# Describe the numerical attributes
data.describe()

Note that the minimum and maximum value for each of the scores are 0 or above and 100, respectively. These scores make sense as the minimum score for a subject can be 0 and maximum can be 100.

Let's also have a look at the boxplots to check the outliers.

### Step 2.2 - Numerical Variables

In [None]:
# Boxplot for math score
sns.boxplot(data["math score"])
plt.show()

Let's have a look at the values lying below `Q1-1.5*IQR` to have a look at the outliers.

In [None]:
# 25% quantile
Q1 = np.quantile(data["math score"],0.25)
# 75% quantile
Q3 = np.quantile(data["math score"],0.75)
# Inter-quantile range
IQR = Q3-Q1
# Outliers on the lower end
data[data["math score"] < Q1-1.5*IQR]

In [None]:
# Boxplot for reading score
sns.boxplot(data["reading score"])
plt.show()

Similar as math score, we will have a look at the outlier samples here as well.

In [None]:
# 25% quantile
Q1 = np.quantile(data["reading score"],0.25)
# 75% quantile
Q3 = np.quantile(data["reading score"],0.75)
# Inter-quantile range
IQR = Q3-Q1
# Outliers on the lower end
data[data["reading score"] < Q1-1.5*IQR]

In [None]:
# Boxplot for writing score
sns.boxplot(data["writing score"])
plt.show()

In [None]:
# 25% quantile
Q1 = np.quantile(data["writing score"],0.25)
# 75% quantile
Q3 = np.quantile(data["writing score"],0.75)
# Inter-quantile range
IQR = Q3-Q1
# Outliers on the lower end
data[data["writing score"] < Q1-1.5*IQR]

Notice that there are some common entries in lower end outliers for writing score, reading score and math score. While we can remove those common entries or replace the values, it's possible that these are genuine entries showing the lower end of scores. That's why, we will keep these values for now.

Next, let's have a look at the distribution of the three scores.

In [None]:
plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
sns.distplot(data["math score"])
plt.subplot(1,3,2)
sns.distplot(data["reading score"])
plt.subplot(1,3,3)
sns.distplot(data["writing score"])
plt.show()

Note that all the three scores are slightly left skewed but we can consider them to follow normal distribution for the easy analysis. We can verify the same using Q-Q plot as shown below.

In [None]:
plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
probplot(data["math score"],dist="norm",plot=plt);
plt.subplot(1,3,2)
probplot(data["reading score"],dist="norm",plot=plt);
plt.subplot(1,3,3)
probplot(data["writing score"],dist="norm",plot=plt);
plt.show()

As we can see, most of the values lie along the 45 deg line meaning that they follow approximately normal distribution.

### Step 2.3 - Categorical Variables

Next, let's have a look at the categorical variables, starting off with gender.

In [None]:
sns.countplot(data["gender"])
plt.show()

In [None]:
data["gender"].value_counts(normalize=True)

We have got ~52% females in the dataset and ~48% males in the dataset. There are no missing values. Since both percentages are nearly same, we don't have to worry about any kind of unbalance in the dataset in terms of gender.

Next let's have a look at the race/ethnicity.

In [None]:
sns.countplot(data["race/ethnicity"])
plt.show()

In [None]:
data["race/ethnicity"].value_counts(normalize=True)

The majority of students belong to group C whereas only 89 students (8.9%) are from group A.

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data["parental level of education"])
plt.show()

data["parental level of education"].value_counts(normalize=True)

Majority of parents of students have had some college education or have an associate's degree. Whereas only 59 parents (5.9%) have a master's degree. This shows an imbalance in the dataset but it can be ignored for now.

In [None]:
sns.countplot(data["lunch"])
plt.show()

data["lunch"].value_counts(normalize=True)

The above plot shows that most students are paying the standard fee for lunch. 

In [None]:
sns.countplot(data["test preparation course"])
plt.show()

data["test preparation course"].value_counts(normalize=True)

From the above countplot, we can see that around 64% students have not taken or completed a test preparation course.

Before we proceed with the bivariate analysis, let's convert the multilevel categorical variables (`race/ethnicity`, `parental level of education`) using one hot encoding.

In [None]:
# parental level of education one hot encoding
data = pd.concat([data,
                 pd.get_dummies(data["parental level of education"],
                               prefix="parent_education")],
                axis=1)
# drop parental level of education
data.drop("parental level of education",
         axis=1,
         inplace=True)

In [None]:
# race/ethnicity one hot encoding
data = pd.concat([data,
                 pd.get_dummies(data["race/ethnicity"],
                               prefix="race")],
                axis=1)
# drop race/ethnicity
data.drop("race/ethnicity",
         axis=1,
         inplace=True)

In [None]:
data.head()

In [None]:
# Binary encoding for 2 level categorical variables
data["gender"] = data["gender"].apply(lambda x:0 if x=="male" else 1)

In [None]:
data["lunch"] = data["lunch"].apply(lambda x:0 if x=="standard" else 1)
data["test preparation course"] = data["test preparation course"].apply(lambda x: 0 if x=="none" else 1)

In [None]:
data.head()

## Step 3 - Bivariate & Multivariate Analysis

In [None]:
sns.pairplot(data[["math score","reading score","writing score"]])
plt.show()

There is a clear linear relationship between the scores.

In [None]:
plt.figure(figsize=(15,3))
plt.subplot(1,3,1)
sns.scatterplot(x="math score",
             y="writing score",
             data = data,
             hue = "gender")
plt.subplot(1,3,2)
sns.scatterplot(x="reading score",
             y="writing score",
             data = data,
             hue = "gender")
plt.subplot(1,3,3)
sns.scatterplot(x="math score",
             y="reading score",
             data = data,
             hue = "gender")
plt.show()

Note that the relationship between `reading score` and `writing score` is linear and the same irrespective of the `gender`. On the other hand, for the same math score, the reading score of the females tend to be higher as compared to males but the relationship is still linear.

In [None]:
plt.figure(figsize=(15,3))
plt.subplot(1,3,1)
sns.scatterplot(x="math score",
             y="writing score",
             data = data,
             hue = "lunch")
plt.subplot(1,3,2)
sns.scatterplot(x="reading score",
             y="writing score",
             data = data,
             hue = "lunch")
plt.subplot(1,3,3)
sns.scatterplot(x="math score",
             y="reading score",
             data = data,
             hue = "lunch")
plt.show()

From the above plots, people paying standard lunch amount are scoring higher scores.

In [None]:
plt.figure(figsize=(15,3))
plt.subplot(1,3,1)
sns.scatterplot(x="math score",
             y="writing score",
             data = data,
             hue = "test preparation course")
plt.subplot(1,3,2)
sns.scatterplot(x="reading score",
             y="writing score",
             data = data,
             hue = "test preparation course")
plt.subplot(1,3,3)
sns.scatterplot(x="math score",
             y="reading score",
             data = data,
             hue = "test preparation course")
plt.show()

From the above plots, people with no or incomplete test preparation course are scoring low marks.

In [None]:
# Race groups
race_groups = []
for col in data.columns:
    if col.startswith("race_group"):
        race_groups.append(col)

In [None]:
plt.figure(figsize=(15,18))
for i in range(len(race_groups)):
    plt.subplot(len(race_groups),3,i*3+1)
    sns.scatterplot(x="math score",
                 y="writing score",
                 data = data,
                 hue = race_groups[i])
    plt.subplot(len(race_groups),3,i*3+2)
    sns.scatterplot(x="reading score",
                 y="writing score",
                 data = data,
                 hue = race_groups[i])
    plt.subplot(len(race_groups),3,i*3+3)
    sns.scatterplot(x="math score",
                 y="reading score",
                 data = data,
                 hue = race_groups[i])
plt.show()

In [None]:
# Parent education groups
parent_edu_groups = []
for col in data.columns:
    if col.startswith("parent_education_"):
        parent_edu_groups.append(col)

In [None]:
plt.figure(figsize=(15,24))
for i in range(len(parent_edu_groups)):
    plt.subplot(len(parent_edu_groups),3,i*3+1)
    sns.scatterplot(x="math score",
                 y="writing score",
                 data = data,
                 hue = parent_edu_groups[i])
    plt.subplot(len(parent_edu_groups),3,i*3+2)
    sns.scatterplot(x="reading score",
                 y="writing score",
                 data = data,
                 hue = parent_edu_groups[i])
    plt.subplot(len(parent_edu_groups),3,i*3+3)
    sns.scatterplot(x="math score",
                 y="reading score",
                 data = data,
                 hue = parent_edu_groups[i])
plt.show()