# Complete Data Exploration

In this notebook I'd like to do some in depth data exploration. The goal is to find relevant information from the data so to end up with a valuable conclusion.

In [None]:
# Let's import the libraries we'll need
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

In [None]:
dataset = pd.read_csv("../input/StudentsPerformance.csv")

In [None]:
# Let's check the 5 first rows
dataset.head()

# Explorations

In [None]:
dataset.info()

The dataset doesn't count any missing values. 

In [None]:
# Let's describe the dataset
dataset.describe()

We can go further in this description by grouping the dataset by the features. Let's summarise the data by grouping it by the categorical features.

In [None]:
dataset.groupby("gender").describe()

It seems that, overall the boys are better at math (overall mean of 68 for boys vs 63 for girls). Girls are better at reading and writing.

In [None]:
# Let's group by race/ethnicity
dataset.groupby("race/ethnicity").describe()

Some ethnical background seem to be more represented than others. The group A has the least students (89)

In [None]:
dataset.groupby("parental level of education").describe()

Let's know try to extract the ethnical background of the students based on their parents level of education

In [None]:
dataset.groupby(["race/ethnicity", "parental level of education"]).describe()["math score"]
# I choose to grab the "math score" just to reduce the size width of the output

In the group A: parents are mostly related with high schol

In the group B : parents mostly are associate degrees and high school degrees holder.

...

All these differences are better in visualizations

# Let's do some visualizations

In [None]:
# Let's fix the figure size
plt.rcParams["figure.figsize"] = [10,6]

In [None]:
# Seaborn uses to send unimportant warnings, I'll hide them with the warnings module. 
# It is not recommanded though to do so
import warnings
warnings.filterwarnings("ignore")

In [None]:
sns.pairplot(dataset, hue = "gender", palette= 'viridis', plot_kws= {'alpha': 0.6})
# The plot_kws= {'alpha': 0.5} gives us the possibility provide additional arguments in pairplots

There seems to be a high correlaion between reading score and writing score.

In [None]:
dataset.head(3)

In [None]:
sns.countplot("parental level of education", data=dataset)

In [None]:
sns.distplot(dataset["math score"], kde = False, bins = 50)

In [None]:
sns.countplot("test preparation course", data = dataset)

In [None]:
sns.boxplot(x = "race/ethnicity", y = "math score", data = dataset, 
            hue = "gender", palette = "viridis")

The student of group E has the best score in math. We clearly see here that the math score is not something that is determined by the ethnicity, boys have better scores than girls.

In [None]:
sns.boxplot("race/ethnicity", "reading score", data = dataset,
           hue = "gender", palette = 'viridis')

The girls are far more better than boys in reading.

In [None]:
sns.boxplot(x = "test preparation course", y = "reading score", data = dataset)

Students who prepare the test tend to have a slightly better score

In [None]:
# Let's see correlation between scores
dataset.corr()

In [None]:
sns.heatmap(dataset.corr(), cmap = 'viridis_r')