# EDA of Student Performance using pandas

This notebook is to show the power of pandas when performing Exploratory Data Analysis.

Although it is beneficial to understand other packages such as matplotlib and seaborn, this notebook shows the fundamentals in EDA through this single powerful package.

## ****Importing dependancies and data****

In [None]:
import pandas as pd 
from IPython.display import display # just to make things look nicer

df = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')

## ****Checking columns, datatypes and quantity of null values****

This is to get a general idea of the data that we're working with. Being able to see how many null values there are in each column is especially useful to prioritise how much time should be spent on feature engineering and filling columns with new data.

Finding out the data type of each column is useful as it shows whether the data is categorical or numeric, which needs to be taken into account when further exploring each column.

In [None]:
df.info()

Luckily, there are no null values in this data set. This is proven by minusing the non-null count from the amount of entries stated above the table.

## ****Describing the Numeric Data****

In [None]:
df.describe()

This is to get a quick look at the distribution of data.

## ****Visualising the Numeric Data****

### ****Histograms****

In [None]:
    df.hist(['math score','reading score', 'writing score'], ec = 'black', grid = False)

This shows that the data spread of these scores are negatively skewed.

### ****Boxplot****

In [None]:
 df.boxplot(['math score','reading score', 'writing score'])

This shows similar information to the histogram. However, it also shows that there are numerous values that could be classed as outliers to the general set of scores for each exam. 

In particular, Maths has two low values that are far from the general distribution of data that could be class as outliers. 

These could show key insights into the kind of students that may need substantially more assistance when preparing for math exams. 

## ****Checking values of Categorical Data****

### ****Value Counts****

### ****Gender****

In [None]:
df.gender.value_counts()

### ****Race/Ethinicity****

In [None]:
df['race/ethnicity'].value_counts()

### ****Parent's level of Education****

In [None]:
df['parental level of education'].value_counts()

### ****Lunch Variants****

In [None]:
df.lunch.value_counts()

### ****If they took a preperation course****

In [None]:
df['test preparation course'].value_counts()

## ****Visualising Categorical Data****

### ****Gender****

In [None]:
df.gender.value_counts().plot(kind = 'bar')

### ****Race/Ethnicity****

In [None]:
df['race/ethnicity'].value_counts().plot(kind = 'bar')

### ****Parental level of Education****

In [None]:
df['parental level of education'].value_counts().plot(kind = 'bar')

### ****Lunch Variants****

In [None]:
df.lunch.value_counts().plot(kind = 'bar')

### ****If they took a preperation course****

In [None]:
df['test preparation course'].value_counts().plot(kind = 'bar')

## ****Looking for Correlations in data****

### **Pivot Tables**

### **Parental education and exam scores?**

In [None]:
exams = df[['math score', 'reading score', 'writing score']]

for i in exams:
    exam_pivot = pd.pivot_table(df,columns = 'parental level of education', values = i, aggfunc = 'mean')
    display(exam_pivot)

In this Pivot table, it can be noted that if the parent has a higher level of education, the students score increases. 

This can be seen throughout all three subjects. 

The score increase may be due to the parents being able to support the student better both academically and financially through a higher paying job (although, this is entirely speculation).

### Visualising Pivot table

In [None]:
parent_to_math = df.groupby(['parental level of education'])['math score'].mean()
parent_to_math.plot.bar()

In [None]:
parent_to_math = df.groupby(['parental level of education'])['reading score'].mean()
parent_to_math.plot.bar()

In [None]:
parent_to_math = df.groupby(['parental level of education'])['writing score'].mean()
parent_to_math.plot.bar()

### Does doing a test preparation help?

In [None]:
for i in exams:
    exam_pivot = pd.pivot_table(df,columns = 'test preparation course', values = i, aggfunc = 'mean')
    display(exam_pivot)

In all exams, doing a test preparation course produced an average higher score than those who did not.

### Exam score Correlation

In [None]:
df.corr()

This table shows the correlation between each of the exams, and how a student good at one subject may have a bias to being good at another.

From this table, it can be seen that reading and writing have a very strong correlation in comparison to maths with either of the priorly specified subjects. 

This makes sense as reading and writing are usually taught together and have a natural correlation.