# Import the libraries

First, we import all of the library we need today. I will be using plotly_express and plotly.graph_object a lot. Compare to matplotlib and seaborn, they are super quick and user_friendly. I highly recommend you to check out those amazing libraries if you are a newbie like me.

I do need to use seaborn and matplotlib here and then to support my visualization though.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import plotly_express as px
import plotly.graph_objects as go
import seaborn as sns

# **First look at the dataset**

First, we take a look at some of the first entry in our dataset to get an overview of what we are doing.

In [None]:
data = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')
data.head(15)

# Checking the dataset

In [None]:
data.isnull().sum()

In [None]:
data.describe()

Great! We don't have any null in this dataset. Also, from the max and min of each score's column, there seems to be no irregular scores that aren't in 100 range.

# EDA

After taking a look at the dataset, I decide to focus on the relationships of all elements to the scores.

# The link between math, reading and writing score

First, out of the 3 subject here, 2 are language-oriented. Thus I assumed that students who has a high score in reading will tend to do better in writing and vice versa. 

I'm going to check if the validity of my assumption in a heatmap, which shows how the score of each subject correlates with each other.

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(data.corr(),annot=True)

Even the lowest correlation in this heatmap is 0.8, which means 80%. Overall, we can say that, if a student has a good score in any of this subject, he/she is likely to have great scores in other subjects and vice versa. 

Particularly, the link between reading and writing is the strongest between different subjects, which means our initial assumption is true.

# Gender

The ratio between gender in this dataset:

In [None]:
grouped = data['gender'].value_counts().reset_index()
fig = px.pie(data_frame = grouped,names = grouped['index'], values = grouped['gender'],  color_discrete_sequence=px.colors.sequential.RdBu, title = 'Gender' )
fig.update_traces(textposition='inside', textinfo='percent+label', hoverinfo = 'label+percent')

Almost 50/50 :D Let move on to race/ethnicity.

# Race and gender

In [None]:
female = data[data['gender'] == 'female']
male = data[data['gender'] == 'male']
female = female['race/ethnicity'].value_counts().reset_index()
male = male['race/ethnicity'].value_counts().reset_index()

# magic bar chart
fig = go.Figure(data=[
    go.Bar(name='Female', x= female['index'], y= female['race/ethnicity']),
    go.Bar(name='Male', x= male['index'], y= male['race/ethnicity'])])
fig.update_layout(barmode='group')
fig.show()

How race and gender will affect the scores then ?

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(1, 3, 1)
plt.title('MATH SCORES')
sns.barplot(x='race/ethnicity',y='math score',data=data,hue='gender',palette='gist_heat')
plt.subplot(1, 3, 2)
plt.title('READING SCORES')
sns.barplot(x='race/ethnicity',y='reading score',data=data,hue='gender',palette='gist_heat')
plt.subplot(1, 3, 3)
plt.title('WRITING SCORES')
sns.barplot(x='race/ethnicity',y='writing score',data=data,hue='gender',palette='gist_heat')
plt.show()

Generally, male students are better than female students at handling math. However, in language tasks, female are clearly better than male.

Race do seems to play a role here as well. In all subject, we see group E comes out in top, followed closely by group D. 

# Preparation's course effectivenes

In [None]:
# create pivot table
gen2 = data.pivot_table(index=['test preparation course'],values=['math score','reading score','writing score'], aggfunc= np.mean)
gen2 = gen2.reset_index()

# draw chart
fig = go.Figure(data=[
    go.Bar(name='Math', y=gen2['math score'], x=gen2['test preparation course']),
    go.Bar(name='Reading', y=gen2['reading score'], x=gen2['test preparation course']),
go.Bar(name='Writing', y=gen2['writing score'], x=gen2['test preparation course'])])
fig.update_layout(barmode='group')
fig.show()

Guess all those money went in for preparation course wasn't in vain afterall. The course makes the most difference in subject that requires more practice like writing, with an increase of 10 points. Reading and math scores do improve too, but not as much, with math's different in score is aprroximately 5 points.


# Lunch, Preparation Course and Investment in Education

The above conclusion raises an even bigger question about one's investment in education. Does it even worth it? 

Looking at all of the factors involved, I think the lunch and the test's preparation course speaks most about the investment one's willing to make. Let's try testing my assumption out. 

In [None]:
# pivot table
gen3 = data.pivot_table(index=['lunch','test preparation course'],values=['math score','reading score','writing score'], aggfunc= np.mean)
gen3 = gen3.reset_index()
gen3['lunch and test preparation'] = gen3['lunch'] + ' '+'lunch'+'/'+ gen3['test preparation course'] + ' '+'course'
gen3.drop
# chart
fig = go.Figure(data=[
    go.Bar(name='Math', y=gen3['math score'], x=gen3['lunch and test preparation']),
    go.Bar(name='Reading', y=gen3['reading score'], x=gen3['lunch and test preparation']),
go.Bar(name='Writing', y=gen3['writing score'], x=gen3['lunch and test preparation'])])
fig.update_layout(barmode='group')
fig.show()


So far, this comparison produces the most pronounced result. The avarage score for a student with standard lunch and have completed the preparetion course is 26% to 27% higher than those who have free lunch and dont take part in the course. This is unsurprising, considering that lunch's standard and taking part in preparation course is directly correlated to the investment that the students puts in education. 

Comparing between other options: free lunch and completing course VS standard lunch and no course, the difference in reading and writing is barely diffrentiable. However, the free lunch group is lagging behind in math score a little bit, mainly because like the former conclusion, math is also the subject that has the lowest improvement by taking the course.

# Parental's level of education

In [None]:
data['Total score']=data['math score']+data['reading score']+data['writing score']
# chart drawing
fig,ax=plt.subplots()
sns.barplot(x=data['parental level of education'],y=data['Total score'],data=data,palette='Paired')
fig.autofmt_xdate()

Overall, the higher parental's education are, the more likely that their children's scores are higher. Besides genetic's reason, it's possible due to the porpotional amount of investment the parents willing to make for their children's education, as well as various other external reasons.