# Student Exam Explonatory Data Analysis & Data Visualization

## Overview

##### Here I will do explonatory data analysis of the data students performance in exams and try to visualize the data so that it is easy to understand. I also try to draw conclusions from every data I have visualized

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Prepare Data

### In this section is done checking what types of data each column, whether the data has null values and so on

Import the dataset using pandas

In [None]:
df = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')
df.head()

Get the information of each column about type, unique, mean, etc.

In [None]:
df.info()

In [None]:
df.describe(include='all')

Check if the dataset have null value

In [None]:
df.isnull().sum()

### Because the dataset has a column containing 3 test exam, I created a new column containing the mean value of the 3 exam scores called 'average'

In [None]:
average = df[['math score', 'reading score','writing score']]
df['average'] = average.mean(axis='columns')
df

# 2. Explonatory Data Analysis (EDA) + Data Visualization

### in this section will be done EDA and data visualization by using seaborn for easy understanding

Visualize distribution column of type number

In [None]:
sns.pairplot(data=df[['math score', 'reading score','writing score','average']], height=6, aspect=10/6)

Distribution of 3 exam score

In [None]:
figure, axes = plt.subplots(1, 3, sharex=True, figsize=(18,6))
figure.suptitle('Distribustion Score Visualize')
sns.histplot(df['math score'] , kde=True,ax=axes[0])
sns.histplot(df['writing score'] , kde=True,ax=axes[1])
sns.histplot(df['reading score'] , kde=True,ax=axes[2])

Showing the mean of each 3 exam score

In [None]:
print(f"Mean Score of Math Socre : {df['math score'].mean()}")
print(f"Mean Score of Reading Socre : {df['reading score'].mean()}")
print(f"Mean Score of Writing Socre : {df['writing score'].mean()}")

# 2.1 Explora data by Gender of the Student

### This section is focused on knowing the relationship of the 'gender' column with the score of each test and also visualization of the data.

In [None]:
gender_data = df.groupby(['gender']).mean()
gender_data.reset_index(inplace=True)
gender_data

In [None]:
figure, axes = plt.subplots(1, 4, sharex=True, figsize=(18,6))
figure.suptitle('Mean Score Visualize by Gender')
sns.barplot(x='gender', y='math score', data=gender_data,palette='pastel', ax=axes[0])
axes[0].set_title('Math Score')
sns.barplot(x='gender', y='reading score', data=gender_data,palette='pastel', ax=axes[1])
axes[1].set_title('Reading Score')
sns.barplot(x='gender', y='writing score', data=gender_data,palette='pastel', ax=axes[2])
axes[2].set_title('Writing Score')
sns.barplot(x='gender', y='average', data=gender_data,palette='pastel', ax=axes[3])
axes[3].set_title('Average Score')

In [None]:
figure, axes = plt.subplots(2, 2, sharex=True, figsize=(18,10))
figure.suptitle('Gender Visualize')
sns.countplot(x='gender',data=df,palette='pastel', ax=axes[0][0])
sns.boxplot(x="gender", y="average", data=df, palette='pastel', ax=axes[0][1])
sns.violinplot(x="gender", y="average", data=df, palette='pastel', ax=axes[1][0])
sns.stripplot(x="gender", y="average", data=df,jitter=True, palette='pastel', ax=axes[1][1])

## Conclusion
* There are more female students than male students
* Male students beat female students on math exams but male students lose out on reading and writing exams
* At an average score of 3 exams male students lose in female students

# 2.2 Explora data by Race/Ethinicity of the Student

### This section is focused on knowing the relationship of the 'race/ethnicity' column with the score of each test and also visualization of the data.

In [None]:
race_data = df.groupby(['race/ethnicity']).mean()
race_data.reset_index(inplace=True)
race_data

In [None]:
figure, axes = plt.subplots(1, 4, sharex=True, figsize=(18,6))
figure.suptitle('Mean Score Visualize by Race/Etchincity')
sns.barplot(x='race/ethnicity', y='math score', data=race_data,palette='pastel', ax=axes[0])
axes[0].set_title('Math Score')
sns.barplot(x='race/ethnicity', y='reading score', data=race_data,palette='pastel', ax=axes[1])
axes[1].set_title('Reading Score')
sns.barplot(x='race/ethnicity', y='writing score', data=race_data,palette='pastel', ax=axes[2])
axes[2].set_title('Writing Score')
sns.barplot(x='race/ethnicity', y='average', data=race_data,palette='pastel', ax=axes[3])
axes[3].set_title('Average Score')

In [None]:
figure, axes = plt.subplots(2, 2, sharex=True, figsize=(18,10))
figure.suptitle('Race/Ethnicity Visualize')
sns.countplot(x='race/ethnicity',data=df,palette='pastel', ax=axes[0][0])
sns.boxplot(x="race/ethnicity", y="average",data=df, palette='pastel', ax=axes[0][1])
sns.violinplot(x="race/ethnicity", y="average", data=df, palette='pastel', ax=axes[1][0])
sns.stripplot(x="race/ethnicity", y="average", data=df,jitter=True, palette='pastel', ax=axes[1][1])

## Conclusion
* There are more students with ethnicity 'b' followed by ethnicity 'd', 'a', 'e', 'c'
* Students with ethnicity 'e' had the highest scores of any other on 3 exams score
* Students with ethnicity 'e' also had the highest of the third average exam scores followed by ethnicity 'd', 'c', 'b', 'a'

# 2.3 Explora data by Parents Education

### This section is focused on knowing the relationship of the 'parental level of education' column with the score of each test and also visualization of the data.

In [None]:
education_data = df.groupby(['parental level of education']).mean()
education_data.reset_index(inplace=True)
education_data

In [None]:
figure, axes = plt.subplots(4, 1, sharex=True, figsize=(16,14))
figure.suptitle('Mean Score Visualize by Parent Education')
sns.barplot(x='parental level of education', y='math score', data=education_data, palette='pastel', ax=axes[0])
axes[0].set_title('Math Score')
sns.barplot(x='parental level of education', y='reading score', data=education_data, palette='pastel', ax=axes[1])
axes[1].set_title('Reading Score')
sns.barplot(x='parental level of education', y='writing score', data=education_data, palette='pastel', ax=axes[2])
axes[2].set_title('Writing Score')
sns.barplot(x='parental level of education', y='average', data=education_data, palette='pastel', ax=axes[3])
axes[3].set_title('Average Score')

In [None]:
figure, axes = plt.subplots(2, 2, sharex=True, figsize=(18,10))
figure.suptitle('parental level of education Visualize')
sns.countplot(x='parental level of education',data=df,palette='pastel', ax=axes[0][0])
sns.boxplot(x="parental level of education", y="average", data=df, palette='pastel', ax=axes[0][1])
sns.violinplot(x="parental level of education", y="average", data=df, palette='pastel', ax=axes[1][0])
sns.stripplot(x="parental level of education", y="average", data=df,jitter=True, palette='pastel', ax=axes[1][1])

## Conclusion
* There are the most students who have parents with a 'bachelor's degree' education level and the fewest students who have parents with a level of education 'high school'
* Students with parents who have a level of education 'master's degree' has the highest test scores
* Students with parents who have a level of education 'master's degree' also have the highest average test scores followed by students who have parents with a level of education 'bachelor's degree'

# 2.4 Explora data by Lunch of the Student

### This section is focused on knowing the relationship of the 'lunch' column with the score of each test and also visualization of the data.

In [None]:
lunch_data = df.groupby(['lunch']).mean()
lunch_data.reset_index(inplace=True)
lunch_data

In [None]:
figure, axes = plt.subplots(1, 4, sharex=True, figsize=(18,6))
figure.suptitle('Mean Score Visualize by Lunch')
sns.barplot(x='lunch', y='math score', data=lunch_data, palette='pastel', ax=axes[0])
axes[0].set_title('Math Score')
sns.barplot(x='lunch', y='reading score', data=lunch_data, palette='pastel', ax=axes[1])
axes[1].set_title('Reading Score')
sns.barplot(x='lunch', y='writing score', data=lunch_data, palette='pastel', ax=axes[2])
axes[2].set_title('Writing Score')
sns.barplot(x='lunch', y='average', data=lunch_data, palette='pastel', ax=axes[3])
axes[3].set_title('Average Score')

In [None]:
figure, axes = plt.subplots(2, 2, sharex=True, figsize=(18,10))
figure.suptitle('lunch Visualize')
sns.countplot(x='lunch',data=df,palette='pastel', ax=axes[0][0])
sns.boxplot(x="lunch", y="average", data=df, palette='pastel', ax=axes[0][1])
sns.violinplot(x="lunch", y="average", data=df, palette='pastel', ax=axes[1][0])
sns.stripplot(x="lunch", y="average", data=df,jitter=True, palette='pastel', ax=axes[1][1])

## Conclusion
* There are more students who have a standard lunch
* Students with a standard lunch have high scores in 3 exams
* Students who have a standard lunch also have a high average test score compared to free / reduced lunch

# 2.5 Explora data by Test Preparation of the Student

### This section is focused on knowing the relationship of the 'test preparation course' column with the score of each test and also visualization of the data.

In [None]:
test_data = df.groupby(['test preparation course']).mean()
test_data.reset_index(inplace=True)
test_data

In [None]:
figure, axes = plt.subplots(1, 4, sharex=True, figsize=(18,6))
figure.suptitle('Mean Score Visualize by Test Preparation Course')
sns.barplot(x='test preparation course', y='math score', data=test_data, palette='pastel', ax=axes[0])
axes[0].set_title('Math Score')
sns.barplot(x='test preparation course', y='reading score', data=test_data,palette='pastel', ax=axes[1])
axes[1].set_title('Reading Score')
sns.barplot(x='test preparation course', y='writing score', data=test_data,palette='pastel', ax=axes[2])
axes[2].set_title('Writing Score')
sns.barplot(x='test preparation course', y='average', data=test_data,palette='pastel', ax=axes[3])
axes[3].set_title('Average Score')

In [None]:
figure, axes = plt.subplots(2, 2, sharex=True, figsize=(18,10))
figure.suptitle('test preparation course Visualize')
sns.countplot(x='test preparation course',data=df,palette='pastel', ax=axes[0][0])
sns.boxplot(x="test preparation course", y="average", data=df, palette='pastel', ax=axes[0][1])
sns.violinplot(x="test preparation course", y="average", data=df, palette='pastel', ax=axes[1][0])
sns.stripplot(x="test preparation course", y="average", data=df,jitter=True, palette='pastel', ax=axes[1][1])

## Conclusion
* There are more students who do not complete the test preparation course
* Students who have completed the test preperation course have higher test exams than students who do not take the test preperation course
* Students who have completed the test preperation course also have a higher average test exams

# 3. Conclusion

in this section I will try to draw conclusions from the data that has been visualized above

## 1. Female students tend to have higher average exam scores than male students
## 2. Although it seems strange ethnicity of the student shows affect his exam score, where students with ethincity 'e' have the highest test scores compared to other students ethnicity
## 3. The level of education of the student's parents also affects the student's exam scores. where students with a level of education 'master's degree' have the highest average exam scores while students with a level of education 'high school' have the lowest average exam scores
## 4. Students with standardized lunch have a higher average exam score compared to students with lunch free /reduced
## 5. Here it is clear that students who have completed the preperation course have a higher average test score than students who have not completed it