## Student Exam Performance

In [None]:
# import all packages and set plots to be embedded inline
import pandas as pd 
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline 

In [None]:
# load the data
df=pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')

In [None]:
# display the first 5 rows
df.head()

In [None]:
#looking the shape of the data
df.shape

In [None]:
# Display dataset info
df.info()

In [None]:
# Check for number of null values on each columns
df.isnull().sum()

In [None]:
#Check for duplicate records
df.duplicated().sum()

In [None]:
# Display dataset summary statistics
df.describe()

In [None]:
# rename & update columns names
df.rename(columns={'race/ethnicity':'ethnicity','parental level of education':'parent_education'},inplace=True);
df.rename(columns=lambda x:x.strip().replace(' ','_'),inplace=True)

In [None]:
# check columns names
df.columns

In [None]:
# Display gender Value Counts
df['gender'].value_counts()

In [None]:
# Display ethnicity Value Counts
df['ethnicity'].value_counts()

In [None]:
# Display parental_level_of_education Value Counts
df['parent_education'].value_counts()

In [None]:
# Display lunch Value Counts
df['lunch'].value_counts()

In [None]:
# Display test_preparation_course Value Counts
df['test_preparation_course'].value_counts()

## Exploration

In [None]:
# Ploting the distribution of gender
labels=df['gender'].value_counts().index
values=df['gender'].value_counts().values

plt.figure(figsize=(6,6))
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=90)
plt.title('Gender')
plt.show()

- we find that percent of female in our data is 51.8% more than percent of male 48.2%.

In [None]:
# Ploting the distribution of ethnicity
base_color=sb.color_palette()[0]
freq=df['ethnicity'].value_counts()
gen_order=freq.index
sb.countplot(data=df,x='ethnicity',color=base_color,order=gen_order);

- The most common ethnicity was group C then comes group D, in the third place we have group B followed by group E and group A . 

In [None]:
# Ploting the distribution of test_preparation_course
sb.countplot(data=df,x='test_preparation_course');

- we can see that most of student didn't take the test preparation course.

In [None]:
# Ploting the distribution of parental_level_of_education
base_color=sb.color_palette()[0]
freq=df['parent_education'].value_counts()
gen_order=freq.index
sb.countplot(data=df,x='parent_education',color=base_color,order=gen_order)
plt.xticks(rotation=45);

- We find that the most common education of parents is some collage and associate's degree. The Master's degree is the fewest.

In [None]:
# Ploting the distribution of lunch
labels=df['lunch'].value_counts().index
values=df['lunch'].value_counts().values

plt.figure(figsize=(6,6))
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=90)
plt.title('Lunch')
plt.show()


- we find that the standard lunch is common with 64.5% more than free/reduced lunch with 35.5%.

In [None]:
# plotting the relation between gender and each score
fig, ax = plt.subplots(figsize=(12,5), ncols=3)
fig.subplots_adjust(wspace=1/2)
ax[0].set_title('Math Score')
sb.boxplot(ax=ax[0],x=df['gender'],y=df['math_score'])


ax[1].set_title('Reading Score')
sb.boxplot(ax=ax[1],x=df['gender'],y=df['reading_score'])


ax[2].set_title('Writing Score')
sb.boxplot(ax=ax[2],x=df['gender'],y=df['writing_score']);

- On average, female students performed better on the tests than the male students, except for on the math test.
- there are outliers of females more than males.

In [None]:
# plotting the relation between parent_education and math_score
plt.figure(figsize=(10,4))
sb.boxplot(x=df['parent_education'],y=df['math_score'],color= sb.color_palette()[0])
plt.xticks(rotation=30);


- On average, students performed better on math test if the parent's education is master's degree.
- there are outliers of some college education and other educations level excpet master's degree and associate's degree. 

In [None]:
# plotting the relation between parent_education and gender
plt.figure(figsize=(12,5))
sb.countplot(x=df['parent_education'],hue=df['gender']);


we find that parents education level of females are more than males ,except high school eduction level males are more.

In [None]:
# plotting the relation between test_preparation_course and each score
fig, ax = plt.subplots(figsize=(12,5), ncols=3)
fig.subplots_adjust(wspace=1/2)
ax[0].set_title('Math Score')
sb.boxplot(ax=ax[0],x=df['test_preparation_course'],y=df['math_score'])


ax[1].set_title('Reading Score')
sb.boxplot(ax=ax[1],x=df['test_preparation_course'],y=df['reading_score'])


ax[2].set_title('Writing Score')
sb.boxplot(ax=ax[2],x=df['test_preparation_course'],y=df['writing_score']);

- On average, students performed better on the tests when take the test preparation course .
- there are outliers of none more than completed the test preparation course.

In [None]:
# plotting the relation between lunch and gender
sb.countplot(x=df['lunch'],hue=df['gender']);

we find that females and males perefer standard lunch more than free/reduced lunch.

In [None]:
# plotting the relation between math_score,reading_score and writing_score
pd.plotting.scatter_matrix(df,figsize=(8,8));

- There are a positive relationship between scores in the three tests.

In [None]:
# Ploting heatmap for math_score,reading_score and writing_score
sb.heatmap(df.corr(),annot=True,fmt='.2f',cmap='vlag_r',center=0,vmin=0);

- The correlation coefficient between scores in the three tests is a positive correlation, which means the higher in one test  is expected to have higher in the other tests.

In [None]:
# Plotting Pair grid plot for all dataset
g=sb.PairGrid(data=df,x_vars=['math_score','reading_score','writing_score']
              ,y_vars=['gender','ethnicity','parent_education','lunch','test_preparation_course'])
g.map(sb.barplot,color= sb.color_palette()[0]);

- Based on this figures, we show that the parent education and test preparation course have effect on the score of tests.

### Conclusions

 I found that :
- There are a relationships between  parent education, test preparation course and scores in tests.Which means that parent education and test preparation course effect on scores in tests.
- There is a positive correlation between scores in the three tests, which means the higher in one test is expected to have higher in the other tests.
- On average, female students performed better on the tests than the male students, except for on the math test.
- Females and Males perefer standard lunch more than free/reduced lunch.