# Exploratory data analysis on Student performance dataset 
Dataset is about student performance in a different skills such as maths,reading and writing. 
It contains 1000 rows and 8 columns. Dataset has the columns named gender of a student, race/ethnicity,parental level of education,lunch time, completion status of test preparation course, maths,reading and writing score.

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot  as plt
%matplotlib inline
import scipy.stats
import warnings
warnings.filterwarnings("ignore")

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Data verification
Load the data to the pandas dataframe and store it in the variable "data"

In [None]:
data = pd.read_csv("../input/students-performance-in-exams/StudentsPerformance.csv")
data.head(5)

In [None]:
data.describe()

## Check missing values
The below code checks the null value. 

As per the output this dataset doesnot contain any missing values or null values. So the data cleaning is not neccessary.

In [None]:
#Checking for the null value
data.isnull().sum()

In [None]:
data.info()

# Data Category Distribution

In [None]:
for i in ['gender','race/ethnicity', 'parental level of education', 'lunch' ,"test preparation course"]:
    print(i+' distribution')
    print(data[i].value_counts())
    print('####################################################')

In [None]:
# Data Distribution Visualization

plt.figure(figsize=(15,10))
plt.subplot(2,3,1)
sns.countplot(data = data, x='gender')


plt.subplot(2,3,2)
sns.countplot(data = data, x='race/ethnicity')

plt.subplot(2,3,3)
sns.countplot(data = data, x='parental level of education')
plt.xticks(rotation = - 90 )

plt.subplot(2,3,4)
sns.countplot(data = data, x='lunch')

plt.subplot(2,3,5)
sns.countplot(data = data, x='test preparation course')

# Score Distribution

In [None]:
plt.figure(figsize=(8,8))
sns.kdeplot(data['math score'],  color='black')
sns.kdeplot(data['writing score'],  color='purple')
sns.kdeplot(data['reading score'],)
plt.show()

# Correlation Analysis

In [None]:
data.corr()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(5,5))
sns.heatmap(data = data.corr(), annot=True, 
fmt = '.2f', linewidths=.5, cmap='Blues')

In [None]:
sns.set()
sns.pairplot(data)
plt.show()

Let us find the average score of each student to make a visualization easy. 

Below code will find the average score of each student and append it to the dataframe with a label 'avg_score'.

In [None]:
#Find the average score of each student and append the attribute to the dataframe
data.total_score=data["math score"]+data["reading score"]+data["writing score"]
data.avg_score=round(data.total_score)/3.0
data.avg_score
data['avg_score']=data.avg_score
data.head()

##### Lets visualize the relation between the student's score and other attributes

In [None]:
#Relation between the student's score and their race/ethnicity
sns.barplot(x='race/ethnicity',y='avg_score',data=data)
data.groupby('race/ethnicity').mean().style.background_gradient(cmap = "OrRd")

In [None]:
#Relation between the student's score and their lunch time
sns.barplot(x='lunch',y='avg_score',data=data)
data.groupby('lunch').mean().style.background_gradient(cmap = "OrRd")

In [None]:
#Relation between the student's score and their parent's educational level
sns.barplot(x='parental level of education',y='avg_score',data=data)
data.groupby('parental level of education').mean().style.background_gradient(cmap = "OrRd")

In [None]:
#Relation between the student's score and their completion status of test preparation score
sns.barplot(x='test preparation course',y='avg_score',data=data)
data.groupby('test preparation course').mean().style.background_gradient(cmap = "OrRd")

### Key findings

Before the visualization we would have concluded that the performance of each student does not depend on any of the attributes listed in a dataset

But after visualizing a data, we could say that the score of each student is mostly dependent on the attributes such as gender,lunch time, completion status of test preparation course,race/ethnicity, parental level of education etc.



In [None]:
#Get the mean of math score, reading score and writing score based on the gender of a students
gender_math_score=data.groupby("gender")[["math score"]].mean()
gender_math_score=gender_math_score.reset_index()
gender_read_score=data.groupby("gender")[["reading score"]].mean()
gender_read_score=gender_read_score.reset_index()
gender_write_score=data.groupby("gender")[["writing score"]].mean()
gender_write_score=gender_write_score.reset_index()

# Hypothesis test

In [None]:
plt.figure(figsize=(20,15))
plt.subplot(3,2,1)

sns.boxplot(x='gender',y="math score", data=data)
plt.title('math score(boxplot)')
plt.subplot(3,2,2)
sns.barplot(x='gender',y="math score",data=gender_math_score)
plt.title('math score(Barplot)')


plt.subplot(3,2,3)
sns.boxplot(x='gender',y="writing score", data=data)
plt.title('writing score(boxplot)')
plt.subplot(3,2,4)
sns.barplot(x='gender',y="writing score", data=gender_write_score)
plt.title('writing score(violinplot)')


plt.subplot(3,2,5)
sns.boxplot(x='gender',y="reading score", data=data)
plt.title('reading score(boxplot)')

plt.subplot(3,2,6)
sns.barplot(x='gender',y="reading score", data=gender_read_score)
plt.title('reading score(violinplot)')

In [None]:
data_m_math = data[data['gender']=='male']
data_f_math = data[data['gender']=='female']

### 1, Hypothesis test of math score by gender

By observing the above graph we can make a hypothesis as below:

Ho (null hypothesis) = There is no difference in math scores between genders.

H1 (alternative hypothesis) = Differences in math scores between genders exist.

In [None]:
scipy.stats.ttest_ind(data_m_math['math score'], data_f_math['math score'], equal_var=False)

Conclusion -> The p value is 8.42083810e-08 (p<0.05), so there is a difference in math scores between genders. 
p value is lesser than 0.05, so we accept the alternative hypothesis H1
Therefore, men are better at math than women.

### 2, Hypothesis test of reading score by gender

Ho = There is no difference in listening scores between genders

H1 = Differences in listening scores between genders exist.

In [None]:
scipy.stats.ttest_ind(data_m_math['reading score'], data_f_math['reading score'], equal_var=False)

Conclusion -> P value is 4.37629e-15(p<0.05), so there is a difference in reading score between genders. Therefore, women can be considered to have better listening performance than men.

### 3, Hypothesis test of writing score by gender

Ho = There is no difference in writing scores between genders.

H1 = Differences in writing scores between genders exist.

In [None]:
scipy.stats.ttest_ind(data_m_math['writing score'], data_f_math['writing score'], equal_var=False)

Conclusion -> Because the p value is 1.711809371e-22(p<0.05), there is a difference in written performance between genders. Therefore, women are better at writing than men.

### 4, Hypothesis test of parental education level by Average Score

First let us have a look at the relation between the attributes parental level of education and average score of a student

In [None]:
#Get the average score with respect to the parental level of education
p_avg_score=data.groupby(["parental level of education"])["avg_score"].mean()
p_avg_score=p_avg_score.reset_index()
p_avg_score

###### visualizing the relation between parental level of education and average score of a student

In [None]:
#Draw a bar chart to visualize the relation between parental education level and su
p_avg_score.plot(x="parental level of education",y=["avg_score"],kind="bar",color=['green'],figsize=(10,5))
plt.xlabel("Parent level of education",size=15)
plt.ylabel("Average Marks",size=15)
plt.title("Avg marks with respect to parental education level",size=20)
plt.show()

##### Lets make a hypothesis for student's average score and parental level of education

H0 (null hypothesis)- student's avg score is not dependent on parental level of education

H1(alternative hypothesis)- student's avg score is dependent on parental level of education

In [None]:
from scipy.stats import ttest_ind
res=ttest_ind(p_avg_score['parental level of education'].index,p_avg_score['avg_score']).pvalue
p_avg_score['parental level of education'].index
print("P value is",res)

Result has the p value 5.247545972866754e-12 which is < 0.05. So the we reject the null hypothesis(H0) and accept the alternate hypothesis(H1)

Therefore the student's average score is dependent on parental level of education

### Suggestions for next steps in analysing the data
We can visualize the remaining factors such as lunch time, completion status of test preparation course. We can create a co-relation between these factors and the student's average score.

Going further we can find the main factors that as effecting the student's score and help them to improve it.Also we can predict the future score of a student based on the data collected for each attributes.

### Suggestions to improve the data

   Dataset has very few attributes which may not impact highly on student's score.This dataset need some more attributes to get the deeper understanding of relation between student score and the different factors which are effecting the score.Dataset can contain more attributes such as **hour of study, extra tuition classes (yes/no) etc.** which will improve the EDA on this dataset.Working on more attributtes will give the high level of understanding on the factors which are impacting student's performance. Through the high level understanding/knowledge on the effective factors we can help the student's to improve student's performance.
