## EDA Student Performance Indicator

### 1) Problem statement
- This project understands how the student's performance (test scores) is affected by other variables such as Gender, Ethnicity, Parental level of education, Lunch and Test preparation course.


### 2) Data Collection
- Dataset Source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?datasetId=74977
- The data consists of 8 column and 1000 rows.

### 3) Dataset Information
- gender : sex of students  -> (Male/female)
- race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)
- parental level of education : parents' final education ->(bachelor's degree,some college,master's degree,associate's degree,high school)
- lunch : having lunch before test (standard or free/reduced) 
- test preparation course : complete or not complete before test
- math score
- reading score
- writing score

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Read the dataset
df=pd.read_csv('stud.csv')
df

In [None]:
df.shape

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

In [None]:
## check missing Values
df.isna().sum()

## Insights or Observation
There are no missing values

In [None]:
df.isna().sum()

In [None]:
## Check Duplicates
df.duplicated().sum()    

There are no duplicates values in the dataset

In [None]:
## check datatypes
df.info()

In [None]:
## 3.1 Checking the number of uniques values of each columns
df.nunique()         #Count number of distinct elements in specified axis. (here col)

In [None]:
## Check the statistics of the dataset
df.describe()

## Insights or Observation
- From the above description of numerical data,all means are very close to each other- between 66 and 69
- All the standard deviation are also close- between 14.6- 15.19
- While there is a minimum of 0 for maths,other are having 17 and 10 value

In [None]:
## Explore more info about the data
df.head()

In [None]:
df.tail()

In [None]:
[feature for feature in df.columns if df[feature].dtype=='O']

In [None]:
#segrregate numerical and categorical features
numerical_features=[feature for feature in df.columns if df[feature].dtype!='O']
categorical_feature=[feature for feature in df.columns if df[feature].dtype=='O']

In [None]:
numerical_features

In [None]:
categorical_feature

In [None]:
df['gender'].value_counts()

In [None]:
df['race_ethnicity'].value_counts()

In [None]:
## Aggregate the total score with mean

df['total_score']=(df['math_score']+df['reading_score']+df['writing_score'])
df['average']=df['total_score']/3
df.head()

In [None]:
### Explore More Visualization
fig,axis=plt.subplots(1,2,figsize=(15,7))
plt.subplot(121)
sns.histplot(data=df,x='average',bins=30,kde=True,color='g')
plt.subplot(122)
sns.histplot(data=df,x='average',bins=30,kde=True,hue='gender')

fig, axis: This line initializes a figure and its axes. fig is a reference to the entire figure, while axis is an array of axes objects. In this case, since subplot(1, 2, ...) is specified, axis will contain two elements, one for each subplot.
plt.subplots(1, 2, figsize=(15, 7)): This function call creates a figure and a set of subplots. (1, 2) indicates that there will be 1 row and 2 columns of subplots, and figsize=(15, 7) sets the size of the figure to be 15 inches wide and 7 inches high.

plt.subplot(121): This line specifies that subsequent plotting commands will be drawn on the first subplot (1 row, 2 columns, and this is the first subplot). The 121 is a shorthand for 1, 2, 1.

In [None]:
fig

In [None]:
axis

## Insights
- Female student tend to perform well than male students

In [None]:
plt.subplots(1,3,figsize=(25,6))
plt.subplot(131)
sns.histplot(data=df,x='average',kde=True,hue='lunch')
plt.subplot(132)
sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='lunch')
plt.subplot(133)
sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='lunch')

## Insights
- Standard Lunch help students perform well in exams
- Standard lunch helps perform well in exams be it a male of female

In [None]:
df.head()

In [None]:
plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
sns.histplot(data=df,x='average',kde=True,hue='parental_level_of_education')
plt.subplot(142)
sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='parental_level_of_education')
plt.subplot(143)
sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='parental_level_of_education')

#####  Insights
- In general parent's education don't help student perform well in exam.
- 3rd plot shows that parent's whose education is of associate's degree or master's degree their male child tend to perform well in exam
- 2nd plot we can see there is no effect of parent's education on female students.

As red and blue line are left skewed in 3rd plot

In [None]:
plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
ax =sns.histplot(data=df,x='average',kde=True,hue='race_ethnicity')
plt.subplot(142)
ax =sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='race_ethnicity')
plt.subplot(143)
ax =sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='race_ethnicity')
plt.show()

#####  Insights
- Students of group A and group B tends to perform poorly in exam.
- Students of group A and group B tends to perform poorly in exam irrespective of whether they are male or female

In [None]:
plt.subplots(1, 3, figsize=(25, 6))
plt.subplot(141)
ax = sns.histplot(data=df, x='average', kde=True, hue='race_ethnicity', element='step')
plt.subplot(142)
ax = sns.histplot(data=df[df.gender=='female'], x='average', kde=True, hue='race_ethnicity', element='step')
plt.subplot(143)
ax = sns.histplot(data=df[df.gender=='male'], x='average', kde=True, hue='race_ethnicity', element='step')
plt.show()


In [None]:
sns.heatmap(df.corr(),annot=True)