![title](https://thelibertyreview.com/wp-content/uploads/2018/04/students.jpg)

## <div style="text-align: center" > Analyzing Student Performance in Exams </div>
<div style="text-align: center"> Being a part of Kaggle gives me unlimited access to learn, share and grow as a Data Scientist. I am going to share my work on how I look in to the data and analyse for insights.

<div style="text-align:center"> If there are any recommendations/changes you would like to see in this notebook, please <b>leave a comment</b> at the end of this kernel. Any feedback/constructive criticism would be genuinely appreciated.</div>

# Introduction
***
This kernel is for beginner those who like to understand basics of pandas and creating informative graphs. We will create a detailed statistical analysis of the data through graphs.

## Table of contents
***
- Introduction
- Kernel Goals
- Part 1: Importing Necessary Modules
    - 1a. Libraries
    - 1b. Load datasets
    - 1c. A Glimpse of the dataset
    - 1d. About this dataset
    - Tableau Visualization
- Part 2: Overview of the data
    - 2a. Total number of male vs female students
    - 2b. Number of students belonging to different race/ethnicity groups
    - 2c. Count distribution of students with different parental level of edducation
    - 2d. Count of students who have completed test preparation course vs not completed ones
- Part3 : Distribution of subject wise scores 
    - 3a. Score spread and mean
    - 3b. 3b. Score range and number of students
- Part 4: Analysis of correlations
    - 4a. Avereage scores of students by gender
    - 4b. How different race/ethnicity have performed in each exam.
    - 4c. Relation between parental level of education and distribution of score in each subject
    
- Credits

# Kernel Goals
<a id="aboutthiskernel"></a>
***
There are two primary goals of this kernel.
- To do a statistical analysis on the data. Analyzing different features. 
- To do an exploratory analysis of the dataset with visualizations, understanding graphs.

# Part 1: Importing Necessary Libraries and datasets
***
<a id="import_libraries**"></a>
## 1a. Loading libraries
There are many libraries that we can use for analysys in python. Lets go with few most used ones.

In [None]:
# Import necessary modules. 
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

## 1b. Loading Datasets
<a id="load_data"></a>
***
This dataset is available in Kaggle's datasets <b>Students Performance in Exams.</b> 
This dataset is clean and free of unwanted data. We dont have to go through the processes of cleaning the data.

In [None]:
## Importing the dataset
stdPer = pd.read_csv("../input/StudentsPerformance.csv")

## 1c. A Glimpse of the Datasets. 
<a id="glimpse"></a>
***

In [None]:
#Lets see the first five rows of the dataset
stdPer.head()

In [None]:
print ("The shape of data is (row, column):"+ str(stdPer.shape))
print (stdPer.info())

 ## 1d. About This Dataset
***
The data has all mixed feature types.

**Categorical:**
- **Nominal**(variables that have two or more categories, but which do not have an intrinsic order.)
   > - **race/ethnicity*
        
- **Dichotomous**(Nominal variable with only two categories)
   > - **Sex**
            Female
            Male
   > - **lunch**
   > - **test preparation course**
   
- **Ordinal**(variables that have two or more categories just like nominal variables. Only the categories can also be ordered or ranked.)
   > - **parental level of eduction** 
   
**Numeric:**
- **Discrete**
  >  - **math score**
  >  - **resding score**
  >  - **writing score** 

# Part 2: Data Overview
***
## 2a. Total number of male vs female students

### Numeric count

In [None]:
gender_list = stdPer['gender'].unique()
for gender in gender_list:
    print('Number of',gender,'students is',stdPer['gender'][stdPer['gender'] == gender].count())
    
#visualizing the above counts.
sns.countplot(x = 'gender', data = stdPer)

### Percentage distribution

In [None]:
#Percentage distribution of gender count
# labels = ['female', 'male']

#using list comprehension here
sizes = [stdPer['gender'][stdPer['gender'] == gender].count() for gender in gender_list]
colors = ['lightcoral', 'lightskyblue']
 
# Plot
plt.pie(sizes, labels=gender_list, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140) 
plt.axis('equal')
plt.show()

## 2b. Number of students belonging to different race/ethnicity groups

In [None]:
#Actual count
race_ethnicity_list = stdPer['race/ethnicity'].unique()
for race_ethnicity in race_ethnicity_list:
    print('Number of students in',race_ethnicity,'is',stdPer['race/ethnicity'][stdPer['race/ethnicity'] == race_ethnicity].count())

#Plot using seaborn    
sns.countplot(y = 'race/ethnicity', data = stdPer, order = ['group C', 'group D', 'group B', 'group E', 'group A'])

In [None]:
#Percentage distribution of race/ethnicity count
labels = stdPer['race/ethnicity'].unique()
# labels = ['group B', 'group C', 'group A', 'group D', 'group E']

#using list comprehension here
sizes = [stdPer['race/ethnicity'][stdPer['race/ethnicity'] == 
        race_ethnicity].count() for race_ethnicity in labels]
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue','red']
explode = (0, 0.1, 0, 0,0)  # explode 1st slice
 
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=140)
 
plt.axis('equal')
plt.show()

## 2c. Count distribution of students with different parental level of edducation

In [None]:
#Actual count
parent_edu_list = stdPer['parental level of education'].unique()
for parent_edu in parent_edu_list:
    print('Number of students in',parent_edu,'is',stdPer['race/ethnicity'][stdPer['parental level of education'] == parent_edu].count())

#Plot using seaborn    
sns.countplot(y = 'parental level of education', data = stdPer, order = ["some college","associate's degree", 'some high school', "bachelor's degree", "master's degree"])

The above graph seems reasonable as people doing master's degree is generally less.

In [None]:
#unique elements under parental level of eduction
labels = stdPer['parental level of education'].unique()

# Using list comprehension here
sizes = [stdPer['parental level of education'][stdPer['parental level of education'] == 
                                               qualification].count() for qualification in labels]

colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue','red',"blue"]

# Plot
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.title('Percentage distribution of students count for different parental level of education')
plt.show()

## 2d. Count of students who have completed test preparation course vs not completed ones

In [None]:
preparation_list = stdPer['test preparation course'].unique()
for preparation in preparation_list:
    print('Number of students whose test preparation is',preparation, 'is',stdPer['test preparation course'][stdPer['test preparation course'] == preparation].count())
    
#visualizing the above counts.
sns.countplot(x = 'test preparation course', data = stdPer)

# Part3 : Distribution of subject wise scores 

## 3a. Score spread and mean

In [None]:
print('Avrage Maths Score : ',stdPer['math score'].mean())
print('Avrage Reading Score : ',stdPer['reading score'].mean())
print('Avrage Writing Score : ',stdPer['writing score'].mean())

In [None]:
fig, axes = plt.subplots(1,3,figsize=(15,5), sharey = True)
sns.despine()
sns.boxplot(y=stdPer['math score'], ax = axes[0], color = 'g')
sns.boxplot(y=stdPer['reading score'], ax = axes[1], color = 'b')
sns.boxplot(y=stdPer['writing score'], ax = axes[2], color = 'r')
plt.setp(axes, yticks=[i for i in range(0,110,10)])
plt.show()

In [None]:
#Above information can also be visualized by violin plots
f, axes = plt.subplots(1, 3, figsize=(12, 5), sharey=True)
# sns.despine(left=True)

# Violinplot for distribution of math score
sns.violinplot(y = 'math score', data = stdPer, ax=axes[0], color = 'g')

# Violinplot for distribution of reading score
sns.violinplot(y = 'reading score', data = stdPer, ax=axes[1], color = 'b')

# Violinplot for distribution of writing score
sns.violinplot(y="writing score", data = stdPer, ax = axes[2], color = 'r')

# plt.setp(axes, yticks=[0,10,20,30,40,50,60,70,80,90,100])
plt.setp(axes, yticks=[i for i in range(0,110,10)])
plt.tight_layout()

## 3b. Score range and number of students

In [None]:
# sns.set(style="white", palette="muted", color_codes=True)
fig, axes = plt.subplots(1,3, figsize = (10,5), sharey = True)
# sns.despine(left=True)

sns.distplot(stdPer['math score'], color="m", kde=False, ax=axes[0],bins=10, 
             hist_kws={"rwidth":0.9,'edgecolor':'black', 'alpha':1.0})

sns.distplot(stdPer['reading score'], color="g", kde=False, ax=axes[1],bins=10,
            hist_kws={"rwidth":0.9,'edgecolor':'black', 'alpha':1.0})

sns.distplot(stdPer['writing score'], color="b", kde=False, ax=axes[2],bins=10,
            hist_kws={"rwidth":0.9,'edgecolor':'black', 'alpha':1.0})

plt.tight_layout()


### Using factorplot from seaborn
Its is a general syntax used to draw any kind of graph using seaborn. We just have to mention the type of graph in the parameters
sns.factorplot(x = 'None' , y = 'None' , data = data, kind = 'plot type' )

# Part4 : Lets analyze the correlations

## 4a. Avereage scores of students by gender

In [None]:
stdPer.groupby('gender').mean().reset_index()

In [None]:
f, axes = plt.subplots(1,3,figsize=(15, 5)) 
sns.boxplot(x = 'gender', y = 'math score', data = stdPer, ax=axes[0])
sns.boxplot(x = 'gender', y = 'reading score', data = stdPer, ax=axes[1])
sns.boxplot(x = 'gender', y = 'writing score', data = stdPer, ax=axes[2])
plt.setp(axes, yticks=[i for i in range(0,110,10)])
sns.despine()
plt.tight_layout()

## 4b. How different race/ethnicity have performed in each exam.

In [None]:
stdPer.groupby('race/ethnicity').mean().reset_index().sort_values('math score', ascending = False)

In [None]:
f, axes = plt.subplots(3,1,figsize=(10, 15)) 
sns.boxplot(x = 'race/ethnicity', y = 'math score', data = stdPer, ax=axes[0])
sns.boxplot(x = 'race/ethnicity', y = 'reading score', data = stdPer, ax=axes[1])
sns.boxplot(x = 'race/ethnicity', y = 'writing score', data = stdPer, ax=axes[2])
plt.setp(axes, yticks=[i for i in range(0,110,10)])
sns.despine()
plt.tight_layout()

From the above graphs, we can see a trend.
Irrespective of the exam subject, average scores of students of group E > group D > group C > group B > group A

## 4c. Relation between parental level of education and distribution of score in each subject

In [None]:
stdPer.groupby('parental level of education').mean().reset_index().sort_values('math score', ascending = False)

Childern of master degree holders have done better in all the subjects.

In [None]:
#We will plot three different boxplots
f, axes = plt.subplots(3,1,figsize=(10, 15)) 
sns.boxplot(x = 'parental level of education', y = 'math score', data = stdPer, ax=axes[0])
sns.boxplot(x = 'parental level of education', y = 'reading score', data = stdPer, ax=axes[1])
sns.boxplot(x = 'parental level of education', y = 'writing score', data = stdPer, ax=axes[2])
plt.setp(axes, yticks=[i for i in range(0,110,10)])
sns.despine()
plt.tight_layout()

In [None]:
f, axes = plt.subplots(1,3,figsize=(15, 5)) 
sns.violinplot(x = 'test preparation course', y = 'math score', data = stdPer, ax=axes[0])
sns.violinplot(x = 'test preparation course', y = 'reading score', data = stdPer, ax=axes[1])
sns.violinplot(x = 'test preparation course', y = 'writing score', data = stdPer, ax=axes[2])
plt.setp(axes, yticks=[i for i in range(0,110,10)])
sns.despine()
plt.show()
# plt.tight_layout()

# Credits
* I have referenced Masum rumi's kernel for templet : [https://www.kaggle.com/masumrumi/a-statistical-analysis-ml-workflow-of-titanic](http://). This kernel is great for beginners
* Wikipidia