# Dataset - Students Performance in Exams #
* **Content**
    * This data set consists of the marks secured by the students in various subjects.
* **Acknowledgements**
    * http://roycekimmons.com/tools/generated_data/exams
* **Inspiration**
    * To understand the influence of the parents background, test preparation etc on students performance
* **Research Questions**
    * How does gender affect student scores?
    * How does race/ethnicity affect student scores?
    * How does parental level of eduction affect student scores?
    * How does lunch (economic status) affect student scores?
    * How does test preparation affect student scores in the above categories?

# Import Relevant Python Environment #

In [None]:
# to display plots within the notebook
%matplotlib inline

# import libraries
import pandas as pd
import matplotlib as mpl
import numpy as np
import seaborn as sns
import os

from matplotlib import pyplot as plt

# display library versions
print('numpy:{0}'.format(np.__version__))
print('pandas:{0}'.format(pd.__version__))
print('matplotlib:{0}'.format(mpl.__version__))
print('seaborn:{0}'.format(sns.__version__))

# misc. configurations
pd.options.display.float_format = '{:.2f}'.format
#plt.style.use('ggplot')

# Load Data #

In [None]:
# check for data file
print('data file: {0}'.format(os.listdir("../input")))

In [None]:
# load data
data = pd.read_csv('../input/StudentsPerformance.csv')

# Descriptive Statistics (data exploration) #

In [None]:
# first 5 rows of data
data.head()

In [None]:
# last 5 rows of data
data.tail()

In [None]:
data.info()

* The data consists of:
    * 1000 rows and 8 columns.
    * 5 of the columns are categorical.
    * 3 of the columns are numeric.
    * There doesn't seem to be any null data.

In [None]:
# null data per column
data.isnull().any()

In [None]:
# overall null data
data.isnull().any().any()

* The above concludes that we have no null values in the data.

In [None]:
# Determine need for standardizing/normalizing of data
data.describe().T

* As the mean and standard deviation values are not too varied, there is no need for standardizing/normalizing of the data.

In [None]:
# categorical data - uniqe values
for column in data:
    if data[column].dtype == 'O':
        print('column: {0}\nunique values: {1}\n'.format(column, data[column].unique()))

In [None]:
# unique values
for column in data:
    if data[column].dtype != 'O':
        print('column: {0}\nunique values: {1}\n'.format(column, data[column].unique()))

* Data seems to be quite clean. No irregular values.

In [None]:
# correlation
data.corr(method='pearson')

* As we see that the [reading score] and [writing score] seem to be highly correlated, we can combine them into a new column [literacy score]

In [None]:
data['literacy score'] = (data['reading score']+data['writing score'])/2

In [None]:
data.head()

In [None]:
data2 = data.drop(['reading score', 'writing score'], axis=1)

In [None]:
data2.head()

# Utility Methods #

In [None]:
def univariatePlot(column):
    data2.groupby(column).mean().plot(kind='bar', rot=45, figsize=(12,8))

In [None]:
def bivariatePlot(columnList):
    data2.groupby(columnList).mean().plot(kind='bar', rot=45, figsize=(12,8))

# Research Questions / Answers #

In [None]:
univariatePlot('gender')

* Male students seem to have higher math scores than female students.
* Female students seem to have higher literacy (reading/writing).scores than male students.

In [None]:
bivariatePlot(['gender','test preparation course'])

* Students of all genders do well in both math & literacy tests if they complete their test preparation course.

In [None]:
univariatePlot('race/ethnicity')

* Across all race/ethnicity groups, there is a gradual increase in both math & literacy (read/write) scores from group A to group E.
* All race/ethnicity groups show higher literacy (read/write) scores than math scores apart from group E.

In [None]:
bivariatePlot(['race/ethnicity','test preparation course'])

* Students who complete their test preparation course show better scores than the ones that do not complete the test preparation course.

In [None]:
univariatePlot('parental level of education')

* Higher the parental level of education, higher the student scores for both math & literacy (read/write).

In [None]:
bivariatePlot(['parental level of education','test preparation course'])

* Completing the test preparation course improves the test scores across all parental level education category.

In [None]:
univariatePlot('lunch')

* Students getting free/reduced (lower economic level) have lower scores for both math & literacy (read/write) as compared to students that get standard lunch.

In [None]:
bivariatePlot(['lunch','test preparation course'])

* Test preparation course completion improves student scores across economic level.

In [None]:
univariatePlot('test preparation course')

* In general, completion of test preparation course improves both math & literacy (read/write) scores across the board.