# Students Perfomance

## Introduction

Undoubtedly, one of the most important concern of society is education. Knowledge and information were for a long time resources not available for all population. Fortunately, we had advanced a lot in the access of education for most of the people helping to improve the social indicators.

Nevertheless, there are distinctions in the exam perfomances among students. Understading what are the reasons for those distinctions may help policy makers to formualte better education program for the society.

Seen in these terms, I will apply a linear regression over the grade of the students using their social background. The idea is to find evidences if, and how, social background may affect the students perfomance.

# Loading Libraries and Data

In [None]:
### Libraries

# Importing and Data Manipulation
import pandas as pd
import numpy as np
import math

# Graphs
import matplotlib.pyplot as plt
import seaborn as sns

# Stats
from scipy import stats
import statsmodels.api as sm

# Data
eduData = pd.read_csv("../input/students-performance-in-exams/StudentsPerformance.csv")

An overview over the data shows us that the dataset contains eight variables and one thousand observations (students). Basics statistics shows us that the grade avarage of Math is lower than for Reading and Writing. Also, it is possible to notice that one person has zero score in the Math Score Grade.

In [None]:
# Type of the Variables
print(eduData.info())

# Checking for duplicates
print(eduData[eduData.duplicated()])

# Cheking for unique values
for i in eduData.columns:
    print("Unique value for the column: " + i)
    print(eduData[i].unique())
    print("\n")

# Basic Statistics
eduData.describe()

# Cleaning the Data

One of the most important and dificult process in the data analysis is the cleaning of our data. Fixing or removing incorrect, corrupted, duplicates, or incomplete data demands not only formal understading, but also a sense of interpretation of the dataset.

We noticed that one person had a zero score in the math test. Usually I would drop this observation from the dataset for being a probably input error, but checking the the other grades of that studend I realise that actually the general grades are very low. Therefore, I will keep the observation in the dataset

In [None]:
eduData[eduData['math score'] == 0] 

None of the dataset observations are null.

In [None]:
# Checking for NULL values
print(eduData.isnull().sum())

# Data Transformation

Most of the variables are categorial. Nevertheless, when we import the dataset to our analysis environment the system interpret them as strings. Because of that I apply a transformation in the columns of my dataset. 

In [None]:
### Data Transformation
for i in eduData.columns[0:5]:
    print("Applying transformation in the data column: " + i)
    eduData[i] = pd.Categorical(eduData[i])
    print("\n")

# Exploratory Analysis

This part is one of my favorite in the data analysis process. Is is used to analyze and investigate data in many ways. Is it important to summarize the main characteristics that contains the data for subsequent data modeling. The main idea is to look the data to identify patterns, interesting relations among the variables, and anomalous characteristics.

Boxplot is one of the most important tools in the descriptive statistics. Visually show the distribution of numerical data through the data quartiles. I ploted boxplots for each categorical variable using the grades archive by the students.

* Gender vs Subject: noticeable that female students has less grade in Math subjects than males; However, male students has less grade in Reading and Writing score; Standard Deviation is very similar for both genders.

* Race/Ethnicity vs Subject: when checking the boxplot through race/ethnicity there is a slight increase for each group from A to E.

* Parental Level of Education by Subject: students whose parents has high level of education hold better grades score.

* Lunch by Subject: It is very clear that students with standard lunch hold better grades scores.

* Test Preparation Course by Subject: The same pattern is noticeable in test preparation course, students that receive a preparion course hold better grades score.

Other descriptive analysys show that the data is unbalanced for Race/Ethnicity, probably reflecting the inequallity present among the students. Also, among the parents is posible to see that many of them hold a lack of formal education what may be associate with low perfomance of few students.

In [None]:
# Creating Boxplot Gender vs Subject

fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
fig.suptitle('Boxplot Gender by Subject')

# Math Score
sns.boxplot(ax=axes[0], y='math score', x='gender', 
                 data=eduData, 
                 palette="colorblind")
axes[0].set_title("Math Score")


# Reading Score
sns.boxplot(ax=axes[1], y='reading score', x='gender', 
                 data=eduData, 
                 palette="colorblind")
axes[1].set_title("Reading Score")

# Writing Score
sns.boxplot(ax=axes[2], y='writing score', x='gender', 
                 data=eduData, 
                 palette="colorblind")
axes[2].set_title("Writing Score")

In [None]:
print(eduData.groupby(['gender']).median()[['math score','reading score','writing score']])

print(eduData.groupby(['gender']).std()[['math score','reading score','writing score']])

In [None]:
# Creating Boxplot Race/Ethnicity vs Subject

fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
fig.suptitle('Boxplot race/ethnicity by Subject')

# Math Score
sns.boxplot(ax=axes[0], y='math score', x='race/ethnicity', 
                 data=eduData, 
                 palette="colorblind")
axes[0].set_title("Math Score")


# Reading Score
sns.boxplot(ax=axes[1], y='reading score', x='race/ethnicity', 
                 data=eduData, 
                 palette="colorblind")
axes[1].set_title("Reading Score")

# Writing Score
sns.boxplot(ax=axes[2], y='writing score', x='race/ethnicity', 
                 data=eduData, 
                 palette="colorblind")
axes[2].set_title("Writing Score")

In [None]:
print(eduData.groupby(['race/ethnicity']).median()[['math score','reading score','writing score']])

print(eduData.groupby(['race/ethnicity']).std()[['math score','reading score','writing score']])

In [None]:
eduData['race/ethnicity'].value_counts().plot.pie(autopct="%1.1f%%")
plt.show()

In [None]:
# Creating Boxplot Parental Level of Education vs Subject

fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
fig.suptitle('Boxplot Parental Level of Education by Subject')

# Math Score
sns.boxplot(ax=axes[0], y='math score', x='parental level of education', 
                 data=eduData, 
                 palette="colorblind")
axes[0].set_title("Math Score")


# Reading Score
sns.boxplot(ax=axes[1], y='reading score', x='parental level of education', 
                 data=eduData, 
                 palette="colorblind")
axes[1].set_title("Reading Score")

# Writing Score
sns.boxplot(ax=axes[2], y='writing score', x='parental level of education', 
                 data=eduData, 
                 palette="colorblind")
axes[2].set_title("Writing Score")

In [None]:
print(eduData.groupby(['parental level of education']).median()[['math score','reading score','writing score']])

print(eduData.groupby(['parental level of education']).std()[['math score','reading score','writing score']])

In [None]:
eduData['parental level of education'].value_counts().plot(kind = 'barh')
plt.show()

In [None]:
# Creating Boxplot Lunch vs Subject

fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
fig.suptitle('Boxplot Lunch by Subject')

# Math Score
sns.boxplot(ax=axes[0], y='math score', x='lunch', 
                 data=eduData, 
                 palette="colorblind")
axes[0].set_title("Math Score")


# Reading Score
sns.boxplot(ax=axes[1], y='reading score', x='lunch', 
                 data=eduData, 
                 palette="colorblind")
axes[1].set_title("Reading Score")

# Writing Score
sns.boxplot(ax=axes[2], y='writing score', x='lunch', 
                 data=eduData, 
                 palette="colorblind")
axes[2].set_title("Writing Score")

In [None]:
print(eduData.groupby(['lunch']).median()[['math score','reading score','writing score']])

print(eduData.groupby(['lunch']).std()[['math score','reading score','writing score']])

In [None]:
# Creating Boxplot Test Preparation Course vs Subject

fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
fig.suptitle('Boxplot Test Preparation Course by Subject')

# Math Score
sns.boxplot(ax=axes[0], y='math score', x='test preparation course', 
                 data=eduData, 
                 palette="colorblind")
axes[0].set_title("Math Score")


# Reading Score
sns.boxplot(ax=axes[1], y='reading score', x='test preparation course', 
                 data=eduData, 
                 palette="colorblind")
axes[1].set_title("Reading Score")

# Writing Score
sns.boxplot(ax=axes[2], y='writing score', x='test preparation course', 
                 data=eduData, 
                 palette="colorblind")
axes[2].set_title("Writing Score")

In [None]:
print(eduData.groupby(['test preparation course']).median()[['math score','reading score','writing score']])

print(eduData.groupby(['test preparation course']).std()[['math score','reading score','writing score']])

Data may have positive, negative, or no correlation. It is interesting to describe how associate may be the three subjects. Reading and Writing have more strong association than Reading and Math or Writing and Math. 

In [None]:
eduCorr = eduData.corr()
sns.heatmap(eduCorr,xticklabels=eduCorr.columns, yticklabels=eduCorr.columns,annot=True)

Data distribution show us the shape the data we analysing, it is very important when we are modeling. There is a cleary linear association between the subjects, that makes sense once students who have good performance in one subject may have good perfomance in other subjects as well. 

Also, all the three subjects seems to have a bell shape (condition to the Linear Regression)

In [None]:
# Data Distribution
sns.pairplot(eduData)

I always try to be very careful when I make assumptions about the data. I would like to apply a Linear Regression to understating how social background variables may affect the perfomance of the students in the exam. 

One tool very useful to test the normality of our data is Q-Q Plot. It is a graphical technique for determining if the data comes from a theorical distribution, in our case we want to check if our data comes from a Normal Distribution. The data seems fit in the Normal Distribution, even if the values in the tail are escaping from our theorical Distribution.  

In [None]:
from statsmodels.graphics.gofplots import qqplot

# Clearing Axes
axes[0].clear();axes[1].clear();axes[2].clear()
# Checking Normality
fig.suptitle('Theorical Normal Distribution')

qqplot(eduData['math score'], line = 's', ax = axes[0])
qqplot(eduData['reading score'], line = 's', ax = axes[1])
qqplot(eduData['writing score'], line = 's', ax = axes[2])

# Linear Regression

One model that may very powerful and useful is the Linear Regression. There is a full theorical background that explains how the Ordinary Least Square works very well in the attempt to explain the linear relationship between data that I will not work here, but strongly advise everyone to understand.

The relationship between the grade score archive by the students and their social background variables may be describe through Linear Regression. In our case, where the target variable is numerical and the explanatory variables are categorical, we will perform a Linear Regression using dummies variables. 

Few preparations are needed to obatin a dataset compatible with the regression, and there are described in the code.

In [None]:
# Preparing Dataset for Linear Regression
eduDataLR = pd.get_dummies(eduData)

# Explanatory Variables
eduDataLR_X = eduDataLR.drop(columns=['math score','reading score', 'writing score'])
eduDataLR_X = sm.add_constant(eduDataLR_X)

Applying the Linear Regression with Math Score variable as target, we can confirm few evidences already saw in the exploratory analysis. The variable Gender shows us that female perform less than male in Mathmatics (approximately five score points less). 

Also, students who belong to the race/ethnicity group E may obtain much more score from those who belongs to the race/ethnicity group A,B, C or D (approximately ten times more than group A and two times more than group D). 

Parents with Master Degree also shows strong evidence that affect the perfomance of students in the Math Score exame as well Lunch and Course Preparation.

Importance to notice that for **parental level of education_high school**, **parental level of education_some high school**, and **race/ethnicity_group A** the p-value was not significant.

In [None]:
# Math Score Model
modelMathScore = sm.OLS(eduDataLR['math score'],eduDataLR_X).fit()
modelMathScore.summary()

Now, we apply the explanatory variables in a new target variable Reading Score. The results for gender now are the opposite than before for Gender, female perform better in Reading than male (approximately seven score points more). The distances between students between race/ethnicity are less abrupt than before, but still very high. 

The other veraibles follow a similar pattern than the previous analysis.

In [None]:
# Reading Model
modelReadingScore = sm.OLS(eduDataLR['reading score'],eduDataLR_X).fit()
modelReadingScore.summary()

The Linear Regression for Writing Score target varaible shows very similar results of Reading Score target. Nevertheless, the R-square was a bit better for this modeling.  

In [None]:
# Writing Model
modelReadingScore = sm.OLS(eduDataLR['writing score'],eduDataLR_X).fit()
modelReadingScore.summary()

In many times is important to try develop parsimonious, that means, explain the data with the minimum number of parameters posible. Less complex models may describe better the relationship between target and explanatory variables.

I will try to remove gender and lunch from the model to attempt improve my linear model. I choosed both variables because I believe gender should not affect the perfomance of any student in the exams and lunch because I believe in some way race/ethnicity hold that information implicity.

Removing **gender** and **lunch**.

In [None]:
eduDataReduce = eduData.drop(columns=['gender','lunch'])

# Preparing Dataset for Linear Regression
eduDataLR = pd.get_dummies(eduDataReduce)

# Explanatory Variables
eduDataLR_X = eduDataLR.drop(columns=['math score','reading score', 'writing score', ])
eduDataLR_X = sm.add_constant(eduDataLR_X)


In all situations the models dimished by removing the variables. The R-square, AIC, and BIC are worst than the previous models. Therefore, my attempt to apply parsimonious did not work.

In [None]:
# Math Score Model
modelMathScore = sm.OLS(eduDataLR['math score'],eduDataLR_X).fit()
modelMathScore.summary()

In [None]:
# Reading Model
modelReadingScore = sm.OLS(eduDataLR['reading score'],eduDataLR_X).fit()
modelReadingScore.summary()

In [None]:
# Writing Model
modelReadingScore = sm.OLS(eduDataLR['writing score'],eduDataLR_X).fit()
modelReadingScore.summary()