Overview
--

This data set includes scores from three exams and a variety of personal, social, and economic factors that have interaction effects upon them

**Example Research Questions**

------------

- How effective is the test preparation course?
- Which major factors contribute to test outcomes?
- What would be the best way to improve student scores on each test?

In [None]:
# import libraries we need for EDA
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#import libraries we need for predictions
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn import metrics

In [None]:
# importing our file
students_df = pd.read_csv('../input/StudentsPerformance.csv')

In [None]:
# Look at the top 5 rows, uncomment the line below to run the code
#students_df.head()

In [None]:
# Look at the last 5 rows, uncomment the line below to run the code
#students_df.tail()

> This data set consists of the marks secured by the students in various subjects.

In [None]:
# Let's take a look on columns, shape and descriptive information of our data set
# uncomment the line below to run the code
#students_df.columns

In [None]:
# Shape of our dataset
# uncomment the line below to run the code
#students_df.shape

In [None]:
students_df.info()

>As we can see out data set is very clean with no *Null values* and all columns are the correct type as we expected. There are 5 columns that contain categorical values and 3 of numeric (integers) values.

In [None]:
# Summary statistics of our numeric columns of entire dataset
students_df.describe()

- All numeric columns have almost equal *standard deviations* and *mean values* are very similar. 
- Mininmum score of all subjects is 0. It is possible to be an outlier but as long as it is not a negative number we can consider it as normal (a student it is possible to take 0 score for some reason).

In [None]:
categorical_features = ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']

In [None]:
# Counts on categorical columns
for feature in categorical_features:
    print(feature,':')
    print(students_df[feature].value_counts())
    print('----------------------------')

>We are going to plot all the categorical columns to see the differences. Counts between our categories differ a lot except for the *gender* column

In [None]:
fig, axes = plt.subplots(3,2, figsize=(12,12))

def get_x_labels(column):
    # helper function to get all xlabels for all axes
    col_dict = dict(students_df[column].value_counts())
    return col_dict.keys()

x_labels = [list(get_x_labels(feature)) for feature in categorical_features]

def get_y_ticks(column):
    # helper function to get all heights for all axes
    return students_df[column].value_counts()

y_ticks = [list(get_y_ticks(feature)) for feature in categorical_features]

for i in range(3):
    for j in range(2):
        if i==1:
            axes[i,j].bar(x_labels[i+j+1], y_ticks[i+j+1])
            axes[i,j].set_frame_on(False)
            axes[i,j].set_xticklabels(x_labels[i+j+1], rotation=45)
            axes[i,j].set_title('{} Counts'.format(categorical_features[i+j+1].capitalize()))
            axes[i,j].minorticks_off()
        elif i==2:
            axes[i,j].bar(x_labels[i+j+1], y_ticks[i+j+1])
            axes[i,j].set_frame_on(False)
            axes[i,j].set_xticklabels(x_labels[i+j+1], rotation=45)
            axes[i,j].set_title('{} Counts'.format(categorical_features[i+j+1].capitalize()))
        else:
            axes[i,j].bar(x_labels[i+j], y_ticks[i+j])
            axes[i,j].set_frame_on(False)
            axes[i,j].set_xticklabels(x_labels[i+j], rotation=45)
            axes[i,j].set_title('{} Counts'.format(categorical_features[i+j].capitalize()))
plt.tight_layout()
plt.show()

I will try to explain as much simpler as I can the above code:

- *fig, axes = plt.subplots(3,2, figsize=(12,12))*:
    - creates a figure, you can think of that like a white paper onto whinch we are going to draw our plots (you can take a look the code below to get a better understanding)
    - axes are just the horizontal axis $x$ and the vertical axis $y$
    - this line creates a white paper with $3X2$ grid of axes $x$,$y$
    - so we can think of axes as objects which we can add, remove, draw anything we want
- *axes[i,j].bar(x, y, more params)*:
    - creates a bar plot (usually we use barplots to plot categorical variables)
- *axes[i,j].set_frame_on(False)*:
    - removes the lines of the axes $x$ and $y$ but keeps the ticks (think of that like if you have draw the axes with a pencil you could take an eraser and remove the lines)
- *axes[i,j].set_xticklabels(label, rotation=45, more params)*:
    - adds ticks and labels on the x-axis rotated by $45^o$
- *axes[i,j].set_title('{} Counts'.format(categorical_features[i+j].capitalize()))*:
    - adds a title for our plot

In [None]:
#To help you understand better, what we mean by figure and by axes
#below I use a very nice way to clarify those. (I saw it first time 
#in LinkedIn by Ted Petrou).

fig_new, ax = plt.subplots()

fig_new.set_facecolor('tab:cyan') #our paper has cyan color, our axes in white color

In [None]:
ax.set_facecolor('tab:green') #our axes now has green color
fig_new

I would like to highlight the following notes:

- Two thirds of our students DIDN'T took the test preparation course
- Group A of Race/Ethinicity column has the minimum number of representatives, it is more than 3 times smaller than the Group C
- Μost counts of Parental Level of Education has the 'Some College' and the less counts has the 'Master's Degree'.

We could split our dataset into smaller to analyze each one category seperately.

In [None]:
numeric_features = ['math score', 'reading score', 'writing score']

In [None]:
# First of all let us take a look on 
# the distribution of each numeric column

for feature in numeric_features:
    students_df[feature].plot(kind='hist', bins=20)
    plt.title('{} Distribution'.format(feature))
    plt.show()

In [None]:
# We print all the minimum values for each numeric feature

print('The minimum score for Maths is: {}'.format(students_df['math score'].min()))
print('The minimum score for Reading is: {}'.format(students_df['reading score'].min()))
print('The minimum score for Writing is: {}'.format(students_df['writing score'].min()))

Next step is to visualize the performance of each category of each categorical feature.

We will make use of boxplots and groupby each category our dataset because:

- We can take the minimum(lower cap) and maximum(upper cap) value,
- We can take the median(green line),
- We can take quartiles, 25%(bottom of the box), 50%(is the same as median), 75%(top of the box),
- outliers (circles)

In [None]:
students_df.boxplot(column=numeric_features, by='gender', rot=45, figsize=(15,6), layout=(1,3));

In [None]:
students_df.boxplot(column=numeric_features, by='test preparation course', rot=45, figsize=(15,6), layout=(1,3));

In [None]:
students_df.boxplot(column=numeric_features, by='parental level of education', rot=90, figsize=(15,6), layout=(1,3));

In [None]:
students_df.boxplot(column=numeric_features, by='race/ethnicity', rot=45, figsize=(15,6), layout=(1,3));

In [None]:
students_df.boxplot(column=numeric_features, by='lunch', rot=45, figsize=(15,6), layout=(1,3));

Let's start our analysis. I am going to start first with the *Test Preparation Course*. A first note is that: 
- outliers in any of math/reading/writing course for those who completed the test preparation course(TPC) are very similar to minimum value while for those who didn't completed TPC not.

- Moreover, scores for those who completed the TPC has smaller variance, the body of the box is everywhere higher than the others and the whiskers are shorter.

In my opinion if a student completed the TPC then has more chances to get a better score! So yes, TPC does affects the scores.

Another interesting plot is *Parental Level of Education*. 
- Master's Degree does a bit better than bachelor's Degree
- Master's and Bachelor's Degree has long bodies and short whiskers, almost no outliers(only one) and always higher median than the others
- Very steady is the level of Associate Degree it has almost the same values on all three courses

I believe that Parental's Level of Education affects a student's score. But our samples aren't equally so I think that a better collected dataset could give us more distinct results.

Last I would like to talk about *Race/Ethinicity* boxplot. 
- By far the best group is Group E, low variance, small number of outliers, short whiskers and the highest medians
- Group C that represents the larger group of all on the other hand doesn't do very well. Large number of outliers, long whiskers
- Group A is the worst group refer to scores, it is always lower than the others groups.

I don't know how Race/Ethinicity could affect the scores of a student. I believe that because of large difference of samples for each group we can't be sure if really affects the scores.

Summary Statistics for Parental Level of Education and Test Preparation Course
--

In [None]:
# We are going to split our dataset to smaller,
# one for each category, and compare their statistics
# with the overall statistics.

df_compl = students_df[students_df['test preparation course'] == 'completed']
df_notcompl = students_df[students_df['test preparation course'] == 'none']

In [None]:
# A good way to decide if and how the test preparation course helped,
# is to compare the mean values of our two subsets to the entire dataset
print(students_df.mean() - df_compl.mean())
print(students_df.mean() - df_notcompl.mean())

This is very nice! For sure, the results for those who did took the preparation course are much better!

In [None]:
print(students_df.std() - df_compl.std())
print('--------------')
print(students_df.std() - df_notcompl.std())

Standard Deviation shows that again our subset that completed TPC, has less variance to their scores.

>As plots shown above if a student completes the TPC has much more chances to get high score to exams.

In [None]:
df_BD = students_df[students_df['parental level of education'] == "bachelor's degree"]
df_MD = students_df[students_df['parental level of education'] == "master's degree"]
df_sc = students_df[students_df['parental level of education'] == 'some college']
df_AD = students_df[students_df['parental level of education'] == "associate's degree"]
df_hs = students_df[students_df['parental level of education'] == 'high school']
df_shs = students_df[students_df['parental level of education'] == 'some high school']

In [None]:
print(students_df.mean() - df_BD.mean())
print('--------------')
print(students_df.mean() - df_MD.mean())
print('--------------')
print(students_df.mean() - df_sc.mean())
print('--------------')
print(students_df.mean() - df_shs.mean())
print('--------------')
print(students_df.mean() - df_hs.mean())
print('--------------')
print(students_df.mean() - df_AD.mean())

In [None]:
print(students_df.std() - df_BD.std())
print('--------------')
print(students_df.std() - df_MD.std())
print('--------------')
print(students_df.std() - df_sc.std())
print('--------------')
print(students_df.std() - df_shs.std())
print('--------------')
print(students_df.std() - df_hs.std())
print('--------------')
print(students_df.std() - df_AD.std())

Great notes!!! As we expected, Master's/ Bachelor's/ Associate's Degree did better than the entire dataset and a bit better also did Some College. 

>In conclusion I think that also Parental Level of Education affects a student's score

In [None]:
df_A = students_df[students_df['race/ethnicity'] == 'group A']
df_B = students_df[students_df['race/ethnicity'] == 'group B']
df_C = students_df[students_df['race/ethnicity'] == 'group C']
df_D = students_df[students_df['race/ethnicity'] == 'group D']
df_E = students_df[students_df['race/ethnicity'] == 'group E']

In [None]:
print(students_df.mean() - df_A.mean())
print('--------------')
print(students_df.mean() - df_B.mean())
print('--------------')
print(students_df.mean() - df_C.mean())
print('--------------')
print(students_df.mean() - df_D.mean())
print('--------------')
print(students_df.mean() - df_E.mean())

In [None]:
print(students_df.std() - df_A.std())
print('--------------')
print(students_df.std() - df_B.std())
print('--------------')
print(students_df.std() - df_C.std())
print('--------------')
print(students_df.std() - df_D.std())
print('--------------')
print(students_df.std() - df_E.std())

Well done!!! Group E has much better mean value compared to the entire dataset and also Group D. But, Group E standard deviation is similar the entire dataset.

> Ultimately Race/Ethnicity it very possible to affect score.

Linear Regression - Score Prediction
--

In [None]:
students_dummies = pd.get_dummies(students_df, drop_first=True, columns=categorical_features)
students_dummies.head()

Creating Our First Model
--

> This model will keep two out of three scores and we are going to predict the third. *Reading & Writing Scores* are kept, *math score* is the target variable.

In [None]:
features = ['reading score', 'writing score', 'gender_male',
       'race/ethnicity_group B', 'race/ethnicity_group C',
       'race/ethnicity_group D', 'race/ethnicity_group E',
       'parental level of education_bachelor\'s degree',
       'parental level of education_high school',
       'parental level of education_master\'s degree',
       'parental level of education_some college',
       'parental level of education_some high school', 'lunch_standard',
       'test preparation course_none']

In [None]:
target = ['math score']

In [None]:
class ColumnLinearRegression(BaseEstimator, RegressorMixin):
     # columns are a "Hyperparameter" for our estimator, 
     # so we have to pass it in, in the __inti__ method,
     # we need to keep track of the columns, so we have to save them
        
    def __init__(self, columns):
        if not isinstance(columns, list):
            raise ValueError("columns must be a list")
        self.columns= columns
        self.lr = LinearRegression()
        
    def _select(self, X):
        return X[self.columns]
        
    def fit(self, X, y):
        self.lr.fit(self._select(X), y)
        return self
    
    def predict(self, X):
        return self.lr.predict(self._select(X))

In [None]:
def r2_adj(y_t, X_t, feat, pred):
    #this function calculates the r^2 adjusted
    r2 = metrics.r2_score(y_t, pred)
    n = len(X_t)
    p = len(feat)
    return 1-((1-r2)*(n-1)/(n-p-1))

In [None]:
feat_list = []
new_dict = {}
i=0
r2_adj_max = 0
best_model = None
for feature in features:
    feat_list.append(feature)
    clr = ColumnLinearRegression(feat_list)
    X_train, X_test, y_train, y_test = train_test_split(students_dummies[feat_list], students_dummies[target], 
                                                    test_size=0.3, random_state=42)
    clr.fit(X_train, y_train)
    predictions = clr.predict(X_test)
    variables = len(clr.columns)
    mse = metrics.mean_squared_error(y_test, predictions)
    r2 = metrics.r2_score(y_test, predictions)
    r2_adjusted = r2_adj(y_test, X_test, clr.columns, predictions)
    if r2_adjusted > r2_adj_max:
        best_model = clr
        r2_adj_max = r2_adjusted
        new_dict[i] = {'var': variables, 
                       'MSE': mse,
                       'R^2': r2, 
                       'R^2_adjusted': r2_adj_max}
    else:
        feat_list.remove(feat_list[-1])
    i+=1
print(best_model.columns)    

In [None]:
df = pd.DataFrame(new_dict)
df.head()

## Regression Evaluation Metrics


Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are **loss functions**, because we want to minimize them.

Another common metric for regression is $R^2$, also known as the **coefficient of determination**. The $R^2$ quantifies how our model's MSE compares to a naive model in which we always predict the mean $y$ value, $\bar{y}$.

$$ 1 - \frac{\sum_i \left[f(X_i) - y_i\right]^2}{\sum_i\left(\bar{y} - y_i\right)^2} $$

If our $R^2 < 0$ we know our model is very bad, because the MSE is larger than than the MAE of the mean model.

One important consideration when choosing a metric is how they scale with the data. One note for $R^2$ is that, it seems to get better and better as long as we add more features. But this most of the time is not very good for our model. So beyond the $R^2$ we calculate the $R^2 adjusted$ which is a modified version of $R^2$ that has been adjusted for the number of predictors in the model. The $R^2 adjusted$ increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The $R^2 adjusted$ can be negative, but it’s usually not. It is always lower than the $R^2$.