# Dataset and Analysis outline

This dataset contains data from individual students (student ID). 
Their 
- school
- school type (public or non-public)
- school setting (urban, suburban, rural)
- classroom, teaching method (standard or experimental)
- number of students in the class
- gender of the student
- whether they qualify for a free/reduced lunch or not
- pre-test scores
- post-test scores

Here I will explore the data **visually** as well as conduct a **multiple linear regression** in order to identify possible features that could predict the post-test performance of students.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

## Data Overview

In [None]:
data_raw = pd.read_csv('../input/predict-test-scores-of-students/test_scores.csv')
data_raw.head()

In [None]:
data_raw.isna().sum()

**There is no missing data.**

In [None]:
data_raw.columns.values

In [None]:
data_raw.describe()

## Drop unneccessary features

In [None]:
df = data_raw.copy()
df = df.drop('student_id',axis=1) # student id can be dropped, as the index provides a unique identifier
df.head()

## Create dummy variables for categorical variables

In [None]:
df.school.unique()

In [None]:
schools = pd.get_dummies(df.school, drop_first = True)
schools.head()

In [None]:
room = pd.get_dummies(df.classroom, drop_first = True)
room.head()

**Because there are so many different schools and classrooms, I will create two variables that contain school and classroom information as dummy variables to keep the information, in case we want to add them to the model later. I will drop the features from the dataframe to keep the data concise and check the model's accuracy without the school and classroom information.**

In [None]:
df = df.drop(['school','classroom'],axis=1)

In [None]:
df.school_setting.unique()

In [None]:
df.teaching_method.unique()

In [None]:
df.lunch.unique()

In [None]:
df.school_type.unique()

In [None]:
df.teaching_method = df.teaching_method.map({'Standard':1, 'Experimental':0})
df.gender = df.gender.map({'Female':1,'Male':0})
df.lunch = df.lunch.map({"Does not qualify":0, "Qualifies for reduced/free lunch":1})
df.school_type = df.school_type.map({'Public':1, 'Non-public': 0})
df.school_setting = df.school_setting.map({'Urban': 0,'Suburban': 1, 'Rural': 2})

In [None]:
df.head()

## Visual inspection of the features and their relationships

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2,sharey=True,sharex=True)
ax1.hist(df.pretest)
ax1.set_title('Pretest scores',size=15)
ax2.hist(df.posttest)
ax2.set_title('Posttest scores',size=15)
plt.show()

In [None]:
plt.hist(df.n_student)
plt.title('Number of students per class',size=20)

From the Numerical data we can see that the number of students per class varies between 14 and 31 students. With an average of ~23 students per class (standard deviation = 4.23).

Pre-test scores range between 22 and 93 with an average of ~55 percent (std = 13.56). The data seems to be normally distributed.
Post-test scores range between 32 and 100, with an average of ~67% (std = 13.99). The data seems to be normally distributed.

**Because number of students and pretest scores are of a different magnitude (average pre-test scores roughly twice the average #students), the numerical data should be standardized before fitting the model.**

In [None]:
sns.boxplot(x='gender',y='pretest',data=df)

In [None]:
sns.boxplot(x='gender',y='posttest',data=df)

**There doesn't seem to be a gender effect for either pre-test or post-test scores.**

## Checking assumptions

### Linearity

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2,sharex=True)
ax1.scatter(df.posttest,df.pretest)
ax1.set_title('Posttest scores vs. Pretest scores')
ax2.scatter(df.posttest,df.n_student)
ax2.set_title('Posttest scores vs. #students')
plt.show()

**There seems to be a linear relationship between the continuous input variables "Pre-test" and "number of students" and the target variable "post-test".**

### Homoscedasticity

**The scatter plots do not display a cone shaped distribution, which could indicate homoscedasticity. Therefore the assumption of equal variance holds.**

### Multicollinearity (VIF)

In [None]:
variables = df[['pretest','n_student']]
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(variables.values,i) for i in range(variables.shape[1])]
vif['features'] = variables.columns
vif

**The variance inflation factor for both features exceeds 5, but remains below 10. These values can be considered borderline, but not unacceptably high.**

## Scaling the data

In [None]:
target = df.posttest
inputs = df.drop('posttest',axis=1)
to_scale = pd.concat([df.n_student,df.school_setting,df.pretest],axis=1)

Only the continuous variables need to be scaled. 'School_setting' will also be scaled because it has three options (0, 1, 2).

In [None]:
scaler = StandardScaler()
scaler.fit(to_scale)
scaled = scaler.transform(to_scale) # this creates an array that we need to turn back into a dataframe
scaled_inputs = pd.DataFrame(data=scaled,columns=['n_student','school_setting','pretest'])

In [None]:
inputs = inputs.drop(['n_student','school_setting','pretest'],axis=1)
inputs = pd.concat([inputs,scaled_inputs],axis=1)

In [None]:
inputs.head()

## Train Test Split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(inputs, target, test_size=0.2, random_state=42)

## Create Regression Model

In [None]:
reg = LinearRegression()
reg.fit(x_train, y_train)
yhat = reg.predict(x_train) # predictions made by the model

In [None]:
plt.scatter(y_train, yhat, alpha = 0.3)
plt.title('Linear Regression Model', size=20)
plt.xlabel('Target (y_train)', size=15)
plt.ylabel('Predictions (yhat)',size=15)
plt.show()

In [None]:
sns.displot(y_train-yhat)
plt.title('Residuals Distribution',size=20)
plt.show()

## Model Summary

In [None]:
reg.score(x_train,y_train)

**The model explains about 94.7% of variability.**

In [None]:
reg.intercept_

In [None]:
summary = pd.DataFrame(inputs.columns.values,columns=['Features'])
summary['Weights'] = reg.coef_
summary.sort_values(by=['Weights'], ascending=False)

**"School_type" and "school_setting" have weigths very close to 0. This means these two features have a negligable contribution to the model and should therefore not be included in the model.**

## Re-run the model without unnecessary features

In [None]:
inputs = inputs.drop(['school_type','school_setting'],axis=1)
inputs.head()

In [None]:
target.head()

In [None]:
x_train, x_test, y_train, y_test = train_test_split(inputs, target, test_size=0.2, random_state=42)

## Create Regression Model

In [None]:
reg = LinearRegression()
reg.fit(x_train, y_train)
yhat = reg.predict(x_train) # predictions made by the model

In [None]:
plt.scatter(y_train, yhat, alpha = 0.3)
plt.title('Linear Regression Model', size=20)
plt.xlabel('Target (y_train)', size=15)
plt.ylabel('Predictions (yhat)',size=15)
plt.show()

In [None]:
sns.displot(y_train-yhat)
plt.title('Residuals Distribution',size=20)
plt.show()

## Model Summary

In [None]:
reg.score(x_train,y_train)

**The updated model still explains about 94.7% of variability. This shows again that the two dropped features did not provide a significant contribution to the model.**

In [None]:
reg.intercept_

In [None]:
summary = pd.DataFrame(inputs.columns.values,columns=['Features'])
summary['Weights'] = reg.coef_
summary.sort_values(by=['Weights'], ascending=False)

**The Weights show that pre-test scores seem to be the bes predictor of post-test scores, with a positive relationship. The higher the pre-test score, the higher the predicted post-test score.**

**The feature with the second largest effect is "teaching_method", with a negaive relationship. This feature is a categorical variable with 1='Standard teaching method' and 0 = 'Experimental teaching method'. This can be interpreted that students taught with the "Experimental teaching method" are predicted to score higher on the post-test.**

## Test the model

In [None]:
yhat_test = reg.predict(x_test)

In [None]:
plt.scatter(y_test,yhat_test,alpha = 0.4)
plt.title('Linear Regression Model', size=20)
plt.xlabel('Target (y_test)', size=15)
plt.ylabel('Predictions (yhat_test)', size=15)
plt.show()

In [None]:
sns.displot(y_test-yhat_test)
plt.title('Residuals Distribution',size=20)
plt.show()

The scatter plot as well as the distribution of residuals shows that the model is able to predict post-test scores equally well across the full range.

In [None]:
y_test = y_test.reset_index(drop=True)
y_test.head()

In [None]:
df_pf = pd.DataFrame(yhat_test.round(1), columns=['Predictions'])

In [None]:
df_pf = pd.concat([df_pf.Predictions,y_test],axis=1)
df_pf.columns = ['Predictions','Targets']
df_pf['Residuals'] = df_pf.Targets - df_pf.Predictions
df_pf['Difference%'] = np.absolute(df_pf['Residuals']/df_pf['Targets']*100).round(2) # absolute value because it doesn't matter if off by +1% or -1%
df_pf.head()

In [None]:
df_pf.sort_values(by='Difference%', ascending=False)

In [None]:
summary.sort_values(by='Weights',ascending=False)

In [None]:
reg.score(x_test,y_test)

In [None]:
MAE = mean_absolute_error(y_test,yhat_test)
MAE

### Overall, this model performs excellent, being able to explain ~95% of variability. The best predictors of post-test scores are Pre-test scores and teaching method. The Mean Absolute Error (MAE) of the model is ~2.6, which means on average the model predicts post-test scores to be 2.6% higher than they actually are. 

#### The 'Experimental' teaching method seems to help students get higher post-test scores. Exploring the effect of teaching method on pre-test scores could provide insight into whether the teaching method influences the overall test performance.