# Python Project: Predicting Test Scores
### Exploratory Data Analysis, Linear and Multiple Regression Models

In this project, I first begin with EDA of the data: let's understand what the data is and what we can learn from it. Then, I will use the different varibales to check if they are correlated or associated with **post test** scores, and see which varibale(s) best predict post test scores. I hope you will find this notebook useful.

Thanks for reading

## Importing Relevat Python Libraries & Data First Glance

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

scores = pd.read_csv('../input/predict-test-scores-of-students/test_scores.csv')
scores.head()

What we see here is 5 categorical varibales (school setting, school type, teaching method, gender, and luch), and 2 quantitative variables (number of students in class, and pre-test scores). We will deal with the categorical variables later on in this notebook. First, let's start with making sure our data is complete, and continue with visualizing some summaries of the data

## Data Cleaning, Processing, and EDA

In [None]:
scores.info()

In [None]:
scores.describe()

From the above we can first see that we have no missing data in our dataset which is excellent. We can also learn that the number of students in class has a minumum value of 14 and a maximum value of 31 (will be interesting to check how this affects post-test scores). 

In [None]:
fig, axes = plt.subplots(1,4)
scores['school_setting'].value_counts().plot(kind='bar', ax=axes[0], figsize=(16,6))
scores['gender'].value_counts().plot(kind='bar', ax=axes[1])
scores['teaching_method'].value_counts().plot(kind='bar', ax=axes[2])
scores['school_type'].value_counts().plot(kind='bar', ax=axes[3])

fig.suptitle('Glances From the Dataset')
axes[0].set_title('Distribution of School Location')
axes[1].set_title('Sex Distribution')
axes[2].set_title('Standard VS. Experimental Studies')
axes[3].set_title('Distribution of School Type')
plt.tight_layout()

So, we can learn from the above about our data sample. Gender is evenly distributed, but type of schools, teaching methods, and locations are not evenly distributed. The location of schools can be explained by the distribution of the actual population - more people are living in the big cities and therefore more schools there. It will be interesting to see if these factors will have an impact on the post test scores that we will check later on

## Recoding Categorical Variables into Numeric Values

Since we have many categorical values, I need to recode them to 0's and 1's in order to later fit them in the model and check for their effect on predicting post-test scores. It is important to notify at this stage that we have **pre-test** scores and **post-test** scores. It is possible that pre-test scores will be correlated not just with the post-test, but also with the categorical varibables, and we will need to calculate them in the multiple linear regression model in order to evaluate if they increase the association or not.

In [None]:
scores['lunch'] = scores['lunch'].replace('Does not qualify', 0)
scores['lunch'] = scores['lunch'].replace('Qualifies for reduced/free lunch', 1)

In [None]:
scores['school_type'] = scores['school_type'].replace('Non-public', 0)
scores['school_type'] = scores['school_type'].replace('Public', 1)

In [None]:
scores['teaching_method'] = scores['teaching_method'].replace('Standard', 0)
scores['teaching_method'] = scores['teaching_method'].replace('Experimental', 1)

In [None]:
scores['school_setting'] = scores['school_setting'].replace('Rural', 0)
scores['school_setting'] = scores['school_setting'].replace('Suburban', 1)
scores['school_setting'] = scores['school_setting'].replace('Urban', 2)

Next thing I want to do is plotting the correlations between all the variables and post-test scores

In [None]:
df=scores.corr()
sns.set(rc = {'figure.figsize':(15,8)})
sns.heatmap(df, annot=True)

Ok, this is a start. We see here that there is a **really good** correlation between pre-test scores and post-test scores. BUT, we also see that pre-test score (our X, explenatory variable) is also **negatively** correlated with lunch, number of students in class, school_type, and school_setting. Pre-test scores are **positively** correlated with teaching methods. So,  

We will create a simple linear regression of pre-test and post-test, test the model, and then move to multiple linear regression and add all other variables. **We will need to test and conclude if PRE-TEST scores are highly associated with POST-TEST DESPITE other variables or not**.

In [None]:
regplot = smf.ols('posttest ~ pretest', data=scores).fit()
print(regplot.summary())

Ok, so this is a summary of the association between pre-test and post-test scores which we already know is strong. We can see that the p-value is low (which tells us about the strong relationship), but we also see the R-squared, 90%. Meaning, pre-test scores explain 90% of the variability in post-test scores.  

Let's see what happens when we add in all other variables in a multiple regression

In [None]:
regplot = smf.ols('posttest ~ pretest + lunch + n_student + school_type + school_setting + teaching_method', data=scores).fit()
print(regplot.summary())

Ok, so at first glance we can already conclude that school type & school setting are not a very good predictors of post-test score based on their p-values that indicate a weak association. In addition, we can also conclude that pre-test scores remain a good predictor DESPITE lunch and number of students in class. So, we are left with teaching methods which contribute 4% to the R-squared (now 94%) - both of these variables explain 94% of the variability of post-test scores.

## Evaluating the Model

First, let's create a residuals plot of pre-test scores

In [None]:
sns.residplot(scores['pretest'], scores['posttest'])

The residuals plot looks perfect, with most of the values are between -1 to 1. This means that our model is fine. So now we know that our model works for the data that we HAVE. but will this model work for additional data? Other data? We need to also test that and evaluate.

In [None]:
x = scores[['pretest','teaching_method']]
y = scores['posttest']

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=0)

print('the amount of values for test data are', x_test.shape[0])
print('the amount of values for train data are', x_train.shape[0])

In [None]:
lre = LinearRegression()
lre.fit(x_train, y_train)
y_hat =  lre.predict(x_test)

print('The R-sqaured for the test data is', lre.score(x_test, y_test))
print('94%')

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_hat)

print('On Average, our prediction on new data is off by 2.73 points per test')

In [None]:
ax1=sns.distplot(scores['posttest'], hist=False,color='r',label='Actual Values')
sns.distplot(y_hat, hist=False,color='b',label='Fitted Values',ax=ax1)
plt.title('Acutal Values (Red) VS. Fitted Values (Blue)')

In [None]:
rcross = cross_val_score(lre, x,y, cv=4)
print('The average R-sqaured for all 4 samples of the data is', rcross.mean())

All evaluation tests show excellent results, making us conclude that our model will work pretty well on **other samples**

# Thank you for reading