# **Predicting End Semester Performance through Simple Linear Regression**

This notebook implements an example of predicting the end semester performance  of students from their midsemester performance and classroom attendance.

The purpose of this notebook is to explain the concepts underlying Simple Linear Regression (SLR) through an example that connects to the student's daily academic life. 

This programming exercise can also be undertaken a lab activity for simple linear regression.

I explained the necessary statistical concept and working of SLR in my video course on Machine Learning. The unit on Simple Linear Regression covers following points.

1.  [Simple Linear Regression (SLR): 7 Key points to remember.](https://youtu.be/--awGDpi9pk)
2.  [SLR: Statistical Background](https://youtu.be/o1QWm6yLHdw)
3.  [SLR: Estimating Model Parameters](https://youtu.be/h-WyE73E1yM)
4.  [SLR: Evaluating Model Parameters](https://youtu.be/AvZ-4yFGwG4)


The notebook uses the data set of student's performance, and attendance recorded that I collected while teaching a course on Software Architecture in the year 2014. 

Here, I am  building two different models. First is to explore relationship between MSE and ESE. Second is to explore the relationship between classroom attendance and ESE.

**(1) Initilization**

This section imports modules such as Pandas, Numpy,  Matplotlib, Seaborn necessary to process the data. In addition to this, Statmodels and Math module is implemented.  

The simple linear regression model is going to be implemented through Ordinary Least Square (OLS) method hence statmodels and Math module are imported.

The dataset is available in CSV format.  The header of the dataset confirms that all the feature variables are numeric.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
import math

df = pd.read_csv('../input/predictingese/AttendanceMarksSA.csv')
df.head()
df.describe()

**(2) Correlation Analysis**

Before building an SLR model, the correlation analysis is performed by generating a correlation matrix and the scatter plot. It checks the existence of linearity between input variables(MSE, Attendance) and output variables (ESE).

In [None]:
corr=df.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
X = df["MSE"]
y = df["ESE"]

sns.scatterplot(X ,y)


**(3) Pre-Processing**

The input vectors (exog) and output vectors (endog) are separated from the data frame. Also, a constant unit vector is added in the input features.

In [None]:
endog = df['ESE']
exog = sm.add_constant(df[['MSE']])
print(exog)

**(4) Model building:** The OLS method from the Statmodels is used to build the Simple Regression Model. This method realizes the Ordinary Least Square algorithm. The summary of the results is also printed.

In [None]:
# Fit and summarize OLS model
mod = sm.OLS(endog, exog)
results = mod.fit()
print (results.summary())



> 

The following code segment implements a function named RSE to calculate the Residual Standard Error from the given predicted and actual values.

In [None]:
def RSE(y_true, y_predicted):
   
    y_true = np.array(y_true)
    y_predicted = np.array(y_predicted)
    RSS = np.sum(np.square(y_true - y_predicted))

    rse = math.sqrt(RSS / (len(y_true) - 2))
    return rse

In [None]:
rse= RSE(df['ESE'],results.predict())
print(rse)

**Intepreting the results of first model:**

(1) The values of t-statistics for y-intercept and slope are very high. Also, the p-values for these parameters are < 0.05. These observations validate the accuracy of y-intercept and slope. Hence, the results are statistically significant.

(2) The value R2 is **0.56**. Though it is not as high as required to indicate a strong linear relationship, it shows the existence of linearity between MSE marks and ESE marks.

(3) The value of calculated RSE is **4.3.** on training data set. An error of +/- 4 is an acceptable error in predicting the performance of ESE. [](http://)

The second model aims to explore the relationship between classroom attendance and End Semester performance. First, it visualizes the correlation between classroom attendance and end semester performance. 

In [None]:
X1 = df["Attendance"]
y1 = df["ESE"]

sns.scatterplot(X1 ,y1)


The input vectors are separated, and a constant unit vector is added as an input.

In [None]:
endog1 = df['ESE']
exog1 = sm.add_constant(df[['Attendance']])
print(exog)

In [None]:
# Fit and summarize OLS model
mod1 = sm.OLS(endog1, exog1)
results1 = mod1.fit()
print (results1.summary())

**Interpreting the results of the Second model:**

The R2 value is **0.012**, which indicates a very weak or non-existence of a linear relationship between classroom attendance and end semester performance.  This implies that students might be physically attending classes without any mindful academic engagement.   Also, the absolute value of t-statistic for slope is below one which is tiny to indicate a linear relationship. Further, the p-value for slope is > 0.05, showing no statistically significant relationship between attendance and ESE marks. 

Hence, in the context of this dataset, attendance is a bad predictor for end semester performance. 