# Multiple Linear Regression
## Building a Multiple Linear Regression Model (on Student Performance Data)

Using [this dataset from kaggle](https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression?resource=download), I'll be attempting to build a simple machine learning model that can predict a student's performance based on how many hours they have studied, their sleep hours, the number of sample question practiced etc.

Features (inputs):
- Hours studied
- Previous scores
- Extracurricular activities
- Sleep hours
- Sample question papers practised
- Performance index

Label (output):
- Score


The dataset was originally designed to predict performance index. But I think it'd make more sense to predict the student's final score, based on the above features.
So the for the purposes of this study, I will treat 'Previous scores' as 'Final score' and will rename this in the dataset.

We'll assume performance index, is some indicator that is updated the user completes coursework, homework etc.

And so the new problem is predicting the student's final exam score, based on their performance during the rest of the year.

## Loading the data (using pandas)

In [1]:
import pandas as pd

data = pd.read_csv("datasets/student-performance.csv")
data.head()

Unnamed: 0,Hours Studied,Final Score,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0


## Preprocessing

In [2]:
data['Extracurricular Activities'] = [(x == "Yes") for x in data['Extracurricular Activities']]

## Split dataset into predictors and target

In [3]:
# Inputs
predictors = data[[
    'Hours Studied',
    'Extracurricular Activities',
    'Sleep Hours',
    'Sample Question Papers Practiced',
    'Performance Index'
]]
predictors.head()

Unnamed: 0,Hours Studied,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,True,9,1,91.0
1,4,False,4,2,65.0
2,8,True,7,2,45.0
3,5,True,5,2,36.0
4,7,False,8,5,66.0


In [4]:
# Output
target = data['Final Score']
target.head()

0    99
1    82
2    51
3    52
4    75
Name: Final Score, dtype: int64

## Split dataset into training data and test data

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, shuffle=True)

## Multiple Linear Regression

In [6]:
from sklearn.linear_model import LinearRegression

lf = LinearRegression()
lf.fit(X_train, y_train)
lf.score(X_test, y_test)

0.9866863803029005

## Testing the model

### Displaying predictions alongside actual final scores

In [7]:
y_pred = lf.predict(X_test)

new_df = pd.DataFrame({'Predicted': y_pred, 'Actual': y_test, 'Difference': y_test - y_pred})
new_df.head(1000)

Unnamed: 0,Predicted,Actual,Difference
5310,90.106795,89,-1.106795
2524,62.233609,59,-3.233609
2332,93.612562,98,4.387438
2580,40.392235,41,0.607765
4881,76.267661,75,-1.267661
...,...,...,...
4549,50.687019,50,-0.687019
578,74.707027,71,-3.707027
3989,92.833543,95,2.166457
4982,95.340279,95,-0.340279


### Considering different methods of measuring accuracy


#### Manually using Absolute and Mean Square method

In [8]:
total_absolute_error = 0
total_squared_error = 0

for i in range(len(y_test)):
    diff = y_test.iloc[i] - y_pred[i]    
    total_absolute_error += abs(diff)
    total_squared_error += diff ** 2

mae = total_absolute_error / len(y_test)
mse = total_squared_error / len(y_test)

print(f"Mean Absolute Error (MAE): +/- {round(mae, 2)} %")
print(f"Mean Squared Error (MSE): +/- {round(mse, 2)} %")
print(f"Root Mean Squared Error (RMSE): +/- {round(mse ** 0.5, 2)} %")

Mean Absolute Error (MAE): +/- 1.57 %
Mean Squared Error (MSE): +/- 3.96 %
Root Mean Squared Error (RMSE): +/- 1.99 %


Now we'll do the same accuracy calculations using Scikit-learns inbuilt functions...

In [9]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y_true=y_test, y_pred=y_pred)
mse = mean_squared_error(y_true=y_test, y_pred=y_pred)

print(f"Mean Absolute Error (MAE): +/- {round(mae, 2)} %")
print(f"Mean Squared Error (MSE): +/- {round(mse, 2)} %")
print (f"Root Mean Squared Error (RMSE): +/- {round(mse ** 0.5, 2)} %")

Mean Absolute Error (MAE): +/- 1.57 %
Mean Squared Error (MSE): +/- 3.96 %
Root Mean Squared Error (RMSE): +/- 1.99 %


As you can see this produces the same values.

### R-squared score

We can also measure the accuracy of linear regression models using R-squared scores.

R-squared = Represents the proportion of the variance in the dependent variable that is predictable from the independent variables.

In [10]:
from sklearn.metrics import r2_score

r2 = r2_score(y_true=y_test, y_pred=y_pred)

print(f"R2 Score: {r2}")

R2 Score: 0.9866863803029005


This seems to produce the same value you get from this...

In [11]:
# Return the coefficient of determination of the prediction.
lf_score = lf.score(X_test, y_test)

print("LF Score: ", lf_score)

LF Score:  0.9866863803029005


# Conclusion

There only really seems to be a strong relationship between performance index and previous scores.