# **Machine Learning: Linear Regression**

## Objectives

* Build a linear regression model
* Build a pipeline model which includes scaling
* Consider **hypothesis 4** by comparing the models
* Use the model to evaluate **hypothesis 2**

## Inputs

* Cleaned CSV file "academic_performance_cleaned.csv" 

## Outputs

* A pipeline which performs linear regression on the dataset. I hope to output this to a streamlit app 

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\tb975\\OneDrive\\Documents\\vs_code_projects\\Student-Academic-Performance\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\tb975\\OneDrive\\Documents\\vs_code_projects\\Student-Academic-Performance'

# Load Data and Prepare for Linear Regression

First step is to import packages. This time I will be including packages that I will be needing from sklearn so that I can perform linear regression

In [4]:
#import data manipulation and visualisation packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split 
#set seaborn style so plots look nice
sns.set_style("whitegrid")



Load dataset and display first 10 rows

In [5]:
# load cleaned dataset
df = pd.read_csv("data/academic_performance_cleaned.csv")
#display first 10 rows of the dataset
df.head(10)

Unnamed: 0,Attendance (%),Internal Test 1 (out of 40),Internal Test 2 (out of 40),Assignment Score (out of 10),Daily Study Hours,Final Exam Marks (out of 100),Average Test Score,Study Group
0,84,30,36,7,3,72,33.0,low
1,91,24,38,6,3,56,31.0,low
2,73,29,26,7,3,56,27.5,low
3,80,36,35,7,3,74,35.5,low
4,84,31,37,8,3,66,34.0,low
5,100,34,34,7,3,79,34.0,low
6,96,40,36,8,3,83,38.0,low
7,83,39,37,7,3,77,38.0,low
8,91,30,37,8,2,71,33.5,low
9,87,27,37,8,3,61,32.0,low


When the CSV is saved and loaded it resets the datatypes to int64, so I will change datatypes from int64 to int8 to save memory. 

In [6]:
#change datatypes to save memory
df = df.astype({col: 'int8' for col in df.select_dtypes('int64').columns})
#display datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Attendance (%)                 2000 non-null   int8   
 1   Internal Test 1 (out of 40)    2000 non-null   int8   
 2   Internal Test 2 (out of 40)    2000 non-null   int8   
 3   Assignment Score (out of 10)   2000 non-null   int8   
 4   Daily Study Hours              2000 non-null   int8   
 5   Final Exam Marks (out of 100)  2000 non-null   int8   
 6   Average Test Score             2000 non-null   float64
 7   Study Group                    2000 non-null   object 
dtypes: float64(1), int8(6), object(1)
memory usage: 43.1+ KB


### Linear Regression

Linear regression is used to model the relationship between a dependent variable and one or more independent variables. 

Linear regression is appropriate for this dataset because the target variable (final exam marks) and the predictor variables are continuous and numerical.

The aim of the analysis is to quantify how changes in each academic factor relate to changes in final exam performance, which linear regression is well-suited to model.

Linear regression allows both prediction of final exam marks and interpretation of feature coefficients, supporting hypothesis-driven analysis.

Before performing linear regression the dataset needs to be tidied up. The Study Group column, which was generated for statistical analysis, is categorical and so cannot be considered by the regression algorithm without one hot encoding it. The information in the study group column is represented numerically by the column daily study hours so the study group column is redundant. Leaving it in would make the results misleading as it would be double counting the same metric. 

The same can be said for the two internal test score columns which also need to be removed. These have been summarised in one column: average test scores. Leaving them in would again mislead the results of the linear regression model.

### Small data cleaning step

In [7]:
# define a new dataframe df_reg where the redundent columns have been dropped
df_reg = df.drop(['Internal Test 1 (out of 40)',
       'Internal Test 2 (out of 40)', 'Study Group'], axis=1)
# display the head of the dataframe
df_reg.head(10)

Unnamed: 0,Attendance (%),Assignment Score (out of 10),Daily Study Hours,Final Exam Marks (out of 100),Average Test Score
0,84,7,3,72,33.0
1,91,6,3,56,31.0
2,73,7,3,56,27.5
3,80,7,3,74,35.5
4,84,8,3,66,34.0
5,100,7,3,79,34.0
6,96,8,3,83,38.0
7,83,7,3,77,38.0
8,91,8,2,71,33.5
9,87,8,3,61,32.0


# Hypothesis 4: A fully processed regression pipeline achieves better accuracy than a model without preprocessing.

The plan is to perform linear regression on the raw dataset without any preprocessing steps. I will then build an ML pipeline with a preprocessing step and compare the two models.

An ML pipeline will often consist of multiple steps to process the dataset, transform and fit the model. Preprocessing steps may include data cleaning, feature engineering, feature scaling and feature selection. I will not be including data cleaning or feature engineering steps in this example as the data is alread cleaned and the necessary features have been engineered. I will consider feature selection for my testing of H2. 

The preprocessing step I will consider for H4 is feature scaling.

Feature scaling standardises numerical variables so they are on a comparable scale, preventing features with larger values from dominating the model.

In this model a feature scaling step would ensure that variables measured on different scales, such as attendance percentages (/100) and study hours (max 5 hours), contribute proportionately during model training.

# Split Dataset into Train and Test

This is a supervised learning task which means that the dataset can be thought of as having features and a target variable.

The target variable is a column which the model is trying to predict. In this case the target is the **final exam marks**.

The features are the rest of the data in the dataset. 

The model is trying to answer the question: with a set of unseen features, how accurately can the target variable be predicted. 

The dataset is split into two sections, the train set and the test set. The model is trained on the train set, it learns what features are important and contibute most to the variance in the target data. The model is then tested on unseen data (the test set) and it predicts what the targets variables are on the unseen data. The model can then compare its predictions of the test data to the actual target values of the test data and can assess how good it was at predicting the target variables. 

In [8]:
# split dataset into train and test. X represents features (drop target variable) and y represents target variable
X = df_reg.drop('Final Exam Marks (out of 100)', axis=1)
y = df['Final Exam Marks (out of 100)']
# create 4 variables for the X_train and X_test are the features and y_train and y_test are the targets
# test_size = 0.2 the dataset is split into 80% train and 20% test, ramdom state provides reproducability 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=101)


Inspecting the X_train shows us that it no longer contains the final marks column so represents just the features, and inspecting the y_train dataframe shows us that it is just the the final marks i.e. the target variable.

In [9]:
#features only
X_train

Unnamed: 0,Attendance (%),Assignment Score (out of 10),Daily Study Hours,Average Test Score
668,73,8,2,35.0
1345,81,6,3,32.5
373,92,7,2,31.0
1388,98,7,4,30.0
132,75,8,2,31.5
...,...,...,...,...
1599,97,9,3,36.5
1862,91,7,3,30.5
1361,81,8,2,36.0
1547,94,9,3,36.0


In [10]:
#targets only
y_train

668     65
1345    59
373     70
1388    68
132     58
        ..
1599    78
1862    71
1361    63
1547    81
863     79
Name: Final Exam Marks (out of 100), Length: 1600, dtype: int8

Printing the shape of the datafames shows us that the train set is now 1600 rows (80% of 2000) and the test set is 400 rows (20%)

In [11]:
#print the shape of the train and test sets
print(
    "Train set:",
    X_train.shape,
    y_train.shape,
    "\nTest set:",
    X_test.shape,
    y_test.shape,
)

Train set: (1600, 4) (1600,) 
Test set: (400, 4) (400,)


# Baseline linear regression model (no pre-processing)

The model is fit using the linear regression module from scikit learn

In [12]:
# import linear regression
from sklearn.linear_model import LinearRegression
# define a model and assign the linear regression algorithm to it
model = LinearRegression()
# git the model using the train set
model.fit(X_train, y_train)

The target variables for the unseen test set can be predicted using the predict module 

In [None]:
# get the model to predict the target variable for the unseen test set assign it to variable called prediction
prediction = model.predict(X_test)

The success of the model will be evaluated by using the following metrics:

-  R²: The proportion of variance in the target variable explained by the model.
-  Mean squared error (MSE): The average squared difference between predicted and actual values.
-  Mean absolute error (MAE): The average absolute difference between predicted and actual values.
-  Root mean squared error (RMSE): The square root of MSE, representing typical prediction error in the original units.

In [14]:
# import the metrics to evaluate the model
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
#save regression scores as a dictionary
reg_scores = {"Model": "Baseline",
              "R2 Score": r2_score(y_test, prediction), 
              "MAE": mean_absolute_error(y_test, prediction),
              "MSE": mean_squared_error(y_test, prediction),
              "RMSE": np.sqrt(mean_squared_error(y_test, prediction))
              }

# save regression scores in a dataframe for easy comparison
df_reg_scores = pd.DataFrame([reg_scores])
df_reg_scores


Unnamed: 0,Model,R2 Score,MAE,MSE,RMSE
0,Baseline,0.826012,3.581285,20.082011,4.481296


An R² score of 0.83 tells me that my model can exaplain around 83% of the variation in the final exam results, this means that the predictions adapt well for different students. 
The mean absolute error of 3.6 means that on average the model was predicting the test results and was on average around 3-4 marks out from the actual test results.
The root mean squared error of 4.5 is a similar metric but is more sensitive to large mistakes but it tells us that most predictions are within roughly ±4–5 marks. 

# Linear Regression pipeline with pre-processing (feature scaling)

I will use the pipeline to pre-process i.e. I will get it to scale the features so that the attendance (/100) will have the same weighting as the study hours (max value of 5).

Below I define the pipeline and provide it with two steps. Feature scaling and the model step which fits the model (same steps as in the baseline model)

In [15]:
# import pipeline and the feature scaling module
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


#build a pipeline with 2 steps: scaling and ML model
def linear_regression_pipeline():
    pipeline = Pipeline(
        [
            ('feature_scaling', StandardScaler()),
            ('model', LinearRegression())
        ]
    )
    return pipeline

In [16]:
# display the steps in the pipeline
linear_regression_pipeline()

In [17]:
# define a variable pipeline and assign it the pipeline function I defined above 
pipeline = linear_regression_pipeline()
# fit the pipleine with the train set
pipeline.fit(X_train, y_train)

In [18]:
# define a variable which will store the predictions from the pipeline when it is given the test set
pipeline_prediction = pipeline.predict(X_test)

Same step as before, get the R², MAE and MSE values

In [None]:
# obtain the scores to evaluate the regression
pipeline_reg_scores = {"Model": "Scaled Regression",
                       "R2 Score": r2_score(y_test, pipeline_prediction), 
                       "MAE": mean_absolute_error(y_test, pipeline_prediction),
                       "MSE": mean_squared_error(y_test, pipeline_prediction),
                       "RMSE": np.sqrt(mean_squared_error(y_test, pipeline_prediction))
              }
# add new scores to old table
df_reg_scores.loc[len(df_reg_scores)] = pipeline_reg_scores
df_reg_scores



Unnamed: 0,Model,R2 Score,MAE,MSE,RMSE
0,Baseline,0.826012,3.581285,20.082011,4.481296
1,Scaled Regression,0.826012,3.581285,20.082011,4.481296


# H4 Conclusion



# H2 Use feature selection to give me the feature that BLAH 

Feature selection gives us the feature that explains the most variance in the target (DOES IT?)

I'm going to add a feature selection stage to my pipeline

In [22]:
from sklearn.feature_selection import SelectFromModel

#build a pipeline with 3 steps: scaling, selecting best features, ML model
def feature_selection_pipeline():
    pipeline = Pipeline(
        [
            ('feature_scaling', StandardScaler()),
            ('feature_selection', SelectFromModel(LinearRegression())),
            ('model', LinearRegression())
        ]
    )
    return pipeline

In [23]:
feature_pipeline=feature_selection_pipeline()
feature_pipeline.fit(X_train, y_train)

In [24]:
X_train.columns[feature_pipeline['feature_selection'].get_support()]

Index(['Average Test Score'], dtype='object')

In [25]:
feature_prediction = feature_pipeline.predict(X_test)

In [26]:
feature_selection_reg_scores = {"Model": "Feature Selection Regression",
                                "R2 Score": r2_score(y_test, feature_prediction), 
                                "MAE": mean_absolute_error(y_test, feature_prediction),
                                "MSE": mean_squared_error(y_test, feature_prediction),
                                "RMSE": np.sqrt(mean_squared_error(y_test, feature_prediction))
              }
# add new scores to old table
df_reg_scores.loc[len(df_reg_scores)] = feature_selection_reg_scores
df_reg_scores

Unnamed: 0,Model,R2 Score,MAE,MSE,RMSE
0,Baseline,0.826012,3.581285,20.082011,4.481296
1,Scaled Regression,0.826012,3.581285,20.082011,4.481296
2,Feature Selection Regression,0.72504,4.499313,31.736509,5.633517


My goal was to show that average test score is the feature that has the most importance, as predicted earlier by the correlation coefficients. This result indicates that this is the case, but I was not sure that I had done the correct thing. 

I put my pipeline into chat GPT and asked it to evaluate my reasoning. It informed me that using linear regression inside of SelectFromModel can provide misleading. Below I have quoted the reasoning from GPT

"Lasso regression should be used for feature selection instead of a standard linear regression model because it includes an L1 regularisation penalty that shrinks weaker coefficients toward zero. This allows the model to effectively identify and remove features that contribute little unique predictive value, particularly in the presence of multicollinearity. In contrast, ordinary least squares regression does not apply coefficient penalisation, meaning its coefficients can remain unstable and unsuitable for feature selection. Therefore, Lasso provides a more robust and interpretable approach to identifying the most important predictors in the dataset."

I will run the pipeline again and replace linear regression with Lasso to see if it makes a difference. 


In [27]:
# Lasso pipeline
from sklearn.linear_model import Lasso
def lasso_pipeline():
    pipeline = Pipeline(
        [
            ('feature_scaling', StandardScaler()),
            ('feature_selection', SelectFromModel(Lasso(alpha=0.1))),
            ('model', LinearRegression())
        ]
    )
    return pipeline

In [28]:
lasso_pipeline = lasso_pipeline()
lasso_pipeline.fit(X_train, y_train)

In [29]:
lasso_features = X_train.columns[lasso_pipeline['feature_selection'].get_support()]
lasso_features

Index(['Attendance (%)', 'Assignment Score (out of 10)', 'Daily Study Hours',
       'Average Test Score'],
      dtype='object')

In [30]:
lasso_prediction = lasso_pipeline.predict(X_test)

In [31]:
Lasso_reg_scores = {"Model": "Lasso Regression",
                    "R2 Score": r2_score(y_test, lasso_prediction), 
                    "MAE": mean_absolute_error(y_test, lasso_prediction),
                    "MSE": mean_squared_error(y_test, lasso_prediction),
                    "RMSE": np.sqrt(mean_squared_error(y_test, lasso_prediction))
              }
# add new scores to old table
df_reg_scores.loc[len(df_reg_scores)] = Lasso_reg_scores
df_reg_scores

Unnamed: 0,Model,R2 Score,MAE,MSE,RMSE
0,Baseline,0.826012,3.581285,20.082011,4.481296
1,Scaled Regression,0.826012,3.581285,20.082011,4.481296
2,Feature Selection Regression,0.72504,4.499313,31.736509,5.633517
3,Lasso Regression,0.826012,3.581285,20.082011,4.481296


In [32]:
lasso_coef = lasso_pipeline.named_steps['model'].coef_
lasso_coef

array([2.87508568, 1.48163925, 1.8462766 , 6.63811077])

In [33]:
df_coef = pd.DataFrame({'Best Features': lasso_features, 'Coefficients':lasso_coef.round(3)})
df_coef

Unnamed: 0,Best Features,Coefficients
0,Attendance (%),2.875
1,Assignment Score (out of 10),1.482
2,Daily Study Hours,1.846
3,Average Test Score,6.638


Score is the same
Lasso model behaving exactly like linear regression
Maybe all the features are important. Lasso keeping the features means that none of the predictors were redundant enough to remove.
Does this make sense in real life i.e. all things contribute to final exam scores

"To test this hypothesis, I fitted two regression models.
Model 1 used only Average Test Score to predict Final Exam Marks.
Model 2 added engagement behaviours as additional predictors.
By comparing R² and RMSE across the two models, and examining the coefficients in Model 2, I can assess whether prior attainment explains more variance than engagement behaviours."


---

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.