# **Machine Learning: Linear Regression**

## Objectives

* Build a linear regression model
* Build a pipeline model which includes scaling
* Consider **hypothesis 4** by comparing the models
* Use the model to evaluate **hypothesis 2**

## Inputs

* Cleaned CSV file "academic_performance_cleaned.csv" 

## Outputs

* A pipeline which performs linear regression on the dataset. I hope to output this to a streamlit app 

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data and Prepare for Linear Regression

First step is to import packages. This time I will be including packages that I will be needing from sklearn so that I can perform linear regression

In [None]:
#import data manipulation and visualisation packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split 
#set seaborn style so plots look nice
sns.set_style("whitegrid")



Load dataset and display first 10 rows

In [None]:
df = pd.read_csv("data/academic_performance_cleaned.csv")
#display first 10 rows of the dataset
df.head(10)

When the CSV is saved and loaded it resets the datatypes to int64, so I will change datatypes from int64 to int8 to save memory. 

In [None]:

#change datatypes to save memory
df = df.astype({col: 'int8' for col in df.select_dtypes('int64').columns})
#display datatypes
df.info()

### Linear Regression

Linear regression is used to model the relationship between a dependent variable and one or more independent variables. It is the perfect model for this dataset WHYWHYWHYWHYWHY

Linear regression requires numerical data only and WHAT DOES IT DO?? Precdiction

Before I perform linear regression I need to clean the data. The Study Group column, which was generated for statistical analysis, is categorical and so cannot be considered by the regression algorithm without one hot encoding it. The information in the study group column is represented numerically by the column daily study hours so the study group column is redundant. Leaving it in would make the results misleading as it would be double counting the same metric. 

The same can be said for the two internal test score columns which also need to be removed. These have been summarised in one column: average test scores. Leaving them in would again mislead the results of the linear regression model.

### Small data cleaning step

In [None]:
df_reg = df.drop(['Internal Test 1 (out of 40)',
       'Internal Test 2 (out of 40)', 'Study Group'], axis=1)
df_reg.head(10)

# Hypothesis 4: A fully processed regression pipeline achieves better accuracy than a model without preprocessing.

EXPLANATION overall plan for H4

What does preprocessing mean, why might the data need to be scaled

# Split Dataset into Train and Test

This is a supervised learning task which means that the dataset can be thought of as having features and a target variable.

The target variable is a column which the model is trying to predict. In this case the target is the final exam marks.

The features are the rest of the data in the dataset. 

The model is trying to answer the question: with a set of unseen features, how accurately can the target variable be predicted. 

The dataset is split into two sections, the train set and the test set. The model is trained on the train set, it learns what features are important and contibute most to the variance in the target data. The model is then tested on unseen data (the test set) and it predicts what the targets variables are on the unseen data. The model can then compare its predictions of the test data to the actual target values of the test data and can assess how good it was at predicting the target variables. 

MORE MORE MORE

In [None]:
df_reg.columns.unique()

In [None]:
#split dataset into train and test X represents features (drop target variable) and y represents target variable
X = df_reg.drop('Final Exam Marks (out of 100)', axis=1)
y = df['Final Exam Marks (out of 100)']
#create 4 variables for the X_train and X_test are the features and y_train and y_test are the targets
#test_size = 0.2 the dataset is split into 80% train and 20% test, ramdom state provides reproducability 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=101)


Inspecting the X_train shows us that it no longer contains the final marks column so represents just the features, and inspecting the y_train dataframe shows us that it is just the the final marks i.e. the target variable.

In [None]:
#features only
X_train

In [None]:
#targets only
y_train

Printing the shape of the datafames shows us that the train set is now 1600 rows (80% of 2000) and the test set is 400 rows (20%)

In [None]:
#print the shape of the train and test sets
print(
    "Train set:",
    X_train.shape,
    y_train.shape,
    "\nTest set:",
    X_test.shape,
    y_test.shape,
)

# Baseline linear regression model (no pre-processing)

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
prediction = model.predict(X_test)

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
#save regression scores as a dictionary
reg_scores = {"Model": "Baseline",
              "R2 Score": r2_score(y_test, prediction), 
              "MAE": mean_absolute_error(y_test, prediction),
              "MSE": mean_squared_error(y_test, prediction),
              "RMSE": np.sqrt(mean_squared_error(y_test, prediction))
              }

# save regression scores in a dataframe for easy comparison
df_reg_scores = pd.DataFrame([reg_scores])
df_reg_scores


not considering coeffieients because they're not scaled

# Linear Regression pipeline with pre-processing (feature scaling)

I will use the pipeline to pre-process i.e. I will get it to scale the features

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


#build a pipeline with 3 steps: scaling, selecting best features, ML model
def linear_regression_pipeline():
    pipeline = Pipeline(
        [
            ('feature_scaling', StandardScaler()),
            ('model', LinearRegression())
        ]
    )
    return pipeline

In [None]:
linear_regression_pipeline()

In [None]:
#fit X_train and y_train with the pipeline
pipeline = linear_regression_pipeline()
pipeline.fit(X_train, y_train)

In [None]:
pipeline_prediction = pipeline.predict(X_test)

In [None]:
pipeline_reg_scores = {"Model": "Scaled Regression",
                       "R2 Score": r2_score(y_test, pipeline_prediction), 
                       "MAE": mean_absolute_error(y_test, pipeline_prediction),
                       "MSE": mean_squared_error(y_test, pipeline_prediction),
                       "RMSE": np.sqrt(mean_squared_error(y_test, pipeline_prediction))
              }
# add new scores to old table
df_reg_scores.loc[len(df_reg_scores)] = pipeline_reg_scores
df_reg_scores



# H2 Use feature selection to give me the feature that BLAH 

Feature selection gives us the feature that explains the most variance in the target (DOES IT?)

I'm going to add a feature selection stage to my pipeline

In [None]:
from sklearn.feature_selection import SelectFromModel

#build a pipeline with 3 steps: scaling, selecting best features, ML model
def feature_selection_pipeline():
    pipeline = Pipeline(
        [
            ('feature_scaling', StandardScaler()),
            ('feature_selection', SelectFromModel(LinearRegression())),
            ('model', LinearRegression())
        ]
    )
    return pipeline

In [None]:
feature_pipeline=feature_selection_pipeline()
feature_pipeline.fit(X_train, y_train)

In [None]:
X_train.columns[feature_pipeline['feature_selection'].get_support()]

In [None]:
feature_prediction = feature_pipeline.predict(X_test)

In [None]:
feature_selection_reg_scores = {"Model": "Feature Selection Regression",
                                "R2 Score": r2_score(y_test, feature_prediction), 
                                "MAE": mean_absolute_error(y_test, feature_prediction),
                                "MSE": mean_squared_error(y_test, feature_prediction),
                                "RMSE": np.sqrt(mean_squared_error(y_test, feature_prediction))
              }
# add new scores to old table
df_reg_scores.loc[len(df_reg_scores)] = feature_selection_reg_scores
df_reg_scores

My goal was to show that average test score is the feature that has the most importance, as predicted earlier by the correlation coefficients. This result indicates that this is the case, but I was not sure that I had done the correct thing. 

I put my pipeline into chat GPT and asked it to evaluate my reasoning. It informed me that using linear regression inside of SelectFromModel can provide misleading. Below I have quoted the reasoning from GPT

"Lasso regression should be used for feature selection instead of a standard linear regression model because it includes an L1 regularisation penalty that shrinks weaker coefficients toward zero. This allows the model to effectively identify and remove features that contribute little unique predictive value, particularly in the presence of multicollinearity. In contrast, ordinary least squares regression does not apply coefficient penalisation, meaning its coefficients can remain unstable and unsuitable for feature selection. Therefore, Lasso provides a more robust and interpretable approach to identifying the most important predictors in the dataset."

I will run the pipeline again and replace linear regression with Lasso to see if it makes a difference. 


In [None]:
# Lasso pipeline
from sklearn.linear_model import Lasso
def lasso_pipeline():
    pipeline = Pipeline(
        [
            ('feature_scaling', StandardScaler()),
            ('feature_selection', SelectFromModel(Lasso(alpha=0.1))),
            ('model', LinearRegression())
        ]
    )
    return pipeline

In [None]:
lasso_pipeline = lasso_pipeline()
lasso_pipeline.fit(X_train, y_train)

In [None]:
X_train.columns[lasso_pipeline['feature_selection'].get_support()]

In [None]:
lasso_prediction = lasso_pipeline.predict(X_test)

In [None]:
Lasso_reg_scores = {"Model": "Lasso Regression",
                    "R2 Score": r2_score(y_test, lasso_prediction), 
                    "MAE": mean_absolute_error(y_test, lasso_prediction),
                    "MSE": mean_squared_error(y_test, lasso_prediction),
                    "RMSE": np.sqrt(mean_squared_error(y_test, lasso_prediction))
              }
# add new scores to old table
df_reg_scores.loc[len(df_reg_scores)] = Lasso_reg_scores
df_reg_scores

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.