# Supervised Machine Learning: Linear Regression

## Linear Regression: Unscaled vs. Scaled Data
In this demo, we follow the ML process:
1. **Remember:** Load and inspect the data.
2. **Formulate:** Build a linear regression model first on raw (unscaled) data.
3. **Predict:** Evaluate the model's performance.

Then we apply feature scaling and rebuild the model to compare results.
We use the Student Performance dataset from Kaggle to predict the "Performance Index" of students.

In [2]:
# import neccesary libraries
import pandas as pd
import numpy as np

# Download data from Kaggle
#!kaggle datasets download -d nikhil7280/student-performance-multiple-linear-regression
#!unzip student-performance-multiple-linear-regression.zip

# Import dataframe
df = pd.read_csv("Student_Performance.csv")
df

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0
...,...,...,...,...,...,...
9995,1,49,Yes,4,2,23.0
9996,7,64,Yes,8,5,58.0
9997,6,83,Yes,8,5,74.0
9998,9,97,Yes,7,0,95.0


In [3]:
# Convert extracurricular activities to numeric
df["Extracurricular Activities"] = df["Extracurricular Activities"].map({"Yes":1,
                                                                          "No":0})
# use map function to convert with key value pair

# Define the features and target variable based on the dataset
# set up x and ys
feature_vars = ["Hours Studied", "Previous Scores", "Sleep Hours", "Sample Question Papers Practiced", "Extracurricular Activities"]
X = df[feature_vars] # we're only grabbing those feature variables
y = df["Performance Index"]
# Display a preview of the dataset

print(X)
print(y)

      Hours Studied  Previous Scores  Sleep Hours  \
0                 7               99            9   
1                 4               82            4   
2                 8               51            7   
3                 5               52            5   
4                 7               75            8   
...             ...              ...          ...   
9995              1               49            4   
9996              7               64            8   
9997              6               83            8   
9998              9               97            7   
9999              7               74            8   

      Sample Question Papers Practiced  Extracurricular Activities  
0                                    1                           1  
1                                    2                           0  
2                                    2                           1  
3                                    2                           1  
4                 

## Part 1: Linear Regression on Unscaled Data
In this section, we build a [linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit) model on the raw data.
This helps us see the effect of differing scales on the coefficients.
We start by [spliting our data into training and testing sets](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split).

In [None]:
from sklearn.linear_model import LinearRegression # import the module from the website and only importing the Linear Regression, do not import entire sklearn
from sklearn.model_selection import train_test_split
# train_test_split actually outputs four different variables

# train_test_split randomly splits data (find from second link)
# manually taking first 80% can create massive bias if the data is ordered
# Split the raw data (80% training, 20% testing)
# X's are capitalized and y's are lowercase

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2, # this automatically chooses .25 so you have to change it
                                                    random_state = 42) # confused about this; if you use this for your random state, probably using ChatGPT

# we can see that there's 10000 rows, see if its working
print(y_train.shape)
print(y_test.shape)


(8000,)
(2000,)


In [None]:
# Initialize and train the linear regression model on unscaled data
# call the LinearRegression you have floating around
lin_reg = LinearRegression() # gets the class ready so you can start running the methods inside of it - can see the methods on the right hand side of website
lin_reg.fit(X_train, y_train)
# don't need to reassign the variable becasuse fit changes it
# we get a bubble that tells us we have a linear regression class that has fit the model


In [13]:
# Make predictions on the test set
# take testing data and throw it through another method: predict
y_pred = lin_reg.predict(X_test)
# returns an array of all the predictions of student performance using the features from the data set
# len(lin_reg.predict(X_test)) should be 2000 long
y_pred

array([54.71185392, 22.61551294, 47.90314471, ..., 16.79341955,
       63.34327368, 45.94262301])

In [None]:
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score # grabbing performance metrics
# Evaluate model performance
# going to compare the y_pred to the actual values to see how accurate our model is
mse_lin = mean_squared_error(y_test, y_pred)
rmse_lin = root_mean_squared_error(y_test, y_pred)
r2_lin = r2_score(y_test, y_pred)

print("Unscaled Data Model:")
print(f"Mean Squared Error: {mse_lin:.2f}")
print(f"Root Squared Error: {rmse_lin:.2f}")
print(f"R² Score: {r2_lin:.2f}")

# not just how well our model fits but how accurate it is
# RMSE: puts things into actual units that corresponds to student performance -- our models are about 2 points off
# R2: these features explains 99% of our variance for performance index


Unscaled Data Model:
Mean Squared Error: 4.08
Root Squared Error: 2.02
R² Score: 0.99


### Notes on Unscaled Model:
- **Coefficients (Unscaled):**
    - Each coefficient represents the change in the Performance Index for a one-unit change in the respective feature, holding all other features constant.
    - For example, if "Hours Studied" has a coefficient of 2.85, it implies that for each additional hour studied, the Performance Index increases by 2.85 points (assuming other factors remain constant).
    - However, because features are in different units (e.g., hours vs. scores), comparing these coefficients directly may be misleading.

- **R² Score:**
    - This metric indicates the proportion of the variance in the target variable explained by the model.
    - An R² close to 1 suggests a very good fit, while an R² near 0 indicates the model fails to capture much variance.

- **MSE & RMSE:**
    - MSE measures the average squared difference between actual and predicted values.
    - RMSE, being the square root of MSE, gives an error metric in the same units as the target.
    - Lower RMSE values indicate better predictive performance.

In [26]:
# View our model's coefficients (the betas)
coef_series = pd.Series(lin_reg.coef_, index = X.columns)
# now we have the coefficients for each feature -- tells us the weight of each feature as it relates to predicting our y
# for every additional hour studied, there's a 2.85 increase in student performance
# for every increase in previous exam scores, you increase your performance metric by 1
# etc.
coef_series

Hours Studied                       2.852484
Previous Scores                     1.016988
Sleep Hours                         0.476941
Sample Question Papers Practiced    0.191831
Extracurricular Activities          0.608617
dtype: float64

In [23]:
intercept = pd.Series(lin_reg.intercept_)
intercept

0   -33.921946
dtype: float64

In [None]:
# which feature is the most important for student performance? 
# you may say hours studied, but all of the features are on different scales -- the coefficients are not comparable
# need to normalize the data by scaling it

### Manually Computing a Prediction from Our Model
- In this section, we'll calculate a predicted value by hand (i.e., by multiplying the model's coefficients by the original feature values and adding the intercept).
- This mirrors exactly what the model does internally.

- **Why is this helpful?**
   - It reinforces how linear regression makes its predictions using the equation: `prediction = intercept + (coef_1 * x_1) + (coef_2 * x_2) + ...`
   - It helps us see the individual impact of each feature on the final prediction.
   - It confirms that the manual approach matches the `model.predict()` output.

#### 1. Extract the coefficients and intercept from our trained model

#### 2. Select a single row of our data (e.g., the second row)
- We select only the columns that were used as features in our model.
- The row's values represent the actual data for Hours Studied, Previous Scores, etc.

#### 3. Compute the manual prediction

**Explanation:**
- We multiply each feature value by its corresponding coefficient and sum them up.
- Then, we add the intercept.
- This is precisely the linear regression equation:
$$
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n
$$

Where:
 - $\beta_0$ is the intercept
 - $\beta_i$ is the coefficient for feature $x_i$

 Thus, `manual_prediction` should match what the model would predict internally.

#### 4. Compare to `model.predict()` for confirmation

### **Observation:**
- The `manual_prediction` and `model_prediction` should be nearly identical (up to minor floating-point differences).
- If they match, we've confirmed our understanding of how the model uses coefficients and intercept to make a prediction.

### Why This Matters
- **Transparency:** It shows exactly how each feature influences the final predicted value.
- **Verification:** Confirms our "manual" math aligns with the model's internal computation.
- **Interpretability:** By inspecting the coefficients, we see which features have the biggest impact (positive or negative) on the Performance Index, and we can discuss whether the magnitudes make sense given the domain context.

## Part 2: Linear Regression on Scaled Data
Now we apply feature scaling using StandardScaler and rebuild the model.
Scaling brings all features to a similar scale, which aids in the interpretation of the coefficients.

### Notes on Scaled Model:
- **Coefficients (Scaled):**
    - After scaling, each coefficient indicates the change in the Performance Index for a one standard deviation change in that feature.
    - This standardization makes it easier to compare the relative importance of features.
    - For example, a higher coefficient means that feature has a larger effect on the target, per standard deviation change.

- **R² and RMSE Comparison:**
    - Often the overall performance metrics (R² and RMSE) do not change dramatically after scaling for linear regression.
    - However, scaling is essential for interpreting the model coefficients correctly, especially when features are on different scales.
    - It is also a critical preprocessing step for many other algorithms.

# Conclusion
In this demo, we:
- Built and evaluated a linear regression model on unscaled data.
- Re-trained the model after applying feature scaling.
- Observed that while overall performance metrics (**MSE** and **R²**) may be similar, scaling is crucial for the interpretability of model coefficients and for ensuring that features contribute in a balanced way.
  
### Key Takeaways:
- **Coefficients:** On unscaled data, coefficients are tied to the original units, which can be hard to compare.
  After scaling, coefficients represent the effect of a one standard deviation change in the feature.
- **R² Score:** Reflects the proportion of variance in the target variable explained by the model.
- **MSE (and RMSE):** Lower values indicate better model performance; RMSE provides an error measure in the target's units.

This process reflects the "remember-formulate-predict" approach in machine learning.