# Simple Linear Regression for Salary Prediction

This notebook provides a step-by-step mathematical representation of the univariate linear regression model using SymPy, applied to predict salaries based on a dataset. We will derive the model equations, explore data preprocessing, model training, evaluation, and visualization to understand the relationship between the predictor variable (e.g., years of experience) and the target variable (salary). The goal is to build a predictive model that can estimate salaries accurately while illustrating the underlying mathematics.

## Tools Used
- **Python**: Programming language for data analysis and machine learning.
- **Pandas**: For data manipulation and analysis.
- **NumPy**: For numerical computations.
- **SymPy**: For symbolic mathematics and step-by-step mathematical derivations.

## Linear Regression Formula

Linear regression finds the best straight line that fits through our data. The line can be described with a simple equation:


| *y = mx + c* |
----------------                                                                        

Where:
- **y** = The value we want to predict (salary)
- **x** = The input variable (years of experience)
- **m** = The slope (how much y changes when x increases by 1)
- **c** = The intercept (where the line crosses the y-axis)

The model works by finding the values of m and b that make the line fit the data as closely as possible. We do this by minimizing the total squared distance between the actual data points and the predicted line.

## Cost Function (Mean Squared Error)

The cost function measures how far our predictions are from the actual values. It's the function we want to minimize to find the best line.

### Formula

$$J(m, c) = \frac{1}{2n} \sum_{i=1}^{n}(y_{pred} - y_{actual})^2$$

**Expanded form:**

$$J(m, c) = \frac{1}{2n} \sum_{i=1}^{n}(mx + c - y)^2$$

### Parameters

- **m** : Slope of the line (weight) - determines the steepness
- **c** : Y-intercept (bias) - where the line crosses the y-axis
- **x** : Input variable (years of experience)
- **y** : Actual output value (salary)
- **n** : Total number of data points
- **y_pred = mx + c** : Predicted value for a given x

### How It Works

1. **Calculate errors**: For each data point, find the difference between the predicted value (mx + c) and the actual value (y)
2. **Square the errors**: Square each difference to penalize larger errors more heavily and make all errors positive
3. **Sum squared errors**: Add up all the squared differences
4. **Average the errors**: Divide by 2n to normalize the cost (the factor of 2 is for mathematical convenience in derivatives)
5. **Minimize**: Find the values of m and c that minimize J(m, c) to get the best-fitting line

The division by 2n ensures that the cost doesn't grow with the size of the dataset, making it comparable across different datasets.


## Implementation of Linear Regression using Symbolic Mathematics

In this section, we will implement the linear regression model using SymPy's symbolic mathematics capabilities. This approach allows us to derive the exact mathematical solution by computing partial derivatives and solving a system of equations analytically. Unlike iterative methods like gradient descent, this method provides the closed-form solution directly.

### Step 1: Import Required Libraries

We start by importing the necessary libraries for our implementation.


In [1]:
import sympy as sym
import numpy as np
import pandas as pd

### Step 2: Load and Prepare the Dataset

We load the salary dataset and prepare it for training and testing. The dataset contains years of experience and corresponding salaries.


In [3]:
df = pd.read_csv("../data/Salary_Data.csv")

### Step 3: Split Data into Training Set

We extract the first 28 samples as training data. X contains the independent variable (years of experience) and y contains the dependent variable (salary).


In [4]:
X = df.iloc[0:28, 0].values
y = df.iloc[0:28, 1].values

### Step 4: Define Symbolic Variables

We define the symbolic variables that represent the parameters of our linear regression model. Here, c is the intercept and m is the slope.


In [15]:
c, m = sym.symbols("c m")

### Step 5: Calculate Sum of Squared Errors (SSE)

We compute the Sum of Squared Errors (SSE) for all training data points. This represents the total error between predicted and actual values. SSE is the cost function we want to minimize.


In [16]:
sse = 0
for x_i, y_i in zip(X, y):
  prediction = m * x_i + c
  sse += (y_i - prediction) ** 2

### Step 6: Compute Partial Derivatives

We compute the partial derivatives of SSE with respect to m and c. These derivatives represent the gradients that tell us how to adjust the parameters to minimize the error.


In [17]:
diff_m = sse.diff(m)
diff_c = sse.diff(c)

### Step 7: Solve for Optimal Parameters

We solve the system of equations (setting both partial derivatives to zero) to find the optimal values of m and c. This is the closed-form solution that minimizes the SSE.


In [18]:
solution = sym.solve([diff_m, diff_c], (m,c))

### Step 8: Extract Model Parameters

We extract the computed slope (m) and intercept (c) values from the solution. These are the parameters of our trained linear regression model.


In [20]:
intercept = solution[c]
slope = solution[m]

### Step 9: Display the Trained Model Equation

We display the final linear regression equation with the computed slope and intercept values.


In [21]:
print(f" Model: y = {slope}x + {intercept}" )

 Model: y = 9570.07159301477x + 25336.2527574340


### Step 10: Create Prediction Function

We define a function that uses the trained model to make predictions on new input data.


In [28]:
def predict(x_test):
    return float(slope * x_test + intercept)

### Step 11: Prepare Test Data - Extracts test samples

We extract the remaining samples (from index 28 onwards) as test data to evaluate the performance of our trained model.

In [29]:
X_test = df.iloc[28: , 0].values
y_test = df.iloc[28: , 1].values

### Step 12: Make Predictions on Test Data - Tests the model

In [None]:
y_pred = predict(X_test[1])


125822.00448408909

### Step 13: Compare Predicted vs Actual Values

In [39]:
y_pred

125822.00448408909

In [37]:
y_test[1]

np.float64(121872.0)