# Simple Linear Regression for Salary Prediction

This notebook provides a step-by-step mathematical representation of the univariate linear regression model using SymPy, applied to predict salaries based on a dataset. We will derive the model equations, explore data preprocessing, model training, evaluation, and visualization to understand the relationship between the predictor variable (e.g., years of experience) and the target variable (salary). The goal is to build a predictive model that can estimate salaries accurately while illustrating the underlying mathematics.

## Tools Used
- **Python**: Programming language for data analysis and machine learning.
- **Pandas**: For data manipulation and analysis.
- **NumPy**: For numerical computations.
- **SymPy**: For symbolic mathematics and step-by-step mathematical derivations.


## Linear Regression Formula

Linear regression finds the best straight line that fits through our data. The line can be described with a simple equation:


| *y = mx + c* |
----------------                                                                        

Where:
- **y** = The value we want to predict (salary)
- **x** = The input variable (years of experience)
- **m** = The slope (how much y changes when x increases by 1)
- **c** = The intercept (where the line crosses the y-axis)

The model works by finding the values of m and b that make the line fit the data as closely as possible. We do this by minimizing the total squared distance between the actual data points and the predicted line.


## Cost Function (Mean Squared Error)

The cost function measures how far our predictions are from the actual values. It's the function we want to minimize to find the best line.

### Formula

$$J(m, c) = \frac{1}{2n} \sum_{i=1}^{n}(y_{pred} - y_{actual})^2$$

**Expanded form:**

$$J(m, c) = \frac{1}{2n} \sum_{i=1}^{n}(mx + c - y)^2$$

### Parameters

- **m** : Slope of the line (weight) - determines the steepness
- **c** : Y-intercept (bias) - where the line crosses the y-axis
- **x** : Input variable (years of experience)
- **y** : Actual output value (salary)
- **n** : Total number of data points
- **y_pred = mx + c** : Predicted value for a given x

### How It Works

1. **Calculate errors**: For each data point, find the difference between the predicted value (mx + c) and the actual value (y)
2. **Square the errors**: Square each difference to penalize larger errors more heavily and make all errors positive
3. **Sum squared errors**: Add up all the squared differences
4. **Average the errors**: Divide by 2n to normalize the cost (the factor of 2 is for mathematical convenience in derivatives)
5. **Minimize**: Find the values of m and c that minimize J(m, c) to get the best-fitting line

The division by 2n ensures that the cost doesn't grow with the size of the dataset, making it comparable across different datasets.


## Importing Necessary Libraries

In [None]:
import sympy as sp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Loading the Data

In [None]:
df = pd.read_csv('../data/Salary_Data.csv')


### Train Test Split

In [None]:

X_train = df.iloc[:28, :-1].values  # Features (Years of Experience)
y_train = df.iloc[:28, 1].values    # Target (Salary)

In [18]:
print(X_train)

[[1.1]
 [1.3]
 [1.5]
 [2. ]
 [2.2]
 [2.9]
 [3. ]
 [3.2]
 [3.2]
 [3.7]
 [3.9]
 [4. ]
 [4. ]
 [4.1]
 [4.5]
 [4.9]
 [5.1]
 [5.3]
 [5.9]
 [6. ]
 [6.8]
 [7.1]
 [7.9]
 [8.2]
 [8.7]]


In [19]:
print(y_train)

[ 39343.  46205.  37731.  43525.  39891.  56642.  60150.  54445.  64445.
  57189.  63218.  55794.  56957.  57081.  61111.  67938.  66029.  83088.
  81363.  93940.  91738.  98273. 101302. 113812. 109431.]


In [20]:
X_test = df.iloc[28:, :-1].values  # Features for testing
y_test = df.iloc[28:, 1].values     # Target for testing

In [22]:
print(X_test)
print(y_test)

[[10.3]
 [10.5]]
[122391. 121872.]


## Implementing the Univariate Linear Regression Model

### Defining the symbols 

In [None]:
m,c,x,y = sp.symbols('m c x y')

# m - Slope of the line (weight) - determines the steepness
# c - Intercept (bias) - determines where the line crosses the y-axis
# y - Dependent variable (target)
# x - Independent variable (feature)

### Building Univariate Linear Regression Model

In [None]:
def linear_equation(m_val, c_val, x_val):
    """
    Computes the predicted value using the linear regression formula.
    
    This function implements the basic linear equation y = m*x + c, which is the 
    foundation of univariate linear regression.
    
    Parameters:
    -----------
    m_val : float
        The slope of the line (weight) - determines the steepness of the line.
    
    c_val : float
        The intercept (bias) - determines where the line crosses the y-axis.
    
    x_val : float or array-like
        The independent variable (feature) - the input value(s) for which to predict the output.
    
    Returns:
    --------
    float or array-like
        The predicted value (y) calculated using the formula: y = m*x + c
    
    Example:
    --------
    >>> linear_equation(2, 5, 10)
    25
    # This means: when x=10, m=2, c=5, the predicted y = 2*10 + 5 = 25
    """
    return m_val * x_val + c_val

### Symbolic representation of Linear equation

In [None]:
y_learn = linear_equation(m, c, x)

In [37]:
sp.pprint(y_learn)

c + m⋅x


### Symbolic representation of squared error

In [38]:
squared_error = (y_learn - y)**2

In [39]:
sp.pprint(squared_error)

             2
(c + m⋅x - y) 


### Symbolic Gradient Descent Algorithm

In [41]:
J_m = sp.diff(squared_error, m)
J_c = sp.diff(squared_error, c)

sp.pprint(f"Gradient w.r.t. m: {J_m}")
sp.pprint(f"Gradient w.r.t. c: {J_c}")

Gradient w.r.t. m: 2*x*(c + m*x - y)
Gradient w.r.t. c: 2*c + 2*m*x - 2*y


### Converting  mathematical functions into python functions

In [42]:
get_j_m = sp.lambdify([m, c, x, y], J_m, "numpy")
get_j_c = sp.lambdify([m, c, x, y], J_c, "numpy")

## Building the fit function

In [None]:
def fit(x_train, y_train, learning_rate, epochs):
    """
    Trains the linear regression model using Gradient Descent optimization.
    
    This function implements the gradient descent algorithm to find the optimal 
    values of slope (m) and intercept (c) that minimize the cost function 
    (Mean Squared Error). It iteratively updates m and c by moving in the 
    direction of negative gradients.
    
    Parameters:
    -----------
    x_train : array-like
        Training features (independent variable) - years of experience values.
    
    y_train : array-like
        Training targets (dependent variable) - salary values.
    
    learning_rate : float
        The learning rate (step size) for gradient descent. Controls how much 
        the parameters are adjusted in each iteration. Typical values: 0.001 to 0.01.
    
    epochs : int
        The number of iterations to run gradient descent. More epochs allow the 
        algorithm to converge closer to the optimal solution.
    
    Returns:
    --------
    tuple : (m_final, c_final)
        m_final : float
            The optimal slope of the regression line.
        c_final : float
            The optimal intercept of the regression line.
    
    Example:
    --------
    >>> m, c = fit(X_train, y_train, learning_rate=0.01, epochs=100)
    >>> print(f"Slope: {m}, Intercept: {c}")
    
    Notes:
    ------
    - The function prints detailed information for each epoch including gradients
      and updated parameter values.
    - The algorithm starts with m=0.0 and c=0.0.
    - Uses the gradient functions get_j_m and get_j_c to compute partial derivatives.
    """
    m_current = 0.0 # Initializing the slope to Zero
    c_current = 0.0 # Initializing the intercept to 
    
    for _ in range(epochs):
        print(f"EPOCH {_} {"="*100}", end="\n")
        j_m = get_j_m(m_current, c_current, x_train, y_train)
        j_c = get_j_c(m_current, c_current, x_train, y_train)
        print(f"J_M = {j_m}", end="\n")
        print(f"J_C = {j_c}", end="\n")
        
        
        mean_j_m = np.mean(j_m)
        mean_j_c = np.mean(j_c)
        
        print(f"Mean J_M = {mean_j_m}", end="\n")
        print(f"Mean J_C = {mean_j_c}", end="\n")
        
        m_current = m_current - learning_rate * mean_j_m
        c_current = c_current - learning_rate * mean_j_c
        
        print(f"slope m_{_} = {m_current}", end="\n")
        print(f"intercept c_{_} = {c_current}", end="\n")
        
        print(f"END EPOCH {_} {"="*100}", end="\n")
    
    return m_current, c_current

In [60]:
model = fit(X_train, y_train, learning_rate=0.01, epochs=X_train.shape[0])

J_M = [[  -86554.6  -101651.    -83008.2   -95755.    -87760.2  -124612.4
   -132330.   -119779.   -141779.   -125815.8  -139079.6  -122746.8
   -125305.4  -125578.2  -134444.2  -149463.6  -145263.8  -182793.6
   -178998.6  -206668.   -201823.6  -216200.6  -222864.4  -250386.4
   -240748.2]
 [ -102291.8  -120133.    -98100.6  -113165.   -103716.6  -147269.2
   -156390.   -141557.   -167557.   -148691.4  -164366.8  -145064.4
   -148088.2  -148410.6  -158888.6  -176638.8  -171675.4  -216028.8
   -211543.8  -244244.   -238518.8  -255509.8  -263385.2  -295911.2
   -284520.6]
 [ -118029.   -138615.   -113193.   -130575.   -119673.   -169926.
   -180450.   -163335.   -193335.   -171567.   -189654.   -167382.
   -170871.   -171243.   -183333.   -203814.   -198087.   -249264.
   -244089.   -281820.   -275214.   -294819.   -303906.   -341436.
   -328293. ]
 [ -157372.   -184820.   -150924.   -174100.   -159564.   -226568.
   -240600.   -217780.   -257780.   -228756.   -252872.   -223176.
   -22

In [61]:
print(f"Trained Model Parameters: Slope (m) = {model[0]}, Intercept (c) = {model[1]}")

Trained Model Parameters: Slope (m) = 11133.100448937797, Intercept (c) = 7954.336963732433


## Building Predict method

In [69]:
def predict(x_test):
    m, c = model
    Y_preds = []
    for x in x_test:
        y_pred = linear_equation(m, c, x[0])
        print(f"For input {x[0]}, Predicted output: {y_pred}")
        Y_preds.append(y_pred)
    return Y_preds

## Make Predictions

In [80]:
y_pred = np.array(predict(X_test))

y_pred

For input 10.3, Predicted output: 122625.27158779175
For input 10.5, Predicted output: 124851.8916775793


array([122625.27158779, 124851.89167758])

In [71]:
y_test

array([122391., 121872.])