# Linear Regression
* When fundraising: it's AI  
* When recruiting: it's machine learning  
* When implementing: it's linear regression  

## Understanding Linear Regression  

Linear regression is a mathematical technique used to predict an outcome based on a set of input variables. It works by assigning a specific weight to each input and summing their effects to generate a final predicted value. This process is known as a **linear combination**.  

### A Real-World Analogy: Grocery Shopping  

Imagine you're shopping for groceries and purchase `2.5 lbs` of potatoes, `1 lb` of carrots, and `2 bottles` of milk. If potatoes cost `$2 per lb`, carrots `$4 per lb`, and each bottle of milk is `$3`, the total expense is calculated as:  


$$
(2.5 \times 2) + (1.0 \times 4) + (2 \times 3) = 15
$$

This demonstrates the principle of linearity—each item's contribution to the total is proportional to its price, and any increase in quantity leads to a predictable increase in cost.

> **Linearity** means that for every fixed change in an input variable, the predicted outcome changes by a constant amount.


### The Role of Weights (Coefficients)  

In statistical modeling, the prices in the example above represent **coefficients** or **weights**—values that determine how much each input influences the final outcome.  

Linear regression is a method under the broader category of **regression analysis**, **which focuses on predicting numerical values based on input features**. For instance, a laptop pricing model may estimate the value of a laptop using attributes such as its processor speed, size of harddrive, and weight.  

### Features and Outputs  

We refer to the input data used in predictions as **features** or **inputs**, while the predicted result is called the **output** or **response**. In our laptop example, a laptop's hard drive capacity is an input, and its estimated market price is the output. The goal of regression is to fine-tune the model so that its predictions align as closely as possible with actual values.  

### How Predictions Are Made  

Linear regression assumes that changes in an input variable correspond to proportional changes in the predicted output. For example, if a laptop is initially estimated at `$1,000`, and increasing its memory size by `1 GB` consistently adds `$20` to its value, then:  

- A `8 GB` increase raises the estimate by `$160`.  
- A `16 GB` increase raises it by `$320`.  

Similarly, other factors contribute positively or negatively. For example, using a higher generation processor increase the predicted price by `$300` per generation, whereas increasing the weight of the laptop decrease the price by a fixed amount per pound.  

This structured relationship between inputs and outputs makes linear regression a valuable tool for making predictions in various fields, from finance to healthcare and beyond.  

## Shopping Bill

$$
Bill Totals: \$2 \times 2.5 + \$4 \times 1.0 + \$3 \times 2 = \$15
$$

$$
total = c_1 \times x_1 + c_2 \times x_2 + c_3 \times x_3
$$





# 🎽Exercise 1
Write a program that estimates multiple laptop values and outputs each estimation on a new line. The `x` values are contained in `X` and coefficients are contained in `c`.

In [None]:
# input values for three laptops: processor speed, size of harddrive (TB), weight, size of memory (GB), processor generation
X = [
        [2.5, 0.5,  1.8,  8, 2], # Mid-range laptop
        [3.2, 1,    2.2, 16, 3], # High-end laptop
        [1.8, 0.25, 3,  4, 1]  # Budget laptop
    ]
c = [200, 50, -200, 40, 300]   # coefficient values

def predict(X, c):
    price = 0
    print(price)

predict(X, c)


0


If your output was
<pre>
1085.0
1790.0
232.5
</pre>
you got it!

## NumPy Makes It Easier! 

When working with numerical data in Python, **NumPy** is an essential tool. It provides a powerful array structure and optimized operations for handling large datasets efficiently. Instead of manually implementing matrix calculations, NumPy allows us to simplify these computations significantly.  

### What is NumPy?  

NumPy (**Numerical Python**) is a core library for scientific computing. It provides:  
- **Multidimensional arrays** for efficient data storage and manipulation  
- **Mathematical and statistical functions** for complex calculations  
- **Linear algebra utilities**, including matrix operations  
- **Random number generation** for simulations  
- **File I/O functions** to handle datasets easily  

With NumPy, we can efficiently handle large-scale computations that would otherwise be slow using basic Python lists.  

### Matrix Multiplication in NumPy  

NumPy offers various ways to perform matrix multiplication, an essential operation in linear regression:  

1. **`np.matmul(A, B)`** – Performs matrix multiplication  
2. **`A @ B`** – A shorthand for `matmul`, often used for readability  
3. **`np.dot(A, B)`** – Computes the dot product, which in some cases behaves similarly to `matmul`  

While `matmul` and `@` are used for general matrix multiplication, `dot` is specifically designed for dot products, and the two are **not always interchangeable** for higher-dimensional arrays.  

### Simplifying the Regression Calculation  

Instead of manually summing up each feature’s contribution, we can use NumPy’s `@` operator or `matmul` to compute a prediction.



In [43]:
import numpy as np

# Feature matrix: Each row represents an object (e.g., laptops) and each column a feature
x = np.array([2.5, 0.5,  1.8,  8, 2])

# Coefficient values for each feature
c = np.array([200, 50, -200, 40, 300])

# Compute predictions using matrix multiplication
# c1x1 + c2x2 + ... + c5x5
prediction = x @ c  # Equivalent to np.matmul(X, c)

print(prediction)

1085.0


## The Power of Linear Algebra in NumPy  

One of the greatest advantages of **linear algebra**—the mathematical foundation of many machine learning techniques—is its ability to perform computations efficiently using arrays (also known as **vectors** and **matrices** in mathematical terms). Instead of performing calculations one by one, **linear algebra allows us to process multiple cases simultaneously**.  

### Applying Matrix Multiplication to Multiple Cases  

Instead of computing predictions individually for each set of input values, we can store all of them in a **two-dimensional NumPy array**. When we apply matrix multiplication (`@`), NumPy will automatically compute the predictions for all cases at once.

#### Example: Predicting Multiple Values at Once  

In [44]:
import numpy as np

# Feature matrix: Each row represents a case, and each column represents a feature
X = np.array([
        [2.5, 0.5,  1.8,  8, 2], # Mid-range laptop
        [3.2, 1,    2.2, 16, 3], # High-end laptop
        [1.8, 0.25, 3,  4, 1]  # Budget laptop
    ])
c = np.array([200, 50, -200, 40, 300])   # coefficient values

# Perform matrix multiplication to get predictions for all cases at once
# calculates c1x1 + c2x2 + ... + c5x5 per sub-array of X.
predictions = X @ c  # Equivalent to np.matmul(X, c)

print(predictions)

[1085.  1790.   232.5]



### Why is This Approach So Powerful?  

- **Efficiency** – Instead of using loops to compute each prediction separately, we compute all predictions in one step.  
- **Scalability** – Works seamlessly for thousands or even millions of cases.  
- **Cleaner Code** – Makes machine learning models more readable and maintainable.  

By leveraging NumPy’s array operations, we move closer to how real-world machine learning models handle vast datasets—quickly and efficiently. 

---

# Where Linear Regression Shines  

So far, we have manually estimated laptop prices using predefined coefficients. However, in real-world scenarios, **we often don’t know these coefficients**—we need to **discover them from data**.  

Imagine we have a dataset with specifications (processor speed, RAM, storage, etc.) and the actual selling prices of multiple laptops. Our goal is to determine how much each feature contributes to the price. **Linear regression allows us to learn these relationships automatically.**  

## Least Squares Method: Finding the Best Coefficients  

The **Least Squares Method**, developed by **Legendre and Gauss** in the early 1800s, is the foundation of linear regression.  

It helps us find the **best possible coefficients** (\( $w_1$, $w_2$, ... $w_n$ \)) so that our predicted prices ( $\hat{y}$ ) match the real prices ( $y$ ) as closely as possible. The model takes the form:  

$$
\hat{y} = w_1x_1 + w_2x_2 + ... + w_nx_n + b
$$

where:  
- \( $x_i$ \) are the laptop’s features (e.g., processor speed, memory).  
- \( $w_i$ \) are the coefficients we are trying to find.  
- \( $b$ \) is the **intercept** (base price when all features are 0).  


### **Why "Least Squares"?**  
The name comes from the way we measure how good our predictions are. The method **minimizes the sum of squared errors** between predicted and actual prices:  

$$
\sum (y_i - \hat{y_i})^2
$$

where 
- $\hat{y}$ predicted prices 
- $y$ the real prices 

1. **Why do we square the errors?**  
   - Prevents negative and positive differences from canceling out.  
   - Gives more weight to large errors, improving accuracy.  

2. **Why not just use simple averages?**  
   - The least squares method doesn’t just compute averages—it **finds the best-fit line in multi-dimensional space** to **minimize total error**.  


## Why This Matters  

- **No need for manual tuning** – The model automatically learns the best coefficients.  
- **More accurate predictions** – Reduces errors by fitting the best possible line.  
- **Scales to large datasets** – Works efficiently with thousands of laptops.  



## Implementing Least Squares in Python  

Luckily, Python has built-in tools like **NumPy** and **Scikit-Learn** to handle this easily. Instead of solving for coefficients manually, we can use simple functions to do the work for us.  



# 🎽Exercise 2: Finding the Best Set of Coefficients  

In this exercise, we will calculate the **squared error** for multiple sets of coefficient values and determine which set gives the best fit. This is a **simplified version of the least squares method**, where instead of finding the perfect coefficients, we test a fixed set of possible coefficient combinations and choose the one with the smallest error.  


### **Task**  
Write a program that:  
- Calculates the **squared error** for different coefficient sets  
- Identifies the **best coefficient set** (i.e., the one with the smallest squared error)  


### **Dataset: Laptop Prices**  

We will use a dataset where each row represents a **laptop** and each column represents a **feature**:  

| Feature | Description |
|---------|-------------|
| **Processor Speed (GHz)** | How fast the processor runs |
| **Hard Drive Size (GB)** | Total storage capacity |
| **Weight (kg)** | Laptop weight (lighter is often better) |
| **Memory (GB)** | Amount of RAM |
| **Processor Generation** | Age and performance tier of the CPU |

Each laptop also has a **real market price** (target variable \($y$\)), which we will compare against our model’s predicted prices.  


In [None]:
import numpy as np

# Laptop dataset: Each row is a laptop, each column is a feature

# Processor Speed (GHz)
# Hard Drive Size (GB)
# Weight (kg)
# Memory (GB)
# Processor Generation

X = np.array([
    [2.5, 512, 1.8, 8, 2],   # Laptop 1
    [3.2, 1024, 2.2, 16, 3], # Laptop 2
    [1.8, 256, 1.5, 4, 1]     # Laptop 3
])

# Actual laptop prices
y = np.array([1200, 2200, 800])  # Market prices in dollars

# Possible coefficient values to evaluate
candidates = np.array([
    [10, 0.1, -5, 20, 50],  # Coefficient set 1
    [50, 0.2, -10, 50, 100],  # Coefficient set 2
    [100, 0.3, -20, 100, 150],  # Coefficient set 3
    [250, 0.4, -80, 150, 250],   # Coefficient set 4
    [400, 0.8, -150, 200, 350]   # Coefficient set 5
])

def squared_error(X, y, c):
    """Computes the sum of squared errors (SSE) for a given coefficient set."""
    # Compute predictions here.
    return 0 # Placeholder for now

# Find the best coefficient set
best_index = None
best_error = float('inf')

for i, c in enumerate(candidates):
    error = squared_error(X, y, c)
    print(f"Set {i+1}: Squared Error = {error:>15,.2f}\n")

    if error < best_error:
        best_error = error
        best_index = i

print(f"\nBest coefficient set: Set {best_index+1} with SSE = {best_error:>15,.2f}")


Set 1: Squared Error =            0.00

Set 2: Squared Error =            0.00

Set 3: Squared Error =            0.00

Set 4: Squared Error =            0.00

Set 5: Squared Error =            0.00


Best coefficient set: Set 1 with SSE =            0.00


If your output is somewhat like this you're correct.
<pre>
Predictions: [327.2 593.4 166.1]
Set 1: Squared Error =    3,744,772.61

Predictions: [ 809.4 1442.8  426.2]
Set 2: Squared Error =      865,646.64

Predictions: [1467.6 2633.2  776.8]
Set 3: Squared Error =      259,810.24

Predictions: [2385.8 4183.6 1282.4]
Set 4: Squared Error =    5,573,500.36

Predictions: [3439.6 6019.2 1849.8]
Set 5: Squared Error =   20,704,176.84


Best coefficient set: Set 3 with SSE =      259,810.24
</pre>

---

# Now it's time to find coefficients that minimize the sum of squared errors (SSE) using least squares method.

In order to achieve this we need a new function from numpy called np.linalg.lstsq. This function takes the feature matrix X and the target values y as input, and returns the optimal coefficients c that minimize the sum of squared errors (SSE).



## linalg.lstsq(a, b, rcond=None)

Official documentation from numpy.org, check the example and the curve fitting they have.

https://numpy.org/doc/2.1/reference/generated/numpy.linalg.lstsq.html




## Laptop Specs and modeling their price and estimating

|                 | proc. spd. (ghz) | hdd (gb) | weight | memory (gb) | proc. generation | price in $ (output) |
|-----------------|-----------------:|---------:|-------:|------------:|-----------------:|--------------------:|
| Budget Student  | 1.5              |      250 |   2.2  |        8    |           2      |               460   |
| Office Worker   | 2.2              |      500 |   1.8  |       16    |           3      |               880   |
| Developer       | 2.8              |     1000 |   1.5  |       32    |           4      |             1,370   |
| Content Creator | 3.2              |     2000 |   1.3  |       64    |           5      |             2,010   |
| Gamer           | 3.6              |     4000 |   2.0  |       64    |           6      |             2,340   |


In [None]:
import numpy as np


X = np.array([
    [1.5,  250, 2.2,  8,  2],  # Budget Student
    [2.2,  500, 1.8, 16,  3],  # Office Worker
    [2.8, 1000, 1.5, 32,  4],  # Developer
    [3.2, 2000, 1.3, 64,  5],  # Content Creator
    [3.6, 4000, 2.0, 64,  6]   # Gamer
])

y = np.array([ 460, 880, 1370, 2010, 2340])

coefficients, _, _, _ = np.linalg.lstsq(X, y)
np.set_printoptions(suppress=True, precision=2)  # suppress scientific notation, 2 decimal places
print('Coefficient estimates: ', coefficients)
print('Set of predicted prices for the five laptops: ', X @ coefficients)


Coefficient estimates:  [ 146.31    0.08 -100.      9.35  182.96]
Set of predicted prices for the five laptops:  [ 460.  880. 1370. 2010. 2340.]


- Let's first look at the coefficient estimates. The value `146.31` of the first coefficient means that each GHz increase in the size adds `$146.31` to the price.
- The value `-100` of the third coefficient means for each kg heavier the laptop is, the price of the laptop goes down `$100`, or conversely for each kg the laptop is lighter, you have to pay `$100` more.

The explanation to the unexpected perfect predictions in these five cases is that the model is always able to perfectly match the output values used as the data if the number of cases is less than or equal to the number of coefficients in the model.

# 🎽Last exercise fitting coefficient values with least squares:
The output of the program should be the estimated coefficients and the predicted or "fitted" prices for the same set of laptops used to estimate the parameters. So if you fit the model using data for six laptops with known prices, the program will print out the prices that the model predicts for those six laptops (even if the actual prices are already given in the data).

Hint:  
`X = matrix[:, :-1]` → Selects all rows and all but the last column.  
`y = matrix[:, -1]` → Selects all rows and only the last column.

In [None]:

import numpy as np


matrix = np.array([
    [1.5,  250, 2.2,  8,  2, 460],
    [2.2,  500, 1.8, 16,  3, 880],
    [2.8, 1000, 1.5, 32,  4, 1350],
    [3.2, 2000, 1.3, 64,  5, 2000],
    [3.6, 4000, 2.0, 64,  6, 2350],
    [2.2,  500, 1.3, 16,  4, 1000],
    [4.0, 4000, 4.0, 64,  6, 2250]
])
np.set_printoptions(precision=1)    # this just changes the output settings for easier reading

def fit_model(matrix):
    # Please write your code inside this function

    # fit the matrix that's passed. the values below are placeholder values
    c = np.asarray([])  # coefficients of the linear regression
    x = np.asarray([])  # input data to the linear regression

    print('Coefficient estimates: ', c)
    print('Set of predicted prices for the five laptops: ', x @ c)

# simulate reading a file
fit_model(matrix)

If you managed to get something like this you got it right.
<pre>
Coefficient estimates:  [ 200   50 -200   40  300]
Set of predicted prices for the five laptops:  1085.0
</pre>