# **Must Know Concepts:**

### **1. Ordinary Least Squares (OLS)**:

OLS is a method for finding the best-fitting linear relationship between the input features (**X**) and the output/target variable (**y**) by **minimizing the sum of squared errors** (also called residuals):

> $\text{Loss (Cost)} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Where:
* $y_i$ is the actual target value.
* $\hat{y}_i = X_i w$ is the predicted value.
* $w$ are the weights (coefficients) we want to find.

---

### **2. Closed-Form Solution / Normal Equation:**

The **closed-form solution** to the OLS problem gives a direct formula to calculate the optimal weights:

> $$\hat{w} = (X^T X)^{-1} X^T y$$

This is called the **normal equation**. It's derived from setting the gradient of the OLS cost function to zero and solving for $w$.

---

### **3. Limitations of Closed-Form Solution:**

* **Computationally expensive** for large feature sets due to matrix inversion $(X^T X)^{-1}$.

* Often replaced by **`iterative methods`** like **`gradient descent`** when $X$ is large or sparse.


---

### **4. RSS (Residual Sum of Squares) in Regression:**

In **regression analysis**, especially **linear regression**, the **Residual Sum of Squares (RSS)** is a measure of **how well the regression line fits the data**. It tells us **how much error** there is between the actual data points and the predicted values from the model.

1. **You have data points** — real observations with inputs (like house size) and outputs (like house price).

2. **You create a model** — say, a straight line to predict price based on size.

3. **You predict** the outputs using your model — these are the **predicted values**.

4. **You compare** these predicted values with the actual values — the **difference** is called the **residual** (or error).

**What is a Residual?**

A **residual** is the difference between the **actual value** and the **predicted value**:

> $$\text{Residual} = \text{Actual value} - \text{Predicted value}$$

Sometimes residuals are positive (the model underpredicted), and sometimes negative (the model overpredicted).

**Why Square the Residuals?**

* If we just added up the residuals, positive and negative errors might cancel each other out.

* So, we **square** each residual to make all values positive and emphasize larger errors more.

* This gives us the **squared residuals**.

**What is RSS?**

The **Residual Sum of Squares (RSS)** is the **sum of all squared residuals**:

> $$\text{RSS} = \sum (\text{Actual} - \text{Predicted})^2$$

In words:

> "RSS measures the **total squared difference** between what the model predicts and what the actual data values are."

**Why is RSS Important?**

* **Lower RSS** means the model fits the data **better** (less error).
* **Higher RSS** means the model fits the data **worse** (more error).
* It’s the **key quantity that OLS (Ordinary Least Squares)** tries to **minimize** when finding the best-fitting line.

Imagine you're trying to fit a line through a scatterplot of points. From each point, draw a vertical line to the regression line — that’s the **residual**. The RSS is the **sum of the squares of all those vertical lines**.

---

### **5. What Is Likelihood?**

In **regression analysis**, especially when using **`probabilistic or statistical models`** like **`linear regression`** or **`logistic regression`**, the concept of **`log-likelihood`** plays a key role in estimating the model parameters.

Let’s say you build a model (like a line in linear regression), and you want to know:

**How likely is it that this model would produce the data you actually observed?**

This "likelihood" is a number that tells you how *`probable`* your data is, **given** your model’s parameters.

**What Is the Likelihood Function?**

The **likelihood function** is a mathematical function that:

* Takes the model’s parameters as input (like the weights in a regression line),
* And tells you how likely the **`observed data`** is under that model.

In regression, this is often based on the assumption that errors (residuals) follow a **`normal distribution`** — meaning most predictions are close to the true value, and big errors are rare.

**Why Use the `Log` of the Likelihood?**

Working with raw likelihood values directly can be tricky because:

* The likelihood often involves multiplying many small probabilities together, which can become **very tiny numbers** (leading to **underflow**).

* Taking the **logarithm** (log) of the likelihood turns the multiplication into **addition**, which is easier to work with mathematically.

This gives us the **log-likelihood function**, which is just the logarithm of the likelihood function.

**What Is the Log-Likelihood Function?**

The **log-likelihood function** measures how well a model explains the observed data, **`in logarithmic form`**.

It’s a function of the model’s parameters — so we can use it to **find the best parameters** by **maximizing** the log-likelihood.

This approach is called **Maximum Likelihood Estimation (MLE)**.

**Key Use:**

In regression (especially logistic regression), we often **`maximize the log-likelihood`** to find the **`best-fitting parameters`**, just like we minimize RSS in OLS linear regression.

----

### **6. AIC and BIC:**

In **regression analysis**, especially when comparing different models, it’s important not only to find one that fits the data well, but also to **`avoid overfitting`**. That’s where **AIC** and **BIC** come in.

* **`AIC`** stands for **`Akaike Information Criterion`**.

* **`BIC`** stands for **`Bayesian Information Criterion`**.

Both are tools to help you **compare models** and **choose the best one** by balancing two things:

1. **Goodness of fit** – how well the model explains the data.

2. **Model complexity** – how many parameters (e.g., features, coefficients) the model uses.

**`The key idea:`** A model that fits really well *but is very complex* might be overfitting. `AIC` and `BIC` try to prevent that.

**Why We Need:**

Imagine you’re comparing two regression models:

* Model A has 2 parameters.
* Model B has 10 parameters.

Model B might fit the data **better**, but it also might be **memorizing** the data rather than **generalizing**. AIC and BIC help decide **if the better fit is worth the extra complexity**.

**The Basic Formula:**

Both AIC and BIC are based on the **`log-likelihood`** (how likely the model is to produce the observed data), and both add a **`penalty`** for the number of parameters.

>  **AIC = -2 × log-likelihood + 2 × (number of parameters)**

>  **BIC = -2 × log-likelihood + ln(n) × (number of parameters)**

Where:
* **`log-likelihood`** rewards good fit.
* **`number of parameters`** punishes complexity.
* **`n`** is the number of observations (for BIC only).

**What Do the Numbers Mean?**

* **Lower AIC or BIC = Better model**

* These numbers are **`relative`**, not absolute. You use them to **`compare models`**, not to say whether one model is “good” in isolation.

---

* **AIC** is more forgiving — it allows slightly more complex models if they improve fit.

* **BIC** is stricter — it favors simpler models unless the complex one offers a *much* better fit.

**Example:**

Imagine you're fitting polynomial regression (curved lines) with degrees 1 to 5:

| Model (degree)           | AIC | BIC |
| ------------------------ | --- | --- |
| Degree 1 (straight line) | 150 | 155 |
| Degree 2                 | 140 | 148 |
| Degree 3                 | 135 | 145 |
| Degree 4                 | 134 | 147 |
| Degree 5                 | 133 | 150 |

* AIC might suggest degree 5 is best (lowest AIC).
* BIC might suggest degree 3 is better (because the penalty is stronger and 133 → 150 is not worth the complexity).

* Use **AIC** when your goal is **prediction accuracy** (e.g., in machine learning).

* Use **BIC** when you're looking for the **true underlying model** (e.g., in scientific modeling).

* Both are great tools when you're **comparing different models** with different numbers of parameters.

---

### **7. Residual's Distribution:**

A **residual** is the **error** or **difference** between:

* The **actual value** (what really happened), and
* The **predicted value** (what the regression model says should happen).

> $$\text{Residual} = \text{Actual Value} - \text{Predicted Value}$$

So if your model predicts that a house price is \$300,000 but the real price is \$320,000, the residual is:

> $$\text{Residual} = 320,000 - 300,000 = +20,000$$

**What Is the Residuals Distribution?**

When you make predictions for **many data points**, you'll get **many residuals** — one for each prediction.

The **residuals distribution** is the overall **pattern** or **spread** of these residuals.

In a **good regression model**, the residuals should behave in a certain way — they should look like **random noise** around zero.

**Why Is the Residuals Distribution Important?**

Studying the residuals helps us know if our model is:

* Making consistent errors
* Missing patterns in the data
* Violating basic regression assumptions

**Properties of a Good Residuals Distribution:**

When your regression model is **well-fitted and valid**, the residuals should have the following **ideal properties**:

1. **The Mean of Residuals Is Zero:**  
* On average, the residuals should cancel out.
* Some errors will be positive (underprediction), and some will be negative (overprediction).
* The average of all residuals should be **close to 0**.

2. **Residuals Are Randomly Scattered**
* When you plot residuals against predicted values or inputs, they should appear as **a random cloud**, not a pattern.
* If you see patterns (like curves or funnels), it means the model **missed something** — like a missing variable or nonlinearity.

3. **Constant Variance (Homoscedasticity)**
* The **spread of residuals** should stay about the same for all levels of the predicted values.
* If the residuals get wider or narrower (a "funnel shape"), it’s called **heteroscedasticity**, which is a problem.

4. **Residuals Are Normally Distributed (Optional but Helpful)**
* If you're using **linear regression** and want to do **inference** (like confidence intervals or hypothesis tests), it helps if residuals follow a **normal (bell-shaped) distribution**.
* This doesn’t have to be perfect, but large deviations from normality might affect your conclusions.

5. **No Autocorrelation in Residuals**
* Especially in **time series** or sequential data, residuals shouldn’t be correlated with one another.
* If they are, it means there's some pattern the model didn't capture.

> Residuals are like your model’s “mistakes” — and **how those mistakes behave tells you a lot** about whether your model is smart or just guessing wrong in a patterned way.

----