# Supervised Learning

## Key Concepts in Linear Regression

* **Definition:** Learn a function h: X → Y where h(x) predicts y.
* **Hypothesis:** The function h(⋅).
* **Regression:** Predicting a continuous target variable (e.g., house price).
* **Classification:** Predicting a discrete target variable (e.g., house vs. apartment).

## Hypothesis Representation

* **Linear Function:**  $h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n$
    * $\theta_i$: Parameters/weights
    * $x_i$: Input features ($x_0 = 1$ for intercept term)
* **Vectorized Form:** $h(x) = \theta^Tx$  ($\theta$ and $x$ are vectors)
    * $n$: Number of input variables (excluding $x_0$)

## Cost Function

Measures the error between predicted and actual values (e.g., Mean Squared Error).

* **Least Squares:** $J(\theta) = \frac{1}{2} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$
    * Measures how close predictions are to actual values.
    * $m$: Number of training examples

## Gradient Descent

Iterative algorithm to minimize the cost function and find optimal $\theta$ values.

* **Goal:** Minimize $J(\theta)$ by iteratively updating $\theta$.
* **General Update Rule:** $\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta)$
    * $\alpha$: Learning rate
* **Partial Derivative:** $\frac{\partial}{\partial \theta_j}J(\theta) = \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$
* **LMS Update Rule (Single Training Example):** $\theta_j := \theta_j + \alpha (y^{(i)} - h_\theta(x^{(i)})) x_j^{(i)}$
    * Also known as Widrow-Hoff learning rule.
    * Update proportional to the error.

## Gradient Descent Variants

* **Batch Gradient Descent:**
    * Updates $\theta$ after looking at all training examples.
    * $\theta_j := \theta_j + \alpha \sum_{i=1}^{m} (y^{(i)} - h_\theta(x^{(i)})) x_j^{(i)}$ (for all $j$)
    * Converges to global minimum for linear regression (convex quadratic function).
* **Stochastic Gradient Descent:**
    * Updates $\theta$ after each training example.
    * $\theta_j := \theta_j + \alpha (y^{(i)} - h_\theta(x^{(i)})) x_j^{(i)}$ (for all $j$, for each $i$)
    * Faster for large datasets, but may oscillate around the minimum.
    * Can converge to minimum with decreasing learning rate $\alpha$.

### Linear Regression Assumptions

1. **Linearity**: The relationship between the independent variables (predictors) and the dependent variable is linear.

2. **Independence**: Observations are independent of each other, meaning the errors (residuals) are not correlated across observations.

3. **Homoscedasticity**: The variance of the residuals is constant across all levels of the independent variables.

4. **Normality of Residuals**: The residuals (differences between observed and predicted values) should be normally distributed.

5. **No Multicollinearity**: Independent variables are not highly correlated with each other.

6. **No Autocorrelation**: There is no correlation between the residuals over time (important for time series data).
