$\newcommand{\trinom}[3]{\begin{pmatrix} #1 \\ #2 \\ #3 \end{pmatrix}}$

# **Vector concepts**
### Vector
  * Norm of vector ||$\vec{v}$||: Magnitude of the vector (typically it's length).
    * L1 norm $||\vec{v}||_1$: Is the manhattan distance $x_1$ + $x_2$ + ... + $x_n$
    * L2 norm $||\vec{v}||_2$: Is the euclidian distance $\sqrt{x_1^2 + x_2^2 + ... x_n^2}$
### Angle between vectors
  * Gives us idea of similarity between the two vector embeddings
    * Lesser angle => Two vectors are more similar
  * Typically determined by doing dot product. See [dot product formula](#dot_product_formula).
  * Two vectors $\vec{v_1}$ and $\vec{v_2}$ are said to be orthogonal/perpendicular to each other when their dot product = 0 i.e. $\vec{v_1} . \vec{v_2}$ = 0
### Distance
  * Distance between point and decision hyperplane
    * Gives us a measure of confidence of classification of the point by the hyperplane.
      * Greater distance ~= More confidence
      * Lesser distance ~= Less confidence
    * Always use the [distance formula](#point_distance_formula) to measure distance of point from decision plane.
  * Distance between two parallel hyperplanes
### Hyperplanes
  * The bounding planes that separate given space into two half-spaces: Positive halfspace and negative halfspace.
#### Weight vector
  * The vector that is orthogonal to the given decision hyperplane.
#### Bias
  * Weight vector determines direction, while bias geometrically identifies the distance of the hyperplane from origin.
  * Three different interpretations of bias:
    * **Geometric:** ***Weights rotate*** the hyperplane; ***bias translates*** it. If you remove bias, you’re saying: “The origin itself has semantic meaning.” - which is almost never true in real data.
    * **Mathematical/Algebraic:** Bias represents one more degree of freedom/dimension.  
    Rewrite the classifier like this:  
    $w^T$x + b = $\binom{w}{b}^T$ $\binom{x}{1}$ = [w b]$\binom{x}{1}$.  

      Now bias is just: a weight on a constant feature
    * **Logical:** Prior or default decision.  
    “If there were no signal at all, what would I predict?”  
    In the No-signal case, x = 0, so $w^T$x + b = b.  
    So:  
    b > 0 => Default towards +ve classification  
    b < 0 => Default towards -ve classification  
    Bias encodes the classifier’s baseline belief before seeing any evidence.

# **Important formulae**

* <a id="dot_product_formula">Dot product:</a> $\vec{v_1}.\vec{v_2}$ = $|v_1||v_2| cos\theta$
  * Use this to determine many things:
    * Angle between two vectors
    * Projection of one vector on other
* <a id="point_distance_formula">Distance of point</a> $x_1$ from hyperplane $\vec{w}$.x + $w_0$ = 0 is: ($\vec{w}$.$x_1$ + $w_0$) / ||$\vec{w}$||
  * If distance is +ve, the point $x_1$ is in positive halfspace of the decision hyperplane
  * If distance is -ve, the point is in negative halfspace of the decision hyperplane
* Distance of origin from hyperplane $\vec{w}$.x + $w_0$ = 0 is: $w_0$ / ||$\vec{w}$||
* Distance between hyperplanes
  * Two hyperplanes $\vec{w}x + w_0 = 0$ and $\vec{w'}x + w'_0 = 0$ are said to be parallel only if one can be expressed as a multiple of another. i.e. both can be expressed as $ax_1 + bx_2 + ... c_1 = 0$ and $ax_1 + bx_2 + ... c_2 = 0$
  * For such hyperplanes, distance between them = $|c_2 - c_1|$/$\sqrt{a^2 + b^2...}$

# **Classification concepts**

### Gain function
G(X, $\vec{w}$, $w_0$) = $\sum_{i=1}^{n}$ ($\vec{w}^Tx_i$ + $w_0$) * $y_i$ / $||\vec{w}||$

$\vec{w^*}$, $w_0^*$ = $argmax_{\vec{w},w_0}$ [G(X, $\vec{w}$, $w_0$)] = $argmax_{\vec{w},w_0}$ $\sum_{i=1}^{n}$ ($\vec{w}^Tx_i$ + $w_0$) * $y_i$ / $||\vec{w}||$

## Loss function
L(X, $\vec{w}$, $w_0$) = -G(X, $\vec{w}$, $w_0$)

$\vec{w^*}$, $w_0^*$ = $argmin_{\vec{w},w_0}$ [L(X, $\vec{w}$, $w_0$)] = $argmin_{\vec{w},w_0}$ -( $\sum_{i=1}^{n}$ ($\vec{w}^Tx_i$ + $w_0$) * $y_i$ / $||\vec{w}||$)