# Chapter 8 - The Inner Product

## norm of vector "length"

The concept is defined by property:
- non negative
- 0 only for zero vector
- close in scalar multiplication
- |u + v| ≤ |u| + |v|

### inner product
Define an operation inner product for vector: <u, v>; this operation _could be different_ on different field.

In $\mathbb{R}$, the inner product is defined as dot-product, it inherits properties:

- linearity in the first argument: ⟨u + v, w⟩ = ⟨u, w⟩ + ⟨v, w⟩ 
- symmetry: ⟨u, v⟩ = ⟨v, u⟩
- homogeneity: ⟨α u, v⟩ = α ⟨u, v⟩

The norm of vector could be defined as: |v| = sqrt(⟨v, v⟩)

## Orthogonality

### Definition
Define u and v to be orthogonal if ⟨u,v⟩ = 0

Indeed, every vector is orthogonal to v when v is the zero vector.

### Theorem 8.3.1 (Pythagorean Theorem for vectors over the reals): 
If vectors u and v over the reals are orthogonal then

    |u+v|^2 =|u|^2 +|v|^2

### Lemma 8.3.2 (Orthogonality Properties)
For any vectors u and v and any scalar α, 
Property O1: If u is orthogonal to v then α u is orthogonal to α v for every scalar α. 
Property O2: If u and v are both orthogonal to w then u + v is orthogonal to w.

### Lemma 8.3.3: 
If u is orthogonal to v then, for any scalars α,β, 

    | α u + β v | 2 = α^2 | u | 2 + β^2 | v | 2

### Decompose vector
For any vector b and any vector v, define vectors b||v and b⊥v to be, respectively, 
the projection of b along v and the projection of b orthogonal to v:
- b = b||v + b⊥v
- b||v = σ v
- b⊥v is orthogonal to v

### Lemma 8.3.8
Let b and v be vectors. The point in Span {v} closest to b is b||v, and the distance is |b⊥v|.

### Lemma 8.3.11, calculated
For any vector b and any vector v over the $\mathbb{R}$:

1. There is a scalar σ such that b−σv is orthogonal to v.
2. The point p on Span {v} that minimizes ∥b − p∥ is σ v.
3. The value of σ is ⟨b,v⟩ / ⟨v,v⟩ 

In [13]:
from book.vecutil import *
# Quiz 8.3.12
def project_along(b, v):
    sigma = (b*v) / (v*v) if v*v > 1e-20 else 0
    return sigma * v

def project_orthogonal_1(b, v):
    return b - project_along(b, v)

b = list2vec([2, 4])
v = list2vec([6, 2])
[
    project_orthogonal_1(b, v)
]

[Vec({0, 1},{0: -1.0, 1: 3.0})]

In [17]:
from book.matutil import *
# Problem 8.3.15:
def projection_matrix(v):
    v_norm = coldict2mat([v / (v*v)])
    return v_norm * v_norm.transpose()

[
    projection_matrix(v)
]

[Mat(({0, 1}, {0, 1}), {(0, 0): 0.022500000000000006, (0, 1): 0.0075000000000000015, (1, 0): 0.0075000000000000015, (1, 1): 0.0025000000000000005})]

In [18]:
# Problem 8.3.16
# rank = size of basis of column / row space of Matrix
# rank = 1, all the columns are linearly dependent

# Problem 8.3.17
# 1. n^2, use dot-product interpretation, n rows * n-vector
# 2. ?? 

In [78]:
from book.cancer_data import *
def shape(M):
    if isinstance(M, Vec):
        return (len(M.D),)
    else:
        return len(M.D[0]), len(M.D[1])
# 8.4 Lab: machine learning
A, b = read_training_data('book/train.data')
assert A.D[0] == b.D
print(shape(A), shape(b))

(300, 30) (300,)


In [42]:
# hypothesis vector, w, H(f) = w*f
def signum(u):
    return Vec(u.D, {i: 1 if u[i] >= 0 else -1 for i in u.D})

signum(Vec({'A','B'}, {'A':3, 'B':-2}))

Vec({'B', 'A'},{'B': -1, 'A': 1})

In [124]:
import random

def fraction_wrong(A, b, w):
    n = len(b.D)
    # dot product in range [-n, n], -n is all wrong, n is all right
    return 1/2 + ((signum(A*w) * b) / (2*n))

all_one_w = Vec(A.D[1], {i: 1 for i in A.D[1]})
all_zero_w = Vec(A.D[1], {i: -1 for i in A.D[1]})
random_w = Vec(A.D[1], {i: random.choice([-1, 1]) for i in A.D[1]})

[
    fraction_wrong(A, b, all_one_w),
    fraction_wrong(A, b, all_zero_w),
    fraction_wrong(A, b, random_w),
]

[0.4866666666666667, 0.5133333333333333, 0.4866666666666667]

In [153]:
# A very rough measurement of errors
def mse(A, b, w):
    l = A*w - b
    return l*l
[
    mse(A, b, all_one_w),
    mse(A, b, all_zero_w),
    mse(A, b, random_w),
]

[1461169191.191651, 1461971671.8656526, 1135597639.570312]

In [215]:
def mse_grad(A, b, w):
    return A.transpose()*(2*A*w - b)

[
    mse_grad(A, b, all_one_w),
    mse_grad(A, b, all_zero_w),
    mse_grad(A, b, random_w),
]

[Vec({'area(worst)', 'fractal dimension(worst)', 'texture(mean)', 'concave points(mean)', 'compactness(mean)', 'concavity(worst)', 'fractal dimension(mean)', 'area(mean)', 'compactness(worst)', 'perimeter(worst)', 'compactness(stderr)', 'radius(worst)', 'area(stderr)', 'smoothness(worst)', 'concave points(stderr)', 'smoothness(stderr)', 'smoothness(mean)', 'fractal dimension(stderr)', 'concave points(worst)', 'perimeter(stderr)', 'symmetry(worst)', 'texture(stderr)', 'symmetry(stderr)', 'radius(stderr)', 'concavity(stderr)', 'texture(worst)', 'concavity(mean)', 'symmetry(mean)', 'radius(mean)', 'perimeter(mean)'},{'area(worst)': 1450912284.3342988, 'fractal dimension(worst)': 101380.81951281568, 'texture(mean)': 23825371.457551107, 'concave points(mean)': 83008.75914763153, 'compactness(mean)': 147523.68534069866, 'concavity(worst)': 419219.53997671866, 'fractal dimension(mean)': 72765.22301727664, 'area(mean)': 1015221747.5450637, 'compactness(worst)': 366815.67796184006, 'perimeter(w

In [223]:
def gradient_descent_step(A, b, w, σ):
    return w - σ * mse_grad(A, b, w)

def gradient_descent(A, b, init_w, σ, T):
    check_interval = T/50
    w = init_w
    for i in range(T):
        w = gradient_descent_step(A, b, w, σ)
        if i % check_interval == 0:
            print(f'iteration {i}, fraction wrong {fraction_wrong(A, b, w):.2f}, mse {mse(A, b, w)}')
    return w

In [None]:
# NOTE: 

# The initial set up is indeed sensitive:
# - For all ones, the fraction of error continues rising while mse is decreasing.
# - For all zeros, the fraction of error is always over 0.5
# - Learning rate too high would blow up the error calculation.

# Problem:
# - The optimization is mse, and the error is fraction wrong, which is not the same thing.
#   The result indeed optimizes for the absolute value of mse, which leads to more border decisions.
# - Lack of bias?
trained_w = gradient_descent(A, b, random_w, 1e-4, T=10000)

In [233]:
A_v, b_v = read_training_data('book/validate.data')
[
    "on training data",
    fraction_wrong(A, b, trained_w2),
    "on validate data",
    fraction_wrong(A_v, b_v, trained_w2),
]

['on training data',
 0.5166666666666667,
 'on validate data',
 0.49230769230769234]

In [268]:
import numpy as np

def parse_data(f):
    data_string = None
    with open(f, 'r') as f:
        data_string = f.read()

    data_lines = data_string.strip().split('\n')
    parsed_data = []
    for line in data_lines:
        split_line = line.split(',')
        ID, label, features = int(split_line[0]), 1 if split_line[1] == 'M' else 0, list(map(float, split_line[2:]))
        parsed_data.append([ID, label] + features)

    data = np.array(parsed_data)
    X = data[:, 2:]
    Y = data[:, 1]
    return X, Y

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def cross_entropy(y_true, y_pred):
    epsilon = 1e-8  # Needed to prevent log(0)
    return -np.mean(y_true * np.log(y_pred + epsilon) + (1 - y_true) * np.log(1 - y_pred + epsilon))

def gradient_cross_entropy(X, y_true, y_pred):
    return np.dot(X.T, (y_pred - y_true)) / len(y_true)

X_train, Y_train = parse_data("book/train.data")

# Initialize weights
weights = np.zeros(X_train.shape[1])


In [271]:
# Set learning rate and number of epochs
learning_rate = 1e-5
epochs = 500000

# Optimize the model using gradient descent
y = Y_train
for epoch in range(epochs):
    y_pred = sigmoid(np.dot(X, weights))
    loss = cross_entropy(y, y_pred)
    gradient = gradient_cross_entropy(X, y, y_pred)
    weights = weights - learning_rate * gradient

    if epoch % 10000 == 0:
        # print(f'Epoch {epoch}, Loss: {loss}')
        pass

In [275]:
def evaluate(X, y_true, weights):

    y_pred = sigmoid(np.dot(X, weights))
    y_pred = np.round(y_pred)

    tp = np.sum((y_true == 1) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))

    accuracy = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1_score = 2 * precision * recall / (precision + recall)

    return accuracy, precision, recall, f1_score

X_test, Y_test = parse_data("book/validate.data")

[
    # accuracy, precision, recall, f1_score
    evaluate(X, Y, weights),
    evaluate(X_test, Y_test, weights),
]

[(0.9566666666666667,
  0.965034965034965,
  0.9452054794520548,
  0.9550173010380623),
 (0.9076923076923077,
  0.7432432432432432,
  0.9166666666666666,
  0.8208955223880596)]