# refresher

dot products = a * b = a1b1 + a2b2 + .. aNbN
geometrically expressed as a * b = ||a|| ||b|| cos(theta)
measures how aligned they are sclaed by lengths - represents similarity and weighted
sum

comparing cosine similarities is more meaningful than comaring raw magnitude

subtracting score.max prevents exp() from oveflowing by ensuring largest exponent is
exp(0) = 1

# real world example
spam vs not spam

each email is a vector of features
0 = "free", 1 = "buy now", 2 = percent of all-caps, 3 = number of links

stage 1: standardization (z-score)
each feature might have different scales - links could be 0 - 50, caps could be 0 - 200, etc. without fixing one scale, one feature would dominate

training mean and std are computed per feature
z = x - mu / delta (how unusual is this email on each feature compared to normal training)

a standardized vector = a vector where each feature has been turned into a z=score

stage 2: logits (raw scores)
each standardized vector produces scores per class

scores[0] = not spam
scores[1] = spam

dot products generate scores

stage 3: argmax (prediction)
argmax(scores) = index of biggest score
this index gives us the predicted class

stage 4: softmax - gives the probability
computed once per example (one email gets one probability vector)

stage 5: cross entropy loss (training signal)


In [None]:
# let feature vector x have 3 features
# mu, sigma for these 3 features
# compute z-score
# s_spam = w @ z + b
# 2-class score vector, [0, s_spam] where 0 = not spam, 1 = spam
# softmax to get probabilities
# y to compute loss

import numpy as np
x = np.array([2, 1, 0.10], dtype=np.float32) #free occurs twice, one link, 10% of email is caps
mu = np.array([0.5, 0.2, 0.02], dtype=np.float32)
sigma = np.array([1.0, 0.5 ,0.05], dtype=np.float32)
w = np.array([1.2, 0.7 ,1.5], dtype=np.float32)
b = -0.3
y = 1 # email is spam

z = (x - mu) / sigma
print(z) # [1.5       1.6       1.5999999] is about 1.5 std above average, strong indicator this is spam

[1.5       1.6       1.5999999]


In [6]:
# logit - spam score for this vector
s_spam = w @ z + b
print(s_spam) # 5.0199995, has to be converted to probability

scores = np.array([0.0, s_spam], dtype=np.float32)
print(scores)

5.0199995
[0.        5.0199995]


In [None]:
# compute probability vectors
shifted = scores - scores.max() # hedge against overflow

exp = np.exp(shifted)
prob = exp / exp.sum()
print(prob) # the odds of this email being spam are 99% and not spam 0.66%
print(prob.sum())

[0.0065612 0.9934388]
1.0


In [12]:
# cross-entropy
p_correct = prob[y]
loss = -np.log(p_correct)
print(p_correct)
print(loss)
# this tells us if the null hypothesis was true, how surprising is my data
# how much probability did the model assign to the truth?

# small = big loss

0.9934388
0.006582839


for two choices, loss would be loss = -log(0.5) = 0.693.
a loss close to 0.693 would mean closer to random

loss tells us how wrong am i?
later, gradient descent tells us how can we change the parameters to reduce loss
through the derivatives of loss with respect to w and b to update them

in many ML models, w and b start as random and are refined. these are weights in neural nets! w = weights, b = bias

training data shapes these weights via optimization!

# attribution is hard - saying which internal components represent to which abstractions
# and why a specific output happens is the "black box". we DO know how training and inference work.
