# Making Complex Decisions

- Markov Decision Processes (MDPs)
    - states, s ∈ S
    - actions, a ∈ A
        - sometimes A(s)
    - rewards, R(s)
        - sometimes r(s, a)
    - transition model, P(s' | s, a)
        - P(s' | s, a) is the probability that the next state s' is reached when action a is taken in state s
            - environmental model for how agent's actions update the world
    - Element 1: Markov attributes
        - because it is a Markov process, the probability of reaching a state s' depends only on the current state s and the action a
            - P(s' | s, a) = P(s' | s, a, s1, a1, s2, a2, ..., st, at)
                - i.e. the future does not depend on the past, only the present
        - time that you reach a state is not important, only the state itself
            - i.e. the process is memoryless, the first time you land on Go is the same as the 70th time you land on Go
        - sort of like a deterministic process, but with probabilities
            - e.g. in gridworld there is a chance that when you step_forward() you will actually step_backward() instead or stay in the same place
            - the probabilities are the only thing that is not deterministic
    - Element 2: Maximized Reward
        - the goal is to maximize the sum of rewards over a sequence of actions
            - the reward is the only thing that matters
            - the agent is trying to maximize the sum of rewards
        - $max_{a1, a2, ..., an} \sum_{t=0}^{n} R_t$
        
            - deterministic, finite horizon
            - where n is the number of steps in the sequence
        - $max_{a1, a2, ..., an} \sum_{t=0}^{\infty} \gamma^tR_t$
        
            - stochastic, infinite horizon
            - where $\gamma$ is the discount factor
                - $\gamma \in [0, 1]$
                - $\gamma = 0$ means that the agent only cares about the immediate reward
                - $\gamma = 1$ means that the agent cares about all future rewards equally
                - $\gamma > 1$ causes an infinite sum
                - $\gamma = 0.9$ is a common value
    - Element 3: Policy
        - map from state to action
        - $\pi(s) = a$
            - when in state s, take action a
    - Element 4: Value/Utility function
        - usually called V(s), sometimes U(s)
        - maps from state to value if you follow the policy
    - Element 5: Optimal Value Function
        - the value function that maximizes the sum of rewards
        - denoted with a star, $V^*(s)$
        - $V^*(s) = max_{\pi} V^{\pi}(s)$
            - the best reward you can get from state s is by making all the best decisions

- Note: long term vs short term rewards
    - the agent may have to make a decision that sacrifices short term rewards for long term rewards
    - e.g. in chess, sacrificing a piece to gain a positional advantage
- R(s) can be used to tailor the agent's behavior
    - e.g. in gridworld with two terminal states: one with reward 1 and one with -1
        - a reward of < -1 for each step can be used to encourage the agent to reach the goal as quickly as possible
            - i.e. end the game as fast as possible even if the terminal state has a negative (but higher) reward
        - a reward of > 1 encourages the agent to never reach the terminal state
            - i.e. the agent will try to avoid the terminal state at all costs

- acting with a value function
    - use MEU (maximize expected utility) to choose the best action
        - $MEU = max_{a} \sum_{s'} P(s' | s, a) V(s')$
        - the agent will choose the action that maximizes the expected utility **of the next state**
    - hill climbing behavior, i.e. follow the gradient of the value function
        - $a = argmax_{a} V(s')$
        - the agent will choose the action that maximizes the value of the next state
        - if it is a probabilistic environment, the agent will choose the action that maximizes the expected value of the next state
            - i.e. account for the possibility of the action failing
        - if the environment is deterministic, the agent will choose the action that maximizes the value of the next state
            - i.e. the chosen action will always succeed
- Optimality Principle
    - Bellman Equation
        - $V^*(s) = max_{a} (R(s) + \sum_{s'}  \gamma P(s' | s, a) [V^*(s')])$
        - if it is deterministic, the equation simplifies to
            - $V^*(s) = max_{a} (R(s) + \gamma V^*(s'))$
    - each state has an associated $V^*(s)$
        - N states, N equations to solve completely
    - algorithms
        - value iteration
            - evaluate every state's value and then update them all at once
            1. initialize V(s) with 0 or some guess, k = 0
            2. for each state s
                - $V_{k+1}(s) = max_{a} (R(s) + \gamma \sum_{s'} P(s' | s, a) V_k(s'))$
                - repeat until $V_{k+1}(s)$ and $V_k(s)$ are close enough**
                    - close enough defined by $max_s |V_{k+1}(s) - V_k(s)| < \epsilon\frac{1 - \gamma}{\gamma}$
                        - where $\epsilon$ is the desired error
            - e.g. in gridworld, the value of each state is the maximum of the values of the states that can be reached from the current state
        - in place value iteration
            - same as value iteration, but update the value of each state as you go
            - i.e. don't wait until the end of the loop to update the values
            - use half as much memory because you only need to store the current and previous values
            - it will probably converge in fewer iterations
        - policy evaluation
            - given a policy, compute its value function
            - $V^{\pi}(s) = R(s) + \gamma \sum_{s'} P(s' | s, \pi(s)) V^{\pi}(s')$ for s in S
                - simpler than the Bellman equation because it doesn't have the max operator and the policy is fixed
                    - i.e. we know what $\pi(s)$ is
                - a linear system of equations that can be solved with linear algebra
                    - often develops a triangular matrix that can be solved in O(n^2) time where n is the number of states
        - policy improvement
            - given a policy, find a better policy
            - $\pi'(s) = argmax_{a} (R(s) + \gamma \sum_{s'} P(s' | s, a) V^{\pi}(s'))$
                - i.e. choose the action that maximizes the expected value of the next state
        - policy iteration
            1. initialize V(s) to any admissible policy, k = 0
            2. for each state s
                - $V_{k+1}(s) = R(s) + \gamma \sum_{s'} P(s' | s, \pi(s)) V_k(s')$

### Learning from Examples I
- 04Apr2024

- moving from utility based agents to learning agents
    - in some cases, you may have to learn the utility function
- learning from experience
    - gain experience (E) with respect to a task (T) and a performance measure (P)
        - e.g. Samuel's checkers player
            - T: playing checkers
            - P: % of games won
            - E: playing games against itself
    - learning system
        - a loop
            - experiment generator starts a new problem ->
            - performance system does the task ->
            - critic evaluates the performance ->
            - generalizer modifies the model or hypothesis -> (back to experiment generator)
                - finding $\hat V$, an approximation of the true value function
    - representation
        - how to represent the value function
            - e.g. Samuel's 6 features of the checkers board
                - the true value function is some $V^*(board)$
                - his estimated value function is $\hat V(board) = \sum_{i=1}^{6} w_i f_i(board)$
                    - $w_i$ is the weight of feature i
                    - $f_i(board)$ is the value of feature i

- forms of learning
    - supervised
        - learn a function from labeled examples
        - given instances and their correct outputs (labels)
        - e.g. recognizing handwritten digits after training on a dataset of labeled images
    - unsupervised
        - recognize and learn patters in the data
        - given instances without labels
        - e.g. clustering
    - reinforcement
        - learn from rewards
        - given instances and rewards
        - generally the reward is given at the end of the sequence of actions, not immediately
            - e.g. it is postponed until the end of the game and may not be explicitly tied to a specific action

- supervised learning
    - given a vector of samples and their labels
        - $D = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$
    - goal: learn a function that maps from x to y
        - $f: X \rightarrow Y$
        - assumption: there is some true function $f^*$ that maps from x to y
            - $y = f^*(x)$
        - the goal is to find a function $\hat f$ that approximates $f^*$
            - $\hat f$ is the hypothesis
    - output may be a category (classification), a real number (regression), or a sequence (sequence prediction)
    - issues:
        - consistency in data
        - generalization (fitting)
            - will h match f on new data?
        - noisy data
            - error in the data may affect the hypothesis
        - model complexity (hypothesis space)
            - how complex should the model be?
            - too simple and it may not capture the true function
            - too complex and it may overfit the data
                - i.e. it may capture the noise in the data

- metrics
    - accuracy
        - $\frac{TP + TN}{TP + TN + FP + FN}$
        
            - where TP is true positive, TN is true negative, FP is false positive, and FN is false negative
        - the proportion of correct predictions
    - precision
        - $\frac{TP}{TP + FP}$
        - the proportion of positive identifications that were actually correct
    - recall
        - $\frac{TP}{TP + FN}$
        - the proportion of actual positives that were identified correctly
    - least squares fit:
        - Loss = $\sum_{i=1}^{n} (y_i - (ax_i + b))^2$
            - y_i is the true value
            - where a and b are the parameters of the model
            - the goal is to minimize the loss
    - exact polynomial fit:
        - Loss = $\sum_{i=1}^{n} (y_i - (a_0 + a_1x_i + a_2x_i^2 + ... + a_kx_i^k))^2$
            - where a_0, a_1, ..., a_k are the parameters of the model
            - the goal is to minimize the loss
            - probably a poor model because it will overfit the data

- overfitting
    - the model fits the training data too well and does not generalize to new data
        - e.g. the exact polynomial fit will work well on the training data, but probably poorly on new data
    - Occam's Razor
        - "Entities should not be multiplied without necessity"
        - the simplest explanation/model that fits the data is probably the best

- decision trees (nothing new compared to ML with Johnson)
    - structural details
        - input: a set of input attributes
        - decision: predicted output
        - e.g. boolean classification
        - a decision is reached by following a path from the root to a leaf
        - each internal node is a branching decision based on an attribute's value
    - popular when
        - there are discrete target outputs
        - the data is noisy
        - it is valuable to understand the decision making process
            - e.g. need to be able to explain the decision to a human who is not a data scientist
    - inputs can be categorical or numerical (discrete or continuous)
    - outputs can be categorical or numerical
        - e.g. CART and regression trees
- decision stumps
    - a decision tree with only one split (only one attribute is considered)
    - not really anything wild but Branicky thinks the're neat/funny
    - e.g. a decision stump for classifying fruit with color as the only attribute
        - if x == yellow, then lemon, if x == green, then lime, else orange 

- building a decision tree system
    - start with the root node
        - choose the attribute that explains the largest section of the data
    - recursively build the tree down the branches
        - choose the attribute that explains the largest section of the data
        - repeat until the data is perfectly classified or the tree is sufficiently deep
- more on this Tuesday

Learning from Examples II

- Learning a Decision Tree
    - start with the training data
    - begin with the root node
        - choose the attribute that best splits the data
        - split the data into subsets based on the attribute
    - repeat for each subset
    - stop when the data is perfectly classified or the tree is sufficiently deep
- ID3 algorithm
    - psuedocode
    ```
    if examples is empty
        return plurality-value(parent-examples)
    else if all examples have the same classification
          return the classification
        else if attributes is empty
            return plurality-value(examples)
        else
            A <- the attribute that best classifies examples
            tree <- a new decision tree with root A
            for each value v of A
                add a branch to tree with label v
                exs <- {e : e in examples and e.A = v}
                subtree <- ID3(exs, examples)
                add a branch to tree with label A=v and subtree subtree
            return tree
    ```
    - in practice, depth can also be a stopping condition
        - e.g. stop when the tree is 5 levels deep 
    - information gain is another potential stopping condition
    - the algorithm is greedy
        - it chooses the best attribute at each step
        - it may not find the best tree
- finding the "best" attribute
    - entropy is a measure of disorder
        - for random value V
            - $H(V) = -\sum_{k=1}^{n} P(v_k)log_2P(v_k)$
        - binary-valued
            - $B(q) = -(qlog_2q + (1 - q)log_2(1 - q))$
                        
                - where q is the proportion of positive examples
                - e.g. if q = 0.5, then B(q) = 1
                    - B(0) = B(1) = 0
        - the entropy of a set S of boolean instances
            - p = number of positive instances in S
            - n = number of negative instances in S
            - $B(\frac{p}{p + n})$ is the entropy of S
    - information gain
        - the expected reduction in entropy from splitting the data on an attribute
        - if no information remains, the split accounts for all the information in the set
        - Gain(A) = $B\frac{p}{p + n} - remainder(A)$
            - where p and n are the number of positive and negative instances in the set 
            - where remainder(A) is the expected entropy of the subsets
                - $remainder(A) = \sum_k \frac{p_k + n_k}{p + n}B(\frac{p_k}{p_k + n_k})$ 
                
                    - where p_k and n_k are the number of positive and negative instances in subset k
        - e.g. tennis
            - humidity and play tennis
                - B(9/14) = 0.940
                    - this is the entropy of the entire set (listed in the course notes)
                        - 14 instances, 9 positive (play tennis), 5 negative (don't play tennis)
                - humidity has 2 values: high and normal
                - 14 instances, 7 positive, 7 negative
                - $B(\frac{7}{14}) = 1$
                - high humidity
                    - 7 instances, 3 positive, 4 negative
                    - $B(\frac{3}{3 + 4}) = 0.985$
                - normal humidity
                    - 7 instances, 6 positive, 1 negative
                    - $B(\frac{6}{6 + 1}) = 0.592$
                - Gain(humidity) = 0.940 - $\frac{7}{14}0.985 - \frac{7}{14}0.592 = 0.151$
            - e.g. wind and play tennis
                - B(9/14) = 0.940
                - wind has 2 values: strong and weak
                - 14 instances, 8 positive, 6 negative
                - $B(\frac{8}{14}) = 0.985$
                - strong wind
                    - 8 instances, 6 positive, 2 negative
                    - $B(\frac{6}{6 + 2}) = 0.811$
                - weak wind
                    - 6 instances, 3 positive, 3 negative
                    - $B(\frac{3}{3 + 3}) = 1$
                - Gain(wind) = 0.940 - $\frac{8}{14}0.811 - \frac{6}{14}1 = 0.048$
            - humidity is the better attribute to split on because it has the higher information gain
            - after splitting on humidity, start the process over with the subsets (high and normal) to find the next best attribute to split on in each subset
                - for high humidity you are calculating only the entropy of the high humidity instances
                - for normal humidity you are calculating only the entropy of the normal humidity instances
                - both subsets may have different best attributes to split on
            - **see the "Split and Recurse" slide from the 09Apr2024 lecture for a visual representation of this process and more examples**

ID3 hypothesis space
- ID3 searches through the space of decision trees
    - from simple to complex
    - guided by information gain
- no backtracking
    - greedy with respect to information gain
- can get caught in local optima
    - e.g. if the best attribute to split on is not the best attribute to split on at the next level

### Issues with decision trees
- continuous attributes
    - discretize the attribute
        - i.e. choose thresholds and split the data into sets based on the thresholds
        - e.g. if the attribute is temperature, discretize it into hot, warm, and cold
- missing data
    - some piece of data is missing for an instance
    - common solutions:
        - apply the most common value for the attribute in similar instances
            - e.g. is it more common that it is sunny or rainy 
        - apply the most common value for the attribute in the entire dataset
            - e.g. is it more common that it is sunny or rainy in the entire dataset
        - split the value proportionally
            - e.g. if 5/13 sunny, 4/13 rain, 4/13 overcast, then split the missing data 5/13 sunny, 4/13 rain, 4/13 overcast
- weighting
    - in the tennis example, there are 72 possible combinations of attributes
    - if you have 1000 instances, it is not useful to train on all of the data
    - count the number of instances that match each combination of attributes
    - apply a weight to each of the 72 combinations based on the number of instances that match the combination
- testing performance
    - split into training and testing sets
    - test is a disjoint set chosen at random from the data

### Ensemble Learning
- combining multiple models to improve performance
- plurality of k experts
- can apply weights to the experts
    - add to a weight when the expert is correct
    - subtract when the expert is incorrect
- random forest
    - the experts are decision trees
    - classification is determined by a majority vote
    - trees must differ from each other
        - e.g. by using different subsets of the data or different attributes
- bagging
    - training multiple models on different subsets of the data
- boosting
    - successively train models
        - weight samples higher if they were misclassified by the previous model
        - i.e. the next model will be good at classifying the samples that the previous model was bad at classifying

P(pass(c1, t1)|q+(c1)) = 0.8	<- pass given good shape 
P(pass(c1, t1)|q−(c1)) = 0.35	<- pass given bad shape 
P(pass) = P(pass│q+)P(q+)+P(pass│q-)P(q-) 
    = P(pass | q+)P(q+) + P(pass | q-)P(q-) 
    = 0.8*0.7 + 0.35*0.3 
    = 0.665 P(fail) 
    = 0.335
P(q+|pass) = $\frac{P(pass|q+)P(q+)}{P(pass)}$ 
    = $\frac{0.8*0.7}{0.665}$ 
    = 0.842
P(q−|pass) = $\frac{P(pass|q-)P(q-)}{P(pass)}$
    = $\frac{0.35*0.3}{0.665}$
    = 0.158
P(q+|fail) = $\frac{P(fail|q+)P(q+)}{P(fail)}$
    = $\frac{0.2*0.7}{0.335}$
    = 0.416
P(q−|fail) = $\frac{P(fail|q-)P(q-)}{P(fail)}$
    = $\frac{0.65*0.3}{0.335}$
    = 0.584

U(q+,buy) = 500
U(q−,buy) = -200

EU(buy|pass) = P(q+|pass)U(q+,buy) + P(q−|pass)U(q−,buy)
= 0.842*500 + 0.158*-200
= 421 - 31.6
= 389.4
    
EU(buy|fail) = P(q+|fail)U(q+,buy) + P(q−|fail)U(q−,buy)
= 0.416*500 + 0.584*-200
= 208 - 116.8
= 91.2
EU(buy-|pass) = 0.842*0 + 0.158*0
= 0
EU(buy-|fail) = 0.416*0 + 0.584*0
= 0


# video lecture

### Bayesian Learning
-16Apr2024

- Bayesian Learning
    - statistical learning
    - based on Bayes Rule
        - $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
        - "posterior probability of A given B is equal to the likelihood of B given A times the prior probability of A divided by the prior probability that B is observed"
- defining the "best" hypothesis in the hypothesis space
    - MAP
        - best = most probable hypothesis given the data
        - Maximum A Posteriori hypothesis
        - $h_{MAP} = argmax_{h \in H} P(h|D)$ = $argmax_{h \in H} P(D|h)P(h)$
            - where h is the hypothesis and D is the data
            - e.g. $D$ = head pain, $h_1$ = migraine, $h_2$ = brain tumor, $H$ = $\{h_1, h_2\}$
                - if $P(D|h_1)P(h_1) > P(D|h_2)P(h_2)$, then $h_{MAP} = h_1$
    - ML
        - best = most likely to have produced the data
        - Maximum Likelihood hypothesis
            - "without any other specific knowledge, assume that each $h_i$ is equally likely"
                - $P(h_i) = P(h_j)$ for all $h_i, h_j \in H$ 
        - $h_{ML} = argmax_{h \in H} P(D|h)$ = $argmax_{h \in H} log(P(D|h))$
            - e.g. $D$ = \{heads, heads\}, h_1 = P(heads) = 0.5, h_2 = P(heads) = 0.6
                - $P(D|h_1) = 0.5^2 = 0.25$
                - $P(D|h_2) = 0.6^2 = 0.36$
                - $P(D|h_1) < P(D|h_2)$, so $h_{ML} = h_2$

- further examples
    - MAP
        - hypothesis: $h_1$ = +cancer, $h_2$ = -cancer
        - priors: $P(+c) = 0.008$, $P(-c) = 0.992$
        - test gives correct positive answer 98% of the time
            - $P(+t|+c) = 0.98$
        - test gives correct negative answer 97% of the time
            - $P(-t|-c) = 0.97$
        - what is the diagnosis if the test is positive?
            - $h_{MAP} = argmax_{h \in H} \{P(D|h)P(h)\}$
                        - $= argmax \{P(+t|+c)P(+c), P(+t|-c)P(-c)\}$
                        - $= argmax \{0.98*0.008, 0.03*0.992\}$
                        - $= argmax \{0.00784, 0.02976\}$
        - what is the chance of cancer?
            - need to do the normalization
                - the argmax above must sum to 1
            - $P(+c|+t) = \frac{P(+t|+c)P(+c)}{P(+t)}$
                - $= \frac{0.00784}{0.00784 + 0.02976}$
                - $= \frac{0.00784}{0.0376}$
                - $= 0.2085$
                                
                - 20.85% > 0.8% so the probability that it is cancer decreased 
                    - A Priori probability of not cancer is 0.992%
                    - A Posteriori probability of not cancer is 0.79%

- Learning Bayes Networks
    - problem: given a network structure and data, calculate the CPTs (conditional probability tables)
    - solution: just estimate
        - $\hat{P}\{V_i = v_i | P_i = p_i\} = \frac{count(V_i = v_i \land P_i = p_i)}{count(P_i = p_i)}$
        
            - where $V_i$ is the variable, $v_i$ is the value, $P_i$ is the parent variable, and $p_i$ is the parent value
        - this is the ML estimate
        - children
            - e.g. table in book example in slides (robot with battery, move, arm, liftable
                - $\hat{P}(+b) = 94/100 = 0.94$
                                
                    - count instances of +b and divide by the total number of instances
                - $\hat{P}(+g|-b) = 0/6 = 0$
                                
                    - count instances of +g and -b and divide by the total number of instances of -b
                - $\hat{P}(+m|+b, -l) = 1/30$
                                
                    - count instances of +m and +b and -l and divide by the total number of instances of +b and -l
            - **it's just counting from the table**
            - what if you don't have data for the parents?
                - some people initialize all numerators to 1 and all denominators to |V|
                    - V is the number of labels for the variable V
                    - e.g. boolean case, |V| = 2 $\implies \frac{1}{2}$ 
    - issues
        - unknown BN structure
            - usually supplied by a human expert
                - even then, there may be "hidden" intermediate variables
                    - e.g. meat spontaneously creates flies
                        - the hidden variable is the presence of flies laying eggs in the meat
                    - cholera spread near water sources
                        - germs in the water are the hidden variable
            - solution: Naive Bayes Classifier

- Naive Bayes Classifiers
    - assume structure
        - all attributes are independent given the class
        - the class is the parent of all the attributes
        - estimate CPTs from training examples
            - if the class is binary, each attribute has 2 CPTs
                - one for each class
    - e.g. tennis & weather
        - <img src="images/ML_Tennis.png">
        - class = play tennis
            - P(play tennis) = $\frac{9}{14}$
            - P(don't play tennis) = $\frac{5}{14}$
            - P(strong wind | play tennis) = $\frac{\sum strong | yes}{\sum yes} = \frac{3}{9} = 0.333$
            - P(strong wind | don't play tennis) = $\frac{\sum strong | no}{\sum no} = \frac{3}{5} = 0.6$ 
            - P(weak wind | play tennis) = $1 - \frac{3}{9} = 0.667$
                - don't have to calculate this, it is just 1 - P(strong wind | play tennis)
        - MAP defined best
            - we can use the non normalized probabilities to determine the best hypothesis
            - P(play| sunny, cool, high, strong) ~ P(play)P(sunny|play)P(cool|play)P(high|play)P(strong|play)
                - $= \frac{9}{14} * \frac{2}{9} * \frac{3}{9} * \frac{3}{9} * \frac{3}{9}$
                - $= 0.0053$
            - P(don't play| sunny, cool, high, strong) ~ P(don't play)P(sunny|don't play)P(cool|don't play)P(high|don't play)P(strong|don't play)
                - $= \frac{5}{14} * \frac{3}{5} * \frac{1}{5} * \frac{2}{5} * \frac{3}{5}$
                - $= 0.0096$
            - argmax $\implies$ don't play

- issues continued
    - hidden variables
    - incomplete or missing data
        - missing attribute values
    - patterns that don't appear at all
    - solution ideas
        - $h_ml = argmax_{h \in H} P(D|h)$, where $D = [x, Z]$
        - Z is unobserved so take the expectation over all values of Z
        - $h_ml = argmax_{h \in H} E[log(P(x,Z|h))]$
                - $= argmax_{h \in H} \sum_{z} P(Z=z|x,h)log(P(x,z|h))$
    - implementation
        - EM algorithm
            - Expectation Maximization
            - iterative
            - $\theta$ = set of known and unknown parameters for h
                - e.g. the CPTs
            - start with a random guess for unknown parameters of $\theta$
            - repeat until convergence
                - E-step: estimate the hidden variables
                - M-step: estimate the parameters
                - repeat until convergence
            - $\theta^{i+1} = argmax_\theta \sum_z P(Z=z|x,h,\theta^i)log(P(x,Z=z|\theta))$
                        
                - where $\theta^i$ is the current estimate of the parameters
                - $\theta^{i+1}$ is the next estimate of the parameters
                - $P(Z=z|x,h,\theta^i)$ is the probability of the hidden variable given the data and the current estimate of the parameters
                - $P(x,Z=z|\theta)$ is the probability of the data and the hidden variable given the parameters
            - e.g. robot except we don't know the value of L for the 2nd to last row
                - $\theta_1$ = proportion of L=True
                - pick a random value for $\theta_1 \in [0,1]$
                - repeat until convergence
                    - $c_1 = P(-g, -m, -b, +l | \theta_1^i)$
                        - = $P(-g|-b)P(-m|-b, +l, \theta_1^i)P(-b)P(+l|\theta_1^i)$
                        - = $(6/6)(1)(6/100)(64+\theta_1^i*4)/100$
                    - $c_2 = P(-g, -m, -b, -l | \theta_1^i)$
                        - = $P(-g|-b)P(-m|-b, -l, \theta_1^i)P(-b)P(-l|\theta_1^i)$
                        - = $(6/6)(1)(6/100)(32+(1-\theta_1^i*4)/100$
                - M-step
                    - $\theta_1^{i+1} = \frac{c_1}{c_1 + c_2}$
                - code version below:

In [1]:
from random import random
th1 = random()
allth1 = [th1]

for i in range(6):
    c1 = 1 * 1 * (6/100) * (64 + th1*4) / 100
    c2 = 1 * 1 * (6/100) * (32 + (1-th1)*4) / 100
    th1 = c1 / (c1 + c2)
    allth1.append(th1)
print([round(t1, 6) for t1 in allth1])
"""
converges pretty dang well within just a few iterations
"""

[0.225044, 0.649002, 0.66596, 0.666638, 0.666666, 0.666667, 0.666667]


- EM algorithm can actually be used to infer the entire structure of a network
    - downside is that there may be many possible structures
        - can use hill climbing to estimate the best structure

### Reinforcement Learning
- 18Apr2024

- Reinforcement Learning
    - learning from rewards
    - the agent interacts with the environment
        - the agent takes an action
        - the environment responds with a new state and a reward
        - the agent uses the reward to update its policy
    - issues with RL
        - Thorndike's Law of Effect
            - "responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation, and responses that produce a discomforting effect become less likely to occur again in that situation"
            - conditioning
                - can consider an animal to be an agent which increases weights given good outcomes and decreases weights given bad outcomes
                - effect is often proportional to the magnitude of the reward/punishment
        - min/max a performance measure of a system over time
            - e.g. optimizing a system to maximize the number of cars that can pass through a toll booth in a given time
        - combines search and memory
            - search is selectional
                - try alternatives by comparing consequences
                - e.g. natural selection
            - memory is associative
                - remember what worked in the past and in what situations
                - strengthen the weight of the action in the state if it leads to reward
                - weaken the weight of the action in the state if it leads to punishment
        - trial and error search
            - not told, must discover
        - credit assignment problem
            - delayed reward means that the agent must figure out which actions led to the reward
        - exploration vs exploitation tradeoff
            - exploit what you know to get the reward
                - me at the UD creamery
            - explore to find a better reward
                - me when Taco Bell has a new online exclusive
                - risks lesser or no reward
        - policy space
            - the space of all possible policies
            - $|A|^{|S|}$ where A is the set of actions and S is the set of states
                        
                - e.g. 4 actions and 10 states $\implies 4^{10}$

- examples of RL
    - chess player
        - move informed by planning and judgments of desirability
    - preparing breakfast
        - behavior + interlocking goal/subgoal relationships
        - e.g. walk to cupboard, open, select, reach, grasp, retrieve, etc
            - can be represented as a heirarchy of actions
- applications
    - Samuel's checkers player
    - Tessauro's backgammon player (1992 - 1995)
        - trained by playing against itself
        - multilayered neural network
        - learned to play at a world class level
        - utilized features for the board space
    - acrobot
        - 2 jointed pendulum
            - shoulder is motorized, elbow is free to move
        - goal is to swing the pendulum "hand" above a certain height
        - given a -1 reward for each time step until the goal is reached
    - 

http://incompleteideas.net/book/the-book-2nd.html
- Sutton and Barto's book on RL
    - free

- learning to *do*
    - state $\implies$ action $\implies$ reward
    - goal is encoded as a reward passed from environment to agent
    - agent must learn to maximize the cumulative reward
    - e.g. pick and place robot
        - states: arm angles and velocities
        - actions: voltages to the motors
        - rewards: +1 for successful pick and place, -$\epsilon$ jerkiness of path
    - e.g. learning to walk
        - use reward proportional to distance traveled per unit time
    - e.g. Unity Puppo
        - dog simulation
        - reward = $0.01 \times \vec{v} \cdot \vec{d} - 0.001 - 0.001 \times \theta + \{ \begin{array}{ll} 1 & \text{if at target} \\ 0 & \text{if not} \end{array}$
        - $\vec{v}$ is the velocity of the dog
        - 
    - discounting
        - rewards received now are more valuable than rewards received in the future (at a discount)
    - value function
        - as previously discussed


- Q-Learning
    - when you don't have a model, you can still identify action-value pairs
    - Q(s,a) = expected reward of taking action a in state s
        - V(s) = max_a Q(s,a)
            - the value of a state is the maximum value of the actions that can be taken in that state
    - 

### Game Theory

- models of conflict and cooperation between rational decision makers
    - created by von Neumann and Morgenstern in 1944
    - has lead to 10 Nobel Prizes (mostly in economics)
    - setup
        - multi-player games with simultaneous actions
        - main aspects
            - agent design
                - what is the bets strategy against a rational player?
            - mechanism design
                - define rules/protocols so collective good is achieved
                    - e.g. TCP/IP
        - elements
            - players
            - actions
                - refers to an entire strategy or policy
            - payoffs

- e.g. Rock-Paper-Scissors
    - a zero sum game
        - the sum of the payoffs is zero for all possible outcomes
        - for 2 player games, payoff for player 1 is the negative of the payoff for player 2
        - can be generalized to just player 1's payoff as with previous examples
            - e.g. player 1 gets 1, player 2 gets -1
- e.g. Cake-Slicing "Game"
    - to fairly split a dessert between 2 people
    - one person cuts, the other chooses
        - maximin strategy
            - maximize the minimum payoff
        - minimax strategy
            - minimize the maximum payoff
        - both strategies lead to a 50/50 split if both players are rational
    - not a zero sum game
        - there is one cake and there is no negative cake
    - assumes that a bigger piece is preferred
        - "take bigger" is then a dominant strategy
    - there is a saddle point at 50/50
        - a saddle point is a point where the best strategy for one player is the best strategy for the other player
- e.g. Penny Matching Game
    - do all 2 player games have a saddle point?
    - the game is played as follows
        - 2 players each place a penny on the table
        - if pennies match, player 1 wins (+1) and player 2 loses (-1)
        - if pennies don't match, player 2 wins (+1) and player 1 loses (-1)
    - minimax
        - min of the max is +1
    - maximin
        - max of the min is -1
    - no saddle point
        - minimax $\ne$ maximin
    - strategy
        - play randomly
            - both playing randomly does create a saddle point
            - the expected value of the game is 0


- solving a zero sum game
    - using the penny matching game as an example
    - assume players 1 and 2 use mixed strategies $R_p$ and $R_q$
        - $R_p = {H, P(p), T(1-p)}$
        - $R_q = {H, P(q), T(1-q)}$
            - where H is heads, T is tails, and P is the probability of heads
    - utility of player 1
        - $U_1 = (p)(q) + (p)(1-q) + (1-p)(q) + (1-p)(1-q)$
            - i.e. HH + HT + TH + TT
    - find the saddle point
        - find the partial derivatives of $U_1$ with respect to $p$ and $q$ and where they are equal to 0
            - $\frac{\partial U_1}{\partial p} = q - (1-q) + q + (q-1)$
                    
                - $= 4q - 2$
                
                - $= 0 \implies q = 0.5$
                            
            - $\frac{\partial U_1}{\partial q} = p + p - (1-p) - (1-p)$
                    
                - $= 4p - 2$
                            
                - $= 0 \implies p = 0.5$            
        - determine if the saddle point is a maximum or minimum
            - $\frac{\partial^2 U_1}{\partial p^2} = 4$
            - $\frac{\partial^2 U_1}{\partial q^2} = 4$
            - $\frac{\partial^2 U_1}{\partial p \partial q} = \frac{\partial^2 U_1}{\partial q \partial p} = 0$
            - determinant = 16 - 0 = 16
            - since the determinant is positive and the second partial derivatives are positive, the saddle point is a minimum
            - the Hessian matrix yields eigenvalues of -4 and 4
                - since the eigenvalues are of opposing signs, this is a saddle point

- Minimax Theorem (von Neummann, 1928)
    - every finite, zero sum, two player game has a rational solution such that minimax = maximin
        - i.e. the saddle point
- Nash Equlibrium
    - applies to non-zero sum games
    - a point where no player can improve their payoff by changing their strategy
        - assumes that the other player maintains their strategy
    - a point where neither player has regret
        - i.e. they are happy with their strategy
        - no Monday morning quarterbacking
    - tldr: neither player wants to change their strategy
    - Existence Theorem (Nash, 1950)
        - every finite game has at least one Nash equilibrium
        - not necessarily unique, rational, or efficient
        - equilibria may have "strange or undesirable properties" (Straffin)
            - as defined by the game/society, not necessarily by the players
    - positive example: the Driving Game
        - 2 players driving towards each other on a 2 lane road
        - each player can choose to drive on the left or right side of the road
        - if both players choose the same side, they both get 1
        - if they choose different sides, they both get 0
        - the Nash equilibrium is for both players to choose the same side
            - if one player changes their strategy, the other player will change their strategy to match
            - the Nash equilibrium is for both players to drive on the same side of the road
        - **$\implies$ cooperation is the optimal strategy**
    - negative example: the Prisoner's Dilemma
        - 2 prisoners are arrested for a crime
        - they are separated and offered a deal
            - if one confesses and the other doesn't, the confessor goes free and the other gets 10 years
            - if both confess, they both get 5 years
            - if neither confess, they both get 1 year
        - the Nash equilibrium is for both prisoners to betray
        - **$\implies$ betrayal is the optimal strategy**

- iterated games
    - games that are played multiple times
    - e.g. the Prisoner's Dilemma
        - if the game is played once, the Nash equilibrium is to betray
        - if the game is played multiple times, the Nash equilibrium may be to confess
            - i.e. other factors and strategies may come into play
        - meta strategies in iterated PD
            - perpetual punishment
                - cooperate until the other player betrays, then betray forever
            - tit-for-tat
                - cooperate until the other player betrays, then betray on the next move
                - performs very well in competitions with large pools
            - tit-for-two-tats
                - cooperate until the other player betrays twice in a row, then defect on the next move
    - e.g. Penny Game
        - if the game is played once, the Nash equilibrium is to play randomly
        - if you play over time you may find the other player is not perfectly random
            - you can then exploit the other player's bias
- Real world examples
    - can be used to explain cooperation
        - BitTorrent unchoking
        - altruism in animal behavior
        - WWI trench truces
    - simulated societies
        - "Sugarscape"
    - issues
        - limited (human) memory
            - you can't remember everyone who ever cut you off
        - adaptation over time

### 30Apr24

- AI/ML/Data Science Pipeline
    - data acquisition
    - data processing
    - data integration
    - analytical modeling
    - validation
    - presentation
- misc
    - decision list
        - decision tree with yes/no options at each layer
            - if yes, classification
            - if no, move to the next layer
        - may require more data than a decision tree for training
        - much more explainable than a decision tree

### Simpler Machine Learning Models
- Cynthia Rudin
- CISE Distinguished Lecture Series video
- possibly available on the NSF website   

- goal is to increase the interpretability/explainability of machine learning models
- many problems do not require or benefit from complex models
    - e.g. 2HELPS2B
        - an algorithm that can be calculated by hand or even in your head
        - corresponds to 6 attributes in a healthcare scenario
            - add up the values of the attributes
            - score corresponds to a risk level
    - for tabular data, neural networks generally are not better than simpler models
        - e.g. decision trees, logistic regression, etc
- why?
    - one explanation is the Rashomon Set Theory
        - there's no single best explanation but lots of good explanations
        - when there are lots of "almost optimal" models, there will be simpler models in that set
        - Roshomon Set: a set of models that are all about equally good for a given data set
        - large Rashomon sets correlate with the existence of simpler models
            - also correlate with many different methods being similarly effective
        - extra features can end up being noise
        - implication is that the urge to use complex models is not always necessary
        - **simpler models are explainable and may be just as effective**
        - e.g. FICO dataset of 10k loan applications
            - loads of features
            - FICO requested a black box model 
            - best black box accuracy was 73%
                - boosted decision trees 
            - best UAC black box accuracy was 80%
                - (2-layer neural network)
            - Fast Sparse (not black box)
                - uses 21 step features
                - runs in ~ 3.85 seconds
                - separates some variables into subgroups
                    - e.g. age can be separated into groups of "age > 25" and "age > 35", etc
                - model is explainable

- Optimal Sparse Decision Trees
    - NP hard problem to find the optimal sparse decision tree
        - factorial of the number of features
    - GOSDT
        - Greedy Optimal Sparse Decision Tree
        - the goal is to find the best decision tree with the fewest number of nodes
        - the best feature to split on is based on the next best feature to split on which is based on the next best feature to split on, etc
            - each subproblem is represented as a binary vector
            - creates a giant dependency graph
            - use theorems to prune the graph
            - propagate bounds up the tree
                - sort of like alpha-beta pruning
        - when used on FICO dataset
            - 10k data points
            - > 1900 binary features
            - 10 leaves
            - accuracy of 71.7%
            - ~8.1 seconds to run
            - roughly on par with the black box models

- why everything in the current ML paradigm is wrong
    - recommendation: hand decision as to which model to use to the user
        - i.e. give them the whole Roshamon Set of good enough models
    - TreeFARMS
        - returns all almost optimal trees
        - TimberTrek
            - a GUI that allows the user to explore the trees in the Roshamon Set returned by TreeFARMS