# Knowledge Representation & Propositional Logic

### Logical Agents
- things that happen in the world aka semantics
    - data/sensing
    - actions/new conditions
    - it can be said that actions/conditions follow from data/sensing
- representations aka syntax
    - formally represented facts
    - via reasoning, new facts can be derived or inferred
- data sensing -> represented facts -> inferred facts -> actions/conditions

### Knowledge Bases
- some examples
    - general rules
        - small AND flying -> edible (for a frog)
    - common sense
        - looks like a duck AND quacks like a duck -> it is a duck
    - negative information
        - NOT keys are on the counter (if I'm looking for my keys)
    - incomplete information
        - keys are on the counter OR keys are in the car
### Propositional Logic
- you might see this in a philosophy class
- a formal language for representing knowledge
- semantics associate elements/propositions with truth values in the world
- inference rules
- construction
    - consistes of well-formed formulas (wffs)
        - atoms
            - True, False
            - P, Q, R, ...
                - "propositional symbols" which represent statements
                - e.g. P = "it is raining"
        - grouping
            - (P AND Q)
        - negation
            - NOT P
        - AND or OR
            - (P AND Q) OR (R AND S)
        - implication
            - P => Q
                - p is the antecedent and q is the consequent (or conclusion)
        - biconditional, iff
            - P <=> Q
        - literal
            - an atom or its negation
            - e.g. P, NOT P
- e.g. of wffs
    - P
    - NOT NOT P
    - P AND Q
    - (P AND Q) => R
- e.g. of non-wffs
    - P AND
    - P NOT Q
- models
    - fixes the truth value of each proposition
        - e.g. P = True, Q = False
        - usually represented as a truth table
        - each row of the table or defined arrangement is a model
- partial models
    - only some of the propositions are assigned truth values
    - e.g. P = True, Q = ?
    - can be used to represent incomplete information
    - may be consistent with multiple defined models
- semantics
    - you can define the truth value of a wff in terms of the truth values of its subformulas
- equivalence
    - two wffs are equivalent if they have the same truth value in every model
    - e.g. P AND Q is equivalent to Q AND P
    - the symbol $\equiv$ is used to denote equivalence
- tautology
    - a wff that is true in every model
    - e.g. P OR NOT P
- unsatisfiable
    - a wff that is false in every model
    - e.g. P AND NOT P
- satisfiable
    - a wff that is true in at least one model
    - e.g. P OR Q

- logical combinatorics
    - given a sentence and a universe, in how many models is the sentence true?
        - i.e. how many rows in the truth table have the sentence true?
        - e.g. in U = {P, Q}
            - P is true in 2 models
            - P AND Q is true in 1 model
        - e.g. in U = {P, Q, R, S}
            - P AND Q is true in 4 models
                - one combination of P and Q is true and this occurs in 2 models of R and 2 models of S
                - 1 * 2 * 2 = 4
            - P OR Q is true in 12 models
                - 3 * 2 * 2 = 12
                - or you can think of it as 16 - 4 = 12
                    - 16 total models - 4 models where P OR Q is false
        - e.g. |U| = n
            - $2^n$ models

- inference
    - $W_1 \Rightarrow W_1 \lor W_2$  
    - $W_1 \land W_2 \Rightarrow W_1$
    - $W_1, W_2 \Rightarrow W_1 \land W_2$
        - the comma is used to denote a set of wffs that are known to be true
    - $\lnot \lnot W_1 \Rightarrow W_1$
    - modus ponens
        - $W_1, W_1 \Rightarrow W_2 \therefore W_2$
- proofs
    - using inference rules to derive new wffs from old ones
    - knowledge base is represented as $\Delta$
        - $\Delta = \{P, R, P \Rightarrow Q\}$
        
            - the wffs in $\Delta$ are known to be true
    - $\Delta = \{BATTERY\_OK \land LIFTABLE \Rightarrow ARM\_MOVES\}$
    
    - $\Delta = \{B, \lnot M, B \land L \Rightarrow M\}$
        
        1. $B$, given
        2. $\lnot M$, given
        3. $B \land L \Rightarrow M$, given
        4. $B \land L$, given
        5. ...
- properties of sets of inference rules
    - soundness: proofs are valid
        - everything you can prove is actually true
        - constructive definition:
            - W is True under all models in which $\Delta$ is true
                - we will revisit this later
    - completeness: everything that is true can be proven
        - everything that is true is derivable 

- Resolution
    - $P \lor Q, \lnot P \lor R \therefore Q \lor R$
        
        - resolution on P
    - $R, \lnot R \lor P \therefore P$
        
        - unit resolution on R
            - unit resolution is a special case of resolution where one of the clauses is a unit clause
                - a unit clause is a clause with only one literal
    - $P \lor Q \lor R \lor S, \lnot P \lor Q \lor W \therefore Q \lor R \lor S \lor W$
        
        - i.e. you can combine the two clauses and remove the P literal
    - $P \lor Q \lor \lnot R, P \lor W \lor \lnot Q \lor R \therefore P \lor \lnot R \lor W \lor R$
        
        - resolution on Q
        - this statement is always true
        - could also resolve on R
            - $\therefore P \lor \lnot Q \lor W \lor \lnot Q$
            
            - this statement is also always true
        - can you perform resolution on both simultaneously?
            - no, you can only resolve on one literal at a time
            - result is $P \lor Q$ which is not always true

### 27FEB2024
- resolution continued
    -  $P \lor Q, \lnot P \lor R \therefore Q \lor R$
        
        - resolution on P
        - if P is true, then $\lnot$ P is false
            - thus R is true
        - if P is false, then Q is true
            - thus Q is true
        - either way, Q $lor$ R is true
        - this is a sound inference rule
        - can be shown with a truth table
- resolution in knowledge bases in conjunctive normal form (CNF)
    - can be abbreviated
        - $(R \lor P) \land (Q \lor R)$ can be written as $R \lor P, Q \lor R$ 
- any wff can be converted to CNF
    - e.g $\lnot(P \rightarrow Q) \lor (R \rightarrow P)$
    1. eliminate biconditionals
        - $ \alpha \Leftrightarrow \beta$ is equivalent to $(\alpha \Rightarrow \beta) \land (\beta \Rightarrow \alpha)$ 
        - e.g. $\lnot(P \rightarrow Q) \lor (R \rightarrow P)$ becomes
            - $(\lnot(\lnot P \lor Q) \lor (\lnot R \lor P))$
    2. eliminate implications
        - $\alpha \Rightarrow \beta$ is equivalent to $\lnot \alpha \lor \beta$
        - e.g. $(\lnot(\lnot P \lor Q) \lor (\lnot R \lor P))$ becomes
            - $(P \land \lnot Q) \lor (\lnot R \lor P)$
    3. move negations inwards with DeMorgan's laws
        - $\lnot(\alpha \land \beta)$ is equivalent to $\lnot \alpha \lor \lnot \beta$
        - $\lnot(\alpha \lor \beta)$ is equivalent to $\lnot \alpha \land \lnot \beta$
        - e.g. $(P \land \lnot Q) \lor (\lnot R \lor P)$ becomes
            - $(P \lor \lnot \lnot Q) \land (\lnot \lnot R \lor P)$
            - $(P \lor Q) \land (R \lor P)$
    4. distribute OR over AND
        - $\alpha \lor (\beta \land \gamma)$ is equivalent to $(\alpha \lor \beta) \land (\alpha \lor \gamma)$
        - e.g. $(P \lor Q) \land (R \lor P)$ becomes
            - $((\lnot R \lor P) \lor P) \land ((\lnot R \lor P) \lor \lnot Q)$
            - $(\lnot R \lor P \lor P) \land (\lnot R \lor P \lor \lnot Q)$
            - $(\lnot R \lor P) \land (\lnot R \lor P \lor \lnot Q)$
            - in CNF: $\lnot R \lor P, \lnot R \lor P \lor \lnot Q$
- resolution algorithm
    - to probe $KB \therefore \alpha$, show that $\lnot \alpha \lor KB$ is unsatisfiable
    1. convert $KB \land \lnot \alpha$ to CNF
    2. successively resolve until
        - no new clauses can be inferred
            - $ KB \not \rightarrow \alpha$
        - the empty clause is inferred
            - the empty clause is unsatisfiable
            - $ KB \rightarrow \alpha$
    - e.g. rover from past lecture
        - KB: $B \land L \Rightarrow M, B, \lnot M$
        1. convert: 
            - $\lnot (B \land L) \lor M, B, \lnot M$
            - $\lnot B \lor \lnot L \lor M, B, \lnot M$
            - set $\alpha = \lnot L$ and $\lnot \alpha = L$
        2. resolve:
            - $\lnot M$ and $\lnot B \lor \lnot L \lor M$ can become $\lnot B \lor \lnot L$
            - $B$ and $\lnot B \lor \lnot L$ can become $\lnot L$
            - $\lnot L$ and $L$ can become the empty clause
                - thus, $KB \rightarrow \alpha$
                - the rock is not liftable
        - could be written as a tree almost. Start with initial clauses and draw lines to new clauses that are inferred
            - each successive line is a resolution step and the width reduces as you go down the tree
            - eventually, you will reach the empty clause or no new clauses can be inferred
                - i.e. either the empty clause is inferred and the statement is unsatisfiable or no new clauses can be inferred and the statement is satisfiable
- since it is tree solveable, you can employ search algorithms
    - resolving with unit clauses first is a good strategy
        - dramatically reduces the search space and complexity

- DPLL (Davis-Putnam-Logemann-Loveland) algorithm
    - uses techniques in CSPs
        - degree heuristic
        - backtracking
        - random restarts
        - clever indexing
            - cache set of clauses in which X$_i$ appears
- satisfiability problems
    - e.g. WALKSAT
        - pick an unsatisfied clause
        - flip a random variable in the clause
            - either at random or min conflict (minimizes the number of unsatisfied clauses after the flip)
        - runs faster than DPLL on some problems
        - may not always find a solution even though one may exist
            - generally, run WALKSAT for a while and then run DPLL if WALKSAT fails in the time limit
- restrict the type of problem
    - Horn clauses
        - at most one positive literal
        - e.g. $P \land Q \Rightarrow R$
        - if the knowledge base is made up of Horn clauses, then the problem can be solved in linear time
            - e.g. $KB \land \lnot \alpha$
                - if $\alpha$ is a Horn clause, then DPLL can be solved in linear time
                - if $\alpha$ is a Horn clause, then WALKSAT can be solved in linear time
- forward and backward chaining
    - efficient for Horn clauses
    - forward chaining
        - start with known facts
        - repeatedly apply modus ponens
        - stop when no new facts can be inferred
    - backward chaining
        - start with the goal
        - repeatedly apply modus ponens
        - stop when no new facts can be inferred

- caching
    - store the results of previous inferences in something

Manual

In [1]:
fibnum = 35
def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

%time print(fibonacci(fibnum))

9227465
CPU times: total: 1.7 s
Wall time: 2.12 s


In [2]:
fcache = []
def man_cache_fibonacci(n):
    if n < len(fcache):
        return fcache[n]
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        fcache
        return fibonacci(n-1) + fibonacci(n-2)

%time print(fibonacci(fibnum))    

9227465
CPU times: total: 1.77 s
Wall time: 1.97 s


via function decorator


In [4]:
from functools import lru_cache
@lru_cache(maxsize=None)
def fib_cache(n):
    if n == 0: 
        return 0
    elif n == 1:
        return 1
    else:
        return fib_cache(n-1) + fib_cache(n-2)
    
%time print(fib_cache(fibnum))

9227465
CPU times: total: 0 ns
Wall time: 999 µs


### Covid

### Probabilistic Reasoning I Review
- Bayes Networks
    - follow Bayes' theorem
        - $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
    - a directed acyclic graph (DAG)
        - nodes represent random variables
        - edges represent dependencies
        - each node has a conditional probability table (CPT)
            - the probability of the node given its parents
    - given k parents, the CPT has $2^k$ entries
        - normalization is done by dividing by the sum of the entries
            - this makes the sum of the entries equal to 1
    - simplified chain rule
        - $P(X, Y, Z) = P(X|Y, Z)P(Y|Z)P(Z)$
        - allows you to expand and simplify sums of probabilities in a Bayes network
        - it can be useful to draw the network to see when some things can be simplified
            - e.g. if X is independent of Z given Y, then $P(X|Y, Z) = P(X|Y)$
            - e.g. relationships that are evident in the network involving parents and children
                        - 

### Probabilistic Reasoning II
- inference
    - given evidence, what is the probability of a query?
- exact inference
    - enumeration
        - enumerate all possible values of all variables
        - sum over all values that are consistent with the evidence
        - divide by the sum of all values
        - complexity is $O(2^n)$ for a binary variable
            - else $O(d^n)$ where d is the number of values a variable can take on and n is the number of variables
            - computed in a Bayesian network with n variables by summing over the CPTs
            - i.e. the number of parent entries you need to know or compute to find the probability of a variable
    - variable elimination
        - eliminate variables that are not in the query or evidence
        - sum out variables that are not in the query
        - multiply and normalize
        - complexity is $O(n^2)$ where n is the number of variables
            - regroup terms and share partial results
    - junction tree
        - convert the Bayes network to a junction tree
        - sum out variables that are not in the query
        - multiply and normalize

- e.g. Polytrees
    - singly connected Bayes networks
        - sort of like a tree or several connected trees
        - only one path from any node to any other node    
        - it is easier because as you go down the tree, it gets simpler
    - NP-hard to find the optimal junction tree
    - time and space complexity is $O(n)$
    - the process is the same as above, but each 

- approximate inference
    - sampling
        - sample values of the variables
        - keep track of the values that are consistent with the evidence
        - divide by the sum of all values
    - MCMC
        - sample values of the variables
        - keep track of the values that are consistent with the evidence
        - divide by the sum of all values
        - use the Markov chain to sample values
- e.g.
    - B = battery charged
    - G = gauge shows full charge
    - L = liftable rock
    - M = arm moves
        - B and L are independent
        - G is independent of L but dependent on B
        - M is dependent on B and L
    - probabilities
        - p(B) = 0.95
        - p(L) = 0.7
        - p(G|B) = 0.95, gauge shows full charge if battery is charged
        - p(G|$\lnot$B) = 0.1, gauge shows full charge if battery is not charged
        - p(M|B, L) = 0.9, arm moves if battery is charged and rock is liftable
        - p(M|B, $\lnot$L) = 0.05, arm moves if battery is charged and rock is not liftable
        - p(M|$\lnot$B, L) = 0.0, arm does not move if battery is not charged and rock is liftable
        - p(M|$\lnot$B, $\lnot$L) = 0.0, arm does not move if battery is not charged and rock is not liftable
    - direct sampling
        - sample in topological order (parents before children)
        - repeat N times
            - sample p(B)
            - sample p(L)
            - sample p(G|B)
            - sample p(M|B, L) and p(M|B, $\lnot$L)
            - record the values and add to a count of the possible sets
                - e.g. if B = True, L = True, G = False, M = True, then add to the count for (T,T,F,T)
        - divide by N and you have the probability of each atomic event
            - an atomic event is a set of values for all the variables at a given time
    - rejection sampling
        - evidence refers to something "given"
            - e.g. p(B|M) uses M as evidence and B as the query
        - if you have evidence, sample only the variables meeting the evidence
            - sample p(B)
            - sample p(L)
            - sample p(G|B)
            - sample p(M|B, L) and p(M|B, $\lnot$L)
            - if it does not meet the evidence, discard the sample
            - else, increase the count for the atomic event
                - if query, then increase the count for the query
                    - i.e. P(Q|E) = P(Q and E) / P(E) where Q is the query and E is the evidence
        - as N approaches infinity, the probability of the query approaches the true probability
    - both rely on the law of large numbers
        - as N approaches infinity, the sample mean approaches the true mean

- fuzzy logic
    - alternative framework for dealing with uncertainty
    - deals with the "thruthiness" of linguistic concepts or variables
        - e.g. 
            - 6' is "tall" is true to degree 0.5
            - 6'6" is "tall" to degree 1.0
            - degree likely comes from measurements/polling/observed data
    - Matlab has a good guide to this, Foundations of Fuzzy Logic
    - fuzzy sets
        - a **crisp** set is a collection of elements
            - an object is a member or it is not a member
                - there is certainty
            - e.g. if the set of "tall" starts at 6', it is [6', +$\infty$)
                - 6' is a member, 5'11" is not a member
        - a **fuzzy** set is a collection of elements with degrees of membership
            - an object is a member to a certain degree
                - there is uncertainty
            - membership degree may be defined by some characteristic/membership function
                - e.g. a person is "tall" to a degree of 0.5 as defined by a membership function
                - it is still possible to have a crisp value in a fuzzy set
                    - e.g. a person is "tall" to a degree of 1.0 if they are 6'6"
        - membership functions
            - a function that maps an element to a degree of membership
            - e.g. "close to zero" 
                - it could be a gaussian function centered at zero with height 1
                - it could be a two linear functions with a slope of 1 and -1 centered at zero, else 0
        - logic rule conversion
            - Crisp
                - A = 0 or 1
            - Fuzzy
                - 0 $\leq$ A $\leq$ 1
            -  Crisp
                - $\lnot$ A
            - Fuzy
                - 1 - $\mu$(A)
            - Crisp
                - A OR B
            - Fuzzy
                - max($\mu$(A), $\mu$(B))
            - Crisp
                - A AND B
            - Fuzzy
                - min($\mu$(A), $\mu$(B))
            - Crisp
                - A $\Rightarrow$ B
            - Fuzzy
                - max(min($\mu$(A), $\mu$(B)))

- Fuzzy If-Then rules
    - can be used to define a fuzzy expert system or fuzzy controller
    - e.g. car controller
        - S is speed
        - A is acceleration
        - P is pedal position
            - if S is "fast" and A is "accelerating" then P is "pressed"
            - if A is not "accelerating" and P is "pressed" then S is fast
    - combine results and create a crisp output by some metric
        - e.g. defuzzification
            - centroid
                - find the center of mass of the fuzzy set
                - the center of mass is the average of the values weighted by the membership function
            - max membership
                - find the value with the highest membership
            - mean of maxima
                - find the value with the highest membership and take the average of the values
            - weighted average
                - find the average of the values weighted by the membership function
            - etc.

### Making Simple Decisions
- if you figure this out bottle it and sell it (I'll buy)
- preferences
    - A > B means A is preferred to B
    - A $\gtrsim$ B means A is at least as good as B
    - A $\sim$ B means A is indifferent to B
    - everything $>$ [$U_\bot$]
    - everything $<$ [$U_\top$]
- lottery notation
    - [p1, S1; p2, S2; ...; pn, Sn]
        - p is the probability of the outcome
        - S is the outcome
        - e.g. [0.5, 100; 0.5, 0]
            - 50% chance of winning 100, 50% chance of winning 0
    - $\sum_{i=1}^{n} p_i = 1$

Utility Theory
- Orderability
    - if A > B xor A < B, then A $\sim$ B
- Transitivity
    - if A > B and B > C, then A > C
- Continuity
    - if A > B > C, then there exists a p such that [p, A; 1-p, C] $\sim$ B
        - i.e. there is some probability p where the lottery is indifferent to B
        - e.g. A = $\textdollar100$, B = $\textdollar10$, C = $\textdollar1$
            - there is some probability p where you would pay $\textdollar10$ for a chance at $\textdollar100$
- Substitution
    - if A $\sim$ B, then [p, A; 1-p, C] $\sim$ [p, B; 1-p, C]
        - i.e. if A is indifferent to B, then a lottery with A is indifferent to a lottery with B
- Monotonicity
    - if A > B, then [p, A; 1-p, C] > [p, B; 1-p, C]
        - i.e. if A is preferred to B, then a lottery with A is preferred to a lottery with B
- Decomposability
    - you can decompose a lottery into simpler lotteries
        - e.g. [p, A; 1-p, B] = [p, A; 1-p, C] + [p, C; 1-p, B]
- Dominance
    - strict dominance
        - plotted on x and y, if B > A in x and y, then A is strictly dominated by B
            - anything in the $\gtrsim$ x AND $\gtrsim$ y from A 
        - if one lottery's outcomes are always better than another, then the first lottery is strictly dominant
    - stochastic dominance
        - complementary cumulative distribution strictly dominates
            - P[A $\geq$ x] $\geq$ P[B $\geq$ x] **and** for some x, P[A$\geq$x] > P[B$\geq$x]
        - e.g. 
            - B is uniform chances between 3 and 8
                - height of the curve is 1/5
            - A is uniform chances between 4 and 10
                - height of the curve is 1/6
            - A stochastically dominates B
            - survival function
                - P[X $\geq$ x]
                - P[B $\geq$ 3] = 1, then decreases to 0 at 8
                - P[A $\geq$ 4] = 1, then decreases to 0 at 10
                - A's curve is **always** $\geq$ B's curve
                - if A was from 4 to 5, neither would dominate because A is higher in some places and B is higher in others

- Utility Functions
    - give you a number for how much you like something
    - more useful than preferences
        - you can do exact computations with utility functions
    1. U(A) > U(B) iff A > B AND U(A) = U(B) iff A $\sim$ B
    2. U([p1, S1; p2, S2; ...; pn, Sn]) 
        - = $\sum_{i=1}^{n} p_i U(S_i)$
            - $\neq$ $\sum_{i=1}^{n} p_i S_i$
        - i.e. the utility of a lottery is the sum of the utilities of the outcomes weighted by the probabilities
    - if you have U(s), you have everything
    - relation to rational agents
        - U(s) represents the agent's preferences
        - the expected utility of an action given evidence EU(a|e) is the average utility value of the outcomes of the action weighted by the probabilities of the outcomes
            - EU(a|e) = $\sum_{s'}^{n} P(Result(a) = s'|a,e) U(s')$
        - the rational agent chooses the action with the highest expected utility
            - argmax$_a$ EU(a|e)
            - Maximum Expected Utility (MEU) principle says that the rational agent chooses the action with the highest expected utility
- the hard part: operationalizing the utility function
    - you don't know the probabilities of all moves or contingencies an adversary might make
    - i.e. you can't really fill out the decision tree
        - e.g. one armed bandit (slot machine)
            - you have to pull the lever to see the outcomes
        - e.g. where should NIH spend its money?
            - you can't know the probabilities of all the outcomes
    - you learn the probabilities as you go
        - e.g. you can learn the probabilities of the slot machine as you play
- there may be special situations that throw everything off
    - e.g. wind disturbances on a plane, nonstandard power grid utilization, etc. 

- Multiple Attributes
    - ordered/priorities
        - tuple ordering or heirarchy
            - <safety, mission, stealth>
            - it may become a 2D or larger nested system
                - e.g. for a plane, there may be a polygon of safety, within that there be a fuel polygon
                    - when you meet safety, then you can care about fuel
    - additive
        - V(x1, x2, ..., xn) = $\sum_{i=1}^{n} c_i V_i(x_i)$
            - V(noise, cost, deaths) for an airport
- decision network
    - decision nodes
        - actions
    - chance nodes
        - uncertain events
    - utility nodes
        - utility of the outcome
    - e.g.
        - decision node: car wash?
        - dependent chance nodes: cost, time 
        - independent chance nodes: forget
        - utility node: U(1-$\gamma^d$)f(-cost - 10, time)
            - $\gamma$ is the discount factor
            - d is the depth of the decision tree
            - f is the utility function
                - e.g. f(x, y) = x + y
            - 0 < $\gamma$ < 1