# Intro to Artificial Intelligence with Python

## Part III - Uncertainty

Harvard CS50 Introduction to Artificial Intelligence with Python is an online course that I took in the Spring of 2020. It consisted of 6 lectures of which I have a notebook for each. Each lecture had accompanying projects which are located in the projects folder in the same directory as this notebook.

[Course Link](https://cs50.harvard.edu/ai/)

[Lecture Link](https://www.youtube.com/watch?v=uQmYZTTqDC0&list=PLhQjrBD2T382Nz7z1AEXmioc27axa19Kv&index=4)

---
## Probability


**Probability** - is a numerical description of how likely an event is to occur or how likely it is that a proposition is true. Probability is a number between 0 and 1, where, roughly speaking, 0 indicates impossibility and 1 indicates certainty. The higher the probability of an event, the more likely it is that the event will occur.


**Probability Theory** - the branch of mathematics concerned with probability.


* ($\omega$) - indicates a possible world containing different potential outcomes (example, when you roll a die, there are six possible worlds that could result, 1, 2, 3, 4, 5, or 6. Each possible world has some probablity of being true. 


* P($\omega$) - P(w) indicates the probablity of world $\omega$ occuring or being true


**Probability Range** - probabilities fall between 0 <= P(w) <= 1, where 0 would be an impossibile event (rolling a 7 for example) and 1 would mean an event is certain. In a die example there are 6 possible probabilities for each die side. All the probablities together have to sum to 1, for a die this would be 1/6 given a singlel roll. 
* Example: P(2) = 1/6 (the probablity of rolling a 2 with 6-sided die is 1/6)
   
  
If two dice are used and we want to get the probabilties of the end result of rolling both, then we need to consider all possible worlds that can be created by both together ((1,1), (2,1), (3,1)....(6,6)), with set dice, the probablities for each are equal with a pair just like with a single die taken together. 

However, if the values are summed rather than taken individually between dice,then things change. For example in that world there is only one possible 12 value outcome (roll two 6's) and 1 possible 2 value outcome (tolltwo 1's), whereas, 7 has the most possibile outcomes at 6. So it stands to reason that rolling a 7 should be more probably overall than rolling a 12 or a 2.

NOTE: total number of possible rolls is just w^w or 6^6 = 36

* Example: P(sum to 12) = 1/36 and  P(sum to 7) = 6/36 or 1/6


**Unconditional Probability** -  Also known as marginal probability, unconditional probability reflects the chances that some event will occur without accounting for any other possible influences or prior outcomes. The calculation is the same as relative frequency. 

* For instance, the chance of a fair coin flip being heads has an unconditional probability of 50% regardless of how many coin flips preceded it, nor if some other event had occurred.

* P(A) = total number of A / all possible outcomes


**Conditional Probability** - Conditional probability is defined as the likelihood of an event or outcome occurring, based on the occurrence of a previous event or outcome., formally:
* P(a | b), the probability of a depends on b happening
    * Example 1: P(rain today | rain yesterday)
    * Example 2: P(route change | traffic conditions)
    * Example 3: P(disease | test results)
    

**When using probability with AI, usually conditional probabilty is most common**

---
## Calculating Conditional Probability
* P(a | b) = P(a and b) / P(b), probability of a given b is the probability of a and b being true / probablity of b by itself

Previously with unconditionaly probablity we say that you have a 1/36 chance of rolling a 12 with two dice. Below, the dice example is used with conditional probablity.

* P(sum 12 | 6-roll), what is the probability that two dice sum to 12 given we know one is already 6.
    * first need the individual probabilities for give 6 which is 1/6
    * then need the unconditional probablity of both rolling a 12, which was 1/36, P(sum 12 and roll 6) = 1/36
    * then you divide the probablity of both by the known to get the final probability, in this case 1/6, so P(sum 12 | roll6) = 1/6


**Random Variable** - a variable in probability theory with a domain of possible values it can take on, 
* Example 1: roll = {1,2,3,4,5,6}, where roll is the random variable and the values in the set are all possibe values the random variable can take on
* Example 2: weather = {sun, cloud, rain, wind, snow}, note that random variable values do not have to be numerical. 

**Probability Distribution** - is the probablity for each value in a random variable
 
**Probability Distribution Example:**
* P(Flight = on time)   = 0.6
* P(Flight = delayed)   = 0.3
* P(Flight = cancelled) = 0.1
* All the probabilities for the possible values (or worlds) from the variable flight equal 1


**Vectors** are often used to display probability distributions like:
* **P**(Flight) = <0.6, 0.3, 0.1>, bold **P** denotes probability distribution


**Independence** - the knowledge that one event occurs does not affect the probability of the other even, P(a and b) = P(a) * P(b)
* Example 1: (roll 6 on die1 AND roll 6 on die2) = P(6) * P(6) = 1/6 * 1/6 = 1/36

**Important: note that just because that each individual die roll has the same probablity, different die roll combinations DO NOT**

* Example 2: (roll 6 on die1 AND roll 4 on die2) != P(6) * P(4), these events are NOT independent, the formula would then be: P(a and b) = P(b) * P(a | b) 

---
## Probability Rules

**General Multiplication Rule** - consists of the two formulas used in the previous examples:
* independent events = P(a and b) = P(a) * P(b)
* dependent events = P(a and b) = P(b) * P (a | b)
* note that P(a and b) = P(a) * P (b | a) is equivalent to above

**Bayes' Rule (or theorem)** - In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if the risk of developing health problems is known to increase with age, Bayes’ theorem allows the risk to an individual of a known age to be assessed more accurately than simply assuming that the individual is typical of the population as a whole
* $ P(b | a) = P(a | b) * P(b) / P(a) $


**Bayes' Rule Example**
* Given clouds in the am, what' the probability of rain in the afternoon?
* Assume we know this info: 80% of rainy afternoons start with cloudy mornings
* Assume we know this info: 40% of days have cloudy mornings
* Assume we know this info: 10% of days have rainy afternoons
* Formula: P(rain|clouds) = P(clouds|rain) * P(rain) / P(clouds)
* Solution: (.80)(.1) / .4 = 0.2, or a 20% chance of rain in the afternoon

**Ultimately this means that knowing:**
* P(cloudy morning | rainy afternoon)

**Allows us to calculate:**
* P(rainy afternoon | cloudy morning)

### or more generally:

**Knowing**
* P(visible effect | unknown cause)

**We can calculate:**
* P(unknown cause | visible effect)

---
## Joint Probability
Joint Probability is the liklihood of two independent events occuring at the same time.

**Joint Probabilty Example**
* Random Variable 1 Individual Probabilites: AM Weather Conditions:
* C = Cloudy, 40% chance
* C = Not Cloudy, 60% chance


* Random Variable 2 Individual Probabilites: PM Weather Conditions:
* C = Cloudy, 10% chance
* C = Not Cloudy, 90% chance


* Variable 1 & 2 Joint Probabilites For both AM and PM Combined:
* See Below for joint probability table


### Joint Probabilty Table with the probabilities on any given day that it is both cloudy and raining

In [1]:
import pandas as pd

# Joint Probabilty Table with the probabilities on any given day
# that it is both cloudy and raining
rain    = pd.Series({'C = Clouds' : 0.08, 'C = No Clouds' : 0.02})
no_rain = pd.Series({'C = Clouds' : 0.32, 'C = No Clouds' : 0.58})

j_distro = pd.DataFrame({'R = Rain': rain, 'R = No Rain' : no_rain})
j_distro

Unnamed: 0,R = Rain,R = No Rain
C = Clouds,0.08,0.32
C = No Clouds,0.02,0.58


**Using the joint probability table above, other probability conclustions can be drawn (conditional prob, ect.), for example:**

The probabilty of clouds and rain would be the probabily of both clouds and rain / individual probabilty of rain


* P(C | rain) = P(C ∧ rain) / P(rain) = $ \alpha $P(C,rain)

* P(C | rain) = $ \alpha $<0.08, 0.02>, note: , same as and


The  $ \alpha $ above will be the factor to normalize the probabilites so that their values sum up to 1. In this case that value is 10.

* P(C | rain) = 10 * <0.08, 0.02> = <0.8, 0.2>

**The key takeaway from the above exercise is that by using the joint probabilities we can take that info and figure out the conditional probablity just by multiplying by some normalization factor**


---
## Probabilty Rules Continues

**Negation Rule** - the probability that an event a does not occur
* $ P(¬a) = 1 - P(a) $


**Inclusion-Exculstion Rule** - the probabity that an event a OR an event b occurs
* $ P(a v b) = P(a) + P(b) - P(a ∧ b) $

**Marginalization Rule** - finding the probabilty of event a using some other variables information like event b. There are really only two possibilites, either a and b both occur or a occurs and b does not occur
* $ P(a) = P(a,b) + P(a, ¬b) $

This rule is not limited to just two values:

The probabilty of X occuring is the sum of all values that Y can take on and look at the joint probabilty of X with those sums:

* $ P(X = xi) =  \sum_{j=all y values} P(X - x_i, Y = y_j) $
 
**Marginalization Example**

this example uses the same (rain/clouds) joint prob table from above.

* $ P(C = cloud) = P(C = cloud, R = rain) + P(C = cloud, R = ¬rain) $

* $ P(C = cloud) = 0.08 + 0.32 = 0.4 $
* the 0.4 is the unconditional probablity that it is cloudy using marginalization

**The key takeaway from marginalization is it allows us to convert a joint probabilites into single conditional probabilites**


**Conditioning Rule** - Similar to joint probability rule, but takes both events and conditions (or focuses) on one in particular, below  b is the conditioning event that P(a | b) is checked against. 

* $ P(a) = P(a | b) * P(b) + P(a | ¬b) * P (¬b) $ 

**OR** Formally:

* $P(X = x_i) =  \sum_{j=all y values} P(X - x_i, Y = y_j) * P(Y = y_j) $


---
## Probabilistic AI Models

**Bayesian Network** - a data structure that represents the dependencies among random variables

Structure of a Bayesian Netowrk:
* directed graph - graph that connects nodes together by arrows
* each node represents a random variablem (weather, ect.)
* arrow from X to Y means X is a parent of Y
* each node X has a probability distribution of P(X | Parents(X))

**Example Bayesion Network**
This example considers 4 random variables that might be influenced by one another, each variable is below with thier domain of possible values:

* node 1 rand_var: Rain = {none, light, heavy}
* node 2 rand_var: Train = {on time, delayed}
* node 3 rand_var: Maintenance = {yes, no}
* node 4 rand_var: Appointment = {attend, miss}

A directed graph showing the dependencies between the four random variables is depicted below. 

Note that rain is not dependent on anything, maintenance is dependant on rain, train dealy is dependent on BOTH maintenance and rain, and appointment is dependent on all three other random variables.

**The importing thing to understand is that we can come up with a probabilty distribution for any of the random variable nodes based only upon a nodes parent/s**

In [None]:
---------------------
        Rain
---------------------
    |            |
    |            |
    V            |
------------     |
Maintenance      |
------------     |
    |            |
    |            |
    V            V
---------------------
        Train
---------------------
          |
          |
          V
---------------------
      Appointment
---------------------

### Breakdown of each node

**Rain Node**

The rain node has no parents and is therefore an unconditonal probabilty distribution as it is not dependent upon anythin. See below:

In [9]:
rain = {'none' : [0.7], 'light' : [0.2], 'heavy' : [0.1]}
r_df = pd.DataFrame(data=rain, )
print('Probabilty of Rain')
r_df

Probabilty of Rain


Unnamed: 0,none,light,heavy
0,0.7,0.2,0.1


---
**Maintenance Node**

Maintenance has one parent (rain) and the idea for this problem is the heavier the rain, the less likely that there will be maintenance. Because maintenance is dependent upon rain, the maintenance distribution is conditional. See Below:

In [10]:
rain = {'none' : 0.4, 'light' : 0.2, 'heavy' : 0.1}
no_rain = {'none' : 0.6, 'light' : 0.8, 'heavy' : 0.9}

m_df = pd.DataFrame({'yes': rain, 'no' : no_rain})

print('Track Maintenance Probablity Based on Rain')
m_df

Track Maintenance Probablity Based on Rain


Unnamed: 0,yes,no
none,0.4,0.6
light,0.2,0.8
heavy,0.1,0.9


---
**Train Node**

Train has two parents (rain and maintenance), the idea for the problem is that the train can be either on time or delayed based on both track maintenance AND rain conditions. This is a larger conditional distribution, See Below:

In [11]:
on_time = {'R: none, M: yes': 0.8, 'R: none, M: no': 0.9,
           'R: light, M: yes': 0.6, 'R: light, M: no': 0.7,
           'R: heavy, M: yes': 0.4, 'R: heavy, M: no': 0.5}

delayed = {'R: none, M: yes': 0.2, 'R: none, M: no': 0.1,
           'R: light, M: yes': 0.4, 'R: light, M: no': 0.3,
           'R: heavy, M: yes': 0.6, 'R: heavy, M: no': 0.5}

t_df = pd.DataFrame({'on time': on_time, 'delayed' : delayed})

print('Train Arrival Probablity Based on Rain and Track Maintenance')
t_df

Train Arrival Probablity Based on Rain and Track Maintenance


Unnamed: 0,on time,delayed
"R: none, M: yes",0.8,0.2
"R: none, M: no",0.9,0.1
"R: light, M: yes",0.6,0.4
"R: light, M: no",0.7,0.3
"R: heavy, M: yes",0.4,0.6
"R: heavy, M: no",0.5,0.5


---
**Appointment Node**

Appointment has three parents (rain, maintenance, and train), the idea for the problem is that one can either attend or miss an appointment based on rain, track maintenance, and train arrival time. 

When creating Bayesian Networks, we want to get relationships that are more directly relatated, so in the case of appointment, while it may be affected by the rain and the track maintenance, because appointment is DIRECTLY related to Train, the train dependencies don't matter because they are already included in the train values. 

This is another conditional probabilty, See Below:

In [13]:
attend = {'on_time' : 0.9, 'delayed' : 0.6}
miss = {'on_time' : 0.1, 'delayed' : 0.4}

a_df = pd.DataFrame({'attend': attend, 'miss' : miss})

print('Appointment Attendence Probablity Based on Train Time')
a_df

Appointment Attendence Probablity Based on Train Time


Unnamed: 0,attend,miss
on_time,0.9,0.1
delayed,0.6,0.4


---
### Computing Joint Probablities from Bayesian Network
All of the values used in the following examples come from the 4 nodes created above in the Bayesian Network section.

**Example 1 node**
* Probabilty of light rain
* P(r_light) = 0.2

**Example 2 nodes**
* Probabilty of light rain AND no track maintenance
* (the unconditional prob or rain) * (the conditional probabilty of track maintenance given light rain), or more formally:
* P(r_light, m_no) = P(r_light) * P(m_no | r_light) 

**Example 3 nodes**<br>
Note that at each hierarchy of the network, the given values are always the parents (in the case of the conditional probs) 
* Probabilty of light rain AND no track maintenance AND a delayed train
* P(r_light) * P(m_no | r_light) * P(t_delayed | r_light AND m_no)

**Example 4 nodes**<br>
* Probabilty of light rain AND no track maintenance AND a delayed train AND missing the appointment
* P(r_light) * P(m_no | r_light) * P(t_delayed | r_light AND m_no) * P(miss | delayed)


### Probability Inference by Enumeration
In the previous lecture on knowledge, we used inference to infer new knowledge about a world from given known information. Here, we perform the same type of analysis, but with probablity instead of knowledge. Note that all of the parts below are for all varibles within a Bayesian Network. 

* Query X: variable for which to compute distribution
* Evidence variable E: observed variables for event e, (direct parents)
* Hidden variables Y: non-evidence, non-query variables (non-parents)
* Overall goal: calculate P(X | e)

**Example 1:**
Compute the probability distribution of making an appointment given light rain and no track maintenance:
* Problem: P(appointment | r_rain, m_no)
* X: appointment
* E: r_rain and m_no
* Y: train 

Probabilty Distribution for the problem:
* $\alpha * P(Appointemnt, light, no)$

BUT.....appointment directly depends on train and we don't know this metric. Marginalization can be used here. Recall marginalization means we only deal with two possible outcomes, either the appointment is made given the two givens or it is not.

* $\alpha * [ P(Appointemnt, light, no, on time) + P(Appointemnt, light, no, delayed) ] $

The above process is called **Inference by Enumeration** and the formala is:

* $ P(X | e) = \alpha * P(X, e) = \alpha * \sum_{y=all-hidden-variable- values} * P(X, e, y) $ 

---
### Using External Libraries to Create and Analyze a Bayesian Network 
Performing the above calculations by hand would be time consuming and tedious. There are numerous libraries for this task, in the below examples we use one called 'pomegranate'

In [42]:
import pomegranate

# Rain node has no parents (unconditional distro)
rain = Node(DiscreteDistribution({
    "none": 0.7,
    "light": 0.2,
    "heavy": 0.1
}), name="rain")

# Track maintenance node conditional on rain (conditioanl distro)
maintenance = Node(ConditionalProbabilityTable([
    ["none", "yes", 0.4],
    ["none", "no", 0.6],
    ["light", "yes", 0.2],
    ["light", "no", 0.8],
    ["heavy", "yes", 0.1],
    ["heavy", "no", 0.9]
], [rain.distribution]), name="maintenance")

# Train node conditional on rain and maintenance (conditional distro)
train = Node(ConditionalProbabilityTable([
    ["none", "yes", "on time", 0.8],
    ["none", "yes", "delayed", 0.2],
    ["none", "no", "on time", 0.9],
    ["none", "no", "delayed", 0.1],
    ["light", "yes", "on time", 0.6],
    ["light", "yes", "delayed", 0.4],
    ["light", "no", "on time", 0.7],
    ["light", "no", "delayed", 0.3],
    ["heavy", "yes", "on time", 0.4],
    ["heavy", "yes", "delayed", 0.6],
    ["heavy", "no", "on time", 0.5],
    ["heavy", "no", "delayed", 0.5],
], [rain.distribution, maintenance.distribution]), name="train")

# Appointment node is conditional on train (conditional distro)
appointment = Node(ConditionalProbabilityTable([
    ["on time", "attend", 0.9],
    ["on time", "miss", 0.1],
    ["delayed", "attend", 0.6],
    ["delayed", "miss", 0.4]
], [train.distribution]), name="appointment")

# Create a Bayesian Network and add states
model = BayesianNetwork()
model.add_states(rain, maintenance, train, appointment)

# Add edges connecting nodes (the dependency connections from diagram)
model.add_edge(rain, maintenance)
model.add_edge(rain, train)
model.add_edge(maintenance, train)
model.add_edge(train, appointment)

# Finalize model
model.bake()

### Performing Basic Probability Calculations
Now that the Bayesian Network has been created, we can perfrom calculations on it. 

In [43]:
# Example 1: calculate probabilty of optimal situation:
# no rain, no track maintenane, train on time, and attend appointment

probability = model.probability([["none", "no", "on time", "attend"]])
print(f'{probability * 100 : .2f} %')

 34.02 %


In [44]:
# Example 2: calculate probabilty of with only one sub-optimal:
# no rain, no track maintenane, train on time, but miss appointment

probability = model.probability([["none", "no", "on time", "miss"]])
print(f'{probability * 100 : .2f} %')

 3.78 %


### Performing Inference Calculations

In [45]:
# First input the observed (known) evidence, train delayed
predictions = model.predict_proba({"train": "delayed"})

# Print predictions for each node
for node, prediction in zip(model.states, predictions):
    if isinstance(prediction, str):
        print(f"\n{node.name}: {prediction}")
    else:
        print(f"\n{node.name}")
        for value, probability in prediction.parameters[0].items():
            print(f"    {value}: {probability * 100 : .2f} %")


rain
    none:  45.83 %
    light:  30.69 %
    heavy:  23.48 %

maintenance
    yes:  35.68 %
    no:  64.32 %

train: delayed

appointment
    miss:  40.00 %
    attend:  60.00 %


**From the study on node depencies that hidden values (or non-direct) parent nodes should not effect the overall outcome of a query variable, below we can see this by adding heavy rain (non-direct-parent of appointment) conditions as known along with train delay.**

**note that the appointment results do not change, BUT the maintenance values do as they are directly related to the rain variable**

In [46]:
predictions = model.predict_proba({"train": "delayed", "rain":"heavy"})

# Print predictions for each node
for node, prediction in zip(model.states, predictions):
    if isinstance(prediction, str):
        print(f"\n{node.name}: {prediction}")
    else:
        print(f"\n{node.name}")
        for value, probability in prediction.parameters[0].items():
            print(f"    {value}: {probability * 100 : .2f} %")


rain: heavy

maintenance
    yes:  11.76 %
    no:  88.24 %

train: delayed

appointment
    miss:  40.00 %
    attend:  60.00 %


---
## Approximate Inference
Calculating exact probablities using Inference by Enumeration (above examples) can be inefficient for large data sets. Most often, rather than get the exact probabilites, approximations are used instead. 

**Sampling** - take random samples from all variables in Bayesian Network, note start at top and go down. Therefore, if heavy rain is randomly selected, then only probablites associated with heavy rain in maintenance can be used and so on down the line. In other words create a probabilty like we did above but randomly selecting the values

Using random sampling, one can reduce the amount of possible calculations down to the number of samples, note the larger the sample amount, the more accurate the probabilty will usually be. 

**Uncondtional Probabilty Question:** 
* P(train = on time)? 

**Conditional Probabilty Question:** 
* P(rain = light | train - on time)? 

### Rejection Sampling
Below the condtional probablity of making or missing an appointment given that train is delayed is computed below using rejection sampling, which will only use random samples containing specific metrics. 

In [52]:
from collections import Counter

def generate_sample():   
    sample = {}  # Mapping of random variable name to sample
    parents = {} # Mapping of distribution to sample generated

    # Loop over all states, assuming topological order
    for state in model.states:
             
        # If we have a non-root node, sample conditional on parents
        # i.e. If Conditional Probablity
        if isinstance(state.distribution, pomegranate.ConditionalProbabilityTable):
            sample[state.name] = state.distribution.sample(parent_values=parents)

        # Otherwise if no parent, sample from the distribution alone
        # i.e Else Unconditional Probablity
        else:
            sample[state.name] = state.distribution.sample()

        # Keep track of the sampled value in the parents mapping
        parents[state.distribution] = sample[state.name]

    # Return generated sample
    return sample

# Rejection sampling (only consider samples where train is delayed)
# Compute distribution of Appointment given that train is delayed
N = 10000
data = []
for i in range(N):
    sample = generate_sample()
    if sample["train"] == "delayed": # only use samples where t_delayed
        data.append(sample["appointment"])
print(Counter(data))

Counter({'attend': 1281, 'miss': 832})


The rejection sampling example above is also not very efficient do to the fact that it performs unecessary samples (those rejected) along the way. There are other sampling methods to avoid this. 

### Likelihood Weighting
This sampling method avoids the random sampling problem of uneeded sample calcultions, the steps performed are:
* Start by fixing the values for evidence variables
* Sample the non-evidence variables using conditional probabilites in the Bayesian Network
* Weight each sample by its likelihood: the probability of all of the evidence

Example using the question:
* P(rain = light | train - on time)? 
* first fix (only use) the on_time variable 
* random sample for rain, maintenance, and appointment values
* assign a weight on how probable it is the train is on time based on the probablity of all the other three random values. 

The chart below shows an example where on_time weights exist based upon the random variable outcomes, so if r_none and m_yes, then the likelihood that a train is on time is 0.8, this would be the weight or $ \alpha $ used


Train Arrival Probablity Based on Rain and Track Maintenance


Unnamed: 0,on time,delayed
"R: none, M: yes",0.8,0.2
"R: none, M: no",0.9,0.1
"R: light, M: yes",0.6,0.4
"R: light, M: no",0.7,0.3
"R: heavy, M: yes",0.4,0.6
"R: heavy, M: no",0.5,0.5


---
## Uncertainty over Time
All of the previous examples do not reflect changes over time. Below are models that take time into account. 

**Markov Assumption** - the assumption that the current state depends on only a finite fixed number of previous states (i.e. the current weather only depends on yesterdays weather and not on all weather prior)

### Markov Models

**Markov Chain** - a sequence of random variables where the distribution of each variable follows the Markov assumption, Formally: A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

**Transition Model** - a model used to create a markov chain

**Markov Chain Example using Transition Model and Weather**

In [55]:
# Define starting probabilities
start = DiscreteDistribution({
    "sun": 0.5,   # sunny 50% of time
    "rain": 0.5   # rain 50% of time
})

# Define transition model
transitions = ConditionalProbabilityTable([
    ["sun", "sun", 0.8],    # if sunny today, 80% sunny tomorrow prob
    ["sun", "rain", 0.2],   # if sunny today, 20% rainy tommorow prob
    ["rain", "sun", 0.3],   # if rainy today, 30% sunny tommorow prob
    ["rain", "rain", 0.7]   # if rainy today, 70% rainy tommorow prob
], [start])

# Create Markov chain
model = MarkovChain([start, transitions])

# Sample 50 states from chain
print(model.sample(50))

['sun', 'sun', 'sun', 'rain', 'rain', 'rain', 'sun', 'rain', 'rain', 'rain', 'rain', 'rain', 'rain', 'rain', 'sun', 'rain', 'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'rain', 'rain', 'rain', 'sun', 'sun', 'sun', 'rain', 'rain', 'rain', 'rain', 'rain', 'sun', 'sun', 'rain', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun', 'sun', 'rain', 'rain', 'sun', 'rain', 'rain', 'sun', 'sun']


---
### Hidden Markov Models (Sensor Models)
Sensor Models translate the state of the world taken from an ovservation. The above examples all took some real-world situation and applied some pre-set truth to it.

**Sensor Markov Assumption** the assumption that the evidence variable depends only on the corresponding state 

**Example 1:** voice recognition software:
* Hidden States (from ai perspective):  word's spoken by human
* Observation: audio waveforms

A computer can use the observed autio waveforms and infer the word's spoken (the hidden state)

**Example 2:** website analytics:
* Hidden States (from ai perspective):  user engagement
* Observation: web or app analytics

A computer can use the observed web analystics and infer user engagement

Note that observed data is not always 100% accurate.

---
## Different Tasks  That Use Hidden Markov Models
**Filtering** - given observations from start until now, calculate distribution for current state

**Prediction** - given observations from start until now, calcluate distribution for a future state

**Smoothing** - given observations from start until now, calcluate distribution for past state

**Most Likely Explanation** - given observations from start untiln ow, calcluate most likely sequence of states

---
### Most Likely Explanation Example Using Weather
* Hidden State: weather
* observation: umbrella (wether people have umbrella)

In [56]:
# Observation model for each state
sun = DiscreteDistribution({
    "umbrella": 0.2,      # some people  may have umbrella with sun
    "no umbrella": 0.8    # most people won't have one with sun
})

rain = DiscreteDistribution({
    "umbrella": 0.9,      # more people will have one if rain
    "no umbrella": 0.1    # less people won't have one if rain
})

states = [sun, rain]

# Transition model (Sensor Model)
transitions = numpy.array(
    [[0.8, 0.2], # Tomorrow's predictions if today = sun
     [0.3, 0.7]] # Tomorrow's predictions if today = rain
)

# Starting probabilities
starts = numpy.array([0.5, 0.5]) # 50/50 chance of each

# Create the model
model = HiddenMarkovModel.from_matrix(
    transitions, states, starts,
    state_names=["sun", "rain"]
)
model.bake()

In [59]:
# Observed data
observations = [
    "umbrella",
    "umbrella",
    "no umbrella",
    "umbrella",
    "umbrella",
    "umbrella",
    "umbrella",
    "no umbrella",
    "no umbrella"
]

# Predict underlying states (the hidden weather states) based on
# number of observed people with umbrellas
predictions = model.predict(observations)
for prediction in predictions:
    print(model.states[prediction].name)

rain
rain
sun
rain
rain
rain
rain
sun
sun


**So the above results give the most likely daily weather based upon the umbrella observations**