# 1. What makes something a probability?

Answer:

* It needs to be a number 
* The number needs to between 0 and 1 (including 0 and 1 themselves)
* where 0 indicates impossibility and 1 indicates certainty

#### (Note: this is supposed to be a dialogue between two people, the master and the learner. The learner's questions are *italicized*)

How about this number?

In [81]:
0.08

0.08

Yes, that looks like a probability. Being close to zero, it means that it is nearly impossible.

Question:

* *Just what is nearly impossible here?*

Answer:

* Good question. You're right, this isn't very useful at all. To make the number useful, it needs to be attached to something. 

Let's attach that probability to a value.

In [2]:
# [valuep,robability]
('something', 0.02)

('something', 0.02)

*Okay, the value 'something' is nearly impossible because it has a probability of 0.02. I'm still not seeing where this is useful.*

Let's take this a step further and attach **values** to **variables**. As a programmer, you know all about variables and how they can take on values. Here are some examples using `X` as the variable:

In [82]:
X = 'something'
# or
X = 'something else'
# or 
X = 5
# or ...

The way we attach probabilities to variable values is by writing and using functions. Let's write one called `P`:

In [83]:
def P(X):
    pass

The function doesn't do anything, but it's a start. 



This probability function `P` is a generic placeholder for a function that YOU need to write. The contents of that function depend on a lot of different factors. 

#### The function `P`, the variables that it takes, and the values that those variables can take are necessary for working in probability theory. 

For example, you might be interested in tomorrow's weather. You decide that one variable that you call `X`, should do the trick here. X can take on the values of either `sunny` or `rainy`. You decide that it's more likely to be sunny than rainy, so you write your `P` function as follows:

In [84]:
def P(X):
    if X == 'sunny': return 0.9
    if X == 'rainy': return 0.5

#### There is a problem: 

* The individual values of variable `X` are both between 0 and 1--making them probabilities by definition--but there still isn't an established relationship between the variables. What does it mean that sunny is 0.9 and raing is 0.5? how do those values compare to each other?

Remedy:

* Add a new constraint: the **the probabilities of all the values of the variable must sum to one**. In fact, a more generally useful constraint would be that the probability function `P` should **distribute** probability mass among all values of all variables such that the sum of all of those values is one.

Let's rewrite our function to meet the demants of this constraint:

In [85]:
def P(X):
    if X == 'sunny': return 0.8
    if X == 'rainy': return 0.2

In [86]:
P('sunny') + P('rainy')

1.0

That's better. The sum of the probabilities of all possible values sum to one. 

(Though, let's be honest, we have no clue what those values should actually be. More on that later.)

Note from Dr. K:

* Making everything sum to one is a way of scaling the probabilities to fit a defined space (i.e., between 0 and 1) and doing so forces the variables to be related to each other in that all variable values work together to fill all the space between 0 and 1. 

#### There is another problem:

* What if we're interested in if it's going to be windy tomorrow?

In [88]:
P('windy')

The above function doesn't return anything (i.e., a `NoneType`). 

**The remedy:**

* All of the variable (in this case `X`) values that you are interested in (in this case, `sunny, rainy, windy`) need to be accounted for in your probability function. If you want a function that can handle all three, then all three need to be considered and some amount of probability mass needs to be distributed to all three values (and, as usual, everything must sum to one).

Question:

* *Can't we just `return 0` after everything is said and done? That way if someone attempts to use my probability function and they try to get a probability back for something I don't have defined, at least they can still use the result. A 0 is better than a NoneType, isn't it? Like this:*


In [89]:
def P(X):
    if X == 'sunny': return 0.8
    if X == 'rainy': return 0.2
    return 0

In [90]:
P('windy')

0

*See? At least that returned a value. Before it didn't do anything.*

Dr. K's reply:

* One could easily argue for that, and it makes mathematical and even programmatical sense. However, you have to ask yourself what that 0 means to anyone using your function. You are defining your function to know something about `sunny` and `rainy` and nothing else. If someone comes along and asks what your function knows about `windy` and your function returns a 0, then your function is telling that someone that you are completely uncertain as to what will happen with `windy`. That seems harmless so far. But someone who doesn't know that your function doesn't actually have any provisions for knowing anything about `windy`, would interpret 0 as "oh good, there's no wind" (in other words, it's almost as if your function is completely certain that it is not windy). That's of course a bad interpretation of the function (but a very common way that people interpret probabilities from probability functions like this). How you handle this is up to you. (We'll talk about more principled ways to handle this later in the semester. For now, the keyword you can hold onto is "smoothing"). For the rest of this tutorial, we'll refrain from returning 0 for those unseen events.

### Checkpoint: 

* Probabilities are numbers between 0 and 1
* Probability functions (also known as probability **distributions**), distributes probability mass of 1 throughout the variable values (also known as the **sample space**). 

### Example: coin tosses

What would the sample space be for this scenario? (I.e., what would the variable values be, and what would their probabilities be)? And, with that, what would the probability function/distribution look like?

In [91]:
def P(X):
    if X == 'heads': return 0.5
    if X == 'tails': return 0.5

In [92]:
P('tails')

0.5

In [93]:
P('heads')

0.5

Nice. 

**Question**:

* *Can we handle more than one variable in a probability function/distribution?*

**Answer**:

* Yes. But remember that after you consider all the values that all variables can take on (and all of the combinations!!) everything still needs to sum to one. 

**Example**: 

* What is the probability that someone who walks past you on the street is wearing a hat AND is wearing something that is colored orange? 

In [94]:
def P(X,Y):
    if X == 'wearing-hat'     and Y == 'wearing-orange':     return 0.2
    if X == 'not-wearing-hat' and Y == 'wearing-orange':     return 0.3
    if X == 'wearing-hat'     and Y == 'not-wearing-orange': return 0.2
    if X == 'not-wearing-hat' and Y == 'not-wearing-orange': return 0.3
    

In [95]:
P('wearing-hat', 'wearing-orange')

0.2

### This is an example of a *joint distribution/function*. It's joint because it jointly considers all possible variable combinations of more than one variable.

Just to be more conventional, let's write functions that look more similar to those that are actually used in probability theory. That is, instead of just writing a function `P(X)`, it is sometimes useful to write the actual value in the function call, e.g., `P(X='sunny')`. Python has a way to help us with that. We just have to change the function declaration for `P`:

In [16]:
def P(X='',Y=''):
    if X == 'wearing-hat'     and Y == 'wearing-orange':     return 0.2
    if X == 'not-wearing-hat' and Y == 'wearing-orange':     return 0.3
    if X == 'wearing-hat'     and Y == 'not-wearing-orange': return 0.2
    if X == 'not-wearing-hat' and Y == 'not-wearing-orange': return 0.3

Then we can call the function and actually write the variables that we are using. This is doable in Python:

In [17]:
P(X='wearing-hat', Y='wearing-orange')

0.2

Question:

* *Why bother with this, though? Why is that notation more useful? Why not just call the function with the values like we've been doing so far?*


Answer:

* We know the variable names `X` and `Y` in the second call, whereas in the first call we only know the values themselves. But the values themselves don't tell us anything about which probability function we should be using. That's because lots of variables could take on values. So the way we determine a function's signature (i.e., which function we are applying) is the name of the variables `X` and `Y` AND the values). 
* Another way of looking at it is *overloading*. The function `P` is a super-duper overloaded function. So how do we know which version of `P` to apply? We can't really tell by the name of the function since ALL of the probability functions seemed to be called the same thing. We can tell which function to apply by looking at the variables. Knowing what variable we are applying is also useful so we often call the functions knowing which variable is in question. I.e., the probability function call P(X,Y) is different from P(X,Z). It would call a completely different function.
  
  
So it's at this point where we must depart slightly from how programming functions and probability functions are used. We can still write probabilistic functions using Python, but we're going to have to treat them differently from each other. 

Question:
    
 * *What if I have a joint distribution for P(X,Y), but I'm just interested in P(X)? In other words, what can I do in order to "get rid of" `Y`?*
 
Answer: 

 *  You can step through all the values for `Y` and sum up the probabilities. That way you've accounted for all the possibilities for what can happen in `Y`, leaving you with a distribution over `X`. 
 
 
Check it out:   

In [18]:
P(X='wearing-hat', Y='wearing-orange') + P(X='wearing-hat', Y='not-wearing-orange')

0.4

So P(X='wearing-hat') is 0.4.

In [19]:
P(X='not-wearing-hat', Y='wearing-orange') + P(X='not-wearing-hat', Y='not-wearing-orange')

0.6

And P(X='not-wearing-hat') is 0.6. And everything still sums to one (i.e., 0.4 + 0.6 = 1.0)!

This magical operation is called **marginalization** and `P(X)` here is a **marginal distribution** (a special kind of probability function/distribution) which is a subset of P(X,Y). 

The general formula to marginalize (for joint distributions) is:

$$P(X) = \sum_{y \in Y} P(X,Y=y)$$

Where we step through each value in $Y$.

*So far so good.*

### Notes on multiple variables and independence:

* When multiple variables are used in a probability function/distribution, then it is a special kind of function/distribution called a **joint** distribution, as noted above. The notation is what we've seen so far: `P(X,Y)` is read as "the joint distribution over X **and** Y." (The **and** here is very important: it's asking "what is the probability that these two things happen together?") One can define as many variables as needed (e.g., P(X,Y,Z)), but in a joint distribution like this, every possible combination (i.e., the cross product) of all of the values must be defined. 


Joint distributions don't sound very fun, especially when there are many variables and each can take on many possible values. 

Question:

* *Do we really have to come up with ALL possible combinations and still have everything sum to one? I mean, if we have five variables, each only having two possible values (i.e., binary), then that means we'd have to come up with 32 different possible outcomes and everything still has to sum to one. That's just outrageous.*

I feel your pain. If it helps you feel better, there's a way to simplify by handling cases where variables don't really have any direct relation to each other. Consider the hat/orange example. Does the fact that someone is wearing a hat really have anything to do with their choice of wearing orange? Not really. (I mean, in Idaho it might, but let's just say that it doesn't actually matter.)

In such a case case, the two variables are **independent** of each other. If that happens, then the following is true:

## $$P(X,Y) = P(X)P(Y)$$

There's a lot to gain from that, it turns out (though it doesn't seem to be the case for our little example with two binary variables--if we had 3 variables, each taking on 10 possible values, then we would only need to consider 30 different probabilities instead of 1,000). We can refactor the function from `P(X,Y)` into two functions, one for `P(X)` and one for `P(Y)`

In [3]:
def Px(X=''):
    if X == 'wearing-hat': return 0.6
    if X == 'not-wearing-hat': return 0.4

def Py(Y=''):
    if Y == 'wearing-orange': return 0.4
    if Y == 'not-wearing-orange': return 0.6

In [4]:
Px(X='wearing-hat') * Py(Y='wearing-orange')

0.24

*Hmm, that's not the same as before.*

    That means that these two variables are not independent of each other (at least in the way we've defined the functions). But if we enumerate all outcomes, does it still really sum to one?

See for yourself:

In [5]:
Px(X='wearing-hat') * Py(Y='wearing-orange') + Px(X='not-wearing-hat') * Py(Y='wearing-orange') + Px(X='wearing-hat') * Py(Y='not-wearing-orange') + Px(X='not-wearing-hat') * Py(Y='not-wearing-orange')

1.0

*Well then.*

The same thing works if you want to link different **events** together, not just different functions. 

For example, What if we want to find the probability of tossing the coin three times with the sequence: `heads, heads, tails`?

Answer:

* The events themselves seem to be independent of each other. What can we do?
* We can apply the same function `P` for all three events.
* Since they are independent events, we can multiply all three events together!
* Note: each 'event' is also known as a **trial**

In [4]:
def P(X=''):
    if X == 'heads': return 0.5
    if X == 'tails': return 0.5

In [5]:
P(X='heads') * P(X='heads') * P(X='tails')

0.125

### Checkpoint

* A **joint** probabiliy distribution is a way of expressing how multiple variables jointly affect a distribution. Just like in a single variable distribution, all combinations of the probabilities all values of all variables must sum to one. 
* We can factor the variables of joint functions/distributions into multiple easier-to-manage functions/distributions if we can determine that those variables are **independent** of each other. 

Question:

* *OK. We know how to handle one or more variables in a probability distribution. But these all seem to be cases when the variables aren't actually known. What if I **know** the value of a variable already? Would that not change how a probability function works?*

Yes. Here's an example of that. You can write a joint probability function `P(X,Y)` where `X` can be `sunny/rainy` and `Y` can be `carry-umbrella/leave-umbrella`. Those obviously aren't independent of each other. But in this case, writing a joint probability distribution is somehow a disservice because you can look out the window and you can *know* with certainty what the value of `X` is: either's either sunny or it's rainy (at that particular time, at least). You can treat it as a different kind of variable where you know the value. 

So the question you want to ask is: what is the probability that someone carries an umbrella when it's sunny? Put in other words, what is the probability that someone carries an umbrella **given that** it is raining? Put yet another way, what is the probability that we carry an umbrella **conditioned on** the *fact* that it's raining?

Your function would look something like this:

In [11]:
def P(X='',Y=''): 
    if X == 'sunny':
        if Y == 'carry-umbrella':return 0.08
        if Y == 'leave-umbrella':return 0.92
    if X == 'rainy':
        if Y == 'carry-umbrella':return 0.9
        if Y == 'leave-umbrella':return 0.1

The structure of that function looks different from the single-variable and joint probability functions/distributions that we have seen so far. That's because you are taking information that you know about (i.e., whether `X` is sunny or rainy), and *conditioning* the results on that knowledge (hence, the outer-most `if` statement). 

As you might have guessed, this kind of probability distribution is a special case called a **conditional**
distribution. 

Note: the way we wrote conditional probability functions in the form `P(X,Y)` is actually wrong. Conditional probabilities use the notation `P(X|Y)`, but we can't really do that with Python. So, from now on, we'll notate conditional probability functions with `Pcond`. It's also conventional for the items on the left to be conditioned on the variables on the right, so we actually wrote it backwards when we were calling it. 

Let's fix that:

In [12]:
def Pcond(Y='', X=''): 
    if X == 'sunny':
        if Y == 'carry-umbrella':return 0.08
        if Y == 'leave-umbrella':return 0.92
    if X == 'rainy':
        if Y == 'carry-umbrella':return 0.9
        if Y == 'leave-umbrella':return 0.1

In [13]:
 # the probability that I carry an umbrella, given that it is sunny
Pcond(Y='carry-umbrella', X='sunny')

0.08

In [14]:
# the probability that I carry an umbrella, given that it is rainy
Pcond(Y='carry-umbrella', X='rainy')

0.9

*Hold on, you said that probability distributions should sum to one, but `Pcond` doesn't sum to one!*

Right. The way conditional probability functions/distributions work is this: since you know that something is the case (e.g. that it is sunny outside), then you can partition your function such that all of the sub partitions act as functions of their own--hence, each sum to one. Maybe rewriting the above function like this will help:

In [7]:
def Psunny(Y=''):
    if Y == 'carry-umbrella':return 0.08
    if Y == 'leave-umbrella':return 0.92
    
def Prainy(Y=''):
    if Y == 'carry-umbrella':return 0.9
    if Y == 'leave-umbrella':return 0.1    
    
def Pcond(Y='', X=''): 
    if X == 'sunny': return Psunny(Y=Y)
    if X == 'rainy': return Prainy(Y=Y)

Where `Pcond` calls functions that each properly sum to one on their own. 

In [8]:
Pcond(Y='carry-umbrella', X='sunny')

0.08

### More about Conditional Probabilities:

* You may have noticed that for joint probabilities, it doesn't really matter how you set up your function. In other words, `P(X,Y)` and `P(Y,X)` are the same thing. 
* For conditional probabilities, this isn't necessarily the case. In other words, `P(X|Y)` isn't always the same as `P(Y|X)`. It *can* be the case, but it often is not the case. Our example above for `Pcond` doesn't illustrate this very well (i.e., if you changed the conditions to have `Y` be considered in the first `if` statements, it wouldn't really change the outcome because of the way the function is defined). But seriously, do you think that the rain depends on whether or not you carry your umbrella? Obviously, it doesn't. However, whether or not you carry your umbrella likely depends on the weather. 

Question:

* *Can we marginalize over `Y` for `P(X|Y)` to get `P(X)` like we did for the joint distribution?*

Answer:

* Yes! But you have to know `P(Y)` somehow, because you have to do marginalizations for conditional probabilities a little differently than with joint distributions.

Here's the formula for marginalizing over conditional distrbiutions:

$$P(X) = \sum_{y \in Y} P(X|Y=y) P(Y=y)$$

### Notes on Probability functions/distributions so far:

* The functions that we wrote, `P`, always constrain their distributions to be one, no matter how many variables are involved, no matter how many values those variables can have, whether it's single-variable, joint, or conditional. 
* That some variables are conditional upon each other and that others are not is a very important concept in probability theory. Variables that aren't related or conditional upon each other are also kown as **independent** variables. The coin toss events were indepenent of each other, but the decision to take an umbrella or not was dependent on the weather. 
    * *Why is independence so imporant?* The answer is: math. When variables are independent of each other, it makes the math to compute probabilities much easier: we can define extra `P` functions and just multiply things together! Easy! But if they aren't independent, we have to come up with the probability function/distribution that defines the conditions that the variables have on each other. That takes more effort and is more computationally expensive.

## Checkpoint:

* Functions are important
* Probability functions/distributions constrain all combinations of variable probabilities to sum to one (except for conditional probabilities where each sub-function sums to one).
* Conditional probability functions/distributions take the form `P(Y|X)` (read: "the probability of Y given X).
* Joint probability functions/distrbutions take the form `P(X,Y)`.
* Variables can be jointly related, conditionally related, or independent of each other.


Question:

* *Why do this? Why write probability functions that are constrained to have everything distributed such that they sum to one? Is there really no other way?*

Dr. K's answer:

* First, when we map everything to the probability scale, i.e., between 0 and 1, we get some nice tools that we can use. We get some nice equations that we can apply (we will see even more of them below!). We can make use of the entire field of statistics to do our bidding. Second, when you force the probabilities of all possible value combinations to sum to one, you are comparing and relating those variables. The amount of space/probability mass/probability distribution to play around with doesn't seem to be much, so the variable values that get more than others means something. Interpretability is really important. 

# 2. Writing the Probability Functions / Estimating the Distributions

So far, we've seen the difference between single-variable, joint, and conditional probability distributions. However, the examples that we wrote were completely made up (except maybe the coin toss with a 50/50 head/tails distribution). Obviously, this isn't how it's done. 

To write our own probability functions, we need to **estimate** the values. For that, we need data.

We can either go find the data, or, for illustration purposes, we can just invent some data. 

In [20]:
#
# some data: a randomly generated list of 1s and 0s
# 

from random import *

n = 1597
data = [randint(0,1) for b in range(1,n+1)]

In [21]:
data[:10] # show the first 10 elements of the list

[0, 0, 1, 1, 1, 0, 0, 0, 0, 1]

Now we want to write the function and fill in the question marks:

Question:

*How can we do that?*

Answer:

### By COUNTING THINGS*

There are lots of ways to count things. Since we are using Python, I'm going to use a Counter which is a very convenient library. 

#### *In NLP we often count things because we have things that are countable. There are many cases where we can't actually count the data to obtain a distribution. We'll talk about that later. 

In [22]:
from collections import Counter # the Counter library

counts = Counter(data) # just pass in the list and see what happens

counts

Counter({0: 796, 1: 801})

In [23]:
counts[0], counts[1]

(796, 801)

Well that was easy. 

But those are huge numbers. How do we get probabilities out of that? 

Answer:

* Use the relative frequences as the probabilities

In [30]:
def P(X=''):
    global counts # the global keyword tells this function to use the global variable called counts
    global data   # it's generally frowned upon to use globals in Python, but I'm doing this to keep the function
                  # calls looking like they should (i.e., like probability functions)
    return float(counts[X]) / float(len(data)) # I had to add these floats for some reason

In [31]:
P(X=1)

0.5015654351909831

In [32]:
P(X=0)

0.49843456480901693

In [34]:
P(X=1) + P(X=0) # sums to one?

1.0

*That seems like a nice Probability function! We didn't have to explicitely write the actual probabilities of the two cases. That's super useful. What if we are dealing with more than just binary values?*

Let's make some more data with 25 values:

In [35]:
data = [randint(0,25) for b in range(1,n+1)]
counts = Counter(data)

counts

Counter({0: 61,
         1: 70,
         2: 56,
         3: 68,
         4: 56,
         5: 54,
         6: 85,
         7: 64,
         8: 62,
         9: 63,
         10: 71,
         11: 76,
         12: 47,
         13: 52,
         14: 59,
         15: 65,
         16: 67,
         17: 58,
         18: 51,
         19: 55,
         20: 68,
         21: 55,
         22: 63,
         23: 51,
         24: 59,
         25: 61})

In [36]:
data[:10]

[0, 3, 11, 11, 5, 6, 7, 9, 17, 16]

Question:

* *Okay, so we have a huge list of numbers between 0 and 25. Can we use our same probability function to calculate the probabilities?*

Answer:

* Let's see:

In [37]:
P(X=12)

0.029430181590482152

*Does everything sum to one?*

In [38]:
import numpy as np

np.sum([P(X=x) for x in set(data)]) # this function calls P for all possible values in data (hence the call to set)

1.0000000000000002

There might be a slight rounding error, but it's essentially one. Not bad, that. Still, that's a simple estimation of data for a single variable. Things get much more more complicated when estimating joint distributions with several variables, each of which can range over many possible values. 

#### A lot of the work in machine learning (i.e., NLP) has to do with estimating values like this. This is what is often called "training" but we'll get to that in more detail later in the semester.

# 3. Joint and Conditional Probabilities: Mapping Between Them

Question:

* *What if I have the means to estimate a joint probability, but what I really want is a conditional probability? Or visa-versa?*

Answer:

* Vee haf vays. (We have ways.)

It is the case that:

### $$P(A,B) = P(B) P(A|B) = P(A) P(B|A)$$

This is called the **multiplication rule** and it's very, very useful.  (This explains why marginalizing over a conditional probability was kind of weird.) 

In fact, we can generalize it to probability functions/distributions that have an arbitrary number of variables:

### $$P(A_1, ..., A_n) = P(A_1) P(A_2|A_1) P(A_3|A_2,A_1)\ ...\ P(A_n|A_1, ..., A_{n-1})$$

This is called the **chain rule**. 

So we can use these equalities to go from joint to conditional probabilities and the other way around, if we need to (and we will need to).

# 4. Bayes' Theorem

The multiplication rule and chain rule are useful for going between joint and conditional probabilities, but there's another thing that would be useful. Forget joint probabilities for the moment and let's focus on conditional probabilities. Recall that we can't just swap the variables; i.e., `P(A|B) != P(B|A)`.

Question:

* *Is there a way to swap the dependencies? I.e., is there a way to go from `P(A|B)` to `P(B|A)` without injuring myself?*

Answer:

* Yes. It's called Bayes's Theorem. 

### Bayes' Theorem:

### $$P(A|B) = \frac{P(A,B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{\sum_i P(B|A_i)P(A_i)}  $$

# 5. NLP Example: Statistical Language Modeling

Suppose we are writing a program that does automatic speech recognition (ASR). Our program works pretty well; we are able to take in audio and have our program produce transriptions. The problem is, our system isn't sure which transcriptions are correct. 

If we utter "it is hard to recognize speech", our system thinks we could have said one of the following things:

* it is hurdle reckon ice peach
* it is hard to wreck a nice beach
* it is hard to recognize speech
* is car two reckon nice beach

This always happens with your system--it always produces lots of different hypothesized transcriptions, and the correct one is usually in there, but you want to find a way to automatically rank them so the transcriptions that sound most like grammatical English sentences that people would actually say get bumped up towards the top. 

How do we go about this? Could we do this with our knowledge of probability theory? (The answer is obviously *yes*). 

What we are looking for is the probability of a transcription. Let's call that `T`. A transcription is made up of words. Let's call those $W_1, W_2, ..., W_n$. 

The most obvious thing to do here is treat this as a joint probability with all the words:

$$P(W_1, W_2, ... W_n)$$

Easy, right? All we have to do now is grab a bunch of text (we can just download the Internet or something) and we can start counting things up. Then we can get probabilities for each of the transcriptions, beginning with the first one:

$$P(W_1=it, W_2=is, W_3=hurdle, W_4=reckon, W_5=ice, W_6=beach)$$

But we already see a huge (HUGE) problem when we get to the next transcription: it has 8 words instead of 6. From what we know about "overloading" joint probability functions, we know that $P(W_1, ... W_6)$ will require a completely different function than $P(W_1, ..., W_8)$. In fact, we'll have to write a different probability function for EACH LENGTH of possible transcriptions. That's already a horrible idea. 

There's another problem. Even if you get all of the text that ever existed and count up those numbers and make your function for the 6-word case, you are going to have to account for ALL possible words for EACH of the `W`s. Let's say you write a clever program that can find out the vocabulary (i.e., the count of the words that are used) and it's 5,000. That means you will need to write a function, somehow, that can account for $5000^5$ (i.e., $3.125e^{18}$) possible combinations of the 5 variables. You don't want that. Several generations will pass before this function will be trained. 

(There's another problem here that we won't address right now: you could count up all the data you can, but you'll find out that some words just don't occur in the second spot in some sentences, so the count value for that particular word will be 0, which will mess up your system in a big way... we'll deal with that in a later lecture)



What can be done? We can **make independences assumptions** and **factor** the joint distribution into smaller, more manageable ones. Remember indepdence? That's when `P(X,Y)` = `P(X)P(Y)`. So, you decide that words are all independent of each other in a sentence and you determine that: 

$$P(W_1, ... W_n) = P(W_1) ... P(W_n)$$

Now you don't have to find all combinations anymore! In fact, you can use the same `P` function for all the words in the sentence. Problem solved, right?

Not really. What you've invented here is a probability function that makes way too many assumptions (in fact, this is akin to a Naive Bayes Classifier which we will discuss later in the semester). This wil produce terrible probabilities. Is there a way we can make it so the assumptions aren't so, well, naive?

Yes. In fact, let's invoke the chain rule here:

$$P(W_1, ..., W_n) = P(W_1) P(W_2|W_1) P(W_3|W_1, W_2) ... P(W_n|W_1, ..., W_{n-1})$$

This is already a lot better than both the super complicated, full joint distribution extreme, and the super naive extreme (i.e., no jointness at all). But this is still too complicated. The last segment $ P(W_n|W_1, ..., W_{n-1})$ doesn't look much better than the original, since it still is conditioned on a huge joint distribution. 

OK, then now we can make some not-so-harsh independence assumptions on some of the parts. Let's just say that a word is only conditioned on the previous word:

$$P(W_1, ..., W_n) = P(W_1) P(W_2|W_1) P(W_3| W_2) ... P(W_n|W_{n-1})$$

This is called a **bigram** "model". A model where each word is conditional upon the previous *two* words is called a **trigram** model. The general notion of conditioning the current word upon previous words is **n-grams**. 

So now, all you need to do is count up relative frequencies of words for the `P(W)` probability function and the $P(W_i|W_{i-1})$ function. That's a nice middle ground between the full joint distribution extreme and the completely naive extreme. You could fairly easily write a program to do just this, which you could use to calculate the probability of a transcript no matter how many words it has. 



## Final notes

* This notebook only covers **discrete** probability functions. Stay tuned for the other kind: **continuous**. 
* Our examples here were small, but the simple concepts here apply to arbitrary numbers of variables and probability functions. 
* We haven't said much about statistics, but we've said a lot about probabilities. What's the difference? Probabilities are numbers. Statistics, which uses probabilities in a big way, is a field of study. 
* We've written some functions to illustrate the concepts, but in "real life" linear algebra (i.e., matrices) are often used for estimating\training the functions.