In [1]:
#: the usual imports
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')

plt.style.use('fivethirtyeight')

# Lecture 25: Decisions

### Updating Predictions

* New information changes our knowledge of the world.

Or, in data science

* New data leads us to update our predictions.

The updating method we will learn today can be generalized to work in complex settings and is one of the most powerful tools used in data science.

## A “More Likely Than Not” Binary Classifier

Start with a toy example.

Suppose there is a university class with the following composition:

* 60% of the students are Second Years and the remaining 40% are Third Years
* 50% of the Second Years have declared their major
* 80% of the Third Years have declared their major

Pick a student at random from the class. 

### Discussion Question

Can you classify the student as Second Year or Third Year, using a “more likely than not” criterion?

* A) Yes
* B) No
* C) Not sure

The information about the majors is irrelevant, as we already know the proportions of Second and Third Years in the class.

We have a pretty simple classifier!

But now suppose I give you some additional information about the student who was picked:

**The student has declared a major.**

Would this knowledge change your classification?

### Updating the Prediction Based on New Information

Need to look at the relation between year and major declaration.

* More students are Second Years than Third Years.
* But also among the Third Years, a much higher percent have declared their major than among the Second Years. 

Our classifier has to take both of these observations into account.

To visualize this, we will use a table students that consists of one row for each of 100 students whose years and majors have the same proportions as given in the data.

In [2]:
# np.array(list) converts list to an array
# provided all the elements of list are of the same type

n = 100
second = round(n * 0.6)
third = round(n * 0.4)

year = np.array(['Second'] * second + ['Third'] * third)
major = np.array(['Declared'] * (round(second * 0.5)) + ['Undeclared'] * (round(second * 0.5)) + \
                 ['Declared'] * (round(third * 0.8))  + ['Undeclared'] * (round(third * 0.2)))
                 
students = Table().with_columns(
    'Year', year,
    'Major', major
)
students.show(3)

Year,Major
Second,Declared
Second,Declared
Second,Declared


In [3]:
students.show(3)

Year,Major
Second,Declared
Second,Declared
Second,Declared


In [4]:
students.pivot('Major', 'Year')

Year,Declared,Undeclared
Second,30,30
Third,32,8


We have to pick which row the student is most likely to be in. 

When we knew nothing more about the student:

* they could be in any of the four cells
* therefore were more likely to be in the top row (Second Year) because that contains more students.

But now we know that the student has declared a major:

* space of possible outcomes has decreased
* now the student can only be in one of the two Declared cells.

we have to update our prediction and now classify the student as a Third Year

What is the chance that our classification is correct?

In [5]:
32/(30+32)

0.5161290322580645

## Tree Diagram

the calculation depends only on the proportions in the different categories, not on the counts

In [6]:
students.pivot('Major', 'Year')

Year,Declared,Undeclared
Second,30,30
Third,32,8


Can visualize ith a tree diagram.

![tree_students.png](tree_students.png)

this diagram partitions the students into four distinct groups known as “branches”

* “Third Year, Declared” branch contains the proportion 0.4 x 0.8 = 0.32 of the students
* “Second Year, Declared” branch contains 0.6 x 0.5 = 0.3 of the students

the proportion of Third Years among students who are Declared

In [7]:
(0.4 * 0.8)/(0.6 * 0.5  +  0.4 * 0.8)

0.5161290322580645

# Bayes' Rule

The method that we have just used is due to the Reverend Thomas Bayes (1701-1761).

We will state the rule in the context of our population of students. First, some terminology:

#### Prior probabilities. 

Before we knew the chosen student’s major declaration status, the chance that the student was a Second Year was 60% and the chance that the student was a Third Year was 40%. These are the prior probabilities of the two categories.

#### Likelihoods. 

These are the chances of the Major status, given the category of student; thus they can be read off the tree diagram. For example, the likelihood of Declared status given that the student is a Second Year is 0.5.

#### Posterior probabilities. 

These are the chances of the two Year categories, after we have taken into account information about the Major declaration status. We computed one of these:

* The posterior probability that the student is a Third Year, given that the student has Declared, is denoted $$P(Third\ Year ∣ Declared)$$ and is calculated as follows.
$$P(Third\ Year ∣ Declared) = \frac{0.4×0.8}{0.6×0.5 + 0.4×0.8}$$

$$= \frac{(prior\ probability\ of\ Third\ Year)×(likelihood\ of\ Declared\ given\ Third\ Year)}{total\ probability\ of\ Declared}$$

* The other posterior probability is $$P(Second\ Year ∣ Declared) = \frac{0.6×0.5}{0.6×0.5 + 0.4×0.8}$$

$$= \frac{(prior\ probability\ of\ Second\ Year)×(likelihood\ of\ Declared\ given\ Second\ Year)}{total\ probability\ of\ Declared}$$

Notice that both the posterior probabilities have the same denominator: the chance of the new information, which is that the student has Declared.

Because of this, Bayes’ method is sometimes summarized as a statement about proportionality:

$$posterior ∝ prior×likelihood$$

# Making Decisions

**Bayes' rule lets us make decisions based on incomplete information, incorporating new information as it comes in.**

Many medical tests for diseases return Positive or Negative results. 

* Positive result means that according to the test, the patient has the disease.
* Negative result means the test concludes that the patient doesn’t have the disease.

Medical tests are carefully designed to be very accurate. But few tests are accurate 100% of the time. Almost all tests make errors of two kinds:

* A false positive is an error in which the test concludes Positive but the patient doesn’t have the disease.
* A false negative is an error in which the test concludes Negative but the patient does have the disease.

These errors can affect people’s decisions. 

* False positives can cause anxiety and unnecessary treatment (which in some cases is expensive or dangerous). 
* False negatives can have even more serious consequences if the patient doesn’t receive treatment because of their Negative test result.

### A Test for a Rare Disease

Suppose there is a large population and a disease that strikes a tiny proportion of the population. The tree diagram below summarizes information about such a disease and about a medical test for it.

![tree_disease_rare.png](tree_disease_rare.png)

* Overall, only 4 in 1000 of the population has the disease. 
* The test is quite accurate: it has a very small false positive rate of 5 in 1000, and a somewhat larger (though still small) false negative rate of 1 in 100.
* Individuals might or might not know whether they have the disease; typically, people get tested to find out whether they have it.
* So suppose a person is picked at random from the population and tested. If the test result is Positive, how would you classify them: Disease, or No disease?

We can answer this by applying Bayes’ Rule and using our “more likely than not” classifier. Given that the person has tested Positive, the chance that he or she has the disease is the proportion in the top branch, relative to the total proportion in the Test Positive branches.

In [8]:
(0.004 * 0.99)/(0.004 * 0.99  +  0.996*0.005 )

0.44295302013422816

Given that the person has tested Positive, the chance that he or she has the disease is about 44%. So we will classify them as: No disease.

This is a strange conclusion. We have a pretty accurate test, and a person who has tested Positive, and our classification is … that they don’t have the disease? That doesn’t seem to make any sense.

In [9]:
n = 10000
disease = round(n * 0.001)
no_disease = round(n * 0.999)

status = np.array(['Disease'] * disease + ['No disease'] * no_disease)
result = np.array(['Test +'] * (disease) + ['Test +'] * (round(no_disease * 0.05))  + \
                 ['Test -'] * (round(no_disease * 0.95)))
                 
persons = Table().with_columns(
    'Status', status,
    'Test Result', result
)
persons.show(3)

Status,Test Result
Disease,Test +
Disease,Test +
Disease,Test +


In [10]:
persons.pivot('Test Result', 'Status')

Status,Test +,Test -
Disease,10,0
No disease,500,9490


# More about Bayes

![frequentists_vs_bayesians.png](frequentists_vs_bayesians.png)

* Original proof (pool tables)
* Laplace's Rule
* Frequentists (Randomness and Uncertainty)
* Relationship to propositional Logic
