In [1]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

from toolz import memoize

# Diagnostics
In this lecture, we are going to talk about **making decisions**. Making decisions is tricky.

## Decision Under Uncertainty
For example, in the study of * Interpretation by Physicians of Clinical Laboratory Results (1978)*,

"We asked 20 house officers, 20 fourth-year medical students and 20 attending physicians, selected in 67 consecutive hallway encounters at four Harvard Medical School teaching hospitals, the following question:

If a test to detect a disease whose prevalence is 1/1000 (1 in 1000 people has the disease) has a false positive rate of 5% (if you don't have the disease, the test will wrongly says that you have the disease with a probability of 5%), what is the chance that a person found to have a positive result actually has a disease, assuming that you know nothing about the person's symptomps or signs?"

11 of 60 participants, or 18%, gave the correct answer. These participants included 4 of 20 fourth-year students, 3 of 20 residents in internal medicine and 4 of 20 attending physicians. The most common answer, given by 27, was that **the chance that a person found to have a positive result actually has the disease was 95%**. 

The most common answer above is incorrect! It's actually lower than 95%.

# Conditional Probability
Conditional Probability refers to chances when we know something else has happened.

In the **Interpretation by Physicians** example, when the test is positive, we want to find the chance that the person actually has the disease. We can set it in probability term as "the chance a person has the disease given the test was positive for that particular person". 

## Round 1
Consider the following scenario:
1. A class consisting of second years (60%) and third years (40%)
2. 50% of the second years have declared their major
3. 80% of the third years have declared their major

We pick a student at random. Which is more likely the student that we pick: second year or third year?

#### Ans: second year, since they're 60% of the class

## Round 2
Similar scenario as above, We pick a student but **the student has declared a major**. 

Assume we have 100 students in the class. The chance of getting a 2nd year student that has picked a major is:

In [2]:
(60 / 100) * (1/2)

0.3

While the chance of getting a 3rd year student is:

In [3]:
(40/100) * (80/100)

0.32000000000000006

Thus, **the answer is 3rd year**! How does this work?

Assume we have `100` students in the class,

<img src = '100.jpg' width = 150/>

We split the students to 2nd years and 3rd years,

<img src = '2nd_3rd.jpg' width = 200/>

And we split the students again between those who declared and those who have not declared,

<img src = 'declared.jpg' width = 200/>

As we can see, there are more students who are declared 3rd years!

# Bayes' Rule
If we observe an event and we happen to know some information about the event beforehand, that changes what we believe. Since we know that the student we picked has a major, the chance the student is a 2nd year or 3rd year has changed. This phenomenon is described by **Bayes' Rule**.

Suppose we want to find the chance **the student is 2nd year and has declared a major**. 

<img src = 'declared.jpg' width = 200/>

Since we know that the student has declared a major, the student must have come from one of the 2 boxes on the upper part. Thus, the probability is:

$$ Probability = 
\frac
{\text{P (Second Year and Declared)}} 
{P (Declared)} = \frac{30}{30+32} = 0.48
$$

## Terminology
Here we draw the same box in a different way, called **tree diagram**. In this case, instead of assuming that we have `100` students, we use proportions / probabilities. 

<img src = 'terminology.jpg' width = 400/>

We have names for these probabilities,
<img src = 'terms.jpg' width = 500/>

**Prior probability** is,  if we had no information at all, what we would predict about the student that we pick. **Prior** to any new information, we believed that 60% of the students would be 2nd years. 

## Bayes' Rule
Bayes' Rule describes how to calculate the **Posterior probability**. **Posterior** implies what we would believe if we have more information and thus, in this case we are going backwards. 

<img src = 'bayes.jpg' width = 300/>

In Bayes' rule, the probability of getting a **3rd year student that has declared a major** is:
$$ Probability = \frac
{ P( \text {Third Year and Declared})}
{ P(Declared)}
= 
\frac{0.4 \times 0.8}
{(0.6 \times 0.5) + (0.4 \times 0.8)}
$$

Notice that the calculation above is similar to the ones we had with the box diagram. The difference is that this one is described in terms of probability with the tree diagram.

## Purpose of Bayes' Rule
1. Update prediction based on new information
    * When we obtain new information, what we believe also changes. If we know the student is declared, the chance that the student is 2nd year will be different than before.
    
2. In a multi-stage experiment, find the chance of an event at an earlier stage, given the result of a later stage
    * We can use Bayes' rule to work backwards the probability of something happening
    * This won't be covered in this lecture.

## You Try
Now back to the **Interpretation by Physicians** problem, what is the chance that the person actually has disease?

Here, if we have no information about the false positive rate (let's say the person did not take the test at all), then the chance the person has disease is 1/1000. Thus, we say the **Prior probability** is 1/1000.

With the possibility of false positive rate of 5%, **assume that if a person has the disease, the test will always be positive**. Only when the person doesn't have the disease, that the false positive rate applies. 

Suggestion: draw the box or draw the tree diagram. The visualizations would help. 

<img src = 'disease.jpg' width = 400/>

Thus, the probability that a person has a disease and received a positive test is:

$$ P(\text{Disease | +})
= \frac
{P(\text{Disease and +})}
{P(+)}
=
\frac
{0.001 \times 1}
{(0.001 \times 1) + (0.999 \times 0.05)}
= 2 \text{%}
$$

This means we think that the person only has 2% chance of having the disease! 

The test is accurate, so why the chance is so low? This is because it's so rare for the disease to occur (only 1 in 1000 people). 

This is slightly better than before, when we tought that that a randomly picked person from the population would have 1/1000 chance of having the disease. Now the chance is 2%. 

# Decision
Turns out that we have been using Bayes' rules at the back of our mind when making decisions. When we make decisions, we deal with **Subjective Probabilities**. 

## Subjective Probabilities
There are 2 ways of viewing probabilites.

A probability of an outcome is:
1. The frequency with which the outcome will occur in repeated trials 
    * If we roll a die, we expect that out of 1 million rolls, 1/6 of them will be 6s
    

2. The subjective degree of belief that it will (or has occurred).
    * In the prior example, our belief that a randomly selected person will have a disease is 1/1000. 

Why use subjective priors?

1. When the subject of your prediction was not selected randomly from the population
    * We're not sure about the 1/1000 probability. Suppose a person goes to the doctor, and the doctor says that the person should do the test, this chance is different than the chance of randomly selecting a person from the entire population and see whether the person has a disease.
    * If somebody gets the test and goes to the doctor's office, the chance of that person having the disease is most likely greater than 1/1000. In this case, the person actually thinks that something is wrong with his/her body.
    * The 1/1000 is our belief. It is what we think about the world. However, it is a **subjective probability**, and thus it can change. 
    
    
2. In order to quantify a belief that is relevant to a decision
    * When we use subjective belief to do some calculation, if our belief is changed, the prior probability changes as well.

## You Try  #2
Below is the tree diagram like before, but notice that the prior probabilities are different (`0.05` instead of `0.01`)

<img src = 'disease_2.jpg' width = 400/>

In this new case, if a person is actually getting the test, we think the person has 1 in 20 chance of getting the disease. 

If we repeat the calculation that we did before,
$$ P(\text{Disease | +})
= \frac
{P(\text{Disease and +})}
{P(+)}
=
\frac
{0.05 \times 1}
{(0.05 \times 1) + (0.95 \times 0.05)}
= 51 \text{%}
$$

Thus, just by changing our prior belief about what we think the chance of the person getting disease, we obtain a very different decision.

This 51% says that if a person gets the test and the result is positive, we think more likely than not, this person has a disease. In the previous case, if we're still under the assumption that 1  in 1000 people has disease, then we can say, "the test result came out positive but the person probably doesn't have the disease".

## Recap
1. Bayes' Rule lets us calculate conditional probabilities
2. Draw the box / tree of events. It helps for calculation
3. When making decision, or looking at other people's decision, take into account priors
    * It could be our prior is A, then the other person's prior is B

# Data Science

## Why Data Science
1. Unprecedented access to data means that we can make new discoveries and more informed decisions
2. Computation is a powerful ally in data processing, visualization, prediction, and statistical inference
    * We can do it by hand, but using computers is exceedingly more efficient
3. Once we have data and once we use good statistics to draw conclusion, people can agree on evidence and measurement

## How to Analyze Data
If you can't remember everything in the class, then remember these:

1. Begin with **a question from some domain**, **reasonable assumptions about the data**, and **a choice of methods**

2. Visualize, then quantify

3. The most important: Interpretation of the results in the language of the domain, without statistical jargon
    * Make sure to be able to communicate the result to other people 
    
### How to NOT Analyze Data: Only relying on `Quantify`

## The Design of Data 8
1. Table manipulation with Python
2. Working with whole distributions, not just statistics
3. Sampling and resampling for statistical inference
4. Parametric & nonparametric prediction
    * Confidence intervals
    * Normal distribution
5. Machine learning: generalization, feature selection, etc.

## The Same Data With Different Priors Result in Different Posteriors
Depending on their prior beliefs, 2 people can have very different conclusions.

## The Classifier is more important than the prediction
We don't have complete control on our results. If we only focus on one particular prediction, it won't help with making a good classifier. 