In [1]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

# Review: Cause and Effect

## Comparison
When we're trying to compare or to when we're trying to establish cause and effect, we have individuals that are subject to something.
* We group by some **treatment** and measure the **outcome** of that treament
* Yesterday we covered the simplest case or setting: where we have a `treatment group` (that's the group that are subject to a treatment) and a `control group` (the group that are NOT subject to the treatment)
* Simplest setting: a **treatment** group and a **control** group
* If the **outcome** differs between the 2 groups, then there's an `association` (or `relation`). 
    * E.g. the top-tier chocolate eaters died of heart disease at a lower rate (12%) than chocolate abstainers (17%)
    * Because the % of people died of heart disease differ between the 2 groups, there was an `association`
    * However, `association` is not the same as something that causes the outcome to happen
        * `association` is not `causality`!
* To establish `causality`, we need an additional constrain on the 2 groups
    * The 2 groups need to be similar in allways except the `treatment`
        * When this is fulfilled, then we can say that the `treatment`, not anything else, causes the outcome
        
This **something else** that might causes an outcome is called the `confounding variable`.

## Confounding
If the 2 groups have differences and we have not minimized the differences, we can't say for certain whether it's the treatment or one of those differences that causes the outcome.
    * If the treatment and the control groups have **systematic difference other than the treatment itself**, it is difficult to identify a causal link.

These differences are called `confounding factors`. Such differences are often present in observational studies.

Yesterday we talked about `observational study` and `RCE`. To be clear:

In both cases we have a treatment group and a control group. The difference is whether we have the chance to split the individual to 2 groups and whether we have the chance to administer the treatment.
1. In observational study, we don't have control over individuals
    * The individuals themselves separate themselves into 2 groups and something happens to them, but we don't do anything to them
    * Example: Does wearing grey shirt make you look better? In the classroom, we can look at the people wearing grey shirts and see how good they look. This is an observational study since we did not choose who to wear grey shirt.
2. In RCE, we design a procedure for selecting the treatment and control groups
    * In grey shirt example, we split the classroom and ask the other half of the class to wear grey shirt

## Randomize!
In observational study, the reason 2 groups have differences in the first place is that we don't get to randomize the individuals into the groups. Randomization allows us to minimize the differences between the 2 groups
* If we randomly assign individuals to groups, chances are the 2 groups would look similar

If we do randomization properly, we can account mathematically for the differences between the 2 groups.
* This allows us to decide if our conclusion is valid

Because of this, nowadays RCE is the most reliable way to establish causal relations.
* RCE is the standard practice in medicine and other fields
* We can say with more certainty that the treatment causes the outcome

# Expressions
In this section, we're going to talk about programming in Python. Expressions are the building blocks of Python.

## Programming Languages
We use Python in this class because Python is popular for both data science and general software development industry. 

1. Mastering the language fundamentals is critical
2. It's definitely better to learn through practice, not by reading or listening

## Demo

In [2]:
# Addition
2 + 3

5

In [3]:
# Multiplication
2 * 3

6

Python is very picky about how we write expressions. If we write the following,

In [4]:
2 * * 3

SyntaxError: invalid syntax (<ipython-input-4-0f343c0fc354>, line 1)

It gave us a `SyntaxError`, which means we wrote our code in an invalid way. We can't have multiple multiplication signs with spaces in between. However, exponentiation is different story.

In [None]:
# Exponentiation
2 ** 3

In [None]:
# Division
2 / 3

Notice that with division, the outcome is a number with decimal places. 

If we divide a number with 0, we'll get an `ZeroDivisionError`! This is different from the `SyntaxError`. 

In [None]:
2 / 0

Similar to graphic calculator, Python follows the **order of operations** rule. If we do the following,

In [None]:
1 + 2 * 3 * 4 * 5 / 6 ** 3 + 7 + 8 - 9 + 10

1. The exponentiation `6 ** 3` was computed first
2. Then the multiplication `2 * 3 * 4 * 5` was computed next
3. Then division, then addition and subtraction, and so on

If we want to force Python so that a specific computation is computed first, we can add a parentheses (),

In [None]:
1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 + 8 - 9 + 10

Note that the parentheses need to be complete! If there is a missing parentheses, Python will give out an error!

In [None]:
1 + 2 * (3 * 4 * 5 / 6 ** 3 + 7 + 8 - 9 + 10

## Arithmetic Operators
These are the basic arithmetic operators that Python has:

| Operation | Operator | Example | Value |
| --- | --- | --- | --- |
| Addition | + | 2 + 3 | 5 |
| Subtraction | - | 2 - 3 | -1 |
| Multiplication | * | 2 * 3 | 6 |
| Division | / | 7 / 3 | 2.66667 |
| Remainder | % | 7 % 3 | 1 |
| Exponentiation | ** | 2 ** 0.5 | 1.41421 |

## Example: Slopes
Below is a visualization of doctor's income vs. other professionals. Whoever made this visualization might deliberately want to show that doctors made more than any other occupation. This is an example of what NOT to do when making visualization.
<img src = 'income.jpg' width = 500\>

1. The doctor figures and human figures are unnecessary.
2. The years in x-axis are not evenly spaced.

A good graph would look like the following.
<img src = 'graph.jpg' width = 600\>
Even though the graph might look boring, we can clearly see the trend. 

If we look at the doctors' graph, it appears as if there are 2 different regions of the graph: the region before 1965 and the region after 1965. Around 1965, it seemed something occured that doctors' income skyrocketed.

Here we will try to calculate the slope for doctor's income before 1965 and after 1965. We are going to use the simplest method: $\frac{\Delta Y-axis} {\Delta X-axis}$

In [None]:
# Income at 1963 - Income at 1939, divided by the time difference
(25050 - 3262) / (1963 - 1939)

The calculation above implies that roughly before 1963, doctors' income increased with a rate of 907 dollar/year.

In [None]:
# Income at 1976 - Income at 1963, divided by time difference
(62799 - 25050) / (1976 - 1963)

After 1963, the rate of increase skyrocketed to 2,903 dollar/year!

# Numbers
We might have noticed from previous demo that some numbers contain decimal places and some don't. For example:
1. 2 + 3 = 5
2. 2 * 3 = 6
3. 2 / 3 = 0.6666666...
4. 1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 + 8 - 9 + 10 = 2017.0

There are 2 types of numbers in Python:
1. `int` or `integer`
2. `float`, a decimal point number

If we just type `2` here,

In [None]:
2

We just get the integer 2. But if we type `2.0`,

In [None]:
2.0

We'll get the float `2.0`! We can also get the float `2.0` by just writing `2.`

In [None]:
2. 

If we use the division operator, the outcome will always be a float regardless of the numbers we use.

In [None]:
2/1

In [None]:
5 / 7000

When the outcome is small enough, Python converts the outcome to scientific notation.

In [None]:
5/ 70000

If we try to type the number manually, Python still converts it to scientific notation.

In [None]:
0.00007142857142857143

We can also write numbers in scientific notation!

In [None]:
7.14e-6

If a number becomes too small in value, Python will display it as `0.0`

In [None]:
0.000000000000000000000000000000000000000000000123456

In [None]:
0.000000000000000000000000000000000000000000000123456 ** 30

What if the numbers get too big?

For integers, nothing special would happen,

In [None]:
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

But for float, at first it will be converted to scientific notation. At some point, it will give out an error!

In [None]:
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.0

In [None]:
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.0 ** 2.3

Similar to a calculator, Python has a limited capability in representing decimal numbers.

In [None]:
0.6666666666666666666666666612435125

This is a fundamental limitation of how computers are made of. There is a limit to how precise a computation can be. 

In [None]:
0.6666666666666666666666666612435125 - 0.66666666666666666666666666

Again, Python has a limited numerical precision, as shown with the computation above.

Now let's try taking the square root of 2.

In [None]:
2 ** 0.5

Now if we do the following,

In [None]:
(2 ** 0.5) * (2 ** 0.5)

The result is not exactly 2! And if we try to subtract this result with 2,

In [None]:
(2 ** 0.5) * (2 ** 0.5) - 2

The outcome is a very small number that we can take it as 0. Just remember that this is just due to Python's limited precision capability.

Also, if we do any operation that involves a float, the outcome will be a float.

In [None]:
2 + 3.0

In [None]:
2 * 3.0

# Recap: Ints and Floats
Python has 2 real number types:
1. `int`: an integer of any size
2. `float`: a number with an optional fractional part

* An `int` never has a decimal point. A `float` always does.
* A `float` might be printed using scientific notation

3 limitations of float values:
1. They have limited size (but the limit is huge)
2. They have limited precision of 15-16 decimal places
3. After arithmetic, the final few decimal places can be wrong

## Discussion Question
Rank the following expressions in order from least to greatest:

1. 3 * 10  ** 10
2. 10 * 3 ** 10
3. (10 * 3) ** 10
4. 10 / 3 / 10
5. 10 / (3 / 10)



In [None]:
# Rank 4
3 * 10 ** 10

In [None]:
# Rank 3
10 * 3 ** 10

In [None]:
# Rank 5 greatest
(10 * 3) ** 10

In [None]:
# Rank 1 smallest
10 / 3 / 10

In [None]:
# Rank 2
10 / (3 / 10)

# Names (also known as variables)
Sometimes we want to be able to save some computation results so that we don't need to type them all over again.

## Assignment Statements
Consider the following,

In [6]:
more_than_1 = 2 + 3

`more_than_1` is the name of the variable, or the label that we are giving to the result. The `2 + 3` is the expression that computes the result that we want to save.

Notice when we run the cell above, nothing comes out. This is because **statements don't have a value**. They only perform an action.

An assignment statement changes the meaning of the name to the left of the `=` symbol. As a result, the name is now bound to a value. This is NOT an equation that checks the equality between `more_than_1` and `2 + 3`.

In [7]:
more_than_1

5

We can see that `more_than_1` is now bound to the value 5. We can do computations using the name!

In [8]:
more_than_1 * 2

10

You can actually put more than one line in a cell. Even though all the lines are run, only the outcome of the last line will be printed.

In [9]:
1 + 1
2 + 2
3 + 3

6

However, this means we can do assignment and result printing on the same cell!

In [10]:
more_than_1 = 2 + 4
more_than_1

6

Be careful with names!

In [11]:
x = 4
y = x + 1
y

5

In [12]:
x = 3
y

5

Above, even if we have changed `x`, `y` does not change. You will have to reassign `y` to change it.

In [13]:
y = x + 1
y

4

# Exponential Growth
When we discuss about exponential growth, we also use the term `growth rate`. 

## Growth Rate
Growth rate is defined by **the rate of increase per unit time**. 
* e.g. water consumption increases by 100% per day
    
After one time unit, a quantity `x` growing at rate `g` will be: 
\begin{align}
x \times ( 1 + g)
\end{align}
* For example, we consumed 1 dollar worth of water yesterday
* Today we consumed 1.5 dollar worth of water
    * `x` is the original quantity, the 1 dollar worth of water
    * `g` in this case is 0.5, since the water consumption is increased by 50%
    * The calculation would be $ 1 \times ( 1 + 0.5)$
    
After `t` time units, a quantity `x` growing at a rate `g` will be: 
\begin{align}
x \times ( 1 + g) ^ {t}
\end{align}

* Let's say you have 100 dollars in bank. `x` is 100
* There's 5% interest every year
* In 7 years, `x` will become:
\begin{align}
100 \times ( 1 + 0.05) ^ {7}
\end{align}

If `after` and `before` are measurements of the same quantity taken `t` time units apart, then the growth rate `g` is,
\begin{align}
g = (\frac {after}{before})^ {\frac {1}{t}} - 1
\end{align}

## Ebola Epidemic, Sept. 2014
In Sept 2014, people were frightened of Ebola outbreak. 
* President Obama said: "It's spreading and growing `exponentially`"
* Dr. David Nabarro, Special Adviser for the U.N.'s Secretary-General said: "This is a disease outbreak that is advancing in an `exponential` fashion"
<img src = 'ebola.jpg' width = 500\>

The solid brown line was the current rate of ebola spread at that time. It was projected that the infection would rise exponentially, following the upward curve. In reality, the infection did not rise in exponential rate, instead it rose in linear rate (the blurry dotted line).  

## Demo: Federal Budget
In 2002, the federal budget was 2,370,000,000,000 dollars. In 2012, it grew to 3,380,000,000,000 dollars. 

In [14]:
fed_budget_2002 = 2370000000000
fed_budget_2012 = 3380000000000
fed_budget_2012 / fed_budget_2002

1.4261603375527425

From above, we can see that over the course of 10 years, the federal budget has a surplus of 42%. We can try calculating the growth for every year.

In [15]:
fed_budget_2002 = 2.37
fed_budget_2012 = 3.38
g = (fed_budget_2012/ fed_budget_2002) ** (1/10) - 1
g

0.03613617208346853

The federal budget increased 3.6% per year!