# Review: Causality and Effect

### Comparison
* Group by some **treatment** and measure some **outcome**
* Simplest setting: a **treatment** group and a **control** group
* If the **outcome** differs between the 2 groups, that's evidence of an **asociation** (or **relation**)
    * E.g. the top-tier chocolate eaters died of heart disease at a lower rate (12%) than chocolate abstainers (17%)
* If the 2 groups are similar in all ways except the **treatment**, a difference in the **outcome** is also evidence of **causality**

### Confounding
* If the treatment and control groups have **systematic difference other than the treatment** itself, it might be difficult to identify a causal link
* When these systematic differences lead researchers astray, they are called **confounding factors**
* Such differences are often present in observational studies
    * Observational study: the researcher **does not choose** which subjects receive the treatment
    * Controlled experiment: the researcher **designs a procedure for selecting** the treatment and control groups
    
### Randomize
* When subjects are split up **randomly**, it's unlikely that there will be systematic differences between the groups
    * And it's possible to account for the chance of a difference
* Therefore, **randomized controlled experiments** are the most reliable way to establish causal relations

# Expressions

### Programming Languages
* Python is popular both for data science & general software development
* Mastering the language fundamentals is critical
* Learn through practice, not by reading or listening

Programming can dramatically improve our ability to collect and analyze information about the world, which can lead to new discoveries. In data science, the purpose of writing a program is to instruct a computer to carry out the steps of an analysis. Computers cannot study the world on their own, and so humans must describe precisely what steps the computer should take in order to collect and analyze data. Those steps are expressed through programs.

Programming languages are much simpler than human languages. Nonetheless, there are some rules of grammar to learn in any language, and that is where we will begin. In this text, we will use the Python programming language. Learning the grammar rules is essential, and the same rules used in the most basic programs are also central to more sophisticated programs.

Programs are made up of **expressions**, which describe to the computer how to combine pieces of data. For example, a multiplication expression consists of a ***** symbol between **two numerical expressions**. Expressions, such as 3 * 4, are evaluated by the computer. The value (the result of evaluation) of the last expression in each cell, 12 in this case, is displayed below the cell.


In [1]:
3 * 4

12

* The grammar rules of a programming language are rigid. 
    * In Python, the * symbol cannot appear twice in a row (except for exponentiation, in which we'll cover later). 
* The computer will not try to interpret an expression that differs from its prescribed expression structures. 
    *Instead, it will show a **SyntaxError** error. 
* The Syntax of a language is its set of grammar rules, and a SyntaxError indicates that an expression structure doesn’t match any of the rules of the language.

In [2]:
3 * * 4

SyntaxError: invalid syntax (<ipython-input-2-012ea60b41dd>, line 1)

Small changes to an expression can change its meaning entirely.
* Below, the space between the *’s has been removed. 
    * Because ** appears between two numerical expressions, the expression is a well-formed exponentiation expression
    * The first number raised to the power of the second: 3 times 3 times 3 times 3. 
* The symbols * and ** are called operators, and the values they combine are called operands.

In [3]:
3**4

81

### Common Operators
Data science often involves combining numerical values, and the set of operators in a programming language are designed to so that expressions can be used to express any sort of arithmetic. In Python, the following operators are essential.

|  Expression Type  |  Operator  | Example | Value |
|  ---  |  ---  | ----- | ---- |
|  Addition  | + | 2 + 3 | 5 | 
|  Subtraction  | - | 2 - 3 | -1 | 
|  Multiplication  | * | 2 * 3 | 6 | 
|  Division  | / | 7 / 3 | 2.66667 | 
|  Remainder  | % | 7 % 3 | 1 | 
|  Exponentiation  | ** | 4 ** 0.5 | 2 | 

Python expressions obey the same familiar rules of precedence as in algebra: multiplication and division occur before addition and subtraction. Parentheses can be used to group together smaller expressions within a larger expression.

In [4]:
1 + 2 * 3 * 4 * 5 / 6 ** 3 + 7 + 8 - 9 + 10

17.555555555555557

In [5]:
1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 + 8 - 9 + 10

2017.0

This chapter introduces many types of expressions. Learning to program involves trying out everything you learn in combination, investigating the behavior of the computer. What happens if you divide by zero? What happens if you divide twice in a row? You don’t always need to ask an expert (or the Internet); many of these details can be discovered by trying them out yourself.

### Numbers

#### Ints and Floats
Python has 2 real number types
* int: an integer of any size
    * int never has a decimal point
* float: a number with an optional fractional part
    * float always has a decimal point
    * float might be printed with scientific notation
    
Limitations of float values:
* Limited size (but the limit is huge)
* Limited precision of 15-16 decimal places
* After arithmetic operation, the final few decimal places can be wrong


Below are examples of int and float, consecutively.

In [1]:
2

2

In [2]:
2.0

2.0

If you add a dot '.' after a number, you're also creating a float.

In [3]:
2.

2.0

Division operation always result in a float.

In [5]:
2 / 1

2.0

When reaching a certain numbers of decimal zeros, Python automatically converts the number to scientific notation. Below is an example.

In [6]:
3 / 700000

4.2857142857142855e-06

In [7]:
0.000000000546

5.46e-10

You can also write scientific notation manually.

In [9]:
7.43e-7

7.43e-07

If a number is too small, it might appear as zeros.

In [11]:
0.000000000000006584 ** 30

0.0

If an integer is too big, it would simply show the number.

In [12]:
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

However, if it's a float that's too big, it'll show as **inf**, which stands for infinity

In [13]:
1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.

inf

Computer has a limited ability to represent number (limited precision). When a number gets to 16 decimal places, it will cut that number.

In [14]:
0.666666666666666666666666666666661243215

0.6666666666666666

Beware with computation that uses precision.

In [15]:
0.666666666666666666666666666666661243215 - 0.6666666666666666666666666

0.0

In [16]:
(2 ** 0.5) * (2 ** 0.5)

2.0000000000000004

### Discussion
Rank the results of the following expressions in order from least to greatest
1. 3 * 10 ** 10
2. 10 * 3 ** 10
3. (10 * 3) ** 10
4. 10 / 3 / 10
5. 10 / (3 / 10)

**Ans**: If you try to compute each of them, the order would be 4, 5, 2, 1, 3

### Names
<img src='name.jpg'/>

Names are given to values in Python using an **assignment** statement. In an assignment, a name is followed by =, followed by any expression. The value of the expression to the right of = is assigned to the name. Once a name has a value assigned to it, the value will be substituted for that name in future expressions.
* Statements don't have a value
    * They perform an action
* An assignment statement changes the meaning of the name to the left of the '=' symbol
* The name is bound to a value (not an equation)

In [18]:
more_than_1 = 2 + 3

In [19]:
more_than_1

5

A previously assigned name can be used in the expression to the right of =.

In [20]:
quarter = 1/4
half = 2 * quarter
half

0.5

BE CAREFUL! Only the current value of an expression is assigned to a name. If that value changes later, names that were defined in terms of that value will not change automatically.

In [21]:
quarter = 4
half

0.5

#### Naming Rule:
* Names must start with a letter
    * Can contain both letters and numbers. 
* A name cannot contain a space
    * Instead, it is common to use an underscore character **_** to replace each space. 
Names are only as useful as you make them; it’s up to the programmer to choose names that are easy to interpret. For example, to describe the sales tax on a $5 purchase in Berkeley, CA, the following names clarify the meaning of the various quantities involved.

In [22]:
purchase_price = 5
state_tax_rate = 0.075
county_tax_rate = 0.02
city_tax_rate = 0
sales_tax_rate = state_tax_rate + county_tax_rate + city_tax_rate
sales_tax = purchase_price * sales_tax_rate
sales_tax

0.475

### Example: Growth Rate
The relationship between two measurements of the same quantity taken at different times is often expressed as a **growth rate**. For example, the United States federal government employed 2,766,000 people in 2002 and 2,814,000 people in 2012. To compute a growth rate, we must first decide which value to treat as the initial amount. For values over time, the earlier value is a natural choice. Then, we **divide the difference between the changed and initial amount by the initial amount**.

In [25]:
initial = 2766000
changed = 2814000
federal_employment_growth = (changed - initial) / initial
federal_employment_growth

0.01735357917570499

It is also typical to subtract one from the ratio of the two measurements, which yields the same value.

In [None]:
(changed/initial) - 1

This value is the growth rate over 10 years. A useful property of growth rates is that **they don’t change even if the values are expressed in different units**. So, for example, we can express the same relationship between thousands of people in 2002 and 2012.

In [None]:
initial = 2766
changed = 2814
(changed/initial) - 1

In 10 years, the number of employees of the US Federal Government has increased by only 1.74%. In that time, the total expenditures of the US Federal Government increased from $2.37 trillion to $3.38 trillion in 2012.

In [26]:
initial = 2.37
changed = 3.38
federal_expenditure_growth = (changed/initial) - 1
federal_expenditure_growth

0.4261603375527425

A 42.6% increase in the federal budget is much larger than the 1.74% increase in federal employees. In fact, the number of federal employees has grown much more slowly than the population of the United States, which increased 9.21% in the same time period from 287.6 million people in 2002 to 314.1 million in 2012.

In [27]:
initial = 287.6
changed = 314.1
federal_employee_growth = (changed/initial) - 1
federal_employee_growth

0.09214186369958277

A growth rate can be negative, representing a decrease in some value. For example, the number of manufacturing jobs in the US decreased from 15.3 million in 2002 to 11.9 million in 2012, a -22.2% growth rate.

In [28]:
initial = 15.3
changed = 11.9
manufacturing_jobs_growth = (changed/initial) - 1
manufacturing_jobs_growth

-0.2222222222222222

### Annual Growth Rate
An annual growth rate is a growth rate of some quantity over a single year. An annual growth rate of 0.035, accumulated each year for 10 years, gives a much larger ten-year growth rate of 0.41 (or 41%).

In [30]:
1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 - 1

0.410598760621121

This same computation can be expressed using names and exponents.

In [29]:
annual_growth_rate = 0.035
ten_year_growth_rate = (1 + annual_growth_rate) ** 10 - 1
ten_year_growth_rate

0.410598760621121

Likewise, a ten-year growth rate can be used to compute an equivalent annual growth rate. 
Below, **t** is the number of years that have passed between measurements. The following computes the annual growth rate of federal expenditures over the last 10 years.

In [None]:
initial = 2.37
changed = 3.38
t = 10
(changed/initial) ** (1/t) - 1

The total growth over 10 years is equivalent to a 3.6% increase each year.

In summary, a growth rate g is used to describe the relative size of an initial amount and a changed amount after some amount of time t. 
To compute the changed amount, apply the growth rate g repeatedly, t times using exponentiation.

changed = initial * (1 + g) ** t

To compute g, raise the total growth to the power of 1/t and subtract one.

g = (changed/initial) ** (1/t) - 1