In [1]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

**Reader Notes: Due to flawed video lecture, this lecture is based on powerpoint slides and the associated readings**

# Randomness
Data scientists need to understand randomness. For example, they have to be able to assign individuals to treatment and control groups at random, and then try to say whether any observed differences in the outcomes of the two groups are simply due to the random assignment or genuinely due to the treatment.

In this chapter, we begin our analysis of randomness. To start off, we will use Python to make choices at random. In `numpy` there is a sub-module called `random` that contains many functions that involve random selection. One of these functions is called `choice`. It picks one item at random from an array, and it is equally likely to pick any of the items. The function call is `np.random.choice(array_name)`, where `array_name` is the name of the array from which to make the choice.

Thus the following code evaluates to treatment with chance 50%, and control with chance 50% (If you run the code repeatedly, it will eventually give the other output)

In [5]:
two_groups = make_array('treatment', 'control')
np.random.choice(two_groups)

'treatment'

The big difference between the code above and all the other code we have run thus far is that the code above doesn’t always return the same value. It can return either **treatment** or **control**, and we don’t know ahead of time which one it will pick. We can repeat the process by providing a second argument, the number of times to repeat the process.

In [6]:
np.random.choice(two_groups, 10)

array(['treatment', 'treatment', 'control', 'control', 'treatment',
       'control', 'treatment', 'control', 'treatment', 'control'],
      dtype='<U9')

## Random Selection 
`np.random.choice`
* Selects at random
* With replacement
* From an array
* A specified number of times

`np.random.choice(some_array, sample_size)`

A fundamental question about random events is whether or not they occur. For example:

* Did an individual get assigned to the treatment group?
* Is a gambler going to win money?
* Has a poll made an accurate prediction?

Once the event has occurred, you can answer “yes” or “no” to all these questions. In programming, it is conventional to do this by labeling statements as `True` or `False`. For example:
* If an individual did get assigned to the treatment group, then the statement, “The individual was assigned to the treatment group” would be `True`. 
* If not, it would be `False`.

## Discussion Question

In [3]:
d = np.arange(6) + 1

What are the results from evaluating the following 2 expressions? Are they the same? Do they describe the same process?

In [4]:
#1st expression
np.random.choice(d, 1000) + np.random.choice(d, 1000)

array([ 3,  2,  6,  8,  6,  8,  3,  5,  8,  4,  8,  4,  6,  7,  5,  9,  8,
        7,  8,  3,  8,  6, 10,  3,  9,  7,  8,  7,  6,  7, 10, 11,  3, 10,
        6,  7,  6,  7,  8, 10,  5,  3,  9,  8,  8,  8,  8,  5,  3, 12, 11,
        8,  5,  3,  7,  4,  8,  9, 10, 11,  7,  8, 10,  9,  4, 12,  6,  9,
       11,  8,  7,  9,  6,  9, 11,  5,  7, 11, 11,  7,  7,  9,  6,  5,  2,
        7,  7,  4,  7,  7,  2, 11,  5,  4,  6, 10,  6,  7,  5,  6, 10,  6,
        7, 12,  4,  4,  7,  2,  9,  7,  9,  8,  6,  6,  6,  4,  7,  7,  9,
        6,  7,  6,  6,  4,  8,  2,  6,  6,  7,  8,  3, 12,  9,  3,  5,  4,
       10,  7,  7,  4,  7,  5,  8,  7,  7,  5, 11,  6,  8,  6,  3,  6,  6,
        4,  5,  7,  9,  9,  3, 10,  6,  2,  5,  3,  4,  5,  7,  9,  7,  6,
        7,  7,  6, 10,  7,  8,  8,  5,  5,  5,  2,  9,  8,  8, 11,  7,  9,
        5,  6,  8,  9,  8,  4, 11,  8, 11,  5,  7,  8, 10, 10,  9,  7,  3,
        4, 10,  7, 10,  8,  8,  5,  7, 12,  7,  5, 11,  5, 10, 11,  4,  4,
        4,  7,  4, 12,  7

In [5]:
#2nd expression
2*np.random.choice(d, 1000)

array([ 6,  6,  2,  2, 10,  2,  6,  4,  2,  8, 12, 12, 12,  2,  8,  8, 10,
        4,  6,  4,  4,  6,  6, 10,  8,  4,  2,  6,  6,  4,  6,  2, 10, 10,
       12, 10, 10, 12,  8,  6, 10,  8,  8, 12,  6,  4, 12, 10,  8, 10,  2,
       10, 10, 10,  6,  6,  6,  6, 12, 12, 10,  4,  8,  2,  8, 12, 12,  2,
        4, 10,  6,  8, 12, 12,  4,  8, 12,  6, 12, 10,  6,  8,  4,  8,  4,
        6, 10,  2,  4, 12, 12,  6,  6,  6,  8, 12,  2,  4,  8,  8,  2,  2,
        2, 12, 12,  2,  4,  8,  2,  2,  8,  8, 12, 10,  2,  6,  2, 12,  2,
       12, 12,  2,  6,  4,  4,  4,  8,  6,  8, 12,  2, 10,  4,  8,  4, 10,
        8,  2, 10,  8,  4, 12, 10,  4,  4, 12,  2,  6,  2,  4, 12,  4,  8,
        8, 10,  6,  2,  4,  2,  6, 10,  8,  8,  6,  4,  2,  8,  6, 12, 10,
        8,  2, 10, 10, 10,  6, 10, 10,  4,  6,  4, 10, 10,  6, 12,  8,  2,
       10,  8, 12,  2,  2,  2, 12,  6,  4, 12,  4,  4,  6,  4, 12, 10, 12,
       12, 12,  6,  2,  6, 10, 12,  4,  8, 10,  4, 12,  8, 12,  4,  4,  2,
        2,  6, 12,  8,  4

## Boolean and Comparison
**The result of a comparison expression is a bool value**
<img src = 'comparison.jpg' width = 500/>

In Python, Boolean values, named for the logician George Boole, represent truth and take only two possible values: `True` and `False`. Whether problems involve randomness or not, Boolean values most often arise from comparison operators. Python includes a variety of operators that compare values. For example, `3` is larger than `1 + 1`.

In [None]:
3 > 1 + 1

The value `True` indicates that the comparison is correct. The full set of common comparison operators are listed below.

| Comparison | Operator | True Example | False Example |
| --- | --- | --- | ---|
| Less than | < | 2 < 3 | 2 < 2|
| Greater than | > | 3 > 2 | 3 > 3|
| Less than or equal | <= | 2 <= 2 | 3 <= 2 |
| Equal | == | 3 == 3 | 2 == 3 |
| Not Equal | != | 3 != 3 | 2 != 3 |

Notice the two equal signs `==` in the comparison to determine **equality**. Python already uses `=` to mean assignment to a name; it can’t use the same symbol for a different purpose. Thus if you want to check whether `5` is equal to the `10/2`, then you have to be careful: `5 = 10/2` returns an error message because Python assumes you are trying to assign the value of the expression 10/2 to a name that is the numeral 5. Instead, you must use `5 == 10/2`, which evaluates to `True`.

In [7]:
5 = 10/2

SyntaxError: can't assign to literal (<ipython-input-7-e8c755f5e450>, line 1)

In [8]:
5 == 10/2

True

An expression can contain multiple comparisons, and they all must hold in order for the whole expression to be True. For example, we can express that 1+1 is between 1 and 3 using the following expression.

In [9]:
1 < 1 + 1 < 3

True

The average of two numbers is always between the smaller number and the larger number. We express this relationship for the numbers `x` and `y` below. You can try different values of `x` and `y` to confirm this relationship.

In [10]:
x = 12
y = 5
min(x, y) <= (x+y)/2 <= max(x, y)

True

### String Comparison
Strings can also be compared, and their order is alphabetical.

In [11]:
'Dog' > 'Catastrophe'

True

A shorter string is less than a longer string that begins with the shorter string.

In [12]:
'Catastrophe' > 'Cat'

True

Let’s return to random selection. Recall the array `two_groups` which consists of just two elements, `treatment` and `control`. To see whether a randomly assigned individual went to the treatment group, you can use a comparison:

In [13]:
np.random.choice(two_groups) == 'treatment'

False

The random choice will not always be the same, so the result of the comparison won’t always be the same either. It will depend on whether treatment or control was chosen. With any cell that involves random selection, it is a good idea to run the cell several times to get a sense of the variability in the result.

### Comparing an Array and a Value
Recall that we can perform arithmetic operations on many numbers in an array at once. For example, `make_array(0, 5, 2)*2` is equivalent to `make_array(0, 10, 4)`. In similar fashion, if we compare an array and one value, each element of the array is compared to that value, and the comparison evaluates to **an array of Booleans**.

In [14]:
tosses = make_array('Tails', 'Heads', 'Tails', 'Heads', 'Heads')
tosses == 'Heads'

array([False,  True, False,  True,  True])

The numpy method `count_nonzero` evaluates to the number of non-zero (that is, `True`) elements of the array.

In [15]:
np.count_nonzero(tosses == 'Heads')

3

## Combining Comparisons
Boolean operators can be applied to **bool** values
<img src = 'combining.jpg' width = 450/>

### Predicates Make Comparisons 
The result of calling a predicate function (i.e. `are.above(3)`) is a function that performs a comparison.

## Control Statements
These statements **control** the sequence of computations that are performed in a program.
* The keywords `if` and `for` begin control statements
* The purpose of `if` is to define functions that choose different behavior based on their arguments
* The purpose of `for` is to perform a computation for every element in a list or array

### Iteration
In programming – especially when dealing with randomness – we often want to repeat a process multiple times. For example, we might want to assign each person in a study to the treatment group or to control by a coin toss. We can do this without actually tossing a coin for each person; we can use `np.random.choice` instead.

Here is a reminder of how `np.random.choice` works. Run the cell a few times to see how the output changes.

In [7]:
np.random.choice(make_array('Heads', 'Tails'))

'Tails'

A more automated solution is to use a `for` statement to loop over the contents of a sequence. This is called **iteration**. A `for` statement begins with the word `for`, followed by a name we want to give each item in the sequence, followed by the word `in`, and ending with an expression that evaluates to a sequence. The indented body of the for statement is executed once for each item in that sequence.

In [8]:
for i in np.arange(3):
    print(i)

0
1
2


It is instructive to imagine code that exactly replicates a `for` statement without the `for` statement. This is called **unrolling the loop**.

A `for` statement replicates the code inside it, but before each iteration, it assigns a new value from the given sequence to the name we chose. For example, here is an unrolled version of the loop above:

In [9]:
i = np.arange(3).item(0)
print(i)
i = np.arange(3).item(1)
print(i)
i = np.arange(3).item(2)
print(i)

0
1
2


The name `i` is arbitrary, just like any name we assign with `=`.

Here we use a `for` statement in a more realistic way: we print 5 random choices from coin, thus simulating the results five tosses of a coin.

In [10]:
coin = make_array('Heads', 'Tails')

for i in np.arange(5):
    print(np.random.choice(coin))

Heads
Heads
Heads
Heads
Heads


## Augmenting Arrays
While the `for` statement above does simulate the results of 5 tosses of a coin, the results are simply printed. A typical use of a `for` statement is to create an array of results, by augmenting it each time.

The append method in numpy helps us do this. The call `np.append(array_name, value)` evaluates to a new array that is `array_name` augmented by value. When you use append, keep in mind that all the entries of an array must have the same type.

In [13]:
pets = make_array('Cat', 'Dog')
np.append(pets, 'Another Pet')

array(['Cat', 'Dog', 'Another Pet'], dtype='<U11')

Notice that the array `pets` is unchanged.

In [14]:
pets

array(['Cat', 'Dog'], dtype='<U3')

But often while using for loops it will be convenient to mutate an array – that is, change it – when augmenting it. This is done by assigning the augmented array to the same name as the original.

In [15]:
pets = np.append(pets, 'Another Pet')
pets

array(['Cat', 'Dog', 'Another Pet'], dtype='<U11')

### Example: Counting the Number of Heads
We can now simulate 5 tosses of a coin and place the results into an array. We will start by creating an empty array and then appending the outcome of each toss. Notice that the body of the for loop contains 2 statements. Both statements are executed for each value in the given sequence `np.arange(5)`.

In [2]:
coin = make_array('Heads', 'Tails')

outcomes = make_array()

for i in np.arange(5):
    outcome_of_toss = np.random.choice(coin)
    outcomes = np.append(outcomes, outcome_of_toss)
    
outcomes

array(['Tails', 'Heads', 'Tails', 'Heads', 'Tails'], dtype='<U32')

The **unrolled** version of the code above would be:

In [3]:
coin = make_array('Heads', 'Tails')

outcomes = make_array()

i = np.arange(5).item(0)
outcome_of_toss = np.random.choice(coin)
outcomes = np.append(outcomes, outcome_of_toss)

i = np.arange(5).item(1)
outcome_of_toss = np.random.choice(coin)
outcomes = np.append(outcomes, outcome_of_toss)

i = np.arange(5).item(2)
outcome_of_toss = np.random.choice(coin)
outcomes = np.append(outcomes, outcome_of_toss)

i = np.arange(5).item(3)
outcome_of_toss = np.random.choice(coin)
outcomes = np.append(outcomes, outcome_of_toss)

i = np.arange(5).item(4)
outcome_of_toss = np.random.choice(coin)
outcomes = np.append(outcomes, outcome_of_toss)

outcomes

array(['Heads', 'Tails', 'Tails', 'Heads', 'Heads'], dtype='<U32')

By capturing the results in an array we have given ourselves the ability to use array methods to do computations. For example, we can use `np.count_nonzero` to count the number of heads in the five tosses.

In [4]:
np.count_nonzero(outcomes == 'Heads')

3

We have used the `for` loop to simulate a random experiment, and therefore if you run the cell again, the array outcomes is likely to be different. In upcoming sections of the course we will study how different the outcomes could be.

Iteration is a powerful technique. For example, by running exactly the same code for 1000 tosses instead of 5, we can count the number of heads in 1000 tosses.

In [None]:
outcomes = make_array()

for i in np.arange(1000):
    outcome_of_toss = np.random.choice(coin)
    outcomes = np.append(outcomes, outcome_of_toss)

np.count_nonzero(outcomes == 'Heads')