# Double voting and the birthday problem

<img src="voting.jpg" alt="voting" width="400" align="left"/>

## Introduction

Claims of voter fraud are widespread. Here's one example:

> **“Probably over a million people voted twice 
in [the 2012 presidential] election.”**
>
> Dick Morris, in 2014 on Fox News

Voter fraud can take place in a number of ways, including tampering with voting machines, destroying ballots, and impersonating voters. Today, though, we're going to explore 
**double voting**, which occurs when a single person illegally casts more than one vote in an election.

To start, consider this fact:

> In the 1970 election, there were 141 individuals named "John Smith", and 27 pairs had exactly the same birthday.

Were there 27 fraduluent ballots in the 1970 election? Let's find out.

## The birthday problem

To begin answering this question, let's solve a similar problem: 

> In a room of 30 people, how likely is it that two people share exactly the same birthday? 
>
> You can assume that every person in the room was born in the same year.

To answer this question, we could use the following **algorithm**, or list of steps needed to solve a problem:
 > 1. We need a room of 30 people, and we need to know their birthdays.
 > 2. We need to check whether two people in that room share a birthday.
 > 3. We need to repeat steps 1 and 2 over and over.
 > 4. We need to figure out how often two or more people shared a birthday.

Unfortunately, we don't have the time (or money!) to recruit a couple thousand experimental subjects to come to Stanford, sit in a big room with 29 other people, and tell us their birthdays.

Instead, we must create a **simulation**, or a computer model that helps us recreate and study real-life phenomena. 

## The `sample` command

Here's the first step of our algorithm:

> We need a room of 30 people, and we need to know their birthdays.

How can we simulate 30 people using `R`, a statistical programming language?

### Exercise 1

**A. Think about the output of the cells below, then discuss with your partner what you think `sample` does.**

(Hint: Try re-running the code in each cell a couple times by pressing SHIFT + ENTER, and see what happens!)

In [107]:
sample(1:10, 5, replace = FALSE)

In [108]:
sample(1:3, 5, replace = TRUE)

In [109]:
sample(1:20, 3, replace = TRUE)

In [13]:
# Feel free to test more sample commands here.



---

**Tip: To edit text, double click on it!**

Ideas about what `sample` does:
- Idea 1
- Idea 2
- ...

---

**B. Using `sample`, simulate a list of 30 birthdays.**

(Hint: Think about assigning a number to each day in the year.)

In [36]:
# Your code goes here!



## Interlude: Math, variables, and vectors

### Using R as a calculator

One simple (and useful!) way to use R is as a calculator. For example:

In [26]:
5 + 10

In [27]:
30 * 3

In [29]:
25 / 5

In [30]:
3 ^ 2

In [1]:
(2 + 3) * 5

#### Exercise 1

Use R to find the average of 42, 100, and 280.

In [31]:
# Your code here!



### Variables

**Variables** are like boxes: they store things for us, and we can label them so we know what's inside.

If you've taken algebra before, you already understand variables! Consider the following example:

> If x = 2, what is x + 5?

You guessed it: the answer is indeed 7. Let's express the same problem using `R`:

In [56]:
# If x = 2, what is x + 5?

x <- 2
x + 5

In the first line of our code, we assigned the value <b>2</b> to the variable named <b>x</b> using <b><-</b> .

The assignment operator (`<-`) tells R "Hey, take the variable on the *_left_* of the equal sign, and give it the value of the thing on the *_right_*."

> Note: The equal sign ( `=` ) also works for assignment, but it is an `R` convention to use `<-`.

---

If you ever run a cell containing just a variable, R will print the value of that variable:

In [45]:
x

---

In the second line of our code, we added <b>5</b> to our variable <b>x</b>, which has the value 2.

R will also print the value of simple expressions:

In [57]:
x + 5

**Important note:** The value of x is still 2, not 7. 

Unless we use `<-` to assign a new value to x, it will always be 2:

In [59]:
x

---

We can also use the same variable on both sides of `<-` to update the value of a variable. For example, we can increase the value of x by 10:

In [57]:
x <- x + 10
x

### Vectors

A **vector** is a sequence of things.

For example, we can have a vector of numbers:

In [64]:
5:10

The colon (`:`) symbol allows us to create vectors of integers from *start:end*.

---

We can also use `c()` to create vectors.

In [66]:
c(10, 100, 1000)

The "c" in `c()` stands for **concatenate**, which is the act of connecting things from end to end.

---

We can assign vectors to variables too!

In [14]:
my_vector <- c(10, 100, 1000)
my_vector

---

Lastly, we can extract **elements** from vectors using their **index**, or their place in line.

In [4]:
my_vector[2]

#### Exercise 2

**A. Create a vector of numbers from 15 to 140, and assign the vector to a variable called `my_vector`.**

In [72]:
# Your code here!



**B. Find the difference between the 30th and 100th values of `my_vector`.**

In [73]:
# Your code here!



## Back to the birthday problem: Finding duplicates

Here's the second step of our algorithm:

> We need to check whether two people in the room share a birthday.

In this exercise, we will learn about two useful tools for finding duplicates.

#### Exercise 3

**A. Think about the output of the cells below, then discuss with your partner what you think `has_duplicate` does.**

In [2]:
# Press SHIFT + ENTER to run this cell
# Don't worry about what it's doing for now!

source("duplicate.R")

---
<br>

In [4]:
vector_a <- sample(1:3, 5, replace = TRUE)

In [5]:
print(vector_a)

[1] 3 1 3 1 1


In [6]:
has_duplicate(vector_a)

---
<br>

In [7]:
vector_b <- sample(1:10, 5, replace = FALSE)

In [8]:
print(vector_b)

[1]  5  7  8  1 10


In [9]:
has_duplicate(vector_b)

---
<br>

In [7]:
vector_c <- sample(1:5, 10, replace = TRUE)

In [11]:
print(vector_c)

 [1] 5 3 3 4 5 5 2 3 4 2


In [8]:
has_duplicate(vector_c)

---

Ideas about `has_duplicate`:
- Idea 1
- Idea 2
- ...

---

**B. Generate a vector of 30 random birthdays, and determine whether the vector has any duplicates. Re-run the code several times and think about how often the vector has duplicates.**

In [71]:
# Your code here!



Observations:
- Observation 1
- Observation 2
- ...

---

**C. Think about the output of the cells below, then discuss with your partner what you think `num_duplicates` does.**

In [102]:
print(vector_a)

[1] 3 2 2 2 1


In [103]:
num_duplicates(vector_a)

---
<br>

In [98]:
print(vector_b)

[1]  4  5  7 10  3


In [73]:
num_duplicates(vector_b)

---
<br>

In [113]:
print(vector_c)

 [1] 1 2 5 5 4 5 5 1 3 5


In [114]:
num_duplicates(vector_c)

---

Ideas about `num_duplicates`:
- Idea 1
- Idea 2
- ...

---

**D. Generate a vector of 141 random birthdays, and determine how many duplicates are in the list. Re-run the code several times and note any observations.**

In [92]:
# Your code here!



Observations:
- Observation 1
- Observation 2
- ...

### Repetition with `for` loops

Here's the third step in our algorithm:

> We need to repeat this process over and over.

The `for` loop lets us do exactly this!

#### Exercise 4

**A. Think about the output of the cells below, then discuss with your partner how you think a `for` loop works.**

In [25]:
for (i in 1:10) {
    print("Hello, world!")
}

[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"


In [26]:
for (i in 1:5) {
    print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5


In [111]:
for (i in 1:3) {
    print(sample(1:20, 5, replace = TRUE))
}

[1] 18  9 18 14 15
[1]  3 13  2  9  7
[1]  8  5  1 15  5


<br>

Ideas about `for` loops:
- Idea 1
- Idea 2
- ...

---

**B. Using a `for` loop, print 10 lists of 30 random birthdays**

In [60]:
# Your code here!



---

**C. Think about the output of the cell below, and discuss with your partner what you think is going on. You are encouraged to change the numbers and re-run the code a couple times!**

In [95]:
counter <- 0

for (i in 1:5) {
    counter <- counter + 1
}

print(counter)

[1] 5


Ideas:
- Idea 1
- Idea 2
- ...

## Interlude: Functions, booleans, and control flow

### Functions

If you've taken algebra, you've already seen functions! For example, this function f takes the square root of its input:

> f(x) = √x
>
> f(25) = √25 = 5

<img src="function-machine.svg" width=200 align="left"/>

R also has a square root function called `sqrt`. To use a function in R, we write the name of the function, and then put its input in parentheses. The output of a function is called its **return value**. 

In [33]:
sqrt(25)

You may have noticed that we used this function notation quite a lot already. Here are some of the functions you have already used:
- `has_duplicate`: determines if a vector has any duplicate values
- `num_duplicates`: determines how many elements in a vector are duplicates
- `print`: prints its input 

There are *many* other functions in R, and you can even write your own functions! Here are some examples of functions that you can use:
- `sum`: Adds up all of the numbers in a vector
- `mean`: Finds the average of the numbers in a vector
- `length`: Finds the total number of elements in a vector.
- `max`: Finds the maximum value in a vector
- `min`: Finds the minimum value in a vector

#### Exercise 5

**A. Find the sum of all the numbers from 1 to 100 using `sum`.**

> *Historical note*: Carl Friedrich Gauss, a 19th century mathematician, rapidly solved this problem by hand when he was only ten years old! https://www.nctm.org/Publications/Teaching-Children-Mathematics/Blog/The-Story-of-Gauss/  

In [36]:
# Your code here!



**B. Find the average of all the numbers from 1 to 100 using `mean`.**

In [37]:
# Your code here!



### Multi-argument functions and named arguments

Functions like `print` only need one input, or **argument**. However, functions can have more than one argument. For example, this function `f` adds its two arguments, x and y:

> f(x, y) = x + y
>
> f(2, 3) = 2 + 3 = 5

You've also already used a multi-argument function in R: `sample`! 

`sample` takes three arguments:

1. A vector of numbers to sample from
2. How many numbers to sample
3. Whether or not we can reuse numbers after sampling them.

In [46]:
sample(1:10, 5, replace = TRUE)

Notice that the last argument, whether or not we can reuse numbers, has its own name: `replace`.

To make functions more understandable and usable, programmers often name arguments. We will see more examples of functions with multiple arguments and named arguments in the next tutorial.

### Control flow with booleans, `if`, and `else`

Booleans are a special type of variable that can take on only two possible values: `TRUE` or `FALSE`.

> *Historical note*: Booleans are named after George Boole, a 19th century mathematician. https://en.wikipedia.org/wiki/George_Boole

Booleans come in handy when you're comparing values.

In [76]:
10 == 10

In [77]:
9 == 10

---

The double equal sign ( `==` ) is different than the single equal sign ( `=` ).

- While a single equal sign is used to <i>assign</i> values to arguments inside functions, a double equal sign is used to <i>compare</i> values.

---

We can also use greater than ( `>` ) and less than ( `<` ) to compare values:

In [104]:
9 < 10

In [107]:
10 > 10

In [108]:
# <= means "less than or equal to", and >= means "greater than or equal to"

10 >= 10

---

We can use `if` in conjunction with booleans to control our code:

> if (this statement is true) {do this thing}

In [52]:
counter <- 0

for (i in 1:5) {
    counter <- counter + 1
    
    print(counter)
    
    if (counter >= 3) {
        print("Counter is now bigger than or equal to 3!")
    }
}

[1] 1
[1] 2
[1] 3
[1] "Counter is now bigger than or equal to 3!"
[1] 4
[1] "Counter is now bigger than or equal to 3!"
[1] 5
[1] "Counter is now bigger than or equal to 3!"


---

In computer science, `else` means "otherwise". We can use `if` and `else` with each other to write code that follows this pattern:

> if (this statement is true) {do this thing}
>
> else {do this other thing}

In [53]:
counter <- 0

for (i in 1:5) {
    counter <- counter + 1
    
    print(counter)
    
    if (counter >= 3) {
        print("Counter is now bigger than or equal to 3!")
    }
    else {
        print("Counter is less than 3!")
    }
}

[1] 1
[1] "Counter is less than 3!"
[1] 2
[1] "Counter is less than 3!"
[1] 3
[1] "Counter is now bigger than or equal to 3!"
[1] 4
[1] "Counter is now bigger than or equal to 3!"
[1] 5
[1] "Counter is now bigger than or equal to 3!"


#### Exercise 6

**Write a `for` loop to count off all the numbers from 1 to 10. Print "Bigger than 5!" after each number that is bigger than 5.**

In [51]:
# Your code here!



#### Exercise 7

**Generate a list of 10 birthdays, and use a `for` loop with `if` to print the birthdays that fall in the first half of the year.**

In [43]:
# Your code here!



## Back to the birthday problem: Translating our algorithm into code

We're ready to come back to our algorithm:

> 1. We need a room of 30 people, and we need to know their birthdays.
> 2. We need to check whether two people in that room share a birthday.
> 3. We need to repeat this process over and over.
> 4. We need to figure out how frequently two or more people shared a birthday.

Here's our algorithm for finding duplicates, translated into code! 

In [59]:
# This is the total number of birthday vectors we will generate
n <- 1000

# This is a counter to keep track of how many vectors had at least one duplicate birthday
n_with_duplicates <- 0

# Each time we see a vector with at least one duplicate, we should increment our counter
for (i in 1:n) {
    
    if (has_duplicate(sample(1:366, 30, replace = TRUE))) {
        
        n_with_duplicates <- n_with_duplicates + 1
    }
    
}

# Fraction of vectors with at least one duplicate
n_with_duplicates / n


### Exercise 8

**A. Increase the number of birthdays we generate in each vector, and re-run the code several times. What happens to the fraction of vectors with duplicates?**

Findings:
- Finding 1
- Finding 2
- ...

**B. Change the number of vectors to 100, and re-run the code several times. What happens to the results?**

Findings:
- Finding 1
- Finding 2 
- ...

**C. Change the number of vectors to 100,000, and re-run the code several times. What happens to the results?**

Findings:
- Finding 1
- Finding 2 
- ...

**D. How many birthdays should be in each vector for an approximately 50% chance of a match?**

Findings:
- Finding 1
- Finding 2
- ...

## Circling back to double voting

Remember our original problem:

> In the 1970 election, there were 141 individuals named "John Smith", and 27 pairs had exactly the same birthday.

### Exercise 9

**Modify the simulation code to calculate the average number of birthday matches in 1,000 vectors of 141 individuals.**

(Hint: Use `num_duplicates` from earlier and have the counter add up the total number of duplicates in all the generated vectors.)

In [None]:
# Your code here!

