# Double voting and the birthday problem

<img src="voting.jpg" alt="voting" width="400"/>

## Introduction

Claims of voter fraud are widespread. Here's one example:

> **“Probably over a million people voted twice 
in [the 2012 presidential] election.”**
>
> Dick Morris, in 2014 on Fox News

Voter fraud can take place in a number of ways, including tampering with voting machines, destroying ballots, and impersonating voters. Today, though, we're going to explore 
**double voting**, which occurs when a single person illegally casts more than one vote in an election.

To start, consider this fact:

> In the 1970 election, there were 141 individuals named "John Smith", and 27 pairs had exactly the same birthday.

Were there 27 fraduluent ballots in the 1970 election? Let's find out.

## The birthday problem

To begin answering this question, let's solve a similar problem: 

> In a room of 30 people, how likely is it that two people share exactly the same birthday? 
>
> You can assume that every person in the room was born in the same year.

To answer this question, we could use the following **algorithm**, or list of steps needed to solve a problem:
 > 1. We need a room of 30 people, and we need to know their birthdays.
 > 2. We need to check whether two people in that room share a birthday.
 > 3. We need to repeat steps 1 and 2 over and over.
 > 4. We need to figure out how often two or more people shared a birthday.

Unfortunately, we don't have the time (or money!) to recruit a couple thousand experimental subjects to come to Stanford, sit in a big room with 29 other people, and tell us their birthdays.

Instead, we must create a **simulation**, or a computer model that helps us recreate and study real-life phenomena. 

## The `sample` command

Here's the first step of our algorithm:

> We need a room of 30 people, and we need to know their birthdays.

How can we simulate 30 people using `R`, a statistical programming language?

### Exercise 1

**A. Think about the output of the cells below, then discuss with your partner what you think `sample` does.**

(Hint: Try re-running the code in each cell a couple times by pressing SHIFT + ENTER, and see what happens!)

In [107]:
sample(1:10, 5, replace = FALSE)

In [108]:
sample(1:3, 5, replace = TRUE)

In [109]:
sample(1:20, 3, replace = TRUE)

In [13]:
# Feel free to test more sample commands here.



> Tip: To edit text, double click on it!

Ideas:
- Idea 1
- Idea 2
- ...

**B. Using `sample`, simulate a list of 30 birthdays.**

(Hint: Think about assigning a number to each day in the year.)

In [36]:
# Your code goes here!



## `R` Basics: Assignments
 - Convention for assigning values to variables is an arrow(`<-`)[^1]
 - Direction of arrow indicates direction of assignment

In [5]:
A <- 12
A  # 12
A + 3 -> B
B  # 15
24 -> A
A  # 24

- The equal sign (`=`) also works, but only for assignment to the left, e.g.

In [6]:
A = 12  # good
12 = A  # BAD

ERROR: Error in 12 = A: invalid (do_set) left-hand side to assignment


## `R` Basics: Re-Assignments
 - A variable can be re-assigned to anything

In [None]:
x <- 860306  # first x is assigned a number
x
x <- 555
x  # Now x is 555

### Finding duplicates

Here's the second step of our algorithm:

> We need to check whether two people in the room share a birthday.

In this exercise, we will learn about two useful tools for finding duplicates.

### Exercise 2

**A. Think about the output of the cells below, then discuss with your partner what you think `has_duplicate` does.**

In [4]:
# Press SHIFT + ENTER to run this cell
# Don't worry about what it's doing for now!

source("duplicate.R")

In [115]:
list_a <- sample(1:3, 5, replace = TRUE)

print(list_a)

has_duplicate(list_a)

[1] 2 2 2 3 1


In [110]:
list_b <- sample(1:10, 5, replace = FALSE)

print(list_b)

has_duplicate(list_b)

[1]  7  4  2 10  1


In [112]:
list_c <- sample(1:5, 10, replace = TRUE)

print(list_c)

has_duplicate(list_c)

 [1] 1 2 5 5 4 5 5 1 3 5


Ideas:
- Idea 1
- Idea 2
- ...

** B. Generate a list of 30 random birthdays, and determine whether the list has any duplicates. Re-run the code several times and think about how often the list has duplicates.**

In [71]:
# Your code here!



Observations:
- Observation 1
- Observation 2
- ...

**C. Think about the output of the cells below, then discuss with your partner what you think `num_duplicates` does.**

In [102]:
print(list_a)

[1] 3 2 2 2 1


In [103]:
num_duplicates(list_a)

In [98]:
print(list_b)

[1]  4  5  7 10  3


In [73]:
num_duplicates(list_b)

In [113]:
print(list_c)

 [1] 1 2 5 5 4 5 5 1 3 5


In [114]:
num_duplicates(list_c)

Ideas:
- Idea 1
- Idea 2
- ...

** D. Generate a list of 141 birthdays, and determine how many duplicates are in the list. Re-run the code several times and note any observations.**

In [92]:
# Your code here!



Observations:
- Observation 1
- Observation 2
- ...

## Repetition with `for` loops

Here's the third step in our algorithm:

> We need to repeat this process over and over.

The `for` loop lets us do exactly this!

### Exercise 3

**A. Think about the output of the cells below, then discuss with your partner how you think a `for` loop works.**

In [25]:
for (i in 1:10) {
    print("Hello, world!")
}

[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"
[1] "Hello, world!"


In [26]:
for (i in 1:5) {
    print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5


In [111]:
for (i in 1:3) {
    print(sample(1:20, 5, replace = TRUE))
}

[1] 18  9 18 14 15
[1]  3 13  2  9  7
[1]  8  5  1 15  5


Ideas:
- Idea 1
- Idea 2
- ...

**B. Using a `for` loop, print 10 lists of 30 random birthdays**

In [60]:
# Your code here!



**C. Think about the output of the cell below, and discuss with your partner what you think is going on. You are encouraged to change the numbers and re-run the code a couple times!**

In [95]:
counter <- 0

for (i in 1:5) {
    counter <- counter + 1
}

print(counter)

[1] 5


Ideas:
- Idea 1
- Idea 2
- ...

## Loops: Note!
- Loops in `R` are inefficient[^loops]
- For many cases, there will be a much faster, *vectorized* alternative
  to looping
- We cover loops because there are some *rare* cases in which a loop might
  make more sense, but in general, loops should be avoided when writing `R`

[^loops]: Not entirely true, but loops are still best avoided for other reasons too.

- two main types of loops

In [7]:
for (ind in sequence/set) {
  # iterate over sequence or elements of a set
  # do stuff
}

ERROR: Error in eval(expr, envir, enclos): object 'set' not found


In [8]:
while (condition) {
  # stuff to do while the condition is TRUE
  # the condition must become FALSE at some point!
}

ERROR: Error in eval(expr, envir, enclos): object 'condition' not found


In [9]:
for (i in 1:3) {
  print(paste('iteration', i))
}

while (i >= 0) {
  print(paste('de-iteration', i))
  i <- i - 1  # beware of infinite loops!
}

[1] "iteration 1"
[1] "iteration 2"
[1] "iteration 3"
[1] "de-iteration 3"
[1] "de-iteration 2"
[1] "de-iteration 1"
[1] "de-iteration 0"


## Translating our algorithm into code

We're ready to come back to our algorithm:

> 1. We need a room of 30 people, and we need to know their birthdays.
> 2. We need to check whether two people in that room share a birthday.
> 3. We need to repeat this process over and over.
> 4. We need to figure out how frequently two or more people shared a birthday.

Here's our algorithm for finding duplicates, translated into code! 

In [59]:
# This is the total number of birthday lists we will generate (aka our number of simulations)
n_lists <- 1000

# This is a counter to keep track of how many lists had at least one duplicate birthday
n_lists_with_duplicates <- 0

# Each time we see a list with at least one duplicate, we should increment our counter
for (i in 1:n_lists) {
    
    if (has_duplicate(sample(1:366, 30, replace = TRUE))) {
        
        n_lists_with_duplicates <- n_lists_with_duplicates + 1
    }
    
}

# Fraction of simulations with at least one duplicate
n_lists_with_duplicates / n_lists


### Exercise 4

**A. Increase the number of birthdays we generate in each list, and re-run the code several times. What happens to the fraction of simulations with a match?**

Findings:
- Finding 1
- Finding 2
- ...

**B. Change the number of lists to 100, and re-run the code several times. What happens to the results?**

Findings:
- Finding 1
- Finding 2 
- ...

**C. Change the number of lists to 100,000, and re-run the code several times. What happens to the results?**

Findings:
- Finding 1
- Finding 2 
- ...

**D. How many birthdays should be in each list for an approximately 50% chance of a match?**

Findings:
- Finding 1
- Finding 2
- ...

## Circling back to double voting

Remember our original problem:

> In the 1970 election, there were 141 individuals named "John Smith", and 27 pairs had exactly the same birthday.

### Challenge exercise

**Modify the simulation code to calculate the average number of birthday matches in 1,000 lists of 141 individuals.**

(Hint: Use `num_duplicates` from earlier and have the counter add up the total number of duplicates in all the generated lists.)

In [None]:
# Your code here!

