<table align="left" style="border-style: hidden" class="table"> <tr> <td class="col-md-2"><img style="float" src="http://prob140.org/assets/icon256.png" alt="Prob140 Logo" style="width: 120px;"/></td><td><div align="left"><h3 style="margin-top: 0;">Probability for Data Science</h3><h4 style="margin-top: 20px;">UC Berkeley, Spring 2018</h4><p>Ani Adhikari</div></td></tr></table><!-- not in pdf -->

In [None]:
# SETUP

import numpy as np
from datascience import *
from prob140 import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# These lines make warnings look nicer
import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.simplefilter('ignore', DeprecationWarning)

# Useful for probability calculations
from scipy import stats
from scipy import misc

In [None]:
# Standard deck of cards

ranks = np.append(np.arange(2, 11), np.array(['Jack', 'Queen', 'King', 'Ace']))

suits = ['\u2660', '\u2663', '\u2661', '\u2662']

deck = Table().values('Suit', suits, 'Rank', ranks).move_to_start('Rank')

# Lab 2: Five-Card Poker #
Simple random sampling (sampling uniformly at random without replacement) is a natural scheme for sampling from a finite population. Data science applications aside, simple random sampling happens every time a hand of cards is dealt from a well shuffled deck. In this lab we will study aspects of simple random sampling in the context of poker, a card game. 

Several of the discoveries you will make in this lab, especially those in Parts 3 and 4, will have obvious generalizations when you change the parameters. So the lab isn't just about poker, but poker is the primary setting.

What you'll do in this lab:
- Use SciPy to find hypergeometric probabilities and numbers of combinations
- Deal yourself some poker hands and learn which ones are valued more highly than others
- Work out why some hands are more valuable
- Study the distribution of the count of a specified rank in a hand
- Study the joint distribution of the counts of two specified ranks in a hand

## Part 1: Using SciPy ##
We will start with some functions that make numerical probability calculations easy. `SciPy` is a system for scientific computing, based on Python. Its modules `misc` and `stats` are useful for math and probability calculations.

#### Number of Combinations ####
You know that if you have a population of size $N$ and you take a simple random sample of size $n$, then there are ${N \choose n}$ possible samples. 

For integers $0 \le n \le N$, `misc.comb(N, n)` evaluates to ${N \choose n}$. 

Annoyingly, sometimes you get a float instead of an integer, due to the method of computation.

If you are bothered by the decimals you can use `misc.comb(N, n, exact=True)` for the integer value.

Combinatorial terms can get large very quickly, but in this lab you won't have to worry about that.

#### Hypergeometric Probabilities ###
Because of the simple random sampling scheme, all ${N \choose n}$ possible samples are equally likely. Now suppose that among the $N$ elements of the population, $G$ are good according to some precise definition of "good". Then you know that

$$
P(k \text{ good elements in the sample}) ~ = ~ 
\frac{ {G \choose k}{{N-G} \choose {n-k}} }{ {N \choose n} }
$$

following the standard interpretation that "$k$ good elements" means "exactly $k$ good elements."

These are impressively called *hypergeometric probabilities* because the terms are related to the hypergeometric series of mathematics. Scary terminology notwithstanding, what is being calculated is straightforward: the chance of getting a specified number of good elements in a simple random sample.

In the calculation above, $k$ is the desired number of good elements specified in the event. The population size $N$, the population count of good elements $G$, and the sample size $n$ are constants of the sampling scheme and are called *parameters*. 

`stats.hypergeom.pmf(k, N, G, n)` evaluates to the probability above. The `pmf` part stands for "probability mass function".

**Warning**: Be careful when you read `SciPy` documentation for `hypergeom`. Their notation uses some of the same letters as we are using, to mean different things. It can therefore be horribly confusing. 

Just remember the call as:

`stats.hypergeom.pmf(k, population_size, num_good_in_population, sample_size)`

#### Hypergeometric Distributions ####
If $X$ is the number of good elements in the sample, then $P(X = k)$ is the hypergeometric probability above, and $X$ is said to have the *hypergeometric distribution with parameters $N$, $G$, and $n$.* When you see that, you should say in your head, "$X$ is the number of good elements in a simple random sample with the listed parameters."

By using a list or array for `k`, instead of a single integer, you can get all the corresponding hypergeometric probabilities by using the same call as above. The sampling scheme and hence the parameters stay fixed.

#### Example ####
Suppose a class has 100 students of whom 40 are seniors, and suppose you take a simple random sample of 25 students from the class. Then the chance that you get 10 seniors in the sample is

$$
\frac{ {{40} \choose {10}}{{60} \choose {15}} }{ {{100} \choose {25}} }
$$

which can be calculated in the two ways below.

In [None]:
misc.comb(40, 10) * misc.comb(60, 15) / misc.comb(100, 25)

In [None]:
stats.hypergeom.pmf(10, 100, 40, 25)

If $X$ is the number of seniors in the sample, then $X$ has the hypergeometric distribution with parameters 100, 40, and 25. Here are all the probabilities in the distribution followed by a confirmation that it is indeed a distribution.

In [None]:
k = np.arange(26)
stats.hypergeom.pmf(k, 100, 40, 25)

In [None]:
sum(stats.hypergeom.pmf(k, 100, 40, 25))

Now it's your turn to use these functions. The exercises in this part are in the context of cookies, not poker, with apologies for making you hungry.

**In a box of 36 cookies, 12 are chocolate chip, 18 are oatmeal raisin, and the rest are snickerdoodles. Select a simple random sample of 20 cookies.**

### 1a) ###
In each cell, use `misc.comb` and arithmetic operations to find the quantity described.

(i) the total number of samples

In [None]:
...

(ii) the number of samples that have no snickerdoodles

In [None]:
...

(iii) the number of samples that have equal numbers of chocolate chip and oatmeal raisin cookies

In [None]:
...

(iv) the chance of getting equal numbers of chocolate chip and oatmeal raisin cookies

In [None]:
...

### 1b) ###
This is a brief workout with `stats.hypergeom.pmf`. It's a good idea to go back to the start of Part 1 where the details were provided, to remind yourself the types of input and the order of the arguments. 

Find:

(i) the chance that there is 1 snickerdoodle in the sample

In [None]:
stats.hypergeom.pmf(...)

(ii) the chance that the sample has more than 4 chocolate chip cookies

In [None]:
...

Note that if you had been asked to calculate the above probability by hand, using the complement rule would have been quicker than the direct method. With a computational system such as the one we are using, it doesn't make much difference.

#newpage

## Part 2: Classification of Poker Hands ##
That's enough of hypothetical cookies. It's time for you to deal some cards. 

In a standard deck of 52 cards, each card has three attributes – suit, color, and rank.
- There are 26 black cards, and 26 red cards which we show here as colorless.
- There are 13 cards in each of four suits: hearts ($\heartsuit$) and diamonds ($\diamondsuit$) are red, and spades ($\spadesuit$) and clubs ($\clubsuit$) are black.
- Within each suit, the 13 cards are ranked 2, 3, 4, 5, 6, 7, 8, 9, 10, Jack, Queen, King, Ace in ascending order of value.

You can think of the table `deck` as a standard deck, one card per row. The spades are displayed below.

In [None]:
deck.show(13)

A *five-card poker hand* consists of 5 cards dealt at random without replacement from the deck. Each hand can be classified according to the attributes of the cards in it. 

Some hands are more valuable than others. Here is the ranking from highest to lowest.
- **Royal flush**: 10, Jack, Queen, King, Ace of the same suit
- **Straight flush**: five consecutive ranks, all of the same suit, not a royal flush
- **Four of a kind**: *aaaab* for any two ranks *a* and *b*
- **Full house**: *aaabb* for any two ranks *a* and *b*
- **Flush**: five cards of the same suit but not a royal flush or a straight flush
- **Straight**: five consecutive ranks, not all of the same suit
- **Three of a kind**: *aaabc* for any three ranks *a*, *b*, and *c*
- **Two pair**: *aabbc* for any three ranks *a*, *b*, and *c*
- **One pair**: *aabcd* for any four ranks *a*, *b*, *c*, and *d*
- **High card**: None of the above; if all players have these, they have to compare the highest ranked card in each hand

### 2a) ###
The Table method `sample` draws uniformly from the rows of a table. If `tbl` is a table and `n` is a positive integer, `tbl.sample(n, with_replacement=False)` creates a table whose rows are a simple random sample of `n` rows of `tbl`.

In each of the four cells below, deal a poker hand. In the comment line below the code, say what kind of hand it is, based on the classification above. If you get a "high card" hand, identify the highest card in the hand.

In [None]:
# Hand 1

deck.sample(...)

#

In [None]:
# Hand 2

In [None]:
# Hand 3

In [None]:
# Hand 4

What was the best hand you got?


** Your answer here. **

### 2b) ###
Find the total number of five-card poker hands.

In [None]:
# total number of poker hands

...

That's a lot of poker hands. Each of the hands falls into one of the categories described above. The rarer the category, the higher its value. For example, there are only four possible royal flushes – one for each suit – so royal flush is the most valuable. [Wikipedia's list of poker hands](https://en.wikipedia.org/wiki/List_of_poker_hands) includes the number of hands of each kind. Skim the page but ignore "five of a kind" because our deck doesn't have a joker card.

In the rest of this part of the lab, you will find the numbers of hands in some of the categories above. Counts for some of the other categories were covered in discussion section. 

### 2c) Straight Flush ###
A straight flush has been defined above as both a straight (five consecutive cards) and a flush (all of the same suit), but not a royal flush. By this definition, how many straight flush hands are there? It will help to notice that the ranks in any straight are fixed once you know the highest ranked card in the straight.


** Your answer here. **

**Note:** Some definitions of the straight flush include the royal flush. Others allow aces to be both high and low, that is, "Ace 2 3 4 5" as well as "10 J Q K A" are straights. We aren't including either of those, but Wikipedia does include them and thus has a higher count than ours.

### 2d) Full House ###
Scroll down the Wikipedia page till you find the Full House section and the place where it says that there are 3,744 possible full house hands. In the cell below, show the calculation that leads to this result. Run the cell and confirm that your calculation is correct.

In [None]:
# Number of possible full house hands

...

### 2e) Three of a Kind ###
"Three of a kind" is less valuable than full house. How many possible "three of a kind" hands are there? Show the calculation in the cell below and run the cell.

In [None]:
# three of a kind

...

Does your answer agree with the count according to Wikipedia? Explain how you came up with the calculation in the code cell above.


** Your answer here **

In the same way, with some careful counting, you can verify the counts (and hence the relative value) of all the different kinds of poker hands.

#newpage

## Part 3: Aces ##
In Part 2, poker hands were classified according to all their attributes. Now you will take a more careful look at just the count of cards of one specified rank in a hand. We will use the ace as the specified rank, but the results apply to any other rank as well.

The main goal of this part of the lab is to visualize the distribution of the number of aces in a  5-card poker hand. 

The `prob140` library has methods that allow you to quickly draw a histogram of the probability distribution of an integer-valued random variable. For a random variable $X$ whose possible values are in `values_array` with the corresponding probabilities in the array `probs_array`, the assignment 

`dist_object = Table().values(values_array).probability(probs_array)`

creates a "distribution object" we have named `dist_object`, containing the distribution of $X$. Then `Plot(dist_object)` draws the histogram.

### 3a) ###
Find the chance of getting 1 ace in a poker hand. Do the calculation in two ways, one in each of the cells below.

In [None]:
# Use only misc.comb and arithmetic operations
...

In [None]:
# Use only stats.hypergeom.pmf
...

### 3b) ###
Let $X$ be the number of aces in a poker hand. Fill in and run the cell below so the final result is the probability histogram of $X$. Make sure that the bar at $X = 1$ is consistent with your answer to **3a**.

In [None]:
# Array of possible values of X
k = ...

# Array/list of the corresponding probabilities
aces_probs = ...

# Distribution object consisting of the distribution of X
aces_dist = ...

# Probability histogram of X

...

### 3c) ###
For comparison, draw the probability histogram of the number of red cards in a poker hand.

In [None]:
...

Explain the difference between the shape of this histogram and the one in part **b**. 


** Your answer here **

### 3d) ###
While it's often easy to compare shapes by just a glance at a pair of histograms, you have to look more closely to make numerical comparisons. Drawing both histograms on the same horizontal axis is often revealing.

`Plots(variable_name_1, dist_object_1, variable_name_2, dist_object_2)`

draws overlaid histograms of the two probability distributions in the arguments. Choose the strings `variable_name_1` and `variable_name_2` to be short but descriptive of the corresponding variables.

Overlay the histograms in parts **b** and **c**. Call one of the variables `Number of Aces` and the other variable `Number of Reds`.

In [None]:
...

**Prediction intervals:** Based on the graph above (and no other calculation) fill in the blanks below. Fill in the first blank with either "aces" or "red cards", and each of the other blanks with an integer. If there is more than one correct set of choices, provide all of them.

There is about 95% chance that the number of $\underline{~~~~~~~~~~~~~~~~}$ in the professor's next 5-card poker hand will be either $\underline{~~~~~~~~~~~}$ or $\underline{~~~~~~~~~~~}$.


** Your answer here **

#newpage

## Part 4: Aces and Kings ##
This part of the lab examines the joint distribution of the counts of two ranks in the hand. We have used aces and kings but the results apply to any pair of ranks.

### 4a) ###
Let $X$ be the number of aces and $Y$ the number of kings in a five-card poker hand. Get some scratch paper and figure out:
- How to describe all the possible values of $(X, Y)$
- $P(X = x, Y = y)$ for each possible value $(x, y)$

Now define a function `joint_prob` that takes arguments $x$ and $y$ and returns $P(X = x, Y = y)$.

In [None]:

def joint_prob(x, y):
    return ...

What should `joint_prob(4, 2)` be? Explain in the Markdown cell and then run the code cell to confirm.


** Your answer here **

In [None]:

joint_prob(4, 2)

You know what the total of all the probabilities in the joint distribution of $X$ and $Y$ should be. Use the cell below to confirm that the sum of `joint_prob(x, y)` over all `x` and `y` is indeed what it should be. There are many ways of writing the code; whichever you use, make sure that the last line evaluates to the sum.

In [None]:

...

### 4b) ###
The cell below uses `joint_prob` to create and display a joint distribution table for $X$ (`Aces`) and $Y$ (`Kings`). All you have to do is fill in the blank to set the possible values of each of $X$ and $Y$.

In [None]:

k = np.arange(...)
two_ranks_tbl = Table().values("Aces", k, "Kings", k).probability_function(joint_prob)
two_ranks = two_ranks_tbl.to_joint()

two_ranks

Use the table to find numerical values of the following probabilities.

(i) $P(\text{1 ace and 1 king})$

(ii) $P(\text{2 cards of one of the two ranks and none of the other})$


** Your answer here **

You can use joint distribution tables created by the `prob140` library to find other distributions as well. For example, if `joint_dist` is a joint distribution table of variables labeled `V` and `W`, then:

`joint_dist.both_marginals()` displays the joint distribution table along with the two marginals

`joint_dist.conditional_dist(V, W)` displays a table of the conditional distribution of $V$ given each value of $W$, and the marginal distribution of $W$.

### 4c) ###
For each of the statements (i), (ii), and (iii), say whether it is true or false and explain your choice. As before, $X$ is the number of aces and $Y$ the number of kings in a five-card poker hand.

(i) $X = Y$

(ii) $X = Y$ in distribution

(iii) $X$ and $Y$ are not equal in distribution


** Your answer here **

Complete the line of code below to display a table that supports your choices above.

In [None]:

two_ranks...

### 4d) ###
Display the conditional distribution of the number of aces in the hand for each given value of the number of kings in the hand.

In [None]:
...

The table should show that $P(X = 1 \mid Y = 3) \approx 15.6\%$. Use the division rule and the appropriate elements in the tables in parts **b** and **c** to confirm that this conditional probability has been computed correctly.

In [None]:
... / ...

### 4e) ###
Focus on the row that displays **the conditional distribution of the number of aces given that there are two kings in the hand**. Like all the others in the table, this conditional distribution has been calculated using the joint distribution of the number of aces and the number of kings in the hand.

But there is a way of coming up with the conditional distribution directly, without using the joint distribution. Think about the symmetries in simple random samples and about how the condition "two kings in the hand" restricts the outcome space. Figure out how to use this to find $P(\text{hand has 1 ace} \mid \text{2 kings in the hand})$. Then use the same idea to write one line of code that generates all the probabilities in the conditional distribution of the number of aces given two kings.

In [None]:
...



If the probabilities in your answer agree with the corresponding line of the table in part **d**, congratulations. Not only have you discovered a good approach to conditioning in the context of simple random sampling, you're done with this lab!

### Conclusion ###
What you have learned in this lab, apart from some details about poker:
- The `stats` and `misc` modules of SciPy make it easy to compute probabilities numerically.
- Counting isn't always as easy as 1-2-3; you have to be careful to avoid double counting or leaving out elements that should be counted.
- Shapes of hypergeometric distributions can vary quite markedly depending on the parameters.
- Joint, marginal, and conditional distributions are related in straightforward ways based on the multiplication, addition, and division rules.
- Symmetries in simple random sampling help to simplify calculations.

## Submission Instructions

1. **Save your notebook using File > Save and Checkpoint.**
2. Run the cell below to generate a pdf file.
3. Download the pdf file and confirm that none of your work is missing or cut off.
4. Submit the assignment to Lab_02b on Gradescope. Use the entry code "9GEKKD" if you haven't already joined the class.

In [None]:
import gsExport
gsExport.generateSubmission("Lab_02.ipynb")