# Powerball

## Overview
Powerball is an American lottery game offered by 44 states, the District of Columbia, Puerto Rico and the US Virgin Islands. It is coordinated by the Multi-State Lottery Association (MUSL), a nonprofit organization formed by an agreement with US lotteries. Powerball's minimum advertised jackpot is $40 million (annuity); Powerball's annuity is paid in 30 graduated installments or winners may choose a lump sum payment instead. One lump sum payment will be less than the total of the 30 annual payments because of the time value of money.

Drawings for Powerball are held every Wednesday and Saturday evening at 10:59 p.m. Eastern Time. Since October 7, 2015, the game has used a 5/69 (white balls) + 1/26 (Powerballs) matrix from which winning numbers are chHalogen, manufactured by Smartplay International of Edgewater Park, New Jersey.osen, resulting in odds of 1 in 292,201,338 of winning a jackpot per play.  Each play costs $2, or $3 with the Power Play option. 

The minimum Powerball bet is $2. In each game, players select five numbers from a set of 69 white balls and one number from 26 red Powerballs; the red ball number can be the same as one of the white balls. The drawing order of the five white balls is irrelevant; all tickets show the white ball numbers in ascending order. Players cannot use the drawn Powerball to match two of their white numbers, or vice versa. Players can select their own numbers, or have the terminal pseudorandomly select the numbers (called "quick pick", "easy pick", etc.).
## The Logisitics of Powerball
I have heard people ask whether it would be possible to create a random number generator to "predict" the powerball outcomes.  Of course there are several inconsistencies in the very question itself.  First, if the draws were purely random then it would be impossible to predict the outcomes.  Secondly, if you had a truly random number generator, it would not "predictably match" another such process.   However,  as most scientists know,  there is no such thing as a random number generator.   Pseudo-generators in use produce repeatable sequences of numbers that are hard to distinguish from a random sequence, but are decidedly repeatable and predictable depending on the starting point (or seed).  

Back to powerball, there are some reasonable questions that one can ask related to the randomness of the process.   For example,  given how the process works, (the logistics) are there elements which lead to higher probability outcomes, for instance in each drawing, winning numbers are selected using two identical ball machines which are randomly selected from a set of four : one containing the white balls and the other containing the red Powerballs. Five white balls are drawn from the first machine and the red ball from the second machine. Games matching at least three white balls or the red Powerball win.  The balls are mixed by a turntable at the bottom of the machine that propels the balls around the chamber. When the machine selects a ball, the turntable slows to catch it, sends it up the shaft, and then down the rail to the display.   So some reasonable questions to ask are the following:  Is there something in the process which leads to "less than random" outcomes?   For example, are some of the balls heavier than the others?   Are the dimensions of the balls slightly different which gives one or more an advantage in being selected?   We can explore some of these questions using statistics.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # Collection of functions for scientific and publication-ready visualization
import plotly.offline as py     # Open source library for composing, editing, and sharing interactive data visualization 
from matplotlib import pyplot
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

from collections import Counter
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
# Load data file
df = pd.read_csv('../input/powerball_draw_order.csv')

In [None]:
# initial view of data
df.head()

In [None]:
df.tail()

In [None]:
# limit the data to the relevent columns  PB1,,,PB5 and the Powerball
select_columns= ["PB1","PB2","PB3","PB4","PB5","Powerball"]
pbData = df[select_columns]

In [None]:
pbData.head()

In [None]:
pbData.describe()

##  We view the data in histogram form to see if there are any clear anomalies
* we see that if there is one candidate for a statistical outlier in the first power ball draw,  and
* potentially two statisitical outliers in the second powerball draw.
* We note that these are truly only possibilities and not necessarily evidence of non-randomness

In [None]:
pbData.hist(bins=50, figsize=(20,15))
plt.show()

##  We isolate the number '32' as the potentially non-random candidate by determining the number of times it has been drawn first in the lottery (12)

In [None]:
#utilizing the 'value_counts' method and taking the top five occurences
pbData["PB1"].value_counts()[0:5]

##  Chi-Square Analysis
We perform two statistical analyses.   We first check whether the entire sequence of first drawn numbers is randomly selected from the 69 possibilities.  Second, we check to see if the number 32 is a statistical outlier from the remainder of the first drawn powerballs.  We will perform both tests with a Chi Square analysis, but there is a subtle difference between what we are looking for.    In the first case, we are asking if in aggregate there is evidence that the first drawn number is not randomly sampled from the 69 possibilties.  We are effectively loooking at all 69 possibilities and seeing if in aggregate the results look reasonable.   In the second analysis we are looking at the number 32 and the remainder of the first draws.  in effect,  32 versus all other possibilities. 

We start by considering a binomial random variable $Y$ with mean (expected value) $np$  where $n$ is the number of draws in the sample (286) and $p$ is the theoretical probablity of the number being drawn (1/69).  the  variance of $Y$ is   $\sigma^2 = np(1-p)$.   We say that $Y$ has a binomial $Bin(n,p)$ distribution.

Suppose we have a random variable $Y_1$ with a $Bin(n,p_1)$  distribution, and let $Y_2 = n - Y_1$ and $p_2 = 1 - p_1$.
Then 
$$
\begin{eqnarray}
    Z^² &=& \frac{(Y_1-np_1)^²}{ np_1(1-p_1)} \\
        &=& \frac{ (Y_1 - np_1)^²(1 - p_1) + (Y_1 - np_1)^²(p_1)}{np_1(1-p_1)}\\
        &=&\frac{ (Y_1 - np_1)^²}{np_1}  + \frac{(Y_1 - np_1)^2}{n(1-p_1)}\\
        &=&\frac{ (Y_1 - np_1)^²}{np_1}  + \frac{(Y_2 - np_2)^2}{np_2}
\end{eqnarray}
$$

Since $(Y_1 - np_1)^2 = (n - Y_2 - n + np_2)^2 = (Y_2 - np_2)^2$,

$ Z^2 = \frac{(Y_1 - np_1)^2}{np_1}   + \frac{(Y_2 - np_2)^2}{np_2}$   has a chi-square distribution with 1 degree of freedom.  If the observed values $Y_1$ and $Y_2$ are close to their expected values $np_1$ and $np_2$, then the calculated value $Z^2$ will be close to zero.  If not, $Z^2$ will be large.  In general, for k random variables $Y_i, i = 1, 2,\ldots, k$, with corresponding expected values $np_i$, a statistic measuring the "closeness" of the observations to their expectations is the sum:
\begin{equation}
    \frac{ (Y_1 - np_1)^²}{np_1}  + \frac{(Y_2 - np_2)^2}{np_2}+\ldots   + \frac{(Y_k - np_k)^2}{np_k}
\end{equation}
which has a chi-square distribution with k-1 degrees of freedom.
 
To perform the first analysis, we set $k=69$ in the last equation for the 69 different potential outcomes, $n= 286$ and $p_i = \frac{1}{69}$ for all $i$.      For the second analysis,  we set

##  First Chi-Square Analysis

In [None]:
## pbData["PB1"].value_counts()
## expected counts  =   target_index * 1/69
##
## now perform a chi-square test to see if "in aggregate" the first draw appears random
## what this does not test is if there is a single number which is statistically significant
## 
import scipy.stats as stats
expected_ct = 286*1/69  #sample length * probability of 

chi_squared_stat =  sum( (pbData["PB1"].value_counts() - expected_ct)**2/ expected_ct)

In [None]:
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 68)   # Df = number of variable categories - 1 = k-1

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                             df=68)
print("P value")
print(p_value)

print("Chi_squared_stat")
print(chi_squared_stat)

 In aggregate there is no evidence that the process is non-random at the 95% confidence level.    However,  when comparing the number 32's drawn first,  we can reject the hypothesis that it is drawn with probability 1/69 at the 95% level.  And as seen in the second test below we can reject this hypothesis even at the 99% level.

##  Second Chi-Square Analysis

In [None]:
# First at the 95% level
n= 286     # number of draws
p1 = 1/69  # probability of a paticular number being drawn

chi_squared_stat =  (pbData["PB1"].value_counts().get_values()[0] - n*p1)**2/ (n*p1*(1-p1))
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 1)   # Df = number of variable categories - 1

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                             df=1)
print("P value")
print(p_value)

print("Chi_squared_stat")
print(chi_squared_stat)

In [None]:
# second at the 99% level
n= 286     # number of draws
p1 = 1/69  # probability of a paticular number being drawn

chi_squared_stat =  (pbData["PB1"].value_counts().get_values()[0] - n*p1)**2/ (n*p1*(1-p1))
crit = stats.chi2.ppf(q = 0.99, # Find the critical value for 95% confidence*
                      df = 1)   # Df = number of variable categories - 1

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                             df=1)
print("P value")
print(p_value)

print("Chi_squared_stat")
print(chi_squared_stat)