We want to build an app to help lottery addicts better estimate their chances of winning. There is a team of engineers that will build the app, however they need help creating the logical core of the app and calculate probabilities.

For the first version of the app, we focus on the 6/49 lottery and build functions that enable users to answer questions like:

- What is the probability of winning the big prize with a single ticket?
- What is the probability of winning the big prize if we play 40 different tickets (or any other number)?
- What is the probability of having at least five (or four, or three, or two) winning numbers on a single ticket?

We consider historical data from the national 6/49 lottery game in Canada. The [data set](https://www.kaggle.com/datascienceai/lottery-dataset) has data for 3,665 drawings dating from 1982 to 2018.

In this project we will need to calculate probabilities and combinations repeatedly. Therefore, we will write functions that can be reused:

- A function that calculates factorials
- A function that calculates combinations

NOTE: In the 6/49 lottery, six numbers are drawn from a set of 49 numbers that range from 1 to 49. The drawing is done **without replacement**.

Recall:

- n! = n(n-1)(n-2)...(3)(2)(1)
- nCk = n!/(k!(n-k)!)

In [1]:
def factorial(n):
    fact = 1
    i = n
    for i in range(1,n+1):
        fact *= i
    return fact

In [2]:
factorial(5)

120

In [3]:
factorial(10)

3628800

In [4]:
def combinations(n,k):
    return factorial(n) // (factorial(k)*factorial(n-k))

In [5]:
combinations(10,5)

252

In [6]:
combinations(15,7)

6435

In the 6/49 lottery, a player wins the big prize if the six numbers on their ticket matches all the six numbers drawn (even if only one number differs, the player does not win).

In the first version of the app, we want players to be able to calculate the probability of winning the big prize with the various numbers they play on a single ticket (for each ticket a player chooses size numbers out of 49).

We need to be aware of the following details when writing this function:

- Inside the app, the user inputs six different numbers from 1 to 49
- "Under the hood", the six numbers come as a Python list which will serve as the single input to the function
- The engineering team wants the function to print the probability value in a friendly way (so that people uneducated in probability can understand)

In the code below, we define our function `oneTicProb(listOfSix)` with an input `listOfSix` because the user will enter six values (in list form). Next we compute the total number of ways we can select 6 numbers from 49 numbers. To obtain probability, we take the reciprocal of `combinations(49,6)`. NOTE: 1 is in the numerator because the values the user enters is just ONE sequence or selection of 6 numbers from 49. We want the probability in percentage form so we multiply by 100. Lastly, we print a message that not only returns the probability, but does so in a way that is easy to understand in lay-terms.

In [7]:
def oneTicProb(listOfSix):
    total = combinations(49,6)
    prob = (1 / total) * 100
    print("There is a {:.8f}% percent chance of winning the big prize "
    "based on the values {} you entered. In other words, you have a 1 in " 
    "{:,} chance of winning.".format(prob, listOfSix, total))

In [8]:
oneTicProb([4, 22, 34, 35, 40, 41])

There is a 0.00000715% percent chance of winning the big prize based on the values [4, 22, 34, 35, 40, 41] you entered. In other words, you have a 1 in 13,983,816 chance of winning.


We now want to read in and examine (determine the number of rows and columns, and check out the first and last several rows) the `lottery.csv` data set:

In [9]:
import numpy as np
import pandas as pd

In [10]:
lottery = pd.read_csv('lottery.csv')

In [11]:
lottery.shape

(3665, 11)

There are 11 columns and 3,665 rows:

In [12]:
lottery.head(5)

Unnamed: 0,PRODUCT,DRAW NUMBER,SEQUENCE NUMBER,DRAW DATE,NUMBER DRAWN 1,NUMBER DRAWN 2,NUMBER DRAWN 3,NUMBER DRAWN 4,NUMBER DRAWN 5,NUMBER DRAWN 6,BONUS NUMBER
0,649,1,0,6/12/1982,3,11,12,14,41,43,13
1,649,2,0,6/19/1982,8,33,36,37,39,41,9
2,649,3,0,6/26/1982,1,6,23,24,27,39,34
3,649,4,0,7/3/1982,3,9,10,13,20,43,34
4,649,5,0,7/10/1982,5,14,21,31,34,47,45


In [13]:
lottery.tail(5)

Unnamed: 0,PRODUCT,DRAW NUMBER,SEQUENCE NUMBER,DRAW DATE,NUMBER DRAWN 1,NUMBER DRAWN 2,NUMBER DRAWN 3,NUMBER DRAWN 4,NUMBER DRAWN 5,NUMBER DRAWN 6,BONUS NUMBER
3660,649,3587,0,6/6/2018,10,15,23,38,40,41,35
3661,649,3588,0,6/9/2018,19,25,31,36,46,47,26
3662,649,3589,0,6/13/2018,6,22,24,31,32,34,16
3663,649,3590,0,6/16/2018,2,15,21,31,38,49,8
3664,649,3591,0,6/20/2018,14,24,31,35,37,48,17


NOTE: It seems that the `PRODUCT` and `SEQUENCE NUMBER` columns are not helpful. The entry for every row is either 649 or 0, respectively. We also note that while there are 3,665 rows, there are 3,591 draw numbers. However, based on the `DRAW DATE` column, we see that there are dates that are not recorded in this dataset.

At this point, we want to use this historical data. We want to write a function enabling users to compare their ticket against the historical lottery data and determine whether they would have ever won by now. We must be aware of the following details:

- Inside the app, the user inputs six different numbers from 1 to 49
- The six numbers will come as a Python list and serve as an input to the function
- The function prints:
    - the number of times the combination selected occurred in the data set
    - the probability of winning the big prize in the next drawing with that combination

In [14]:
lottery.iloc[:, 4:10].head()

Unnamed: 0,NUMBER DRAWN 1,NUMBER DRAWN 2,NUMBER DRAWN 3,NUMBER DRAWN 4,NUMBER DRAWN 5,NUMBER DRAWN 6
0,3,11,12,14,41,43
1,8,33,36,37,39,41
2,1,6,23,24,27,39
3,3,9,10,13,20,43
4,5,14,21,31,34,47


In [15]:
lottery.iloc[0, 4]

3

In [16]:
type(lottery.iloc[0,4])

numpy.int64

In [17]:
# I have a hard time writing functions that will be applied to DataFrames
# This is a very simple function, yet I struggled quite a bit - probably
# about 2 hours. Get better at this

def extractNumbers(row):
    row = row[4:10]
    row = set(row.values)
    return row

In [18]:
winningNumbers = lottery.apply(extractNumbers, axis=1)
winningNumbers.head()

0    {3, 41, 11, 12, 43, 14}
1    {33, 36, 37, 39, 8, 41}
2     {1, 6, 39, 23, 24, 27}
3     {3, 9, 10, 43, 13, 20}
4    {34, 5, 14, 47, 21, 31}
dtype: object

We now have extracted all the winning numbers from each recorded draw. We now want to write a function that takes in two inputs: a Python list containing the user numbers, and a pandas Series containing sets with the winning numbers (the Series above containig the winning numbers). We then want to:

- Convert the user numbers list as a set using the `set()` function
- Compare the set against the pandas Series that contains the sets with the winning numbers to find the number of matches - a Series of Boolean values will be returned as a result of the comparison
- Print information about the number of times the combination inputted by the user occurred in the past
- Print information (in an easy-to-understand way) about the probability of winning the big prize in the next drawing with that combination

In [19]:
# I struggled with this part of the problem as well. Specifically, I was
# using a `for` loop to check whether the set matched sets within the
# Series, which is of course incorrect because I should be using 
# vectorized operations

def histOccurrence(userList, histSeries):
    userSet = set(userList)
    print("User Set: ", userSet)
    win = userSet == histSeries
    winCount = win.sum()
    print("Win Count: ", winCount)
    total = combinations(49,6)
    winProb = (1 / total) * 100
    
    if winCount == 0:
        print("The combination {} has never occurred in the past. This "
             "does not mean the combination {} is more or less likely "
             "to occur now. Therefore, you still have a {:.8f}% chance "
             "of winning the big prize. In other words, you have a 1 in "
             "{:,} chance of winning the big prize".format(userList, userList, winProb, total))
        
    else:
        print("The combination {} has occurred {} time(s) in the past. This "
             "does not mean the combination {} is more or less likely "
             "to occur now. Therefore, you still have a {:.8f}% chance "
             "of winning the big prize. In other words, you have a 1 in "
             "{:,} chance of winning the big prize".format(userList, winCount, userList, winProb, total))

In [20]:
myNums = [3, 41, 11, 12, 43, 14]

In [21]:
histOccurrence(myNums, winningNumbers)

User Set:  {3, 41, 11, 12, 43, 14}
Win Count:  1
The combination [3, 41, 11, 12, 43, 14] has occurred 1 time(s) in the past. This does not mean the combination [3, 41, 11, 12, 43, 14] is more or less likely to occur now. Therefore, you still have a 0.00000715% chance of winning the big prize. In other words, you have a 1 in 13,983,816 chance of winning the big prize


Often, lottery (or gambling) addicts play more than one ticket on a single drawing, thinking that this might increase their chances of winning. We will write a function that will allow users to calculate the chances of winning any number of different tickets. We need to be aware of the following details:

- The user will input the number of *different* tickets they want to play (without inputting the specific combinations they intend to play)
- The function will see an integer between 1 and 13,983,816 (the maximum number of different tickets)
- The function should print information about the probability of winning the big prize depending on the number of different tickets played

In [26]:
def multiTixProb(n):
    totalCombos = combinations(49,6)
    prob = n / totalCombos
    probPercent = prob * 100
    reducedChance = round(totalCombos / n)
    print("You have a {:8f}% chance of winning the big prize by playing "
         "{} ticket(s). In other words, you have a 1 in {:,} "
         "chance of winning".format(probPercent, n, reducedChance))

In [27]:
multiTixProb(1)

You have a 0.000007% chance of winning the big prize by playing 1 ticket(s). In other words, you have a 1 in 13,983,816 chance of winning


In [28]:
multiTixProb(10)

You have a 0.000072% chance of winning the big prize by playing 10 ticket(s). In other words, you have a 1 in 1,398,382 chance of winning


In [29]:
multiTixProb(100)

You have a 0.000715% chance of winning the big prize by playing 100 ticket(s). In other words, you have a 1 in 139,838 chance of winning


In [30]:
multiTixProb(1000)

You have a 0.007151% chance of winning the big prize by playing 1000 ticket(s). In other words, you have a 1 in 13,984 chance of winning


In [31]:
multiTixProb(100000)

You have a 0.715112% chance of winning the big prize by playing 100000 ticket(s). In other words, you have a 1 in 140 chance of winning


We now want to write a function that allows users to calculate probabilities for two, three, four, or five winning numbers. In most 6/49 lotteries there are smaller prizes if a player's ticket matches two, three, four, or five of the six numbers drawn. Therefore, users might be interested in knowing the probability of having two, three, four, or five winning numbers. Here are some details to be aware of:

- Inside the app, the user inputs:
    - six different number from 1 to 49
    - an integer between 2 and 5 the represents the number of winning numbers expected
- The function prints information about the probability of having the inputted number of winning numbers

Let's consider one case: calculate the probability for having five winning numbers. For example, suppose a player chose the numbers {1,2,3,4,5,6}. Out of these six numbers, we can form six 5-number combinations:

- {1,2,3,4,5}
- {1,2,3,4,6}
- {1,2,3,5,6}
- {1,2,4,5,6}
- {1,3,4,5,6}
- {2,3,4,5,6}

Also note that "6 choose 5" equals:

`6! / (5!(6-5)!) = 6!/5! = 6`

For each of the above 5-number combinations, there are 44 possible successful outcomes in a lottery drawing. For example, the 5-number combination {1,2,3,4,5} would have the following 44 possible winning combinations:

- {1,2,3,4,5,**6**}
- {1,2,3,4,5,**7**}
- ...
- {1,2,3,4,5,**47**}
- {1,2,3,4,5,**48**}
- {1,2,3,4,5,**49**}

We would have 44 similar winning combinations for the rest of the five 5-number combos. Thus, in total, we would have `6 x 44 = 264` many possible successful outcomes out of a total of 13,983,816 outcomes. Thus, the probability of having five winning number for a single lottery ticket is:

`264 / "49 choose 6" = 0.0000189`

Now consider the case of 4-winning numbers. Then:

In [33]:
combinations(6,4)

15

There are 15 possible 4-number combos that could win a small prize. Consider the 4-number combo {1,2,3,4}. The next digit can be any number between 5 and 49 (45 possible values). The sixth digit can be any value except 1-4 and the fifth digit select, meaning we have 44 possible values. Thus the total number of winning outcomes is `(44)(45) = 1,980`. And since we have 15 different 4-number combinations, we therefore have `15 x 1980 = 29,700` total possible successful outcomes. The probability of winning becomes: 

`29,700 / "49 choose 6" = 0.002124`

Now consider the case of 3-winning numbers. Then:

In [36]:
combinations(6,3)

20

There are 20 possible 3-number combos that could win a small prize. Therefore, for a given 3-number combo such as {1,2,3}, there are 46 possible number choices for the 4th number; 45 possible choices for the 5th number, and 44 possible choices for the 6th number. Thus, there are:
`(46)(45)(44) = 91,080` possible 6-number combos that would win a small prize. There are 20 total 3-number winning combos, so we have a total of:
`20 x 91,080 = 1,821,600` total number of ways of winning a 3-number small prize. The probability of winning such a prize is:

`1,821,600 / "49 choose 6" = 0.130265`

Lastly, for the case of 2-winning numbers, there are:

In [37]:
combinations(6,2)

15

possible winning combos. There are `(47)(46)(45)(44) = 4,280,760` possible values for the remaining 4 numbers in the set. Thus, there are a total of `4,280,760 x 15 = 64,211,400` ways of winning a 2-number prize.

However:

In [38]:
combinations(49,6)

13983816

We see that there are 13,983,816 ways to choose 6 numbers from 49 numbers. Thus, we are *guaranteed* to win a 2-number prize because there are more ways to win a two number prize than there are possible combinations!

Now we need to generalize these observations and write the required function.

In [49]:
# To do this without if statments is tricky...

def probLess6(n):
    ticketCombos = combinations(6,n)
    remainingCombos = combinations(49-n, 6-n)
    
    successOutcomes = ticketCombos * remainingCombos
    totalOutcomes = combinations(49,6)
    
    prob = successOutcomes / totalOutcomes
    probPercent = prob * 100
    
    reducedChances = totalOutcomes // successOutcomes
    
    print("The chance of winning a {}-number prize is {:5f}%. In other words "
          "you have a 1 in {} chance of winning a {}-number prize"
          .format(n, probPercent, reducedChances, n))

In [50]:
for num in [2,3,4,5]:
    probLess6(num)
    print("-------------------------")

The chance of winning a 2-number prize is 19.132653%. In other words you have a 1 in 5 chance of winning a 2-number prize
-------------------------
The chance of winning a 3-number prize is 2.171081%. In other words you have a 1 in 46 chance of winning a 3-number prize
-------------------------
The chance of winning a 4-number prize is 0.106194%. In other words you have a 1 in 941 chance of winning a 4-number prize
-------------------------
The chance of winning a 5-number prize is 0.001888%. In other words you have a 1 in 52969 chance of winning a 5-number prize
-------------------------


In [54]:
x = combinations(6,2)
y = combinations(49-2, 6-2)
print(x)
print(y)
print(x*y)

15
178365
2675475


I must have made a mistake somewhere because these values above (coming from the solution of Dataquest) does NOT match my intuitive understanding of the problem!

In [57]:
totalPossible = combinations(49,6)
print("{:,}".format(totalPossible))

13,983,816


In [58]:
x = combinations(6,4)
y = combinations(49-4, 6-4)
print(x)
print(y)
print(x*y)

15
990
14850


The reason I got this wrong is because I forgot to divide out by the redundancies! For example, take the case of 4-number winning combinations. We still have 15 possible 4-number winning combos:
`6 choose 4 = 6! / (4!(6-4)!) = 6! / ((4!)(2!)) = (6)(5) / 2 = 15`

For each 4-number combo, say {1,2,3,4}, there are `49-4 = 45` possible numbers for the 5th digit, and then `49-5 = 44` possible numbers for the 6th digit. HOWEVER, there are redundancies! For example: 
`{1,2,3,4,10,6} = {1,2,3,4,6,10}`

Thus, we must divide out by the total number of redundancies. The number of redundancies, in this case is only 2! = 2 because there are two positions to be filled.

In the case of 3-number winning combinations, the number of redundancies will be 3! = 6; and in the case of 2-number winning combinations there will be 4! = 24 redundancies.

In [63]:
def prob_less_6(n):
    numNCombos = combinations(6,n)
    numBranches = combinations(49-n, 6-n)
    possibleSuccesses = numNCombos*numBranches
    totalOutcomes = combinations(49, 6)
    prob = (possibleSuccesses / totalOutcomes) * 100
    reducedChances = totalOutcomes // possibleSuccesses
    print("You have a {:8f}% chance of winning with a {}-number "
          "combination. In other words, you have a 1 in {} chance of "
          "winning with a {}-number combination."
          .format(prob, n, reducedChances, n))

In [64]:
for num in [2,3,4,5]:
    prob_less_6(num)
    print('------------------')

You have a 19.132653% chance of winning with a 2-number combination. In other words, you have a 1 in 5 chance of winning with a 2-number combination.
------------------
You have a 2.171081% chance of winning with a 3-number combination. In other words, you have a 1 in 46 chance of winning with a 3-number combination.
------------------
You have a 0.106194% chance of winning with a 4-number combination. In other words, you have a 1 in 941 chance of winning with a 4-number combination.
------------------
You have a 0.001888% chance of winning with a 5-number combination. In other words, you have a 1 in 52969 chance of winning with a 5-number combination.
------------------


We may wish to consider other features for a second version of this app:

- Add in some helpful or funny analogies like "You are 100 times more likely to be the victim of a shark attack than winning this lottery game"
- Combine the `oneTicProb()` and `histOccurrence()` functions to output information on probability and historical occurence at the same time.