# Motivation
Just to practice data skills and possibly identify a cash cow opportunity

# Business understanding
There is only 1 licensed operator for lottery in Singapore by Singapore Pools. As a governement organisation it has full control over the lottery scene and wages placed on any other platforms/organisations are illegal. In Singapore Pools there are a number of number based lottery games, namely: 4D (4 digits), TOTO (6 matching numbers from 1-49), Singapore Sweep. The draws take place weekly and prize money defers based on the probability. With TOTO, since the odds of winning is drastically lower compared to 4D, it typically offers a 1mil SGD payout.

Probability vs winnings 4D and TOTO
| Type | Probability | Winnings (SGD)|
|-------|------------|-------------|
| 4D | 1/10000 | 2,000 - 3,000 |
| TOTO | 1/C(49,6) = 7.5 e-8 or 1 in 13,983,816 of group 1| 1mil|

Objective:
Determine if there are any patterns on 4d numbers. Are there numbers where it appears more frequently? The assumption made here is that not all balls used by Singapore pools are made equal and therefore there is a possibility that some numbers appear more frequently than others.

Step 1: Determine the layout of the singapore pools website. Read source and identify the HTML tags to pull
Step 2: 

# Extract data for analysis
Since there is no existing consolidated dataset for the winning numbers, the information can be downloaded from the official Singapore pools webpage.

In [18]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import pandas as pd
import numpy as np


In [32]:
def fetch_4d_date_list(date_url):
    response = requests.get(date_url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser') 

    list_sppl = soup.find('select', class_='form-control selectDrawList').find_all('option')
    #initiate list to store date,sppl
    date_list = []
    for option in list_sppl:
        sppl = option.get('querystring')
        #sppl=RHJhd051bWJlcj01MTU3
        sppl_date_string = option.get_text()
        sppl_date_formatted = datetime.strptime(sppl_date_string, '%a, %d %b %Y').strftime('%Y%m%d')
        date_list.append((sppl_date_formatted,sppl))

    return date_list
    

In [53]:
def fetch_4d_results(draw_date,url):
    # Send a GET request to the webpage
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for HTTP errors

    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    #data structure
    data = {
        'DrawDate': draw_date,
        'PrizeType': [],
        'PrizeNumber': []
    }

    prizes = {
        'First_Prize': soup.find('td', class_='tdFirstPrize').get_text().strip(),
        'Second_Prize': soup.find('td', class_='tdSecondPrize').get_text().strip(),
        'Third_Prize': soup.find('td', class_='tdThirdPrize').get_text().strip()
    }

    for prize_type, prize_number in prizes.items():
        data['PrizeType'].append(prize_type)
        data['PrizeNumber'].append(prize_number)
    
        # Extract Starter Prizes
    starter_prizes = soup.find('tbody', class_='tbodyStarterPrizes').find_all('td')
    for prize in starter_prizes:
        data['PrizeType'].append('Starter_Prize')
        data['PrizeNumber'].append(prize.get_text().strip())

    # Extract Consolation Prizes
    consolation_prizes = soup.find('tbody', class_='tbodyConsolationPrizes').find_all('td')
    for prize in consolation_prizes:
        data['PrizeType'].append('Consolation_Prize')
        data['PrizeNumber'].append(prize.get_text().strip())
    
    #convert to dataframe
    df = pd.DataFrame(data)

    return df

In [56]:
#get date list and the corresponding URL tag to append
date_url = 'https://www.singaporepools.com.sg/DataFileArchive/Lottery/Output/fourd_result_draw_list_en.html?v=2024y4m16d12h30m'

# URL of the page to scrape
#'https://www.singaporepools.com.sg/en/product/Pages/4d_results.aspx?sppl=RHJhd051bWJlcj01MTU3'
url_pre = 'https://www.singaporepools.com.sg/en/product/Pages/4d_results.aspx?'

date_list_full = fetch_4d_date_list(date_url)
print(date_list_full)

#initialise dataframe
full_data = pd.DataFrame()


for i in range(0,len(date_list_full)):
    draw_date = date_list_full[i][0]
    url_sppl_append = date_list_full[i][1]
    url_full = url_pre + url_sppl_append 
    print(f'Draw date is {draw_date}, step number {i}\n{url_full}')
    draw_data = fetch_4d_results(draw_date,url_full)
    #print(draw_data)
    full_data = pd.concat([full_data, draw_data], ignore_index=True)

print('final dataframe is...')
print(full_data)



[('20240414', 'sppl=RHJhd051bWJlcj01MTU3'), ('20240413', 'sppl=RHJhd051bWJlcj01MTU2'), ('20240410', 'sppl=RHJhd051bWJlcj01MTU1'), ('20240407', 'sppl=RHJhd051bWJlcj01MTU0'), ('20240406', 'sppl=RHJhd051bWJlcj01MTUz'), ('20240403', 'sppl=RHJhd051bWJlcj01MTUy'), ('20240331', 'sppl=RHJhd051bWJlcj01MTUx'), ('20240330', 'sppl=RHJhd051bWJlcj01MTUw'), ('20240327', 'sppl=RHJhd051bWJlcj01MTQ5'), ('20240324', 'sppl=RHJhd051bWJlcj01MTQ4'), ('20240323', 'sppl=RHJhd051bWJlcj01MTQ3'), ('20240320', 'sppl=RHJhd051bWJlcj01MTQ2'), ('20240317', 'sppl=RHJhd051bWJlcj01MTQ1'), ('20240316', 'sppl=RHJhd051bWJlcj01MTQ0'), ('20240313', 'sppl=RHJhd051bWJlcj01MTQz'), ('20240310', 'sppl=RHJhd051bWJlcj01MTQy'), ('20240309', 'sppl=RHJhd051bWJlcj01MTQx'), ('20240306', 'sppl=RHJhd051bWJlcj01MTQw'), ('20240303', 'sppl=RHJhd051bWJlcj01MTM5'), ('20240302', 'sppl=RHJhd051bWJlcj01MTM4'), ('20240228', 'sppl=RHJhd051bWJlcj01MTM3'), ('20240225', 'sppl=RHJhd051bWJlcj01MTM2'), ('20240224', 'sppl=RHJhd051bWJlcj01MTM1'), ('20240221

In [149]:
#Determine ibet combinations. ie, position of the numbers does not matter. if 1234 is drawn, 2134, 3124 etc will win as well.
#instead of doing individual number search, a sorting of the PrizeNumber can be done and groupby thereafter.
def sort_digits(number_str):
    return ''.join(sorted(number_str))

full_data['ibet'] = full_data['PrizeNumber'].apply(sort_digits)
full_data.head(5)

Unnamed: 0,DrawDate,PrizeType,PrizeNumber,ibet,digits
0,20240414,First_Prize,9406,469,"[9, 4, 0, 6]"
1,20240414,Second_Prize,4726,2467,"[4, 7, 2, 6]"
2,20240414,Third_Prize,1892,1289,"[1, 8, 9, 2]"
3,20240414,Starter_Prize,175,157,"[0, 1, 7, 5]"
4,20240414,Starter_Prize,2321,1223,"[2, 3, 2, 1]"


In [103]:
#write to csv file
full_data.to_csv('full_data.csv',index=False)

In [150]:
#explode based on digits of each draw
#list of lists
draw_digits_all = full_data
draw_digits_all['digits'] = draw_digits_all['PrizeNumber'].apply(list) #break up string into individual characters in a list
draw_digits_all_explode = draw_digits_all.explode('digits')
draw_digits_all_explode.head(5)

#group by drawdate, digits
#digit_count = draw_digits_all_explode.groupby(['DrawDate','digits'],as_index=False)['PrizeNumber'].count().rename(columns={'PrizeNumber':'count'})
digit_count = draw_digits_all_explode.groupby(['DrawDate','digits']).size().reset_index(name='count')
digit_count.head(11)

#write to csv file
digit_count.to_csv('digit_count.csv',index=False)

In [181]:
#full number combinations
all_iterations = [f"{i:04d}" for i in range(10000)]
all_iterations_df = pd.DataFrame(all_iterations, columns = ['number'])

#drawn number counts
lucky_numbers_group = full_data.groupby(['PrizeNumber']).size().reset_index(name='count')
lucky_numbers_group.head(5)
#numbers which have not appeared at all - unlucky numbers
unlucky_numbers = all_iterations_df.merge(lucky_numbers_group,how='left',left_on='number',right_on='PrizeNumber')
unlucky_numbers = unlucky_numbers[unlucky_numbers['PrizeNumber'].isna()]['number']
unlucky_numbers.size

3385

## Analysis
### Lucky and unlucky numbers

Interestingly, over the period 3 years, there are some combination of numbers that have appeared more frequently than others.
|Numbers | Drawn Occurrence |
|--------|-|
|0170, 0627, 3192, 5943, 6977 (5 numbers)| 6 |
| 41 numbers | 5 |

If we continue with our assumption that it is a fair equal chance game, the only explaination is that these are lucky numbers. Inversely there are also unlucky numbers. These are number combinations that have never occurred before in the past 3 years. There are a total of 3,385 unlucky numbers between the 3 year period and this represents **33.85%** of all possible number combinations.

In [68]:
#total number of permutations of 4d number is 10*10*10*10. 0-9 number
all_possible = 10*10*10*10
print(all_possible)

10000


In [82]:
percentage_drawn = round((len(full_data ['PrizeNumber'].unique())/all_possible) *100,2)
percentage_drawn

66.15

# Machine learning
To identify patterns