# Building a randomised dataset of online digital game purchases

## In this .... we will do....

# Before we begin, let's think about how we want the dataset to look...

#### We want the dataset to have 12 columns *(description of field | source)*: 

- **date** *(date of purchase | python random date between 01/01/2022 and 31/12/2022)*
- **time** *(time of purchase | python random time across 24 hours)*
- **account_id** *(account_id of account holder | python random number between 1000 and 999999)*
- **purchase** *(title of purchased game | game_list.csv)*
- **price** *(price of purchased game | game_list.csv)*
- **version** *(distribution of purchased game | default setting == digital)*
- **reg_postcode** *(registered postcode of account holder | OPENWithPANDAS_UKpostcodeslist.csv)*
- **longitude** *(longitude of registered postcode of account holder | OPENwithPANDAS_UKpostcodeslist.csv)*
- **latitude** *(latitude of registered postcode of account holder | OPENwithPANDAS_UKpostcodeslist.csv)*
- **bank_no** *(sort code of account holder's bank | UKBankingSortCodes.csv)*
- **bank_name** *(name of account holder's bank | UKBankingSortCodes.csv)*
- **played_in_24_hours** *(True if game played within 24 hours of purchase | python random boolean)*
- **played_in_48_hours** *(True if game not played in 24 hours but in 48 hours of purchase | python if statement)*

# Set up

In [49]:
import random
import pandas as pd
import numpy as np

# Load in relevant raw data

In [7]:
# Game list | game_list.csv
url='https://drive.google.com/file/d/1V58XAYqIdAn0MdIE1cJxuN2ruhp0dwM_/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
gamelist = pd.read_csv(path)

In [8]:
# UK postcodes list | OPENwithPANDAS_UKpostcodeslist.csv
url='https://drive.google.com/file/d/1mYbtw2uqxvjEXchXmnFZnucOqanseoV5/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
UKpostcodes = pd.read_csv(path)

In [10]:
# Bank numbers list | UKBankingSortCodes.csv
url='https://drive.google.com/file/d/1klZZa1pVTwwnv9PwScgn0TaU3kg2W5fQ/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
banknumbers = pd.read_csv(path)

##### Note: Ensure to check you have the correct raw data

In [21]:
#UKpostcodes.info()
#gamelist.info()
#banknumbers.info()

#UKpostcodes.head()
#gamelist.head()
banknumbers.head()

Unnamed: 0,GENERALSortingCode,GENERALBIC1,GENERALBIC2,GENERALSubBranchSuffix,GENERALShortBranchTitle,GENERALShortNameOwningBank,GENERALFullNameOwningBankLine1,GENERALFullNameOwningBankLine2,GENERALBankCodeOwningBank,GENERALNationalCentralBankCountryCode,...,PRINTAddressLine3,PRINTAddressLine4,PRINTTown,PRINTCounty,PRINTPostcodeField1,PRINTPostcodeField2,PRINTTelephoneArea,PRINTTelephoneNumber,PRINTTelephone2Area,PRINTTelephone2Number
0,90025,ANILJESH,TSY,0,ABBEY NAT (OVERSEAS) W/SALE,ABBEY NAT TY INT LTD,ABBEY NATIONAL TREASURY SERVICES PL,C,641,,...,"Santander Hs, 19-21 Commercial St",St Helier,Jersey,C.I.,JE4,8XG,1534,885000.0,,
1,239285,AIBKGB2L,XXX,0,Customer Treasury Services,AIB GB,ALLIED IRISH BANK (GB),,17,,...,,,London,,EC3A,8AB,20,73093000.0,,
2,300083,ARAYGB22,XXX,0,OP HEADQUARTERS,AL RAYAN BANK PLC,AL RAYAN BANK PLC,,338,,...,,,Birmingham,,B15,1RP,121,4527300.0,,
3,405179,ARNBGB2L,XXX,0,47 SEYMOUR ST LONDON W1A,ARAB NATIONAL BANK,ARAB NATIONAL BANK,,75,,...,,,London,,W1J,7TT,20,72974600.0,,
4,300066,ARBUGB2L,AD1,0,ARBUTHNOT LATHAM & CO LTD,ARBUTHNOT LATHAM&CO,ARBUTHNOT LATHAM AND CO LTD,,103,,...,,,LONDON,,EC2M,2SN,20,70122500.0,,


##### Example: Call a column

In [16]:
gamelist["title"]

0                                       Forspoken
1                              Saints Row PS4&PS5
2                                Ghostwire: Toyko
3                                WWE 2K22 for PS5
4                                 Ghostrunner PS5
5                            ELDEN RING PS4 & PS5
6                                  Cyberpunk 2077
7                                            Sifu
8                Dying Light 2 Stay Human PS4&PS5
9                 Fifa Standard Edition PS4 & PS5
10    Marvel's Spider-Man: Mile Morales PS4 & PS5
11                        Borderlands 3 PS4 & PS5
12       Back 4 Blood: Standard Edition PS4 & PS5
13                     Battlefield 2042 PS4 & PS5
14                   Watch Dogs: Legion PS4 & PS5
15               Crash Bandicoot: It's About Time
16                       NBA 2K21 Next Generation
17         Life is Strange: True Colors PS4 & PS5
18                Resident Evil Village PS4 & PS5
19                      Madden NFL 2K21 PS4 & PS5


## Turn selected columns into lists ()

In [24]:
# from gamelist | game_list.csv
gamelist_purchase = gamelist["title"].to_list()
gamelist_price = gamelist["price"].to_list()


# from UKpostcodes | OPENwithPANDAS_UKpostcodeslist.csv

UKpostcodes_reg_postcode = UKpostcodes["Postcode"].to_list()
UKpostcodes_longitude = UKpostcodes["Latitude"].to_list()
UKpostcodes_latitude = UKpostcodes["Longitude"].to_list()


# from banknumbers | UKBankingSortCodes.csv
banknumbers_bank_no = banknumbers["GENERALSortingCode"].to_list()
banknumbers_bank_name = banknumbers["GENERALFullNameOwningBankLine1"].to_list()


##### Note: You can check the data type to make sure

In [31]:
type(banknumbers_bank_name)

list

### There should not be any NaNs (null values, blanks etc.) in our lists, but we should go to remove them just in case there are

In [32]:
# for gamelist lists
gamelist_purchase = [x for x in gamelist_purchase if pd.isnull(x) == False]
gamelist_price = [x for x in gamelist_price if pd.isnull(x) == False]


# for UKpostcodes lists

UKpostcodes_reg_postcode = [x for x in UKpostcodes_reg_postcode if pd.isnull(x) == False]
UKpostcodes_longitude = [x for x in UKpostcodes_longitude if pd.isnull(x) == False]
UKpostcodes_latitude = [x for x in UKpostcodes_latitude if pd.isnull(x) == False]


# fom banknumbers lists
banknumbers_bank_no = [x for x in banknumbers_bank_no if pd.isnull(x) == False]
banknumbers_bank_name = [x for x in banknumbers_bank_name if pd.isnull(x) == False]

### Since we have prepped the raw data, we can now use the function to produce the randomised game sales dataset

In [134]:
def random_online_digital_game_purchases(num):
    
    '''
    function to pull a random dataset of online digital game purchases
    
    '''
    # 
    def randomTime():
        # generate random number scaled to number of seconds in a day
        # (24*60*60) = 86,400

        rtime = int(random.random()*86400)

        hours   = int(rtime/3600)
        minutes = int((rtime - hours*3600)/60)
        seconds = rtime - hours*3600 - minutes*60

        time_string = '%02d:%02d:%02d' % (hours, minutes, seconds)

        return time_string

    def rand_time_list(num):
        timelist = []
        for num in range(0,num):
            timelist.append(randomTime())
        return timelist   
    
    rand_time_randomtime = rand_time_list(num)
    
    rand_account_id = list(np.random.randint(1000, 99999, size=num))
    
    rand_gamelist_purchase = np.random.choice(gamelist_purchase, size=num)
    rand_gamelist_price = np.random.choice(gamelist_price, size=num)
    
    rand_UKpostcodes_reg_postcode = np.random.choice(UKpostcodes_reg_postcode, size=num)
    rand_UKpostcodes_longitude = np.random.choice(UKpostcodes_longitude, size=num)
    rand_UKpostcodes_latitude = np.random.choice(UKpostcodes_latitude, size=num)
    
    rand_banknumbers_bank_no = np.random.choice(banknumbers_bank_no, size=num)
    rand_banknumbers_bank_name = np.random.choice(banknumbers_bank_name, size=num)
   
    rand_played_in_24_hours = bool(random.randint(0,1))

    # Turning NoneTypes from variables above into columns of a single dataframe
    # this approach is using the dictionary method
    # {"name of column": NoneType variable, *repeat this for desired amount of columns*}
    d = {
        "date": "test",
        "time": rand_time_randomtime,
        "account_id": rand_account_id,
        "purchase": rand_gamelist_purchase,
        "price": rand_gamelist_price,
        "version": "Digital",
        "reg_postcode": rand_UKpostcodes_reg_postcode,
        "longitude": rand_UKpostcodes_longitude,
        "latitude": rand_UKpostcodes_latitude,
        "bank_no": rand_banknumbers_bank_no,
        "bank_name": rand_banknumbers_bank_name,
        "played_in_24_hours": rand_played_in_24_hours,
        "played_in_48_hours": "test",
    }

    new_dataframe = pd.DataFrame(d)
    
    
    #### thoughts
    #### certain columns needs to be consistent with the column to the left of it
    #### i.e. if X game is 29.99, the price column needs to show 29.99, not some random number
    #### a way to approach this is to cast the column which is conditional to "null" when creating
    #### the dataframe and then after the dataframe is created, run a .apply(lambda) function
    #### on the column, with an if statement that this column == the column to the left's price etc.
    #### then return the dataframe - this may work - I should try it out
    
    
    
    return new_dataframe

In [132]:
# Test place

def randomTime():
    # generate random number scaled to number of seconds in a day
    # (24*60*60) = 86,400
    
    rtime = int(random.random()*86400)

    hours   = int(rtime/3600)
    minutes = int((rtime - hours*3600)/60)
    seconds = rtime - hours*3600 - minutes*60

    time_string = '%02d:%02d:%02d' % (hours, minutes, seconds)
        
    return time_string


def rand_time_list(num):
    timelist = []
    for num in range(0,num):
        timelist.append(randomTime())
    return timelist

rand_time_list(10)
    

['15:03:16',
 '12:35:10',
 '11:17:22',
 '14:48:27',
 '05:52:21',
 '21:33:29',
 '12:01:04',
 '06:21:41',
 '07:00:47',
 '02:28:07']

# To Do: Focus on turning the other columns into lists like the time column, then you can start to add logic to the dataset

In [118]:
# Test place

import datetime

start_date = datetime.date(2020, 1, 1)
end_date = datetime.date(2020, 2, 1)

time_between_dates = end_date - start_date
days_between_dates = time_between_dates.days
random_number_of_days = random.randrange(days_between_dates)
random_date = start_date + datetime.timedelta(days=random_number_of_days)

print(random_date)

2020-01-30


- **date** *(date of purchase | python random date between 01/01/2022 and 31/12/2022)*
- **time** *(time of purchase | python random time across 24 hours)*
- **account_id** *(account_id of account holder | python random number between 1000 and 999999)*
- **purchase** *(title of purchased game | game_list.csv)*
- **price** *(price of purchased game | game_list.csv)*
- **version** *(distribution of purchased game | default setting == digital)*
- **reg_postcode** *(registered postcode of account holder | OPENWithPANDAS_UKpostcodeslist.csv)*
- **longitude** *(longitude of registered postcode of account holder | OPENwithPANDAS_UKpostcodeslist.csv)*
- **latitude** *(latitude of registered postcode of account holder | OPENwithPANDAS_UKpostcodeslist.csv)*
- **bank_no** *(sort code of account holder's bank | UKBankingSortCodes.csv)*
- **bank_name** *(name of account holder's bank | UKBankingSortCodes.csv)*
- **played_in_24_hours** *(True if game played within 24 hours of purchase | python random boolean)*
- **played_in_48_hours** *(True if game not played in 24 hours but in 48 hours of purchase | python if statement)*

In [135]:
random_online_digital_game_purchases(40)

Unnamed: 0,date,time,account_id,purchase,price,version,reg_postcode,longitude,latitude,bank_no,bank_name,played_in_24_hours,played_in_48_hours
0,test,16:58:40,34096,ELDEN RING PS4 & PS5,39.99,Digital,SA15 3NP,53.755435,0.085266,609588,CLEAR JUNCTION LIMITED,False,test
1,test,02:21:10,63313,NBA 2K21 Next Generation,64.99,Digital,KT5 8YF,51.557545,-2.014062,200052,Heritable Bank Limited,False,test
2,test,19:17:47,94126,DEATHLOOP,69.99,Digital,DG9 7PQ,57.507578,-2.642853,301561,SILICON VALLEY BANK,False,test
3,test,12:11:40,12337,Sifu,64.99,Digital,MK9 1AW,54.800203,-1.150159,405056,PROJECT IMAGINE LTD,False,test
4,test,04:30:40,74175,Battlefield 2042 PS4 & PS5,69.99,Digital,B47 5DE,54.928015,-0.139744,165562,HBL BANK UK LIMITED T/A HBL BANK UK,False,test
5,test,11:35:36,83642,Ghostrunner PS5,59.99,Digital,NE26 3SJ,53.829616,-0.15865,41300,WEATHERBYS BANK LTD,False,test
6,test,15:49:54,44584,Cyberpunk 2077,59.99,Digital,WR14 3NT,53.55534,-0.467889,405130,SAINSBURY'S BANK PLC,False,test
7,test,13:29:46,76971,Sifu,59.99,Digital,NR34 9NP,53.389373,-1.772412,950001,BANK OF CHINA (UK) LTD,False,test
8,test,21:38:34,98668,Horizon Forbidden West,64.99,Digital,B31 5SL,51.817366,-0.097945,300059,AIB GROUP (UK) PLC (TRADING NAME FI,False,test
9,test,00:29:10,4747,Forspoken,69.99,Digital,LD3 8SD,51.139991,-1.236641,231618,J P Morgan Europe Ltd,False,test


# Saving the dataset as CSV

In [135]:
random_game_sales_dataset(50).to_csv("example_random_game_sales_dataset.csv", index=False)

In [136]:
df_saved_file = pd.read_csv("example_random_game_sales_dataset.csv")
df_saved_file.head()

Unnamed: 0,purchase_location,game_version,purchase_date
0,AB9,digital,12/02/2022
1,EF6,digital,17/02/2022
2,QWF6,physical,03/02/2022
3,EF6,digital,17/02/2022
4,FG8,physical,17/02/2022
