# Building a randomised dataset of online digital game purchases

## In this .... we will do....

# Before we begin, let's think about how we want the dataset to look...

#### We want the dataset to have 12 columns *(description of field | source)*: 

- **date** *(date of purchase | python random date between 01/01/2022 and 31/12/2022)*
- **time** *(time of purchase | python random time across 24 hours)*
- **account_id** *(account_id of account holder | python random number between 1000 and 999999)*
- **purchase** *(title of purchased game | game_list.csv)*
- **price** *(price of purchased game | game_list.csv)*
- **version** *(distribution of purchased game | default setting == digital)*
- **reg_postcode** *(registered postcode of account holder | OPENWithPANDAS_UKpostcodeslist.csv)*
- **longitude** *(longitude of registered postcode of account holder | OPENwithPANDAS_UKpostcodeslist.csv)*
- **latitude** *(latitude of registered postcode of account holder | OPENwithPANDAS_UKpostcodeslist.csv)*
- **bank_no** *(sort code of account holder's bank | UKBankingSortCodes.csv)*
- **bank_name** *(name of account holder's bank | UKBankingSortCodes.csv)*
- **played_in_24_hours** *(True if game played within 24 hours of purchase | python random boolean)*
- **played_in_48_hours** *(True if game not played in 24 hours but in 48 hours of purchase | python if statement)*

# Set up

In [58]:
import random
import datetime
import pandas as pd
import numpy as np

# Load in relevant raw data

In [59]:
# Game list | game_list.csv
url='https://drive.google.com/file/d/1V58XAYqIdAn0MdIE1cJxuN2ruhp0dwM_/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
gamelist = pd.read_csv(path)

In [60]:
# UK postcodes list | OPENwithPANDAS_UKpostcodeslist.csv
url='https://drive.google.com/file/d/1mYbtw2uqxvjEXchXmnFZnucOqanseoV5/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
UKpostcodes = pd.read_csv(path)

In [61]:
# Bank numbers list | UKBankingSortCodes.csv
url='https://drive.google.com/file/d/1klZZa1pVTwwnv9PwScgn0TaU3kg2W5fQ/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
banknumbers = pd.read_csv(path)

##### Note: Ensure to check you have the correct raw data

In [62]:
#UKpostcodes.info()
#gamelist.info()
#banknumbers.info()

#UKpostcodes.head()
#gamelist.head()
banknumbers.head()

Unnamed: 0,GENERALSortingCode,GENERALBIC1,GENERALBIC2,GENERALSubBranchSuffix,GENERALShortBranchTitle,GENERALShortNameOwningBank,GENERALFullNameOwningBankLine1,GENERALFullNameOwningBankLine2,GENERALBankCodeOwningBank,GENERALNationalCentralBankCountryCode,...,PRINTAddressLine3,PRINTAddressLine4,PRINTTown,PRINTCounty,PRINTPostcodeField1,PRINTPostcodeField2,PRINTTelephoneArea,PRINTTelephoneNumber,PRINTTelephone2Area,PRINTTelephone2Number
0,90025,ANILJESH,TSY,0,ABBEY NAT (OVERSEAS) W/SALE,ABBEY NAT TY INT LTD,ABBEY NATIONAL TREASURY SERVICES PL,C,641,,...,"Santander Hs, 19-21 Commercial St",St Helier,Jersey,C.I.,JE4,8XG,1534,885000.0,,
1,239285,AIBKGB2L,XXX,0,Customer Treasury Services,AIB GB,ALLIED IRISH BANK (GB),,17,,...,,,London,,EC3A,8AB,20,73093000.0,,
2,300083,ARAYGB22,XXX,0,OP HEADQUARTERS,AL RAYAN BANK PLC,AL RAYAN BANK PLC,,338,,...,,,Birmingham,,B15,1RP,121,4527300.0,,
3,405179,ARNBGB2L,XXX,0,47 SEYMOUR ST LONDON W1A,ARAB NATIONAL BANK,ARAB NATIONAL BANK,,75,,...,,,London,,W1J,7TT,20,72974600.0,,
4,300066,ARBUGB2L,AD1,0,ARBUTHNOT LATHAM & CO LTD,ARBUTHNOT LATHAM&CO,ARBUTHNOT LATHAM AND CO LTD,,103,,...,,,LONDON,,EC2M,2SN,20,70122500.0,,


##### Example: Call a column

In [63]:
gamelist["title"]

0                                       Forspoken
1                              Saints Row PS4&PS5
2                                Ghostwire: Toyko
3                                WWE 2K22 for PS5
4                                 Ghostrunner PS5
5                            ELDEN RING PS4 & PS5
6                                  Cyberpunk 2077
7                                            Sifu
8                Dying Light 2 Stay Human PS4&PS5
9                 Fifa Standard Edition PS4 & PS5
10    Marvel's Spider-Man: Mile Morales PS4 & PS5
11                        Borderlands 3 PS4 & PS5
12       Back 4 Blood: Standard Edition PS4 & PS5
13                     Battlefield 2042 PS4 & PS5
14                   Watch Dogs: Legion PS4 & PS5
15               Crash Bandicoot: It's About Time
16                       NBA 2K21 Next Generation
17         Life is Strange: True Colors PS4 & PS5
18                Resident Evil Village PS4 & PS5
19                      Madden NFL 2K21 PS4 & PS5


## Turn selected columns into lists ()

In [64]:
# from gamelist | game_list.csv
gamelist_purchase = gamelist["title"].to_list()
gamelist_price = gamelist["price"].to_list()


# from UKpostcodes | OPENwithPANDAS_UKpostcodeslist.csv

UKpostcodes_reg_postcode = UKpostcodes["Postcode"].to_list()
UKpostcodes_longitude = UKpostcodes["Latitude"].to_list()
UKpostcodes_latitude = UKpostcodes["Longitude"].to_list()


# from banknumbers | UKBankingSortCodes.csv
banknumbers_bank_no = banknumbers["GENERALSortingCode"].to_list()
banknumbers_bank_name = banknumbers["GENERALFullNameOwningBankLine1"].to_list()


##### Note: You can check the data type to make sure

In [65]:
type(banknumbers_bank_name)

list

### There should not be any NaNs (null values, blanks etc.) in our lists, but we should go to remove them just in case there are

In [66]:
# for gamelist lists
gamelist_purchase = [x for x in gamelist_purchase if pd.isnull(x) == False]
gamelist_price = [x for x in gamelist_price if pd.isnull(x) == False]


# for UKpostcodes lists

UKpostcodes_reg_postcode = [x for x in UKpostcodes_reg_postcode if pd.isnull(x) == False]
UKpostcodes_longitude = [x for x in UKpostcodes_longitude if pd.isnull(x) == False]
UKpostcodes_latitude = [x for x in UKpostcodes_latitude if pd.isnull(x) == False]


# fom banknumbers lists
banknumbers_bank_no = [x for x in banknumbers_bank_no if pd.isnull(x) == False]
banknumbers_bank_name = [x for x in banknumbers_bank_name if pd.isnull(x) == False]

### Since we have prepped the raw data, we can now use the function to produce the randomised game sales dataset

In [100]:
def random_online_digital_game_purchases(num):
    
    '''
    function to pull a random dataset of online digital game purchases
    
    '''
    # ---------------------------------------------------------------------------------------------------
    
    # for 'date' column
    def randomDate():
        start_date = datetime.date(2022, 2, 1) # set the start date of purchases for the report
        end_date = datetime.date(2022, 3, 1) # set the end date of purchases for the report

        time_between_dates = end_date - start_date
        days_between_dates = time_between_dates.days
        random_number_of_days = random.randrange(days_between_dates)
        random_date = start_date + datetime.timedelta(days=random_number_of_days)

        return random_date.isoformat()

    def rand_date_list(num):
        datelist = []
        for num in range(0,num):
            datelist.append(randomDate())
        return datelist

    rand_date_randomdate = rand_date_list(num)
    
    # ---------------------------------------------------------------------------------------------------
      
    # for 'time' column
    def randomTime():
        # generate random number scaled to number of seconds in a day
        # (24*60*60) = 86,400

        rtime = int(random.random()*86400)

        hours   = int(rtime/3600)
        minutes = int((rtime - hours*3600)/60)
        seconds = rtime - hours*3600 - minutes*60

        time_string = '%02d:%02d:%02d' % (hours, minutes, seconds)

        return time_string

    def rand_time_list(num):
        timelist = []
        for num in range(0,num):
            timelist.append(randomTime())
        return timelist   
    
    rand_time_randomtime = rand_time_list(num)
    
    # ---------------------------------------------------------------------------------------------------
    
    # for 'account_id' column
    rand_account_id = list(np.random.randint(1000, 99999, size=num))
    
    # ---------------------------------------------------------------------------------------------------
    
    # for 'purchase' and 'price' columns
    rand_gamelist_purchase = np.random.choice(gamelist_purchase, size=num)
    rand_gamelist_price = np.random.choice(gamelist_price, size=num)
    
    # ---------------------------------------------------------------------------------------------------
    
    # for 'reg_postcode', 'longitude' and 'latitude' columns
    rand_UKpostcodes_reg_postcode = np.random.choice(UKpostcodes_reg_postcode, size=num)
    rand_UKpostcodes_longitude = np.random.choice(UKpostcodes_longitude, size=num)
    rand_UKpostcodes_latitude = np.random.choice(UKpostcodes_latitude, size=num)
    
    # ---------------------------------------------------------------------------------------------------
    
    # for 'bank_no' and 'bank_name' columns
    rand_banknumbers_bank_no = np.random.choice(banknumbers_bank_no, size=num)
    rand_banknumbers_bank_name = np.random.choice(banknumbers_bank_name, size=num)
    
    # ---------------------------------------------------------------------------------------------------
    
    # for 'played_in_24_hours' column
    def rand_bool(num):
        bool_list = []
        for num in range(0, num):
            bool_list.append(random.randint(0,1))
        return bool_list
    
    rand_played_in_24_hours = rand_bool(num)
    
    # ---------------------------------------------------------------------------------------------------

    # Turning NoneTypes from variables above into columns of a single dataframe
    # this approach is using the dictionary method
    # {"name of column": NoneType variable, *repeat this for desired amount of columns*}
    d = {
        "date": rand_date_randomdate,
        "time": rand_time_randomtime,
        "account_id": rand_account_id,
        "purchase": rand_gamelist_purchase,
        "price": rand_gamelist_price,
        "version": "digital",
        "reg_postcode": rand_UKpostcodes_reg_postcode,
        "longitude": rand_UKpostcodes_longitude,
        "latitude": rand_UKpostcodes_latitude,
        "bank_no": rand_banknumbers_bank_no,
        "bank_name": rand_banknumbers_bank_name,
        "played_in_24_hours": rand_played_in_24_hours,
    }

    new_dataframe = pd.DataFrame(d)
    
    new_dataframe['played_in_48_hours'] = new_dataframe['played_in_24_hours'].apply(lambda x: random.randint(0,1) if x == 0 else 1)
    
    
    #### thoughts
    #### certain columns needs to be consistent with the column to the left of it
    #### i.e. if X game is 29.99, the price column needs to show 29.99, not some random number
    #### a way to approach this is to cast the column which is conditional to "null" when creating
    #### the dataframe and then after the dataframe is created, run a .apply(lambda) function
    #### on the column, with an if statement that this column == the column to the left's price etc.
    #### then return the dataframe - this may work - I should try it out
    
    
    
    return new_dataframe

In [101]:
random_online_digital_game_purchases(4)

Unnamed: 0,date,time,account_id,purchase,price,version,reg_postcode,longitude,latitude,bank_no,bank_name,played_in_24_hours,played_in_48_hours
0,2022-02-04,15:38:55,71626,Ghostrunner PS5,69.99,digital,PE22 9LW,51.262046,-3.729349,608376,BILDERLINGS PAY LTD,0,1
1,2022-02-05,21:54:46,48325,Ghostrunner PS5,64.99,digital,ME17 1UE,51.745707,-1.540074,301275,CYNERGY BANK LIMITED,1,1
2,2022-02-02,07:40:22,4299,Saints Row PS4&PS5,39.99,digital,NW2 7JX,51.616438,-2.01352,980000,TSB BANK PLC,0,1
3,2022-02-06,01:13:22,1125,ELDEN RING PS4 & PS5,69.99,digital,TS26 0AP,52.436135,-1.464295,609280,MIDPOINT & TRANSFER LTD,0,1


In [102]:
# Test place

    

- **date** *(date of purchase | python random date between 01/01/2022 and 31/12/2022)*
- **time** *(time of purchase | python random time across 24 hours)*
- **account_id** *(account_id of account holder | python random number between 1000 and 999999)*
- **purchase** *(title of purchased game | game_list.csv)*
- **price** *(price of purchased game | game_list.csv)*
- **version** *(distribution of purchased game | default setting == digital)*
- **reg_postcode** *(registered postcode of account holder | OPENWithPANDAS_UKpostcodeslist.csv)*
- **longitude** *(longitude of registered postcode of account holder | OPENwithPANDAS_UKpostcodeslist.csv)*
- **latitude** *(latitude of registered postcode of account holder | OPENwithPANDAS_UKpostcodeslist.csv)*
- **bank_no** *(sort code of account holder's bank | UKBankingSortCodes.csv)*
- **bank_name** *(name of account holder's bank | UKBankingSortCodes.csv)*
- **played_in_24_hours** *(True if game played within 24 hours of purchase | python random boolean)*
- **played_in_48_hours** *(True if game not played in 24 hours but in 48 hours of purchase | python if statement)*

# Saving the dataset as CSV

In [71]:
random_game_sales_dataset(50).to_csv("example_random_game_sales_dataset.csv", index=False)

NameError: name 'random_game_sales_dataset' is not defined

In [None]:
df_saved_file = pd.read_csv("example_random_game_sales_dataset.csv")
df_saved_file.head()