In [1]:
# Loading library
import os
os.sys.path.append("../")
from scripts.etl_pipeline import *

In [2]:
# Create a Spark Session
spark = (
    SparkSession.builder.appName("ETL Pipeline")
    .config("spark.sql.repl.eagerEval.enabled", True)
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.driver.memory", "4g")
    .config("spark.executor.memory", "2g")
    .getOrCreate()
)

24/10/16 14:48:12 WARN Utils: Your hostname, DESKTOP-H6V94HM resolves to a loopback address: 127.0.1.1; using 192.168.0.220 instead (on interface eth0)
24/10/16 14:48:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/16 14:48:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Extract

## BNPL Dataset

As we can't use `urlretrieve` to get the data from Canvas, please download it to your local machine and move it `data/tables`. Then run the code below to unzip the files

In [3]:
# Assign data path
raw_path = "../data"

# Unzip files (Only run once)
"""
for file in os.listdir(f"{raw_path}/tables"):
    if file == ".gitkeep":
        continue
    with zipfile.ZipFile(f"{raw_path}/tables/{file}", "r") as zip_ref:
        zip_ref.extractall(f"{raw_path}/")
    os.remove(f"{raw_path}/tables/{file}")
"""

'\nfor file in os.listdir(f"{raw_path}/tables"):\n    if file == ".gitkeep":\n        continue\n    with zipfile.ZipFile(f"{raw_path}/tables/{file}", "r") as zip_ref:\n        zip_ref.extractall(f"{raw_path}/")\n    os.remove(f"{raw_path}/tables/{file}")\n'

## External Dataset

Our external dataset will include socio-economic as well as demographics data on different granularity levels such as state and Local Government Area (LGA), collected from the Australia's Bureau of Statistic. Since LGA is a region that includes postcodes and suburbs, we found a dataset that help us map postcode to LGA code. We downloaded the data that maps postcode to LGA code directly as the website doesn't allow us to use `urlretrieve`. The data can be downloaded by clicking [here](https://www.matthewproctor.com/Content/postcodes/australian_postcodes.csv).

We will also use the ABS's **Personal Income in Australia** for each LGA region

In [4]:
INCOME_DATA_URL = "https://www.abs.gov.au/statistics/labour/earnings-and-working-conditions/personal-income-australia/2020-21-financial-year/Table%202%20-%20Total%20income%20distribution%20by%20geography%2C%202020-21.xlsx"

urlretrieve(INCOME_DATA_URL, f"{raw_path}/raw/income_data.xlsx")

('../data/raw/income_data.xlsx', <http.client.HTTPMessage at 0x7fad1eff3d60>)

Getting the ABS's **Personal Fraud** statistics, which includes card fraud, identity theft, and scams.

In [5]:
PERSONAL_FRAUD_DATA_URL = "https://www.abs.gov.au/statistics/people/crime-and-justice/personal-fraud/2022-23/Personal%20Fraud%20%28Tables%201a%20to%2014b%29.xlsx"

urlretrieve(PERSONAL_FRAUD_DATA_URL, f"{raw_path}/raw/personal_fraud.xlsx")

('../data/raw/personal_fraud.xlsx',
 <http.client.HTTPMessage at 0x7fad1f1b75b0>)

# Transform

## External Dataset

For the data on mapping postcodes to LGA code, we will only select the columns that are neccessary for the mapping. There are some entries with missing LGA code. To resolve this, we will use the closest postcode, using lattitude and longitude, and if it has a valid LGA code, we will impute the missing value with the existing one. To do this, we will using K-Nearest Neighour wth `k=1` to do. The reason being the data is already being sorted (clustered) using states. Thus, we only need to select the closest postcode with a valid LGA to the target.


In [6]:
postcode_lga_map = pd.read_csv("../data/raw/australian_postcodes.csv")

# Clean the data
postcode_lga_map = clean_postcode_lga_mapping(postcode_lga_map)

# Display the data
postcode_lga_map

Unnamed: 0,postcode,state,long,lat,lgacode
0,200,ACT,149.119000,-35.277700,89399
2,800,NT,130.836680,-12.458684,71150
4,801,NT,130.836680,-12.458684,71000
5,803,NT,130.745908,-12.433991,71000
6,804,NT,130.873315,-12.428017,71000
...,...,...,...,...,...
18540,9013,QLD,152.823141,-27.603479,31000
18541,9015,QLD,152.823141,-27.603479,31000
18542,9464,QLD,153.074982,-27.397055,31000
18543,9726,QLD,153.412197,-28.008783,33430


For the data on median/mean income and age of earners, as the data is in excel format, we did some extensive work to renaming the columns as well as selecting the correct rows that contain the number we want.

There are LGA codes in Western Australia that do not have any values across all features that we're planning to use. Thus, we will impute the median age, income and mean income using the state's values. 

For the number of earners, we notice that there's a discrepancy in the total number of earners. In total, the number of earners in Western Australia is 1,585,093 whereas when summing up the number of earners across all LGA code in Western Australia, the number is 1,5815,061. This is a difference of 32 earners. We're not entirely sure why this discrepancy occurs but we will split this number in half for the 2 missing LGA code. Even though this is not a sound solution, we believe that imputing such number for only 2 entries doesn't affect the fraud probability predition significantly as well as we're not entirely sure if the consumers live in this LGA code.

In [7]:
income_df = pd.read_excel("../data/raw/income_data.xlsx", sheet_name = "Table 2.5", skiprows = 5, skipfooter = 5)

income_df = preprocess_income_df(income_df)
income_df.head(10)

Unnamed: 0,lga,median_age,median_income,mean_income
0,10050,43,53392,62395
1,10180,43,48837,58514
2,10250,48,47527,60261
3,10300,43,49408,53342
4,10470,43,55986,64364
5,10500,38,55000,67148
6,10550,51,43532,51652
7,10600,50,41711,51346
8,10650,49,44099,50045
9,10750,39,58934,65892


Merging the income data with the postcode data.

In [8]:
print(f"Shape of postcode data before the merge: {postcode_lga_map.shape}")
postcode_info = postcode_lga_map.merge(income_df, left_on='lgacode', right_on='lga',
                                       how = "left" )
print(f"Shape of postcode data after the merge: {postcode_info.shape}")

Shape of postcode data before the merge: (3175, 5)
Shape of postcode data after the merge: (3175, 9)


We also need to check that if there are any postcode with LGA code that is not in the income data

In [9]:
postcode_info[postcode_info.lga.isnull()]

Unnamed: 0,postcode,state,long,lat,lgacode,lga,median_age,median_income,mean_income
734,2540,NSW,150.5789,-34.8282,99399,,,,
979,2899,NSW,146.928783,-36.084231,99399,,,,


We can see that there are 2 postcode that does not appear in the income data. We will follow the same approach, using K-Nearest Neighbour, that we did previously to compute these value.

In [10]:
postcode_info = impute_income_metrics(postcode_info)
postcode_info = postcode_info.drop(columns='lga')

The ABS's **Personal Fraud** data have multiple tables, in which we will use Table 4a and Table 4b as it contains the percentage of individual experienced personal fraud and respective Relative Standard Error (RSE) for each state, respectively. We will use the rate from 2021 to 2022 to match our the date range of the data we have one the transactions.

In [11]:
pf_df = process_fp_data(path="../data/raw/personal_fraud.xlsx")
pf_df

Unnamed: 0,state,victimisation_rate,rse_percent
0,NSW,10.7,4.3
1,VIC,11.2,5.5
2,QLD,10.4,5.3
3,SA,10.1,6.0
4,WA,11.3,7.1
5,TAS,9.8,6.7
6,NT,10.0,13.7
7,ACT,11.7,9.0


## Main Dataset

The system use `user_id` as a key for identifying customer in transactions record and fraud probability tables. However, they also have a key-value map of `user_id` and `consumer_id`. We will use `consumer_id` as the only ID for customer. Thus, we will map `user_id` from each table to `consumer_id` and drop the former.

In [12]:
# Load consumer user details -> a key:value map for user_id to consumer_id
consumer_user_map = spark.read.parquet(f"{raw_path}/tables/consumer_user_details.parquet")
consumer_user_map.limit(5) # Preview

                                                                                

user_id,consumer_id
1,1195503
2,179208
3,1194530
4,154128
5,712975


In [13]:
# Load consumer fraud rate dataset
consumer_fraud_rate = spark.read.csv(f"{raw_path}/tables/consumer_fraud_probability.csv", header=True, inferSchema=True)
consumer_fraud_rate.show(5)
get_dataset_count(consumer_fraud_rate)

# Replace all user_id with unique consumer_id
consumer_fraud_rate = replace_id(consumer_user_map, consumer_fraud_rate)
consumer_fraud_rate.show(5)

# Check to make sure no rows were lost on the inner join
get_dataset_count(consumer_fraud_rate)

+-------+--------------+-----------------+
|user_id|order_datetime|fraud_probability|
+-------+--------------+-----------------+
|   6228|    2021-12-19| 97.6298077657765|
|  21419|    2021-12-10|99.24738020302328|
|   5606|    2021-10-17|84.05825045251777|
|   3101|    2021-04-17|91.42192091901347|
|  22239|    2021-10-19|94.70342477508035|
+-------+--------------+-----------------+
only showing top 5 rows

The dataset count is  34864
+--------------+------------------+-----------+
|order_datetime| fraud_probability|consumer_id|
+--------------+------------------+-----------+
|    2022-02-20| 9.805431136520959|    1195503|
|    2021-08-30| 9.599513915425788|     179208|
|    2021-09-25|10.069850934775245|     179208|
|    2021-11-03| 8.300636455314633|    1194530|
|    2021-10-09| 9.633302411090419|     154128|
+--------------+------------------+-----------+
only showing top 5 rows

The dataset count is  34864


We see that there is no change in the number of entries upon an inner join.

In [15]:
consumer_fraud_rate.groupBy('consumer_id').agg(F.avg("fraud_probability").alias('avg_fp')).count()

20128

In [14]:
# Load all the transaction data 
transaction_p1 = spark.read.parquet(f"{raw_path}/tables/transactions_20210228_20210827_snapshot")
transaction_p2 = spark.read.parquet(f"{raw_path}/tables/transactions_20210828_20220227_snapshot")
transaction_p3 = spark.read.parquet(f"{raw_path}/tables/transactions_20220228_20220828_snapshot")

# Combine the datasets
transaction_records = reduce(DataFrame.unionAll, [transaction_p1, transaction_p2, transaction_p3])
get_dataset_count(transaction_records)

# Replace user_id with consumer_id after combining
transaction_records = replace_id(consumer_user_map, transaction_records)

# Check to make sure no rows were lost on the inner join
get_dataset_count(transaction_records)

                                                                                

The dataset count is  14195505




The dataset count is  14195505


                                                                                

## Cleaning

Now that replacing `user_id` to `consumer_id` is done, we will load all other data and clean them altogether. We start off with dropping duplicates from the consumer fraud probability data

In [15]:
print(f"Number of entries before dropping duplicates: {consumer_fraud_rate.count()}")
consumer_fraud_rate = consumer_fraud_rate.dropDuplicates()
print(f"Number of entries after dropping duplicates: {consumer_fraud_rate.count()}")

Number of entries before dropping duplicates: 34864
Number of entries after dropping duplicates: 34765


We can see that around 0.28% of the entries are duplicates.

Doing the same thing for the transactions data.

In [16]:
print(f"Number of entries before dropping duplicates: {transaction_records.count()}")
transaction_records = transaction_records.dropDuplicates()
print(f"Number of entries after dropping duplicates: {transaction_records.count()}")

Number of entries before dropping duplicates: 14195505


24/09/29 15:26:51 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:26:51 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:26:51 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:26:51 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:26:51 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:26:51 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:26:51 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:26:51 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:26:51 WARN RowBasedKeyValueBatch: Calling spill() on

Number of entries after dropping duplicates: 14195505


                                                                                

There are no duplicates in the transactions data. Next, we will load the merchant fraud rate and repeat the same process.

In [17]:
# Load consumer fraud rate dataset
merchant_fraud_rate = spark.read.csv(f"{raw_path}/tables/merchant_fraud_probability.csv", header=True, inferSchema=True)
merchant_fraud_rate.show(5)
get_dataset_count(merchant_fraud_rate)

+------------+--------------+------------------+
|merchant_abn|order_datetime| fraud_probability|
+------------+--------------+------------------+
| 19492220327|    2021-11-28|44.403658647495355|
| 31334588839|    2021-10-02| 42.75530083865367|
| 19492220327|    2021-12-22|38.867790051131095|
| 82999039227|    2021-12-19|  94.1347004808891|
| 90918180829|    2021-09-02| 43.32551731714902|
+------------+--------------+------------------+
only showing top 5 rows

The dataset count is  114


Removing any duplicates existing in the data

In [18]:
print(f"Number of entries before dropping duplicates: {merchant_fraud_rate.count()}")
merchant_fraud_rate = merchant_fraud_rate.dropDuplicates()
print(f"Number of entries after dropping duplicates: {merchant_fraud_rate.count()}")

Number of entries before dropping duplicates: 114
Number of entries after dropping duplicates: 114


Cleaning `tbl_merchants.parquet`. The feature `tags` is a string that represents either a tuple or a list, containing 3 elements:
* Items that are being sold
* Revenue levels
* Commission rate

Each elements either a list, a tuple, or a combination of both (e.g starts with `[` and ends with `)` and vice versa). These inconsistencies are mostly due to human errors. Thus, we need to take into account these consistent when splitting the values of the feature `tags` into separate columns

In [19]:
# Load merchant's info
merchant_info = spark.read.parquet(f"{raw_path}/tables/tbl_merchants.parquet")
merchant_info.show(5, truncate=False)

# Clean the data
merchant_info = clean_merchant_details(merchant_info)
merchant_info.show(5, truncate=False)

+------------------------------------+-----------------------------------------------------------------------------------------------------------------+------------+
|name                                |tags                                                                                                             |merchant_abn|
+------------------------------------+-----------------------------------------------------------------------------------------------------------------+------------+
|Felis Limited                       |((furniture, home furnishings and equipment shops, and manufacturers, except appliances), (e), (take rate: 0.18))|10023283211 |
|Arcu Ac Orci Corporation            |([cable, satellite, and otHer pay television and radio services], [b], [take rate: 4.22])                        |10142254217 |
|Nunc Sed Company                    |([jewelry, watch, clock, and silverware shops], [b], [take rate: 4.40])                                          |10165489824 |
|Ult

Removing any duplicated merchants.

In [20]:
print(f"Number of entries before dropping duplicates: {merchant_info.count()}")
merchant_info = merchant_info.dropDuplicates()
print(f"Number of entries after dropping duplicates: {merchant_info.count()}")

Number of entries before dropping duplicates: 4026
Number of entries after dropping duplicates: 4026


The data on consumer's basic information is a single column that contains the consumer's name, address, state, postcode, gender, and their unqiue consumer ID, each separated by "`|`". Thus, we will need to split these into individual columns. Based on the `README.md` for the data, we will only keep the consumer's name, state, postcode, gender, and consumer ID as the addresses are fake.

In [16]:
# Load consumer info's
consumer_info = spark.read.csv(f"{raw_path}/tables/tbl_consumer.csv", header=True, inferSchema=True)
consumer_info.show(5, truncate=False)

# Clean the data
consumer_info = clean_consumer_details(consumer_info)
consumer_info.show(5, truncate=False)

+---------------------------------------------------------------------+
|name|address|state|postcode|gender|consumer_id                       |
+---------------------------------------------------------------------+
|Yolanda Williams|413 Haney Gardens Apt. 742|WA|6935|Female|1195503   |
|Mary Smith|3764 Amber Oval|NSW|2782|Female|179208                    |
|Jill Jones MD|40693 Henry Greens|NT|862|Female|1194530               |
|Lindsay Jimenez|00653 Davenport Crossroad|NSW|2780|Female|154128     |
|Rebecca Blanchard|9271 Michael Manors Suite 651|WA|6355|Female|712975|
+---------------------------------------------------------------------+
only showing top 5 rows

Before: 
The dataset count is  499999
After: 
The dataset count is  499999
+-----------------+-----------+------+-----+--------+
|name             |consumer_id|gender|state|postcode|
+-----------------+-----------+------+-----+--------+
|Yolanda Williams |1195503    |Female|WA   |6935    |
|Mary Smith       |179208     |Femal

Removing any duplicated consumers.

In [22]:
print(f"Number of entries before dropping duplicates: {consumer_info.count()}")
consumer_info = consumer_info.dropDuplicates()
print(f"Number of entries after dropping duplicates: {consumer_info.count()}")

Number of entries before dropping duplicates: 499999




Number of entries after dropping duplicates: 499999


                                                                                

For all data (transactions, merchant and consumer fraud probability), we will need to ensure that the datetime of all dataset with such column is within the specified range (labeled on the name of the intial downloaded file)

In [23]:
transaction_records = ensure_datetime_range(transaction_records, "2021-03-01", "2022-08-28")

24/09/29 15:27:14 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:27:14 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:27:14 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:27:14 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:27:14 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:27:14 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:27:14 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:27:14 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:27:14 WARN RowBasedKeyValueBatch: Calling spill() on

Starting entries: 14195505 
Final entries: 12544270
Net change (%): 11.63 


                                                                                

In [24]:
merchant_fraud_rate = ensure_datetime_range(merchant_fraud_rate, "2021-03-01", "2022-08-28")

Starting entries: 114 
Final entries: 114
Net change (%): 0.0 


In [25]:
consumer_fraud_rate = ensure_datetime_range(consumer_fraud_rate, "2021-03-01", "2022-08-28")

Starting entries: 34765 
Final entries: 34747
Net change (%): 0.05 


Next, we check for any existing null values across all dataset.

In [26]:
calculate_missing_values(consumer_fraud_rate)
calculate_missing_values(merchant_fraud_rate)
calculate_missing_values(consumer_info)
calculate_missing_values(merchant_info)
calculate_missing_values(transaction_records)

+----------------------------+-------------------------------+-------------------------+
|order_datetime_missing_count|fraud_probability_missing_count|consumer_id_missing_count|
+----------------------------+-------------------------------+-------------------------+
|                           0|                              0|                        0|
+----------------------------+-------------------------------+-------------------------+

+--------------------------+----------------------------+-------------------------------+
|merchant_abn_missing_count|order_datetime_missing_count|fraud_probability_missing_count|
+--------------------------+----------------------------+-------------------------------+
|                         0|                           0|                              0|
+--------------------------+----------------------------+-------------------------------+



                                                                                

+------------------+-------------------------+--------------------+-------------------+----------------------+
|name_missing_count|consumer_id_missing_count|gender_missing_count|state_missing_count|postcode_missing_count|
+------------------+-------------------------+--------------------+-------------------+----------------------+
|                 0|                        0|                   0|                  0|                     0|
+------------------+-------------------------+--------------------+-------------------+----------------------+

+------------------+--------------------------+----------------------+---------------------------+-----------------------+
|name_missing_count|merchant_abn_missing_count|category_missing_count|revenue_level_missing_count|take_rate_missing_count|
+------------------+--------------------------+----------------------+---------------------------+-----------------------+
|                 0|                         0|                     0|     

24/09/29 15:28:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:03 WARN RowBasedKeyValueBatch: Calling spill() on

+--------------------------+--------------------------+----------------------+----------------------------+-------------------------+
|merchant_abn_missing_count|dollar_value_missing_count|order_id_missing_count|order_datetime_missing_count|consumer_id_missing_count|
+--------------------------+--------------------------+----------------------+----------------------------+-------------------------+
|                         0|                         0|                     0|                           0|                        0|
+--------------------------+--------------------------+----------------------+----------------------------+-------------------------+



                                                                                

Currently, there are no mising values after we do some cleaning. We will come back to this after we merged the data together.

# Load

In [27]:
landing_directory = "../data/curated"

In [28]:
# Main dataset
consumer_fraud_rate.write.parquet(f"{landing_directory}/consumer_fp.parquet", mode = 'overwrite')
merchant_fraud_rate.write.parquet(f"{landing_directory}/merchant_fp.parquet", mode = 'overwrite')
transaction_records.write.parquet(f"{landing_directory}/transactions.parquet", mode = 'overwrite')
merchant_info.write.parquet(f"{landing_directory}/merchant_info.parquet", mode = 'overwrite')
consumer_info.write.parquet(f"{landing_directory}/consumer_info.parquet", mode = 'overwrite')

24/09/29 15:28:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/29 15:28:25 WARN RowBasedKeyValueBatch: Calling spill() on

In [29]:
# External dataset
postcode_info.to_csv("../data/curated/postcode_info.csv")
pf_df.to_csv("../data/curated/personal_fraud.csv")