### this is a second script to transformed the data from curated zone to transformed zone
<p>
purpose of this is to further enrich the data before serving ie. to add more features (encrypt the data)
<br>
Data Analyst / Data Scientis should only take data from the transform zone to avoid any accidental leak of PII data
<br>
Remain the unencrypt data in curated zone for DataOps in the event of data dispute from downstream activity
</p>

In [1]:
import pyspark
from pyspark.sql import SparkSession

from pyspark.sql.functions import col, lit  # common function
from pyspark.sql.functions import year, current_date, col  # date function
from pyspark.sql.functions import aes_encrypt, aes_decrypt, base64, unbase64 # encryption

import os


In [2]:
conf = (
    pyspark.SparkConf().setAppName('cc_credit_card_pii_encrpytion')
    .set("spark.executor.memory", "16g")
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()

In [None]:
df = spark.read.parquet("curated_cc_credit_card\cc_credit_card_curated.parquet")

### Data encryption Process

In [4]:
df.show(5)

+----------+---------------------+----------------+--------------------+-----+--------+--------+--------------------+---------+-----+-----+-------+---------+--------+--------------------+----------+--------------------+------------------+-----------+--------+-------------+----------------------+--------------------+---------+-------------+------+
|Unnamed: 0|trans_date_trans_time|          cc_num|            merchant|  amt|   first|    last|              street|     city|state|  zip|    lat|     long|city_pop|                 job|       dob|           trans_num|         merch_lat| merch_long|is_fraud|merch_zipcode|merch_last_update_time|      merch_eff_time|   cc_bic|     category|gender|
+----------+---------------------+----------------+--------------------+-----+--------+--------+--------------------+---------+-----+-----+-------+---------+--------+--------------------+----------+--------------------+------------------+-----------+--------+-------------+----------------------+------

In [5]:
encrypted_data = df.select("*")

In [6]:
# definition of pii data: https://www.investopedia.com/terms/p/personally-identifiable-information-pii.asp#:~:text=What%20Is%20Personally%20Identifiable%20Information,to%20successfully%20recognize%20an%20individual.
# PII divided into sensitive and non sensitive
# will not encrypt non sensitive data (Gender / State / City) and will used for further analysis

# before encrypting, will create a column age by subtracting birth year from current year so we can analyze age distribution and encrypt the date of birth

In [7]:
# Create an 32 byte AES key 
key_path = 'decryption.key'

if os.path.exists(key_path):
    with open(key_path, "rb") as f:
        key = f.read()
else:
    key = os.urandom(32)
    with open(key_path, "wb") as f:
        f.write(key)

In [8]:
encrypted_data = encrypted_data.withColumn("age", year(current_date()) - year(col("dob")))

In [9]:
list_of_columns_to_encrypt = [
    "cc_num",
    "first",
    "last",
    "street",
    "dob"
]

In [10]:
for columns in list_of_columns_to_encrypt:
    encrypted_data = encrypted_data.withColumn(columns, base64(aes_encrypt(col(columns).cast("binary"), lit(key))))

In [11]:
encrypted_data.show(5, truncate=False)

+----------+-----------------------+------------------------------------------------------------+------------------------------------+-----+------------------------------------------------+------------------------------------------------+------------------------------------------------------------------------+---------+-----+-----+-------+---------+--------+-----------------------+----------------------------------------------------+--------------------------------+------------------+-----------+--------+-------------+-----------------------+-----------------------+---------+-------------+------+---+
|Unnamed: 0|trans_date_trans_time  |cc_num                                                      |merchant                            |amt  |first                                           |last                                            |street                                                                  |city     |state|zip  |lat    |long     |city_pop|job                    |dob     

In [12]:
# test decryption data
encrypted_data.withColumn(
    "cc_num",
    aes_decrypt(unbase64(col("cc_num")), lit(key)).cast("string")
).show(5)

+----------+---------------------+----------------+--------------------+-----+--------------------+--------------------+--------------------+---------+-----+-----+-------+---------+--------+--------------------+--------------------+--------------------+------------------+-----------+--------+-------------+----------------------+--------------------+---------+-------------+------+---+
|Unnamed: 0|trans_date_trans_time|          cc_num|            merchant|  amt|               first|                last|              street|     city|state|  zip|    lat|     long|city_pop|                 job|                 dob|           trans_num|         merch_lat| merch_long|is_fraud|merch_zipcode|merch_last_update_time|      merch_eff_time|   cc_bic|     category|gender|age|
+----------+---------------------+----------------+--------------------+-----+--------------------+--------------------+--------------------+---------+-----+-----+-------+---------+--------+--------------------+---------------

### load the encrypted data into the transformed folder

In [13]:
encrypted_data.show(5)

+----------+---------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+---------+-----+-----+-------+---------+--------+--------------------+--------------------+--------------------+------------------+-----------+--------+-------------+----------------------+--------------------+---------+-------------+------+---+
|Unnamed: 0|trans_date_trans_time|              cc_num|            merchant|  amt|               first|                last|              street|     city|state|  zip|    lat|     long|city_pop|                 job|                 dob|           trans_num|         merch_lat| merch_long|is_fraud|merch_zipcode|merch_last_update_time|      merch_eff_time|   cc_bic|     category|gender|age|
+----------+---------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+---------+-----+-----+-------+---------+--------+--------------------+---

In [14]:
encrypted_data.write.partitionBy("cc_bic", "category", "gender").mode('overwrite').parquet('transformed_cc_credit_card/cc_credit_card_transformed.parquet')

In [15]:
spark.stop()