<h1>1- Data Preprocessing</h1>

<p align="justify" style="border: 1px solid black; padding: 10px; border-radius: 5px;">This notebook serves as the initial stage in our data analysis pipeline focusing on the "member", "transactions" and "user logs" tables from our dataset. It covers key preliminary steps such as data cleaning, type casting, and null value handling, ensuring the data is ready for analysis. Special attention is given to identify data noise and duplicate instances based on specific attributes.</p>

In [1]:
# Dependencies and Setup
import package as pk

Package: Resources loaded. ☑


<h2>1-1- Data Collection</h2>

In [2]:
# Instantiate the 'DataToSpark' class to handle KKBOX data
kkbox = pk.DataToSpark("customer-churn-391917", "kkbox")

# Load specific tables into DataFrames
tables_to_load = ["member", "user", "transactions", "train"]
kkbox.load_tables(tables_to_load)

# Retrieve the stored DataFrames
tables = kkbox.get_tables()

# Display the first row of each table DataFrame
for table_name, table_df in tables.items():
    print(table_name)
    table_df.show(1)

member


                                                                                

+--------------------+----+---+------+--------------+----------------------+
|                msno|city| bd|gender|registered_via|registration_init_time|
+--------------------+----+---+------+--------------+----------------------+
|8weIFLAcRU/dYHiOc...|   1|  0|  null|             9|              20120320|
+--------------------+----+---+------+--------------+----------------------+
only showing top 1 row

user
+--------------------+--------+------+------+------+-------+-------+-------+----------+
|                msno|    date|num_25|num_50|num_75|num_985|num_100|num_unq|total_secs|
+--------------------+--------+------+------+------+-------+-------+-------+----------+
|FnqNUBvN8mysLeKba...|20160119|    34|     1|     1|      1|     88|    108| 24172.471|
+--------------------+--------+------+------+------+-------+-------+-------+----------+
only showing top 1 row

transactions
+--------------------+-----------------+-----------------+---------------+------------------+-------------+--

<h2>2-1- Exploratory Data Analysis (EDA)</h2>

<h3>1-2-1- Members Dataset</h3>

In [3]:
# Selecting the "msno" column from the "trans_model_df" DataFrame
train_keys = tables["train"].select("msno")

# filter by Inner join the member DataFrame with the train DataFrame keys
member_df_filter= tables["member"].join(train_keys, on=["msno"], how="inner")

# Create an instance of DataFrameInfoDisplay to show information about the member DataFrame
member_info = pk.DataFrameInfoDisplay(member_df_filter)

# Display the schema of the member DataFrame
member_info.display_info(show_schema=True, show_data=True)

                                                                                

shape
 |-- rows: 877161
 |-- columns: 6
root
 |-- msno: string (nullable = true)
 |-- city: long (nullable = true)
 |-- bd: long (nullable = true)
 |-- gender: string (nullable = true)
 |-- registered_via: long (nullable = true)
 |-- registration_init_time: long (nullable = true)



[Stage 9:>                                                          (0 + 1) / 1]

+--------------------------------------------+----+---+------+--------------+----------------------+
|msno                                        |city|bd |gender|registered_via|registration_init_time|
+--------------------------------------------+----+---+------+--------------+----------------------+
|++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=|1   |0  |null  |7             |20140714              |
|+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=|15  |31 |male  |9             |20060603              |
|+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=|9   |31 |male  |9             |20040330              |
|+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=|15  |29 |male  |9             |20080322              |
|+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=|13  |29 |female|3             |20120612              |
+--------------------------------------------+----+---+------+--------------+----------------------+
only showing top 5 rows



                                                                                

In [4]:
# Update the "registration_init_time" column
# Replace NULLs with "NAN" and cast the column to a date format
member_df = member_df_filter.withColumn(
    "registration_init_time",
    pk.when(pk.col("registration_init_time").isNotNull(),
            pk.to_date(pk.col("registration_init_time").cast("string"), "yyyyMMdd")
           ).otherwise("NAN")
)

# Fill missing values in the "gender" column with "NAN"
member_df = member_df.fillna("NAN", subset=["gender"])

# Convert gender values to numeric codes
# NAN becomes 0, male becomes 1, and female becomes 2
member_df = member_df.withColumn(
    "gender",
    pk.when(pk.col("gender") == "NAN", 0)
      .when(pk.col("gender") == "male", 1)
      .when(pk.col("gender") == "female", 2)
      .otherwise(pk.col("gender"))
)

# Iterate through specified columns to replace null values with "NAN" and cast to int
columns_to_replace = ["city", "bd", "registered_via"]
for column in columns_to_replace:
    member_df = member_df.withColumn(
        column,
        pk.when(pk.col(column).isNull(), "NAN").otherwise(pk.col(column).cast("int"))
    )

# Display the schema and data of the updated member DataFrame
member_df_info = pk.DataFrameInfoDisplay(member_df)
member_df_info.display_info(show_schema=True, show_data=True)

                                                                                

shape
 |-- rows: 877161
 |-- columns: 6
root
 |-- msno: string (nullable = true)
 |-- city: string (nullable = true)
 |-- bd: string (nullable = true)
 |-- gender: string (nullable = false)
 |-- registered_via: string (nullable = true)
 |-- registration_init_time: string (nullable = true)



[Stage 16:>                                                         (0 + 1) / 1]

+--------------------------------------------+----+---+------+--------------+----------------------+
|msno                                        |city|bd |gender|registered_via|registration_init_time|
+--------------------------------------------+----+---+------+--------------+----------------------+
|++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=|1   |0  |0     |7             |2014-07-14            |
|+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=|15  |31 |1     |9             |2006-06-03            |
|+0/X9tkmyHyet9X80G6GTrDFHnJqvai8d1ZPhayT0os=|9   |31 |1     |9             |2004-03-30            |
|+09YGn842g6h2EZUXe0VWeC4bBoCbDGfUboitc0vIHw=|15  |29 |1     |9             |2008-03-22            |
|+0RJtbyhoPAHPa+34MkYcE2Ox0cjMgJOTXMXVBYgkJE=|13  |29 |2     |3             |2012-06-12            |
+--------------------------------------------+----+---+------+--------------+----------------------+
only showing top 5 rows



                                                                                

<h4>1-1-2-1- Data and Model Noise</h4>

<p align="justify"><b>b)</b> Instances of duplicated data are identified by considering attributes such as city, bd, gender, and registered_via, with specific attention placed on scenarios where bd equals 0 and gender is labeled as NAN.</p>

<b><ul><li>Noise evaluation</li></ul></b>

In [5]:
# Group the data and count occurrences based on specified columns
unit_count = member_df.groupBy("city", "bd", "gender", "registered_via").count()

# Order the grouped data by count in descending order and show the top 10
top_10_unit_count = unit_count.orderBy(pk.col("count").desc()).show(10)

                                                                                

+----+---+------+--------------+------+
|city| bd|gender|registered_via| count|
+----+---+------+--------------+------+
|   1|  0|     0|             7|399071|
|   1|  0|     0|             4| 16453|
|   1|  0|     0|             9| 15651|
|   1|  0|     0|             3|  6786|
|  13|  0|     0|             9|  4072|
|   1|  0|     0|            13|  2804|
|   5|  0|     0|             9|  2660|
|  13|  0|     0|             3|  2399|
|  15|  0|     0|             9|  1635|
|  13| 27|     1|             9|  1586|
+----+---+------+--------------+------+
only showing top 10 rows



In [6]:
# city=1 assesment
city_1_df = member_df.filter(pk.col("city") == 1)

# Calculate and print the filtered result count and percentage
print(f"result\n |-- count: {city_1_df.count()}\n |-- out of total data: {round(city_1_df.count()/member_df.count(),4)*100} %")



result
 |-- count: 455389
 |-- out of total data: 51.92 %


                                                                                

<h4>2-1-2-1- Observation</h4>

<p align="justify">The member dataset exhibits a presence 51.92% of the dataset originates from a specific city (city 1), yet lacks essential demographic details such as age and gender. This scenario gives rise to an imbalanced dataset, characterized by a substantial disparity in the representation of different cities. This imbalance can potentially introduce bias in machine learning model performance, leading it to disproportionately emphasize the predominant class while marginalizing the less prevalent classes. To mitigate this challenge, a judicious approach involves the development of a separate model dedicated to the analysis of city 1's data, thus fostering a more nuanced and equitable learning process.</p>

In [7]:
# Create a new DataFrame 'member_model_df' by subtracting the contents of 'NAN_filtered_df' and 'city_1' from 'merge_df'.
member_model_df = member_df.subtract(city_1_df)

# Display the schema of the member DataFrame
member_df_info = pk.DataFrameInfoDisplay(member_model_df)

member_df_info.display_info(show_schema=True, show_data=True)

                                                                                

shape
 |-- rows: 421772
 |-- columns: 6
root
 |-- msno: string (nullable = true)
 |-- city: string (nullable = true)
 |-- bd: string (nullable = true)
 |-- gender: string (nullable = false)
 |-- registered_via: string (nullable = true)
 |-- registration_init_time: string (nullable = true)





+--------------------------------------------+----+---+------+--------------+----------------------+
|msno                                        |city|bd |gender|registered_via|registration_init_time|
+--------------------------------------------+----+---+------+--------------+----------------------+
|++ViQ2i7L4hzLdBQ233Z8p4AxK9Vr38lSqZS09d2M84=|13  |22 |2     |3             |2015-02-25            |
|+0lSGTZKwfVjc0s36k7YvJmF1DGqJft289lJCrcG/qI=|5   |22 |1     |9             |2015-10-03            |
|+14H1v78CHVFv9RX3XusVZEj4f+YwWe3ozVjoUZSaVM=|6   |20 |0     |3             |2013-10-22            |
|+1LjlnGGmNYWIQWnb6GxLzMT39zYK2/TFJHklUK139M=|13  |44 |1     |9             |2008-06-22            |
|+1luNX1IrJn8MEXXdA9fHWWuCx+j5E5T+O6b7AGGJgk=|13  |18 |2     |3             |2013-12-13            |
+--------------------------------------------+----+---+------+--------------+----------------------+
only showing top 5 rows



                                                                                

<h3>2-2-1- Transaction Data</h3>

<h4>1-2-2-1- Merge Function</h4>

<b><ul><li>Transaction Data For "member_model_df"</li></ul></b>

In [8]:
# Initialize the TransactionMerger class
member_merger = pk.TransactionMerger(member_model_df, tables["transactions"])

# Merge transactions and get the new dataframe
member_trans_df = member_merger.merge_transactions()

# Display shape, schema, and data information
member_merger.display_merged_info(member_trans_df)

                                                                                

shape
 |-- rows: 431657
 |-- columns: 15
root
 |-- msno: string (nullable = true)
 |-- city: string (nullable = true)
 |-- bd: string (nullable = true)
 |-- gender: string (nullable = false)
 |-- registered_via: string (nullable = true)
 |-- registration_init_time: string (nullable = true)
 |-- payment_method_id: long (nullable = true)
 |-- payment_plan_days: long (nullable = true)
 |-- plan_list_price: long (nullable = true)
 |-- actual_amount_paid: long (nullable = true)
 |-- is_auto_renew: long (nullable = true)
 |-- transaction_date: string (nullable = true)
 |-- membership_expire_date: string (nullable = true)
 |-- is_cancel: long (nullable = true)
 |-- exp_last: string (nullable = true)



[Stage 69:>                                                         (0 + 1) / 1]

+--------------------------------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|msno                                        |city|bd |gender|registered_via|registration_init_time|payment_method_id|payment_plan_days|plan_list_price|actual_amount_paid|is_auto_renew|transaction_date|membership_expire_date|is_cancel|exp_last  |
+--------------------------------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|+/namlXq+u3izRjHCFJV4MgqcXcLidZYszVsROOq/y4=|15  |31 |1     |9             |2006-06-03            |34               |30               |149            |149               |1            |2017-02-28      |2017-03-31            |0        |2017-02-28|
|+0/X9tkmyHy

                                                                                

<b><ul><li>Transaction Data For "city_1_df"</li></ul></b>

In [9]:
# Initialize the TransactionMerger class
city_merger = pk.TransactionMerger(city_1_df, tables["transactions"])

# Merge transactions and get the new dataframe
city_trans_df = city_merger.merge_transactions()

# Display shape, schema, and data information
city_merger.display_merged_info(city_trans_df)

                                                                                

shape
 |-- rows: 468169
 |-- columns: 15
root
 |-- msno: string (nullable = true)
 |-- city: string (nullable = true)
 |-- bd: string (nullable = true)
 |-- gender: string (nullable = false)
 |-- registered_via: string (nullable = true)
 |-- registration_init_time: string (nullable = true)
 |-- payment_method_id: long (nullable = true)
 |-- payment_plan_days: long (nullable = true)
 |-- plan_list_price: long (nullable = true)
 |-- actual_amount_paid: long (nullable = true)
 |-- is_auto_renew: long (nullable = true)
 |-- transaction_date: string (nullable = true)
 |-- membership_expire_date: string (nullable = true)
 |-- is_cancel: long (nullable = true)
 |-- exp_last: string (nullable = true)





+--------------------------------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|msno                                        |city|bd |gender|registered_via|registration_init_time|payment_method_id|payment_plan_days|plan_list_price|actual_amount_paid|is_auto_renew|transaction_date|membership_expire_date|is_cancel|exp_last  |
+--------------------------------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|++4RuqBw0Ss6bQU4oMxaRlbBPoWzoEiIZaxPM04Y4+U=|1   |0  |0     |7             |2014-07-14            |41               |30               |149            |149               |1            |2017-02-13      |2017-03-13            |0        |2017-02-13|
|+2eLsQv6T46

                                                                                

<h4>2-2-2-1- Data and Model Noise</h4>

<b>a) Columns Error</b>
<p align="justify"><b>a-1) </b>The 'msno' column has matching 'transaction_date' and 'membership_expire_date' values, but differs in other columns such as errors in the 'is_cancel' column, which are distinct for the below example. In this context, the 'exp_last' value is null.</p>

In [10]:
# Apply the noise_member function to the member_trans_df DataFrame for anonymization using the specified member key.
noise_a_1_m = pk.noise_member(member_trans_df, "1M+OGzETqoIR33bo2mzrTP3p8jOFyVdUX5vJ3KLe2ms=")

23/08/30 20:52:27 WARN org.apache.spark.util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
                                                                                

+--------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+--------+
|                msno|city| bd|gender|registered_via|registration_init_time|payment_method_id|payment_plan_days|plan_list_price|actual_amount_paid|is_auto_renew|transaction_date|membership_expire_date|is_cancel|exp_last|
+--------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+--------+
|1M+OGzETqoIR33bo2...|   5| 38|     1|             9|            2006-01-29|               40|               30|            149|               149|            1|      2017-02-21|            2017-03-20|        1|     NAN|
|1M+OGzETqoIR33bo2...|   5| 38|     1|             9|            2006-01-29|               40|               30|    

<p align="justify"><b>a-2) </b> The "msno" shares identical "transaction_date" and "membership_expire_date," but has errors in the "exp_last" column. </p>

<b><ul><li>In "member_trans_df"</li></ul></b>

In [11]:
# Apply the noise_member function to the member_trans_df DataFrame for anonymization using the specified member key.
noise_a_2_m = pk.noise_member(member_trans_df, "ya/pbMnE1Bc9NXQIQ3r9avpXJet0hiNQEgy8QMG98ZI=")

                                                                                

+--------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|                msno|city| bd|gender|registered_via|registration_init_time|payment_method_id|payment_plan_days|plan_list_price|actual_amount_paid|is_auto_renew|transaction_date|membership_expire_date|is_cancel|  exp_last|
+--------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|ya/pbMnE1Bc9NXQIQ...|  13| 18|     1|             4|            2016-02-22|               38|               30|            149|               149|            0|      2017-01-06|            2017-02-06|        0|2016-03-13|
|ya/pbMnE1Bc9NXQIQ...|  13| 18|     1|             4|            2016-02-22|               38|              

<b><ul><li>In "city_trans_df"</li></ul></b>

In [12]:
# Apply the noise_member function to the member_trans_df DataFrame for anonymization using the specified member key.
noise_a_2_c = pk.noise_member(city_trans_df, "rjrMVdeV2c289rwjkIOsL/4KNSE9wl60GdJkpYfGQPg=")

                                                                                

+--------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|                msno|city| bd|gender|registered_via|registration_init_time|payment_method_id|payment_plan_days|plan_list_price|actual_amount_paid|is_auto_renew|transaction_date|membership_expire_date|is_cancel|  exp_last|
+--------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|rjrMVdeV2c289rwjk...|   1|  0|     0|             7|            2014-10-02|               41|               30|             99|                99|            1|      2017-02-10|            2017-03-10|        0|2017-01-11|
|rjrMVdeV2c289rwjk...|   1|  0|     0|             7|            2014-10-02|               41|              

<p align="justify"><b>b) </b>The "msno" has one "transaction_date" and multiple distinct <b>"membership_expire_date"</b> values, with/without null.</p>

In [13]:
# Apply the noise_member function to the member_trans_df DataFrame for anonymization using the specified member key.
noise_b_m = pk.noise_member(member_trans_df, "pI0cMv4wwhvLTBJpJSoHQrG6pdazfXo77JmRfKoyA6U=")

                                                                                

+--------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+--------+
|                msno|city| bd|gender|registered_via|registration_init_time|payment_method_id|payment_plan_days|plan_list_price|actual_amount_paid|is_auto_renew|transaction_date|membership_expire_date|is_cancel|exp_last|
+--------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+--------+
|pI0cMv4wwhvLTBJpJ...|   9| 26|     1|             9|            2015-11-30|               38|               10|              0|                 0|            0|      2015-11-30|            2017-03-24|        0|     NAN|
|pI0cMv4wwhvLTBJpJ...|   9| 26|     1|             9|            2015-11-30|               38|               10|    

<h4>3-2-2-1- Noise Evaluation</h4>

<b><ul><li>Noise a)</li></ul></b>

In [14]:
# Use the noise_finder function to apply noise to specific columns in the member_trans_df DataFrame
# The function returns three variables: the noised DataFrame, noised data retention information, and the keys used for noise generation
noise_a_member_df, noise_a_dr_member, noise_a_keys_member = pk.noise_finder(member_trans_df, ["msno", "transaction_date", "membership_expire_date"])



DataFrame Shape:
 |-- Number of Records: 8013
 |-- Number of Unique Cases: 3770


                                                                                

In [15]:
# Use the noise_finder function to apply noise to specific columns in the city_trans_df DataFrame
# The function returns three variables: the noised DataFrame, noised data retention information, and the keys used for noise generation
noise_a_city_df, noise_a_dr_city, noise_a_keys_city = pk.noise_finder(city_trans_df, ["msno", "transaction_date", "membership_expire_date"])



DataFrame Shape:
 |-- Number of Records: 11096
 |-- Number of Unique Cases: 5225


                                                                                

<b><ul><li>Noise b)</li></ul></b>

In [16]:
# Use the noise_finder function to apply noise to specific columns in the member_trans_df DataFrame
# The function returns three variables: the noised DataFrame, noised data retention information, and the keys used for noise generation
noise_b_member_df, noise_b_dr_member, noise_b_keys_member = pk.noise_finder(member_trans_df, ["msno", "transaction_date"])



DataFrame Shape:
 |-- Number of Records: 17605
 |-- Number of Unique Cases: 7720


                                                                                

In [17]:
# Use the noise_finder function to apply noise to specific columns in the city_trans_df DataFrame
# The function returns three variables: the noised DataFrame, noised data retention information, and the keys used for noise generation
noise_b_city_df, noise_b_dr_city, noise_b_keys_city = pk.noise_finder(city_trans_df, ["msno", "transaction_date"])



DataFrame Shape:
 |-- Number of Records: 24459
 |-- Number of Unique Cases: 11679


                                                                                

<h4>2-2-2-1- Observation</h4>

In [18]:
# Create a new DataFrame 'trans_model_member_df' by subtracting the contents of 'noise_a_member_df' and 'noise_b_member_df' from 'member_trans_df'.
trans_model_member_df = member_trans_df.subtract(noise_a_member_df).subtract(noise_b_member_df)

# Create an instance of DataFrameInfoDisplay to show information about the member DataFrame
trans_member_info = pk.DataFrameInfoDisplay(trans_model_member_df)

# Display the schema of the member DataFrame
trans_member_info.display_info(show_schema=True, show_data=True)

                                                                                

shape
 |-- rows: 414052
 |-- columns: 15
root
 |-- msno: string (nullable = true)
 |-- city: string (nullable = true)
 |-- bd: string (nullable = true)
 |-- gender: string (nullable = false)
 |-- registered_via: string (nullable = true)
 |-- registration_init_time: string (nullable = true)
 |-- payment_method_id: long (nullable = true)
 |-- payment_plan_days: long (nullable = true)
 |-- plan_list_price: long (nullable = true)
 |-- actual_amount_paid: long (nullable = true)
 |-- is_auto_renew: long (nullable = true)
 |-- transaction_date: string (nullable = true)
 |-- membership_expire_date: string (nullable = true)
 |-- is_cancel: long (nullable = true)
 |-- exp_last: string (nullable = true)





+--------------------------------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|msno                                        |city|bd |gender|registered_via|registration_init_time|payment_method_id|payment_plan_days|plan_list_price|actual_amount_paid|is_auto_renew|transaction_date|membership_expire_date|is_cancel|exp_last  |
+--------------------------------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|++gZjzjkC8lAqAtcOxAz0677Ygp0CE1hNqd3v9lRKG0=|5   |39 |1     |9             |2009-11-28            |33               |30               |149            |149               |1            |2017-01-31      |2017-03-21            |0        |2017-02-21|
|+8XdwFIkon/

                                                                                

In [19]:
# Create a new DataFrame 'trans_model_city_df' by subtracting the contents of 'noise_a_city_df' and 'noise_b_city_df' from 'city_trans_df'.
trans_model_city_df = city_trans_df.subtract(noise_a_city_df).subtract(noise_b_city_df)

# Create an instance of DataFrameInfoDisplay to show information about the member DataFrame
trans_city_info = pk.DataFrameInfoDisplay(trans_model_city_df)

# Display the schema of the member DataFrame
trans_city_info.display_info(show_schema=True, show_data=True)

                                                                                

shape
 |-- rows: 443710
 |-- columns: 15
root
 |-- msno: string (nullable = true)
 |-- city: string (nullable = true)
 |-- bd: string (nullable = true)
 |-- gender: string (nullable = false)
 |-- registered_via: string (nullable = true)
 |-- registration_init_time: string (nullable = true)
 |-- payment_method_id: long (nullable = true)
 |-- payment_plan_days: long (nullable = true)
 |-- plan_list_price: long (nullable = true)
 |-- actual_amount_paid: long (nullable = true)
 |-- is_auto_renew: long (nullable = true)
 |-- transaction_date: string (nullable = true)
 |-- membership_expire_date: string (nullable = true)
 |-- is_cancel: long (nullable = true)
 |-- exp_last: string (nullable = true)





+--------------------------------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|msno                                        |city|bd |gender|registered_via|registration_init_time|payment_method_id|payment_plan_days|plan_list_price|actual_amount_paid|is_auto_renew|transaction_date|membership_expire_date|is_cancel|exp_last  |
+--------------------------------------------+----+---+------+--------------+----------------------+-----------------+-----------------+---------------+------------------+-------------+----------------+----------------------+---------+----------+
|++XryZQSzQL1Pn1aAPnoIJeFe6Z2DST999y5pOOdG4E=|1   |0  |0     |7             |2013-03-10            |41               |30               |149            |149               |1            |2017-02-27      |2017-03-27            |0        |2017-02-27|
|+30BtsMGjSz

<h4>3-2-1- User Data</h4>

In [20]:
# filter activity data to just the closest to last data (january 2017)
user_filter_df=tables["user"].filter((pk.col("date") >= "20170101") & (pk.col("date") <= "20170131"))

                                                                                

In [21]:
# Use the user_activity function to generate a model of member user activities based on transactional and user filter data
member_model=pk.aggregate_user_activity(trans_model_member_df, user_filter_df, tables["train"])



shape
 |-- rows: 390673
 |-- columns: 24
root
 |-- msno: string (nullable = true)
 |-- is_churn: long (nullable = true)
 |-- city: string (nullable = true)
 |-- bd: string (nullable = true)
 |-- gender: string (nullable = false)
 |-- registered_via: string (nullable = true)
 |-- registration_init_time: string (nullable = true)
 |-- payment_method_id: long (nullable = true)
 |-- payment_plan_days: long (nullable = true)
 |-- plan_list_price: long (nullable = true)
 |-- actual_amount_paid: long (nullable = true)
 |-- is_auto_renew: long (nullable = true)
 |-- transaction_date: string (nullable = true)
 |-- membership_expire_date: string (nullable = true)
 |-- is_cancel: long (nullable = true)
 |-- exp_last: string (nullable = true)
 |-- activity_count: string (nullable = true)
 |-- sum_num_25: long (nullable = true)
 |-- sum_num_50: long (nullable = true)
 |-- sum_num_75: long (nullable = true)
 |-- sum_num_985: long (nullable = true)
 |-- sum_num_100: long (nullable = true)
 |-- sum_num

                                                                                

In [23]:
# Use the user_activity function to generate a model of city user activities based on transactional and user filter data
city_model=pk.aggregate_user_activity(trans_model_city_df, user_filter_df, tables["train"])



shape
 |-- rows: 368486
 |-- columns: 24
root
 |-- msno: string (nullable = true)
 |-- is_churn: long (nullable = true)
 |-- city: string (nullable = true)
 |-- bd: string (nullable = true)
 |-- gender: string (nullable = false)
 |-- registered_via: string (nullable = true)
 |-- registration_init_time: string (nullable = true)
 |-- payment_method_id: long (nullable = true)
 |-- payment_plan_days: long (nullable = true)
 |-- plan_list_price: long (nullable = true)
 |-- actual_amount_paid: long (nullable = true)
 |-- is_auto_renew: long (nullable = true)
 |-- transaction_date: string (nullable = true)
 |-- membership_expire_date: string (nullable = true)
 |-- is_cancel: long (nullable = true)
 |-- exp_last: string (nullable = true)
 |-- activity_count: string (nullable = true)
 |-- sum_num_25: long (nullable = true)
 |-- sum_num_50: long (nullable = true)
 |-- sum_num_75: long (nullable = true)
 |-- sum_num_985: long (nullable = true)
 |-- sum_num_100: long (nullable = true)
 |-- sum_num

                                                                                

<h4>4-2-1- Save Data in Google Storage</h4>

In [24]:
# save member model data
pk.save_model_data(member_model, "gs://kkbox_data_churn", "model_member_main")

                                                                                

In [25]:
# save city model data
pk.save_model_data(city_model, "gs://kkbox_data_churn", "model_city_main")

                                                                                