# Customer Campaign Response Analytics Using PySpark

The dataset can be found at [kaggle](https://www.kaggle.com/datasets/nimishsawant/bankfull). The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

## Data Ingestion from CSV to Spark DataFrame

In [1]:
pip install pyspark



In [2]:
pip install -q findspark # -q, --quiet Give less output

In [3]:
import findspark
findspark.init()

In [4]:
# Create a Spark Session
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Customer Campaign Response Analytics').getOrCreate()

In [5]:
# Load the dataset
file_path = '/content/bank-full.csv'

df = spark.read.csv(file_path, header=True, inferSchema=True)
# Which variables do we have?
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)
 |-- marital: string (nullable = true)
 |-- education: string (nullable = true)
 |-- default: string (nullable = true)
 |-- balance: integer (nullable = true)
 |-- housing: string (nullable = true)
 |-- loan: string (nullable = true)
 |-- contact: string (nullable = true)
 |-- day: integer (nullable = true)
 |-- month: string (nullable = true)
 |-- duration: integer (nullable = true)
 |-- campaign: integer (nullable = true)
 |-- pdays: integer (nullable = true)
 |-- previous: integer (nullable = true)
 |-- poutcome: string (nullable = true)
 |-- Target: string (nullable = true)



In [6]:
# How does the data look like?
df.show(5)

+---+------------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+------+
|age|         job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|Target|
+---+------------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+------+
| 58|  management|married| tertiary|     no|   2143|    yes|  no|unknown|  5|  may|     261|       1|   -1|       0| unknown|    no|
| 44|  technician| single|secondary|     no|     29|    yes|  no|unknown|  5|  may|     151|       1|   -1|       0| unknown|    no|
| 33|entrepreneur|married|secondary|     no|      2|    yes| yes|unknown|  5|  may|      76|       1|   -1|       0| unknown|    no|
| 47| blue-collar|married|  unknown|     no|   1506|    yes|  no|unknown|  5|  may|      92|       1|   -1|       0| unknown|    no|
| 33|     unknown| single|  unknown|     no|      1|     no|  no|unkn

Each datapoint contains information about a particular client, which was contacted during the marketing campaign mentioned at the beginning of this notebook. Most of the columns are self-explanatory, nevertheless, there are some for which an extra explanation is useful. For the sake of completeness, we include the description of each column:

- **age**: Age
- **job**: Occupation
- **marital**: Marital Status
- **education**: Education Level
- **default**: Has credit in default?
- **balance**: Average yearly balance
- **housing**: Average has housing loan?
- **loan**: Has personal loan?
- **contact**: Contact communication type
- **day**: Last contact day of the month (In the data description it says day of the week, but wee will see below that's not the case)
- **month**: Last contact month of year
- **duration**: Last contact duration, in seconds
- **campaign**: Number of contacts performed during this campaign and for this client
- **pdays**: Number of days that passed by after the client was last contacted from a previous campaign
- **previous**: Number of contacts performed before this campaign and for this client
- **poutcome**: Outcome of the previous marketing campaign
- **Target**: Has the client subscribed a term deposit?

As its name suggests, **Target** is the target variable, which we would like to predict.

# Data Cleaning and Preprocessing

### Column renaming

In [7]:
# Do all the cleaning in a copy of the original dataframe
df_clean = df.withColumnRenamed('Target', 'y')

### Missing Values:

In [8]:
from pyspark.sql.functions import isnull, col, sum

null_summary = df_clean.select(
    [sum(col(c).isNull().cast('int')).alias(c) for c in df_clean.columns]
)

null_summary.show()

+---+---+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
|age|job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|  y|
+---+---+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
|  0|  0|      0|        0|      0|      0|      0|   0|      0|  0|    0|       0|       0|    0|       0|       0|  0|
+---+---+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+



### Time Variables

In [9]:
df_clean.select('month').distinct().show()

+-----+
|month|
+-----+
|  jun|
|  aug|
|  may|
|  feb|
|  sep|
|  mar|
|  oct|
|  jul|
|  nov|
|  apr|
|  dec|
|  jan|
+-----+



All 12 months of the year are present. There is no 'year' column in the dataset and I also don't find any reference of the year(s) where the marketing campaign took place. We could assume that the campaign was run on a single year, but we can't be really sure. As we culd be dealing with more than one year, it is safer to perform cyclic encoding for month.

In [10]:
# First convert month to a numerical variable
from pyspark.sql.functions import from_unixtime, unix_timestamp

df_clean = df_clean.withColumn('month', from_unixtime(unix_timestamp(col('month'), 'MMM'), 'MM'))
df_clean = df_clean.withColumn('month', df_clean['month'].cast('int'))
df_clean.select('month').distinct().show()

+-----+
|month|
+-----+
|   12|
|    1|
|    6|
|    3|
|    5|
|    9|
|    4|
|    8|
|    7|
|   10|
|   11|
|    2|
+-----+



In [11]:
# Cyclical encoding of month
from pyspark.sql.functions import sin, cos
from math import pi
df_clean = df_clean.withColumn('month_sin', sin(2*pi*(df_clean['month'] - 1)/12))
df_clean = df_clean.withColumn('month_cos', sin(2*pi*(df_clean['month'] - 1)/12))

Let's now take a look at the **day** column:

In [12]:
df_clean.select('day').distinct().show()

+---+
|day|
+---+
| 31|
| 28|
| 26|
| 27|
| 12|
| 22|
|  1|
| 13|
|  6|
| 16|
|  3|
| 20|
|  5|
| 19|
| 15|
|  9|
| 17|
|  4|
|  8|
| 23|
+---+
only showing top 20 rows



As mentioned on the introduction, we are dealing with day of the month instead of day of the week. Let's perform cyclical encoding accordingly.

In [18]:
from pyspark.sql import functions as F

# create an auxiliar column having the number of days in a given month
df_clean = df_clean.withColumn(
    'month_days',
    F.when(F.col('month').isin([1, 3, 5, 7, 8, 10, 12]), 31) #months with 31 days
    .when(F.col('month').isin([4, 6, 9, 11]), 30) # months with 30 days
    .otherwise(28) # february
)

In [20]:
# Cyclical encoding of day
df_clean = df_clean.withColumn('day_sin', sin(2*pi*(df_clean['day'] - 1)/df_clean['month_days']))
df_clean = df_clean.withColumn('day_cos', sin(2*pi*(df_clean['day'] - 1)/df_clean['month_days']))
# drop auxiliar column
df_clean = df_clean.drop('month_days')

### Exploration of categorical columns

In [41]:
categorical = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome']

for col in categorical:
  df_clean.groupBy(col).count().show()

+-------------+-----+
|          job|count|
+-------------+-----+
|   management| 9458|
|      retired| 2264|
|      unknown|  288|
|self-employed| 1579|
|      student|  938|
|  blue-collar| 9732|
| entrepreneur| 1487|
|       admin.| 5171|
|   technician| 7597|
|     services| 4154|
|    housemaid| 1240|
|   unemployed| 1303|
+-------------+-----+

+--------+-----+
| marital|count|
+--------+-----+
|divorced| 5207|
| married|27214|
|  single|12790|
+--------+-----+

+---------+-----+
|education|count|
+---------+-----+
|  unknown| 1857|
| tertiary|13301|
|secondary|23202|
|  primary| 6851|
+---------+-----+

+-------+-----+
|default|count|
+-------+-----+
|     no|44396|
|    yes|  815|
+-------+-----+

+-------+-----+
|housing|count|
+-------+-----+
|     no|20081|
|    yes|25130|
+-------+-----+

+----+-----+
|loan|count|
+----+-----+
|  no|37967|
| yes| 7244|
+----+-----+

+---------+-----+
|  contact|count|
+---------+-----+
|  unknown|13020|
| cellular|29285|
|telephone| 2906|
+

### Clean Data Saving

In [15]:
#df_clean.toPandas().to_csv('/content/clean_data.csv')