In this project, data is taken from an Audiobook app. The aim is to predict cutomer churn using machine learning models.
Each customer in the database has make a purchase at least once. The main idea is to spend money by targeting only on the customers who are likely to convert again thus increase in sales and profitability.

The data has several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s.

In [51]:
import pandas as pd
import numpy as np
import tensorflow as tf

# Pre processing data

In [52]:
from sklearn import preprocessing

In [53]:
#load data
raw_df = pd.read_csv("data/Audiobooks_data.csv")

In [54]:
# The inputs are all columns in the csv, except for the first one [:,0]
# (which is just the arbitrary customer IDs that bear no useful information),
# and the last one [:,-1] (which is our targets)

unscaled_inputs_df = raw_df.iloc[:,1:-1]

# The targets are in the last column. That's how datasets are conventionally organized.
targets_df = raw_df.iloc[:,-1]

In [55]:
unscaled_inputs = np.array(unscaled_inputs_df)
targets = np.array(targets_df)

# balancing the dataset

In [56]:
# Count how many targets are 1 (meaning that the customer did convert)
targets_one_count = int(np.sum(targets))
targets_one_count

2237

In [57]:
# we have 2237 rows with 1 and the remaing rows with 0 
# to create a balanced dataset, we remove excess input/target pairs with target as 

targets_zero_count = 0

indices_to_remove = []
for i in range(targets.shape[0]):
    if targets[i] == 0:
        targets_zero_count += 1
        if targets_zero_count > targets_one_count:
            indices_to_remove.append(i)

In [58]:
len(indices_to_remove)

9609

In [59]:
#delete all indices marked to remove
unscaled_inputs_bal = np.delete(unscaled_inputs,indices_to_remove,axis = 0)
targets_bal = np.delete(targets,indices_to_remove,axis = 0)
targets_bal.shape[0]

4474

# standardize inputs

In [60]:
scaled_inputs  = preprocessing.scale(unscaled_inputs_bal)

# Shuffle data

In [61]:
indices_shuffle = np.arange(targets_bal.shape[0])
np.random.shuffle(indices_shuffle)

In [62]:
inputs_shuffle = scaled_inputs[indices_shuffle]
targets_shuffle  = targets_bal[indices_shuffle]

In [63]:
inputs_shuffle.shape[0]

4474

# split data into train, validation and test sets

In [64]:
# Count the total number of samples
samples_count = inputs_shuffle.shape[0]

In [65]:
# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.
# Naturally, the numbers are integers.
train_count = int(0.8 * samples_count)
validation_count = int(0.1 * samples_count)
# The 'test' dataset contains all remaining data.
test_count = samples_count - train_count - validation_count

In [66]:
train_count

3579

In [67]:
train_input = inputs_shuffle[:train_count]
train_target = targets_shuffle[:train_count]

val_input = inputs_shuffle[train_count:train_count+validation_count]
val_target = targets_shuffle[train_count:train_count+validation_count]

test_input = inputs_shuffle[train_count+validation_count:]
test_target = targets_shuffle[train_count+validation_count:]

In [68]:
# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_target), train_count, np.sum(train_target) / train_count)
print(np.sum(val_target), validation_count, np.sum(val_target) / validation_count)
print(np.sum(test_target), test_count, np.sum(test_target) / test_count)

1768 3579 0.49399273540095
221 447 0.49440715883668906
248 448 0.5535714285714286


# Save the three datasets in *.npz.


In [69]:
np.savez('Audiobooks_traindata', inputs=train_input, targets=train_target)
np.savez('Audiobooks_valdata', inputs=val_input, targets=val_target)
np.savez('Audiobooks_testdata', inputs=test_input, targets=test_target)

# load data

In [70]:
# let's create a temporary variable npz, where we will store each of the three Audiobooks datasets
npz = np.load('Audiobooks_traindata.npz')

# we extract the inputs using the keyword under which we saved them
# to ensure that they are all floats, let's also take care of that
train_inputs = npz['inputs'].astype(float)
# targets must be int because of sparse_categorical_crossentropy (we want to be able to smoothly one-hot encode them)
train_targets = npz['targets'].astype(int)

# we load the validation data in the temporary variable
npz = np.load('Audiobooks_valdata.npz')
# we can load the inputs and the targets in the same line
val_inputs, val_targets = npz['inputs'].astype(float), npz['targets'].astype(int)

# we load the test data in the temporary variable
npz = np.load('Audiobooks_testdata.npz')
# we create 2 variables that will contain the test inputs and the test targets
test_inputs, test_targets = npz['inputs'].astype(float), npz['targets'].astype(int)

# Model

In [71]:
# Set the input and output sizes
input_size = 10
output_size = 2
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 50
    
# define how the model will look like
model = tf.keras.Sequential([
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

In [72]:
# Optimizer -  adam
# loss function - sparse_categorical_crossentropy
#output function - softmax
# early stopping mechanism
# batch size = 100
# maximum epochs = 100

In [73]:
### Choose the optimizer and the loss function

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# set the batch size
batch_size = 50

# set a maximum number of training epochs
max_epochs = 200

# set an early stopping mechanism
# let's set patience=2, to be a bit tolerant against random validation loss increases
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

# fit the model
# note that this time the train, validation and test data are not iterable
model.fit(train_inputs, # train inputs
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
          # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          callbacks=[early_stopping], # early stopping
          validation_data=(val_inputs, val_targets), # validation data
          verbose = 2 # making sure we get enough information about the training process
          )  

Epoch 1/200
72/72 - 1s - loss: 0.5070 - accuracy: 0.7329 - val_loss: 0.4586 - val_accuracy: 0.7427 - 1s/epoch - 16ms/step
Epoch 2/200
72/72 - 0s - loss: 0.4002 - accuracy: 0.7879 - val_loss: 0.4019 - val_accuracy: 0.7897 - 160ms/epoch - 2ms/step
Epoch 3/200
72/72 - 0s - loss: 0.3728 - accuracy: 0.7960 - val_loss: 0.3865 - val_accuracy: 0.7830 - 134ms/epoch - 2ms/step
Epoch 4/200
72/72 - 0s - loss: 0.3613 - accuracy: 0.8019 - val_loss: 0.3839 - val_accuracy: 0.8009 - 139ms/epoch - 2ms/step
Epoch 5/200
72/72 - 0s - loss: 0.3520 - accuracy: 0.8053 - val_loss: 0.3641 - val_accuracy: 0.8009 - 145ms/epoch - 2ms/step
Epoch 6/200
72/72 - 0s - loss: 0.3459 - accuracy: 0.8103 - val_loss: 0.3632 - val_accuracy: 0.8121 - 150ms/epoch - 2ms/step
Epoch 7/200
72/72 - 0s - loss: 0.3405 - accuracy: 0.8100 - val_loss: 0.3719 - val_accuracy: 0.7875 - 137ms/epoch - 2ms/step
Epoch 8/200
72/72 - 0s - loss: 0.3377 - accuracy: 0.8198 - val_loss: 0.3703 - val_accuracy: 0.7942 - 133ms/epoch - 2ms/step


<keras.src.callbacks.History at 0x2045d7e20a0>

# test data

In [74]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [75]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.34. Test accuracy: 84.15%
