# Audiobook App Business Case

The given dataset contains information about customer purchases from an audiobook app. Each customer in the database has made at least one purchase from the app. The following is an implementation of a deep learning model designed with Tensorflow 2 which seeks to examine the <b>likelihood of these customers making a purchase from the app again</b>. The motivation behind answering this question is to determine the most effective methods and userbase to advertise to. It is reasonable to assume that given information about the purchasing habits and experience of customers, and targetting those that are predicted to be most likely to buy again through a well-trained model will yield the best, most targeted and accurate results regarding potential future customers and allow the app company to design the optimal ad campaign.  

The data provides us with several important variables to consider. We have standard attributes about the purchase such as the price and minutes listened, effective indicators of how big a priority audiobooks are in the customer's life. We also have consumer engagement metrics such as review status and score, number of support requests and the difference between the last app visit and the purchase date.  

The data was gathered from an audiobook app and contains 2 years worth of customer engagement data. This data is contained in the loaded <code>.csv</code> file. There was, however, a further 6 months worth of consumer data analyzed to determine if there had been a purchase from the customer in that time period. The results are the boolean targets for the model. If there have been no purchases in the 6-month period, it is safe to assume that the customer has moved on to a different provider or just stopped buying audiobooks.  

In [1]:
import numpy as np
from sklearn import preprocessing
import tensorflow as tf

In [2]:
# Load the data
raw_data = np.loadtxt('Audiobooks_data.csv', delimiter=',')

unscaled_inputs = raw_data[:,1:-1]
targets = raw_data[:,-1]

#### Data Preprocessing

In [3]:
# Shuffling
shuffled_indices = np.arange(unscaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Using shuffle indices to shuffle inputs and targets
unscaled_inputs = unscaled_inputs[shuffled_indices]
targets = targets[shuffled_indices]

In [4]:
# Balance the dataset
num_true_targets = int(np.sum(targets)) # Count the targets that are 1's (customer bought a book in the last 6 months)
zero_targets_counter = 0 # set a counter for number of targets that are 0's (customer didn't buy)
indices_to_remove = [] # array containing input-target pairs which need to removed to create a balanced dataset

for i in range(targets.shape[0]):
    if targets[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_true_targets:
            indices_to_remove.append(i)

# These two variables contain the inputs and targets respectively
unscaled_inputs_equal_priors = np.delete(unscaled_inputs, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets, indices_to_remove, axis=0)

In [5]:
# Standardization using sklearn's preprocessing capabilities
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

In [6]:
# Shuffling after scaling
# Data collected arranged by date. shuffle indices to ensure data is not arranged in the same way as it is fed.
# Since we are batching, data should be spread out as randomly as possible
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle inputs and targets
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

In [7]:
# Split the datasets into training, validation and test sets

samples_count = shuffled_inputs.shape[0]

train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1776.0 3579 0.49622799664710815
246.0 447 0.5503355704697986
215.0 448 0.4799107142857143


In [8]:
# Saving the datasets in an .npz file

np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

#### Data

In [9]:
npz = np.load('Audiobooks_data_train.npz')
train_inputs = npz['inputs'].astype(np.float)
train_targets = npz['targets'].astype(np.int)

npz = np.load('Audiobooks_data_validation.npz')
validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

npz = np.load('Audiobooks_data_test.npz')
test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

#### Model

The model outline, loss function, early stopping and training

In [10]:
input_size = 10
output_size = 2
hidden_layer_size = 50

model = tf.keras.Sequential([
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                            tf.keras.layers.Dense(output_size, activation='softmax')
                            ])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) # Adam optimizer awesome! 
# sparse_categorical_crossentropy is chosen as loss function because it is a classification problem dealing with categorical
# data and the function also one-hot encodes our targets, allowing for greater convenience.

# Hyperparameters

batch_size = 100
max_epochs = 100

#early stopping mechanism
early_stopping = tf.keras.callbacks.EarlyStopping(patience = 2) # Early stopping mechanism. Patience of 2 implies slight 
                                                               # tolerance against random validation loss increases

# Training
model.fit(train_inputs, 
          train_targets, 
          batch_size=batch_size, 
          epochs = max_epochs, 
          callbacks=[early_stopping], # callbacks are functions called after execution of task. 
          validation_data=(validation_inputs, validation_targets), 
          verbose=2)

Epoch 1/100
36/36 - 0s - loss: 0.5931 - accuracy: 0.6683 - val_loss: 0.5052 - val_accuracy: 0.7539
Epoch 2/100
36/36 - 0s - loss: 0.4781 - accuracy: 0.7575 - val_loss: 0.4466 - val_accuracy: 0.7808
Epoch 3/100
36/36 - 0s - loss: 0.4341 - accuracy: 0.7712 - val_loss: 0.4123 - val_accuracy: 0.7987
Epoch 4/100
36/36 - 0s - loss: 0.4085 - accuracy: 0.7849 - val_loss: 0.3913 - val_accuracy: 0.8166
Epoch 5/100
36/36 - 0s - loss: 0.3931 - accuracy: 0.7952 - val_loss: 0.3786 - val_accuracy: 0.8255
Epoch 6/100
36/36 - 0s - loss: 0.3827 - accuracy: 0.7969 - val_loss: 0.3729 - val_accuracy: 0.8210
Epoch 7/100
36/36 - 0s - loss: 0.3761 - accuracy: 0.7999 - val_loss: 0.3658 - val_accuracy: 0.8166
Epoch 8/100
36/36 - 0s - loss: 0.3714 - accuracy: 0.8036 - val_loss: 0.3665 - val_accuracy: 0.8166
Epoch 9/100
36/36 - 0s - loss: 0.3691 - accuracy: 0.8019 - val_loss: 0.3605 - val_accuracy: 0.8143
Epoch 10/100
36/36 - 0s - loss: 0.3649 - accuracy: 0.8011 - val_loss: 0.3556 - val_accuracy: 0.8277
Epoch 11/

<tensorflow.python.keras.callbacks.History at 0x24bc4b78ac8>

#### Testing the Model

Test the predictive power of the model by introducing it to test data it has never encountered before. 

In [11]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [12]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.35. Test accuracy: 82.14%
