### Audiobooks project - Machine Learning part

#### Problem

You are given data from an Audiobook App. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, ), Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10, Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

In [1]:
# Importing libraries
import numpy as np
import tensorflow as tf

In [4]:
# Loading the data
npz = np.load('data/audiobooks_data_train_v1.npz')
train_inputs = npz['inputs'].astype(float)
train_targets = npz['targets'].astype(int)

npz = np.load('data/audiobooks_data_validation_v1.npz')
#We have to make sure that all inputs are floats and targets are int so we add .astype() to every each of them
validation_inputs = npz['inputs'].astype(float)
validation_targets = npz['targets'].astype(int)

npz = np.load('data/audiobooks_data_test_v1.npz')
test_inputs, test_targets = npz['inputs'].astype(float), npz['targets'].astype(int)

### NN Model

In [5]:
# The data is preprocessed in the proper way already so we don't have to use Flatten as the first layer as for example
# in the MNIST project.

input_size = 10 # 10 predictors
output_size = 2 # output is 0 or 1
hidden_layer_size = 50

model = tf.keras.Sequential([
                             tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                             tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                             tf.keras.layers.Dense(hidden_layer_size, activation='softmax')
                            ])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# sparse_categorical_crossentropy applies one-hot encoding to the targets

batch_size = 100
max_epochs = 100

model.fit(train_inputs,
          train_targets,
          batch_size = batch_size,
          epochs = max_epochs,
          validation_data=(validation_inputs, validation_targets),
          verbose=2)

Epoch 1/100
36/36 - 1s - loss: 3.2485 - accuracy: 0.4269 - val_loss: 2.2212 - val_accuracy: 0.6823
Epoch 2/100
36/36 - 0s - loss: 1.2524 - accuracy: 0.7304 - val_loss: 0.6143 - val_accuracy: 0.7606
Epoch 3/100
36/36 - 0s - loss: 0.5412 - accuracy: 0.7603 - val_loss: 0.4777 - val_accuracy: 0.7763
Epoch 4/100
36/36 - 0s - loss: 0.4668 - accuracy: 0.7673 - val_loss: 0.4285 - val_accuracy: 0.7808
Epoch 5/100
36/36 - 0s - loss: 0.4354 - accuracy: 0.7759 - val_loss: 0.4147 - val_accuracy: 0.7763
Epoch 6/100
36/36 - 0s - loss: 0.4184 - accuracy: 0.7812 - val_loss: 0.3930 - val_accuracy: 0.7964
Epoch 7/100
36/36 - 0s - loss: 0.4086 - accuracy: 0.7863 - val_loss: 0.3880 - val_accuracy: 0.7897
Epoch 8/100
36/36 - 0s - loss: 0.3984 - accuracy: 0.7980 - val_loss: 0.3770 - val_accuracy: 0.7964
Epoch 9/100
36/36 - 0s - loss: 0.3915 - accuracy: 0.7927 - val_loss: 0.3684 - val_accuracy: 0.7942
Epoch 10/100
36/36 - 0s - loss: 0.3853 - accuracy: 0.7994 - val_loss: 0.3676 - val_accuracy: 0.7919
Epoch 11/

36/36 - 0s - loss: 0.3371 - accuracy: 0.8175 - val_loss: 0.3531 - val_accuracy: 0.8166
Epoch 84/100
36/36 - 0s - loss: 0.3380 - accuracy: 0.8198 - val_loss: 0.3432 - val_accuracy: 0.8121
Epoch 85/100
36/36 - 0s - loss: 0.3376 - accuracy: 0.8159 - val_loss: 0.3424 - val_accuracy: 0.8188
Epoch 86/100
36/36 - 0s - loss: 0.3368 - accuracy: 0.8203 - val_loss: 0.3472 - val_accuracy: 0.8076
Epoch 87/100
36/36 - 0s - loss: 0.3369 - accuracy: 0.8175 - val_loss: 0.3425 - val_accuracy: 0.8166
Epoch 88/100
36/36 - 0s - loss: 0.3360 - accuracy: 0.8198 - val_loss: 0.3392 - val_accuracy: 0.8166
Epoch 89/100
36/36 - 0s - loss: 0.3389 - accuracy: 0.8136 - val_loss: 0.3392 - val_accuracy: 0.8233
Epoch 90/100
36/36 - 0s - loss: 0.3395 - accuracy: 0.8231 - val_loss: 0.3425 - val_accuracy: 0.8188
Epoch 91/100
36/36 - 0s - loss: 0.3371 - accuracy: 0.8178 - val_loss: 0.3408 - val_accuracy: 0.8166
Epoch 92/100
36/36 - 0s - loss: 0.3390 - accuracy: 0.8161 - val_loss: 0.3446 - val_accuracy: 0.8143
Epoch 93/100


<keras.callbacks.History at 0x259bdef8460>

Even though the loss was mainly decreasing and we were getting higher accuracy which is good, the validation loss was decreasing at one time and increasing at some points. This shows us that our model was `overfitting`. At some point we should stop the model not to get overfitted.

In [6]:
# We can retrain the model using early_stopping callback
# I could just put it in the code above but this way there will be sth to compare it to...

input_size = 10 # 10 predictors
output_size = 2 # output is 0 or 1
hidden_layer_size = 50

model = tf.keras.Sequential([
                             tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                             tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                             tf.keras.layers.Dense(hidden_layer_size, activation='softmax')
                            ])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# sparse_categorical_crossentropy applies one-hot encoding to the targets

batch_size = 100
max_epochs = 100

early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

model.fit(train_inputs,
          train_targets,
          batch_size = batch_size,
          epochs = max_epochs,
          callbacks=[early_stopping],
          validation_data=(validation_inputs, validation_targets),
          verbose=2)

Epoch 1/100
36/36 - 1s - loss: 2.9223 - accuracy: 0.4739 - val_loss: 1.8570 - val_accuracy: 0.6689
Epoch 2/100
36/36 - 0s - loss: 1.0958 - accuracy: 0.6764 - val_loss: 0.5869 - val_accuracy: 0.7651
Epoch 3/100
36/36 - 0s - loss: 0.5246 - accuracy: 0.7578 - val_loss: 0.4618 - val_accuracy: 0.7606
Epoch 4/100
36/36 - 0s - loss: 0.4563 - accuracy: 0.7709 - val_loss: 0.4224 - val_accuracy: 0.7852
Epoch 5/100
36/36 - 0s - loss: 0.4291 - accuracy: 0.7793 - val_loss: 0.4035 - val_accuracy: 0.7897
Epoch 6/100
36/36 - 0s - loss: 0.4132 - accuracy: 0.7835 - val_loss: 0.3871 - val_accuracy: 0.7875
Epoch 7/100
36/36 - 0s - loss: 0.4030 - accuracy: 0.7893 - val_loss: 0.3720 - val_accuracy: 0.8031
Epoch 8/100
36/36 - 0s - loss: 0.3941 - accuracy: 0.7918 - val_loss: 0.3690 - val_accuracy: 0.8098
Epoch 9/100
36/36 - 0s - loss: 0.3877 - accuracy: 0.7949 - val_loss: 0.3634 - val_accuracy: 0.7987
Epoch 10/100
36/36 - 0s - loss: 0.3830 - accuracy: 0.7974 - val_loss: 0.3562 - val_accuracy: 0.8054
Epoch 11/

<keras.callbacks.History at 0x259c05bc790>

### Testing the model

In [7]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [8]:
print('Test loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100))

Test loss: 0.36. Test accuracy: 81.03%
