# Audiobook Purchase Classification

## Problem

We have data from an Audiobook app in a .csv that relates only to the audio versions of books. Each row in the database represents a customer who has made at least one purchase.

There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

We want to know if a customer will buy from the company again. The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. 

We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company. If we can focus our efforts ONLY on customers that are likely to convert again, we can incease savings and improve profitability. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

The inputs to the model will be all of these variables excluding customer ID, as it is completely arbitrary. It's more like a name, than a number.

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 


## Import the relevant libraries

In [2]:
import numpy as np
# Using the sklearn preprocessing library will make it easier to standardize the data.
from sklearn import preprocessing
import tensorflow as tf

## Preprocess the data

### Extract data from csv

In [3]:
# Load the data
raw_csv_data = np.loadtxt('Audiobooks_data.csv',delimiter=',')

# The inputs are all columns in the csv, except for the first one [:,0]
# (which is just the arbitrary customer IDs that bear no useful information),
# and the last one [:,-1]
unscaled_inputs_all = raw_csv_data[:,1:-1]

# The targets are in the last column
targets_all = raw_csv_data[:,-1]

### Balance the dataset

In [4]:
# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets = int(np.sum(targets_all))

# Set a counter for targets that are 0 (meaning that the customer did not convert)
zero_targets_counter = 0

# We want to create a "balanced" dataset, so we will have to remove some input/target pairs.
# Declare a variable that will do that:
indices_to_remove = []

# Count the number of targets that are 0. 
# Once there are as many 0s as 1s, mark entries where the target is 0.
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

# Create two new variables, one that will contain the inputs, and one that will contain the targets.
# We delete all indices that we marked "to remove" in the loop above.
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

### Standardize the inputs

In [5]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data
When the data was collected it was actually arranged by date. Shuffle the indices of the data, so the data is not arranged in any way when we feed it. Since we will be batching, we want the data to be as randomly spread out as possible.

In [6]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train, validation, and test

In [7]:
# Count the total number of samples
samples_count = shuffled_inputs.shape[0]

# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.
# Naturally, the numbers are integers.
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)

# The 'test' dataset contains all remaining data.
test_samples_count = samples_count - train_samples_count - validation_samples_count

# Create variables that record the inputs and targets for training
# In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for validation.
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# We balanced our dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were 
# taken from a shuffled dataset. Check if they are balanced, too. Note that each time we rerun this code, 
# we will get different values, as each time they are shuffled randomly.

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1791.0 3579 0.5004191114836547
229.0 447 0.5123042505592841
217.0 448 0.484375


## Create the machine learning algorithm



### Model
Outline, optimizers, loss, early stopping and training

In [71]:
# Set the input and output sizes
input_size = 10
output_size = 2
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 50
    
# define how the model will look like
model = tf.keras.Sequential([
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])


### Choose the optimizer and the loss function

# Define the optimizer we'd like to use, 
# the loss function, 
# and the metrics we are interested in obtaining at each iteration
#custom_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
#model.compile(optimizer=custom_optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

### Training
# That's where we train the model we have built.

# set the batch size
batch_size = 100

# set a maximum number of training epochs
max_epochs = 100

# set an early stopping mechanism
# let's set patience=2, to be a bit tolerant against random validation loss increases
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

# fit the model
# note that this time the train, validation and test data are not iterable
model.fit(train_inputs, # train inputs
          train_targets, # train targets
          batch_size=batch_size, # batch size
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
          # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          callbacks=[early_stopping], # early stopping
          validation_data=(validation_inputs, validation_targets), # validation data
          verbose = 2 # making sure we get enough information about the training process
          )  

Train on 3579 samples, validate on 447 samples
Epoch 1/100
3579/3579 - 0s - loss: 0.5511 - accuracy: 0.7902 - val_loss: 0.4388 - val_accuracy: 0.8635
Epoch 2/100
3579/3579 - 0s - loss: 0.3615 - accuracy: 0.8840 - val_loss: 0.3598 - val_accuracy: 0.8680
Epoch 3/100
3579/3579 - 0s - loss: 0.3103 - accuracy: 0.8894 - val_loss: 0.3412 - val_accuracy: 0.8792
Epoch 4/100
3579/3579 - 0s - loss: 0.2915 - accuracy: 0.8955 - val_loss: 0.3232 - val_accuracy: 0.8926
Epoch 5/100
3579/3579 - 0s - loss: 0.2791 - accuracy: 0.8989 - val_loss: 0.3089 - val_accuracy: 0.8926
Epoch 6/100
3579/3579 - 0s - loss: 0.2691 - accuracy: 0.8989 - val_loss: 0.3090 - val_accuracy: 0.8949
Epoch 7/100
3579/3579 - 0s - loss: 0.2621 - accuracy: 0.9011 - val_loss: 0.3026 - val_accuracy: 0.8971
Epoch 8/100
3579/3579 - 0s - loss: 0.2570 - accuracy: 0.9042 - val_loss: 0.3129 - val_accuracy: 0.8926
Epoch 9/100
3579/3579 - 0s - loss: 0.2536 - accuracy: 0.9053 - val_loss: 0.3009 - val_accuracy: 0.8926
Epoch 10/100
3579/3579 - 0

<tensorflow.python.keras.callbacks.History at 0x1aa8fec190>

## Test the model

After training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.

In [72]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [73]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.27. Test accuracy: 89.06%


Using the model and hyperparameters given in this notebook, the final test accuracy should be roughly around 91%.