# Customer Return Analysis on Audiobooks application using Neural Networks

## Problem Statement
Given the data from an Audiobook app logically, it relates only to the audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

There are several features: `Customer ID`, `Book length in mins_avg` (average of all purchases), `Book length in minutes_sum` (sum of all purchases), `Price Paid_avg` (average of all purchases), `Price paid_sum` (sum of all purchases), `Review` (a Boolean variable), `Review` (out of 10), `Total minutes listened`, `Completion` (from 0 to 1), `Support requests` (number), and `Last visited minus purchase date` (in days).

The `targets` are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

### Import required Modules

In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn import preprocessing

### Loading Dataset

In [5]:
raw_csv_data = np.loadtxt('Audiobooks_data.csv',delimiter=',')
raw_csv_data

array([[8.7300e+02, 2.1600e+03, 2.1600e+03, ..., 0.0000e+00, 0.0000e+00,
        1.0000e+00],
       [6.1100e+02, 1.4040e+03, 2.8080e+03, ..., 0.0000e+00, 1.8200e+02,
        1.0000e+00],
       [7.0500e+02, 3.2400e+02, 3.2400e+02, ..., 1.0000e+00, 3.3400e+02,
        1.0000e+00],
       ...,
       [2.8671e+04, 1.0800e+03, 1.0800e+03, ..., 0.0000e+00, 2.9000e+01,
        0.0000e+00],
       [3.1134e+04, 2.1600e+03, 2.1600e+03, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [3.2832e+04, 1.6200e+03, 1.6200e+03, ..., 0.0000e+00, 9.0000e+01,
        0.0000e+00]])

### Inputs and Targets

In [6]:
unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data[:,-1]

### When the data was collected it was actually arranged by date. Shuffle the indices of the data, so the data is not arranged in any way when we feed it. Since we will be batching, we want the data to be as randomly spread out as possible

In [7]:
shuffled_indices = np.arange(unscaled_inputs_all.shape[0])
np.random.shuffle(shuffled_indices)

unscaled_inputs_all = unscaled_inputs_all[shuffled_indices]
targets_all = targets_all[shuffled_indices]

### Balancing the dataset

In [8]:
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0

indices_to_remove = []
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

### Standardize Inputs

In [9]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

In [10]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Train, Test and Validation split

In [11]:
# Count the total number of samples
samples_count = shuffled_inputs.shape[0]

# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.
# Naturally, the numbers are integers.
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)

# The 'test' dataset contains all remaining data.
test_samples_count = samples_count - train_samples_count - validation_samples_count

# Create variables that record the inputs and targets for training
# In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for validation.
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# We balanced our dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were 
# taken from a shuffled dataset. Check if they are balanced, too. Note that each time you rerun this code, 
# you will get different values, as each time they are shuffled randomly.
# Normally you preprocess ONCE, so you need not rerun this code once it is done.
# If you rerun this whole sheet, the npzs will be overwritten with your newly preprocessed data.

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1796.0 3579 0.5018161497625034
239.0 447 0.5346756152125279
202.0 448 0.45089285714285715


### Save the three datasets in `.npz` format

In [12]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

### Loading the saved data

In [13]:
npz = np.load('Audiobooks_data_train.npz')

# we extract the inputs using the keyword under which we saved them
# to ensure that they are all floats, let's also take care of that
train_inputs = npz['inputs'].astype(np.float)
# targets must be int because of sparse_categorical_crossentropy (we want to be able to smoothly one-hot encode them)
train_targets = npz['targets'].astype(np.int)

# we load the validation data in the temporary variable
npz = np.load('Audiobooks_data_validation.npz')
# we can load the inputs and the targets in the same line
validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

# we load the test data in the temporary variable
npz = np.load('Audiobooks_data_test.npz')
# we create 2 variables that will contain the test inputs and the test targets
test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  train_inputs = npz['inputs'].astype(np.float)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  train_targets = npz['targets'].astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np

### Model training

In [14]:
input_size = 10
output_size = 2
hidden_layer_size = 50
    
model = tf.keras.Sequential([
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

batch_size = 100
max_epochs = 100
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

model.fit(train_inputs, train_targets, batch_size=batch_size,epochs=max_epochs,
        callbacks=[early_stopping],
        validation_data=(validation_inputs, validation_targets),
        verbose = 2
    )

Epoch 1/100
36/36 - 4s - loss: 0.6449 - accuracy: 0.6178 - val_loss: 0.5471 - val_accuracy: 0.7293 - 4s/epoch - 108ms/step
Epoch 2/100
36/36 - 0s - loss: 0.5107 - accuracy: 0.7402 - val_loss: 0.4705 - val_accuracy: 0.7562 - 119ms/epoch - 3ms/step
Epoch 3/100
36/36 - 0s - loss: 0.4595 - accuracy: 0.7642 - val_loss: 0.4372 - val_accuracy: 0.7718 - 126ms/epoch - 3ms/step
Epoch 4/100
36/36 - 0s - loss: 0.4347 - accuracy: 0.7756 - val_loss: 0.4164 - val_accuracy: 0.7785 - 117ms/epoch - 3ms/step
Epoch 5/100
36/36 - 0s - loss: 0.4149 - accuracy: 0.7885 - val_loss: 0.4286 - val_accuracy: 0.7629 - 115ms/epoch - 3ms/step
Epoch 6/100
36/36 - 0s - loss: 0.4019 - accuracy: 0.7955 - val_loss: 0.4026 - val_accuracy: 0.7897 - 131ms/epoch - 4ms/step
Epoch 7/100
36/36 - 0s - loss: 0.3912 - accuracy: 0.8005 - val_loss: 0.4015 - val_accuracy: 0.7875 - 130ms/epoch - 4ms/step
Epoch 8/100
36/36 - 0s - loss: 0.3849 - accuracy: 0.7985 - val_loss: 0.4008 - val_accuracy: 0.7785 - 115ms/epoch - 3ms/step
Epoch 9/1

<keras.callbacks.History at 0x2bcbd438d60>

### Testing

In [15]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [16]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.35. Test accuracy: 82.59%
