# Audiobooks business case

### Problem

You are given data from an Audiobook App. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why he/she is in the database. 

Create a machine learning algorithm based on the available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her.

If the company can focus its efforts SOLELY on customers that are likely to convert again, it can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, ), Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10, Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s.

Good luck!

## Importing Relevant libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing 
import tensorflow as tf

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


## Loading data

### Inspecting the Dataset with pandas

In [2]:
data_inspection = pd.read_csv('Audiobooks_data.csv')
data_inspection

Unnamed: 0,00994,1620,1620.1,19.73,19.73.1,1,10.00,0.99,1603.80,5,92,0
0,1143,2160.0,2160,5.33,5.33,0,8.91,0.00,0.0,0,0,0
1,2059,2160.0,2160,5.33,5.33,0,8.91,0.00,0.0,0,388,0
2,2882,1620.0,1620,5.96,5.96,0,8.91,0.42,680.4,1,129,0
3,3342,2160.0,2160,5.33,5.33,0,8.91,0.22,475.2,0,361,0
4,3416,2160.0,2160,4.61,4.61,0,8.91,0.00,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
14078,28220,1620.0,1620,5.33,5.33,1,9.00,0.61,988.2,0,4,0
14079,28671,1080.0,1080,6.55,6.55,1,6.00,0.29,313.2,0,29,0
14080,31134,2160.0,2160,6.14,6.14,0,8.91,0.00,0.0,0,0,0
14081,32832,1620.0,1620,5.33,5.33,1,8.00,0.38,615.6,0,90,0


The first column contains customer IDs which are not usefull and the last column contains the targerts 0 or 1 specifying if the customer will convert(come back or not.

So these two columns should be removed


### Extracting the inputs and targets

In [3]:
raw_data = np.loadtxt('Audiobooks_data.csv', delimiter=',') # Load the data from the CSV file
inputs = raw_data[:, 1:-1] # Get all the rows from the second column to the column just before the last
targets = raw_data[:,-1] # Get all rows from last column

## Balancing the Dataset

From the observation of the dataset, there are far more 0 targets than 1s. This may biased the results in a certain way. 

So the best thing thing to do is to balance the dataset by having the same number of 0s and 1s.

In [4]:
# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets = int(np.sum(targets==1))
num_one_targets

2237

In [5]:
# Set a counter for targets that are 0 (meaning that the customer did not convert)
zero_targets_counter = 0

# Creating a "balanced" dataset, so to remove some input/target pairs.
# Declare a variable that will do that:
indices_to_remove = []

In [6]:
for i in range(targets.shape[0]): # Loop through the whole sets of targets
    if targets[i] == 0: # If target is 0
        zero_targets_counter += 1 # increase the zero counter by 1
        if zero_targets_counter > num_one_targets: # If number of zeros is greater than number of 1s
            indices_to_remove.append(i)# get the indices of the remaining targerts of 0

In [7]:
# Create two new variables, one that will contain the inputs, and one that will contain the targets.
# Delete all indices that were marked "to remove" in the loop above.

inputs_equal = np.delete(inputs, indices_to_remove, axis=0) #delete all input rows at indices_to_remove
targets_equal = np.delete(targets, indices_to_remove, axis=0) #delete all target rows at indices_to_remove

All the  inputs and targets have been balanced.

The inputs are of different magnitudes so scaling(standardization) will bring them to same magnitude

## Standardizing the inputs

In [8]:
scaled_inputs = preprocessing .scale(inputs_equal)

## Shuffling the dataset

**Shuffling will optimize batching.**

The collected was arranged by date and this makes it quite homogeneous

Shuffling the data so that it is not arranged in any way.

When batching, data should be randomly spread out as possible

### Get the indices of the scaled inputs and shuffle them

In [9]:
shuffled_indices = np.arange(scaled_inputs.shape[0]) # Get the indices of the input data
np.random.shuffle(shuffled_indices) # Shuffle these indices

### Shuffle the scaled inputs data by using their indices

In [10]:
shuffled_scaled_inputs = scaled_inputs[shuffled_indices] # Shuffle the scaled inputs
shuffled_targets = targets[shuffled_indices] # Shuffle the targets

## Split the dataset into train, validation, and test

Using the 80, 10, 10 split (Train, Validation, and Test)

In [11]:
total_number_of_samples = shuffled_scaled_inputs.shape[0]

train_samples_count = int(0.8 * total_number_of_samples)
validation_samples_count = int(0.1 * total_number_of_samples)
test_samples_count = total_number_of_samples - (train_samples_count - validation_samples_count)

# Create variables that record the inputs and targets for training
# In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = shuffled_scaled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for validation.
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = shuffled_scaled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
test_inputs = shuffled_scaled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)



700.0 3579 0.19558535903883767
105.0 447 0.2348993288590604
82.0 1342 0.06110283159463487


In [12]:
# from sklearn.model_selection import train_test_split

# train_inputs_temp, test_imputs, train_targets_temp, test_targets = train_test_split(shuffled_scaled_inputs, shuffled_targets, test_size=0.1, random_state=365) # Get the test_imputs and test_targets


# train_inputs, validation_imputs, train_targets, validation_targets = train_test_split(train_inputs_temp, train_targets_temp, test_size=0.1, random_state=365) # Get the train_inputs & validation_imputs and train_targets & validation_targets


# # Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
# print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
# print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
# print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

### Save the three datasets in *.npz file

Saving the three datasets in separate .npz files with very semantic filemanes

In [13]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

It is advisable to use a different notebook to work with the Deep Neural network part of this exercise but will continue with this same notebook to get a chronological work-flow of the whole exercise

# Creating the Machine Learning Algorithm

## Importing and splitting the various datasets

### Training dataset

In [14]:
npz = np.load('Audiobooks_data_train.npz')

train_inputs = npz['inputs'].astype(np.float)
train_targets = npz['targets'].astype(np.int)

### Validation Dataset

In [15]:
npz = np.load('Audiobooks_data_validation.npz')

validation_inputs = npz['inputs'].astype(np.float)
validation_targets = npz['targets'].astype(np.int)

### Testing Dataset

In [16]:
npz = np.load('Audiobooks_data_test.npz')

test_inputs = npz['inputs'].astype(np.float)
test_targets = npz['targets'].astype(np.int)

# Creating the Model

In [17]:
input_size = 10 # There are 10 predictors in the datesets
output_size = 2 # The expected output is 0  or 1
hidden_layer_size = 50


model = tf.keras.Sequential([        
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    
    # The model is a classifier hence it can be activated with a softmax activation function
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

# Define the optimizer to be used based on the problem, 
# the loss function, based on the type of encoding  
# and the metrics to be obtained at each iteration
#  sparse_categorical_crossentropy is used since the targets are one-hot encoded

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

## Training the model

In [24]:
max_epoch = 100 # Number of epochs for which the Model should be trained
batch_size = 100

# set an early stopping mechanism
# let's set patience=2, to be a bit tolerant against random validation loss increases

early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

# Fit the model, specifying the
# training data
# the total number of epochs
# and the validation data created in the format: (inputs,targets)

model.fit(train_inputs, 
          train_targets,
          batch_size=batch_size,
          epochs=max_epoch,  # epochs to be trained for (assuming early stopping doesn't kick in)
          # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          callbacks=[early_stopping], # early stopping
          validation_data = (validation_inputs, validation_targets),
          verbose=2)

Epoch 1/100
36/36 - 0s - loss: 0.4109 - accuracy: 0.8134 - val_loss: 0.4844 - val_accuracy: 0.7651
Epoch 2/100
36/36 - 0s - loss: 0.4085 - accuracy: 0.8153 - val_loss: 0.4817 - val_accuracy: 0.7696
Epoch 3/100
36/36 - 0s - loss: 0.4060 - accuracy: 0.8156 - val_loss: 0.4864 - val_accuracy: 0.7629
Epoch 4/100
36/36 - 0s - loss: 0.4047 - accuracy: 0.8170 - val_loss: 0.4821 - val_accuracy: 0.7629


<tensorflow.python.keras.callbacks.History at 0x1f1288d6670>

# Testing the Model

**Training** 

**Validation**

**Testing**

During the training, overfitting was prevented by validating the model on the validation_data.
After the first first training, each modification of the Hyperparameters, actually overfitted the validation dataset

After training on the training data and validating on the validation data, the final prediction power of the model is gotten by running it on the test dataset that the algorithm has NEVER seen before.

It is very important to realize that fiddling with the hyperparameters overfits the validation dataset.

The test is the absolute final instance, hence testing should not be done before the model has completely been adjusted.

If you adjust your model after testing, you will start overfitting the test dataset, which will defeat its purpose.

In [26]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [27]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.41. Test accuracy: 81.70%


Pure Luck!

Test Accuracy is greater than validation accuracy