## Problem

You are given data from an Audiobook app. Logically, it relates only to the audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

Good luck!

# Preprocess data,Balance the data set, create 3 data sets:: training , validation and test. Save the new in tensor friendly format *.npz


## Extract Data from CSV


In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore')
  
raw_csv_data = np.loadtxt('Audiobooks_data.csv',delimiter=',')

unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data [:,-1]


print(raw_csv_data.shape)
print(targets_all.shape)
print(unscaled_inputs_all.shape) 


(14084, 12)
(14084,)
(14084, 10)


## Balance dataset

In [2]:
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove = []


for i in range(targets_all.shape[0]):
    if targets_all[i] ==0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
        
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0 )
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis = 0)
  


print(unscaled_inputs_equal_priors.shape)
print(targets_equal_priors.shape) 



(4474, 10)
(4474,)


## Standerdize inputs

In [3]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)



# Shuffle data

In [4]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

# Creating 3 Data sets

In [5]:
samples_count = shuffled_inputs.shape [0]

train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count  - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count + validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]


test_inputs   = shuffled_inputs[train_samples_count + validation_samples_count:]
test_targets = shuffled_targets[train_samples_count + validation_samples_count:]


In [6]:
print(np.sum(train_targets), train_samples_count, np.sum(train_targets)/train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets)/validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets)/test_samples_count)

1765.0 3579 0.49315451243364067
234.0 447 0.5234899328859061
238.0 448 0.53125


# Save the three datasets in *.npz

In [7]:
np.savez('Audiobooks_data_train', inputs = train_inputs, targets= train_targets)
np.savez('Audiobooks_data_validation', inputs = validation_inputs, targets= validation_targets)
np.savez('Audiobooks_data_test', inputs = test_inputs, targets= test_targets)

## Create the machine learning algorithm


### Import the relevant libraries

In [8]:
import tensorflow as tf

### Data


In [9]:
npz = np.load('Audiobooks_data_train.npz')
train_inputs = npz['inputs'].astype(np.float)

train_targets = npz['targets'].astype(np.int)

npz = np.load('Audiobooks_data_validation.npz')
validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

npz = np.load('Audiobooks_data_test.npz')

test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

### Model
Outline, optimizers, loss, early stopping and training

In [10]:
input_size = 10
output_size = 2
hidden_layer_size = 50
    
model = tf.keras.Sequential([
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])


### Choose the optimizer and the loss function

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

### Training

batch_size = 100

max_epochs = 100

early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

# fit the model
model.fit(train_inputs,  
          train_targets,  
          batch_size=batch_size,  
          epochs=max_epochs,  
          
          callbacks=[early_stopping],  
          validation_data=(validation_inputs, validation_targets),  
          verbose = 2  
          )  

Epoch 1/100
36/36 - 0s - loss: 0.6257 - accuracy: 0.6370 - val_loss: 0.5184 - val_accuracy: 0.7338
Epoch 2/100
36/36 - 0s - loss: 0.4924 - accuracy: 0.7499 - val_loss: 0.4437 - val_accuracy: 0.7405
Epoch 3/100
36/36 - 0s - loss: 0.4298 - accuracy: 0.7748 - val_loss: 0.4051 - val_accuracy: 0.7539
Epoch 4/100
36/36 - 0s - loss: 0.3942 - accuracy: 0.7890 - val_loss: 0.3843 - val_accuracy: 0.7808
Epoch 5/100
36/36 - 0s - loss: 0.3747 - accuracy: 0.7969 - val_loss: 0.3662 - val_accuracy: 0.8188
Epoch 6/100
36/36 - 0s - loss: 0.3603 - accuracy: 0.8094 - val_loss: 0.3608 - val_accuracy: 0.8031
Epoch 7/100
36/36 - 0s - loss: 0.3510 - accuracy: 0.8134 - val_loss: 0.3534 - val_accuracy: 0.8188
Epoch 8/100
36/36 - 0s - loss: 0.3453 - accuracy: 0.8181 - val_loss: 0.3600 - val_accuracy: 0.7875
Epoch 9/100
36/36 - 0s - loss: 0.3410 - accuracy: 0.8148 - val_loss: 0.3572 - val_accuracy: 0.7987


<tensorflow.python.keras.callbacks.History at 0x19f49dfcb80>

## Test the model

As we discussed in the lectures, after training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.

It is very important to realize that fiddling with the hyperparameters overfits the validation dataset. 

The test is the absolute final instance. You should not test before you are completely done with adjusting your model.

If you adjust your model after testing, you will start overfitting the test dataset, which will defeat its purpose.

In [11]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [12]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.36. Test accuracy: 80.80%


Using the initial model and hyperparameters given in this notebook, the final test accuracy should be roughly around 91%.

Note that each time the code is rerun, we get a different accuracy because each training is different. 