# Audiobooks using TensorFlow

## Problem

A machine learning algorithm based on available data that can predict if a customer will buy again from the Audiobook company.

 
There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

The targets are a Boolean variable (so 0, or 1). 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 



## Relevant libraries

In [50]:
import numpy as np
import tensorflow as tf
import pandas as pd

## Raw Data

In [51]:
df = pd.read_csv("Audiobooks_data.csv")
df.describe(include="all")

Unnamed: 0,ID,Book length (mins)_overall,Book length (mins)_avg,Price_overall,Price_avg,Review,Review 10/10,Completion,Minutes listened,Support Requests,Last visited minus Purchase date,Targets
count,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0
mean,16772.491551,1591.281685,1678.608634,7.103791,7.543805,0.16075,8.909795,0.125659,189.888983,0.070222,61.935033,0.158833
std,9691.807248,504.340663,654.838599,4.931673,5.560129,0.367313,0.643406,0.241206,371.08401,0.472157,88.207634,0.365533
min,2.0,216.0,216.0,3.86,3.86,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,8368.0,1188.0,1188.0,5.33,5.33,0.0,8.91,0.0,0.0,0.0,0.0,0.0
50%,16711.5,1620.0,1620.0,5.95,6.07,0.0,8.91,0.0,0.0,0.0,11.0,0.0
75%,25187.25,2160.0,2160.0,8.0,8.0,0.0,8.91,0.13,194.4,0.0,105.0,0.0
max,33683.0,2160.0,7020.0,130.94,130.94,1.0,10.0,1.0,2160.0,30.0,464.0,1.0


In [52]:
X = df.iloc[:,1:11].values  #remove IDs
y = df.iloc[:,11:12].values

## Preprocessing

### Balance the data

In [53]:
# Number of y's = 0 shall be equal to Number of y's = 1
num_one_y = int(np.sum(y))


zero_y_counter = 0

indices_to_remove = []

for i in range(y.shape[0]):
    if y[i] == 0:
        zero_y_counter += 1
        if zero_y_counter > num_one_y:
            indices_to_remove.append(i)

X = np.delete(X,indices_to_remove, axis = 0)
y = np.delete(y, indices_to_remove, axis = 0)

### Standardize X (inputs)

In [54]:
from sklearn import preprocessing

In [55]:
X_scaled = preprocessing.scale(X)

### Split the shuffled data into train, validation and test

In [56]:
from sklearn.model_selection import train_test_split

In [57]:
# Split data to train and test
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.1, random_state=1, shuffle= True)

# Split train again to train and validation # 0.125 x 0.8 = 0.1
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.125, random_state=1, shuffle= True) 

### Save the three datasets in *.npz

In [58]:
np.savez('Audiobooks_data_train', inputs=X_train, targets=y_train)
np.savez('Audiobooks_data_validation', inputs=X_val, targets=y_val)
np.savez('Audiobooks_data_test', inputs=X_test, targets=y_test)

## Processed Data

In [59]:
npz = np.load('Audiobooks_data_train.npz')
train_inputs, train_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)


npz = np.load('Audiobooks_data_validation.npz')
validation_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

npz = np.load('Audiobooks_data_test.npz')
test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

### Model
Outline, optimizers, loss, early stopping and training

In [65]:

input_size = 10
output_size = 2

hidden_layer_size = 50
    

model = tf.keras.Sequential([
            tf.keras.layers.Dense(hidden_layer_size, activation='relu'), 
            tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
            tf.keras.layers.Dense(output_size, activation='softmax') 
])


model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

### Training


batch_size = 50


max_epochs = 100

# set an early stopping mechanism
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

# fit the model

model.fit(train_inputs, 
          train_targets, 
          batch_size=batch_size, 
          epochs=max_epochs, 
          callbacks=[early_stopping], 
          validation_data=(validation_inputs, validation_targets), 
          verbose = 2 
          )  

Epoch 1/100
71/71 - 0s - loss: 0.5428 - accuracy: 0.7198 - val_loss: 0.4444 - val_accuracy: 0.7837
Epoch 2/100
71/71 - 0s - loss: 0.4116 - accuracy: 0.7859 - val_loss: 0.3984 - val_accuracy: 0.7817
Epoch 3/100
71/71 - 0s - loss: 0.3784 - accuracy: 0.7984 - val_loss: 0.3837 - val_accuracy: 0.7937
Epoch 4/100
71/71 - 0s - loss: 0.3631 - accuracy: 0.8072 - val_loss: 0.3741 - val_accuracy: 0.7897
Epoch 5/100
71/71 - 0s - loss: 0.3565 - accuracy: 0.8061 - val_loss: 0.3687 - val_accuracy: 0.7877
Epoch 6/100
71/71 - 0s - loss: 0.3519 - accuracy: 0.8066 - val_loss: 0.3638 - val_accuracy: 0.7976
Epoch 7/100
71/71 - 0s - loss: 0.3448 - accuracy: 0.8146 - val_loss: 0.3648 - val_accuracy: 0.7956
Epoch 8/100
71/71 - 0s - loss: 0.3397 - accuracy: 0.8143 - val_loss: 0.3671 - val_accuracy: 0.8036


<tensorflow.python.keras.callbacks.History at 0x7f47740f2070>

## Test the model



In [66]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [67]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.32. Test accuracy: 82.81%
