## Problem

You are given data from an Audiobook app. Logically, it relates only to the audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

Good luck!

In [65]:
import pandas as pd
import numpy as np
import tensorflow as tf

In [66]:
data = pd.read_csv("data/Audiobooks_data.csv",header=None)
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,873,2160.0,2160,10.13,10.13,0,8.91,0.00,0.0,0,0,1
1,611,1404.0,2808,6.66,13.33,1,6.50,0.00,0.0,0,182,1
2,705,324.0,324,10.13,10.13,1,9.00,0.00,0.0,1,334,1
3,391,1620.0,1620,15.31,15.31,0,9.00,0.00,0.0,0,183,1
4,819,432.0,1296,7.11,21.33,1,9.00,0.00,0.0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
14079,27398,2160.0,2160,7.99,7.99,0,8.91,0.00,0.0,0,54,0
14080,28220,1620.0,1620,5.33,5.33,1,9.00,0.61,0.0,0,4,0
14081,28671,1080.0,1080,6.55,6.55,1,6.00,0.29,0.0,0,29,0
14082,31134,2160.0,2160,6.14,6.14,0,8.91,0.00,0.0,0,0,0


In [67]:
data[11].value_counts()

0    11847
1     2237
Name: 11, dtype: int64

In [68]:
inputs_all = np.array(data.drop([0,11],axis=1))
targets_all = np.array(data[11])

In [69]:
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove = []

for i in range(targets_all.shape[0]): # 14084
    if targets_all[i]==0:
        zero_targets_counter+=1
        if zero_targets_counter>num_one_targets:
            indices_to_remove.append(i)
            
inputs_equal_priors = np.delete(inputs_all,indices_to_remove,axis=0)
targets_equal_priors = np.delete(targets_all,indices_to_remove,axis=0)

In [70]:
samples_counts = inputs_equal_priors.shape[0]

shuffled_indices = np.arange(samples_counts)
np.random.shuffle(shuffled_indices)

shuffled_inputs = inputs_equal_priors[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

In [71]:
train_samples = int(0.8*samples_counts)
val_samples = int(0.1*samples_counts)
test_samples = samples_counts - train_samples - val_samples

In [72]:
inputs_train,targets_train = shuffled_inputs[:train_samples],shuffled_targets[:train_samples]
inputs_val,targets_val = shuffled_inputs[train_samples:train_samples+val_samples],shuffled_targets[train_samples:train_samples+val_samples]
inputs_test,targets_test = shuffled_inputs[train_samples+val_samples:],shuffled_targets[train_samples+val_samples:]

In [73]:
len(inputs_train),len(inputs_val),len(inputs_test)

(3579, 447, 448)

In [74]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(inputs_train)
inputs_train_scaled = scaler.transform(inputs_train)
inputs_val_scaled = scaler.transform(inputs_val)
inputs_test_scaled = scaler.transform(inputs_test)

In [76]:
np.savez("data/Audiobooks_data_train",inputs=inputs_train_scaled,targets=targets_train)
np.savez("data/Audiobooks_data_val",inputs=inputs_val_scaled,targets=targets_val)
np.savez("data/Audiobooks_data_test",inputs=inputs_test_scaled,targets=targets_test)

In [98]:
npz = np.load('data/Audiobooks_data_train.npz')

train_inputs = npz['inputs'].astype(np.float)
train_targets = npz['targets'].astype(np.int)

In [99]:
npz = np.load('data/Audiobooks_data_val.npz')

val_inputs = npz['inputs'].astype(np.float)
val_targets = npz['targets'].astype(np.int)

In [100]:
npz = np.load('data/Audiobooks_data_test.npz')

test_inputs = npz['inputs'].astype(np.float)
test_targets = npz['targets'].astype(np.int)

In [167]:
input_size = 10
output_size = 2
hidden_layer_size = 10

model = tf.keras.models.Sequential() # Construct model
model.add(tf.keras.layers.Dense(units=hidden_layer_size,activation="relu")) # First layer
model.add(tf.keras.layers.Dense(units=hidden_layer_size,activation="relu")) # Second layer
model.add(tf.keras.layers.Dense(units=output_size,activation="softmax"))

model.compile(optimizer="adam",loss="sparse_categorical_crossentropy",metrics=["accuracy"])

**Important Note**

Use "sparse_categorical_crossentropy" when the label is integer for two or more label classes, **output_size should be >1**

Use "categorical_crossentropy" when the label is from one-hot, **output_size should be 1**

Use "binary_crossentropy" when the label(assumed to be 0 and 1) is integer for only two label classes **output_size should be 1**

In [168]:
val_inputs.shape,val_targets.reshape(-1,1).shape

((447, 10), (447, 1))

In [169]:
BATCH_SIZE = 100
NUM_EPOCHS = 100

early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)
model.fit(x=train_inputs,
          y=train_targets,
          validation_data=(val_inputs,val_targets),
          batch_size=BATCH_SIZE,
          epochs=NUM_EPOCHS,
          callbacks=[early_stopping],
          verbose=2)

Epoch 1/100
36/36 - 0s - loss: 0.7360 - accuracy: 0.4722 - val_loss: 0.6740 - val_accuracy: 0.5503
Epoch 2/100
36/36 - 0s - loss: 0.6392 - accuracy: 0.6030 - val_loss: 0.6064 - val_accuracy: 0.7204
Epoch 3/100
36/36 - 0s - loss: 0.5690 - accuracy: 0.7773 - val_loss: 0.5452 - val_accuracy: 0.8054
Epoch 4/100
36/36 - 0s - loss: 0.5052 - accuracy: 0.8351 - val_loss: 0.4860 - val_accuracy: 0.8300
Epoch 5/100
36/36 - 0s - loss: 0.4484 - accuracy: 0.8533 - val_loss: 0.4350 - val_accuracy: 0.8367
Epoch 6/100
36/36 - 0s - loss: 0.4011 - accuracy: 0.8620 - val_loss: 0.3984 - val_accuracy: 0.8479
Epoch 7/100
36/36 - 0s - loss: 0.3691 - accuracy: 0.8709 - val_loss: 0.3733 - val_accuracy: 0.8613
Epoch 8/100
36/36 - 0s - loss: 0.3468 - accuracy: 0.8776 - val_loss: 0.3576 - val_accuracy: 0.8635
Epoch 9/100
36/36 - 0s - loss: 0.3311 - accuracy: 0.8807 - val_loss: 0.3440 - val_accuracy: 0.8702
Epoch 10/100
36/36 - 0s - loss: 0.3194 - accuracy: 0.8849 - val_loss: 0.3330 - val_accuracy: 0.8725
Epoch 11/

<tensorflow.python.keras.callbacks.History at 0x7fc25ebb5250>

In [170]:
test_loss,test_accuracy = model.evaluate(x=test_inputs,y=test_targets)



In [171]:
print(f"Test loss:{test_loss:.2f}, Test accuracy: {test_accuracy*100:.2f}%")

Test loss:0.28, Test accuracy: 88.84%


Using the initial model and hyperparameters given in this notebook, the final test accuracy should be roughly around 91%.

Note that each time the code is rerun, we get a different accuracy because each training is different. 

We have intentionally reached a suboptimal solution, so you can have space to build on it!