# Audiobooks business case

## Preprocess the data. Balance the dataset. Create 3 datasets: training, validation, and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)

Since we are dealing with real life data, we will need to preprocess it a bit. This is the relevant code, which is not that hard, but is crucial to creating a good model.

If you want to know how to do that, go through the code. In any case, this should do the trick for most datasets organized in the way: many inputs, and then 1 cell containing the targets (supervised learning datasets). Keep in mind that a specific problem may require additional preprocessing.

Note that we have removed the header row, which contains the names of the categories. We simply want the data.

## Preprocess the data

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
import tensorflow as tf

### Balance the dataset

In [2]:
raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter=',')

unscaled_data = raw_csv_data[:,1:-1]
targets = raw_csv_data[:,-1]

num_one_targets_quantity = np.sum(targets)
num_zero_targets_quantity = 0
indice_to_remove = []

for i in range(targets.shape[0]):
    if targets[i] == 0:
        num_zero_targets_quantity += 1
        if num_zero_targets_quantity > num_one_targets_quantity:
            indice_to_remove.append(i)

balanced_targets = np.delete(targets, indice_to_remove, axis=0)
balanced_unscaled_data = np.delete(unscaled_data, indice_to_remove, axis=0)

### Standardize the unscaled data

In [3]:
scaled_data = preprocessing.scale(balanced_unscaled_data)

### Shuffle the data

In [4]:
indices = np.arange(scaled_data.shape[0])

BATCH_SIZE = 1000
shuffled_indices = tf.random.shuffle(indices, BATCH_SIZE)

shuffled_data = scaled_data[shuffled_indices]
shuffled_targets = balanced_targets[shuffled_indices]

### Split the dataset into train, validation, and test

In [5]:
# 80, 10, 10
data_size = shuffled_data.shape[0]

train_data_size = int(data_size * 0.8)
validation_size = int(data_size * 0.1)
test_size = data_size - train_data_size - validation_size

train_data = shuffled_data[:train_data_size]
train_data_targets = shuffled_targets[:train_data_size]

validation_start_indice = train_data_size
test_start_indice = validation_start_indice + validation_size

validation = shuffled_data[validation_start_indice:test_start_indice]
validation_targets = shuffled_targets[validation_start_indice:test_start_indice]

test = shuffled_data[test_start_indice:]
test_targets = shuffled_targets[test_start_indice:]

### Save the three datasets in *.npz

In [6]:
np.savez('Audiobooks_data_train', inputs=train_data, targets=train_data_targets)
np.savez('Audiobooks_data_validation', inputs=validation, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test, targets=test_targets)

## Create the machine learning algorithm

### Data

In [7]:
npz = np.load('Audiobooks_data_train.npz')
train_inputs = npz['inputs'].astype(np.float)
train_targets = npz['targets'].astype(np.int)

npz = np.load('Audiobooks_data_validation.npz')
validation_inputs = npz['inputs'].astype(np.float)
validation_targets = npz['targets'].astype(np.int)

npz = np.load('Audiobooks_data_test.npz')
test_inputs = npz['inputs'].astype(np.float)
test_targets = npz['targets'].astype(np.int)

### Batching

In [8]:
#BATCH_SIZE = 128

#train_inputs = tf.data.Dataset.from_element(train_inputs)
#train_inputs = train_inputs.batch(BATCH_SIZE)

#train_targets = tf.data.Dataset.from_element(train_targets)

### Outline the model

In [16]:
input_size = 10
output_size = 2
hidden_layer_size = 100

model = tf.keras.Sequential([
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

### Choose the optimizer and the loss function

In [17]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

### Training

In [18]:
print(validation_inputs[0], train_inputs[0])

[-0.76445401 -0.75268653 -0.24508509 -0.39411984 -0.44877204 -0.01125564
 -0.37475172 -0.8635056  -0.20536617 -0.18692921] [-0.33022754  1.10843845 -0.38189654  0.35122388 -0.44877204 -0.01125564
 -0.37475172 -0.8635056  -0.20536617  0.34532556]


In [19]:
NUM_EPOCHS = 9
BATCH_SIZE = 100

early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

model.fit(
   train_inputs,
   train_targets,
   batch_size=BATCH_SIZE,
   epochs=NUM_EPOCHS,
   validation_data=(validation_inputs, validation_targets),
   verbose=2,
   callbacks=[early_stopping]
)

Train on 3579 samples, validate on 447 samples
Epoch 1/9
3579/3579 - 1s - loss: 0.4448 - accuracy: 0.8301 - val_loss: 0.3236 - val_accuracy: 0.8881
Epoch 2/9
3579/3579 - 0s - loss: 0.3104 - accuracy: 0.8846 - val_loss: 0.2924 - val_accuracy: 0.9038
Epoch 3/9
3579/3579 - 0s - loss: 0.2878 - accuracy: 0.8916 - val_loss: 0.2646 - val_accuracy: 0.9083
Epoch 4/9
3579/3579 - 0s - loss: 0.2740 - accuracy: 0.8991 - val_loss: 0.2432 - val_accuracy: 0.9128
Epoch 5/9
3579/3579 - 0s - loss: 0.2653 - accuracy: 0.8983 - val_loss: 0.2381 - val_accuracy: 0.9128
Epoch 6/9
3579/3579 - 0s - loss: 0.2587 - accuracy: 0.9008 - val_loss: 0.2281 - val_accuracy: 0.9150
Epoch 7/9
3579/3579 - 0s - loss: 0.2543 - accuracy: 0.9044 - val_loss: 0.2228 - val_accuracy: 0.9195
Epoch 8/9
3579/3579 - 0s - loss: 0.2554 - accuracy: 0.9036 - val_loss: 0.2205 - val_accuracy: 0.9172
Epoch 9/9
3579/3579 - 0s - loss: 0.2478 - accuracy: 0.9036 - val_loss: 0.2212 - val_accuracy: 0.9217


<tensorflow.python.keras.callbacks.History at 0x146e3c518>

## Testing

In [25]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)
print('Test loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))



Test loss: 0.25. Test accuracy: 91.07%
