# Audiobooks business case

## Preprocess the data. Balance the dataset. Create 3 datasets: training, validation, and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)

Since we are dealing with real life data, we will need to preprocess it a bit. This is the relevant code, which is not that hard, but is crucial to creating a good model.

If you want to know how to do that, go through the code. In any case, this should do the trick for most datasets organized in the way: many inputs, and then 1 cell containing the targets (supervised learning datasets). Keep in mind that a specific problem may require additional preprocessing.

Note that we have removed the header row, which contains the names of the categories. We simply want the data.

This code does not include comments - it is the same as the one in the lesson. Please refer to the other file if you want the code with comments.

### Extract the data from the csv

In [6]:
import numpy as np
import tensorflow as tf
from sklearn import preprocessing

!pip install -U scikit-learn


raw_csv_data = np.loadtxt(r'C:\Users\91776\Downloads\Audiobooks_data.csv', delimiter = ',')

unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data[:,-1]



### Balance the dataset

In [7]:
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove = []

for i in range(targets_all.shape[0]):
    if targets_all[i] ==0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
            
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0)
targets_equal_priors = np.delete (targets_all, indices_to_remove, axis=0)

### Standardize the inputs

In [8]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

In [9]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train, validation, and test

In [10]:
samples_count = shuffled_inputs.shape[0]

train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1788.0 3579 0.49958088851634536
225.0 447 0.5033557046979866
224.0 448 0.5


### Save the three datasets in *.npz

In [11]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

In [12]:
#data now loading into the vaiables

In [13]:
npz=np.load(r'C:\Users\91776\Audiobooks_data_test.npz')

train_inputs = npz["inputs"].astype(float)
train_targets = npz["targets"].astype(int)

npz=np.load(r'C:\Users\91776\Audiobooks_data_validation.npz')
validation_inputs, validation_targets = npz['inputs'].astype(float), npz['targets'].astype(int)

npz=np.load(r'C:\Users\91776\Audiobooks_data_test.npz')
test_inputs, test_targets = npz['inputs'].astype(float), npz['targets'].astype(int)




In [14]:
# building the algorithm

In [20]:
input_size = 10
output_size=2
hidden_layer_size = 50
# tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)

model = tf.keras.Sequential([
                             tf.keras.layers.Dense(hidden_layer_size,activation='relu'),## adding layers to the model
                             tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
                             tf.keras.layers.Dense(output_size,activation = 'softmax')                                                            
])

model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)
max_epochs = 100
batch_size = 100

model.fit(train_inputs, train_targets,
          batch_size = batch_size,
          epochs=max_epochs,
          callbacks=[early_stopping],
          validation_data=(validation_inputs,validation_targets),
          verbose=2
         )

Epoch 1/100
5/5 - 1s - loss: 0.7132 - accuracy: 0.4710 - val_loss: 0.6802 - val_accuracy: 0.5570 - 566ms/epoch - 113ms/step
Epoch 2/100
5/5 - 0s - loss: 0.6561 - accuracy: 0.6027 - val_loss: 0.6435 - val_accuracy: 0.6353 - 36ms/epoch - 7ms/step
Epoch 3/100
5/5 - 0s - loss: 0.6141 - accuracy: 0.6942 - val_loss: 0.6151 - val_accuracy: 0.6734 - 36ms/epoch - 7ms/step
Epoch 4/100
5/5 - 0s - loss: 0.5822 - accuracy: 0.7277 - val_loss: 0.5907 - val_accuracy: 0.6913 - 39ms/epoch - 8ms/step
Epoch 5/100
5/5 - 0s - loss: 0.5537 - accuracy: 0.7388 - val_loss: 0.5699 - val_accuracy: 0.6980 - 36ms/epoch - 7ms/step
Epoch 6/100
5/5 - 0s - loss: 0.5303 - accuracy: 0.7433 - val_loss: 0.5507 - val_accuracy: 0.7159 - 44ms/epoch - 9ms/step
Epoch 7/100
5/5 - 0s - loss: 0.5103 - accuracy: 0.7478 - val_loss: 0.5332 - val_accuracy: 0.7226 - 31ms/epoch - 6ms/step
Epoch 8/100
5/5 - 0s - loss: 0.4927 - accuracy: 0.7522 - val_loss: 0.5175 - val_accuracy: 0.7360 - 37ms/epoch - 7ms/step
Epoch 9/100
5/5 - 0s - loss: 

<keras.src.callbacks.History at 0x2547ebed4c0>

In [21]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)

