<a href="https://colab.research.google.com/github/vjacobsen/neural-network-audiobook/blob/main/Audiobooks_Case_1_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Audiobooks business case

## Preprocessing exercise

It makes sense to shuffle the indices prior to balancing the dataset. 

Using the code from the lesson (below), shuffle the indices and then balance the dataset.

At the end of the course, you will have an exercise to create the same machine learning algorithm, with preprocessing done in this way.

Note: This is more of a programming exercise rather than a machine learning one. Being able to complete it successfully will ensure you understand the preprocessing. 

Good luck!

**Solution:**

Scroll down to the 'Exercise Solution' section

### Extract the data from the csv

In [86]:
import numpy as np
import tensorflow as tf
import pandas as pd
import csv
import urllib.request, shutil
from sklearn import preprocessing

In [87]:
url  = 'https://raw.githubusercontent.com/vjacobsen/neural-network-audiobook/main/Audiobooks_data.csv'
file_name = 'audiobooks.csv'

with urllib.request.urlopen(url) as response, open(file_name,'wb') as out_file:
    shutil.copyfileobj(response, out_file)

raw_csv_data = np.loadtxt('audiobooks.csv', delimiter=',')

unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data[:,-1]

In [88]:
# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.
# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(unscaled_inputs_all.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
unscaled_inputs_all = unscaled_inputs_all[shuffled_indices]
targets_all = targets_all[shuffled_indices]

### Balance the dataset

In [89]:
pip install -U imbalanced-learn

Requirement already up-to-date: imbalanced-learn in /usr/local/lib/python3.7/dist-packages (0.8.0)


In [90]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler, TomekLinks, CondensedNearestNeighbour
from imblearn.combine import SMOTETomek

#balancer_model = SMOTE()
#balancer_model = RandomUnderSampler()
#balancer_model = TomekLinks()
balancer_model = SMOTETomek()

unscaled_inputs_equal_priors, targets_equal_priors =  balancer_model.fit_resample(X=unscaled_inputs_all, y=targets_all)

In [91]:
len(targets_equal_priors)

23176

In [92]:
# Verify it worked
sum(targets_equal_priors) / len(targets_equal_priors)

0.5

### Standardize the inputs

In [93]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

In [94]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train, validation, and test

In [95]:
# Count the total number of samples
samples_count = shuffled_inputs.shape[0]

# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.
# Naturally, the numbers are integers.
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)

# The 'test' dataset contains all remaining data.
test_samples_count = samples_count - train_samples_count - validation_samples_count

# Create variables that record the inputs and targets for training
# In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for validation.
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# We balanced our dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were 
# taken from a shuffled dataset. Check if they are balanced, too. Note that each time you rerun this code, 
# you will get different values, as each time they are shuffled randomly.
# Normally you preprocess ONCE, so you need not rerun this code once it is done.
# If you rerun this whole sheet, the npzs will be overwritten with your newly preprocessed data.

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

9240.0 18540 0.49838187702265374
1176.0 2317 0.5075528700906344
1172.0 2319 0.5053902544200086


### Save the three datasets in *.npz

In [96]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [97]:
# Save the three datasets in *.npz.
# In the next lesson, you will see that it is extremely valuable to name them in such a coherent way!

folder_path = '/content/drive/MyDrive/Python and Data Science/projects/neural-network-audiobook/'

np.savez(folder_path+'Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez(folder_path+'Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez(folder_path+'Audiobooks_data_test', inputs=test_inputs, targets=test_targets)