### Audiobooks project

### Problem

You are given data from an Audiobook App. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, ), Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10, Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

In [3]:
# Importing libraries
import numpy as np
import pandas as pd
from sklearn import preprocessing

In [2]:
# Loading the data
raw_csv_data = np.loadtxt('data/Audiobooks_data.csv', delimiter = ",")
raw_csv_data

array([[9.9400e+02, 1.6200e+03, 1.6200e+03, ..., 5.0000e+00, 9.2000e+01,
        0.0000e+00],
       [1.1430e+03, 2.1600e+03, 2.1600e+03, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [2.0590e+03, 2.1600e+03, 2.1600e+03, ..., 0.0000e+00, 3.8800e+02,
        0.0000e+00],
       ...,
       [3.1134e+04, 2.1600e+03, 2.1600e+03, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [3.2832e+04, 1.6200e+03, 1.6200e+03, ..., 0.0000e+00, 9.0000e+01,
        0.0000e+00],
       [2.5100e+02, 1.6740e+03, 3.3480e+03, ..., 0.0000e+00, 0.0000e+00,
        1.0000e+00]])

In [12]:
raw_d = pd.read_csv('data/Audiobooks_data.csv')
raw_d.describe()

Unnamed: 0,00994,1620,1620.1,19.73,19.73.1,1,10.00,0.99,1603.80,5,92,0
count,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0
mean,16773.611943,1591.279646,1678.612796,7.102894,7.54294,0.16069,8.909717,0.125598,189.788585,0.069871,61.932898,0.158844
std,9691.23921,504.358512,654.861664,4.9307,5.559378,0.367258,0.643363,0.241104,370.905846,0.470342,88.210402,0.365544
min,2.0,216.0,216.0,3.86,3.86,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,8371.5,1188.0,1188.0,5.33,5.33,0.0,8.91,0.0,0.0,0.0,0.0,0.0
50%,16715.0,1620.0,1620.0,5.95,6.07,0.0,8.91,0.0,0.0,0.0,11.0,0.0
75%,25187.5,2160.0,2160.0,8.0,8.0,0.0,8.91,0.13,194.4,0.0,105.0,0.0
max,33683.0,2160.0,7020.0,130.94,130.94,1.0,10.0,1.0,2160.0,30.0,464.0,1.0


In [6]:
# In the firs column there is ID (we don't use it) and the last column are targets
# We have to split this data:
unscaled_inputs_all = raw_csv_data[:,1:-1] #all the rows, columns from 1 to -1 (excluded)
targets_all = raw_csv_data[:,-1] #all the rows, only last column

### Balancing the dataset


In [10]:
# Counting 1's in the targets
num_one_targets  = int(np.sum(targets_all))
num_one_targets

2237

#### In this course the approach for balancing the dataset is to leave only the same amount of targets with 0 value as we have num_one_targets. The rest is going to be deleted.

In [16]:
zero_targets_counter = 0
indices_to_remove = []

for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter +=1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

In [22]:
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

### Standardizing

In [23]:
# Scaling using sklearn library (standardizing data along any axis)
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffling the data

In [34]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

#### The data is shuffled randomly so every time we reuse this notebook the data will be shuffled in different way. There is no random seed in here because this is the very basic and simple approach for EDA. 

### Splitting the data

In [35]:
# Getting the sizes of train, test and validations sets
samples_count = shuffled_inputs.shape[0]

train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count-train_samples_count-validation_samples_count

# Splitting the data (manually)
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

In [37]:
# We can check if the splitted data is balanced as we intended:
print(np.sum(train_targets), train_samples_count, np.sum(train_targets/train_samples_count))
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets/validation_samples_count))
print(np.sum(test_targets), test_samples_count, np.sum(test_targets/test_samples_count))

1797.0 3579 0.5020955574182733
217.0 447 0.4854586129753914
223.0 448 0.49776785714285704


#### The second columns is the number of samples which for the first row (training set) should be much bigger than for the last two (val and test). In the last column all values should be around 0.5 (50%) which shows us the split. In this case everything is like it should be.

### Saving the data

In [39]:
# We can save the data in the zipped numpy file format (.npz) for future use
np.savez('data/audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('data/audiobooks_data_validation', inputs=validation_inputs, targets = validation_targets)
np.savez('data/audiobooks_data_test', inputs=test_inputs, targets=test_targets)