# Predicting Retention Rate of Customers of an Audiobook Company



## Introduction to the Business Problem

Companies, Organizations and Businesses, are always keen to not only to expand their customer base, but also retain valuable customers. Customers are the most prized asset of any company. Availability of data as a resource, and utilization of technology to identify prospective customers creates a plethora of value and growth opportunities.It is one of the finer applications of Data Science.

**Customer Retention** is the capability of a company, business or product, to hold on to its customers over a specified time period. **High customer retention** means customers tend to return to and continue to buy from the company, and don't defect to another company.

In this analysis, I'll be predicting the Retention Rate of customers of an Audiobook Company, by employing Deep Neural Networks. The Audiobook company wants to make efficient use of its Advertising Budget, and doesn't want to target individuals who are unlikely to come back. Concentrated efforts on customers who are likely to convert again, will improve the sales and profitability figures.


**OBJECTIVE**: Using Deep Learning to predict if a customer will buy again from the AudioBook Company.


## Data

The data is taken from an Audiobook App. It relates to Audiobooks purchased by each customer at least once.


Let's import the relevant libraries and havea look at the data.


In [1]:
import numpy as np
import os
from sklearn import preprocessing
import tensorflow as tf

In [4]:
import pandas as pd
data=pd.read_csv('Downloads/Audiobooks_data.csv',header=None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,994,1620.0,1620,19.73,19.73,1,10.0,0.99,1603.8,5,92,0
1,1143,2160.0,2160,5.33,5.33,0,8.91,0.0,0.0,0,0,0
2,2059,2160.0,2160,5.33,5.33,0,8.91,0.0,0.0,0,388,0
3,2882,1620.0,1620,5.96,5.96,0,8.91,0.42,680.4,1,129,0
4,3342,2160.0,2160,5.33,5.33,0,8.91,0.22,475.2,0,361,0


In the dataset above, each row represents a customer.

### Feature Description

The Features are:

* 0 - **I.D. of Customer**

* 1 - **Book Length in minutes**: The overall book length is the sum of the length in minutes of all Audiobook purchases.

* 2 - **Average Book Length in minutes**: It is the overall book length divided by the number of purchases. So if somebody has bought a single audio book, the average length and the overall length for this person will be equal.

* 3 - **Price Overall**: Price in Dollars, Price is almost always a good predictor of behavior.

* 4 - **Average Price**

* 5 - **Reviews**: It shows if the customer left a review. This is a metric that shows engagement with the platform. Our assumption is that people who leave reviews are more likely to convert again.

* 6 - **Review out of 10**: This is a different variable. It measures the review of a customer on a scale from 1 to 10.

* 7 - **Completion**: Completion is the total minutes listened to divided by the total length of books a person has purchased, assuming people don't re-listen to books.

* 8 - **Minutes Listened - Total**

* 9 - **Support Requests**: The total number of support requests the person has opened. Support is anything from a forgotten password to assistance on using the platform once more. This is a measure of engagement.

* 10 - **Last visited minus Purchase Date** The difference between the last time a person interacted with the platform and their first purchase date. That's yet another measure of engagement. The bigger the difference the better. If a person engages regularly with a platform this difference will be bigger. Thus the customer is likely to convert again.  But if the value of this variable is zero, We are sure the customer has never accessed what he/she has bought or perhaps he did it on the first day only. So it is unlikely he or she will convert again.

* 11 - **Target**: The targets are **one** if a person converted and **zero** if he or she didn't.


#### Time Period

The data represents two years worth of engagement. In order to create the targets, data for an extra six months has been taken after the two year period to check if a user converted. So, the dataset pertains to the two year period, and the targets to the six month period.
In other words if a customer bought another book and if that happened in the six month period we can count them as a conversion and the target will be 1. Otherwise it is zero. This is a classification problem with two classes **won't buy** and **will buy** represented by **zeros and ones**.


       
       


### Loading the Dataset and creating Inputs and Targets

In [7]:
file_path=os.path.join('Downloads','Audiobooks_data.csv')
raw_data_csv=np.loadtxt(fname=file_path,delimiter=',')

unscaled_inputs_all=raw_data_csv[:,1:-1]
targets_all=raw_data_csv[:,-1]

## Data Preprocessing

### Balancing the Dataset

The dataset must be well-balanced and not have excess of any one type of value. So, let's count the Targets which are 1's and keep as many 0's as 1's.



In [8]:
num_one_targets=int(np.sum(targets_all))
zero_target_counter=0
indices_to_remove=[]

for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_target_counter += 1
        if zero_target_counter > num_one_targets:
            indices_to_remove.append(i)
            
unscaled_inputs_equal_priors=np.delete(unscaled_inputs_all,indices_to_remove,axis=0)
targets_equal_priors=np.delete(targets_all,indices_to_remove,axis=0)

### Standardizing the Inputs

Let's scale the inputs and then shuffle the inputs and targets.

In [36]:
#Scaling
scaled_inputs=preprocessing.scale(unscaled_inputs_equal_priors)

#Shuffling
shuffled_indices=np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)


shuffled_inputs=scaled_inputs[shuffled_indices]
shuffled_targets=targets_equal_priors[shuffled_indices]

### Splitting Data into Train, Validation and Test Sets

In [37]:
samples_count=shuffled_inputs.shape[0]

train_samples_count=int(0.8*samples_count)
validation_samples_count=int(0.1*samples_count)
test_samples_count=samples_count-train_samples_count-validation_samples_count

train_inputs=shuffled_inputs[:train_samples_count]
train_targets=shuffled_targets[:train_samples_count]

validation_inputs=shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets=shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs=shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets=shuffled_targets[train_samples_count+validation_samples_count:]

Let's check if the dataset is balanced, as we may have balanaced the whole data, but not the Train, Validation and Test sets individually.

In [38]:
print(np.sum(train_targets),train_samples_count,np.sum(train_targets)/train_samples_count)
print(np.sum(validation_targets),validation_samples_count,np.sum(validation_targets)/validation_samples_count)
print(np.sum(test_targets),test_samples_count,np.sum(test_targets)/test_samples_count)

1786.0 3579 0.4990220732048058
225.0 447 0.5033557046979866
226.0 448 0.5044642857142857


### Saving the data as '.npz'

In [39]:
np.savez('Audiobooks_Data_train',inputs=train_inputs,targets=train_targets)
np.savez('Audiobooks_Data_validation',inputs=validation_inputs,targets=validation_targets)
np.savez('Audiobooks_Data_test',inputs=test_inputs,targets=test_targets)

## Data Modeling

Let's use Deep Neural Networks to classify the customers. 

### Loading the npz files

In [40]:
npz1=np.load('Audiobooks_Data_train.npz')
train_inputs=npz1['inputs'].astype(np.float)
train_targets=npz1['targets'].astype(np.int)

npz2=np.load('Audiobooks_Data_validation.npz')
validation_inputs=npz2['inputs'].astype(np.float)
validation_targets=npz2['targets'].astype(np.int)

npz3=np.load('Audiobooks_Data_test.npz')
test_inputs=npz3['inputs'].astype(np.float)
test_targets=npz3['targets'].astype(np.int)

### Deep Neural Network

The Neural Network has:
* 10 Input nodes for our 10 Features
* 3 Hidden Layers with 50 nodes each
* The activation function for the first Hidden layer is the Hyperbolic Tangent 'tanh' and the next two layers is the Rectified Linear function 'relu'.

In [101]:
input_size=10
output_size=2
hidden_layer_size=100



model=tf.keras.Sequential([
   
   
    tf.keras.layers.Dense(hidden_layer_size,activation='tanh'),
    tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
    tf.keras.layers.Dense(hidden_layer_size,activation='relu'),
   
    tf.keras.layers.Dense(output_size,activation='softmax')
])


Let's use the optimizer 'Adam' and loss function 'sparse categorical crossentropy'.

In [102]:
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

Let's employ **Early stopping** in order to avoid overfitting.

In [109]:
batch_size=100
max_epochs=100

early_stopping=tf.keras.callbacks.EarlyStopping(patience=2)
model.fit(train_inputs,
          train_targets,
          batch_size=batch_size,
          epochs=max_epochs,
          callbacks=[early_stopping],
          validation_data=(validation_inputs,
          validation_targets),
          verbose=2)

Train on 3579 samples, validate on 447 samples
Epoch 1/100
3579/3579 - 1s - loss: 0.3162 - accuracy: 0.8187 - val_loss: 0.3648 - val_accuracy: 0.8076
Epoch 2/100
3579/3579 - 1s - loss: 0.3132 - accuracy: 0.8287 - val_loss: 0.3716 - val_accuracy: 0.7852
Epoch 3/100
3579/3579 - 0s - loss: 0.3131 - accuracy: 0.8279 - val_loss: 0.3791 - val_accuracy: 0.8166


<tensorflow.python.keras.callbacks.History at 0x1a43650cd0>

## Testing the Model

In [110]:
test_loss,test_accuracy=model.evaluate(test_inputs,test_targets)
print(test_loss,test_accuracy)

0.32399983810526983 0.8214286


#### The Models Test accuracy is 82%, very close to the Validation Accuracy of 81.66%. 

## Results

The Neural Network is correct in predicting the Retention of customers 82% of the time. 