### Starbucks Capstone Challenge
# Feature Engineering

This notebook aims to generate features by preparing the data to correspond to the problem stated and feed the neural networks.  
The transcription records must be structured and labeled as an appropriate offer or not.

Feature Engineering is the process of determining which features might be useful in training a model, and then converting raw data from log files and other sources into said features.

In [1]:
## Import all the necessary libraries
import os

import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import quantile_transform, scale, robust_scale

In [2]:
## Global definitions

data_dir = 'data'

pd.set_option('display.precision', 2)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

portfolio_data_path = os.path.join(data_dir, 'portfolio.json')
profile_data_path = os.path.join(data_dir, 'profile.json')
transcript_data_path = os.path.join(data_dir, 'transcript.json')

In [3]:
## global functions

def load_dataframe(data_path):
    """Create a dataframe from a json file"""
    return pd.read_json(data_path, orient='records', lines=True)

# Portfolio dataset
Dataset containing information about the offers which can be sent to customers.

## Overview

In [4]:
## Load dataset
portfolio_df = load_dataframe(portfolio_data_path)
display(portfolio_df.head())

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7


## Transformation
As we can see **offer type** is a categorical feature that could be mapped as one hot encoding.  

**Channels** are categorical feature as well, but values can assume more than one category. Its values may be converted into individual features.  

In [5]:
## Set id as index
portfolio_df.set_index(keys='id', verify_integrity=True, inplace=True)

## Make offer_type one hot encoded
portfolio_df = portfolio_df.join(
    pd.get_dummies(portfolio_df.pop('offer_type')))

## Transform channels in distinct features
channels_df = pd.DataFrame(portfolio_df.pop('channels'))
channels_df = channels_df.explode('channels')
channels_df = channels_df.assign(value=lambda x: 1)
channels_df = channels_df.pivot(columns='channels', values='value')
channels_df.fillna(value=0, inplace=True)
portfolio_df = portfolio_df.join(channels_df)
channels_df = None

## print the result
display(portfolio_df)

Unnamed: 0_level_0,reward,difficulty,duration,bogo,discount,informational,email,mobile,social,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ae264e3637204a6fb9bb56bc8210ddfd,10,10,7,1,0,0,1.0,1.0,1.0,0.0
4d5c57ea9a6940dd891ad53e9dbe8da0,10,10,5,1,0,0,1.0,1.0,1.0,1.0
3f207df678b143eea3cee63160fa8bed,0,0,4,0,0,1,1.0,1.0,0.0,1.0
9b98b8c7a33c4b65b9aebfe6a799e6d9,5,5,7,1,0,0,1.0,1.0,0.0,1.0
0b1e1539f2cc45b7b9fa7c272da2e1d7,5,20,10,0,1,0,1.0,0.0,0.0,1.0
2298d6c36e964ae4a3e7e9706d1fb8c2,3,7,7,0,1,0,1.0,1.0,1.0,1.0
fafdcd668e3743c1bb461111dcafc2a4,2,10,10,0,1,0,1.0,1.0,1.0,1.0
5a8bc65990b245e5a138643cd4eb9837,0,0,3,0,0,1,1.0,1.0,1.0,0.0
f19421c1d4aa40978ebb69ca19b0e20d,5,5,5,1,0,0,1.0,1.0,1.0,1.0
2906b810c7d4411798c6938adc9daaa5,2,10,7,0,1,0,1.0,1.0,0.0,1.0


## Analysis

In [6]:
## Filter out email column, since it is not an informative feature
# once any offer uses this channel.
portfolio_df.drop(columns='email', inplace=True)

display(pd.DataFrame(portfolio_df.describe())
        .style.set_caption('Dataset description'))

display(portfolio_df.corr().abs()
        .style.set_caption('Pairwise correlation'))

Unnamed: 0,reward,difficulty,duration,bogo,discount,informational,mobile,social,web
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,4.2,7.7,6.5,0.4,0.4,0.2,0.9,0.6,0.8
std,3.58,5.83,2.32,0.52,0.52,0.42,0.32,0.52,0.42
min,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,5.0,5.0,0.0,0.0,0.0,1.0,0.0,1.0
50%,4.0,8.5,7.0,0.0,0.0,0.0,1.0,1.0,1.0
75%,5.0,10.0,7.0,1.0,1.0,0.0,1.0,1.0,1.0
max,10.0,20.0,10.0,1.0,1.0,1.0,1.0,1.0,1.0


Unnamed: 0,reward,difficulty,duration,bogo,discount,informational,mobile,social,web
reward,1.0,0.47,0.16,0.79,0.29,0.62,0.08,0.29,0.12
difficulty,0.47,1.0,0.81,0.03,0.6,0.7,0.74,0.15,0.24
duration,0.16,0.81,1.0,0.19,0.74,0.68,0.53,0.19,0.34
bogo,0.79,0.03,0.19,1.0,0.67,0.41,0.27,0.25,0.1
discount,0.29,0.6,0.74,0.67,1.0,0.41,0.41,0.17,0.41
informational,0.62,0.7,0.68,0.41,0.41,1.0,0.17,0.1,0.37
mobile,0.08,0.74,0.53,0.27,0.41,0.17,1.0,0.41,0.17
social,0.29,0.15,0.19,0.25,0.17,0.1,0.41,1.0,0.41
web,0.12,0.24,0.34,0.1,0.41,0.37,0.17,0.41,1.0


## Resulting dataset

In [7]:
display(portfolio_df)

Unnamed: 0_level_0,reward,difficulty,duration,bogo,discount,informational,mobile,social,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ae264e3637204a6fb9bb56bc8210ddfd,10,10,7,1,0,0,1.0,1.0,0.0
4d5c57ea9a6940dd891ad53e9dbe8da0,10,10,5,1,0,0,1.0,1.0,1.0
3f207df678b143eea3cee63160fa8bed,0,0,4,0,0,1,1.0,0.0,1.0
9b98b8c7a33c4b65b9aebfe6a799e6d9,5,5,7,1,0,0,1.0,0.0,1.0
0b1e1539f2cc45b7b9fa7c272da2e1d7,5,20,10,0,1,0,0.0,0.0,1.0
2298d6c36e964ae4a3e7e9706d1fb8c2,3,7,7,0,1,0,1.0,1.0,1.0
fafdcd668e3743c1bb461111dcafc2a4,2,10,10,0,1,0,1.0,1.0,1.0
5a8bc65990b245e5a138643cd4eb9837,0,0,3,0,0,1,1.0,1.0,0.0
f19421c1d4aa40978ebb69ca19b0e20d,5,5,5,1,0,0,1.0,1.0,1.0
2906b810c7d4411798c6938adc9daaa5,2,10,7,0,1,0,1.0,0.0,1.0


# Profile data set
Dataset containing demographic data for each one of the reward program users.

## Overview

In [8]:
profile_df = load_dataframe(profile_data_path)
display(profile_df.head())

Unnamed: 0,gender,age,id,became_member_on,income
0,,118,68be06ca386d4c31939f3a4f0e3dd783,20170212,
1,F,55,0610b486422d4921ae7d2bf64640c50b,20170715,112000.0
2,,118,38fe809add3b4fcf9315a9694bb96ff5,20180712,
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,20170509,100000.0
4,,118,a03223e636434f42ac4c3df47e8bac43,20170804,


## Transformation

Altough the users with missing data could be filtered out of the data set, they might have some peculiar behaviour which would be interesting to be analyzed.  
Thus, my approach with those record is:  
* maintain those records in the data set
* mark those users as belonging to a particular class, by creating a new feature for that
* since gender is a discrete feature, create another gender category
* as age and income are continuous features, fill them with the respective mean values

The feature *became_member_on* is a date represented as a string, so that the values is difficulty to compare due to the lack of linearity.  
Consider the dates 20171230, 20171231, and 20180101. Altough the distances between them are the same (1 day), if they are considered as numbers the distance becames 1 and 8870, respectively.  
Thus, this feature should be transformed to the type long with a continuous representantion. 

In [9]:
# create new feature to indicate missing values
missing_data = profile_df.gender.isna()
profile_df = profile_df.assign(missing_data=missing_data.astype(int))

# fill gender if a new 'U' (unknown) category
profile_df.gender.mask(missing_data, 'U', inplace=True)

# Make gender one-hot encoded
profile_df = profile_df.join(
    pd.get_dummies(profile_df.pop('gender')))

# Set id as index
profile_df.set_index(keys='id', verify_integrity=True, inplace=True)

## Convert data to datetime format
profile_df.became_member_on = pd.to_datetime(profile_df.became_member_on, format='%Y%m%d').astype(np.long)

In [10]:
display(pd.DataFrame(profile_df.describe()).round(2).style \
        .set_caption('Dataset description'))

display(profile_df.corr().abs().style \
        .set_caption('Pairwise correlation'))

Unnamed: 0,age,became_member_on,income,missing_data,F,M,O,U
count,17000.0,17000.0,14825.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,62.53,1.487855530164706e+18,65404.99,0.13,0.36,0.5,0.01,0.13
std,26.74,3.5529745278667744e+16,21598.3,0.33,0.48,0.5,0.11,0.33
min,18.0,1.375056e+18,30000.0,0.0,0.0,0.0,0.0,0.0
25%,45.0,1.4642208e+18,49000.0,0.0,0.0,0.0,0.0,0.0
50%,58.0,1.501632e+18,64000.0,0.0,0.0,0.0,0.0,0.0
75%,73.0,1.514592e+18,80000.0,0.0,1.0,1.0,0.0,0.0
max,118.0,1.5325632e+18,120000.0,1.0,1.0,1.0,1.0,1.0


Unnamed: 0,age,became_member_on,income,missing_data,F,M,O,U
age,1.0,0.02,0.31,0.79,0.14,0.39,0.03,0.79
became_member_on,0.02,1.0,0.03,0.03,0.01,0.03,0.01,0.03
income,0.31,0.03,1.0,,0.23,0.23,0.01,
missing_data,0.79,0.03,,1.0,0.29,0.38,0.04,1.0
F,0.14,0.01,0.23,0.29,1.0,0.75,0.08,0.29
M,0.39,0.03,0.23,0.38,0.75,1.0,0.11,0.38
O,0.03,0.01,0.01,0.04,0.08,0.11,1.0,0.04
U,0.79,0.03,,1.0,0.29,0.38,0.04,1.0


## Resulting dataset

In [11]:
display(profile_df.head(10))

Unnamed: 0_level_0,age,became_member_on,income,missing_data,F,M,O,U
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
68be06ca386d4c31939f3a4f0e3dd783,118,1486857600000000000,,1,0,0,0,1
0610b486422d4921ae7d2bf64640c50b,55,1500076800000000000,112000.0,0,1,0,0,0
38fe809add3b4fcf9315a9694bb96ff5,118,1531353600000000000,,1,0,0,0,1
78afa995795e4d85b5d9ceeca43f5fef,75,1494288000000000000,100000.0,0,1,0,0,0
a03223e636434f42ac4c3df47e8bac43,118,1501804800000000000,,1,0,0,0,1
e2127556f4f64592b11af22de27a7932,68,1524700800000000000,70000.0,0,0,1,0,0
8ec6ce2a7e7949b1bf142def7d0e0586,118,1506297600000000000,,1,0,0,0,1
68617ca6246f4fbc85e91a2a49552598,118,1506902400000000000,,1,0,0,0,1
389bc3fa690240e798340f5a15918d5c,65,1518134400000000000,53000.0,0,0,1,0,0
8974fc5686fe429db53ddde067b88302,118,1479772800000000000,,1,0,0,0,1


# Transcript data set
Event log containing records for transactions, offers received, offers viewed, and offers completed

## Overview

In [12]:
transcript_df = load_dataframe(transcript_data_path)
display(transcript_df.head())

Unnamed: 0,person,event,value,time
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},0
1,a03223e636434f42ac4c3df47e8bac43,offer received,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},0
2,e2127556f4f64592b11af22de27a7932,offer received,{'offer id': '2906b810c7d4411798c6938adc9daaa5'},0
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'},0
4,68617ca6246f4fbc85e91a2a49552598,offer received,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'},0


## Transformation

In [13]:
transcript_df = transcript_df.join(
    pd.DataFrame.from_records(transcript_df.pop('value')))
transcript_df.offer_id.update(transcript_df.pop('offer id'))

display(transcript_df)

Unnamed: 0,person,event,time,amount,offer_id,reward
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,,9b98b8c7a33c4b65b9aebfe6a799e6d9,
1,a03223e636434f42ac4c3df47e8bac43,offer received,0,,0b1e1539f2cc45b7b9fa7c272da2e1d7,
2,e2127556f4f64592b11af22de27a7932,offer received,0,,2906b810c7d4411798c6938adc9daaa5,
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,0,,fafdcd668e3743c1bb461111dcafc2a4,
4,68617ca6246f4fbc85e91a2a49552598,offer received,0,,4d5c57ea9a6940dd891ad53e9dbe8da0,
...,...,...,...,...,...,...
306529,b3a1272bc9904337b331bf348c3e8c17,transaction,714,1.59,,
306530,68213b08d99a4ae1b0dcb72aebd9aa35,transaction,714,9.53,,
306531,a00058cf10334a308c68e7631c529907,transaction,714,3.61,,
306532,76ddbd6576844afe811f1a3c0fbb5bec,transaction,714,3.53,,


# Data cleaning

In [14]:
users_to_discard = np.append(
    transcript_df.query('amount >= 50').person.unique(),
    transcript_df.groupby('person').count().query('amount == 0').index.values)

profile_df = profile_df.query('id not in @users_to_discard')
transcript_df = transcript_df.query('person not in @users_to_discard')

# Transformation after data cleaning
After removing some registers from the profile datase, it is time to scale features and fill blank values.

In [15]:
data = profile_df.filter(['became_member_on'])
data_transf = quantile_transform(data, output_distribution='normal', copy=True)
data = pd.DataFrame(data_transf, columns=data.columns, index=data.index)
profile_df.update(data)

data = profile_df.query('missing_data == 0').filter(['age','income'])
# data_transf = quantile_transform(data, output_distribution='normal', copy=False)
data_transf = robust_scale(data, copy=False)
data = pd.DataFrame(data_transf, columns=data.columns, index=data.index)
profile_df.update(data)

missing_data = profile_df.income.isna()
income_mean = profile_df[~(missing_data.values)].income.mean()
age_mean = profile_df[~(missing_data.values)].age.mean()

# # # fill age and income with its respective mean values
profile_df.loc[missing_data.values, 'age'] = age_mean
profile_df.loc[missing_data.values, 'income'] = income_mean

# Classify offer sendings

In [16]:
## Split register according to event
offer_received_df = transcript_df.query('event == "offer received"')
offer_viewed_df = transcript_df.query('event == "offer viewed"')
offer_completed_df = transcript_df.query('event == "offer completed"')
transaction_df = transcript_df.query('event == "transaction"')

In [17]:
## Remove unnecessary columns
offer_received_df = offer_received_df.drop(columns=['amount', 'reward'])

## Create registers when customers did not receive offer
offer_sending_time = offer_received_df.time.unique()
new_index = pd.MultiIndex.from_product(
    (profile_df.index, offer_sending_time),
    names=['person', 'time'])

offer_received_df = offer_received_df.set_index(['person', 'time'])
offer_received_df = offer_received_df.sort_index()
offer_received_df = offer_received_df.reindex(new_index)
offer_received_df = offer_received_df.reset_index()

## Create column to indicate when the offer ends
offer_received_df = offer_received_df.join(portfolio_df.duration*24, on='offer_id')
offer_received_df = offer_received_df.assign(
    offer_ends_on=offer_received_df.time + offer_received_df.duration)

## Create column to indicate when the offer is informational
offer_received_df = offer_received_df.join(
    portfolio_df.informational, on='offer_id')

## Create column to indicate when the offer is viewed or completed
offer_received_df = offer_received_df.assign(viewed_on=np.nan, completed_on=np.nan)

## Create a column to hold the label for that offer sending
offer_received_df = offer_received_df.assign(label=np.nan)

In [18]:
%%time

next_sending_time = {0: 168, 168: 336, 336: 408,
                     408: 504, 504: 576, 576: 720}

def classify_offers(row):
    if row.offer_id is np.nan:
        row.offer_ends_on = next_sending_time[row.time]
        try:
            # In this case, there is no offer to be viewed or completed.
            # So we look for a simple transaction.
            row.completed_on = transaction_df.query(
                'person == @row.person ' \
                'and time >= @row.time ' \
                'and time <= @row.offer_ends_on').time.values[0]
        except:
            # If there is no transaction in this period
            row.label = 0
        else:
            row.label = 1
        finally:
            return row

    try:
        row.viewed_on = offer_viewed_df.query(
            'person == @row.person ' \
            'and offer_id == @row.offer_id ' \
            'and time >= @row.time ' \
            'and time <= @row.offer_ends_on').time.values[0]
    except:
        # Offer was not viewed
        row.label = 0
        return row

    
    if row.informational == 1:
        try:
            # In this case, there is no offer to be completed.
            # So we look for a simple transaction after offer viewed
            row.completed_on = transaction_df.query(
                'person == @row.person ' \
                'and time >= @row.viewed_on ' \
                'and time <= @row.offer_ends_on').time.values[0]
        except:
            # If there is no transaction in this period
            row.label = 0
        else:
            row.label = 1
        finally:
            return row


    try:
        # In the other cases, we need an offer completion
        row.completed_on = offer_completed_df.query(
            'person == @row.person ' \
            'and offer_id == @row.offer_id ' \
            'and time >= @row.viewed_on ' \
            'and time <= @row.offer_ends_on').time.values[0]
    except:
        # If the offer was not completed
        row.label = 0
    else:
        row.label = 1
    finally:
        return row


offer_received_df = offer_received_df.apply(classify_offers, axis=1)

## Remove auxiliary columns
offer_received_df.drop(
    inplace=True,
    columns=['duration', 'offer_ends_on', 'informational',
             'viewed_on', 'completed_on'])

CPU times: user 21min 48s, sys: 294 ms, total: 21min 48s
Wall time: 21min 49s


# Transformation after classifying offers
After removing some registers from the profile datase, it is time to scale features and fill blank values.

In [19]:
## Transform PORTFOLIO
data = portfolio_df[['reward','difficulty','duration']]
data = scale(data)
portfolio_df[['reward','difficulty','duration']] = data
portfolio_df

Unnamed: 0_level_0,reward,difficulty,duration,bogo,discount,informational,mobile,social,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ae264e3637204a6fb9bb56bc8210ddfd,1.71,0.42,0.23,1,0,0,1.0,1.0,0.0
4d5c57ea9a6940dd891ad53e9dbe8da0,1.71,0.42,-0.68,1,0,0,1.0,1.0,1.0
3f207df678b143eea3cee63160fa8bed,-1.24,-1.39,-1.14,0,0,1,1.0,0.0,1.0
9b98b8c7a33c4b65b9aebfe6a799e6d9,0.24,-0.49,0.23,1,0,0,1.0,0.0,1.0
0b1e1539f2cc45b7b9fa7c272da2e1d7,0.24,2.22,1.59,0,1,0,0.0,0.0,1.0
2298d6c36e964ae4a3e7e9706d1fb8c2,-0.35,-0.13,0.23,0,1,0,1.0,1.0,1.0
fafdcd668e3743c1bb461111dcafc2a4,-0.65,0.42,1.59,0,1,0,1.0,1.0,1.0
5a8bc65990b245e5a138643cd4eb9837,-1.24,-1.39,-1.59,0,0,1,1.0,1.0,0.0
f19421c1d4aa40978ebb69ca19b0e20d,0.24,-0.49,-0.68,1,0,0,1.0,1.0,1.0
2906b810c7d4411798c6938adc9daaa5,-0.65,0.42,0.23,0,1,0,1.0,0.0,1.0


# Create Features

### Join datasets

In [20]:
## Set index as person and time
offer_received_df = offer_received_df.set_index(['person', 'time'])
offer_received_df = offer_received_df.sort_index()

## Join offer data with portfolio and profile data
offer_received_df = offer_received_df.join(portfolio_df, on='offer_id')
offer_received_df = offer_received_df.join(profile_df, on='person')

## Fill NA
offer_received_df.fillna(value=0, inplace=True)

display(offer_received_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,event,offer_id,label,reward,difficulty,duration,bogo,discount,informational,mobile,social,web,age,became_member_on,income,missing_data,F,M,O,U
person,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0009655768c64bdeb2e877511632db8f,0,0,0,0,0.00,0.00,0.00,0.0,0.0,0.0,0.0,0.0,0.0,-0.92,-0.15,0.30,0,0,1,0,0
0009655768c64bdeb2e877511632db8f,168,offer received,5a8bc65990b245e5a138643cd4eb9837,1,-1.24,-1.39,-1.59,0.0,0.0,1.0,1.0,1.0,0.0,-0.92,-0.15,0.30,0,0,1,0,0
0009655768c64bdeb2e877511632db8f,336,offer received,3f207df678b143eea3cee63160fa8bed,1,-1.24,-1.39,-1.14,0.0,0.0,1.0,1.0,0.0,1.0,-0.92,-0.15,0.30,0,0,1,0,0
0009655768c64bdeb2e877511632db8f,408,offer received,f19421c1d4aa40978ebb69ca19b0e20d,0,0.24,-0.49,-0.68,1.0,0.0,0.0,1.0,1.0,1.0,-0.92,-0.15,0.30,0,0,1,0,0
0009655768c64bdeb2e877511632db8f,504,offer received,fafdcd668e3743c1bb461111dcafc2a4,0,-0.65,0.42,1.59,0.0,1.0,0.0,1.0,1.0,1.0,-0.92,-0.15,0.30,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ffff82501cea40309d5fdd7edcca4a07,168,offer received,0b1e1539f2cc45b7b9fa7c272da2e1d7,1,0.24,2.22,1.59,0.0,1.0,0.0,0.0,0.0,1.0,-0.42,-0.37,-0.03,0,1,0,0,0
ffff82501cea40309d5fdd7edcca4a07,336,offer received,2906b810c7d4411798c6938adc9daaa5,1,-0.65,0.42,0.23,0.0,1.0,0.0,1.0,0.0,1.0,-0.42,-0.37,-0.03,0,1,0,0,0
ffff82501cea40309d5fdd7edcca4a07,408,offer received,2906b810c7d4411798c6938adc9daaa5,1,-0.65,0.42,0.23,0.0,1.0,0.0,1.0,0.0,1.0,-0.42,-0.37,-0.03,0,1,0,0,0
ffff82501cea40309d5fdd7edcca4a07,504,offer received,9b98b8c7a33c4b65b9aebfe6a799e6d9,0,0.24,-0.49,0.23,1.0,0.0,0.0,1.0,0.0,1.0,-0.42,-0.37,-0.03,0,1,0,0,0


### Plot the correlation matrix

In [21]:
offer_received_df.corr().abs().round(2)

Unnamed: 0,label,reward,difficulty,duration,bogo,discount,informational,mobile,social,web,age,became_member_on,income,missing_data,F,M,O,U
label,1.0,0.06,0.08,0.01,0.1,0.03,0.04,0.08,0.06,0.14,0.03,0.15,0.05,0.13,0.08,0.0,0.02,0.13
reward,0.06,1.0,0.47,0.16,0.73,0.26,0.6,0.04,0.24,0.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
difficulty,0.08,0.47,1.0,0.81,0.03,0.55,0.67,0.41,0.13,0.17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
duration,0.01,0.16,0.81,1.0,0.17,0.69,0.66,0.29,0.16,0.24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
bogo,0.1,0.73,0.03,0.17,1.0,0.43,0.27,0.46,0.39,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
discount,0.03,0.26,0.55,0.69,0.43,1.0,0.27,0.11,0.07,0.54,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0
informational,0.04,0.6,0.67,0.66,0.27,0.27,1.0,0.29,0.04,0.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mobile,0.08,0.04,0.41,0.29,0.46,0.11,0.29,1.0,0.63,0.52,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0
social,0.06,0.24,0.13,0.16,0.39,0.07,0.04,0.63,1.0,0.12,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0
web,0.14,0.08,0.17,0.24,0.2,0.54,0.08,0.52,0.12,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0


Here, we see features not correlated with each other, except for the gender *unknown* and *missing_data* in profile dataset. Thus, *missing_data* will be employed whereas *unknown* not.

### Organize customer data

In [22]:
%%time

## Certify that dataframe is ordered
offer_received_df = offer_received_df.sort_index()

## Choose features to feed the networks
features_cols = [
    'reward', 'difficulty', 'duration',                 # offer characteristics
    'bogo', 'discount', 'informational',                # offer type
    'mobile', 'social', 'web',                          # channels
    'age','became_member_on', 'income', 'missing_data', # customer data
    'F', 'M', 'O'                                       # customer gender
]

## Create feature and target lists where each position
# holds the data related to one customer
features = []
targets = []
for index, user_data in offer_received_df.groupby('person'):
    targets.append(user_data['label'].values)
    features.append(user_data[features_cols].values)

CPU times: user 16.3 s, sys: 20 ms, total: 16.4 s
Wall time: 16 s


### Create the data loaders

In [27]:
## Convert features and targets to tensors
features = torch.as_tensor(features, dtype=torch.float)
targets = torch.as_tensor(targets, dtype=torch.long)


## Split data into three random datasets 

# Generate randomic indices
len_dataset = len(features)
random_idx = np.random.choice(len_dataset, len_dataset, replace=False)

# Use the proportions: train: 80%, valid: 10%, test: 10%
train_idx = random_idx[:int(len_dataset*0.8)]
valid_idx = random_idx[int(len_dataset*0.8):-int(len_dataset*0.1)]
test_idx = random_idx[-int(len_dataset*0.1):]

# Create datasets
train_dataset = TensorDataset(features[train_idx], targets[train_idx])
valid_dataset = TensorDataset(features[valid_idx], targets[valid_idx])
test_dataset = TensorDataset(features[test_idx], targets[test_idx])

# Create dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)

### Count targets within each dataloader

In [28]:
# TARGETS FOR TRAINING
targ_neg = (targets[train_idx] == 0).sum()
targ_pos = (targets[train_idx] == 1).sum()
targ_total = targ_neg + targ_pos
print('Train dataset:\tCN: {:5d} / {:5.2f}%\tCP: {:5d} / {:5.2f}%' \
      .format(targ_neg, targ_neg*100./targ_total,
              targ_pos, targ_pos*100./targ_total))

# TARGETS FOR TRAINING
targ_neg = (targets[valid_idx] == 0).sum()
targ_pos = (targets[valid_idx] == 1).sum()
targ_total = targ_neg + targ_pos
print('Valid dataset:\tCN: {:5d} / {:5.2f}%\tCP: {:5d} / {:5.2f}%' \
      .format(targ_neg, targ_neg*100./targ_total,
              targ_pos, targ_pos*100./targ_total))

# TARGETS FOR TRAINING
targ_neg = (targets[test_idx] == 0).sum()
targ_pos = (targets[test_idx] == 1).sum()
targ_total = targ_neg + targ_pos
print('Test dataset:\tCN: {:5d} / {:5.2f}%\tCP: {:5d} / {:5.2f}%' \
      .format(targ_neg, targ_neg*100./targ_total,
              targ_pos, targ_pos*100./targ_total))

Train dataset:	CN: 42200 / 55.33%	CP: 34072 / 44.67%
Valid dataset:	CN:  5255 / 55.08%	CP:  4285 / 44.92%
Test dataset:	CN:  5247 / 55.03%	CP:  4287 / 44.97%


### Save the dataloaders into a zip file

In [29]:
torch.save((train_dataloader, valid_dataloader, test_dataloader), 'dataloaders.pt')
!zip dataloaders.zip dataloaders.pt
!rm dataloaders.pt

  adding: dataloaders.pt (deflated 93%)
