### Starbucks Capstone Challenge
# Feature Engineering

This notebook aims to explore the data set available and generate features to use in machine learning models. To do so, it involves two main concepts:

* #### Exploratory Data Analysis
Data exploration is one of the most important parts of the machine learning workflow because it allows you to notice any initial patterns in data distribution and features that may inform how to proceed with modeling and clustering the data.  
Data exploration uses visual exploration to understand what is in a dataset and the characteristics of the data. These characteristics can include size or amount of data, completeness and correctness of the data, and possible relationships amongst data elements.  


* #### Feature Engineering
It is the process of determining which features might be useful in training a model, and then converting raw data from log files and other sources into said features.

In [1]:
## Import all the libraries necessary

import os

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import minmax_scale

In [2]:
## Global definitions

data_dir = 'data'

pd.set_option('display.max_colwidth', -1)
pd.set_option('max_columns', None)

portfolio_data_path = os.path.join(data_dir, 'portfolio.json')
profile_data_path = os.path.join(data_dir, 'profile.json')
transcript_data_path = os.path.join(data_dir, 'transcript.json')

In [3]:
## global functions

def load_dataframe(data_path):
    """Create a dataframe from a json file"""
    return pd.read_json(data_path, orient='records', lines=True)

# Portfolio data set
Data set containing information about the offers which can be sent to users

### Overview

In [4]:
portfolio_df = load_dataframe(portfolio_data_path)
portfolio_df

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7
5,3,"[web, email, mobile, social]",7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2
6,2,"[web, email, mobile, social]",10,10,discount,fafdcd668e3743c1bb461111dcafc2a4
7,0,"[email, mobile, social]",0,3,informational,5a8bc65990b245e5a138643cd4eb9837
8,5,"[web, email, mobile, social]",5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d
9,2,"[web, email, mobile]",10,7,discount,2906b810c7d4411798c6938adc9daaa5


### Transformation
As we can see **offer type** is a categorical feature that could be mapped as one hot encoding.  

**Channels** are categorical feature as well, but values can assume more than one category. Its values may be converted into individual features.  

**Reward**, **difficulty**, and **duration** are numerical features which should be scaled. As they are expressed in monetary units, reward and difficulty should share the same scale.

In [5]:
## Set id as index
portfolio_df.set_index(keys='id', verify_integrity=True, inplace=True)

## Make offer_type one hot encoded
portfolio_df = portfolio_df.join(
    pd.get_dummies(portfolio_df.pop('offer_type')))

## Transform channels in distinct features
channels_df = pd.DataFrame(portfolio_df.pop('channels'))
channels_df = channels_df.explode('channels')
channels_df = channels_df.assign(value=lambda x: 1)
channels_df = channels_df.pivot(columns='channels', values='value')
channels_df.fillna(value=0, inplace=True)
portfolio_df = portfolio_df.join(channels_df)
channels_df = None

## Scale reward, difficulty, and duration
# reward and difficulty share the same scale
portfolio_df[['reward', 'difficulty']] /= \
    portfolio_df[['reward', 'difficulty']].to_numpy().max()
portfolio_df.duration = minmax_scale(portfolio_df.duration)

## print the result
portfolio_df

Unnamed: 0_level_0,reward,difficulty,duration,bogo,discount,informational,email,mobile,social,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ae264e3637204a6fb9bb56bc8210ddfd,0.5,0.5,0.571429,1,0,0,1.0,1.0,1.0,0.0
4d5c57ea9a6940dd891ad53e9dbe8da0,0.5,0.5,0.285714,1,0,0,1.0,1.0,1.0,1.0
3f207df678b143eea3cee63160fa8bed,0.0,0.0,0.142857,0,0,1,1.0,1.0,0.0,1.0
9b98b8c7a33c4b65b9aebfe6a799e6d9,0.25,0.25,0.571429,1,0,0,1.0,1.0,0.0,1.0
0b1e1539f2cc45b7b9fa7c272da2e1d7,0.25,1.0,1.0,0,1,0,1.0,0.0,0.0,1.0
2298d6c36e964ae4a3e7e9706d1fb8c2,0.15,0.35,0.571429,0,1,0,1.0,1.0,1.0,1.0
fafdcd668e3743c1bb461111dcafc2a4,0.1,0.5,1.0,0,1,0,1.0,1.0,1.0,1.0
5a8bc65990b245e5a138643cd4eb9837,0.0,0.0,0.0,0,0,1,1.0,1.0,1.0,0.0
f19421c1d4aa40978ebb69ca19b0e20d,0.25,0.25,0.285714,1,0,0,1.0,1.0,1.0,1.0
2906b810c7d4411798c6938adc9daaa5,0.1,0.5,0.571429,0,1,0,1.0,1.0,0.0,1.0


### Analysis

In [6]:
print('Missing data: {}\n'.format(portfolio_df.isna().any().any()))

print('Dataset description:')
display(pd.DataFrame(portfolio_df.describe()))

print('Pairwise correlation')
display(portfolio_df.corr().abs().round(2))

Missing data: False

Dataset description:


Unnamed: 0,reward,difficulty,duration,bogo,discount,informational,email,mobile,social,web
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,0.21,0.385,0.5,0.4,0.4,0.2,1.0,0.9,0.6,0.8
std,0.179196,0.291595,0.331628,0.516398,0.516398,0.421637,0.0,0.316228,0.516398,0.421637
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,0.1,0.25,0.285714,0.0,0.0,0.0,1.0,1.0,0.0,1.0
50%,0.2,0.425,0.571429,0.0,0.0,0.0,1.0,1.0,1.0,1.0
75%,0.25,0.5,0.571429,1.0,1.0,0.0,1.0,1.0,1.0,1.0
max,0.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Pairwise correlation


Unnamed: 0,reward,difficulty,duration,bogo,discount,informational,email,mobile,social,web
reward,1.0,0.47,0.16,0.79,0.29,0.62,,0.08,0.29,0.12
difficulty,0.47,1.0,0.81,0.03,0.6,0.7,,0.74,0.15,0.24
duration,0.16,0.81,1.0,0.19,0.74,0.68,,0.53,0.19,0.34
bogo,0.79,0.03,0.19,1.0,0.67,0.41,,0.27,0.25,0.1
discount,0.29,0.6,0.74,0.67,1.0,0.41,,0.41,0.17,0.41
informational,0.62,0.7,0.68,0.41,0.41,1.0,,0.17,0.1,0.37
email,,,,,,,,,,
mobile,0.08,0.74,0.53,0.27,0.41,0.17,,1.0,0.41,0.17
social,0.29,0.15,0.19,0.25,0.17,0.1,,0.41,1.0,0.41
web,0.12,0.24,0.34,0.1,0.41,0.37,,0.17,0.41,1.0


### Resulting dataset

In [7]:
portfolio_df.sort_values(by=['discount', 'bogo', 'reward', 'difficulty'])

Unnamed: 0_level_0,reward,difficulty,duration,bogo,discount,informational,email,mobile,social,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3f207df678b143eea3cee63160fa8bed,0.0,0.0,0.142857,0,0,1,1.0,1.0,0.0,1.0
5a8bc65990b245e5a138643cd4eb9837,0.0,0.0,0.0,0,0,1,1.0,1.0,1.0,0.0
9b98b8c7a33c4b65b9aebfe6a799e6d9,0.25,0.25,0.571429,1,0,0,1.0,1.0,0.0,1.0
f19421c1d4aa40978ebb69ca19b0e20d,0.25,0.25,0.285714,1,0,0,1.0,1.0,1.0,1.0
ae264e3637204a6fb9bb56bc8210ddfd,0.5,0.5,0.571429,1,0,0,1.0,1.0,1.0,0.0
4d5c57ea9a6940dd891ad53e9dbe8da0,0.5,0.5,0.285714,1,0,0,1.0,1.0,1.0,1.0
fafdcd668e3743c1bb461111dcafc2a4,0.1,0.5,1.0,0,1,0,1.0,1.0,1.0,1.0
2906b810c7d4411798c6938adc9daaa5,0.1,0.5,0.571429,0,1,0,1.0,1.0,0.0,1.0
2298d6c36e964ae4a3e7e9706d1fb8c2,0.15,0.35,0.571429,0,1,0,1.0,1.0,1.0,1.0
0b1e1539f2cc45b7b9fa7c272da2e1d7,0.25,1.0,1.0,0,1,0,1.0,0.0,0.0,1.0
