# Clean portfolio data

This section aims to numerically represent the offers in the portfolio dataset so it can be used by the machine learning model.

In [1]:
import pandas as pd

In [2]:
portfolio = pd.read_csv('./data/processed/portfolio.csv')
portfolio

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"['email', 'mobile', 'social']",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"['web', 'email', 'mobile', 'social']",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"['web', 'email', 'mobile']",0,4,informational,3f207df678b143eea3cee63160fa8bed
3,5,"['web', 'email', 'mobile']",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9
4,5,"['web', 'email']",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7
5,3,"['web', 'email', 'mobile', 'social']",7,7,discount,2298d6c36e964ae4a3e7e9706d1fb8c2
6,2,"['web', 'email', 'mobile', 'social']",10,10,discount,fafdcd668e3743c1bb461111dcafc2a4
7,0,"['email', 'mobile', 'social']",0,3,informational,5a8bc65990b245e5a138643cd4eb9837
8,5,"['web', 'email', 'mobile', 'social']",5,5,bogo,f19421c1d4aa40978ebb69ca19b0e20d
9,2,"['web', 'email', 'mobile']",10,7,discount,2906b810c7d4411798c6938adc9daaa5


The channels through which the offers are sent are encoded as strings of list. To transform this information into a machine-learnable feature, I transformed it to a binary encoding of the channels (email, mobile, social, etc.).

In [3]:
from sklearn.preprocessing import MultiLabelBinarizer

# Convert string to list
channels = portfolio['channels'].apply(eval)

# Binarise labels
binariser = MultiLabelBinarizer()
encoded_channels = binariser.fit_transform(channels)

# # Create pandas dataframe and show output
encoded_channels = pd.DataFrame(encoded_channels, columns=binariser.classes_)
encoded_channels

Unnamed: 0,email,mobile,social,web
0,1,1,1,0
1,1,1,1,1
2,1,1,0,1
3,1,1,0,1
4,1,0,0,1
5,1,1,1,1
6,1,1,1,1
7,1,1,1,0
8,1,1,1,1
9,1,1,0,1


At the beginning, linear regression was selected as a baseline model. Therefore, the `offer_type` column also needs to be binary-encoded. This will create an additional 3 binary features: `bogo`, `informational`, `discount`. To avoid the "dummy variable trap", a random binary feature will be dropped, which is `bogo` in this case.

In [4]:
offer_type_encoded = pd.get_dummies(portfolio['offer_type'], drop_first=True)
offer_type_encoded

Unnamed: 0,discount,informational
0,0,0
1,0,0
2,0,1
3,0,0
4,1,0
5,1,0
6,1,0
7,0,1
8,0,0
9,1,0


The encoded features will then be merged back to the original `portfolio` dataframe.

In [5]:
# Merge back to the portfolio dataframe
portfolio_encoded = pd.concat(
    [portfolio, encoded_channels, offer_type_encoded],
    axis=1
)

# Clean up and rearrange columns
portfolio_encoded.drop(['channels', 'offer_type'], axis=1, inplace=True)
portfolio_encoded = portfolio_encoded['id reward difficulty duration email mobile social web discount informational'.split()]
portfolio_encoded

Unnamed: 0,id,reward,difficulty,duration,email,mobile,social,web,discount,informational
0,ae264e3637204a6fb9bb56bc8210ddfd,10,10,7,1,1,1,0,0,0
1,4d5c57ea9a6940dd891ad53e9dbe8da0,10,10,5,1,1,1,1,0,0
2,3f207df678b143eea3cee63160fa8bed,0,0,4,1,1,0,1,0,1
3,9b98b8c7a33c4b65b9aebfe6a799e6d9,5,5,7,1,1,0,1,0,0
4,0b1e1539f2cc45b7b9fa7c272da2e1d7,5,20,10,1,0,0,1,1,0
5,2298d6c36e964ae4a3e7e9706d1fb8c2,3,7,7,1,1,1,1,1,0
6,fafdcd668e3743c1bb461111dcafc2a4,2,10,10,1,1,1,1,1,0
7,5a8bc65990b245e5a138643cd4eb9837,0,0,3,1,1,1,0,0,1
8,f19421c1d4aa40978ebb69ca19b0e20d,5,5,5,1,1,1,1,0,0
9,2906b810c7d4411798c6938adc9daaa5,2,10,7,1,1,0,1,1,0


In this dataset, the offer validity period is specified in days. However, in the `transcript` dataset, it was specified in hours. For consistency, the validity here will be converted to hours.

In [6]:
portfolio_encoded['duration'] *= 24

We should also rename the `id` column to `offer` to avoid confusion with customer id

In [7]:
portfolio_encoded.rename(columns={'id': 'offer'}, inplace=True)

At this step, we've successfully numerically encoded the characteristics of the available offers. Finally, it will get exported to csv for future use.

In [8]:
portfolio_encoded.to_csv('./data/final/portfolio.csv', index=False)
print('saved')

saved
