# Assess & Clean Data

Load raw data, assess, clean and save in proper sets.


### Data Sources

- portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
- profile.json - demographic data for each customer
- transcript.json - records for transactions, offers received, offers viewed, and offers completed

### Changes

- 2018-12-19: Started project



In [1]:
# load libraries

import numpy as np
import pandas as pd
from tqdm import tqdm

# my own custom functions
import EDA_functions as EDA
import cleaning_functions as cleaning

# visualization
import matplotlib.pyplot as plt
import seaborn as sns #, sns.set_style('whitegrid')
color = 'rebeccapurple'
%matplotlib inline

# display settings
from IPython.display import display
pd.options.display.max_columns = None

from pathlib import Path  # to make file path references relative to notebook directory

In [2]:
# import data

portfolio_file = Path.cwd() / "data" / "raw" / "portfolio.json"
profile_file = Path.cwd() / "data" / "raw" / "profile.json"
transcript_file = Path.cwd() / "data" / "raw" / "transcript.json"

portfolio = pd.read_json(portfolio_file, orient='records', lines=True)
profile = pd.read_json(profile_file, orient='records', lines=True)
transcript = pd.read_json(transcript_file, orient='records', lines=True)

## Assess Data
### Check portfolio data

In [3]:
portfolio.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
channels      10 non-null object
difficulty    10 non-null int64
duration      10 non-null int64
id            10 non-null object
offer_type    10 non-null object
reward        10 non-null int64
dtypes: int64(3), object(3)
memory usage: 560.0+ bytes


In [4]:
portfolio

Unnamed: 0,channels,difficulty,duration,id,offer_type,reward
0,"[email, mobile, social]",10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10
1,"[web, email, mobile, social]",10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10
2,"[web, email, mobile]",0,4,3f207df678b143eea3cee63160fa8bed,informational,0
3,"[web, email, mobile]",5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5
4,"[web, email]",20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5
5,"[web, email, mobile, social]",7,7,2298d6c36e964ae4a3e7e9706d1fb8c2,discount,3
6,"[web, email, mobile, social]",10,10,fafdcd668e3743c1bb461111dcafc2a4,discount,2
7,"[email, mobile, social]",0,3,5a8bc65990b245e5a138643cd4eb9837,informational,0
8,"[web, email, mobile, social]",5,5,f19421c1d4aa40978ebb69ca19b0e20d,bogo,5
9,"[web, email, mobile]",10,7,2906b810c7d4411798c6938adc9daaa5,discount,2


Explanations: 
- id (string) - offer id
- offer_type (string) - type of offer ie BOGO (buy-one-get-one-free), discount, informational
- difficulty (int) - minimum required spend to complete an offer
- reward (int) - reward given for completing an offer
- duration (int) - 
- channels (list of strings)

### Check profile data

In [5]:
profile.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
age                 17000 non-null int64
became_member_on    17000 non-null int64
gender              14825 non-null object
id                  17000 non-null object
income              14825 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.1+ KB


In [6]:
profile.head()

Unnamed: 0,age,became_member_on,gender,id,income
0,118,20170212,,68be06ca386d4c31939f3a4f0e3dd783,
1,55,20170715,F,0610b486422d4921ae7d2bf64640c50b,112000.0
2,118,20180712,,38fe809add3b4fcf9315a9694bb96ff5,
3,75,20170509,F,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,118,20170804,,a03223e636434f42ac4c3df47e8bac43,


Explanations:

* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

### Check transcript data

In [7]:
transcript.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
event     306534 non-null object
person    306534 non-null object
time      306534 non-null int64
value     306534 non-null object
dtypes: int64(1), object(3)
memory usage: 9.4+ MB


In [8]:
transcript.sample(10)

Unnamed: 0,event,person,time,value
60201,offer received,b7bbde316be54f6f9100f8ec7d7e3dbc,168,{'offer id': 'ae264e3637204a6fb9bb56bc8210ddfd'}
30617,transaction,d08bb3a6b09c4059970788a1edbc0312,48,{'amount': 6.82}
200711,offer viewed,0ac4004c9f854997b8c9697dc9bbdd8f,498,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}
39770,transaction,d885cede482e4a03b42991c72c254265,90,{'amount': 10.91}
267950,transaction,b7aa6f74b09b40d89cfbdbba1350b282,588,{'amount': 15.67}
231987,transaction,903fa17ff106494b85e56afe5dd48c6f,534,{'amount': 0.14}
212794,offer received,79a8c6c9f9e34117b8793c583ec16521,504,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}
152345,offer received,4e6458c5eade4beb9b1f7d16dd8decae,408,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}
164376,offer viewed,d2fdc2be8ab64e4ba04830d441e53fd5,408,{'offer id': '2298d6c36e964ae4a3e7e9706d1fb8c2'}
108149,transaction,4d452ab867ed4895b280552cf89af297,318,{'amount': 21.79}


Explanations:
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours after start of 30 days test period. The data begins at time t=0
* value (dict of strings) - either an offer id or transaction amount depending on the record

## Clean Data

Cleaning tasks:
- portfolio: rename `id` col to `offer_id`
- portfolio: one-hot-encode `channels`
- portfolio: add `prop_reward` (`income` / `difficulty`)
- portfolio add `rel_difficulty` (`difficulty` / `duration` in days)
- portfolio: add `duration_hours` col 
- profile: rename `id` col to `person_id`
- profile: change dtype of `became_member_on`to datetime
- profile: transform `became_member_on` into a duration integer (days, starting from maxdate)
- transcript: rename `person` to `person_id`
- all files: simplify ids

Further preparation tasks:
- transcript: flag every transaction if within valid period of a viewed promotion
- profile: add infos about consumption (number of purchases, amount per day etc.)
    - (could later be added for every promo)

In [9]:
"""rename columns"""

portfolio.rename(columns={'id': 'offer_id'}, inplace=True)
profile.rename(columns={'id' : 'person_id'}, inplace=True)
transcript.rename(columns={'person' : 'person_id'}, inplace=True)

In [10]:
# check results
for df in [portfolio, profile, transcript]:
    print(df.columns)

Index(['channels', 'difficulty', 'duration', 'offer_id', 'offer_type',
       'reward'],
      dtype='object')
Index(['age', 'became_member_on', 'gender', 'person_id', 'income'], dtype='object')
Index(['event', 'person_id', 'time', 'value'], dtype='object')


In [11]:
"""one-hot encode channels"""

for index, row in portfolio.iterrows():
    channel_string = ''.join(str(e) for e in portfolio.loc[index, 'channels'])
    for channel in ['web', 'email', 'mobile', 'social']:
        if channel in channel_string:
            portfolio.loc[index, channel] = 1
        else:
            portfolio.loc[index, channel] = 0
            
for col in portfolio[['web', 'email', 'mobile', 'social', 'offer_type']]:
    portfolio[col] = portfolio[col].astype('category', inplace=True)
portfolio.drop('channels', axis=1, inplace=True)

In [20]:
"""add prop_rewards, rel_difficulty and duration_hours"""

portfolio['prop_rewards'] = portfolio['reward'] / portfolio['difficulty']
portfolio['rel_difficulty'] = portfolio['difficulty'] / portfolio['duration']
portfolio['duration_hours'] = portfolio['duration'] * 24

In [22]:
# check results
display(portfolio.head())
display(portfolio.info())

Unnamed: 0,difficulty,duration,offer_id,offer_type,reward,web,email,mobile,social,prop_rewards,rel_difficulty,duration_hours
0,10,7,ae264e3637204a6fb9bb56bc8210ddfd,bogo,10,0.0,1.0,1.0,1.0,1.0,1.428571,168
1,10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10,1.0,1.0,1.0,1.0,1.0,2.0,120
2,0,4,3f207df678b143eea3cee63160fa8bed,informational,0,1.0,1.0,1.0,0.0,,0.0,96
3,5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5,1.0,1.0,1.0,0.0,1.0,0.714286,168
4,20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,5,1.0,1.0,0.0,0.0,0.25,2.0,240


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
difficulty        10 non-null int64
duration          10 non-null int64
offer_id          10 non-null object
offer_type        10 non-null category
reward            10 non-null int64
web               10 non-null category
email             10 non-null category
mobile            10 non-null category
social            10 non-null category
prop_rewards      8 non-null float64
rel_difficulty    10 non-null float64
duration_hours    10 non-null int64
dtypes: category(5), float64(2), int64(4), object(1)
memory usage: 1.1+ KB


None

In [None]:
"""- profile: change dtype of `became_member_on`to datetime
- profile: transform `became_member_on` into a duration integer (days, starting from maxdate)"""

In [None]:
"""- all files: simplify ids"""