# Data Loading and Feature Engineering

## Overview

This notebook downloads, prepares, and transforms the ipenyou dataset.  This example uses dask for parallel data loading.  However, you could use a number of different techniques to load and transform the datasets.  We'll also add features to the dataset to be used for downstream machine learning

#### iPinYou

The iPinYou Global RTB(Real-Time Bidding) Bidding Algorithm Competition is organized by iPinYou (http://www.ipinyou.com) from April 1st, 2013 to December 31st, 2013. 

The competition has been divided into three seasons. For each season, a training dataset is released to the competition participants, the testing dataset is reserved by iPinYou. The complete testing dataset is randomly divided into two parts: one part is the leaderboard testing dataset to score and rank the participating teams on the leaderboard, and the other part is reserved for the final offline evaluation. 

We will be using the second season of iPinYou.  The training dataset includes a set of processed iPinYou DSP bidding, impression, click, and conversion logs.  We will be using the impression and click datasets.  The impression data assumes the bidder won the ad and the click dataset includes which ads were clicks.  Our goal will be to predict when a user will click the ad. 

Let's get started!  First, let's update our python libraries

In [61]:
%pip install numpy --upgrade
%pip install scikit-learn --upgrade
%pip install dask
%pip install imblearn
%pip install Faker

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[0m
[1m[[0m[34;49mnotice[0

In [62]:
# do auto kernel restart
import IPython
IPython.Application.instance().kernel.do_shutdown(True) #automatically restarts kernel

{'status': 'ok', 'restart': True}

In [24]:
# python library imports

import pandas as pd
import glob
import numpy as np
import dask.dataframe as dd
import gc
import sys
import numpy as np
import pyarrow
from sklearn.model_selection import train_test_split
import sagemaker
import os
import boto3
from imblearn.over_sampling import SMOTENC
from faker import Faker
import uuid

In [19]:
region = boto3.Session().region_name
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker.Session().default_bucket()

prefix = 'ipinyou'
os.environ["AWS_REGION"] = region

print(f'Region : {region}')
print(f'IAM Role : {role}')
print(f'S3 Bucket : {bucket_name}')

# Get the current installed version of Sagemaker SDK, TensorFlow, Python, Boto3 and SMDebug
print(f'SageMaker Python SDK version : {sagemaker.__version__}')
print(f'Python version : {sys.version}')

Region : us-east-1
IAM Role : arn:aws:iam::725069941408:role/service-role/AmazonSageMaker-ExecutionRole-20230327T095833
S3 Bucket : sagemaker-us-east-1-725069941408
SageMaker Python SDK version : 2.145.0
Python version : 3.7.10 (default, Jun  4 2021, 14:48:32) 
[GCC 7.5.0]


## Data Download - only once

In [20]:
# check data in s3
!aws s3 ls s3://ipenyou/

                           PRE algo.submission.demo/
                           PRE testing1st/
                           PRE testing2nd/
                           PRE testing3rd/
                           PRE training1st/
                           PRE training2nd/
                           PRE training3rd/
2023-03-27 15:25:45      10663 README
2023-03-27 15:25:45      10534 README.old
2023-03-27 15:25:45       6144 city.cn.txt
2023-03-27 15:25:45       4683 city.en.txt
2023-03-27 15:25:45       7305 files.md5
2023-03-27 15:25:45        326 known.data.bugs.txt
2023-03-27 15:25:45        410 region.cn.txt
2023-03-27 15:25:45        435 region.en.txt
2023-03-27 15:26:46       1335 user.profile.tags.cn.txt
2023-03-27 15:26:46       1420 user.profile.tags.en.txt


In [7]:
!mkdir data

copy data from s3 to local EFS in SageMaker Studio, ~6 GB of data

In [10]:
!aws s3 cp s3://ipenyou/ ./data/ --recursive

The files in the dataset are plain text, *.txt.bz2 or *.tar.bz2 files. With bunzip2 and tar Linux or Mac command, they can be uncompressed easily into plain text:  
           bunzip2 *.txt.bz2
           tar xvjf *.tar.bz2

In [12]:
# convert all data to txt files from bz2 files - takes a long time :( 
!bzip2 -vd ./data/training2nd/*.bz2
!bzip2 -vd ./data/testing2nd/*.bz2

In [None]:
# TODO push back to s3 as txt files

## Data Loading

Now that we've downloaded the data, let's use dask and pandas to process the data

In [21]:
# define columns
ad_columns = ['BidID','Timestamp','Log Type','iPinYou ID','User-Agent','IP','Region','City','Ad Exchange','Domain','URL','Anonymous URL ID','Ad slot ID','Ad slot width','Ad slot height','Ad slot visibility','Ad slot format','Ad slot floor price','Creative ID','Bidding price','Paying price','Key page URL','Advertiser ID','User Tags']
ad_columns

['BidID',
 'Timestamp',
 'Log Type',
 'iPinYou ID',
 'User-Agent',
 'IP',
 'Region',
 'City',
 'Ad Exchange',
 'Domain',
 'URL',
 'Anonymous URL ID',
 'Ad slot ID',
 'Ad slot width',
 'Ad slot height',
 'Ad slot visibility',
 'Ad slot format',
 'Ad slot floor price',
 'Creative ID',
 'Bidding price',
 'Paying price',
 'Key page URL',
 'Advertiser ID',
 'User Tags']

Let's take the impression data and join it with the click data.  We'll use the BidID to match clicks and impressions.  We'll add a new prediction column called 'clicks'

In [23]:
# read impression data
df_imp = dd.read_csv('./data/training2nd/i*.txt',sep='\t',header=0,names=ad_columns)
# read click data
#ddf_clicks = dd.read_csv('./data/training2nd/clk*.txt',sep='\t',header=0,names=ad_columns)

In [5]:
print(f'A total of {len(ddf_imp)} impressions in the impression training dataset')

A total of 12237142 impressions in the impression training dataset


Since there are multiple advertisers in this dataset, let's take just one of the ad Id's for our analysis.  One could imagine that you don't downsample here and include the 'advertiser ID'.  This ID corresponds to a Software Advertiser per - https://arxiv.org/pdf/1407.7073.pdf

# resample to create new data

In [15]:
n = 1000000

In [16]:
fake = Faker()

In [17]:
Faker.seed(54321)

In [26]:
%%time
ips = []
uid = []
for _ in range(n):
    ips.append(fake.ipv4_public().split('.'))
    uid.append(uuid.uuid1())

CPU times: user 48 s, sys: 3.82 s, total: 51.8 s
Wall time: 47.5 s


In [29]:
drop_columns = ['BidID','Timestamp','Log Type','iPinYou ID','User-Agent','Ad Exchange','Domain','URL','Anonymous URL ID','Ad slot ID','Ad slot width','Ad slot height','Ad slot visibility','Ad slot format','Ad slot floor price','Creative ID','Bidding price','Paying price','Key page URL','Advertiser ID']

In [30]:
df_imp = df_imp.drop(columns = drop_columns)

In [31]:
df_imp

Unnamed: 0_level_0,IP,Region,City,User Tags
npartitions=69,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,object,int64,int64,object
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


In [32]:
def sample_categorical_data(df, num_samples):
    newdata_df = pd.DataFrame()
    for column in df.columns:
        sample_column = df[column].sample(n=num_samples, replace=True).reset_index(drop=True)
        newdata_df[column] = sample_column
        
    return newdata_df

In [33]:
df_imp = df_imp.compute()

In [34]:
%%time
new_sample = sample_categorical_data(df_imp,n)

CPU times: user 993 ms, sys: 23.9 ms, total: 1.02 s
Wall time: 1.01 s


In [35]:
new_sample

Unnamed: 0,IP,Region,City,User Tags
0,61.178.155.*,27,1,10063100931007510006
1,27.186.224.*,15,165,10006
2,175.50.254.*,276,1,10057100241340310006100631007710111
3,111.79.219.*,393,68,1006310083100061005910110
4,122.79.162.*,106,278,1000610110
...,...,...,...,...
999995,118.207.238.*,80,290,10083134031000610110
999996,218.31.5.*,201,149,1000610075130421003110063100521007610111
999997,183.131.26.*,1,85,
999998,223.86.114.*,94,212,134031377610083100631386610111


In [None]:
%%time
# Now, let's only get the impression data for a single Advertiser ID.  
df = ddf_imp[ddf_imp['Advertiser ID']==3358].compute()

#### Save to intermediate parquet file

In [None]:
df.to_parquet('dftest.gzip',compression='gzip')

In [None]:
df = pd.read_parquet('dftest.gzip', engine='pyarrow')

## Feature Engineering

In [None]:
# convert timestamp
df['Timestamp'] = pd.to_datetime(df['Timestamp'], format='%Y%m%d%H%M%S%f')

In [None]:
df

Let's take the timestamp information and generate features from it.  

In [None]:
import re

def make_date(df, date_field):
    "Make sure `df[date_field]` is of the right date type."
    field_dtype = df[date_field].dtype
    if isinstance(field_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
        field_dtype = np.datetime64
    if not np.issubdtype(field_dtype, np.datetime64):
        df[date_field] = pd.to_datetime(df[date_field], infer_datetime_format=True)

def add_datepart(df, field_name, drop=True, time=False):
    "Helper function that adds columns relevant to a date in the column `field_name` of `df`."
    make_date(df, field_name)
    field = df[field_name]
    prefix = re.sub('[Dd]ate$', '', field_name)
    attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start',
            'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
    if time: attr = attr + ['Hour', 'Minute', 'Second']
    # Pandas removed `dt.week` in v1.1.10
    week = field.dt.isocalendar().week.astype(field.dt.day.dtype) if hasattr(field.dt, 'isocalendar') else field.dt.week
    for n in attr: df[prefix + n] = getattr(field.dt, n.lower()) if n != 'Week' else week
    mask = ~field.isna()
    df[prefix + 'Elapsed'] = np.where(mask,field.values.astype(np.int64) // 10 ** 9,np.nan)
    if drop: df.drop(field_name, axis=1, inplace=True)
    return df

In [None]:
df = add_datepart(df, 'Timestamp', time=True)

In [None]:
# create one hot encoded features from the user tags
_ = df['User Tags'].str.get_dummies(sep=',')

In [None]:
# add them back to the dataframe
df = pd.concat([df, _], axis=1)

In [None]:
df.head()

In [None]:
# create 3 features from the IP address
_ = df['IP'].str.split(pat='.',expand=True)

In [None]:
df[['ip1','ip2','ip3']]= _.rename(columns={0:'ip1',1:'ip2',2:'ip3'}).drop(columns=[3])

In [None]:
df.drop(columns=['IP','User Tags'],inplace=True)

In [None]:
df['ip1'] = df['ip1'].astype('int')
df['ip2'] = df['ip2'].astype('int')
df['ip3'] = df['ip3'].astype('int')

In [None]:
df

## Select an audience 

In [None]:
fashionfemale = df[(df['10059']==1) & (df['10111']==1)]  # fashion and female

In [None]:
len(fashionfemale)

In [None]:
# get random sample of ids that are not in fashionfemale and are not repeated.  

In [None]:
not_it = df[~df['iPinYou ID'].isin(fashionfemale)].sample(len(fashionfemale),replace=False,random_state=54321)

keep:  Region, City, ip1, ip2, ip3, all customer attributes

In [None]:
fashionfemale = fashionfemale.drop(columns=['BidID', 'Log Type', 'iPinYou ID', 'User-Agent',
       'Ad Exchange', 'Domain', 'URL', 'Anonymous URL ID', 'Ad slot ID',
       'Ad slot width', 'Ad slot height', 'Ad slot visibility',
       'Ad slot format', 'Ad slot floor price', 'Creative ID', 'Bidding price',
       'Paying price', 'Key page URL', 'Advertiser ID', 'TimestampYear',
       'TimestampMonth', 'TimestampWeek', 'TimestampDay', 'TimestampDayofweek',
       'TimestampDayofyear', 'TimestampIs_month_end',
       'TimestampIs_month_start', 'TimestampIs_quarter_end',
       'TimestampIs_quarter_start', 'TimestampIs_year_end',
       'TimestampIs_year_start', 'TimestampHour', 'TimestampMinute',
       'TimestampSecond', 'TimestampElapsed'])

In [None]:
fashionfemale.drop_duplicates(inplace=True)

In [None]:
fashionfemale['group'] = 1

In [None]:
fashionfemale

In [None]:
not_it = not_it.drop(columns=['BidID', 'Log Type', 'iPinYou ID', 'User-Agent',
       'Ad Exchange', 'Domain', 'URL', 'Anonymous URL ID', 'Ad slot ID',
       'Ad slot width', 'Ad slot height', 'Ad slot visibility',
       'Ad slot format', 'Ad slot floor price', 'Creative ID', 'Bidding price',
       'Paying price', 'Key page URL', 'Advertiser ID', 'TimestampYear',
       'TimestampMonth', 'TimestampWeek', 'TimestampDay', 'TimestampDayofweek',
       'TimestampDayofyear', 'TimestampIs_month_end',
       'TimestampIs_month_start', 'TimestampIs_quarter_end',
       'TimestampIs_quarter_start', 'TimestampIs_year_end',
       'TimestampIs_year_start', 'TimestampHour', 'TimestampMinute',
       'TimestampSecond', 'TimestampElapsed'])

In [None]:
not_it.drop_duplicates(inplace=True)

In [None]:
not_it['group'] = 0

In [None]:
fulldf = pd.concat([fashionfemale, not_it])

In [None]:
fulldf  

In [None]:
# move predictive variable to first column
fulldf = fulldf[ ['group'] + [col for col in fulldf.columns if col != 'group'] ] 

In [None]:
fulldf

In [None]:
fulldf.to_csv('usersegfull.csv',header=False, index=False)

In [49]:
fulldf = pd.read_csv('usersegfull.csv',header=None)

In [50]:
fulldf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216887 entries, 0 to 216886
Data columns (total 49 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   0       216887 non-null  int64
 1   1       216887 non-null  int64
 2   2       216887 non-null  int64
 3   3       216887 non-null  int64
 4   4       216887 non-null  int64
 5   5       216887 non-null  int64
 6   6       216887 non-null  int64
 7   7       216887 non-null  int64
 8   8       216887 non-null  int64
 9   9       216887 non-null  int64
 10  10      216887 non-null  int64
 11  11      216887 non-null  int64
 12  12      216887 non-null  int64
 13  13      216887 non-null  int64
 14  14      216887 non-null  int64
 15  15      216887 non-null  int64
 16  16      216887 non-null  int64
 17  17      216887 non-null  int64
 18  18      216887 non-null  int64
 19  19      216887 non-null  int64
 20  20      216887 non-null  int64
 21  21      216887 non-null  int64
 22  22      216887 non-n

In [15]:
fulldf.iloc[100000:120000,0:46]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,36,37,38,39,40,41,42,43,44,45
100000,1,164,165,1,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
100001,1,55,56,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
100002,1,308,309,1,1,0,0,0,1,1,...,0,0,1,0,0,0,0,0,0,0
100003,1,333,334,1,1,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
100004,1,2,2,1,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119995,0,124,129,0,0,0,0,0,1,0,...,0,0,0,1,0,1,0,0,0,0
119996,0,15,20,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
119997,0,164,170,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
119998,0,124,131,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [51]:
from imblearn.over_sampling import SMOTEN

In [52]:
temp_ = fulldf.sample(10000,replace=False,random_state=54321)

In [53]:
y = np.random.choice([0,1], size=len(temp_), p=[0.5, 0.5])

In [54]:
sampler = SMOTEN(sampling_strategy='all',random_state=54321,n_jobs=-1)

In [55]:
%%time
X_res, y_res = sampler.fit_resample(temp_.iloc[:,1:46], y)



CPU times: user 44.1 s, sys: 1.68 s, total: 45.8 s
Wall time: 45.8 s


In [56]:
len(fulldf.iloc[100000:120000,1:46])

20000

In [57]:
y_res.sum()

5108

In [58]:
len(y_res)

10216

In [25]:
X_res

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,36,37,38,39,40,41,42,43,44,45
0,164,165,1,1,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,55,56,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,308,309,1,1,0,0,0,1,1,1,...,0,0,1,0,0,0,0,0,0,0
3,333,334,1,1,0,0,0,0,1,1,...,0,0,0,0,0,1,0,0,0,0
4,2,2,1,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34239,2,2,1,1,0,0,0,1,1,1,...,0,0,0,1,0,0,0,0,0,0
34240,80,82,1,1,0,0,1,0,1,1,...,0,0,0,0,0,0,0,0,0,0
34241,94,97,0,0,0,0,0,0,1,1,...,0,0,0,1,0,0,0,0,0,0
34242,183,79,1,1,0,0,0,1,1,1,...,0,0,1,1,0,0,0,0,0,0


In [None]:
# for ip addresses

# Backup

In [None]:
### STOP HERE!!!  ###  

In [364]:
# TODO
# consider one hot encoding Region and City


In [365]:
df['clicks'] = df['clicks']*1

In [366]:
df.drop(columns=['Advertiser ID', 'TimestampElapsed'],inplace=True)

In [367]:
# move clicks to the first column in the dataframe
clicks = df.pop('clicks')
df.insert(0,'clicks',clicks)

In [368]:
display(df)

#### Save to intermediate parquet file

In [319]:
df.to_parquet('dfclean.gzip',compression='gzip')

In [320]:
df = pd.read_parquet('dfclean.gzip', engine='pyarrow')

In [381]:
df = pd.read_parquet('dfclean.gzip')

In [385]:
df = df.dropna()

#### Save off test / train datasets, stratify on clicks

In [387]:
train, test = train_test_split(df,test_size=0.1, random_state = 4321, stratify=df['clicks'])

In [374]:
train = train*1

In [375]:
test = test*1

In [388]:
train.to_csv('train.csv',index=False,header=False, encoding='utf-8')

In [389]:
test.to_csv('test.csv',index=False,header=False, encoding='utf-8')

In [390]:
test.to_csv()

In [391]:
p = pd.read_csv('test.csv')

In [324]:
# collect python garbage
gc.collect()

## Synthetic Data Generation

When dealing with a mixed of continuous and categorical features, SMOTENC is the only method which can handle this case.

In [None]:
# Oversample the minority class with SMOTENC
categorical_features = [1,3,4,5]

over = SMOTENC(categorical_features=categorical_features, random_state=54321, sampling_strategy='all', n_jobs=-1)

# Fit and apply to the CMS dataset in a single transform
X_smote, y_smote = over.fit_resample(X_train, y_train)


## Convert to tfrecords

References:
* https://www.srijan.net/resources/blog/building-a-high-performance-data-pipeline-with-tensorflow
* https://keras.io/examples/keras_recipes/creating_tfrecords/


In [325]:
import tensorflow as tf

In [326]:
df = pd.read_parquet('dfclean.gzip', engine='pyarrow')

In [327]:
train, test = train_test_split(df,test_size=0.1, random_state = 4321, stratify=df['clicks'])

In [328]:
test

In [329]:
def _float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

In [330]:
def create_example(example):
    
    feature = {
        'clicks': _int64_feature(example['clicks']),
        'region': _int64_feature(example['Region']),
        'city': _int64_feature(example['City']),
        'adslotwidth': _int64_feature(example['Ad slot width']),
        'adslotheight': _int64_feature(example['Ad slot height']),
        'timestampyear': _int64_feature(example['TimestampYear']),
        'timestampmonth': _int64_feature(example['TimestampMonth']),
        'timestampweek': _int64_feature(example['TimestampWeek']),
        'timestampday': _int64_feature(example['TimestampDay']),
        'timestampdayofweek': _int64_feature(example['TimestampDayofweek']),
        'timestampdayofyear': _int64_feature(example['TimestampDayofyear']),
        'timestampis_month_end': _int64_feature(example['TimestampIs_month_end']),
        'timestampis_month_start': _int64_feature(example['TimestampIs_month_start']),
        'timestampis_quarter_end': _int64_feature(example['TimestampIs_quarter_end']),
        'timestampis_quarter_start': _int64_feature(example['TimestampIs_quarter_start']),
        'timestampis_year_end': _int64_feature(example['TimestampIs_year_end']),
        'timestampis_year_start': _int64_feature(example['TimestampIs_year_start']),
        'timestamphour': _int64_feature(example['TimestampHour']),
        'timestampminute': _int64_feature(example['TimestampMinute']),
        'timestampsecond': _int64_feature(example['TimestampSecond']),
        '10006': _int64_feature(example['10006']),
        '10024': _int64_feature(example['10024']),
        '10031': _int64_feature(example['10031']),
        '10048': _int64_feature(example['10048']),
        '10052': _int64_feature(example['10052']),
        '10057': _int64_feature(example['10057']),
        '10059': _int64_feature(example['10059']),
        '10063': _int64_feature(example['10063']),
        '10067': _int64_feature(example['10067']),
        '10074': _int64_feature(example['10074']),
        '10075': _int64_feature(example['10075']),
        '10076': _int64_feature(example['10076']),
        '10077': _int64_feature(example['10077']),
        '10079': _int64_feature(example['10079']),
        '10083': _int64_feature(example['10083']),
        '10093': _int64_feature(example['10093']),
        '10102': _int64_feature(example['10102']),
        '10110': _int64_feature(example['10110']),
        '10111': _int64_feature(example['10111']),
        '10684': _int64_feature(example['10684']),
        '11092': _int64_feature(example['11092']),
        '11278': _int64_feature(example['11278']),
        '11379': _int64_feature(example['11379']),
        '11423': _int64_feature(example['11423']),
        '11512': _int64_feature(example['11512']),
        '11576': _int64_feature(example['11576']),
        '11632': _int64_feature(example['11632']),
        '11680': _int64_feature(example['11680']),
        '11724': _int64_feature(example['11724']),
        '11944': _int64_feature(example['11944']),
        '13042': _int64_feature(example['13042']),
        '13403': _int64_feature(example['13403']),
        '13496': _int64_feature(example['13496']),
        '13678': _int64_feature(example['13678']),
        '13776': _int64_feature(example['13776']),
        '13800': _int64_feature(example['13800']),
        '13866': _int64_feature(example['13866']),
        '13874': _int64_feature(example['13874']),
        '14273': _int64_feature(example['14273']),
        '16593': _int64_feature(example['16593']),
        '16617': _int64_feature(example['16617']),
        '16661': _int64_feature(example['16661']),
        '16706': _int64_feature(example['16706']),
        'ip1': _int64_feature(example['ip1']),
        'ip2': _int64_feature(example['ip2']),
        'ip3': _int64_feature(example['ip3']),
        'adex_1': _int64_feature(example['adex_1']),
        'adex_2': _int64_feature(example['adex_2']),
        'adex_3': _int64_feature(example['adex_3']),
        'advis_0': _int64_feature(example['advis_0']),
        'advis_1': _int64_feature(example['advis_1']),
        'advis_2': _int64_feature(example['advis_2']),
        'advis_255': _int64_feature(example['advis_255']),
        'adfmt_0': _int64_feature(example['adfmt_0']),
        'adfmt_1': _int64_feature(example['adfmt_1']),
        'adfmt_5': _int64_feature(example['adfmt_5']),
    }
    
    return tf.train.Example(features=tf.train.Features(feature=feature))


#### Convert Training Data to TFRecords

In [335]:
%%time

num_tfrecords = 30
split_df = np.array_split(train,num_tfrecords)
options=tf.io.TFRecordOptions(compression_type='ZLIB')

for i,temp_df in enumerate(split_df):
    print(f'Writing TF Record {i}')
    with tf.io.TFRecordWriter(f'./data/tfrecords/train/train_{i}-{num_tfrecords}.tfrec',options=options) as writer:
        for q,r in temp_df.iterrows():
            example = create_example(r)
            writer.write(example.SerializeToString())
            
print(f'A total of {num_tfrecords} TFRecord files were created.')
print(f'Each file contains {len(split_df[0])} records')

In [337]:
!ls ./data/tfrecords/train -lh

In [334]:
%%time

num_tfrecords = 3
split_df = np.array_split(test,num_tfrecords)
options=tf.io.TFRecordOptions(compression_type='ZLIB')

for i,temp_df in enumerate(split_df):
    print(f'Writing TF Record {i}')
    with tf.io.TFRecordWriter(f'./data/tfrecords/test/test_{i}-{num_tfrecords}.tfrec',options=options) as writer:
        for q,r in temp_df.iterrows():
            example = create_example(r)
            writer.write(example.SerializeToString())
            
print(f'A total of {num_tfrecords} TFRecord files were created.')
print(f'Each file contains {len(split_df[0])} records')

In [336]:
!ls ./data/tfrecords/test -lh

In [227]:
def decoder(example):
    feature_description = {
        'clicks': tf.io.FixedLenFeature([], tf.int64),
        'region': tf.io.FixedLenFeature([], tf.int64),
        'city': tf.io.FixedLenFeature([], tf.int64),
        'adslotwidth': tf.io.FixedLenFeature([], tf.int64),
        'adslotheight': tf.io.FixedLenFeature([], tf.int64),
        'timestampyear': tf.io.FixedLenFeature([], tf.int64),
        'timestampmonth': tf.io.FixedLenFeature([], tf.int64),
        'timestampweek': tf.io.FixedLenFeature([], tf.int64),
        'timestampday': tf.io.FixedLenFeature([], tf.int64),
        'timestampdayofweek': tf.io.FixedLenFeature([], tf.int64),
        'timestampdayofyear': tf.io.FixedLenFeature([], tf.int64),
        'timestampis_month_end': tf.io.FixedLenFeature([], tf.int64),
        'timestampis_month_start': tf.io.FixedLenFeature([], tf.int64),
        'timestampis_quarter_end': tf.io.FixedLenFeature([], tf.int64),
        'timestampis_quarter_start': tf.io.FixedLenFeature([], tf.int64),
        'timestampis_year_end': tf.io.FixedLenFeature([], tf.int64),
        'timestampis_year_start': tf.io.FixedLenFeature([], tf.int64),
        'timestamphour': tf.io.FixedLenFeature([], tf.int64),
        'timestampminute': tf.io.FixedLenFeature([], tf.int64),
        'timestampsecond': tf.io.FixedLenFeature([], tf.int64),
        '10006': tf.io.FixedLenFeature([], tf.int64),
        '10024': tf.io.FixedLenFeature([], tf.int64),
        '10031': tf.io.FixedLenFeature([], tf.int64),
        '10048': tf.io.FixedLenFeature([], tf.int64),
        '10052': tf.io.FixedLenFeature([], tf.int64),
        '10057': tf.io.FixedLenFeature([], tf.int64),
        '10059': tf.io.FixedLenFeature([], tf.int64),
        '10063': tf.io.FixedLenFeature([], tf.int64),
        '10067': tf.io.FixedLenFeature([], tf.int64),
        '10074': tf.io.FixedLenFeature([], tf.int64),
        '10075': tf.io.FixedLenFeature([], tf.int64),
        '10076': tf.io.FixedLenFeature([], tf.int64),
        '10077': tf.io.FixedLenFeature([], tf.int64),
        '10079': tf.io.FixedLenFeature([], tf.int64),
        '10083': tf.io.FixedLenFeature([], tf.int64),
        '10093': tf.io.FixedLenFeature([], tf.int64),
        '10102': tf.io.FixedLenFeature([], tf.int64),
        '10110': tf.io.FixedLenFeature([], tf.int64),
        '10111': tf.io.FixedLenFeature([], tf.int64),
        '10684': tf.io.FixedLenFeature([], tf.int64),
        '11092': tf.io.FixedLenFeature([], tf.int64),
        '11278': tf.io.FixedLenFeature([], tf.int64),
        '11379': tf.io.FixedLenFeature([], tf.int64),
        '11423': tf.io.FixedLenFeature([], tf.int64),
        '11512': tf.io.FixedLenFeature([], tf.int64),
        '11576': tf.io.FixedLenFeature([], tf.int64),
        '11632': tf.io.FixedLenFeature([], tf.int64),
        '11680': tf.io.FixedLenFeature([], tf.int64),
        '11724': tf.io.FixedLenFeature([], tf.int64),
        '11944': tf.io.FixedLenFeature([], tf.int64),
        '13042': tf.io.FixedLenFeature([], tf.int64),
        '13403': tf.io.FixedLenFeature([], tf.int64),
        '13496': tf.io.FixedLenFeature([], tf.int64),
        '13678': tf.io.FixedLenFeature([], tf.int64),
        '13776': tf.io.FixedLenFeature([], tf.int64),
        '13800': tf.io.FixedLenFeature([], tf.int64),
        '13866': tf.io.FixedLenFeature([], tf.int64),
        '13874': tf.io.FixedLenFeature([], tf.int64),
        '14273': tf.io.FixedLenFeature([], tf.int64),
        '16593': tf.io.FixedLenFeature([], tf.int64),
        '16617': tf.io.FixedLenFeature([], tf.int64),
        '16661': tf.io.FixedLenFeature([], tf.int64),
        '16706': tf.io.FixedLenFeature([], tf.int64),
        'ip1': tf.io.FixedLenFeature([], tf.int64),
        'ip2': tf.io.FixedLenFeature([], tf.int64),
        'ip3': tf.io.FixedLenFeature([], tf.int64),
        'adex_1': tf.io.FixedLenFeature([], tf.int64),
        'adex_2': tf.io.FixedLenFeature([], tf.int64),
        'adex_3': tf.io.FixedLenFeature([], tf.int64),
        'advis_0': tf.io.FixedLenFeature([], tf.int64),
        'advis_1': tf.io.FixedLenFeature([], tf.int64),
        'advis_2': tf.io.FixedLenFeature([], tf.int64),
        'advis_255': tf.io.FixedLenFeature([], tf.int64),
        'adfmt_0': tf.io.FixedLenFeature([], tf.int64),
        'adfmt_1': tf.io.FixedLenFeature([], tf.int64),
        'adfmt_5': tf.io.FixedLenFeature([], tf.int64),
    }
    example = tf.io.parse_single_example(example, feature_description)
    return example

In [238]:
def prep(features):
    label = features.pop('clicks')
    return tf.stack([features[i] for i in features]), label

In [106]:
def load_data(data_dir):
    AUTOTUNE = tf.data.experimental.AUTOTUNE
    filenames = tf.io.gfile.glob(f'{data_dir}/*.tfrec')
    dataset = (
        tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE, compression_type='ZLIB')
        .map(decoder, num_parallel_calls=AUTOTUNE)
        .map(prep, num_parallel_calls=AUTOTUNE)
        .shuffle(args.batch_size * 10, seed=args.seed)
        .batch(args.batch_size)
        .prefetch(AUTOTUNE)
    )

    logger.info('Completed loading and preprocessing data.')
    return dataset

In [60]:
def get_dataset(filenames, batch_size):
    ignore_order = tf.data.Options()
    ignore_order.experimental_deterministic = False  # disable order, increase speed
    dataset = (
        tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE)
        .with_options(ignore_order)
        .map(decoder, num_parallel_calls=AUTOTUNE)
        .shuffle(batch_size * 10)
        .batch(batch_size)
        .prefetch(AUTOTUNE)
    )
    
    return dataset

## convert to parquet

In [400]:
import boto3

#### TRAIN DATA

In [408]:
%%time
ath = boto3.client('athena')
# create the table with raw data
with open('train_data.ddl') as ddl:
    ath.start_query_execution(
        QueryString=ddl.read(),
        ResultConfiguration={'OutputLocation': 's3://sagemaker-us-east-1-431615879134/ipenyou-xgboost/data/queries/train/'})

In [409]:
%%time
# convert to parquet
with open('parquet_train.ddl') as ddl:
    ath.start_query_execution(
        QueryString=ddl.read(),
        ResultConfiguration={'OutputLocation': 's3://sagemaker-us-east-1-431615879134/ipenyou-xgboost/data/queries/train/'})

#### TEST DATA

In [406]:
%%time
ath = boto3.client('athena')
# create the table with raw data
with open('test_data.ddl') as ddl:
    ath.start_query_execution(
        QueryString=ddl.read(),
        ResultConfiguration={'OutputLocation': 's3://sagemaker-us-east-1-431615879134/ipinyou-tf/queries/'})

In [407]:
%%time
# convert to parquet
with open('parquet_test.ddl') as ddl:
    ath.start_query_execution(
        QueryString=ddl.read(),
        ResultConfiguration={'OutputLocation': 's3://sagemaker-us-east-1-431615879134/ipinyou-tf/queries/'})