# kkbox features + EDA

# features

The purpose of this notebook is to summarise all data files into useful chunks of data, and merge them to the train and test csv. Our output is train_out.csv and test_out.csv. Visualization of the features will be made at the end to see possible trends.

In [1]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

import time

import os
import os.path as path

In [4]:
transactions = pd.read_csv('data/transactions.csv')
members = pd.read_csv('data/members.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/sample_submission_zero.csv')
user_logs = pd.read_csv('data/user_logs_output.csv')

This is what our datafiles look like. We will investigate each file individually, and then merge them all later.

In [9]:
print('Transactions Shape:               %s' % str(transactions.shape))
print('Members Shape:                    %s' % str(members.shape))
print('Train Shape:                      %s' % str(train.shape))
print('Test Shape:                       %s' % str(test.shape))
print('User Logs Output:                 %s' % str(user_logs.shape))

Transactions Shape:               (21547746, 9)
Members Shape:                    (5116194, 7)
Train Shape:                      (992931, 2)
Test Shape:                       (970960, 2)
User Logs Output:                 (5234111, 3)


We have 992,931 in our training dataset and 970,960 in our test. Not too bad.

### Train and Test

In [10]:
train.head()

Unnamed: 0,msno,is_churn
0,waLDQMmcOu2jLDaV1ddDkgCrB/jl6sD66Xzs0Vqax1Y=,1
1,QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=,1
2,fGwBva6hikQmTJzrbz/2Ezjm5Cth5jZUNvXigKK2AFA=,1
3,mT5V8rEpa+8wuqi6x0DoVd3H5icMKkE9Prt49UlmK+4=,1
4,XaPhtGLk/5UvvOYHcONTwsnH97P4eGECeq+BARGItRw=,1


In [11]:
test.head()

Unnamed: 0,msno,is_churn
0,ugx0CjOMzazClkFzU2xasmDZaoIqOUAZPsH1q0teWCg=,0
1,f/NmvEzHfhINFEYZTR05prUdr+E+3+oewvweYz9cCQE=,0
2,zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg=,0
3,8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ=,0
4,K6fja4+jmoZ5xG6BypqX80Uw/XKpMgrEMdG2edFOxnA=,0


Both train and test datafiles have msno (user id) and is_churn. `Test['is_churn']` is what we would want to predict. Let's see what is the percent of people churning in the test dataset.

In [15]:
print('Percentage of ppl who churn:            %f' % np.mean(train.is_churn))

Percentage of ppl who churn:            0.063923


Even though the competition scoring uses logloss, we want to keep in mind that 94% accuracy is our "median Best Guess model." We want our models to perform significantly higher than this.

### Transaction

One basic feature that we can add is mean statistics of the transaction data (which we will name transaction_sum.csv). Another thing that we can do is get an idea of months that users dropped, maintained or added a subscription to kkbox (which we will name recency.csv).

In [20]:
index = set(transactions['msno'])
print('# Unique Users:        %d' % len(index))

# Unique Users:        2363626


In [17]:
transactions.head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
0,YyO+tlZtAXYXoZhNr3Vg3+dfVQvrBVGO8j1mfqe4ZHc=,41,30,129,129,1,20150930,20151101,0
1,AZtu6Wl0gPojrEQYB8Q3vBSmE2wnZ3hi1FbK1rQQ0A4=,41,30,149,149,1,20150930,20151031,0
2,UkDFI97Qb6+s2LWcijVVv4rMAsORbVDT2wNXF0aVbns=,41,30,129,129,1,20150930,20160427,0
3,M1C56ijxozNaGD0t2h68PnH2xtx5iO5iR2MVYQB6nBI=,39,30,149,149,1,20150930,20151128,0
4,yvj6zyBUaqdbUQSrKsrZ+xNDVM62knauSZJzakS9OW4=,39,30,149,149,1,20150930,20151121,0


Let's take a closer look at one individual user's transaction history.

In [22]:
transactions[transactions['msno'] == transactions.msno[100]].sort_values('transaction_date').head()

Unnamed: 0,msno,payment_method_id,payment_plan_days,plan_list_price,actual_amount_paid,is_auto_renew,transaction_date,membership_expire_date,is_cancel
5385571,lF0lx58JJRNAU7yIb8p7gQGkpHVxuZgGoBpeucDSpxo=,39,31,149,149,1,20150131,20150319,0
100,lF0lx58JJRNAU7yIb8p7gQGkpHVxuZgGoBpeucDSpxo=,39,31,149,149,1,20150228,20150419,0
6502405,lF0lx58JJRNAU7yIb8p7gQGkpHVxuZgGoBpeucDSpxo=,39,31,149,149,1,20150331,20150519,0
9947404,lF0lx58JJRNAU7yIb8p7gQGkpHVxuZgGoBpeucDSpxo=,39,0,0,149,1,20150430,20150619,0
15237221,lF0lx58JJRNAU7yIb8p7gQGkpHVxuZgGoBpeucDSpxo=,39,30,149,149,1,20150531,20150719,0


Basic features that we can add are
* `total_sub_days`
* `start_sub`
* `end_sub`
* `num_canceled`
* `num_transactions`

We will save this information to transaction_sum.csv stored in `data`

In [23]:
total_sub_days = transactions.groupby(['msno']).payment_plan_days.sum()
start_sub = transactions.groupby(['msno']).transaction_date.min()
end_sub = transactions.groupby(['msno']).membership_expire_date.max()
num_canceled = transactions.groupby(['msno']).is_cancel.sum()
num_auto_renew = transactions.groupby(['msno']).is_auto_renew.sum()
num_transactions = transactions.groupby(['msno']).size()

transaction_sum = pd.concat([total_sub_days, start_sub, 
                             end_sub, num_canceled, 
                             num_auto_renew, num_transactions], axis = 1)

transaction_sum.to_csv('data/transaction_sum.csv')

Next, we can extract information about months that users dropped, maintained or added a subscription. Basically, the months from 01/2015 - 02/2017 will be our features. 

This is our code for each month:

* dropped/not subscribed:      -1
* maintained:                   0
* added:                        1

*Creating this file can take a while. Step back and drink some tea in the meantime*

We will save this information to recency.csv stored in `data`.

In [None]:
unique_months = list(set([int(i/100) for i in transactions.transaction_date]))
unique_months = sorted(unique_months)
recency = pd.DataFrame(-1, index = list(index), columns = unique_months)

past = time.time()

for index, row in transactions.iterrows():
    user = row['msno']
    start = row['start']
    end = row['end']
    if row['is_cancel'] == 0:
        recency.loc[user, start:end] = 0
    elif row['is_cancel'] == 1:
        recency.loc[user, start:] = -1
    if i % 1000000 == 0:
        print('Iteration %d , Elapsed     %f' % (i, time.time() - past))
    
recency.to_csv('data/recency.csv')
print('Completed and saved!!! ヽ(^◇^*)/')

We are done with that (finally)! Now let's move on too other datafiles.

### Members

In [25]:
members.head()

Unnamed: 0,msno,city,bd,gender,registered_via,registration_init_time,expiration_date
0,URiXrfYPzHAlk+7+n7BOMl9G+T7g8JmrSnT/BU8GmEo=,1,0,,9,20150525,20150526
1,U1q0qCqK/lDMTD2kN8G9OXMtfuvLCey20OAIPOvXXGQ=,1,0,,4,20161221,20161224
2,W6M2H2kAoN9ahfDYKo3J6tmsJRAeuFc9wl1cau5VL1Q=,1,0,,4,20160306,20160309
3,1qE5+cN7CUyC+KFH6gBZzMWmM1QpIVW6A43BEm98I/w=,5,17,female,4,20161031,20161107
4,SeAnaZPI+tFdAt+r3lZt/B8PgTp7bcG/1os39u4pLxs=,1,0,,4,20170202,20170205


These seem like good features already. I won't touch them so it's all ready to join with the train and test.

### Merging the datafiles

We now have

* transaction_sum.csv
* recency.csv
* members.csv

... to join to the train and test datasets. Let's do it!

In [None]:
train.index = train.msno
test.index = test.msno
members.index = members.msno
members = members.drop(['msno'], axis = 1)
user_logs.index = user_logs.msno
user_logs = user_logs.drop(['msno'], axis = 1)
recency.index = recency['Unnamed: 0']
recency = recency.drop(['Unnamed: 0'], axis = 1)
transaction_sum.index = transaction_sum.msno
transaction_sum = transaction_sum.drop(['msno'], axis = 1)

train = pd.concat([train, recency, members, transaction_sum, user_logs], 
                  axis = 1, join='outer', join_axes = [train.index])

test = pd.concat([test, recency, members, transaction_sum, user_logs], 
                 axis = 1, join='outer', join_axes = [test.index])

#Fix some column names
train.rename(columns = {list(train)[-3]: 'num_transactions'}, inplace = True)
test.rename(columns = {list(test)[-3]: 'num_transactions'}, inplace = True)
print('Completed!!! ヽ(^◇^*)/')

### Feature Engineering

There are a couple ideas that we can implement to make our features better. 

For one, the recency data has useful information about user subscription cycles. We can use the numbers to calculate longest consecutive 1's or consecutive 0's to get an idea of longest period of monthly subscriptions or longest period of a subscription. 

Another thing is that since the dates are stored as integers, we can set an arbitrary pivot date and get a delta date to create a continuous date scale. Otherwise, the gap between 12/31/2015 and 01/01/2016 is significantly longer than the gap between 01/01/2016 and 01/02/2016.

First the consecutives. I've defined some functions to help. This chunk might take awhile as well (go big data!), so sit back and relax.

In [26]:
def len_consec_zeros(a):
    a = np.array(list(a))   
    rr = np.argwhere(a == '0').ravel()  
    if not rr.size:
        return 0
    full = np.arange(rr[0], rr[-1]+1) 
    diff = np.setdiff1d(full, rr)
    if not diff.size:
        return len(full)
    pos, difs = full[0], []
    for el in diff:
        difs.append(el - pos)
        pos = el + 1
    difs.append(full[-1]+1 - pos)
    res = max(difs) if max(difs) != 1 else 0
    return res

def len_consec_ones(a):
    a = np.array(list(a))   
    rr = np.argwhere(a == '1').ravel()  
    if not rr.size:
        return 0
    full = np.arange(rr[0], rr[-1]+1) 
    diff = np.setdiff1d(full, rr)
    if not diff.size:
        return len(full)
    pos, difs = full[0], []
    for el in diff:
        difs.append(el - pos)
        pos = el + 1
    difs.append(full[-1]+1 - pos)
    res = max(difs) if max(difs) != 1 else 0
    return res

train['concated'] = train.ix[:, 3:27].astype(str).apply(lambda x: ''.join(x), axis=1)
train['consecutive_zeros'] = train.concated.apply(lambda x: len_consec_zeros(x))
train['consecutive_ones'] = train.concated.apply(lambda x: len_consec_ones(x))

test['concated'] = test.ix[:, 3:27].astype(str).apply(lambda x: ''.join(x), axis=1)
test['consecutive_zeros'] = test.concated.apply(lambda x: len_consec_zeros(x))
test['consecutive_ones'] = test.concated.apply(lambda x: len_consec_ones(x))
print('Completed!!! ヽ(^◇^*)/')

Now the dates. This part is relatively straight forward.

In [None]:
def get_date(i):
    year = int(i / 10000)
    month = int(int(i % 10000) / 100)
    day = int(i % 100)
    return dt.date(year, month, day)

pivot = dt.date(2017, 2, 28)

train.loc[~train.registration_init_time.isnull(), 'registration_init_time'] = [(pivot - get_date(i)).days / 30 for i in train.loc[~train.registration_init_time.isnull(), 'registration_init_time']]
train.loc[~train.expiration_date.isnull(), 'expiration_date'] = [(pivot - get_date(i)).days / 30 for i in train.loc[~train.expiration_date.isnull(), 'expiration_date']]
train.loc[~train.transaction_date.isnull(), 'transaction_date'] = [(pivot - get_date(i)).days / 30 for i in train.loc[~train.transaction_date.isnull(), 'transaction_date']]
train.loc[~train.membership_expire_date.isnull(), 'membership_expire_date'] = [(pivot - get_date(i)).days / 30 for i in train.loc[~train.membership_expire_date.isnull(), 'membership_expire_date']]

test.loc[~test.registration_init_time.isnull(), 'registration_init_time'] = [(pivot - get_date(i)).days / 30 for i in test.loc[~test.registration_init_time.isnull(), 'registration_init_time']]
test.loc[~test.expiration_date.isnull(), 'expiration_date'] = [(pivot - get_date(i)).days / 30 for i in test.loc[~test.expiration_date.isnull(), 'expiration_date']]
test.loc[~test.transaction_date.isnull(), 'transaction_date'] = [(pivot - get_date(i)).days / 30 for i in test.loc[~test.transaction_date.isnull(), 'transaction_date']]
test.loc[~test.membership_expire_date.isnull(), 'membership_expire_date'] = [(pivot - get_date(i)).days / 30 for i in test.loc[~test.membership_expire_date.isnull(), 'membership_expire_date']]
print('Completed!!! ヽ(^◇^*)/')

### Missing Data

Ewwww... Missing data. It's important to realize that not all users in the train or test is in transactions or members. So unfortunately, we have missing data.

In [None]:
def num_missing(x):
    return sum(x.isnull())

#Applying per column
print ("Train missing values:")
print ("---------------------------------")
print (train.apply(num_missing, axis=0))
print('')
print ("Test missing values:")
print ("---------------------------------")
print (test.apply(num_missing, axis=0))

For the categorical variables, we will fill in the missing values with NA's and 999's. Don't forget to hot-encode the categories too.

In [None]:
train.loc[train.city.isnull(), 'city'] = 999
test.loc[test.city.isnull(), 'city'] = 999

train.loc[train.gender.isnull(), 'gender'] = 'NA'
test.loc[test.gender.isnull(), 'gender'] = 'NA'

train.loc[train.registered_via.isnull(), 'registered_via'] = 999
test.loc[test.registered_via.isnull(), 'registered_via'] = 999

train = pd.get_dummies(train, columns = ['city', 'gender', 'registered_via'])
test = pd.get_dummies(test, columns = ['city', 'gender', 'registered_via'])

For continuous, we will fill in the missing values with the average of the combined dataset. We will store train_na and test_na indexes in case we decide to use them in the training/predicting process.

In [None]:
combined = pd.concat([train, test])

def reject_outliers(data, m=2):
    return data[abs(data - np.mean(data)) < m * np.std(data)]

var = ['registration_init_time',
      'expiration_date',
      'payment_plan_days',
      'transaction_date',\n",
      'membership_expire_date',
      'is_cancel',
      'is_auto_renew',
      'num_transactions',
      'num_unq',
      'total_secs']

avg = np.mean(reject_outliers(combined[var]))
train_na = train[var].isnull().any(axis = 1)
test_na = test[var].isnull().any(axis = 1)

for i in np.arange(len(var)):
    #train.loc[mask, var[i]] = avg[i] I'm trying not to do it with the train
    test.loc[test_na, var[i]] = avg[i]
    
test.ix[test['201501'].isnull(), 3:27] = 0
    
output = open('train/train_na.pkl', 'wb')
pickle.dump(train_na, output, protocol=pickle.HIGHEST_PROTOCOL)
output.close()

output = open('train/test_na.pkl', 'wb')
pickle.dump(test_na, output, protocol=pickle.HIGHEST_PROTOCOL)
output.close()
print('Completed!!! ヽ(^◇^*)/')

### Saving Preprocessed Data

In [None]:
# remember to save all your hardwork!

train_out = pd.read_csv('data/train_out.csv')
test_out = pd.read_csv('data/test_out.csv')
print('Completed and Saved!!! ヽ(^◇^*)/')

# Visualizing Data

work in progress...