This kernel tries to understand the key differences between the training and test data, which potentially have an impact for the final leaderboard compared to the public leaderboard. I want to provide some pointers to data issues and maybe spur a discussion on how to cope with them.

From my point of view, two major challenges of this dataset are
1. Categorical data with unique values >> 100
2. Strong imbalance between signal (target) and background 1:1000

A potential trap for algorithms in this scenario is that esp. decision trees are prone to select individual downloads rather than generalizing download patterns/behavior.  Even though boosting is of help here, the performance on the full test set will depend on how comparable the feature distributions in train/test sets are.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
# Thanks to yuliagm: https://www.kaggle.com/yuliagm/talkingdata-eda-plus-time-patterns

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

For memory efficiency the full training set is read in chunks of size of the test set.

Overall this comparison will focus on the categorical features IP, APP, DEVICE, OS and CHANNEL 

In [2]:
#Read training dataset via chunks
train_reader = pd.read_csv('../input/train.csv', chunksize=18790470)
test = pd.read_csv('../input/test.csv')

#test['click_time'] = pd.to_datetime(test['click_time'])
#Analysis focuses on 5 variables
variables = ['ip', 'app', 'device', 'os', 'channel']

#Creating the sets that will contain the unique values in training_only, testing_only and that are shared by both training and testset
#In training and for shared values I differentiate between unique values that are "attributed" and those that aren't.
train_only_sets00 = [set([]) for _ in variables]
train_only_sets01 = [set([]) for _ in variables]
train_only_sets11 = [set([]) for _ in variables]

test_only_sets = [set([]) for _ in variables]

shared_sets00 = [set([]) for _ in variables]
shared_sets01 = [set([]) for _ in variables]
shared_sets11 = [set([]) for _ in variables]

The analysis will investigate if the unique values of the categorical features are comparably distributed in both test and train set. More specifically I focus on
1. Unique values that are shared by training and test set
2. Unique values that occur in the training set only
3. Unique values that occur in the test set only
Of course, 1 and 2 can be broken down into two sub sets (1a, 1b and 2a, 2b) for training item where a download has occured ("is_attributed" == 1).

Calculation of the sets might take some time (10 - 30s)

In [3]:
#Creating the sets for all variables chunk by chunk
loaded_only = [set([]) for _ in variables]
notloaded_only = [set([]) for _ in variables]
both_states = [set([]) for _ in variables]

for cnum, chunk in enumerate(train_reader):
    print("Reading chunk %i" % cnum)
    app_loaded = chunk["is_attributed"] == 1
    app_notloaded = chunk["is_attributed"] == 0
    for vnum, v in enumerate(variables):
        train_notloaded = set(chunk[v][app_notloaded])
        train_loaded = set(chunk[v][app_loaded])
        
        loaded_diff = loaded_only[vnum] | (train_loaded - train_notloaded)
        notloaded_diff = notloaded_only[vnum]| (train_notloaded - train_loaded)
        
        chunk_carry = loaded_diff & notloaded_diff
        
        both_states[vnum] = both_states[vnum]| (train_notloaded & train_loaded) | chunk_carry
        loaded_only[vnum] = loaded_diff - both_states[vnum]
        notloaded_only[vnum] = notloaded_diff - both_states[vnum]
print("Finalizing sets")

for vnum, v in enumerate(variables):
    test_set = set(test[v])
    
    #Unique values of variable that are shared between training and test set
    shared_sets00[vnum] = test_set & notloaded_only[vnum]
    shared_sets01[vnum] = test_set & both_states[vnum]
    shared_sets11[vnum] = test_set & loaded_only[vnum]

    #Unique values of variable that are only in the test set
    test_only_sets[vnum] = test_set - shared_sets00[vnum] - shared_sets01[vnum] - shared_sets11[vnum]
    
    #Unique values of variable that are only in the training set
    train_only_sets00[vnum] = notloaded_only[vnum] - test_set
    train_only_sets01[vnum] = both_states[vnum] - test_set
    train_only_sets11[vnum] = loaded_only[vnum] - test_set

print("Done creating sets")

After the sets of unique values have been calculated. The dataframe for analysis is prepared.

In [4]:
#DataFrame for analysis will consist of 5 columns, each for the subsets 1a, 1b, 2a, 2b, 3
#The rows of the dataframe are the number of unique items in the subset - one row per variable (ip, app...)

ana_data = {"shared_onlyloaded": list(map(len, shared_sets11)),
            "shared_both": list(map(len, shared_sets01)),
            "shared_noneloaded": list(map(len, shared_sets00)),
            "train_only_onlyloaded": list(map(len, train_only_sets11)),
            "train_only_both": list(map(len, train_only_sets01)),
            "train_only_noneloaded": list(map(len, train_only_sets00)),
            "test_only": list(map(len, test_only_sets))
           }

ana_frame = pd.DataFrame(data=ana_data)
#DataFrame is normalized towards total number of unique items
total = ana_frame.sum(axis=1)

for col in list(ana_frame):
    ana_frame[col] = ana_frame[col].divide(total)

To analyze these sets I want to understand the size of these sets and how their distribution overlaps to get an indication of the comparability of training and test set.

In [5]:
from matplotlib import pyplot
#General function for analysis of the categorical variables

def analyze_column(col_num, col_name, bins, ignore_overview=False):
    #Is there an easier way to drop all but a particular row?
    inds = np.arange(5)
    inds_mask = np.ones(5, dtype=bool)
    inds_mask[col_num] = False
    ip_frame = ana_frame.drop(ana_frame.index[inds[inds_mask]])
    
    if not ignore_overview:
        #Plotting the initial overview
        ax = ip_frame.plot.bar()
        ax.set_xticklabels(variables)
        ax.set_title(col_name + "-Overview on items that are shared or differ between training and test set")

        for pnum, p in enumerate(ax.patches):
            ax.annotate(str(round(p.get_height()*total[col_num])), (p.get_x() * 0.99, p.get_height() * (1.01)))

    #plotting distribution
    pyplot.figure()
    pyplot.hist([list(shared_sets01[col_num]),list(shared_sets00[col_num]),list(shared_sets11[col_num]),
                list(test_only_sets[col_num]),list(train_only_sets01[col_num]), list(train_only_sets00[col_num]),
                list(train_only_sets11[col_num])], bins, alpha=1.0, 
                label=['shared_both', 'shared_noneloaded', 'shared_onlyloaded', 'test_only',
                       'train_only_both', 'train_only_noneloaded', 'train_only_onlyloaded'],
                stacked=True
               )
    pyplot.legend()
    pyplot.title(col_name + "-Analysis: Distribution of shared and differing unique values in training and test set")
    pyplot.show()

In [6]:
analyze_column(0, "IP", 50)

**Overview:**One can clearly see that the IP range in the test and the train set are differing significantly. The majority of unique values (50% - train_only_both) is only available in the training_set and w/o a clear signal (i.e. has attributed==0 and attributed==1). Of the 10% with a strong signal (i.e. exclusively attributed=1; train_only_onlyloaded+shared_onlyloaded) only 0.3% exist in both the training and testing set.  Overall the test set introduces more new ip addresses (~18% - dark red) than it shares with the training set (~12% - blue, orange)

**Distribution**: The right hand side around IP>125000 is dominated by IP addresses that only occur in the training set. Let's take a closer look.

In [7]:
analyze_column(0, "IP", np.arange(125000, 128000, 50), True)

With a closer look one can see:

1. The cut-off seems as an intentional mix designed by the competitions hosts:IP < 126500 is a regular mix of IPs that are shared with the training set and are attributed); IP >=126500 consist of only training IP addresses. Still there is a large number of exclusive training IP addreses that are attribute, which might derail algorithms. Furthermore there is a large number of IP addresse that are only occuring in the test set. Hence, the decision tree branches are hopefully well generalized to cope with these new items.

2. The unique values are not evenly distributed and accumulate around certain IP addresses. Interestingly there are areas where test_only dominates (dark red) and others where the shared values dominate (blue, orange). Another hint to the test set being designed to ensure that the models are not a "glorified IP blacklist".

3. The unique values for train_only (IP >=126500) have no holes, i.e., all IP addresses are covered - compared to the testing piece which seems incomplete. That that for the relevant IP area for testing our model we lack the full range of IP address behavior.

As pointed out by the competition hosts, IP addresses are encoded. Identifying this encoding could help reduce cardinality of IP addresses and improve generalization.

In [8]:
analyze_column(0, "IP", np.arange(123128, 123257, 2), True)

Looking at the most granular level one can see that areas with no training information (dark red), weak information (blue) or information on non-download (orange) IP addresess are alternating and are quite narrow. Defining proper IP bands via decision trees might be quite challenging.

In [9]:
analyze_column(1, "APP", 20)

The graphs for APP show a very different picture compared to IP.  Roughly 40% of apps are in the training set only (dark brown) and have no downloads.  Moreover, only a fraction of <5% of apps is newly introduced via the test set. Also, distracting downloads (i.e. that are only attributable to this app and occur only in the training set) are limited to <1%. 

In summary, it seems like the app number is also an indicator for the popularity of the app. Lower numbers correlate with higher blue color, i.e. for the majority of apps at the lower end, both data on downloading/not-downloading the app is available and the app is available in both train and test set). 

Apps number > 550 could be good for training time-series models on how users typically click apps they don't end up downloading.
Apps number < 400 could be used in an over-sampled fashion to train specifically for app download patterns.

In [13]:
analyze_column(2, "Device", 50)

In [14]:
analyze_column(3, "OS", 50)

In [15]:
analyze_column(4, "channel", 30)

Overall channel data shows no issues with channel partners only available in test set, but 10% of the channel partners in the training set are not used in the training set. Looking at the distribution here, one could infer that channel partners are organized in tiers (e.g., < 100, top-tier highly-trustworthy or with high volumes). The number of training_only channels at the higher end of the spectrum supports this perspective

This was an initial introduction on the unique values and challenges in test and training data. Hopefully that was help and insightful. I will add some more analysis on click-behavior and also on the actual distributions beyond unique values.

Kindly let me know, where you agree, where you disagree or what your thoughts on the data are.