# The final touch of Elo data demystification!

I hope most of you have followed my previous work - if not - I encourage to read my previous work, so you would understand the context I am going to talk in this kernel.

I suggest to read them in the following order:

- https://www.kaggle.com/raddar/towards-de-anonymizing-the-data-some-insights
- https://www.kaggle.com/raddar/target-true-meaning-revealed
- https://www.kaggle.com/raddar/card-id-loyalty-different-points-in-time (bonus)
- https://www.kaggle.com/raddar/merchant-id-imputations (bonus)

In this kernel I am going to show you some interesting things you can do, when you know how the target is calculated!

### And it is one of the most interesting puzzles in the competition - so sit back, relax and enjoy!

Let's read the data first and do data unscaling as discussed in previous kernels:

In [None]:
import numpy as np
import pandas as pd
import itertools

pd.set_option('display.float_format', '{:.10f}'.format)

train = pd.read_csv('../input/train.csv')
cols = ['card_id','merchant_id','month_lag','purchase_amount','purchase_date','authorized_flag']
historical_transactions = pd.read_csv('../input/historical_transactions.csv',usecols=cols).fillna('')
new_merchant_transactions = pd.read_csv('../input/new_merchant_transactions.csv',usecols=cols).fillna('')

historical_transactions['purchase_amount'] = np.round(historical_transactions['purchase_amount'] / 0.00150265118 + 497.06,2)
new_merchant_transactions['purchase_amount'] = np.round(new_merchant_transactions['purchase_amount'] / 0.00150265118 + 497.06,2)
train['target'] = 2**train['target']

#only interested in authorized transactions:
historical_transactions = historical_transactions.loc[historical_transactions['authorized_flag']=='Y'].reset_index(drop=True)

Here I will start with some simple data exploration which is going to be useful for later insights and calculations.

We are going to use the information about each card having different observation (reference) date. For each `card_id` the `historical_transactions` and `new_merchant_transactions` tables are split based on the reference date. So here comes the question - is it possible to create `new_merchant_transactions` table for any reference date for any `card_id`?

If you would think about it, you can only do that by shifting the reference date back in time, and not the future. We already know that `new_merchant_transactions` table should contain only new merchants (meaning not present in `historical_transactions`) for each `card_id`. So in general, we could extract information like `(card_id, merchant_id, purchase_date)`, which would represent the first appearance of `merchant_id` for the specific `card_id`.

Let's try that!

In [None]:
# adding "new merchant" flag for each transaction
first_hist_merchant_transaction = historical_transactions.groupby(['card_id','merchant_id'])['month_lag'].min().reset_index(name='month_lag')
first_new_merchant_transaction = new_merchant_transactions.groupby(['card_id','merchant_id'])['month_lag'].min().reset_index(name='month_lag')
first_hist_merchant_transaction['new'] = 1
first_new_merchant_transaction['new'] = 1
historical_transactions = historical_transactions.merge(first_hist_merchant_transaction, on = ['card_id','merchant_id','month_lag'], how = 'left')
new_merchant_transactions = new_merchant_transactions.merge(first_new_merchant_transaction, on = ['card_id','merchant_id','month_lag'], how = 'left')
historical_transactions.loc[pd.isnull(historical_transactions['new']),'new'] = 0
new_merchant_transactions.loc[pd.isnull(new_merchant_transactions['new']),'new'] = 0

Let's look at the `purchase_amount` distribution by `month_lag` and `new` flag:

In [None]:
dist_hist = historical_transactions.groupby(['new','month_lag'])['purchase_amount'].sum().reset_index(name='total_purchase_amount')
dist_new = new_merchant_transactions.groupby(['new','month_lag'])['purchase_amount'].sum().reset_index(name='total_purchase_amount')
dist = pd.concat([dist_hist, dist_new]).reset_index(drop=True)

In [None]:
dist.pivot('month_lag','new','total_purchase_amount')

As you can see, we are missing transactions worth of 160-170 million (based on lag=-2,-1,0) for each future month.

The idea of this kernel is try to reverse engineer missing future transactions using `historical_transactions` and `target` information alone!

# Let's start!

As I found out earlier, the `target` is calculated as a ratio of future/historical transactions. After digging the data around I found that the most likely formula is:

\begin{equation*}
target(card\_id) = \frac{\sum_{m} sum\_future\_purchase\_amount (m,card\_id)/2}{\sum_{m} last\_historic\_transaction\_purchase\_amount (m, card\_id)}
\end{equation*}

where *m* represents  set of merchants, and only such merchants appearing both in historical **and** future transactions. The tricky part of this formula is that we do not know the most important part - *m*. If we knew *m*, we could easily reverse engineer the upper part of the formula, as `target` is given, and it is easy to calculate the denominator part of the ratio.

So, we can actually transform the problem into this:

\begin{equation*}
loyal\_spending(card\_id)  = 2* target(card\_id) * \sum_{m} last\_transaction\_purchase\_amount (m, card\_id)
\end{equation*}

where we try to estimate `loyal_spending` by looking for most likely *m*.

If we knew `loyal_spending` for each `card_id`, we could fill in the previous table easily! On top of that, we could build models which could directly try to predict `loyal_spending` for each `card_id` and use that in our model stack!

# Combinatorics!

So how does one decide the set of *m* for each `card_id`? You will see, that this is actually a very interesting combinatorics problem!

But first, I want to remind everyone, that we are dealing with money - which means `purchase_amount` is a 2 decimal precision float. This is a very important piece of information to solve the puzzle. Why? You would expect `loyal_spending` also to have a precise 2 decimal precision float - sum of money is still money :)

So the idea is quite simple:

- start with empty set of possible solutions `M`
- take a candidate set `m` - set having 1 to N `merchant_id`'s from the pool of historic merchants
- calculate `loyal_spending` based on the formula above
- check if the `loyal_spending` rounds up to 2 decimal points nicely
- if it rounds up - add `m` to your pool of solutions `M`
- iterate `n` times or untill all possible `merchant_id` permutations are exhausted

If `M` contains only one solution - you got lucky, as you have identified a single solution for the given `target` value and `historical_transactions`. However, if `M` contains more than one solution, this is where you have to make assumptions and decide which is the most likely solution. I am going to cover this later.



# Enough theory - start coding!

Not yet... :) Before doing anything let's decide on a strategy how to select a best solution `m` from the pool of solutions `M`. The simpliest and most straightforward way I could think of was to calculate the average `month_lag` with the associated merchants, and use the solution, which  `average_month_lag` is closest to 0 - thus we give the higher priority to merchants who have appeared closest to the reference date. 

There is still a possible tie situation, where 2 or more `m` sets have the same `average_month_lag` metric. For now I decided just to pick the first `m`, but of course, this could be made more sophisticated. However, sophisticated solution is not necesarilly better, as we cannot actually measure the error we are going to make (actually we can do the retrospective analysis, but it is out of scope).

If you feel it is hard to follow - relax!
I have presented the process, but I am pretty sure you are going to re-read all of this when we do some actual coding and everything will make sense!

# The code! 

First, let's extract the last `purchase_amount` for each `(card_id, merchant_id)`. It is necessary, because only last transaction matters in loyalty calculations:

In [None]:
historical_transactions = historical_transactions.sort_values(['card_id','purchase_date']).reset_index(drop=True)
historical_transactions = historical_transactions.groupby(['card_id','merchant_id']).tail(1).reset_index(drop=True)

We want to avoid using pandas in further calculations, as it is extremely slow in indexing the data on such huge tables. Therefore, let's transform the data into useful python dicts containing the relevant information for our task.

In [None]:
# month lags
amtmth = historical_transactions[['card_id', 'purchase_amount','month_lag']].groupby(['card_id', 'purchase_amount'])['month_lag'].max().reset_index()
amtmth = amtmth.groupby('card_id').apply(lambda x: dict(zip(x['purchase_amount'],x['month_lag']))).reset_index(name = 'dict')
amtmth = dict(zip(amtmth['card_id'],amtmth['dict']))

# purchase amounts 
amtpurch = historical_transactions.groupby('card_id')['purchase_amount'].apply(list).reset_index(name='amounts')
amtpurch = dict(zip(amtpurch['card_id'],amtpurch['amounts']))

# targets
target = dict(zip(train['card_id'],train['target']))

This is the function I am going to use which does most of what I have discussed above:

In [None]:
def get_future_amount(trainnum, max_combinations = 100000):
    "max_combinations: number of possible combinations to explore (less - faster, more - slower)"

    card_id = train['card_id'][trainnum]
    k = 1
    
    #collection of possible solutions
    comb = {}
    comb2 = {}
    
    for i in range(1, 10):
        # we want to fully exhaust solutions of length 1 and 2
        # if we found something, do not go into deeper permutation levels
        # 2 is arbitrary and can be tuned. However, starting from 3 performance drops significantly
        if (len(comb) > 0) and (i > 2):
            break
        # need to set hard limit or else we are doomed :)
        if k == max_combinations:
            break    
        for subset in itertools.combinations(amtpurch[card_id], i):
            k+=1
            if k == max_combinations:
                break
            amt = np.round(np.sum(subset), 2)
            amt_month = np.mean([amtmth[card_id][x] for x in subset])
            amt_target = amt * target[card_id] * 2 #this is the loyalty formula as discussed earlier!!!
            # let's see if it rounds up nicely...
            if np.round(amt_target, 2) == np.round(amt_target, 5):      
                comb[tuple(sorted(subset))] = np.round(amt_target, 2), amt_month
            # backup plan with higher error tolerance...
            if np.round(amt_target, 2) == np.round(amt_target, 4):      
                comb2[tuple(sorted(subset))] = np.round(amt_target, 2), amt_month
    
    # if no higher precision combinations found, use combinations with lower precision
    if len(comb) == 0:
        comb = comb2
    
    best = np.nan
    if len(comb) > 0:       
        q = [z[0] for z in comb.values()]
        m = [z[1] for z in comb.values()]
        best = q[np.argmax(m)]
    return comb, best, target[card_id]

The function is quite costly - I would recomend to run it with multiprocessing while running on all `card_id` (takes couple of hours on 12 threads). Let's take a look at the first `card_id` in the train:

In [None]:
get_future_amount(0)

There is a lot of information here, let's go it step by step:

- The first element is dictionary of valid merchant proposals with their respective last `purchase_amount` (actual `merchant_id` does not matter - the amount does!). 

- dict keys represents the actual `purchase_amount`'s used to derive the `loyal_spending`. 

- dict values represents a) `loyal_spending` b) `average_month_lag` as discussed earlier.

- The second element is the final `loyal_spending` selected from the given candidates (based on `average_month_lag`)

- Third element is just for debug purposes - train target value


Everything is quite simple here. Let's do simple maths. Sum of merchant historic transactions for possible `loyal_spending` are:

`6.5 + 73.93 = 80.43`

`19 + 129.95 = 148.95`

`17.4 + 530.76 = 548.16`

`120.67 + 530.76 = 651.43`

`50 + 166.49 = 216.49`

`5.3 + 166.49 = 171.79`

We know, that `target = 0.5663309969446729`. 

Let's put all that into `loyal_spending` formula:

`loyal_spending = 2 * 0.5663309969446729 * 80.43 = 91.1000041685201`

`loyal_spending = 2 * 0.5663309969446729 * 148.95 = 168.710003989818`

`loyal_spending = 2 * 0.5663309969446729 * 584.16 = 620.879998570384`

`loyal_spending = 2 * 0.5663309969446729 * 651.43 = 737.850002679337`

`loyal_spending = 2 * 0.5663309969446729 * 216.49 = 245.209995057105`

`loyal_spending = 2 * 0.5663309969446729 * 171.79 = 194.580003930251`

All these round up nicely to 2 digit precision float. 

As discussed earlier, each of `loyal_spending` values are all possible. However, we already made a decision to pick one with closest to reference date  (using`average_month_lag`). This means that total loyal spending for this `card_id`  is 194.58 over 2 months. 

Let's take a look how this `card_id` looks like in `historic_transactions` table (which has last transactions filetered already):

In [None]:
historical_transactions[historical_transactions['card_id']==train['card_id'][0]]

Based on that we can now pinpoint which actual `merchant_id`'s were used in loyalty calculation: `M_ID_1df4c6f47a` (`purchase_amount = 5.30`) and `M_ID_820c7b73c8` (`purchase_amount = 166.49`), and in the next 2 months a person spent 194.58 on these 2 merchants. How awesome is that?! - we were able to reverse engineer the data we were not provided in this competition! (I feel a bit confused why they did not provide such transactions for train part, but that is another topic...)

To wrap it up let's run all of this for first 100 cards in train set just to see if we generalize this function well enough:

In [None]:
out = []

for i in range(100):
    out.append(get_future_amount(i)[1])

In [None]:
train['loyal_spending'] = np.nan
train['loyal_spending'][0:100] = out
train.head(100)

It seems all is good and the logic transfers for most of other `card_id`'s as well!

# P.S.

Although I spent a lot of time on finding the formula for `loyal_spending` calculations, I am still not 100% confident with that formula. However, if you were able to find other, better working formula, you could easily reproduce the following steps and get the results in the same way I did.

# What's next?

The findings are universal and applicable to all the `card_id`'s. This means that you can use calculated `loyal_spending` as a target for your new meta model and hope the stacking model would improve.

In my experience such approach did not give too much uplift -  I personally expected such approach to provide *killer* models in the stack. This may be related to the `m` selection process which I believe still involves high error rate.

However, such type of models are in our final stack submission, thus proven useful!

# Thank you for following!

## upvote if you liked the content :)

### see you all in next competitions - I hope to not stop making kernels in the future.