# Question this notebook seeks to answer
We are given 5 ```resp``` values: ```resp```, ```resp_1```, ```resp_2```, ```resp_3``` and ```resp_4```, which represent returns over different time horizons. Positive values of ```resp``` contribute positively towards the ```utility score```. For submission to the competition most of us assign ```action```=1 when ```resp```>0. 

So how often are these 5 values all positive for the same sample? Or only 4 of them positive? Or only 3, 2 or one of them positive? How often do various combinations of the 5 values agree or disagree?

In [None]:
import sys, itertools
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from pytz import timezone
print('tic', datetime.now(timezone('Canada/Pacific')).isoformat(timespec='minutes'))

In [None]:
train = pd.read_csv('../input/jane-street-market-prediction/train.csv')

# just slimming down

# remove rows we don't need
train = train.loc[ train['weight']>0 ]

# remove columns we don't need
train = train[ ['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'date', 'weight'] ]

In [None]:
targets = ['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4']
for howmany in np.arange(2, 6):
    for combo in itertools.combinations(targets, howmany):
        label = ''
        pos, neg = True, True
        for target in combo:
            label = f'{label}|{target}'
            pos = (pos) & (train[target]>0)
            neg = (neg) & (train[target]<0)
        label = f'{label}|'
        train[label] = pos | neg
train[ train.columns[train.columns.str.contains('\|resp')] ]

In [None]:
sync_frequency = train[ train.columns[train.columns.str.contains('\|resp')] ].sum() / len(train)
sync_frequency.plot.bar(grid=True, figsize=(15, 5), ylabel='sync frequency', xlabel='combination')
sync_frequency.sort_values(ascending=False)
# sync means positive together or negative together

## consistency #1: correlation between pairs
Pearson tells about pairs only, not combinations of more than two variables, so can't replace the above barplot. Hence the purpose of this notebook.

In [None]:
# bar plot consistent with our expectation that ```resp``` is highly correlated to ```resp_4```
train[ train.columns[train.columns.str.contains('^resp')] ].corr()

## consistency #2: pairplot
```pairplot``` shows pairs only, not combinations of more than two variables, so can't replace the above barplot. Hence the purpose of this notebook.

In [None]:
sns.pairplot(train[ train.columns[train.columns.str.contains('^resp')] ])

## sanity check: in case we need to be convinced

In [None]:
# sanity
auto = sync_frequency['|resp_1|resp_2|resp_3|']
manual = (((train['resp_1']>0) & (train['resp_2']>0) & (train['resp_3']>0)) | 
         ((train['resp_1']<0) & (train['resp_2']<0) & (train['resp_3']<0))).sum()/len(train)
np.testing.assert_allclose(auto, manual)
manual, auto

In [None]:
print('toc', datetime.now(timezone('Canada/Pacific')).isoformat(timespec='minutes') )