# 复赛数据
在给出若干日内来自某成熟国家xx的部分用户的点击购买数据，以及来自某待成熟国家yy和待成熟国家zz的A部分用户的点击购买数据，以及国家yy和zz的B部分用户的截止最后一条购买数据之前的所有点击购买数据，让参赛人预测B部分用户的最后一条购买数据。

商品属性表: 商品的类目id、店铺id以及加密价格，其中价格的加密函数f(x)为一个单调增函数。
训练数据: 给出xx国的用户的点击、购买数据和yy国、zz国的A部分用户的点击、购买数据。
测试数据: 给出yy国、zz国的B部分用户的最后一条购买数据之前的点击购买数据.

训练数据和测试数据的数据结构是一样的，其中各字段含义如下：
country_id: 买家国家id, 只有'xx','yy','zz'三种取值
buyer_admin_id: 买家id
item_id: 商品id
log_time: 商品详情页访问时间
irank: 每个买家对应的所有记录按照时间顺序的逆排序
buy_flag: 当日是否购买

数据集特点：
每个用户有若干条点击数据和至少1条购买数据 （但测试数据中该条购买记录可能未给出到选手）
每个用户的最后一条数据的buy_flag一定为1 （但测试数据中该条数据未给出到选手）
测试数据中每个用户的最后一条点击数据（也是购买数据）所对应的商品一定在训练数据中出现过.
可能存在少量跨国买家.

要求选手提交的数据
关于yy国、zz国的B部分用户每个用户的最后一条购买数据的预测Top30

提交说明：
选手提交的CSV文件, 其格式应如下：
buyer_admin_id,predict 1,predict 2,…,predict 30
其中buyer_admin_id为买家id, predict 1 ,…, predict 30 为预测用户购买商品Top30的item_id依概率从高到低排序，不含表头，例如：

1233434,4354,23432,6546,...,91343

2132133,154,20987,34349,...,78772



评估方法：
MRR(Mean Reciprocal Rank)： 首先对选手提交的表格中的每个用户计算用户得分
$$
score\left(buyer\right) = \sum_{k=1}^{30}\frac{s\left(buyer,k\right)}{k}
$$

其中, 如果选手对该buyer的预测结果predict k命中该buyer的最后一条购买数据则$s\left(buyer,k\right)=1$; 否则$s\left(buyer,k\right)=0$. 最终得分为所有这些buyer的平均值。


In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
%matplotlib inline
import gc

import warnings
warnings.filterwarnings("ignore")

In [2]:
item = pd.read_csv('../data/Antai_AE/Antai_AE_round2_item_attr_20190813.csv')
train = pd.read_csv('../data/Antai_AE/Antai_AE_round2_train_20190813.csv')
test = pd.read_csv('../data/Antai_AE/Antai_AE_round2_test_20190813.csv')

In [3]:
item.isnull().any()

item_id       False
cate_id       False
store_id      False
item_price    False
dtype: bool

In [4]:
train.head()

Unnamed: 0,country_id,buyer_admin_id,item_id,log_time,irank,buy_flag
0,xx,1,7554,2018-04-19 10:59:56,79,0
1,xx,1,7937,2018-04-19 11:17:45,62,0
2,xx,1,7544,2018-04-18 22:19:30,81,0
3,xx,1,7559,2018-04-18 22:15:27,82,0
4,xx,1,7554,2018-04-19 11:28:34,46,0


In [5]:
test.head()

Unnamed: 0,country_id,buyer_admin_id,item_id,log_time,irank,buy_flag
0,zz,186,5759164,2018-04-16 05:11:47,37,0
1,zz,186,2321601,2018-04-16 04:57:35,48,0
2,zz,186,5244747,2018-04-17 12:24:43,2,0
3,zz,186,2136020,2018-04-17 11:53:41,8,0
4,zz,186,2137602,2018-04-16 04:57:36,47,0


In [6]:
train["log_time"].min(), train["log_time"].max()

('2018-04-16 00:00:00', '2018-04-30 23:59:00')

In [7]:
test["log_time"].min(), test["log_time"].max()

('2018-04-16 00:00:05', '2018-04-30 23:52:06')

In [8]:
train["country_id"].unique(), test["country_id"].unique()

(array(['xx', 'zz', 'yy'], dtype=object), array(['zz', 'yy'], dtype=object))

In [9]:
train["irank"].min(), test["irank"].min()

(1, 2)

In [10]:
train["buyer_admin_id"].nunique(), test["buyer_admin_id"].nunique()

(614960, 9844)

In [11]:
train["irank"].max(), test["irank"].max()

(264473, 39848)

In [12]:
train["buy_flag"].unique(), test["buy_flag"].unique()

(array([0, 1]), array([0, 1]))

In [13]:
train.isnull().any()

country_id        False
buyer_admin_id    False
item_id           False
log_time          False
irank             False
buy_flag          False
dtype: bool

In [14]:
test.isnull().any()

country_id        False
buyer_admin_id    False
item_id           False
log_time          False
irank             False
buy_flag          False
dtype: bool

In [15]:
train.groupby("buyer_admin_id").size().min(), test.groupby("buyer_admin_id").size().min()

(1, 1)

In [16]:
train.groupby("buyer_admin_id").size().max(), test.groupby("buyer_admin_id").size().max()

(264473, 39847)

In [17]:
pd.concat([train, test]).groupby("buyer_admin_id")["country_id"].nunique().value_counts()

1    624804
Name: country_id, dtype: int64

In [18]:
df = pd.concat([train.assign(is_train=1), test.assign(is_train=0)])
del train, test; gc.collect()

df['log_time'] = pd.to_datetime(df['log_time'])
df['date'] = df['log_time'].dt.date
df['day'] = df['log_time'].dt.day
df['hour'] = df['log_time'].dt.hour

df = pd.merge(df, item, how='left', on='item_id')

In [19]:
df.head()

Unnamed: 0,country_id,buyer_admin_id,item_id,log_time,irank,buy_flag,is_train,date,day,hour,cate_id,store_id,item_price
0,xx,1,7554,2018-04-19 10:59:56,79,0,1,2018-04-19,19,10,1467.0,9682.0,2067.0
1,xx,1,7937,2018-04-19 11:17:45,62,0,1,2018-04-19,19,11,1467.0,9541.0,1865.0
2,xx,1,7544,2018-04-18 22:19:30,81,0,1,2018-04-18,18,22,1467.0,9682.0,1604.0
3,xx,1,7559,2018-04-18 22:15:27,82,0,1,2018-04-18,18,22,1467.0,9682.0,2067.0
4,xx,1,7554,2018-04-19 11:28:34,46,0,1,2018-04-19,19,11,1467.0,9682.0,2067.0


In [20]:
memory = df.memory_usage().sum() / 1024**2 
print('Before memory usage of properties dataframe is :', memory, " MB")

dtype_dict = {'buyer_admin_id' : 'int32', 
              'item_id' : 'int32', 
              'store_id' : pd.Int32Dtype(),
              'irank' : 'int16',
              'item_price' : float,
              'cate_id' : pd.Int16Dtype(),
              'is_train' : 'int8',
              'day' : 'int8',
              'hour' : 'int8',
              'country_id': str,
              'date': str
             }

df = df.astype(dtype_dict)
memory = df.memory_usage().sum() / 1024**2 
print('After memory usage of properties dataframe is :', memory, " MB")

Before memory usage of properties dataframe is : 5631.612686157227  MB
After memory usage of properties dataframe is : 3469.475672721863  MB


In [21]:
df.isnull().any()

country_id        False
buyer_admin_id    False
item_id           False
log_time          False
irank             False
buy_flag          False
is_train          False
date              False
day               False
hour              False
cate_id            True
store_id           True
item_price         True
dtype: bool

In [22]:
df[['store_id', 'item_price', 'cate_id']].min()

store_id      1.0
item_price    1.0
cate_id       1.0
dtype: float64

In [23]:
# for col in ['store_id', 'item_price', 'cate_id']:
#     df[col] = df[col].fillna(0).astype(np.int32).replace(0, np.nan)
# df.to_hdf('../data/train_test_round2.h5', key='df', mode='w')
# del df; gc.collect()

In [24]:
# %%time
# df = pd.read_hdf('../data/train_test_round2.h5', key='df')

In [25]:
# del df; gc.collect()

In [26]:
# %%time
# train = pd.read_csv('../data/Antai_AE/Antai_AE_round2_train_20190813.csv')
# test = pd.read_csv('../data/Antai_AE/Antai_AE_round2_test_20190813.csv')
# item = pd.read_csv('../data/Antai_AE/Antai_AE_round2_item_attr_20190813.csv')
# del train, test; gc.collect()

# data content

In [27]:
df.head()

Unnamed: 0,country_id,buyer_admin_id,item_id,log_time,irank,buy_flag,is_train,date,day,hour,cate_id,store_id,item_price
0,xx,1,7554,2018-04-19 10:59:56,79,0,1,2018-04-19,19,10,1467,9682,2067.0
1,xx,1,7937,2018-04-19 11:17:45,62,0,1,2018-04-19,19,11,1467,9541,1865.0
2,xx,1,7544,2018-04-18 22:19:30,81,0,1,2018-04-18,18,22,1467,9682,1604.0
3,xx,1,7559,2018-04-18 22:15:27,82,0,1,2018-04-18,18,22,1467,9682,2067.0
4,xx,1,7554,2018-04-19 11:28:34,46,0,1,2018-04-19,19,11,1467,9682,2067.0


In [28]:
df.isnull().sum()

country_id            0
buyer_admin_id        0
item_id               0
log_time              0
irank                 0
buy_flag              0
is_train              0
date                  0
day                   0
hour                  0
cate_id           95958
store_id          95958
item_price        95958
dtype: int64

In [29]:
df.describe()

Unnamed: 0,buyer_admin_id,item_id,irank,buy_flag,is_train,day,hour,cate_id,store_id,item_price
count,52724770.0,52724770.0,52724770.0,52724770.0,52724770.0,52724770.0,52724770.0,52628809.0,52628809.0,52628810.0
mean,311421.9,4333457.0,255.7213,0.1631748,0.983806,21.64609,9.267078,1681.968033,55937.194974,2718.526
std,184476.1,2532043.0,3213.24,0.369525,0.1262211,3.961407,6.221378,1058.869124,31055.458137,6205.391
min,1.0,1.0,-32768.0,0.0,0.0,16.0,0.0,1.0,1.0,1.0
25%,146739.0,2166051.0,22.0,0.0,1.0,18.0,4.0,744.0,30288.0,295.0
50%,312889.0,4211300.0,56.0,0.0,1.0,21.0,9.0,1769.0,57261.0,877.0
75%,474097.0,6461643.0,142.0,0.0,1.0,25.0,13.0,2307.0,81254.0,2301.0
max,626645.0,9167200.0,32767.0,1.0,1.0,30.0,23.0,4793.0,123617.0,50806.0


In [30]:
df["irank"].min()

-32768

In [31]:
df["irank"].sort_values()

8803892   -32768
5657204   -32768
3618399   -32768
354146    -32768
1387808   -32768
           ...  
8755071    32767
3294630    32767
5039742    32767
4969069    32767
3618400    32767
Name: irank, Length: 52724767, dtype: int16

In [32]:
train = df['is_train']==1
test = df['is_train']==0

In [33]:
train_count = len(df[train])
print('训练集样本量是', train_count)
test_count = len(df[test])
print('测试集样本量是', test_count)
print('样本比例为：', train_count/test_count)

训练集样本量是 51870942
测试集样本量是 853825
样本比例为： 60.75125699060112


In [34]:
def groupby_cnt_ratio(df, col):
    if isinstance(col, str):
        col = [col]
    key = ['is_train', 'country_id'] + col
    
    cnt_stat = df.groupby(key).size().to_frame('count')
    ratio_stat = (cnt_stat / cnt_stat.groupby(['is_train', 'country_id']).sum()).rename(columns={'count':'count_ratio'})
    return pd.merge(cnt_stat, ratio_stat, on=key, how='outer').sort_values(by=['count'], ascending=False)

In [35]:
groupby_cnt_ratio(df, [])

Unnamed: 0_level_0,Unnamed: 1_level_0,count,count_ratio
is_train,country_id,Unnamed: 2_level_1,Unnamed: 3_level_1
1,xx,42046596,1.0
1,yy,5241393,1.0
1,zz,4582953,1.0
0,yy,429319,1.0
0,zz,424506,1.0


In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x='is_train', data=df, palette=['red', 'blue', 'green'], hue='country_id', order=[1, 0])
plt.xticks(np.arange(2), ('train', 'test'))
plt.xlabel('data file')
plt.title('cntry no.');

In [None]:
print('训练集中用户数量',len(df[train]['buyer_admin_id'].unique()))
print('测试集中用户数量',len(df[test]['buyer_admin_id'].unique()))

In [None]:
union = list(set(df[train]['buyer_admin_id'].unique()).intersection(set(df[test]['buyer_admin_id'].unique())))
print('同时在训练集和测试集出现的用户，id如下：', union)

In [None]:
admin_cnt = groupby_cnt_ratio(df, 'buyer_admin_id')
admin_cnt.groupby(['is_train','country_id']).head(3)

In [None]:
admin_cnt.groupby(['is_train','country_id'])['count'].agg(['max','min','median'])

In [None]:
fig, ax = plt.subplots(1, 3 ,figsize=(16,6))
ax[0].set(xlabel='buyer records')
sns.kdeplot(admin_cnt[admin_cnt['count']<50].reset_index().query("is_train==1 and country_id=='xx'")['count'].values, ax=ax[0])\
    .set_title('train - xx cntry buyer records')

ax[1].set(xlabel='buyer records')
sns.kdeplot(admin_cnt[admin_cnt['count']<50].reset_index().query("is_train==1 and country_id=='yy'")['count'].values, ax=ax[1])\
    .set_title('yy cntry buyer records')
sns.kdeplot(admin_cnt[admin_cnt['count']<50].reset_index().query("is_train==0 and country_id=='yy'")['count'].values, ax=ax[1])
ax[1].legend(labels=['train', 'test'], loc="upper right")

ax[2].set(xlabel='buyer records')
sns.kdeplot(admin_cnt[admin_cnt['count']<50].reset_index().query("is_train==1 and country_id=='zz'")['count'].values, ax=ax[2])\
    .set_title('zz cntry buyer records')
sns.kdeplot(admin_cnt[admin_cnt['count']<50].reset_index().query("is_train==0 and country_id=='zz'")['count'].values, ax=ax[2])
ax[2].legend(labels=['train', 'test'], loc="upper right")

In [None]:
admin_cnt.columns = ['buyer_click_records_count', 'buyer_click_records_count_ratio']
admin_user_cnt = groupby_cnt_ratio(admin_cnt, 'buyer_click_records_count')
admin_user_cnt.columns = ['same_click_records_buyers_count', 'same_click_records_buyers_count_ratio']
admin_user_cnt.head()

In [None]:
# xx cnt 
admin_user_cnt.reset_index().set_index("buyer_click_records_count")\
.query("is_train==1 and country_id=='xx'")[['same_click_records_buyers_count','same_click_records_buyers_count_ratio']].T

In [None]:
# yy cnt
admin_user_cnt.loc[([1,0],'yy',slice(None))].unstack(0).head(10)

In [None]:
# zz cnt
admin_user_cnt.loc[([1,0],'zz',slice(None))].unstack(0).head(10)

In [None]:
fig, ax = plt.subplots(3, 1, figsize=(25,25))
admin_plot = admin_user_cnt.reset_index()
sax = sns.barplot(x='same_click_records_buyers_count', y='same_click_records_buyers_count_ratio', \
            data=admin_plot[(admin_plot['buyer_click_records_count']<100) & (admin_plot['country_id']=='xx')], 
            estimator=np.mean, ax=ax[0])
_ = sax.set_title('train - xx cntry same_click_records_buyers_count and same_click_records_buyers_count_ratio')
sax.set_xticklabels(sax.get_xticklabels(), rotation=45)


sax = sns.barplot(x='same_click_records_buyers_count', y='same_click_records_buyers_count_ratio', hue='is_train', \
            data=admin_plot[(admin_plot['buyer_click_records_count']<100) & (admin_plot['country_id']=='yy')], 
            estimator=np.mean, ax=ax[1])
_ = sax.set_title('yy cntry same_click_records_buyers_count and same_click_records_buyers_count_ratio')
_ = sax.set_xticklabels(sax.get_xticklabels(), rotation=90)

sax = sns.barplot(x='same_click_records_buyers_count', y='same_click_records_buyers_count_ratio', hue='is_train', \
            data=admin_plot[(admin_plot['buyer_click_records_count']<100) & (admin_plot['country_id']=='zz')], 
            estimator=np.mean, ax=ax[2])
_ = sax.set_title('zz cntry same_click_records_buyers_count and same_click_records_buyers_count_ratio')
_ = sax.set_xticklabels(sax.get_xticklabels(), rotation=90)

In [None]:
print('商品表中商品数：',len(item['item_id'].unique()))
print('训练集中商品数：',len(df[train]['item_id'].unique()))
print('测试集中商品数：',len(df[test]['item_id'].unique()))
print('仅训练集有的商品数：',len(list(set(df[train]['item_id'].unique()).difference(set(df[test]['item_id'].unique())))))
print('仅测试集有的商品数：',len(list(set(df[test]['item_id'].unique()).difference(set(df[train]['item_id'].unique())))))
print('训练集测试集共同商品数：',len(list(set(df[train]['item_id'].unique()).intersection(set(df[test]['item_id'].unique())))))
print('训练集中不在商品表的商品数：',len(list(set(df[train]['item_id'].unique()).difference(set(item['item_id'].unique())))))
print('测试集中不在商品表的商品数：',len(list(set(df[test]['item_id'].unique()).difference(set(item['item_id'].unique())))))

In [None]:
item_cnt = groupby_cnt_ratio(df.query("buy_flag==1"), 'item_id')
item_cnt.columns=['sales', 'sales_ratio']
item_cnt.reset_index(inplace=True)
item_cnt

In [None]:
top_item_plot = item_cnt.groupby(['is_train','country_id']).head(10)
top_item_plot

In [None]:
fig, ax = plt.subplots(3, 1, figsize=(16,16))
sns.barplot(x='item_id', y='sales', data=top_item_plot[top_item_plot['country_id']=='xx'], 
            order=top_item_plot['item_id'][top_item_plot['country_id']=='xx'], ax=ax[0], estimator=np.mean)\
    .set_title('xx cntry - top sales')

sns.barplot(x='item_id', y='sales', hue='is_train', data=top_item_plot[top_item_plot['country_id']=='yy'], 
            order=top_item_plot['item_id'][top_item_plot['country_id']=='yy'], ax=ax[1], estimator=np.mean)\
    .set_title('yy cntry - top sales')
_ = plt.xticks(rotation=45)

sns.barplot(x='item_id', y='sales', hue='is_train', data=top_item_plot[top_item_plot['country_id']=='zz'], 
            order=top_item_plot['item_id'][top_item_plot['country_id']=='zz'], ax=ax[2], estimator=np.mean)\
    .set_title('zz cntry - top sales')
_ = plt.xticks(rotation=45)

In [None]:
item_order_cnt = groupby_cnt_ratio(item_cnt, 'sales')
item_order_cnt.columns = ['sales_count', 'sales_count_ratio']
item_order_cnt

In [None]:
item_order_cnt.groupby(['is_train','country_id']).head(5).sort_values(by=['country_id','is_train'])

In [None]:
item_order_plot = item_order_cnt.reset_index()

xx_item_order_plot = item_order_plot[item_order_plot['country_id']=='xx']

yy_item_order_plot = item_order_plot[item_order_plot['country_id']=='yy']
yy_item_order_plot_1 = yy_item_order_plot[yy_item_order_plot['is_train']==1]
yy_item_order_plot_0 = yy_item_order_plot[yy_item_order_plot['is_train']==0]

zz_item_order_plot = item_order_plot[item_order_plot['country_id']=='zz']
zz_item_order_plot_1 = zz_item_order_plot[zz_item_order_plot['is_train']==1]
zz_item_order_plot_0 = zz_item_order_plot[zz_item_order_plot['is_train']==0]

In [None]:
def text_style_func(pct, allvals):
    absolute = int(round(pct/100.*np.sum(allvals)))
    return "{:.1f}%({:d})".format(pct, absolute)

def pie_param(ax, df, color_palette):
    return ax.pie(df['sales_count_ratio'].values, \
                  autopct=lambda pct: text_style_func(pct, df['sales_count']), \
                  labels=df['sales'], explode=[0.1]+np.zeros(len(df)-1).tolist(), pctdistance=0.7, \
                  colors=sns.color_palette(color_palette, 8))

fig, ax = plt.subplots(2, 3, figsize=(25,12))

ax[0,0].set(xlabel='xx cntry - sales_count')
ax[0,0].set(ylabel='xx cntry - sales_count_ratio')
pie_param(ax[0,0], xx_item_order_plot, "coolwarm")

ax[0,1].set(xlabel='yy cntry - train sales_count')
pie_param(ax[0,1], yy_item_order_plot_1, "Set3")

ax[0,2].set(xlabel='yy cntry - test sales_count')
_ = pie_param(ax[0,2], yy_item_order_plot_0, "Set3")

ax[1,1].set(xlabel='zz cntry - train sales_count')
pie_param(ax[1,1], zz_item_order_plot_1, "Set3")

ax[1,2].set(xlabel='zz cntry - test sales_count')
_ = pie_param(ax[1,2], zz_item_order_plot_0, "Set3")

In [None]:
print(xx_item_order_plot.head(20)['sales_count_ratio'].sum())
print(yy_item_order_plot_1.head(20)['sales_count_ratio'].sum())
print(yy_item_order_plot_0.head(20)['sales_count_ratio'].sum())
print(zz_item_order_plot_1.head(20)['sales_count_ratio'].sum())
print(zz_item_order_plot_0.head(20)['sales_count_ratio'].sum())

In [None]:
print('商品品类数', len(item['cate_id'].unique()))
print('训练集商品品类数', len(df[train]['cate_id'].unique()))
print('测试集商品品类数', len(df[test]['cate_id'].unique()))

In [None]:
cate_cnt = item.groupby(['cate_id']).size().to_frame('count').reset_index()
cate_cnt.sort_values(by=['count'], ascending=False).head(5)

In [None]:
plt.figure(figsize=(12,4))
sns.kdeplot(data=cate_cnt[cate_cnt['count']<1000]['count']);

In [None]:
print('商品店铺数', len(item['store_id'].unique()))
print('训练集店铺数', len(df[train]['store_id'].unique()))
print('测试集店铺数', len(df[train]['store_id'].unique()))

In [None]:
store_cate_cnt = item.groupby(['store_id'])['cate_id'].nunique().to_frame('count').reset_index()
store_cate_cnt.sort_values(by=['count'], ascending=False).head(5)

In [None]:
store_cnt_cate_cnt = store_cate_cnt.groupby(['count']).size().reset_index()
store_cnt_cate_cnt.columns = ['store_cate_count', 'same_store_cate_count_store_count']

In [None]:
store_cnt_cate_cnt.head()

In [None]:
plt.figure(figsize=(12,4))
sns.barplot(x='store_cate_count', y='same_store_cate_count_store_count', \
            data=store_cnt_cate_cnt[store_cnt_cate_cnt['store_cate_count']<50], estimator=np.mean);

In [None]:
store_item_cnt = item.groupby(['store_id'])['item_id'].nunique().to_frame('count').reset_index()
store_item_cnt.sort_values(by=['count'], ascending=False).head(5)

In [None]:
store_cnt_item_cnt = store_item_cnt.groupby(['count']).size().reset_index()
store_cnt_item_cnt.columns = ['store_item_count', 'same_store_item_count_store_count']

In [None]:
store_cnt_item_cnt.T

In [None]:
plt.figure(figsize=(16,4))
sns.barplot(x='store_item_count', y='same_store_item_count_store_count', \
            data=store_cnt_item_cnt[store_cnt_item_cnt['store_item_count']<80], estimator=np.mean);

In [None]:
print(item['item_price'].max(), item['item_price'].min(), item['item_price'].mean(), item['item_price'].median())

In [None]:
plt.figure(figsize=(16,4))
plt.subplot(121)
sns.kdeplot(item['item_price'])
plt.subplot(122)
sns.kdeplot(item['item_price'][item['item_price']<5000]);

In [None]:
price_cnt = item.groupby(['item_price']).size().to_frame('count').reset_index()
price_cnt.sort_values(by=['count'], ascending=False).head(10)

In [None]:
print(df[train].query("buy_flag==1")['item_price'].max(), df[train].query("buy_flag==1")['item_price'].min(), \
      df[train].query("buy_flag==1")['item_price'].mean(), df[train].query("buy_flag==1")['item_price'].median())
print(df[test].query("buy_flag==1")['item_price'].max(), df[test].query("buy_flag==1")['item_price'].min(), \
      df[test].query("buy_flag==1")['item_price'].mean(), df[test].query("buy_flag==1")['item_price'].median())

In [None]:
plt.figure(figsize=(12,4))
sns.kdeplot(df[train].query("buy_flag==1")[df[train]['item_price']<2000][['item_id','item_price']].drop_duplicates()['item_price'])
sns.kdeplot(df[test].query("buy_flag==1")[df[test]['item_price']<2000][['item_id','item_price']].drop_duplicates()['item_price'])
_ = plt.legend(["train","test"])

In [None]:
df[train].query("buy_flag==1").groupby(['item_price'])['item_id'].nunique().to_frame('same_price_item_count').head()

In [None]:
price_cnt = groupby_cnt_ratio(df.query("buy_flag==1"), 'item_price')
price_cnt.groupby(['is_train', 'country_id']).head(5)

In [None]:
print(df[train]['log_time'].min(), df[train]['log_time'].max())
print(df[test]['log_time'].min(), df[test]['log_time'].max())

In [None]:
date_cnt = groupby_cnt_ratio(df.query("buy_flag==1"), 'date')
date_cnt.columns = ['date_sales', "date_sales_ratio"]
date_cnt = date_cnt.reset_index().sort_values(by="date")
date_cnt

In [None]:
fig, ax = plt.subplots(3, 1, figsize=(25,30))
sax = sns.lineplot(x='date', y='date_sales', hue='country_id', data=date_cnt[(date_cnt['is_train']==1)], 
            estimator=np.mean, ax=ax[0])
_ = sax.set_title('train cntry - date sales')
sax.set_xticklabels(sax.get_xticklabels(), rotation=90)

sax = sns.lineplot(x='date', y='date_sales', hue='is_train', data=date_cnt[(date_cnt['country_id']=='yy')], 
            estimator=np.mean, ax=ax[1])
sax.set_title('yy cntry date sales')
_ = sax.set_xticklabels(sax.get_xticklabels(), rotation=90)

sax = sns.lineplot(x='date', y='date_sales', hue='is_train', data=date_cnt[(date_cnt['country_id']=='zz')], 
            estimator=np.mean, ax=ax[2])
sax.set_title('zz cntry date sales')
_ = sax.set_xticklabels(sax.get_xticklabels(), rotation=90)

In [None]:
fig, ax = plt.subplots(2, 3, figsize=(20,16))
def barplot(ax, df, title):
    df['date'] = df['date'].astype(str)
    sns.barplot(y='date', x='date_sales', data=df, order=sorted(df['date'].unique()), ax=ax, estimator=np.mean)\
    .set_title(title)
    
barplot(ax[0][0], seven[(seven['is_train']==1) & (seven['buyer_country_id']=='xx')], 'xx cntry 7 month date sales')
barplot(ax[1][0], eight[(eight['is_train']==1) & (eight['buyer_country_id']=='xx')], 'xx cntry 8 month date sales')
barplot(ax[0][1], seven[(seven['is_train']==1) & (seven['buyer_country_id']=='yy')], 'train - yy cntry 7 month date sales')
barplot(ax[1][1], eight[(eight['is_train']==1) & (eight['buyer_country_id']=='yy')], 'train - yy cntry 8 month date sales')
barplot(ax[0][2], seven[(seven['is_train']==0) & (seven['buyer_country_id']=='yy')], 'test - yy cntry 7 month date sales')
barplot(ax[1][2], eight[(eight['is_train']==0) & (eight['buyer_country_id']=='yy')], 'test - yy cntry 8 month date sales')
plt.tight_layout()