# Easy understanding for beginner

今回のコンペは、ドイツのオンラインショッピング大手OTTO主催のコンペで、ユーザーの「クリック（商品の閲覧）」「カートに入れる」「注文する」を予測するというものです。


# 1.Data Load

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
import gc
pd.set_option("display.max_columns",None)
data_path = Path('/content/drive/MyDrive/Colab Notebooks/kaggle/OTTO/dataset')

Trainingデータは、12,899,778セッション（クリック、カートに入れる等のアクション単位だと216,716,095）と巨大なので、10万セッション分だけ取り出します。

このコンペでは「セッション」≒「ユーザー」と考えると分かりやすいと思います

In [2]:
%%time
sample_size = 100_000
chunks = pd.read_json(data_path / 'original' / 'train.jsonl',lines=True,chunksize=sample_size)
for chunk in chunks:
    train_df = chunk
    break
train_df.set_index('session',drop=True,inplace=True)
train_df

CPU times: user 5.7 s, sys: 2.72 s, total: 8.43 s
Wall time: 12.1 s


Unnamed: 0_level_0,events
session,Unnamed: 1_level_1
0,"[{'aid': 1517085, 'ts': 1659304800025, 'type':..."
1,"[{'aid': 424964, 'ts': 1659304800025, 'type': ..."
2,"[{'aid': 763743, 'ts': 1659304800038, 'type': ..."
3,"[{'aid': 1425967, 'ts': 1659304800095, 'type':..."
4,"[{'aid': 613619, 'ts': 1659304800119, 'type': ..."
...,...
99995,"[{'aid': 1387489, 'ts': 1659326711310, 'type':..."
99996,"[{'aid': 1091948, 'ts': 1659326711396, 'type':..."
99997,"[{'aid': 366639, 'ts': 1659326711431, 'type': ..."
99998,"[{'aid': 845181, 'ts': 1659326711611, 'type': ..."


# 2.Sample looking

例として、セッション4を見てみる

In [3]:
# 一つ購入して、あとはカートに入れるパターン
train_df.iloc[4,0]

[{'aid': 613619, 'ts': 1659304800119, 'type': 'clicks'},
 {'aid': 298827, 'ts': 1659304836708, 'type': 'clicks'},
 {'aid': 298827, 'ts': 1659304900468, 'type': 'orders'},
 {'aid': 383828, 'ts': 1661161611985, 'type': 'clicks'},
 {'aid': 255379, 'ts': 1661161636464, 'type': 'clicks'},
 {'aid': 1838173, 'ts': 1661161670830, 'type': 'clicks'},
 {'aid': 1453726, 'ts': 1661161695814, 'type': 'clicks'},
 {'aid': 1838173, 'ts': 1661161708717, 'type': 'clicks'},
 {'aid': 255379, 'ts': 1661161751223, 'type': 'clicks'},
 {'aid': 383828, 'ts': 1661161753524, 'type': 'clicks'},
 {'aid': 1554752, 'ts': 1661504170116, 'type': 'clicks'},
 {'aid': 1554752, 'ts': 1661504180466, 'type': 'carts'},
 {'aid': 917213, 'ts': 1661504212575, 'type': 'clicks'},
 {'aid': 917213, 'ts': 1661504216807, 'type': 'carts'},
 {'aid': 758750, 'ts': 1661504488403, 'type': 'clicks'},
 {'aid': 758750, 'ts': 1661504510200, 'type': 'carts'},
 {'aid': 678521, 'ts': 1661586368496, 'type': 'clicks'},
 {'aid': 1081407, 'ts': 16615

このユーザーは15個の商品をクリックして1個を購入、3個はカートに入れただけで終えている。

タイムスタンプがついていますので、それぞれのアクションの時間を見てみましょう。

最初に購入した商品（aid=article_id 298827）は、日本時間の2022/8/1 7:00:36にクリックして、その約1分後の7:01:40にオーダーしています。（1分で購入を決めるというのは、よほど気に入ったんでしょうか^^）


aid1554752の商品は8/26 17:56:10にクリックして、その10秒後にカートに入れています。買うかどうかは後で考えるとして、とりあえずカートに入れておこう、という感じですね。

In [4]:
# とりあえずカートに入れるパターン
train_df.iloc[1,0]

[{'aid': 424964, 'ts': 1659304800025, 'type': 'carts'},
 {'aid': 1492293, 'ts': 1659304852871, 'type': 'clicks'},
 {'aid': 1492293, 'ts': 1659304863627, 'type': 'carts'},
 {'aid': 910862, 'ts': 1659304891923, 'type': 'clicks'},
 {'aid': 910862, 'ts': 1659304900209, 'type': 'carts'},
 {'aid': 1491172, 'ts': 1659385939248, 'type': 'clicks'},
 {'aid': 1491172, 'ts': 1659385945915, 'type': 'carts'},
 {'aid': 424964, 'ts': 1659385993848, 'type': 'clicks'},
 {'aid': 1515526, 'ts': 1659386025990, 'type': 'clicks'},
 {'aid': 440486, 'ts': 1659473014870, 'type': 'clicks'},
 {'aid': 109488, 'ts': 1659473065576, 'type': 'clicks'},
 {'aid': 1507622, 'ts': 1659473076244, 'type': 'clicks'},
 {'aid': 1734061, 'ts': 1659855882096, 'type': 'clicks'},
 {'aid': 854637, 'ts': 1659990929876, 'type': 'clicks'},
 {'aid': 854637, 'ts': 1659990941327, 'type': 'carts'},
 {'aid': 718983, 'ts': 1659990943793, 'type': 'clicks'},
 {'aid': 215311, 'ts': 1659990959575, 'type': 'clicks'},
 {'aid': 215311, 'ts': 165999

# Create a model
上記で見たデータ構造をもとに、提出モデルを構築してみます。

「Order」の後には「カートに入れた商品をRecommendしてみる」、「Carts」の後には「クリックした商品をRecommendしてみる」ということをしてみます。20個までRecommendできるので、足りない分は「一番売れている商品」で補充します。

また、「Click」の後にも「一番売れている商品」をRecommendしてみます。
この簡易モデルでのスコアは「0.351」です。

これまでに見てみたデータフレームはJSON形式の原データをそのまま読み込んだものですので、Pandasで扱いやすくするためにセッション単位ではなくアクション単位のデータフレームにします。

In [5]:
del train_df
gc.collect()

0

In [6]:
%%time
train_df = pd.DataFrame()
chunks = pd.read_json(data_path / 'original' / 'train.jsonl',lines=True,chunksize=100_000)
for chunk in chunks:
    event_dict = {'session' : [], 'aid' : [], 'ts' : [], 'type' : []}
    for session,events in zip(chunk['session'].tolist(),chunk['events'].tolist()):
        for event in events:
            event_dict['session'].append(session)
            event_dict['aid'].append(event['aid'])
            event_dict['ts'].append(event['ts'])
            event_dict['type'].append(event['type'])
    train_df = pd.DataFrame(event_dict)
    break
train_df = train_df.reset_index(drop=True)
train_df

CPU times: user 13.5 s, sys: 2.28 s, total: 15.8 s
Wall time: 15.6 s


Unnamed: 0,session,aid,ts,type
0,0,1517085,1659304800025,clicks
1,0,1563459,1659304904511,clicks
2,0,1309446,1659367439426,clicks
3,0,16246,1659367719997,clicks
4,0,1781822,1659367871344,clicks
...,...,...,...,...
5227648,99999,1544954,1660373630318,clicks
5227649,99999,1032408,1660373656430,clicks
5227650,99999,1544954,1660373678083,clicks
5227651,99999,554230,1660373715477,clicks


それぞれのアクションに何分消費したかを知るために、minutesのカラムを追加し、次の行との差をセットします。

このコンペで使われているUnix timestampは13桁ですので、「ミリ秒」を表しています。

In [7]:
train_df['minutes'] = train_df[['session','ts']].groupby('session').diff(-1)*(-1/1000/60)
train_df

Unnamed: 0,session,aid,ts,type,minutes
0,0,1517085,1659304800025,clicks,1.741433
1,0,1563459,1659304904511,clicks,1042.248583
2,0,1309446,1659367439426,clicks,4.676183
3,0,16246,1659367719997,clicks,2.522450
4,0,1781822,1659367871344,clicks,0.240867
...,...,...,...,...,...
5227648,99999,1544954,1660373630318,clicks,0.435200
5227649,99999,1032408,1660373656430,clicks,0.360883
5227650,99999,1544954,1660373678083,clicks,0.623233
5227651,99999,554230,1660373715477,clicks,0.172533


一番売れている商品のリストを作っていきます。

In [8]:
temp = train_df.groupby(['type','aid'])['session'].agg('count').reset_index()
temp.columns = ['type','aid','count']
order_num_df = temp.loc[(temp['type'] == 'orders'),]
order_num_df = order_num_df.sort_values(['count'],ascending=False).reset_index()
order_num_df

Unnamed: 0,index,type,aid,count
0,812656,orders,80222,127
1,840266,orders,1022566,106
2,815207,orders,166037,94
3,858211,orders,1629608,76
4,861353,orders,1733943,75
...,...,...,...,...
54751,831592,orders,727990,1
54752,831593,orders,728010,1
54753,831594,orders,728086,1
54754,831595,orders,728208,1


In [9]:
order_num_df.aid = ' ' + order_num_df.aid.astype('str')
best_sold_list = order_num_df[:20].aid.sum()
best_sold_list

' 80222 1022566 166037 1629608 1733943 332654 351335 923948 1603001 544144 1083665 832192 29735 231487 563117 247240 673407 125278 800391 527209'

# テストデータをダウンロード

In [10]:
test_df = pd.DataFrame()
chunks = pd.read_json(data_path / 'original' / 'test.jsonl',lines=True,chunksize=100_000)
for chunk in chunks:
    event_dict = {'session': [],'aid': [],'ts': [],'type': []}
    for session, events in zip(chunk['session'].tolist(), chunk['events'].tolist()):
        for event in events:
            event_dict['session'].append(session)
            event_dict['aid'].append(event['aid'])
            event_dict['ts'].append(event['ts'])
            event_dict['type'].append(event['type'])
    chunk_session = pd.DataFrame(event_dict)
    test_df = pd.concat([test_df, chunk_session])         
test_df = test_df.reset_index(drop=True)
test_df

Unnamed: 0,session,aid,ts,type
0,12899779,59625,1661724000278,clicks
1,12899780,1142000,1661724000378,clicks
2,12899780,582732,1661724058352,clicks
3,12899780,973453,1661724109199,clicks
4,12899780,736515,1661724136868,clicks
...,...,...,...,...
6928118,14571577,1141710,1662328774770,clicks
6928119,14571578,519105,1662328775009,clicks
6928120,14571579,739876,1662328775605,clicks
6928121,14571580,202353,1662328781067,clicks


In [11]:
test_df["date_time"] = pd.to_datetime(test_df.ts, unit="ms")
test_df

Unnamed: 0,session,aid,ts,type,date_time
0,12899779,59625,1661724000278,clicks,2022-08-28 22:00:00.278
1,12899780,1142000,1661724000378,clicks,2022-08-28 22:00:00.378
2,12899780,582732,1661724058352,clicks,2022-08-28 22:00:58.352
3,12899780,973453,1661724109199,clicks,2022-08-28 22:01:49.199
4,12899780,736515,1661724136868,clicks,2022-08-28 22:02:16.868
...,...,...,...,...,...
6928118,14571577,1141710,1662328774770,clicks,2022-09-04 21:59:34.770
6928119,14571578,519105,1662328775009,clicks,2022-09-04 21:59:35.009
6928120,14571579,739876,1662328775605,clicks,2022-09-04 21:59:35.605
6928121,14571580,202353,1662328781067,clicks,2022-09-04 21:59:41.067


In [None]:
test_df.info()

In [24]:
new = test_df.groupby('session')['date_time'].max() - test_df.groupby('session')['date_time'].min()
new

session
12899779          0 days 00:00:00
12899780   0 days 00:02:34.870000
12899781   3 days 21:22:39.847000
12899782   0 days 22:12:32.610000
12899783   3 days 16:05:39.826000
                    ...          
14571577          0 days 00:00:00
14571578          0 days 00:00:00
14571579          0 days 00:00:00
14571580          0 days 00:00:00
14571581          0 days 00:00:00
Name: date_time, Length: 1671803, dtype: timedelta64[ns]

In [25]:
new.max()

Timedelta('6 days 23:37:57.899000')

In [20]:
test_df[test_df['session']==12899780]

Unnamed: 0,session,aid,ts,type,date_time
1,12899780,1142000,1661724000378,clicks,2022-08-28 22:00:00.378
2,12899780,582732,1661724058352,clicks,2022-08-28 22:00:58.352
3,12899780,973453,1661724109199,clicks,2022-08-28 22:01:49.199
4,12899780,736515,1661724136868,clicks,2022-08-28 22:02:16.868
5,12899780,1142000,1661724155248,clicks,2022-08-28 22:02:35.248


テストデータにminutesカラムを追加

時間の長い順からソートすることでスコアが僅かに改善した

In [None]:
%%time
test_df['minutes'] = test_df[['session','ts']].groupby('session').diff(-1)*(-1/1000/60)
test_df = test_df.sort_values(['minutes'],ascending=False)
test_action_df = test_df.copy()
test_action_df.aid = ' ' + test_df.aid.astype('str')
test_action_df = test_action_df.groupby(['session','type'])['aid'].sum().reset_index()
test_action_df

CPU times: user 8min, sys: 24.4 s, total: 8min 24s
Wall time: 7min 58s


Unnamed: 0,session,type,aid
0,12899779,clicks,59625
1,12899780,clicks,1142000 582732 973453 736515 1142000
2,12899781,carts,199008
3,12899781,clicks,199008 194067 199008 199008 199008 199008 573...
4,12899782,carts,1494780 834354 975116 127404 413962 595994 13...
...,...,...,...
1948868,14571577,clicks,1141710
1948869,14571578,clicks,519105
1948870,14571579,clicks,739876
1948871,14571580,clicks,202353


In [None]:
next_orders_df = pd.DataFrame(test_action_df.loc[(test_action_df['type'] == 'carts')])
next_orders_df['type'] = 'orders'
next_orders_df

Unnamed: 0,session,type,aid
2,12899781,orders,199008
4,12899782,orders,1494780 834354 975116 127404 413962 595994 13...
10,12899786,orders,955252
12,12899787,orders,1682750 1682750 1682750
16,12899790,orders,1830166 1219653
...,...,...,...
1948716,14571430,orders,903014
1948730,14571443,orders,942326
1948774,14571486,orders,350578
1948788,14571499,orders,1132907


In [None]:
next_carts_df = pd.DataFrame(test_action_df.loc[(test_action_df['type'] == 'clicks'),])
next_carts_df['type'] = 'carts'
next_carts_df

Unnamed: 0,session,type,aid
0,12899779,carts,59625
1,12899780,carts,1142000 582732 973453 736515 1142000
3,12899781,carts,199008 194067 199008 199008 199008 199008 573...
5,12899782,carts,603159 779477 1299062 602722 413962 975116 16...
7,12899783,carts,607638 1729553 255297 300127 1754419 1216820 ...
...,...,...,...
1948868,14571577,carts,1141710
1948869,14571578,carts,519105
1948870,14571579,carts,739876
1948871,14571580,carts,202353


クリックの後には一番売れた商品をrecommendしていたのですが、代わりにクリックした商品をrecomendしたらスコアが向上した。

興味とは全然関係ない商品を見せられるより、一度はクリックした商品をrecommendしたほうがよいのだろう。

In [None]:
next_clicks_df = pd.DataFrame(test_action_df.loc[(test_action_df['type'] == 'clicks'),]).copy()

後述でorderの後に「cartに入れた商品」に加えて、「ついで買い」商品をrecommendするロジックを追加し、スコアを僅かに改善させた。

しかし、ここで「ついで買い」商品ではなく「clickした商品」を追加(「cartに入れた商品」+「clickした商品」)にするとスコアが大きく改善した

ついで買いした商品より、一度自分がクリックした商品をあらためて標示するほうがよさそう

In [None]:
next_orders_df

Unnamed: 0,session,type,aid
2,12899781,orders,199008
4,12899782,orders,1494780 834354 975116 127404 413962 595994 13...
10,12899786,orders,955252
12,12899787,orders,1682750 1682750 1682750
16,12899790,orders,1830166 1219653
...,...,...,...
1948716,14571430,orders,903014
1948730,14571443,orders,942326
1948774,14571486,orders,350578
1948788,14571499,orders,1132907


In [None]:
next_orders_df = pd.merge(next_orders_df,next_clicks_df[['session','aid']],on='session',how='left')
next_orders_df['aid'] = next_orders_df['aid_x'] + next_orders_df['aid_y']
next_orders_df = next_orders_df.drop(['aid_x','aid_y'],axis=1)
next_orders_df

Unnamed: 0,session,type,aid
0,12899781,orders,199008 199008 194067 199008 199008 199008 199...
1,12899782,orders,1494780 834354 975116 127404 413962 595994 13...
2,12899786,orders,955252 955252
3,12899787,orders,1682750 1682750 1682750 1682750 1024433
4,12899790,orders,1830166 1219653 1830166
...,...,...,...
242828,14571430,orders,903014 903014 1162324
242829,14571443,orders,942326 1407032 942326 568535
242830,14571486,orders,350578 350578 350578
242831,14571499,orders,1132907 1132907


In [None]:
recommend_df = pd.concat([next_orders_df,next_carts_df,next_clicks_df],axis=0)
recommend_df['session_type'] = recommend_df['session'].astype('str') + '_' + recommend_df['type']
recommend_df

Unnamed: 0,session,type,aid,session_type
0,12899781,orders,199008 199008 194067 199008 199008 199008 199...,12899781_orders
1,12899782,orders,1494780 834354 975116 127404 413962 595994 13...,12899782_orders
2,12899786,orders,955252 955252,12899786_orders
3,12899787,orders,1682750 1682750 1682750 1682750 1024433,12899787_orders
4,12899790,orders,1830166 1219653 1830166,12899790_orders
...,...,...,...,...
1948868,14571577,clicks,1141710,14571577_clicks
1948869,14571578,clicks,519105,14571578_clicks
1948870,14571579,clicks,739876,14571579_clicks
1948871,14571580,clicks,202353,14571580_clicks


In [None]:
sample_sub = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/kaggle/OTTO/dataset/original/sample_submission.csv')
sample_sub

Unnamed: 0,session_type,labels
0,12899779_clicks,129004 126836 118524
1,12899779_carts,129004 126836 118524
2,12899779_orders,129004 126836 118524
3,12899780_clicks,129004 126836 118524
4,12899780_carts,129004 126836 118524
...,...,...
5015404,14571580_carts,129004 126836 118524
5015405,14571580_orders,129004 126836 118524
5015406,14571581_clicks,129004 126836 118524
5015407,14571581_carts,129004 126836 118524


In [None]:
sample_sub = pd.merge(sample_sub,recommend_df[['session_type','aid']],on="session_type",how='left')
sample_sub['next'] = sample_sub['aid'] + best_sold_list
sample_sub['next'].fillna(best_sold_list,inplace=True)
sample_sub['next'] = sample_sub['next'].str.strip()
sample_sub = sample_sub.drop(['labels','aid'],axis=1)
sample_sub.columns = ("session_type",'labels')
sample_sub

Unnamed: 0,session_type,labels
0,12899779_clicks,59625 80222 1022566 166037 1629608 1733943 332...
1,12899779_carts,59625 80222 1022566 166037 1629608 1733943 332...
2,12899779_orders,80222 1022566 166037 1629608 1733943 332654 35...
3,12899780_clicks,1142000 582732 973453 736515 1142000 80222 102...
4,12899780_carts,1142000 582732 973453 736515 1142000 80222 102...
...,...,...
5015404,14571580_carts,202353 80222 1022566 166037 1629608 1733943 33...
5015405,14571580_orders,80222 1022566 166037 1629608 1733943 332654 35...
5015406,14571581_clicks,1100210 80222 1022566 166037 1629608 1733943 3...
5015407,14571581_carts,1100210 80222 1022566 166037 1629608 1733943 3...


In [None]:
sample_sub.to_csv('/content/drive/MyDrive/Colab Notebooks/kaggle/OTTO/submission/Easy_understanding_for_beginner1.csv',index=False)
sample_sub.to_csv('Easy_understanding_for_beginner1.csv',index=False)

In [None]:
sample_sub.to_csv('submission.csv',index=False)

In [None]:
!pip install kaggle -q
import os
import json
f = open("/content/drive/MyDrive/Colab Notebooks/kaggle/kaggle.json", 'r')
json_data = json.load(f)
os.environ['KAGGLE_USERNAME'] = json_data['username']
os.environ['KAGGLE_KEY'] = json_data['key']

In [None]:
!kaggle competitions submit -c otto-recommender-system -f Easy_understanding_for_beginner1.csv -m ""

100% 867M/867M [00:08<00:00, 111MB/s]
Successfully submitted to OTTO – Multi-Objective Recommender System

# 4.ついで買いの追加

Aという商品がOrderされた際にB、C、Dという商品もOrderされている場合、「B,C,D」をAのついで買い商品としてリスト化するものです。

これを追加することで、スコアは0.383→0.384へと少し改善しました。

前述の通り、この「ついで買い」商品ロジックより、「クリックした商品をおススメする」ロジックのほうがスコアが良くなった（0.410）のでそちらをSubmitしていますが、下記の「ついで買い」ロジックもご参考まで残しています。

In [None]:
train_order_df = train_df.loc[(train_df['type'] == 'orders'),].copy()
aid_counts = train_df.aid.value_counts()
aid_counts

29735      3716
832192     3299
1733943    3046
108125     2900
1603001    2863
           ... 
1346268       1
606579        1
1298735       1
1023270       1
662215        1
Name: aid, Length: 663079, dtype: int64

In [None]:
pairs = {}
for i in aid_counts.index.values[:10000]:
    custs = train_order_df.loc[train_order_df.aid==i.item(),'session'].unique()
    aid_orders = train_order_df.loc[(train_order_df.session.isin(custs))&(train_order_df.aid!=i.item()),'aid'].value_counts()
    try:
        pairs[i.item()] = [aid_orders.index[0],aid_orders.index[1],aid_orders.index[2]]
    except:
        continue

In [None]:
pairs_df = pd.DataFrame(pairs)
pairs_df = pairs_df.T
pairs_df['aid'] = pairs_df.index
pairs_df['3pairs'] = ' ' + pairs_df[0].astype('str') + ' ' + pairs_df[1].astype('str') + ' ' + pairs_df[2].astype('str')
pairs_df

Unnamed: 0,0,1,2,aid,3pairs
29735,832192,619885,231487,29735,832192 619885 231487
832192,298888,1691100,29735,832192,298888 1691100 29735
1733943,536184,1457252,680985,1733943,536184 1457252 680985
108125,1503,329725,1492204,108125,1503 329725 1492204
1603001,80222,1472383,101118,1603001,80222 1472383 101118
...,...,...,...,...,...
1171179,1716596,714968,669218,1171179,1716596 714968 669218
778291,1847859,558296,254154,778291,1847859 558296 254154
1203675,1384753,248962,495779,1203675,1384753 248962 495779
990697,1288196,1167802,1609126,990697,1288196 1167802 1609126


In [None]:
test_orders_df = pd.DataFrame(test_df.loc[(test_df["type"] == 'carts'), ]).copy()
test_orders_df = pd.merge(test_orders_df, pairs_df[['aid','3pairs']], on = 'aid', how = 'left')
test_orders_df['type'] = 'orders'
test_orders_df['aid'] = ' ' + test_orders_df['aid'].astype('str')
test_orders_df

Unnamed: 0,session,aid,ts,type,minutes,3pairs
0,12905715,1531805,1661727521491,orders,9954.876183,674295 436764 84886
1,12908612,755467,1661731587214,orders,9875.326483,
2,12914996,471073,1661746910460,orders,9630.287600,532024 340205 904917
3,12909406,631155,1661734150961,orders,9591.169100,1588316 641790 1253946
4,12911707,1662401,1661740775890,orders,9484.075567,443425 488528 125957
...,...,...,...,...,...,...
570006,14571335,1731301,1662328716651,orders,,
570007,14571341,1441093,1662328573060,orders,,1480985 1786188 1705711
570008,14571393,447242,1662328683124,orders,,196039 835431 457623
570009,14571486,350578,1662328680320,orders,,


In [None]:
next_orders_df = test_orders_df.groupby(['session','type'])[['aid','3pairs']].sum().reset_index()
next_orders_df = next_orders_df.replace(0, ' ')
next_orders_df['aid'] = next_orders_df['aid'] + next_orders_df['3pairs']
next_orders_df = next_orders_df.drop(['3pairs'], axis = 1)

recommend_df = pd.concat([next_orders_df, next_carts_df, next_clicks_df], axis =0)
recommend_df["session_type"] = recommend_df["session"].astype('str') + "_" + recommend_df["type"] 
recommend_df

Unnamed: 0,session,type,aid,session_type
0,12899781,orders,199008,12899781_orders
1,12899782,orders,1494780 834354 975116 127404 413962 595994 13...,12899782_orders
2,12899786,orders,955252,12899786_orders
3,12899787,orders,1682750 1682750 1682750,12899787_orders
4,12899790,orders,1830166 1219653,12899790_orders
...,...,...,...,...
1948868,14571577,clicks,1141710,14571577_clicks
1948869,14571578,clicks,519105,14571578_clicks
1948870,14571579,clicks,739876,14571579_clicks
1948871,14571580,clicks,202353,14571580_clicks


In [None]:
sample_sub = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/kaggle/OTTO/dataset/original/sample_submission.csv')
sample_sub = pd.merge(sample_sub, recommend_df[["session_type","aid"]], on = "session_type", how ="left")
sample_sub['next'] = sample_sub['aid'] + best_sold_list
sample_sub['next'].fillna(best_sold_list, inplace = True)
sample_sub['next'] = sample_sub['next'].str.strip()
sample_sub = sample_sub.drop(["labels", "aid"], axis = 1)
sample_sub.columns = ("session_type", "labels")
sample_sub.to_csv('/content/drive/MyDrive/Colab Notebooks/kaggle/OTTO/submission/Easy_understanding_for_beginner2.csv', index=False)
sample_sub.to_csv('Easy_understanding_for_beginner2.csv', index=False)

In [None]:
!pip install kaggle -q
import os
import json
f = open("/content/drive/MyDrive/Colab Notebooks/kaggle/kaggle.json", 'r')
json_data = json.load(f)
os.environ['KAGGLE_USERNAME'] = json_data['username']
os.environ['KAGGLE_KEY'] = json_data['key']

In [None]:
!kaggle competitions submit -c otto-recommender-system -f Easy_understanding_for_beginner2.csv -m ""

100% 854M/854M [00:10<00:00, 87.5MB/s]
Successfully submitted to OTTO – Multi-Objective Recommender System