# FAST PANDAS LEFT JOIN (357x faster)

Hi, I think many people are irritated overhead to join each test dataframe with user dataframes (or content dataframes). For the left join and the case when the right table index is unique, we can join them much faster than pd.merge.

* UPDATE: added the method @alijs1 mentioned (`right_index=True`), 10x faster.
* UPDATE: added the method @doctorkael mentioned (`right_index=True`and present users), 45x faster.

Discussion: https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/197023

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import riiideducation

In [None]:
users = pd.read_csv('../input/riiid-test-answer-prediction/train.csv', usecols=['user_id'])['user_id'].unique()

### Maybe, you have a user dataframe like this.

In [None]:
n_cols = 100
df_user = pd.DataFrame(np.random.random((users.shape[0], n_cols)), index=users, columns=[f'feat{i}' for i in range(n_cols)])
df_user.index.name = 'user_id'
print(df_user.shape)
df_user

# Comparison

In [None]:
# prepare data
env = riiideducation.make_env()
iter_test = env.iter_test()

list_df = []
for itr, (df_test, sample_prediction_df) in enumerate(iter_test):
    df_test.loc[:, 'answered_correctly'] = 0.5
    list_df.append(df_test)
    env.predict(df_test.loc[df_test['content_type_id'] == 0, ['row_id', 'answered_correctly']])

# pd.merge()

It should takes around 1.75 sec.

In [None]:
%%timeit
for df_test in list_df:
    df_test.merge(df_user, how='left', on='user_id')

# pd.merge() with right_index=True

Mentioned by @alijs1 in https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/197023.

It should takes around 139 ms. This is 13 times faster!

In [None]:
%%timeit
for df_test in list_df:
    df_test.merge(df_user, how='left', left_on='user_id', right_index=True)

# pd.merge() with right_index=True and the present users filtering

Mentioned by @doctorkael in https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/197023.

It should takes around 38.3 ms. This is 45 times faster!

In [None]:
%%timeit
for df_test in list_df:
    df_test.merge(df_user.loc[df_user.index.isin(df_test['user_id'])],
                  how='left', left_on='user_id', right_index=True)

### Fast left join
It should takes around 4.89 ms sec, <span style="color: red; ">**357 TIMES FASTER!!!!!!**</span>

In [None]:
%%timeit
for df_test in list_df:
    pd.concat([df_test.reset_index(drop=True), df_user.reindex(df_test['user_id'].values).reset_index(drop=True)], axis=1)

# Of course, they are equivalent.

But, `right_index=True` preserves the original left index. `reset_index(drop=True)` is required to be equal.

In [None]:
for df_test in list_df:
    df_merge = df_test.merge(df_user, how='left', on='user_id')
    df_merge_right_index = df_test.merge(df_user, how='left',
                                         left_on='user_id', right_index=True).reset_index(drop=True)
    df_merge_right_index_user = df_test.merge(df_user.loc[df_user.index.isin(df_test['user_id'])],
                                              how='left', left_on='user_id', right_index=True).reset_index(drop=True)
    df_fast_merge = pd.concat([df_test.reset_index(drop=True),
                               df_user.reindex(df_test['user_id'].values).reset_index(drop=True)], axis=1)
    print(df_merge.equals(df_merge_right_index), 
          df_merge.equals(df_merge_right_index_user), 
          df_merge.equals(df_fast_merge), 
          )

Enjoy your kaggle life!!!