**Imports**

In [1]:
import pandas as pd

**Reading interaction data**

In [2]:
interaction_book_id = pd.read_parquet('../Processed/interaction_RE_v1.parquet')

In [3]:
interaction_book_id.shape

(228648342, 3)

**Reading books data - Only the `book_id` column**

In [4]:
req_book_ids = pd.read_parquet('../Processed/books_SE_v3.parquet', columns=["book_id"])

In [5]:
req_book_ids.shape

(2113033, 1)

In [6]:
req_book_ids.columns

Index(['book_id'], dtype='object')

In [7]:
req_book_ids.dtypes

book_id    object
dtype: object

**Updating the datatype of the `book_id` column**

In [8]:
req_book_ids["book_id"] = req_book_ids["book_id"].astype('Int64')

In [9]:
req_book_ids.dtypes

book_id    Int64
dtype: object

**Updating column name**

In [10]:
req_book_ids.rename(columns={"book_id":"mapped_book_id"}, inplace=True)

In [11]:
req_book_ids.head()

Unnamed: 0,mapped_book_id
0,5333265
1,1333909
2,7327624
3,6066819
4,287140


In [12]:
req_book_ids.tail()

Unnamed: 0,mapped_book_id
2360650,3084038
2360651,26168430
2360652,2342551
2360653,22017381
2360654,11419866


**Keeping only those records in `interaction_book_id` dataframe where `mapped_book_id` column values of `interaction_book_id` is also present in `req_book_ids`'s `book_id` column using inner joint**

In [13]:
filtered_interactions = pd.merge(interaction_book_id, req_book_ids, on='mapped_book_id', how='inner')

In [14]:
filtered_interactions.shape

(219925589, 3)

**Reduction in interaction records**

In [15]:
interaction_book_id.shape[0] - filtered_interactions.shape[0]

8722753

**Number of unique book_id**

In [16]:
len(filtered_interactions["mapped_book_id"].unique())

2113028

**Reduction in number of unique book_id**

In [17]:
len(req_book_ids["mapped_book_id"].unique()) - len(filtered_interactions["mapped_book_id"].unique())

5

**The interaction data index numbers we need to fetch as follows**

In [18]:
filtered_interactions.head()

Unnamed: 0,index,book_id,mapped_book_id
0,0,948,12
1,85452,948,12
2,148581,948,12
3,328679,948,12
4,928107,948,12


**Exporting the filtered interaction data**

In [19]:
filtered_interactions.to_parquet('../Processed/filtered_interaction_RE_v1.parquet', index=True, compression="snappy")