# Book Recommendation Hackathon 

**Task:** Rank 20 editions for each user from 200 candidates, optimizing Score = 0.7×NDCG@20 + 0.3×Diversity@20

**Strategy - classic, catboost ranker + rearranging (for the 30% of the residual metric bcs catboost is fitted on ndcg)**

In [49]:
import sys 
import os
import warnings 

os.environ['OPENBLUS_NUM_THREADS'] = '1'
warnings.filterwarnings('ignore')

In [50]:
import pandas as pd
import datetime as dt
import time

interactions = pd.read_csv('data/interactions.csv')
editions = pd.read_csv('data/editions.csv')
users = pd.read_csv('data/users.csv')
book_genres = pd.read_csv('data/book_genres.csv')
genres = pd.read_csv('data/genres.csv')
authors = pd.read_csv('data/authors.csv') #useless
target_users = pd.read_csv('submit/targets.csv') # users that I will give more weights
target_interactions = pd.read_csv('submit/candidates.csv') # their interactions for prediction

print('all data frames have been loaded successfully')

all data frames have been loaded successfully


In [51]:
%%time

interactions['event_ts'] = pd.to_datetime(interactions['event_ts'])

split_date = pd.Timestamp('2025-03-12')

feature_source = interactions.loc[interactions['event_ts'] < split_date]
train = interactions.loc[interactions['event_ts'] < split_date]

CPU times: user 47.1 ms, sys: 113 ms, total: 160 ms
Wall time: 207 ms


In [None]:
%%time

book_genres = book_genres.groupby('book_id')['genre_id'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
enriched_editions = editions.merge(book_genres, on='book_id')

enriched_editions['author_productivity']= enriched_editions.author_id.map(enriched_editions.author_id.value_counts())

feature_source = feature_source.drop('event_ts', axis=1)
feature_source = feature_source.merge(users, on='user_id')
feature_source = feature_source.merge(enriched_editions, on='edition_id')

feature_source = feature_source.drop('book_id', axis=1) #1 to 1 with edition_id
feature_source = feature_source.drop('publisher_id', axis=1) #1 to 1 with edition_id

feature_source['edition_popularity_score'] = feature_source.edition_id.map(feature_source.edition_id.value_counts())
feature_source['writer_mean_age'] = feature_source.groupby('edition_id')['age'].transform('mean')
feature_source['book_age'] = 2026 - feature_source['publication_year']
feature_source['user_mean_rating'] = feature_source.groupby('user_id')['rating'].transform('mean')
feature_source['book_mean_rating'] = feature_source.groupby('edition_id')['rating'].transform('mean')
feature_source = feature_source.drop('rating', axis=1)

user_cols = ['user_id', 'gender', 'age', 'user_mean_rating']

user_features = feature_source[user_cols].drop_duplicates().reset_index()
user_features = user_features.drop('index', axis=1)

book_features = feature_source[[f for f in feature_source.columns.to_list() if f not in user_cols]].drop_duplicates().reset_index()

CPU times: user 1.89 s, sys: 195 ms, total: 2.08 s
Wall time: 2.22 s


NameError: name 'enriched_interactions' is not defined

#### 2.3.2 - feature engineering for interactions 

I will add edition popularity so I can give users popular books that they did not click 

I will specify popular ages for each edition so it would be much better I guess (edition_popularity_score more -> popularity is higher)

I will add book_age to see how new is each book

I will also add "1 to 1" features - user_average_ratings and book_average_rating, so-called "biases"

## 3 - From final data we take user_information and book information 
#### it is done because I would need to recreate features for the target

## 4 - Adding additional books and separating the target from data.

positives / negatives = 1 / 4 

Notably, I add negatives by popularity

### 4.1 - Creating negatives 

DO NOT RUN IT LOCALLY - on kaggle it was running for an hour and has generated 1.4 gb file.

In [26]:
import pandas as pd
import numpy as np

needed_cols = [
    'user_id',
    'edition_id',
    'event_type',
    'gender',
    'age',
    'author_id',
    'publication_year',
    'age_restriction',
    'language_id',
    'title',
    'description',
    'genre_id',
    'author_productivity',
    'edition_popularity_score',
    'writer_mean_age',
    'book_age',
    'user_mean_rating',
    'book_mean_rating'
 ]

num_negative_samples_per_user = enriched_interactions['user_id'].value_counts() * 8

# which would cause an IndexError in your original code
max_books = len(book_features)
num_negative_samples_per_user = num_negative_samples_per_user.clip(upper=max_books)

# 2. Extract unique user features ONCE instead of filtering per loop
user_cols = ['user_id', 'gender', 'age', 'user_mean_rating']
user_features = enriched_interactions[user_cols].drop_duplicates(subset=['user_id']).set_index('user_id')

# 3. Create arrays of what we need to duplicate
user_ids = num_negative_samples_per_user.index.to_numpy()
counts = num_negative_samples_per_user.to_numpy()

# np.repeat duplicates each user_id exactly 'count' times
# e.g., if User A needs 3 samples: ['A', 'A', 'A']
repeated_user_ids = np.repeat(user_ids, counts)

# 4. Generate the corresponding book indices: [0, 1, 2, ..., 0, 1, ...]
temp_df = pd.DataFrame({'user_id': repeated_user_ids})
book_indices = temp_df.groupby('user_id', sort=False).cumcount().to_numpy()

# 5. Clean up book_features for fast indexing
book_features = book_features.sort_values('edition_popularity_score', ascending=False).reset_index(drop=True)

# 6. Extract all required rows at once using vectorized indexing
sampled_books = book_features.iloc[book_indices].reset_index(drop=True)
sampled_users = user_features.loc[repeated_user_ids].reset_index(drop=True)

# 7. Glue it all together!
negativity_builder = pd.concat([sampled_users, sampled_books], axis=1)

# 8. Fill in static columns and enforce the final column order
negativity_builder['user_id'] = repeated_user_ids # Ensures the ID column is retained
negativity_builder['event_type'] = 0

negativity_builder = negativity_builder[needed_cols]

### 4.2 Add all data up

In [28]:
final_interactions = pd.concat([enriched_interactions, negativity_builder], ignore_index=True)

final_interactions = final_interactions.sort_values(
    by='event_type', 
    key=lambda x: x != 0
)

final_interactions = final_interactions.drop_duplicates(subset=['user_id', 'edition_id'], keep='first')
final_interactions = final_interactions.reset_index().drop('index', axis=1)