# Book Recommendation Hackathon 

**Task:** Rank 20 editions for each user from 200 candidates, optimizing Score = 0.7×NDCG@20 + 0.3×Diversity@20

**Strategy - classic, catboost ranker + rearranging (for the 30% of the residual metric bcs catboost is fitted on ndcg)**

## 1 - loading data

In [1]:
import pandas as pd

interactions = pd.read_csv('data/interactions.csv')
editions = pd.read_csv('data/editions.csv')
users = pd.read_csv('data/users.csv')
book_genres = pd.read_csv('data/book_genres.csv')
genres = pd.read_csv('data/genres.csv')
authors = pd.read_csv('data/authors.csv') #useless
target_users = pd.read_csv('submit/targets.csv') # users that I will give more weights
target_interactions = pd.read_csv('submit/candidates.csv') # their interactions for prediction

print('all data frames have been loaded successfully')

all data frames have been loaded successfully


## 2 - data preparation

#### 2.1 - enriched books' data

In [2]:
book_genres = book_genres.groupby('book_id')['genre_id'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
#book_genres = book_genres.reset_index().groupby('book_id')['genre_id'].apply(list).reset_index() - not yet sure what is better
enriched_editions = editions.merge(book_genres, on='book_id')

print("books' data has been enriched successfully")

books' data has been enriched successfully


#### 2.1.1 - Feature engineering 
I will add author_productivity feature that shows how many books are written with the same author

In [3]:
enriched_editions['author_productivity']= enriched_editions.author_id.map(enriched_editions.author_id.value_counts())

print("author_productivity has been added successfully")

author_productivity has been added successfully


#### 2.3 - enriched interactions

In [4]:
interactions = interactions.drop('event_ts', axis=1)
enriched_interactions = interactions.merge(users, on='user_id')
enriched_interactions = enriched_interactions.merge(enriched_editions, on='edition_id')

print("interactions have been successfully merged")

interactions have been successfully merged


#### 2.3.1 - adjusting logic 

I drop same cols and then I would be able to do more features

In [5]:
enriched_interactions = enriched_interactions.drop('book_id', axis=1) #1 to 1 with edition_id
enriched_interactions = enriched_interactions.drop('publisher_id', axis=1) #1 to 1 with edition_id

print("1 to 1 features have been dropped successfully")

1 to 1 features have been dropped successfully


#### 2.3.2 - feature engineering for interactions 

I will add edition popularity so I can give users popular books that they did not click 

I will specify popular ages for each edition so it would be much better I guess (edition_popularity_score more -> popularity is higher)

I will add book_age to see how new is each book

I will also add "1 to 1" features - user_average_ratings and book_average_rating, so-called "biases"

In [6]:
enriched_interactions['edition_popularity_score'] = enriched_interactions.edition_id.map(enriched_interactions.edition_id.value_counts())
enriched_interactions['writer_mean_age'] = enriched_interactions.groupby('edition_id')['age'].transform('mean')
enriched_interactions['book_age'] = 2026 - enriched_interactions['publication_year']
enriched_interactions['user_mean_rating'] = enriched_interactions.groupby('user_id')['rating'].transform('mean')
enriched_interactions['book_mean_rating'] = enriched_interactions.groupby('edition_id')['rating'].transform('mean')
enriched_interactions = enriched_interactions.drop('rating', axis=1)

print("feature engineering have been done successfully")

feature engineering have been done successfully


## 3 - From final data we take user_information and book information 
#### it is done because I would need to recreate features for the target

In [25]:
user_cols = ['user_id', 'gender', 'age', 'user_mean_rating']

user_features = enriched_interactions[user_cols].drop_duplicates().reset_index()
user_features = user_features.drop('index', axis=1)
print('user_features_dataframe has been created successfully')

book_features = enriched_interactions[[f for f in enriched_interactions.columns.to_list() if f not in user_cols]].drop_duplicates().reset_index()

print('book_features has been created successfully')


user_features_dataframe has been created successfully
book_features has been created successfully


## 4 - Adding additional books and separating the target from data.

positives / negatives = 1 / 4 

Notably, I add negatives by popularity

``` Python 
['user_id',
 'edition_id',
 'event_type',
 'gender',
 'age',
 'author_id',
 'publication_year',
 'age_restriction',
 'language_id',
 'title',
 'description',
 'genre_id',
 'author_productivity',
 'edition_popularity_score',
 'writer_mean_age',
 'book_age',
 'user_mean_rating',
 'book_mean_rating']

```

In [57]:
from tqdm import tqdm

needed_cols = [
    'user_id',
    'edition_id',
    'event_type',
    'gender',
    'age',
    'author_id',
    'publication_year',
    'age_restriction',
    'language_id',
    'title',
    'description',
    'genre_id',
    'author_productivity',
    'edition_popularity_score',
    'writer_mean_age',
    'book_age',
    'user_mean_rating',
    'book_mean_rating'
 ]

num_negative_samples_per_user = enriched_interactions.user_id.value_counts() * 4
#book_features = book_features.sort_values('edition_popularity_score', ascending=False).drop('index', axis=1).reset_index().drop('index', axis=1) # i know but it works !

negativity_builder = pd.DataFrame(columns = needed_cols)

for user_id in tqdm(enriched_interactions.user_id.to_list()): 
    for i in range(num_negative_samples_per_user[user_id]): 
        _user_id = user_id
        _edition_id = book_features.iloc[i]['edition_id']
        _event_type = book_features.iloc[i]['event_type']
        _gender = enriched_interactions.loc[enriched_interactions.user_id == user_id]['gender'].unique()
        _age = enriched_interactions.loc[enriched_interactions.user_id == user_id]['age'].unique()
        _author_id = book_features.iloc[i]['author_id']
        _publication_year = book_features.iloc[i]['publication_year']
        _age_restriction= book_features.iloc[i]['age_restriction']
        _language_id = book_features.iloc[i]['language_id']
        _title = book_features.iloc[i]['title']
        _description = book_features.iloc[i]['description']
        _genre_id = book_features.iloc[i]['genre_id']
        _author_productivity = book_features.iloc[i]['author_productivity']
        _edition_popularity_score = book_features.iloc[i]['edition_popularity_score']
        _writer_mean_age = book_features.iloc[i]['writer_mean_age']
        _book_age = book_features.iloc[i]['book_age']
        _user_mean_rating = enriched_interactions.loc[enriched_interactions.user_id == user_id]['user_mean_rating'].unique()
        _book_mean_rating = book_features.iloc[i]['book_mean_rating']
        data_former = pd.DataFrame(data = [
            [_edition_id],
            [_event_type],
            [_gender],
            [_age],
            [_author_id],
            [_publication_year],
            [_age_restriction],
            [_language_id],
            [_title],
            [_description],
            [_genre_id],
            [_author_productivity],
            [_edition_popularity_score], 
            [_writer_mean_age], 
            [_book_age], 
            [_user_mean_rating], 
            [_book_mean_rating]
            ], columns = needed_cols)
        negativity_builder = pd.concat([negativity_builder, data_former], ignore_index=True)
        if i == 10: 
            break

  0%|          | 0/231210 [00:00<?, ?it/s]


ValueError: 18 columns passed, passed data had 1 columns

In [55]:
book_features.head(1)


Unnamed: 0,edition_id,event_type,author_id,publication_year,age_restriction,language_id,title,description,genre_id,author_productivity,edition_popularity_score,writer_mean_age,book_age,book_mean_rating
0,1010122669,2,2386468.0,2024,16,119,Тайна мертвого ректора. Книга 1,После грандиозной и кровопролитной битвы граф ...,1243 1244,33,378,33.580645,2,9.017794
