В этой задаче от вас требуется по предыдущим лайкам предсказать следующий трек, которому пользователь поставит лайк.

Формат ввода
Прдоставленные вам три файла:
train - обучающий набор данных. Каждая строка - последовательность id треков, которые лайкнул один пользователь. Гарантируется, что лайки даны в той последовательности, в которой их ставил пользователь.

test - набор тестовых данных. Имеет точно такой же формат, но в каждой строке не хватает последнего лайка, который надо предсказать.
Тестовые данные разбита на public и private. Во время соревнования вы будет видеть результаты только на public данных. Финальный подсчет баллов в будет происходить на private данных.

track_artists.csv - информация о исполнителях треков. Гарантируется, что у каждого трека есть ровно один исполнитель. Для треков, у которых фактически несколько исполнителей, мы оставили того, который считается основным исполнителем трека.
Файл baseline.py содержит наивное решение. Обратите внимание, что это решение может выполняться больше одного часа.
Файл score.py содержит код, который вы можете использовать, чтобы локально считать mrr для своего решения.
Формат вывода
В качестве решения необходимо отправить файл, в котором для каждого пользователя в отдельной строке будет не более 100 треков, разделенных пробелом.

Примечания
В качестве метрики используется MRR@100

In [1]:
import pandas as pd
import numpy as np
from itertools import islice
import gc
from tqdm.notebook import trange, tqdm

In [6]:
# getting likes count for each track --> DataFrame
track_stats = {}
with open('train') as f:
    lines = f.readlines()
    for line in lines:
        tracks = line.strip().split(' ')
        for track in tracks:
            if track not in track_stats:
                track_stats[track] = 0
            track_stats[track] += 1
        
track_stats_df = pd.DataFrame.from_dict(track_stats, orient='index').reset_index()
track_stats_df.columns = ['track', 'likes']

In [8]:
track_stats_df.head()

Unnamed: 0,track,likes
0,333396,1
1,267089,1
2,155959,1
3,353335,1
4,414000,1


In [None]:
track_artists_df = pd.read_csv('track_artists.csv') #song number, artist number

In [None]:
track_likes_artist_df = track_stats_df.merge(track_artists_df, how='left', left_on='track', right_on='trackId').drop(columns='trackId')

In [None]:
# creating prepared data from train data
Y_list = []
X_list = []
with open('train') as f:
    lines = list(islice(f, 0, 5000)) #slicing lines with start/end rows ### islice for taking part of data, cause limited resources
    for (i, line) in enumerate(lines):
        tracks_l = line.strip().split(' ')
        Y_list.append(tracks_l[-1])
        X_list.append(tracks_l[:-1])

In [None]:
# temp df from read train
t = pd.DataFrame({'track_liked': X_list, 'y': Y_list})

t['y'] = t['y'].astype('int64')
t = t.merge(track_likes_artist_df, how='left', left_on='y', right_on='track', copy=False)
t.drop(columns=['track', 'likes_count'], inplace=True)
t.rename(columns={'artistId':'y_artist'}, inplace=True)
t.head()

In [None]:
## General idea: to research columns-oriented dataframe (not enough resources to teach neural network with embeddings)
# columns-oriented df
user_preferences_df = t.explode('track_liked').reset_index()
user_preferences_df.rename(columns={'index':'user_id'}, inplace=True)
user_preferences_df = user_preferences_df.astype('int64')
user_preferences_df.head()

In [None]:
# DataFrames merge to get favorit artist for each user
user_track_artist_df = user_preferences_df.merge(track_likes_artist_df, how='left', left_on='track_liked', right_on='track', copy=False)
user_track_artist_df.drop(columns=['track'], inplace=True)
##  counting user likes for each artist & prepare to add new column to df
t2 = user_track_artist_df[['user_id', 'artistId']].value_counts().reset_index()
t2.rename(columns={0:'user_likes_per_artist'}, inplace=True)
# general df with user likes per artist
user_track_artist_df = user_track_artist_df.merge(t2, how='left', on=['user_id', 'artistId'])
user_track_artist_df.sample(5)

In [None]:
# getting most liked songs for each artist
t3 = track_likes_artist_df.sort_values(["artistId", "likes_count"]).groupby("artistId").tail(1)
t3.drop(columns=['likes_count'], inplace=True)
user_track_artist_df = user_track_artist_df.merge(t3, how='left', on='artistId')
user_track_artist_df.rename(columns={'track':'best_song_of_artist'}, inplace=True)

In [None]:
# general df
user_track_artist_df.head()
# user_track_artist_df.to_csv('user_track_artist_df.csv')

In [None]:
# Preparation data for model
features = ['user_id', 'track_liked', 'likes_count', 'artistId', 'user_likes_per_artist', 'best_song_of_artist']
y = user_track_artist_df['y']
X = user_track_artist_df[features]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import time

In [None]:
#Classifier implementing the k-nearest neighbors vote.
startTime = time.time()
neigh_model = KNeighborsClassifier(n_neighbors=15)
neigh_model.fit(X_train, y_train) 
neigh_y_pred = neigh_model.predict(X_test)
print(accuracy_score(neigh_y_pred, y_test))
executionTime = (time.time() - startTime)
print('Execution time in mins: ' + str(executionTime/60))
gc.collect()

In [None]:
# Random forest classifier
startTime = time.time()
rfc_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
rfc_model.fit(X_train, y_train) 
rfc_y_pred = rfc_model.predict(X_test)
print(accuracy_score(rfc_y_pred, y_test))
executionTime = (time.time() - startTime)
print('Execution time in mins: ' + str(executionTime/60))
gc.collect()

In [None]:
#Naive Bayes classifier for multinomial models
startTime = time.time()
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train) 
nb_y_pred = nb_model.predict(X_test)
print(accuracy_score(nb_y_pred, y_test))
executionTime = (time.time() - startTime)
print('Execution time in mins: ' + str(executionTime/60))
gc.collect()

neigh_accuracy_score = 0.007493232983429062
rfc_accuracy_score = 0.01964085297418631 # best result
nb_accuracy_score = 0.001270878721859114


In [None]:
# getting test data
X_list = []
with open('test') as f:
    lines = list(islice(f, 0, None)) #slicing lines with start/end rows
    for (i, line) in enumerate(lines):
        tracks_l = line.strip().split(' ')
        X_list.append(tracks_l)

In [None]:
# creaing df from test data
test_df = pd.DataFrame({'track_liked': X_list})
test_df = test_df.explode('track_liked').reset_index()
test_df.rename(columns={'index':'user_id'}, inplace=True)
test_df = test_df.astype('int64')
test_df.head()

In [None]:
# preparation test df for prediction
test_df = test_df.merge(track_likes_artist_df, how='left', left_on='track_liked', right_on='track', copy=False)
test_df.drop(columns=['track'], inplace=True)

tt2 = test_df[['user_id', 'artistId']].value_counts().reset_index()
tt2.rename(columns={0:'user_likes_per_artist'}, inplace=True)
test_df = test_df.merge(tt2, how='left', on=['user_id', 'artistId'])

test_df = test_df.merge(t3, how='left', on='artistId')
test_df.rename(columns={'track':'best_song_of_artist'}, inplace=True)
test_df.head()
gc.collect()

In [None]:
# test df prediction, partition execution cause limited resources
pred_list = []
for i in trange(0, len(test_df), 100000):
    rfc_test_pred = rfc_model.predict(test_df[i: i+100000])
    pred_list.append(rfc_test_pred)
    gc.collect()


In [None]:
# creating df with predictions
result_df=pd.DataFrame(pred_list[0], columns=['rfc_prediction'])

for i in range(1, len(pred_list)):
    tmp = pd.DataFrame(pred_list[i], columns=['rfc_prediction'])
    result_df = pd.concat([result_df, tmp], ignore_index=True, copy=False)
# result_df.to_csv('pred_test_5k.csv')

In [None]:
# concat datatframes to get results pairs: 'user_id'--'rfc_predicton'
f = pd.concat([test_df, result_df], axis=1)
f.head()

In [None]:
rfc_result = f[['user_id', 'rfc_prediction']
        ].groupby(by=['user_id', 'rfc_prediction']
        ).size(
        ).reset_index(
        ).rename(columns={0:'pred_count'}
        ).sort_values(by=['user_id', 'pred_count'], ascending=[True,False]
        )[['user_id', 'rfc_prediction']
        ].groupby(by='user_id').agg({'rfc_prediction': lambda x: x.tolist()})
rfc_result.head()

 Start tuning baseline naive to improve prediction 

 1. Idea: to recommend more track of most liked artist

In [None]:
# getting info about favorit artist for each user
fav_art_df = test_df[['user_id', 'track_liked', 'artistId', 'user_likes_per_artist']].sort_values(by=['user_id', 'artistId', 'user_likes_per_artist'], ascending=[True, True, False])

tdf = fav_art_df.groupby('user_id')['user_likes_per_artist'].max().reset_index()

fav_art_df = fav_art_df.merge(tdf, how='left', on='user_id', suffixes=['_all', '_max'])
fav_art_df.head()

In [None]:
# users_favorite_artist_and_his_liked_tracks_df
ufaahlt_df = fav_art_df.loc[fav_art_df['user_likes_per_artist_all']==fav_art_df['user_likes_per_artist_max']][['user_id', 'track_liked', 'artistId']]

In [None]:
# collecting non-liked by user & most-liked-by-all-users tracks of fav artist, for each user
more_fav_artist_songs = {}
for i in trange(ufaahlt_df['user_id'].max() +1):
    a_id = ufaahlt_df[ufaahlt_df['user_id']==i]['artistId'].unique()[0]
    tr_list = ufaahlt_df[ufaahlt_df['user_id']==i]['track_liked'].to_list()
    more_fav_artist_songs[i] = track_likes_artist_df.loc[(track_likes_artist_df['artistId']==a_id) & (~track_likes_artist_df['track'].isin(tr_list))].head()['track'].to_list()

In [None]:
fav_artist_pred_df = pd.DataFrame(more_fav_artist_songs.items())

In [None]:
# DataFrame with 2 columns prediction:from ML-model and from more_fav_artist_songs
result_df = rfc_result.merge(fav_artist_pred_df, how='left', left_on='user_id', right_on=0).drop(columns=0).rename(columns={1:'fa_prediction'})
res.head()

2. Idea: to recommend more track of artist, whose track was liked last

In [None]:
#get artist which song is last liked
two_lists_df = test_df[['user_id','track_liked', 'artistId']].groupby(by='user_id').agg({'track_liked': lambda x: x.tolist(), 'artistId': lambda x: x.tolist()})

two_lists_df['last_like_artist'] =  [item[-1] for item in two_lists_df['artistId']]
two_lists_df.head()

In [None]:
#collecting $ --> Dataframe
more_last_artist_songs = {}
for i in trange(two_lists_df.last_valid_index()+1):
    a_id = two_lists_df.iloc[i]['last_like_artist']
    tr_list = two_lists_df.iloc[i]['track_liked']
    tr_list2 = ml_fa_df.iloc[i]['ml_fa']
    more_last_artist_songs[i] = track_likes_artist_df.loc[
                                (track_likes_artist_df['artistId']==a_id) & (
                                ~track_likes_artist_df['track'].isin(tr_list2)) & (
                                ~track_likes_artist_df['track'].isin(tr_list))].head()['track'].to_list()
last_songs_likes_df = pd.DataFrame(more_last_artist_songs.items()) 

In [None]:
# dataframe with predictions of 3 ways: machine learning(rfc), favorit authors other songs(fa), other tracks of artist of last liked song (ls)
res3 = ml_fa_df.merge(last_songs_likes_df, how='left', left_on='user_id', right_on=0).drop(columns=0).rename(columns={1:'ls_prediction'})
res3.head()

3. Getting recomendation of most popular songs in general

In [None]:
# Top 100 tracks general
popular_tracks = sorted(track_stats.items(), key=lambda item: item[1], reverse=True)[:100]
popular_tracks_list = [x[0] for x in popular_tracks]

# top 1000 tracks general
top_tracks = sorted(track_stats.items(), key=lambda item: item[1], reverse=True)[:1000]
top_tracks_set = set([x[0] for x in top_tracks])

# dict with scores top 100 songs (squared from likes count ?)
global_track_score = {}
for track in top_tracks:
    global_track_score[track[0]] = track_stats[track[0]] ** 0.5

In [None]:
# dictionary: song: [other songs with count_likes]
track_count = {}
with open('train') as f:
    lines = f.readlines()
    for (i, line) in enumerate(lines):
        tracks = line.strip().split(' ')
        filtered_tracks = []
        for track in tracks:
            if track in top_tracks_set:
                filtered_tracks.append(track)
        for i in range(len(filtered_tracks)):
            track1 = filtered_tracks[i]
            for j in range(len(filtered_tracks)):
                if i != j:
                    track2 = filtered_tracks[j]
                    if track1 not in track_count:
                        track_count[track1] = {}
                    current_count = track_count[track1]
                    if track2 not in current_count:
                        current_count[track2] = 0
                    current_count[track2] += 1

In [None]:
# getting recomendation for test data
with open('test') as f:
    test = f.readlines()
result = []
empty_track_score = 0
for query in test:
    test_tracks = query.strip().split(' ')
    track_score = {}
    for track in test_tracks:
        if track in track_count:
            for track_id in track_count[track]:
                score = track_count[track][track_id]
                if track_id not in track_score:
                    track_score[track_id] = 0
                track_score[track_id] += score / global_track_score[track] / global_track_score[track_id]
    if len(track_score) == 0:
        result.append(' '.join(popular_tracks_list) + '\n')
        empty_track_score += 1
    else:
        best_tracks = sorted(track_score.items(), key=lambda item: item[1], reverse=True)[:100]
        result.append(' '.join([x[0] for x in best_tracks]) + '\n')
    

In [None]:
top_song_df = pd.DataFrame(result).reset_index()
top_song_df.rename(columns={'index':'user_id', 0: 'top_song_pr'}, inplace=True)

In [None]:
top_song_df['list_top_song_pr'] = [[int(x) for x in top_song_df['top_song_pr'][i].strip().split()] for i in range(len(top_song_df))]

In [None]:
# dataframe with predictions of 4 ways: machine learning(rfc), 
#                                      favorit authors other songs(fa), 
#                                      other tracks of artist of last liked song (ls)
#                                      recomendation of most popular(liked) songs in general
res3 = res3.merge(top_song_df[['user_id', 'top_song_pr']], how='left', on='user_id')
res3.head()

In [None]:
# preparing data for final check in different combinations (scoring abuse)
res3['ml_fa_ls'] = res3['rfc_prediction'] + res3['fa_prediction'] + res3['ls_prediction'] + res3['top_song_pr']
res3['ml_ls_fa'] = res3['rfc_prediction'] + res3['ls_prediction'] + res3['fa_prediction'] + res3['top_song_pr']
res3['ls_ml_fa'] = res3['ls_prediction'] + res3['rfc_prediction'] + res3['fa_prediction'] + res3['top_song_pr']
res3['ls_fa_ml'] = res3['ls_prediction'] + res3['fa_prediction'] + res3['rfc_prediction'] + res3['top_song_pr']
res3['fa_ml_ls'] = res3['fa_prediction'] + res3['rfc_prediction'] + res3['ls_prediction'] + res3['top_song_pr']
res3['fa_ls_ml'] = res3['fa_prediction'] + res3['ls_prediction'] + res3['rfc_prediction'] + res3['top_song_pr']

In [None]:
# remove duplicates from list, keep elements order, set len(list)=100
def remove_dup(seq):
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)][:100]

In [None]:
iter_pred = ['ml_fa_ls', 'ml_ls_fa', 'ls_ml_fa', 'ls_fa_ml', 'fa_ml_ls', 'fa_ls_ml']
for el in iter_pred:
    res3[el] = res3[el].apply(remove_dup)

In [None]:
# writing files to check score on YandexCup site
for elem in iter_pred:
    file = 'result_' + elem  
    for i in range(len(res3[elem])):
        result = []
        result.append(' '.join(res3[elem][i]) + '\n')
        with open(file, 'a') as f:  # writing file in 'append' mode string by string
            f.writelines(result)

CONCLUSION:
The idea with exploding dataframe was not so good. Better model(RandomForrestClassifier) + tuning of incoming data allowed to double success in compare with 'naive' code.
The most perspective idea: working with last liked songs (despite we didn't have timedata for likes). 
Final result: 78 place(251 parcipiant)Ycup22_