#### [배경]
안녕하세요, 여러분! 😀 제2회 코스포 x 데이콘 도서 추천 알고리즘 AI경진대회(채용 연계형)에 오신 것을 환영합니다.



추천시스템이란 과거 구매행동 분석을 통해 향후 구매할 것이라 예상되는 제품을 자동으로 검색하여 제공하는 시스템입니다.

이를 통해 고객은 더 적은 노력으로 원하는 상품을 발견할 수 있으며, 구매 결정을 더욱 쉽게 내리게 함으로써 쇼핑몰의 매출 증가에도 기여합니다.

고객 경험을 개선하여 매출 증대에 기여하기 위해 최근에는 많은 기업들이 추천 시스템의 개발 연구에 많은 노력을 기울이고 있습니다.



이번 대회는 채용 연계형으로, 문제 해결 능력이 우수한 인재는 스타트업에 추천됩니다.



#### [주제]
도서 추천 알고리즘 AI 모델 개발



#### [설명]
유저 정보와 도서 정보를 바탕으로,

유저가 부여한 도서 평점을 회귀 예측하는 AI 모델을 개발해야 합니다.



본 대회는 개인전으로 진행되며 채용 연계형 대회입니다.

대회 종료 후 데이콘이 최종 코드와 솔루션 자료를 검증하고, 우수한 인재를 코리아스타트업포럼 회원사 스타트업에 추천합니다.

In [1]:
import time
import pandas as pd
from surprise import SVD, SVDpp, Dataset, Reader, accuracy
from surprise.model_selection import cross_validate, GridSearchCV

## Load Data

In [2]:
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')

In [3]:
# Surprise 라이브러리용 Reader 및 Dataset 객체 생성
reader = Reader(rating_scale=(0, 10))
train = Dataset.load_from_df(train[['User-ID', 'Book-ID', 'Book-Rating']], reader)

## Parameters Tuning

### Tuning SVD

In [None]:
param_grid = {"n_epochs": [15], "lr_all": [0.0035, 0.004], "reg_all": [0.5, 0.1, 0.15], "n_factors": [3, 5, 10]}
gs = GridSearchCV(SVD, param_grid, measures=["rmse"], cv=3)
gs.fit(train)
print("Done.")

In [23]:
# 1st RUN
print(gs.best_score)
print(gs.best_params)

{'rmse': 3.419399762751754}
{'rmse': {'n_epochs': 15, 'lr_all': 0.004, 'reg_all': 0.2, 'n_factors': 5}}


In [7]:
# 2nd RUN
print(gs.best_score)
print(gs.best_params)

{'rmse': 3.418821353387598}
{'rmse': {'n_epochs': 15, 'lr_all': 0.004, 'reg_all': 0.1, 'n_factors': 3}}


In [10]:
# 3rd RUN
print(gs.best_score)
print(gs.best_params)

{'rmse': 3.4168375355416924}
{'rmse': {'n_epochs': 15, 'lr_all': 0.004, 'reg_all': 0.15, 'n_factors': 3}}


#### Checking how well SVD does

In [5]:
params = {'n_epochs': 15, 'lr_all': 0.004, 'reg_all': 0.15, 'n_factors': 3}
svd = SVD(**params)
cross_validate(svd, train, measures=['RMSE'], cv=3, verbose=False)

{'test_rmse': array([3.41994225, 3.41433136, 3.41046539]),
 'fit_time': (3.3977558612823486, 3.500286340713501, 3.52862811088562),
 'test_time': (1.9803352355957031, 1.8988232612609863, 1.9033949375152588)}

In [4]:
params = {'n_epochs': 15, 'lr_all': 0.004, 'reg_all': 0.15, 'n_factors': 5}
svd = SVD(**params)
cross_validate(svd, train, measures=['RMSE'], cv=3, verbose=False)

{'test_rmse': array([3.41637341, 3.41539549, 3.42517588]),
 'fit_time': (3.4813151359558105, 3.667008399963379, 3.6680119037628174),
 'test_time': (2.280870199203491, 1.8877503871917725, 1.9583783149719238)}

### Tuning SVD++

In [11]:
param_grid = {"n_epochs": [15], "lr_all": [0.003, 0.004], "reg_all": [0.1, 0.2], "n_factors": [5, 10]}
gs = GridSearchCV(SVDpp, param_grid, measures=["rmse"], cv=3)
gs.fit(train)
print("Done.")

Done.


In [12]:
# 1st RUN
print(gs.best_score)
print(gs.best_params)

{'rmse': 3.427634180220371}
{'rmse': {'n_epochs': 15, 'lr_all': 0.004, 'reg_all': 0.2, 'n_factors': 5}}


## Optional

In [4]:
print(f"Original training data size : {train.df.shape[0]}")
train_clean = train.df.groupby('Book-ID').filter(lambda x:len(x) >= 5)
print(f"After removing data of books which have less than 5 rating from users : {train_clean.shape[0]}")
train_clean = train_clean.groupby('User-ID').filter(lambda x:len(x) >= 5)
print(f"After removing data of users who rated less than 5 books : {train_clean.shape[0]}")
train = Dataset.load_from_df(train_clean, reader)
print("Done.")

Original training data size : 871393
After removing data of books which have less than 5 rating from users : 547006
After removing data of users who rated less than 5 books : 469321
Done.


## Ensemble Training

In [4]:
train = train.build_full_trainset()

##################################################
# SVD++ #
svdpp_params = {
    'n_epochs': 15, 
    'lr_all': 0.004, 
    'reg_all': 0.15,
    'cache_ratings': False
}
model1 = SVDpp(**svdpp_params, n_factors=3)
model2 = SVDpp(**svdpp_params, n_factors=4)

##################################################
# SVD #
svd_params = {
    'n_epochs': 15, 
    'lr_all': 0.004
}
model3 = SVD(**svd_params, reg_all=0.1, n_factors=3)
model4 = SVD(**svd_params, reg_all=0.1, n_factors=4)
model5 = SVD(**svd_params, reg_all=0.1, n_factors=5)
model6 = SVD(**svd_params, reg_all=0.15, n_factors=3)
model7 = SVD(**svd_params, reg_all=0.15, n_factors=4)
model8 = SVD(**svd_params, reg_all=0.15, n_factors=5)
model9 = SVD(**svd_params, reg_all=0.125, n_factors=3)
model10 = SVD(**svd_params, reg_all=0.125, n_factors=4)
model11 = SVD(**svd_params, reg_all=0.125, n_factors=5)

# all_models = [model1, model2, model3, model4, model5, model6, model7, model8]
all_svds = [model3, model4, model5, model6, model7, model8, model9, model10, model11]

for i, model in enumerate(all_svds):
    start_time = time.time()
    print("==>> TRAINING model", i+1)
    
    model.fit(train)
    
    end_time = time.time()
    duration = (end_time-start_time)/60
    print(f"=== Time took : ({duration:.2f}) minutes ===")
    

==>> TRAINING model 1
=== Time took : (0.04) minutes ===
==>> TRAINING model 2
=== Time took : (0.04) minutes ===
==>> TRAINING model 3
=== Time took : (0.04) minutes ===
==>> TRAINING model 4
=== Time took : (0.04) minutes ===
==>> TRAINING model 5
=== Time took : (0.04) minutes ===
==>> TRAINING model 6
=== Time took : (0.04) minutes ===
==>> TRAINING model 7
=== Time took : (0.04) minutes ===
==>> TRAINING model 8
=== Time took : (0.04) minutes ===
==>> TRAINING model 9
=== Time took : (0.04) minutes ===


## Ensemble Prediction

In [5]:
submit1 = pd.read_csv('./sample_submission.csv')
submit2 = pd.read_csv('./sample_submission.csv')
submit3 = pd.read_csv('./sample_submission.csv')
submit4 = pd.read_csv('./sample_submission.csv')
submit5 = pd.read_csv('./sample_submission.csv')
submit6 = pd.read_csv('./sample_submission.csv')
submit7 = pd.read_csv('./sample_submission.csv')
submit8 = pd.read_csv('./sample_submission.csv')
submit9 = pd.read_csv('./sample_submission.csv')
submit10 = pd.read_csv('./sample_submission.csv')
submit11 = pd.read_csv('./sample_submission.csv')

# submit1['Book-Rating'] = test.apply(lambda row: model1.predict(row['User-ID'], row['Book-ID']).est, axis=1)
# submit2['Book-Rating'] = test.apply(lambda row: model2.predict(row['User-ID'], row['Book-ID']).est, axis=1)
submit3['Book-Rating'] = test.apply(lambda row: model3.predict(row['User-ID'], row['Book-ID']).est, axis=1)
submit4['Book-Rating'] = test.apply(lambda row: model4.predict(row['User-ID'], row['Book-ID']).est, axis=1)
submit5['Book-Rating'] = test.apply(lambda row: model5.predict(row['User-ID'], row['Book-ID']).est, axis=1)
submit6['Book-Rating'] = test.apply(lambda row: model6.predict(row['User-ID'], row['Book-ID']).est, axis=1)
submit7['Book-Rating'] = test.apply(lambda row: model7.predict(row['User-ID'], row['Book-ID']).est, axis=1)
submit8['Book-Rating'] = test.apply(lambda row: model8.predict(row['User-ID'], row['Book-ID']).est, axis=1)
submit9['Book-Rating'] = test.apply(lambda row: model9.predict(row['User-ID'], row['Book-ID']).est, axis=1)
submit10['Book-Rating'] = test.apply(lambda row: model10.predict(row['User-ID'], row['Book-ID']).est, axis=1)
submit11['Book-Rating'] = test.apply(lambda row: model11.predict(row['User-ID'], row['Book-ID']).est, axis=1)
print("Done.")

Done.


### SVD++s Ensemble

In [6]:
ensemble_svdpp_submit = pd.concat([submit1, submit2])
averaged_ratings = ensemble_svdpp_submit.groupby('ID')['Book-Rating'].mean().reset_index()
averaged_ratings.to_csv('./Submissions/4_tuned_ensemble_2SVDpps.csv', index=False)

### SVDs Ensemble

In [6]:
ensemble_svd_submit = pd.concat([submit3, submit4, submit5, submit6, submit7, submit8, submit9, submit10, submit11])
averaged_ratings = ensemble_svd_submit.groupby('ID')['Book-Rating'].mean().reset_index()
averaged_ratings.to_csv('./Submissions/5_ensemble_SVDs.csv', index=False)

### All Models Ensemble

In [7]:
ensemble_all_submit = pd.concat([submit1, submit2, submit3, submit4, submit5, submit6, submit7, submit8])
averaged_ratings = ensemble_all_submit.groupby('ID')['Book-Rating'].mean().reset_index()
averaged_ratings.to_csv('./Submissions/4_ensemble_AllSVDs.csv', index=False)