## Рекомендательные системы

##### Входные данные

Вам дается две выборки с пользовательскими сессиями - id-шниками просмотренных и id-шниками купленных товаров. Одна выборка будет использоваться для обучения (оценки популярностей товаров), а другая - для теста.

В файлах записаны сессии по одной в каждой строке. Формат сессии: id просмотренных товаров через , затем идёт ; после чего следуют id купленных товаров (если такие имеются), разделённые запятой. Например, 1,2,3,4; или 1,2,3,4;5,6.

Гарантируется, что среди id купленных товаров все различные.

In [1]:
%pylab inline
import pandas as pd
import copy
from collections import Counter

Populating the interactive namespace from numpy and matplotlib


In [2]:
sessions = pd.read_csv('./coursera_sessions_train.txt',delimiter=';', header=None,names=['viewed','bought'])

In [3]:
sessions.info()
sessions.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
viewed    50000 non-null object
bought    3608 non-null object
dtypes: object(2)
memory usage: 781.3+ KB


Unnamed: 0,viewed,bought
0,012345,
1,9101191112911,
2,161718192021,
3,2425262724,
4,343536343735363738393839,
5,42,
6,474849,
7,59606162606364656661676867,676063.0
8,71727374,
9,767778,


##### Важно:

    Сессии, в которых пользователь ничего не купил, исключаем из оценки качества.
    Если товар не встречался в обучающей выборке, его популярность равна 0.
    Рекомендуем разные товары. И их число должно быть не больше, чем количество различных просмотренных пользователем товаров.
    Рекомендаций всегда не больше, чем минимум из двух чисел: количество просмотренных пользователем товаров и k в recall@k / precision@k.

##### Задание

    На обучении постройте частоты появления id в просмотренных и в купленных (id может несколько раз появляться в просмотренных, все появления надо учитывать)
    Реализуйте два алгоритма рекомендаций:

    сортировка просмотренных id по популярности (частота появления в просмотренных),
    сортировка просмотренных id по покупаемости (частота появления в покупках).

### Создаем словари частот, парсим в строки элементы сессии, уникализируем эти элементы

In [4]:
viewed = copy.copy(sessions.viewed.values)
bought = copy.copy(sessions.bought.fillna(-1).values)
viewed_dic = []
bought_dic = []
for idx, item  in enumerate(viewed):
    viewed[idx] = viewed[idx].split(",")
    viewed_dic  += viewed[idx]
    viewed[idx] = pd.unique(viewed[idx])
viewed_dic = Counter(viewed_dic)

for idx, item  in enumerate(bought):
        if(bought[idx] != -1):
            bought[idx] = bought[idx].split(",")
            bought_dic  += bought[idx]
            bought[idx] = pd.unique(bought[idx])
bought_dic = Counter(bought_dic)


### Получаем рекомендации по переданным словарям и сессиям

In [5]:
def recommend_t(viewed_dic, viewed, bought_dic = 0, bought = 0,i=0):
    print(viewed_dic[str(i)],viewed[i],bought_dic[str(i)],bought[i])
    
    k1_v,k1_b,k5_v,k5_b = [],[],[],[]
    sorted_viewed, sorted_bought = [],[]
    
    for idx,item in enumerate(viewed):
        sorted_viewed.append(sorted(list(viewed[idx]), key=lambda x: viewed_dic[x], reverse=True))
        sorted_bought.append(sorted(list(viewed[idx]), key=lambda x: bought_dic[x], reverse=True))
        
        k1_v.append(sorted_viewed[idx][0])
        k5_v.append(sorted_viewed[idx][:5])
        k1_b.append(sorted_bought[idx][0])
        k5_b.append(sorted_bought[idx][:5])
    
    
    return {'k1_v':k1_v,'k5_v':k5_v,'sorted_viewed':sorted_viewed,
            'k1_b':k1_b,'k5_b':k5_b,'sorted_bought':sorted_bought}
    
a = recommend_t(viewed_dic, viewed, bought_dic,bought,7)

312 ['59' '60' '61' '62' '63' '64' '65' '66' '67' '68'] 2 ['67' '60' '63']


In [6]:
def recommend(dic,viewed,k=5):
    rec = []
    sorting = []
    
    for idx,item in enumerate(viewed):
        sorting.append(sorted(list(viewed[idx]), key=lambda x: dic[x], reverse=True))
        rec.append(sorting[idx][:k])
        
    return {'rec':rec,'sorted':sorting,'k':k,'list':'rec - list of recommendation; sorted - sorted list; k - top k'}

In [7]:
t1_v = recommend(viewed_dic, viewed,1)['rec']
t5_v = recommend(viewed_dic, viewed,5)['rec']
t1_b = recommend(bought_dic, viewed,1)['rec']
t5_b = recommend(bought_dic, viewed,5)['rec']


### Расчет метрик

$$Recal@k = {\frac{({Купленное-из-рекомендованного}  \cap  {top-k})}{Количество-рекомендаций(top-k)}}$$

$$Precision@k = \frac{({Купленное-из-рекомендованного}  \cap  {top-k})}{{количество-покупок}}$$

#### AverageRecall@1, AveragePrecision@1, AverageRecall@5, AveragePrecision@5
    Precision@k - Какая доля отрекомендаций покупается.
    Recal@k - Какая доля от того, что пользователи покупает в среднем рекомендуется нами и оказывается в покупках.

In [8]:
#0.44 0.51 0.82 0.21

In [9]:
def recommend_metrics(bought,top_list,rounded = 2):
    # max items in top-list
    k = len(max(top_list,key=len))
    
    recal, precision = [],[]
    for idx,item in enumerate(bought):
            if(type(item) != int):
                num = len(np.intersect1d(np.array(bought[idx]),np.array(top_list[idx]))) + 0.0
                recal.append((num/(len(bought[idx]))))
                precision.append(num/k)
    average_rec = (round(mean(recal),rounded))
    aver_prec = (round(mean(precision),rounded))
    return {'recal': recal, 'precision':precision,'k':k,
            'average_recal':average_rec,'average_precision':aver_prec}   

In [10]:
def write_answer_string_to_file(answer, filename):
    with open(filename, 'w') as f_out:
        f_out.write(answer)

### Рекомендации по частоте просмотров товаров - качество на обучающей выборке

In [11]:
AverageRecall_1 = recommend_metrics(bought,t1_v)['average_recal']
AveragePrecision_1 = recommend_metrics(bought,t1_v)['average_precision']
AverageRecall_5 = recommend_metrics(bought,t5_v)['average_recal']
AveragePrecision_5 = recommend_metrics(bought,t5_v)['average_precision']

In [12]:
print(AverageRecall_1,AveragePrecision_1,AverageRecall_5,AveragePrecision_5)

0.44 0.51 0.82 0.21


In [13]:
write_answer_string_to_file('0.44 0.51 0.82 0.21','recomend_1.txt')

### Рекомендации по частоте покупок - качество на обучающей выборке:

In [14]:
AverageRecall_1 = recommend_metrics(bought,t1_b)['average_recal']
AveragePrecision_1 = recommend_metrics(bought,t1_b)['average_precision']
AverageRecall_5 = recommend_metrics(bought,t5_b)['average_recal']
AveragePrecision_5 = recommend_metrics(bought,t5_b)['average_precision']

In [15]:
print(AverageRecall_1,AveragePrecision_1,AverageRecall_5,AveragePrecision_5)

0.69 0.8 0.93 0.25


In [16]:
write_answer_string_to_file('0.69 0.8 0.93 0.25','recomend_3.txt')

In [17]:
test = pd.read_csv('./coursera_sessions_test.txt',delimiter=';', header=None, names=['viewed','bought'])
sessions.info()
test.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
viewed    50000 non-null object
bought    3608 non-null object
dtypes: object(2)
memory usage: 781.3+ KB


Unnamed: 0,viewed,bought
0,678,
1,131415,
2,2223,
3,282930313233,
4,4041,
5,4344434543454346,
6,5051475249535455565758,
7,63686970666159616668,6663.0
8,75,
9,7980818283,


In [18]:
viewed = copy.copy(test.viewed)
bought = copy.copy(test.bought.fillna(-1).values)
for idx, item  in enumerate(viewed):
    viewed[idx] = pd.unique(viewed[idx].split(","))
    
for idx, item  in enumerate(bought):
    if(bought[idx] != -1):
        bought[idx] = pd.unique(bought[idx].split(","))

In [19]:
t1_v = recommend(viewed_dic, viewed,1)['rec']
t5_v = recommend(viewed_dic, viewed,5)['rec']
t1_b = recommend(bought_dic, viewed,1)['rec']
t5_b = recommend(bought_dic, viewed,5)['rec']

### Рекомендации по частоте просмотров товаров - качество на тестовой выборке:

In [20]:
AverageRecall_1 = recommend_metrics(bought,t1_v)['average_recal']
AveragePrecision_1 = recommend_metrics(bought,t1_v)['average_precision']
AverageRecall_5 = recommend_metrics(bought,t5_v)['average_recal']
AveragePrecision_5 = recommend_metrics(bought,t5_v)['average_precision']
print(AverageRecall_1,AveragePrecision_1,AverageRecall_5,AveragePrecision_5)

0.42 0.48 0.8 0.2


In [24]:
write_answer_string_to_file('0.42 0.48 0.8 0.2','recomend_2.txt')

### Рекомендации по частоте покупок - качество на тестовой выборке выборке:

In [22]:
AverageRecall_1 = recommend_metrics(bought,t1_b)['average_recal']
AveragePrecision_1 = recommend_metrics(bought,t1_b)['average_precision']
AverageRecall_5 = recommend_metrics(bought,t5_b)['average_recal']
AveragePrecision_5 = recommend_metrics(bought,t5_b)['average_precision']
print(AverageRecall_1,AveragePrecision_1,AverageRecall_5,AveragePrecision_5)

0.46 0.53 0.82 0.21


In [25]:
write_answer_string_to_file('0.46 0.53 0.82 0.21','recomend_4.txt')