# ALS applications

## Dzen dataset

Data comes from [dzen.ru](https://dzen.ru/) site and consists of likes which users put to text articles

### Columns
1. item_id - unique id of an item (article)
2. user_id - unique id of a user
3. source_id - unique id of an author. If two items have same source_id, then they come from one author
4. Name of item is name of the article
5. Raw dataset represents user_id and list of item_ids which user liked

In [36]:
!curl -O -J -L 'https://www.dropbox.com/s/ia4bvhuqg8kesee/zen_dataset.zip?dl=1'
!unzip zen_dataset.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    60    0    60    0     0    183      0 --:--:-- --:--:-- --:--:--   185
100   320  100   320    0     0    394      0 --:--:-- --:--:-- --:--:--   394
100 24.0M  100 24.0M    0     0   109k      0  0:03:45  0:03:45 --:--:--  161k3:35  0:00:05  0:03:30  153k  6 1552k    0     0   109k      0  0:03:44  0:00:14  0:03:30  124k      0  0:03:56  0:00:46  0:03:10  127k0     0   102k      0  0:04:01  0:01:12  0:02:49 85128k      0  0:04:01  0:01:22  0:02:39 94913   103k      0  0:03:58  0:01:31  0:02:27  114k 102k      0  0:03:59  0:01:32  0:02:27 94385M   43 10.4M    0     0   100k      0  0:04:04  0:01:46  0:02:18 81952 0   102k      0  0:04:00  0:02:12  0:01:48  137k.0M   55 13.4M    0     0   102k      0  0:04:00  0:02:14  0:01:46   98k.2M    0     0   103k      0  0:03:57  0:02:20  0:01:37  146k 14.5M    0     0   104k      0 

In [1]:
import numpy as np
import pandas as pd
import scipy.sparse as sp
from tqdm.notebook import tqdm
import ast

In [14]:
item_names = pd.read_csv("zen_item_to_name.csv")
item_sources = pd.read_csv("zen_item_to_source.csv")
dataset = pd.read_csv("zen_ratings.csv", converters={'item_ids': ast.literal_eval})

In [15]:
item_names

Unnamed: 0,id,name
0,94962,Что обычно ожидало русских казачек в руках у к...
1,3972,Почему Россия решила строить новую скоростную ...
2,94644,"5 неприличных фактов об Андрее Макаревиче, кот..."
3,82518,"Что стало с красавицей Хмельницкой, которую му..."
4,53264,"Понять и Простить: Почему угонщики, бежавшие и..."
...,...,...
104498,36769,"Плюс один источник мифа о рыцарях, неспособных..."
104499,9190,Мой сад - малоуходный
104500,52731,Купил первую в жизни циркулярную пилу. Честный...
104501,72660,Решили предложить Марине помощь в лечении ч.10


In [16]:
item_sources

Unnamed: 0,id,source
0,94962,2919814402697966089
1,3972,3263022753228392991
2,94644,-3857390427602554682
3,82518,-9036908390349249792
4,53264,3353856219169766284
...,...,...
104498,36769,3818746211375738614
104499,9190,4975535765688979937
104500,52731,3720366796439288909
104501,72660,-7860042973720636310


In [17]:
dataset

Unnamed: 0,user_id,item_ids
0,993675863667353526,"[15267, 61075, 81203, 17066, 25471, 88427, 638..."
1,4250619547882954185,"[4555, 94644, 84972, 17774, 94962, 78217, 2485..."
2,3847785305345691076,"[1898, 26703, 16525, 86939, 55017, 31069, 4035..."
3,1785181112918558233,"[75601, 102458, 28716, 100694, 5757, 47104, 60..."
4,5078748097863903181,"[72260, 40825, 2615, 42549, 379, 100818, 56827..."
...,...,...
75905,4954138831959898373,"[11881, 55520, 63054, 48015, 66952, 103830, 21..."
75906,4967793435819938014,"[74697, 11830, 63858, 87245, 41956, 62089, 686..."
75907,7137764184903122777,"[10353, 1775, 103680, 29704, 9782, 13295, 9975..."
75908,2624987805086334956,"[24324, 18854, 73319, 66641, 64078, 97387, 426..."


In [18]:
total_interactions_count = dataset.item_ids.map(len).sum()
user_coo = np.zeros(total_interactions_count, dtype=np.int64)
item_coo = np.zeros(total_interactions_count, dtype=np.int64)
pos = 0

for user_id, item_ids in enumerate(tqdm(dataset.item_ids)):
    user_coo[pos : pos + len(item_ids)] = user_id
    item_coo[pos : pos + len(item_ids)] = item_ids
    pos += len(item_ids)
    
shape = (max(user_coo) + 1, max(item_coo) + 1)
user_item_matrix = sp.coo_matrix(
    (np.ones(len(user_coo)), (user_coo, item_coo)), shape=shape
)
user_item_matrix = user_item_matrix.tocsr()
sp.save_npz("data_train.npz", user_item_matrix)
# Cleanup memory. Later you need just data_train.npz
del user_coo
del item_coo
del dataset

  0%|          | 0/75910 [00:00<?, ?it/s]

In [24]:
# you could start here if you already done precomputing
user_item_matrix = sp.load_npz("data_train.npz")

In [25]:
user_item_matrix

<75910x104503 sparse matrix of type '<class 'numpy.float64'>'
	with 5792423 stored elements in Compressed Sparse Row format>

In [33]:
def sparce_matrix_report(matrix):
    print('Size of raw data:', matrix.data.nbytes / 10**6, 'Mb')
    print('Feedback matrix size:', matrix.shape)

In [34]:
sparce_matrix_report(user_item_matrix)

Size of raw data: 46.339384 Mb
Feedback matrix size: (75910, 104503)


In [37]:
item_weights = np.array(user_item_matrix.tocsc().sum(0))[0]
top_to_bottom_order = np.argsort(-item_weights)
item_mapping = np.empty(top_to_bottom_order.shape, dtype=int)
item_mapping[top_to_bottom_order] = np.arange(len(top_to_bottom_order))
total_item_count = (item_weights > 0).sum()
total_user_count = user_item_matrix.shape[0]


def build_debug_dataset(user_item_matrix, item_pct: float, user_pct: float):
    '''Get given percent of top rated items and given percent of random users'''
    user_count = int(total_user_count * user_pct), 
    item_count = int(total_item_count * item_pct)
    item_ids = top_to_bottom_order[:item_count]
    user_ids = np.random.choice(
        np.arange(user_item_matrix.shape[0]), size=user_count, replace=False
    )
    train = user_item_matrix[user_ids]
    train = train[:, item_ids]
    return train

In [40]:
debug_dataset = build_debug_dataset(user_item_matrix, 0.05, 0.05)

sparce_matrix_report(debug_dataset)

Size of raw data: 1.116696 Mb
Feedback matrix size: (3795, 5019)


This is useful for debugging (just to save time).

**Final answers should use full dataset!!!**

## Split dataset matrix (5 points)

in the following way: for 20% of users (random) remove one like - this will be test data. The rest is train data. (10 points)

In [3]:
def split_data(ratings):
    # your code here
    return train_matrix, test_matrix

In [None]:
train_ratings, test_ratings = split_data(...)

## Implement ALS, IALS (10 points each)

Note that due to size of data you need to implement algorithms with _sparce matrices_!

In [2]:
def als(ratings, k: int, lam: float):
    '''Alternating Least Squares algorithm

    Args:
        ratings: sparce matrix of ratings
        k: size of embeddings
        lam: regularization term
        
    Returns:
        two matrices: of user embeddings and of item embeddings
    '''
    # your code here
    return user_embeddings, item_embeddings

In [None]:
def ials(ratings, k: int, lam: float):
    '''Implicit Alternating Least Squares algorithm

    Args:
        ratings: sparce matrix of ratings
        k: size of embeddings
        lam: regularization term

    Returns:
        two matrices: of user embeddings and of item embeddings
    '''
    # your code here
    return user_embeddings, item_embeddings

## Compute MRR@100 metric for test users (10 points)

For ALS and IALS algorithms.

**Don't forget to use full dataset!**

In [None]:
def mmr(predictions, test_ratings):
    # your code here
    return mrr_value

In [None]:
mrr_als = mrr(als_predictions, test_ratings)
print(mrr_als)

In [None]:
mrr_ials = mrr(ials_predictions, test_ratings)
print(mrr_ials)

## Adjust hyperparameters of ALS and IALS to maximize MRR (20 points)

Main hyperparameters are regularization and weights for implicit case.

In [None]:
# your code here

Optimal parameters of ALS are:

....

Optimal parameters of IALS are:

....

## Get similarities from item2item CF and SLIM (10 points each)

Item2item can be taken from the first homework, SLIM was implemented in the class.

Alternatively you could use libraries, but in this case you will need to convert dataset to their format.

You need to compute only item similarities, not predictions for users.

In [None]:
i2i_similarities = ... # your code here

In [None]:
slim_similarities = ... # your code here

## Compare similarities from four algorithms (20 points)

* plot distributions
* compute metrics (which you think are relevant)
* look at several top similar lists

Make conclusion how these methods differ in computing similarities

In [4]:
# your code here

Conclusion:

....