In [2]:
import pandas as pd
import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
import math

import pandas as pd

path = '/kaggle/input/steam-data/'

train_interactions = pd.read_parquet(path + 'train_interactions.parquet')
test_interactions = pd.read_parquet(path + 'test_interactions.parquet')

games_metadata = pd.read_parquet(path + 'games_metadata.parquet')

## SBERT + KNN

### Game Metadata Preprocessing (Content-Based Features)

This part of the pipeline is responsible for transforming raw Steam data into semantic vectors. The primary goal is to combine textual information, numerical metrics, and technical specifications into a unified vector space.

#### Key Processing Stages:

1. **Text Engineering (NLP):**
    * **Feature Concatenation:** Fields such as `name`, `tags`, `genres`, and `short_description` are combined into a single string. This allows the model to capture the full context of a game (e.g., combining title keywords with genre-specific tags).
    * **SBERT Embeddings:** We utilize the **SBERT** (`paraphrase-multilingual-MiniLM-L12-v2`) pre-trained transformer. It understands semantic meaning across multiple languages and encodes the text into a fixed-length dense vector.

2. **Numerical Feature Processing:**
    * **Cleaning:** Handling infinite values (`inf`) and missing data (`NaN`) resulting from data scraping inconsistencies.
    * **Log-Transformation:** Applying the $\ln(1+x)$ function to features like review counts and playtime. This is essential for normalizing data with extreme variance (ranging from zero to millions).
    * **Scaling:** Bringing all numerical values to a common scale using `StandardScaler` to ensure stability during neural network training.



3. **Technical Specifications:**
    * Converting operating system support flags (Windows, Mac, Linux) into a binary format (1 or 0).

#### Final Vector Structure:
The output of this process is a feature matrix where each game is represented by a concatenated vector:
$$V_{game} = [E_{sbert} \parallel F_{numeric} \parallel F_{os}]$$

* $E_{sbert}$ — Semantic vector (384 dimensions).
* $F_{numeric}$ — Scaled indicators (price, reviews, metacritic scores).
* $F_{os}$ — Binary platform features.



#### Final Output:
The function returns a `game_map` dictionary where the key is the `appid` and the value is the multidimensional feature vector. This dictionary acts as the model's "memory," defining the characteristics of every game in the dataset.

In [3]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import StandardScaler

def preprocess_metadata(df):
    df = df.copy()
    
    # 1. Очистка текста
    df['short_description'] = df['short_description'].fillna('')
    df['name'] = df['name'].fillna('Unknown Game')
    
    def list_to_str(x):
        if isinstance(x, (list, np.ndarray)):
            return ", ".join(map(str, x))
        return str(x) if pd.notnull(x) else ""

    print("Генерация текста...")
    df['combined_text'] = (
        df['name'] + ". Tags: " + 
        df['tags'].apply(list_to_str) + ". Genres: " + 
        df['genres'].apply(list_to_str) + ". Description: " + 
        df['short_description']
    )

    num_cols = [
        'price', 'dlc_count', 'metacritic_score', 'positive', 'negative', 
        'average_playtime_forever', 'median_playtime_forever', 'num_reviews_total'
    ]
    
    for col in num_cols:

        df[col] = pd.to_numeric(df[col], errors='coerce')

        df[col] = df[col].replace([np.inf, -np.inf], np.nan)

        df[col] = df[col].fillna(0)

        df[col] = df[col].clip(lower=0)

    log_cols = ['positive', 'negative', 'num_reviews_total', 'average_playtime_forever']
    for col in log_cols:
        df[col] = np.log1p(df[col])

    scaler = StandardScaler()
    num_features = scaler.fit_transform(df[num_cols])

    os_features = df[['windows', 'mac', 'linux']].fillna(False).astype(int).values

    print("Запуск SentenceTransformer (может занять время)...")
    model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
    text_embeddings = model.encode(df['combined_text'].tolist(), show_progress_bar=True, batch_size=64)

    final_matrix = np.hstack([text_embeddings, num_features, os_features])

    game_map = {appid: vec for appid, vec in zip(df['appid'], final_matrix)}
    
    return game_map

game_vectors = preprocess_metadata(games_metadata)

2025-12-21 02:11:45.657364: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1766283105.839846      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1766283105.891412      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1766283106.318741      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766283106.318775      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1766283106.318778      55 computation_placer.cc:177] computation placer alr

Генерация текста...
Запуск SentenceTransformer (может занять время)...


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1401 [00:00<?, ?it/s]

### Dataset Formation and Negative Sampling Logic

This code block is responsible for preparing the data for neural network training. The main task here is to transform flat purchase lists into training examples that force the model to understand user preferences.

#### Key Components:

1. **User History (Context):**
   * A game history is identified for each user. 
   * The `MAX_HISTORY = 10` parameter limits the number of previous games considered. If there are fewer games, **Padding** (filling with zeros) is applied so that the tensors have the same size for fast GPU processing.

2. **Negative Sampling (Training Logic):**
   * The model is trained using a comparison method. We provide it not only with the game the user bought (**Positive Target**) but also with a random game they did not buy (**Negative Target**).
   * This forces the neural network to "push" the vectors: bringing the user profile closer to purchased games and moving it away from random ones.



#### __getitem__ Method Mechanics:

| Component | Formation Logic |
| :--- | :--- |
| **History Vecs** | Vectors of the user's latest games from `game_vectors`. If there are too few, they are supplemented with zero vectors. |
| **Target Vec** | The vector of a randomly selected game from the user's real history (what we are trying to predict). |
| **Neg Vec** | The vector of a game randomly selected from the entire `all_appids` catalog, excluding those the user has already bought. |

In [14]:
import torch
from torch.utils.data import Dataset, DataLoader
import random

user_histories = train_interactions.groupby('playerid')['appid'].apply(list).to_dict()
all_appids = list(game_vectors.keys())

MAX_HISTORY = 10 
NEG_SAMPLES = 2 

class SteamRecDataset(Dataset):
    def __init__(self, user_histories, game_vectors, all_appids, is_train=True):
        self.user_ids = list(user_histories.keys())
        self.user_histories = user_histories
        self.game_vectors = game_vectors
        self.all_appids = set(all_appids)
        self.is_train = is_train

        self.emb_dim = len(next(iter(game_vectors.values())))

    def __len__(self):
        return len(self.user_ids)

    def __getitem__(self, idx):
        uid = self.user_ids[idx]
        history = self.user_histories[uid]
        
        if len(history) < 2 and self.is_train:

            target_game = history[0]
            prev_games = history
        else:

            target_game = random.choice(history)
            prev_games = [g for g in history if g != target_game]

        prev_games = prev_games[-MAX_HISTORY:]

        target_vec = torch.tensor(self.game_vectors[target_game], dtype=torch.float32)

        history_vecs = [self.game_vectors[g] for g in prev_games]
        while len(history_vecs) < MAX_HISTORY:
            history_vecs.append(np.zeros(self.emb_dim))
        history_vecs = torch.tensor(np.array(history_vecs), dtype=torch.float32)

        if self.is_train:

            neg_game = random.choice(list(self.all_appids - set(history)))
            neg_vec = torch.tensor(self.game_vectors[neg_game], dtype=torch.float32)

            return history_vecs, target_vec, neg_vec
        else:

            return history_vecs, target_vec

train_dataset = SteamRecDataset(user_histories, game_vectors, all_appids)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

### Model Architecture: SteamTransformer

This model is a hybrid neural network based on the **Transformer** architecture, adapted for recommendation tasks. It analyzes the sequence of a user's past games and estimates the probability of interaction with a new target game.

#### Key Architectural Blocks:

1. **Input Projection:**
   * Input vectors (extracted from metadata and SBERT) have a high dimensionality. The `nn.Linear(input_dim, 256)` layer compresses them into a compact feature space (hidden dimension), which is more efficient for the transformer to process.

2. **Transformer Encoder (Context Extraction):**
   * **Self-Attention:** The attention mechanism allows the network to understand which games from the user's history are most relevant to the current moment. For example, if there are 10 games in the history, but the last 2 are shooters, the attention will focus on them when evaluating a new game.
   * **Multi-head Attention (4 heads):** Allows the model to simultaneously track different types of dependencies (e.g., genre similarity and price similarity).
   

3. **Feature Aggregation:**
   * After the transformer, we obtain a set of vectors (one for each game in the history). Using `torch.mean(..., dim=1)`, we combine them into a single **"User Interest Vector"**. This is an averaged representation of the player's current preferences.

4. **Classifier Head:**
   * The model concatenates the user vector and the target game vector (`torch.cat`).
   * **Fully Connected Layers:** A Multi-Layer Perceptron (MLP) analyzes this combination.
   * **Dropout (0.2):** Prevents overfitting by forcing the network not to rely on specific neurons.
   * **Sigmoid:** The final layer outputs the probability of purchase.
   

#### Technical Specifications:
* **Hidden Dimension:** 256
* **Feedforward Dimension:** 512 (inside the transformer)
* **Dropout:** 0.2
* **Output:** Probability of purchase/interaction.

#### Advantage over Base Models:
Unlike simple fully connected networks, `SteamTransformer` accounts for the **relationships between games within the history**. It understands not just "what the user bought," but "how their purchases relate to one another," which is critical for the complex video game market.

In [16]:
import torch.nn as nn

class SteamTransformer(nn.Module):
    def __init__(self, input_dim, nhead=4, num_layers=2):
        super().__init__()

        self.input_proj = nn.Linear(input_dim, 256)
=
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=256, nhead=nhead, dim_feedforward=512, batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        self.fc = nn.Sequential(
            nn.Linear(256 + 256, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 1),
            nn.Sigmoid()
        )

    def forward(self, history, target):
        hist_emb = self.input_proj(history)  
        target_emb = self.input_proj(target) 
 
        user_features = self.transformer(hist_emb)

        user_combined = torch.mean(user_features, dim=1) 

        combined = torch.cat([user_combined, target_emb], dim=-1)
        return self.fc(combined)

input_size = len(next(iter(game_vectors.values())))
model = SteamTransformer(input_dim=input_size)

### Validation of the model: HitRate@K and NDCG@K

The block is designed to evaluate the accuracy of recommendations based on a training sample.


In [21]:
def evaluate_metrics(model, game_vectors, user_histories, test_interactions, k=10):
    model.eval()
    hr, ndcg = 0, 0
    users = test_interactions['playerid'].unique()

    device = next(model.parameters()).device
    
    all_appids = list(game_vectors.keys())
    emb_dim = len(next(iter(game_vectors.values())))

    with torch.no_grad():
        for uid in users:
            history = user_histories.get(uid, [])[-MAX_HISTORY:]
            if not history: continue

            h_vecs = [game_vectors[g] for g in history]
            while len(h_vecs) < MAX_HISTORY:
                h_vecs.append(np.zeros(emb_dim))

            h_tensor = torch.tensor(np.array(h_vecs), dtype=torch.float32).unsqueeze(0).to(device)

            pos_game = test_interactions[test_interactions['playerid'] == uid]['appid'].iloc[0]

            neg_candidates = list(set(all_appids) - set(user_histories.get(uid, [])))
            neg_games = random.sample(neg_candidates, min(99, len(neg_candidates)))
            
            candidate_games = [pos_game] + neg_games
            c_vecs = torch.tensor(np.array([game_vectors[g] for g in candidate_games]), dtype=torch.float32).to(device)

            num_candidates = len(candidate_games)
            h_tensor_expanded = h_tensor.expand(num_candidates, -1, -1)

            scores = model(h_tensor_expanded, c_vecs).squeeze().cpu().numpy()

            rank_indices = np.argsort(scores)[::-1]
            try:
                rank = np.where(rank_indices == 0)[0][0] + 1 
                if rank <= k:
                    hr += 1
                    ndcg += 1 / math.log2(rank + 1)
            except IndexError:

                continue

    final_hr = hr / len(users)
    final_ndcg = ndcg / len(users)
    return final_hr, final_ndcg

### Model Training Process

This block implements the neural network training loop using Binary Cross-Entropy (BCE).

#### Key Parameters:
* **Optimizer**: `Adam` (lr=0.0005) with L2 regularization (`weight_decay`).
* **Loss Function**: `BCELoss`.
* **Configuration**: 5 training epochs.



1. **Dual Pass**: For each batch, predictions are calculated separately for "positive" games and "negative" candidates.
2. **Balancing**: The average loss value `(pos_loss + neg_loss) / 2` allows for uniform adjustment of the model weights.

In [22]:
import torch.optim as optim
from tqdm.auto import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.0005, weight_decay=1e-5)
criterion = nn.BCELoss()

EPOCHS = 5

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0

    pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}")
    
    for history_vecs, pos_vecs, neg_vecs in pbar:
        history_vecs = history_vecs.to(device)
        pos_vecs = pos_vecs.to(device)
        neg_vecs = neg_vecs.to(device)
        
        optimizer.zero_grad()
  
        pos_preds = model(history_vecs, pos_vecs)
        pos_loss = criterion(pos_preds, torch.ones_like(pos_preds))

        neg_preds = model(history_vecs, neg_vecs)
        neg_loss = criterion(neg_preds, torch.zeros_like(neg_preds))
        
        # Общая ошибка
        loss = (pos_loss + neg_loss) / 2
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        pbar.set_postfix({'loss': f"{loss.item():.4f}"})

    print(f"Запуск валидации...")
    hr, ndcg = evaluate_metrics(model, game_vectors, user_histories, test_interactions, k=10)
    print(f"Epoch {epoch+1} finished. Loss: {total_loss/len(train_loader):.4f} | HR@10: {hr:.4f} | NDCG@10: {ndcg:.4f}")

Epoch 1/5:   0%|          | 0/688 [00:00<?, ?it/s]

Запуск валидации...
Epoch 1 finished. Loss: 0.1796 | HR@10: 0.9465 | NDCG@10: 0.7619


Epoch 2/5:   0%|          | 0/688 [00:00<?, ?it/s]

Запуск валидации...
Epoch 2 finished. Loss: 0.1715 | HR@10: 0.9486 | NDCG@10: 0.7730


Epoch 3/5:   0%|          | 0/688 [00:00<?, ?it/s]

Запуск валидации...
Epoch 3 finished. Loss: 0.1703 | HR@10: 0.9478 | NDCG@10: 0.7701


Epoch 4/5:   0%|          | 0/688 [00:00<?, ?it/s]

Запуск валидации...
Epoch 4 finished. Loss: 0.1704 | HR@10: 0.9517 | NDCG@10: 0.7821


Epoch 5/5:   0%|          | 0/688 [00:00<?, ?it/s]

Запуск валидации...
Epoch 5 finished. Loss: 0.1684 | HR@10: 0.9501 | NDCG@10: 0.7742


### The main validation function

In [24]:
import numpy as np
import pandas as pd
import math
from tqdm import tqdm


def calculate_metrics(test_df, recommender_model, k=10):
    """
    Calculates HitRate@K, Recall@K, NDCG@K
    """
    known_users = set(recommender_model.user_map.keys())
    test_df_filtered = test_df[test_df['playerid'].isin(known_users)].copy()

    ground_truth = test_df_filtered.groupby('playerid')['appid'].apply(list).to_dict()

    hits = 0
    total_recall = 0
    total_ndcg = 0
    n_users = len(ground_truth)

    if n_users == 0:
        return {"Error": "No overlapping users in test set"}

    for user, actual_items in tqdm(ground_truth.items()):
        recs = recommender_model.recommend(user, top_k=k)

        hit = any(item in actual_items for item in recs)
        if hit:
            hits += 1

        intersect = set(recs).intersection(set(actual_items))
        recall = len(intersect) / len(actual_items) if len(actual_items) > 0 else 0
        total_recall += recall

        dcg = 0
        idcg = 0

        for i, item in enumerate(recs):
            if item in actual_items:
                dcg += 1 / math.log2((i + 1) + 1)

        num_relevant_in_top_k = min(len(actual_items), k)
        for i in range(num_relevant_in_top_k):
            idcg += 1 / math.log2((i + 1) + 1)

        ndcg = dcg / idcg if idcg > 0 else 0
        total_ndcg += ndcg

    return {
        f"HitRate@{k}": hits / n_users,
        f"Recall@{k}": total_recall / n_users,
        f"NDCG@{k}": total_ndcg / n_users
    }

### Wrapper for the model

In [25]:
class SteamRecommender:
    def __init__(self, model, user_histories, game_vectors, all_appids, device):
        self.model = model
        self.user_histories = user_histories
        self.game_vectors = game_vectors
        self.all_appids = all_appids
        self.device = device

        self.user_map = user_histories 

        self.candidate_pool = all_appids 

    def recommend(self, user_id, top_k=10):
        self.model.eval()

        history = self.user_histories.get(user_id, [])[-MAX_HISTORY:]
        if not history:
            return []

        emb_dim = len(next(iter(self.game_vectors.values())))
        h_vecs = [self.game_vectors.get(g, np.zeros(emb_dim)) for g in history]
        while len(h_vecs) < MAX_HISTORY:
            h_vecs.append(np.zeros(emb_dim))
        
        h_tensor = torch.tensor(np.array(h_vecs), dtype=torch.float32).unsqueeze(0).to(self.device)

        bought_games = set(history)
        candidates = [g for g in self.candidate_pool if g not in bought_games]

        c_vecs = torch.tensor(np.array([self.game_vectors[g] for g in candidates]), dtype=torch.float32).to(self.device)
        
        with torch.no_grad():

            h_expanded = h_tensor.expand(len(candidates), -1, -1)
            scores = self.model(h_expanded, c_vecs).squeeze().cpu().numpy()

        top_indices = np.argsort(scores)[::-1][:top_k]
        return [candidates[i] for i in top_indices]

In [28]:
recommender = SteamRecommender(
    model=model, 
    user_histories=user_histories, 
    game_vectors=game_vectors, 
    all_appids=all_appids,
    device=device
)

metrics = calculate_metrics(test_interactions[:1000], recommender, k=10)
print(f"Final Test Metrics: {metrics}")

100%|██████████| 1000/1000 [22:52<00:00,  1.37s/it]

Final Test Metrics: {'HitRate@10': 0.018, 'Recall@10': 0.018, 'NDCG@10': 0.006429227523346531}





### Why did the approach fail?

Despite using advanced embeddings, this approach encountered several critical barriers:

#### 1. Semantic Noise and "Marketing" Text
Models like **SBERT** are excellent at capturing textual meaning, but Steam descriptions are full of artistic epithets.
* **Problem:** Two games might have similar descriptions ("epic adventure", "open world") but radically different gameplay (platformer vs. strategy). The model saw textual similarity but failed to recognize the difference in the actual gaming experience.



#### 2. Ignoring Popularity (Popular != Similar)
 The content-based approach looks for *similarity*, but purchases on Steam are often driven by trends and popularity.
* **Problem:** KNN might recommend an "ideally similar" game that has 0 reviews and 0 players, instead of a hit that the user would actually be likely to buy.

#### 3. Lack of Collaborative Signal
This version relied solely on the properties of the game itself, completely ignoring the behavior of other people.
* **Problem:** The most powerful signal ("People who bought A also bought B") was not taken into account. Content-based KNN did not know that games could be related in meaning even if their descriptions were not similar.



#### 4. "Curse of Dimensionality"
When combining 384 SBERT dimensions with other features, the vectors became too "heavy."
* **Problem:** In high-dimensional spaces, cosine similarity begins to degrade — all games become "equally distant" from each other to the model, and search accuracy drops.

## Hybrid Pipeline: Two-Tower Transformer with BPR Loss

This code block implements a modern recommendation system architecture that combines content-based features (SBERT + Meta) with collaborative learning.

#### Key Pipeline Stages:

1. **Feature Compression (PCA):**
    * The initial `game_vectors` (from SBERT) have high dimensionality.
    * **PCA** reduces them to the 64 most informative components, which minimizes noise and accelerates neural network computations.



2. **Two-Tower Architecture:**
    * **Item Tower:** Combines static features (Content) and trainable embeddings (ID). This allows the model to recognize a game as both an "RPG with a Fantasy tag" and a specific object with unique user behavior.
    * **User Tower (Transformer):** Processes a sequence of game vectors from the user's history. The Transformer identifies complex dependencies between purchases to form a final user interest vector.



3. **Training via BPR (Bayesian Personalized Ranking):**
    * Instead of "bought/not bought" classification, the model learns **ranking**.
    * **Negative Sampling:** For every actual purchase (Positive), a random game the user did not purchase (Negative) is selected.
    * **Goal:** To make the dot product of the user vector and the "good" game higher than with the "bad" game.



4. **Inference Optimization:**
    * After training, vectors for all games (`all_item_vectors`) are pre-calculated.
    * This transforms recommendation generation from a heavy neural network pass into instantaneous matrix multiplication (Dot Product).

| Parameter | Value | Description |
| :--- | :--- | :--- |
| **EMBED_DIM** | 64 | Dimensionality of the model's internal space. |
| **MAX_HISTORY** | 10 | Depth of user history analysis. |
| **Loss** | BPR | Ranking order optimization. |
| **Device** | CUDA/CPU | Automatic GPU switching for acceleration. |

In [30]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.decomposition import PCA
from tqdm.auto import tqdm

print("Подготовка признаков и маппинга...")
unique_apps = sorted(games_metadata['appid'].unique())
app2idx = {appid: i + 1 for i, appid in enumerate(unique_apps)}
idx2app = {i: appid for appid, i in app2idx.items()}

pop_counts = train_interactions['appid'].value_counts()
pop_dict = pop_counts.to_dict()

raw_feat_matrix = np.array([game_vectors[aid] for aid in unique_apps])
pca = PCA(n_components=64) 
compressed_feats = pca.fit_transform(raw_feat_matrix)

game_feats_matrix = np.zeros((len(unique_apps) + 1, 64))
game_feats_matrix[1:] = compressed_feats
game_feats_tensor = torch.FloatTensor(game_feats_matrix)

MAX_HISTORY = 10
EMBED_DIM = 64
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class SteamDataset(Dataset):
    def __init__(self, interactions, app2idx, max_history=10):
        self.user_histories = interactions.groupby('playerid')['appid'].apply(list).to_dict()
        self.user_ids = list(self.user_histories.keys())
        self.app2idx = app2idx
        self.all_apps = list(app2idx.keys())
        self.max_history = max_history

    def __len__(self):
        return len(self.user_ids)

    def __getitem__(self, idx):
        uid = self.user_ids[idx]
        full_history = self.user_histories[uid]

        t_idx = np.random.randint(0, len(full_history))
        target_pos = full_history[t_idx]

        context = [g for i, g in enumerate(full_history) if i != t_idx][-self.max_history:]

        user_bought = set(full_history)
        while True:
            target_neg = np.random.choice(self.all_apps)
            if target_neg not in user_bought:
                break

        hist_ids = [self.app2idx.get(aid, 0) for aid in context]
        while len(hist_ids) < self.max_history:
            hist_ids.append(0) 
            
        return (
            torch.LongTensor(hist_ids),
            torch.LongTensor([self.app2idx[target_pos]]),
            torch.LongTensor([self.app2idx[target_neg]])
        )

class SteamTwoTower(nn.Module):
    def __init__(self, num_apps, game_feats_tensor, embed_dim=64):
        super().__init__()
        self.game_content = nn.Embedding.from_pretrained(game_feats_tensor, freeze=True)
        self.id_emb = nn.Embedding(num_apps + 1, embed_dim)

        self.merge = nn.Linear(embed_dim + 64, embed_dim)

        encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=4, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=1)

    def get_item_vec(self, ids):
        comb = torch.cat([self.id_emb(ids), self.game_content(ids)], dim=-1)
        return self.merge(comb)

    def forward(self, hist_ids, pos_ids, neg_ids):
        hist_vecs = self.get_item_vec(hist_ids)
        user_out = self.transformer(hist_vecs)
        user_vec = torch.mean(user_out, dim=1) 

        pos_vec = self.get_item_vec(pos_ids).squeeze(1)
        neg_vec = self.get_item_vec(neg_ids).squeeze(1)

        return user_vec, pos_vec, neg_vec

model = SteamTwoTower(len(app2idx), game_feats_tensor, EMBED_DIM).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

train_dataset = SteamDataset(train_interactions, app2idx, MAX_HISTORY)
train_loader = DataLoader(train_dataset, batch_size=512, shuffle=True)

print("Начало обучения...")
model.train()
for epoch in range(10):
    total_loss = 0
    for h_ids, p_ids, n_ids in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
        h_ids, p_ids, n_ids = h_ids.to(DEVICE), p_ids.to(DEVICE), n_ids.to(DEVICE)
        
        optimizer.zero_grad()
        user_v, pos_v, neg_v = model(h_ids, p_ids, n_ids)

        pos_scores = (user_v * pos_v).sum(dim=-1)
        neg_scores = (user_v * neg_v).sum(dim=-1)

        loss = -torch.log(torch.sigmoid(pos_scores - neg_scores) + 1e-9).mean()
        
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    print(f"Epoch {epoch+1} Loss: {total_loss/len(train_loader):.4f}")

user_histories_final = train_interactions.groupby('playerid')['appid'].apply(list).to_dict()

model.eval()
with torch.no_grad():
    all_item_ids = torch.arange(len(app2idx) + 1).to(DEVICE)
    all_item_vectors = model.get_item_vec(all_item_ids).cpu().numpy()

print("Пайплайн готов к расчету метрик.")

Подготовка признаков и маппинга...
Начало обучения...


Epoch 1:   0%|          | 0/86 [00:00<?, ?it/s]

Epoch 1 Loss: 0.1399


Epoch 2:   0%|          | 0/86 [00:00<?, ?it/s]

Epoch 2 Loss: 0.1010


Epoch 3:   0%|          | 0/86 [00:00<?, ?it/s]

Epoch 3 Loss: 0.0993


Epoch 4:   0%|          | 0/86 [00:00<?, ?it/s]

Epoch 4 Loss: 0.0969


Epoch 5:   0%|          | 0/86 [00:00<?, ?it/s]

Epoch 5 Loss: 0.0903


Epoch 6:   0%|          | 0/86 [00:00<?, ?it/s]

Epoch 6 Loss: 0.0955


Epoch 7:   0%|          | 0/86 [00:00<?, ?it/s]

Epoch 7 Loss: 0.0884


Epoch 8:   0%|          | 0/86 [00:00<?, ?it/s]

Epoch 8 Loss: 0.0889


Epoch 9:   0%|          | 0/86 [00:00<?, ?it/s]

Epoch 9 Loss: 0.0923


Epoch 10:   0%|          | 0/86 [00:00<?, ?it/s]

Epoch 10 Loss: 0.0830
Пайплайн готов к расчету метрик.


In [32]:
class SteamInferenceModel:
    def __init__(self, model, app2idx, user_histories, item_vectors, device):
        self.model = model
        self.app2idx = app2idx
        self.user_map = user_histories  
        self.item_vectors = item_vectors 
        self.device = device
        self.all_appids = sorted(app2idx.keys())
        self.idx2app = {i: aid for aid, i in app2idx.items()}

    def recommend(self, user_id, top_k=10):
        self.model.eval()

        history = self.user_map.get(user_id, [])[-MAX_HISTORY:]
        if not history:
            return []

        hist_ids = [self.app2idx.get(aid, 0) for aid in history]
        while len(hist_ids) < MAX_HISTORY:
            hist_ids.append(0)
        
        h_tensor = torch.LongTensor(hist_ids).unsqueeze(0).to(self.device)
        
        with torch.no_grad():

            hist_vecs = self.model.get_item_vec(h_tensor)
            user_vec = torch.mean(self.model.transformer(hist_vecs), dim=1).cpu().numpy()

        scores = np.dot(self.item_vectors, user_vec.T).flatten()

        bought_indices = [self.app2idx.get(aid, 0) for aid in history]
        scores[bought_indices] = -1e9
        scores[0] = -1e9

        top_indices = np.argsort(scores)[::-1][:top_k]
        return [self.idx2app[i] for i in top_indices if i in self.idx2app]

recommender = SteamInferenceModel(
    model=model, 
    app2idx=app2idx, 
    user_histories=user_histories_final, 
    item_vectors=all_item_vectors, 
    device=DEVICE
)

final_results = calculate_metrics(test_interactions, recommender, k=10)
print("\n=== FINAL TEST RESULTS ===")
for m, v in final_results.items():
    print(f"{m}: {v:.4f}")

  0%|          | 0/44021 [00:00<?, ?it/s]


=== FINAL TEST RESULTS ===
HitRate@10: 0.0483
Recall@10: 0.0483
NDCG@10: 0.0157


### Analysis and Conclusions

Despite the use of a complex **Two-Tower Transformer** architecture, the final metrics (**HitRate@10: 0.0483**) were lower than standard methods. This was caused by the following reasons:

#### 1. Dominance of the "Collaborative Signal"
In Steam data, community behavior (social signal) is the most accurate predictor.
* **KNN** directly exploits this signal through the principle: *"People who bought this also bought that"*.
* The **Neural Network** attempts to learn these connections indirectly through embeddings. To "catch up" with KNN in terms of memorizing such dependencies, a significantly larger volume of data is required.



#### 2. Sparsity and PCA Issues
To reduce the dimensionality of SBERT embeddings, **PCA** was used (compression to 64 components).
* **Consequence:** Along with noise, subtle semantic differences might have been removed. If two games differed only by one specific tag, PCA could have "averaged" their vectors, making them indistinguishable to the model.