# Neural Collaborative Filtering: Two-Tower Architecture

**Objective:** Build a deep learning recommender system that learns latent representations of users and movies in a shared embedding space.

**Key Innovation:** Rather than manually engineering features or using simple matrix factorization, we use neural networks to automatically discover complex, nonlinear patterns in user preferences.

## Dependencies and Setup

In [None]:
import numpy as np
import numpy.ma as ma
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Input, Dot
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import MeanSquaredError
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import tabulate
from recsysNN_utils import *

# Display settings
pd.set_option("display.precision", 1)

# Reproducibility
np.random.seed(1)
tf.random.set_seed(1)

## Data Loading and Exploration

The dataset contains user-movie interactions with ratings. We'll examine both:
1. **Content-based baselines** (for comparison)
2. **Raw training data** (for neural collaborative filtering)

In [None]:
# Content-based baseline recommendations (for comparison)
top10_df = pd.read_csv("./data/content_top10_df.csv")
bygenre_df = pd.read_csv("./data/content_bygenre_df.csv")

print("Top 10 Content-Based Recommendations:")
display(top10_df)

print("\nGenre-Based Recommendations:")
display(bygenre_df)

## Training Data Structure

Load the collaborative filtering dataset with user and item features.

In [None]:
# Load preprocessed data
item_train, user_train, y_train, item_features, user_features, \
    item_vecs, movie_dict, user_to_genre = load_data()

# Feature dimensions
num_user_features = user_train.shape[1] - 3  # Exclude user_id, rating_count, avg_rating
num_item_features = item_train.shape[1] - 1  # Exclude movie_id

# Column slicing indices for training
uvs = 3  # User genre vector start
ivs = 3  # Item genre vector start
u_s = 3  # Start of user features to use
i_s = 1  # Start of item features to use

print(f"User feature dimension: {num_user_features}")
print(f"Item feature dimension: {num_item_features}")
print(f"Training samples: {len(y_train):,}")
print(f"Unique movies: {len(item_vecs):,}")

### Data Preview: User Features

**User features include:**
- User ID
- Average rating given
- Number of ratings
- Genre preference vector (14 dimensions)

In [None]:
pprint_train(user_train, user_features, uvs, u_s, maxcount=5)

### Data Preview: Item Features

**Movie features include:**
- Movie ID
- Release year
- Genre flags (14 dimensions)

In [None]:
pprint_train(item_train, item_features, ivs, i_s, maxcount=5, user=False)

### Target Variable: Ratings

In [None]:
print(f"Sample ratings: {y_train[:10]}")
print(f"Rating range: [{y_train.min()}, {y_train.max()}]")
print(f"Mean rating: {y_train.mean():.2f}")

## Feature Scaling

**Why scale?**

1. **Convergence speed:** Features on different scales slow gradient descent
2. **Numerical stability:** Prevents overflow/underflow in activations
3. **Weight initialization:** Assumes inputs are zero-mean, unit-variance

**Scaling strategies:**
- **StandardScaler:** For user/item features → $z = \frac{x - \mu}{\sigma}$
- **MinMaxScaler(-1, 1):** For ratings → enables tanh-like output interpretation

In [None]:
# Preserve unscaled versions for display
item_train_unscaled = item_train.copy()
user_train_unscaled = user_train.copy()
y_train_unscaled = y_train.copy()

# Standardize features (zero mean, unit variance)
scalerItem = StandardScaler()
item_train = scalerItem.fit_transform(item_train)

scalerUser = StandardScaler()
user_train = scalerUser.fit_transform(user_train)

# Scale ratings to [-1, 1] for smoother optimization
scalerTarget = MinMaxScaler((-1, 1))
y_train = scalerTarget.fit_transform(y_train.reshape(-1, 1)).flatten()

print("Scaling complete.")
print(f"User features: mean={user_train.mean():.3f}, std={user_train.std():.3f}")
print(f"Item features: mean={item_train.mean():.3f}, std={item_train.std():.3f}")
print(f"Target ratings: range=[{y_train.min():.2f}, {y_train.max():.2f}]")

## Train-Test Split

80/20 split with fixed random state for reproducibility.

In [None]:
item_train, item_test = train_test_split(item_train, train_size=0.80, shuffle=True, random_state=1)
user_train, user_test = train_test_split(user_train, train_size=0.80, shuffle=True, random_state=1)
y_train, y_test = train_test_split(y_train, train_size=0.80, shuffle=True, random_state=1)

print(f"Training samples: {len(y_train):,}")
print(f"Test samples: {len(y_test):,}")
print(f"Train/test ratio: {len(y_train)/len(y_test):.1f}")

## Two-Tower Architecture

### Design Philosophy

**Why two separate towers?**

1. **Decoupled learning:** User and item representations evolve independently
2. **Scalability:** Precompute item embeddings once, score millions of users in real-time
3. **Interpretability:** Can analyze user/item clusters separately

**Architecture choice: 256 → 128 → 32**

- **Layer 1 (256):** High-capacity feature extraction
- **Layer 2 (128):** Dimensionality reduction + abstraction
- **Layer 3 (32):** Compact embedding for dot-product similarity

**Why ReLU?** Non-saturating, computationally efficient, empirically strong

**Why linear output?** Embeddings are constrained by L2 normalization, not activation

### Mathematical Formulation

**User Tower:**
$$
\mathbf{v}_u = \text{L2Normalize}\left( \phi_u(\mathbf{x}_u; \theta_u) \right)
$$

**Item Tower:**
$$
\mathbf{v}_m = \text{L2Normalize}\left( \phi_m(\mathbf{x}_m; \theta_m) \right)
$$

**Predicted Rating:**
$$
\hat{r}_{u,m} = \mathbf{v}_u^T \mathbf{v}_m = \sum_{k=1}^{32} v_{u,k} \cdot v_{m,k}
$$

**Loss (MSE):**
$$
\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \left( r_i - \hat{r}_i \right)^2
$$

In [None]:
# Embedding dimension
embedding_dim = 32

# User Tower: Maps user features → 32D embedding
user_NN = Sequential([
    Dense(256, activation='relu'),
    Dense(128, activation='relu'),
    Dense(embedding_dim, activation='linear')
], name='user_tower')

# Item Tower: Maps movie features → 32D embedding
item_NN = Sequential([
    Dense(256, activation='relu'),
    Dense(128, activation='relu'),
    Dense(embedding_dim, activation='linear')
], name='item_tower')

# Build full model with Keras Functional API
input_user = Input(shape=(num_user_features,), name='user_input')
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)  # Normalize to unit sphere

input_item = Input(shape=(num_item_features,), name='item_input')
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)  # Normalize to unit sphere

# Dot product: measures cosine similarity (since vectors are normalized)
output = Dot(axes=1, name='dot_product')([vu, vm])

# Final model
model = keras.Model([input_user, input_item], output, name='two_tower_recsys')

model.summary()

### Why L2 Normalization?

Constraining embeddings to the unit hypersphere ($||\mathbf{v}||_2 = 1$) has several benefits:

1. **Dot product = Cosine similarity:**
   $$\mathbf{v}_u^T \mathbf{v}_m = ||\mathbf{v}_u|| \cdot ||\mathbf{v}_m|| \cdot \cos(\theta) = \cos(\theta)$$

2. **Scale invariance:** Prevents magnitude from dominating similarity

3. **Gradient stability:** Bounded outputs prevent exploding gradients

4. **Geometric interpretation:** Embeddings live on a 32-dimensional sphere

## Model Compilation and Training

**Optimizer:** Adam with learning rate 0.01
- Adaptive learning rates per parameter
- Momentum for faster convergence
- Empirically robust across architectures

**Loss:** Mean Squared Error (regression task)

**Training strategy:**
- Batch size: 256 (balance between gradient noise and memory)
- Epochs: 30 (sufficient for convergence on this dataset)

In [None]:
# Compile model
model.compile(
    optimizer=Adam(learning_rate=0.01),
    loss=MeanSquaredError()
)

# Train
history = model.fit(
    [user_train[:, u_s:], item_train[:, i_s:]],
    y_train,
    epochs=30,
    batch_size=256,
    validation_split=0.1,
    verbose=1
)

## Training Diagnostics

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.title('Learning Curves')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(np.log10(history.history['loss']), label='log10(Training Loss)')
plt.plot(np.log10(history.history['val_loss']), label='log10(Validation Loss)')
plt.xlabel('Epoch')
plt.ylabel('log10(MSE)')
plt.title('Log-Scale Learning Curves')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final training loss: {history.history['loss'][-1]:.4f}")
print(f"Final validation loss: {history.history['val_loss'][-1]:.4f}")

## Model Evaluation

In [None]:
# Evaluate on test set
test_loss = model.evaluate(
    [user_test[:, u_s:], item_test[:, i_s:]],
    y_test,
    verbose=0
)

print(f"Test MSE: {test_loss:.4f}")
print(f"Test RMSE: {np.sqrt(test_loss):.4f}")

# Convert back to original rating scale
test_rmse_original = np.sqrt(test_loss) * (y_train_unscaled.max() - y_train_unscaled.min())
print(f"Test RMSE (original scale): {test_rmse_original:.3f} stars")

## Inference: Recommend Movies for New User

**Workflow:**
1. User specifies genre preferences (0-5 scale)
2. Create user feature vector
3. Replicate across all movies
4. Compute predictions (dot products in embedding space)
5. Sort and return top-N

In [None]:
# Define a new user profile
new_user_preferences = {
    'Action': 5,
    'Adventure': 5,
    'Animation': 0,
    'Children': 0,
    'Comedy': 0,
    'Crime': 3,
    'Documentary': 0,
    'Drama': 0,
    'Fantasy': 5,
    'Horror': 0,
    'Mystery': 4,
    'Romance': 0,
    'SciFi': 5,
    'Thriller': 4
}

print("New user profile:")
for genre, rating in new_user_preferences.items():
    if rating > 0:
        print(f"  {genre}: {rating}/5")

In [None]:
# Construct user feature vector
# Format: [user_id, avg_rating, rating_count, genre_preferences...]
new_user_id = 0
new_avg_rating = 0
new_rating_count = 0

user_vec = np.array([
    [new_user_id, new_avg_rating, new_rating_count,
     new_user_preferences['Action'],
     new_user_preferences['Adventure'],
     new_user_preferences['Animation'],
     new_user_preferences['Children'],
     new_user_preferences['Comedy'],
     new_user_preferences['Crime'],
     new_user_preferences['Documentary'],
     new_user_preferences['Drama'],
     new_user_preferences['Fantasy'],
     new_user_preferences['Horror'],
     new_user_preferences['Mystery'],
     new_user_preferences['Romance'],
     new_user_preferences['SciFi'],
     new_user_preferences['Thriller']]
])

# Replicate user vector for all movies
user_vecs = gen_user_vecs(user_vec, len(item_vecs))

# Scale features
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs)

# Predict ratings
y_pred = model.predict([suser_vecs[:, u_s:], sitem_vecs[:, i_s:]], verbose=0)

# Unscale predictions
y_pred_unscaled = scalerTarget.inverse_transform(y_pred.reshape(-1, 1)).flatten()

# Sort by predicted rating (descending)
sorted_indices = np.argsort(-y_pred_unscaled)
sorted_predictions = y_pred_unscaled[sorted_indices]
sorted_items = item_vecs[sorted_indices]

print("\n🎬 Top 10 Recommended Movies:\n")
print_pred_movies(sorted_predictions, sorted_items, movie_dict, maxcount=10)

## Inference: Predict for Existing User

Compare model predictions against actual ratings for a user in the dataset.

In [None]:
# Select user
uid = 2

# Generate (user, item) pairs for all movies
user_vecs, y_actual = get_user_vecs(uid, user_train_unscaled, item_vecs, user_to_genre)

# Scale and predict
suser_vecs = scalerUser.transform(user_vecs)
sitem_vecs = scalerItem.transform(item_vecs)
y_pred = model.predict([suser_vecs[:, u_s:], sitem_vecs[:, i_s:]], verbose=0)
y_pred_unscaled = scalerTarget.inverse_transform(y_pred.reshape(-1, 1)).flatten()

# Sort by prediction
sorted_indices = np.argsort(-y_pred_unscaled)
sorted_predictions = y_pred_unscaled[sorted_indices]
sorted_items = item_vecs[sorted_indices]
sorted_user = user_vecs[sorted_indices]
sorted_actual = y_actual[sorted_indices]

print(f"\n📊 Predictions vs. Actual Ratings for User {uid}:\n")
print_existing_user(
    sorted_predictions, 
    sorted_actual.reshape(-1, 1), 
    sorted_user, 
    sorted_items, 
    ivs, 
    uvs, 
    movie_dict, 
    maxcount=20
)

## Embedding Analysis: Movie Similarity

### Extract Item Embeddings

We can isolate the item tower to generate embeddings for all movies. These embeddings capture latent factors that explain user preferences.

In [None]:
def euclidean_distance_squared(a, b):
    """
    Compute squared Euclidean distance between two vectors.
    
    d² = Σ(aᵢ - bᵢ)²
    
    Args:
        a, b: numpy arrays of shape (n,)
    
    Returns:
        float: squared distance
    """
    return np.sum((a - b) ** 2)

In [None]:
# Quick validation
a1 = np.array([1.0, 2.0, 3.0])
b1 = np.array([1.0, 2.0, 3.0])
print(f"Distance (identical vectors): {euclidean_distance_squared(a1, b1):.3f} (expect 0)")

a2 = np.array([1.1, 2.1, 3.1])
b2 = np.array([1.0, 2.0, 3.0])
print(f"Distance (small difference): {euclidean_distance_squared(a2, b2):.3f} (expect 0.03)")

a3 = np.array([0, 1, 0])
b3 = np.array([1, 0, 0])
print(f"Distance (orthogonal): {euclidean_distance_squared(a3, b3):.3f} (expect 2)")

### Build Item-Only Model

Extract the item tower with L2 normalization to generate embeddings.

In [None]:
# Build model that outputs only item embeddings
input_item_standalone = Input(shape=(num_item_features,))
vm_standalone = item_NN(input_item_standalone)
vm_standalone = tf.linalg.l2_normalize(vm_standalone, axis=1)

item_model = keras.Model(input_item_standalone, vm_standalone, name='item_embeddings')
item_model.summary()

### Generate All Movie Embeddings

In [None]:
# Scale and embed all movies
scaled_item_vecs = scalerItem.transform(item_vecs)
movie_embeddings = item_model.predict(scaled_item_vecs[:, i_s:], verbose=0)

print(f"Movie embedding matrix shape: {movie_embeddings.shape}")
print(f"Embedding dimension: {movie_embeddings.shape[1]}")
print(f"Number of movies: {movie_embeddings.shape[0]}")

# Verify L2 normalization
norms = np.linalg.norm(movie_embeddings, axis=1)
print(f"\nEmbedding norms (should all be ~1.0):")
print(f"  Mean: {norms.mean():.6f}")
print(f"  Std:  {norms.std():.6f}")
print(f"  Min:  {norms.min():.6f}")
print(f"  Max:  {norms.max():.6f}")

## Find Similar Movies

**Method:** Compute pairwise distances in embedding space

**Interpretation:** Movies with small embedding distance share latent characteristics that drive user preferences, even if their explicit features (genres) differ.

In [None]:
# Compute pairwise distance matrix
n_movies = len(movie_embeddings)
distance_matrix = np.zeros((n_movies, n_movies))

for i in range(n_movies):
    for j in range(n_movies):
        distance_matrix[i, j] = euclidean_distance_squared(
            movie_embeddings[i], 
            movie_embeddings[j]
        )

# Mask diagonal (self-distance)
masked_distances = ma.masked_array(distance_matrix, mask=np.eye(n_movies))

print("Distance matrix computed.")
print(f"Shape: {distance_matrix.shape}")
print(f"\nSample distances for movie 0:")
print(f"  Min: {masked_distances[0].min():.4f}")
print(f"  Mean: {masked_distances[0].mean():.4f}")
print(f"  Max: {masked_distances[0].max():.4f}")

### Nearest Neighbor Table

For each movie, find its nearest neighbor in embedding space.

In [None]:
# Build similarity table
count = 50  # Number of movies to display
similarity_table = [["Movie", "Genres", "Most Similar Movie", "Genres"]]

for i in range(min(count, n_movies)):
    # Find nearest neighbor
    nearest_idx = np.argmin(masked_distances[i])
    
    # Get movie details
    movie1_id = int(item_vecs[i, 0])
    movie2_id = int(item_vecs[nearest_idx, 0])
    
    similarity_table.append([
        movie_dict[movie1_id]['title'],
        movie_dict[movie1_id]['genres'],
        movie_dict[movie2_id]['title'],
        movie_dict[movie2_id]['genres']
    ])

# Display as HTML table
from IPython.display import HTML
html_table = tabulate.tabulate(similarity_table, tablefmt='html', headers='firstrow')
display(HTML(html_table))

## Key Insights

### What the Model Learned

1. **Latent Taste Profiles:** The 32-dimensional embedding space captures abstract preferences beyond genre labels

2. **Automatic Feature Discovery:** The network learns which combinations of genres/attributes matter for prediction

3. **Similarity Structure:** Movies cluster by shared audience appeal, not just explicit features

### Why This Beats Simple Approaches

| Method | Representation | Limitation |
|--------|---------------|------------|
| **Content-Based** | Boolean genre flags | Ignores user-specific preferences |
| **Matrix Factorization** | Linear embeddings | Cannot model nonlinear interactions |
| **Two-Tower DNN** | Learned embeddings | ✓ Captures complex patterns automatically |

### Production Advantages

- **Scalability:** Precompute item embeddings → O(1) similarity lookup
- **Cold Start:** New users can be scored immediately from genre preferences
- **Interpretability:** Embedding space can be visualized (t-SNE, UMAP)
- **Extensibility:** Easy to add side features (actor, director, year)

## Next Steps

**Potential improvements:**

1. **Attention mechanism:** Learn feature importance weights
2. **Triplet loss:** Directly optimize embedding distances
3. **Multi-task learning:** Predict ratings + clicks simultaneously
4. **Temporal dynamics:** Model how preferences evolve over time
5. **Hybrid approach:** Combine with content-based features (plot summaries, cast)

## References

1. **Covington, P., Adams, J., & Sargin, E. (2016).** "Deep neural networks for YouTube recommendations." *RecSys*.
2. **He, X., et al. (2017).** "Neural collaborative filtering." *WWW*.
3. **Koren, Y., Bell, R., & Volinsky, C. (2009).** "Matrix factorization techniques for recommender systems." *Computer*, 42(8).