# Thrills & Mysteries - Book Recommender System
Notebook 4 Enhanced: Interactive Cluster-Aware Gradio Application
name: Sai Sneha Siddapura Venkataramappa
uniqname: saisneha

FEATURES:
1. Cluster-aware recommendations with Î±/Î²/Î³ weighting
2. Multiple recommendation modes (Similar, Explore, Discover)
3. Cluster visualization and insights
4. Explainability with common themes
5. Diversity controls

In [18]:
import os
import gc
import pickle
import warnings
from pathlib import Path
from collections import Counter
import requests
from urllib.parse import quote

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gradio as gr
from rapidfuzz import process, fuzz
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

warnings.filterwarnings("ignore")
np.random.seed(42)

print("Imports successful")

Imports successful


In [59]:
print("LOADING CHECKPOINTS AND DATA FOR DEPLOYMENT")

BASE_DIR = Path.cwd()
CHECKPOINTS_DIR = BASE_DIR / "checkpoints"
EMBEDDINGS_DIR = BASE_DIR / "embeddings"

# Load both train and test data
print("Loading checkpoint 1 (train data)...")
with open(CHECKPOINTS_DIR / 'checkpoint_1_data_prepared.pkl', 'rb') as f:
    checkpoint_1 = pickle.load(f)

df_train = checkpoint_1['df_train']
df_test = checkpoint_1['df_test']
print(f"Loaded: {len(df_train):,} train + {len(df_test):,} test books")

# Combine datasets for deployment
df_combined = pd.concat([df_train, df_test], ignore_index=True)
print(f"Combined: {len(df_combined):,} total books for recommendations")

# Load embeddings info
print("Loading checkpoint 2 (embeddings)...")
with open(CHECKPOINTS_DIR / 'checkpoint_2_embeddings.pkl', 'rb') as f:
    checkpoint_2 = pickle.load(f)

best_model_name = checkpoint_2['best_model_name']
optimal_k = checkpoint_2['optimal_k']

# Load both train and test embeddings
print(f"Loading {best_model_name} embeddings...")
model_dir = EMBEDDINGS_DIR / ("miniLM" if best_model_name == "MiniLM" else "mpnet")
train_embeddings = np.load(model_dir / 'train_embeddings.npy')
test_embeddings = np.load(model_dir / 'test_embeddings.npy')
print(f"Loaded train embeddings: {train_embeddings.shape}")
print(f"Loaded test embeddings: {test_embeddings.shape}")

# Combine embeddings
combined_embeddings = np.vstack([train_embeddings, test_embeddings])
print(f"Combined embeddings: {combined_embeddings.shape}")

# Load evaluation results (keep these - they're from proper train/test split)
print("Loading checkpoint 3 (evaluation metrics - from test set)...")
with open(CHECKPOINTS_DIR / 'checkpoint_3_evaluation.pkl', 'rb') as f:
    checkpoint_3 = pickle.load(f)

eval_results = checkpoint_3['eval_results']
CLUSTER_NAMES = checkpoint_3['cluster_names']

# Handle cluster assignments for combined data
df_plot = checkpoint_2['df_plot']
umap_sample_indices = checkpoint_2['umap_sample_indices']

# Initialize clusters for combined dataset
combined_clusters = np.full(len(df_combined), -1)

# Apply train clusters (umap_sample_indices are relative to train set)
combined_clusters[umap_sample_indices] = df_plot['cluster'].values

if len(df_test) > 0:
    from sklearn.metrics.pairwise import cosine_similarity

    print("Assigning clusters to test set books...")
    # Get test indices in combined dataset
    test_start_idx = len(df_train)
    test_end_idx = len(df_combined)

    # Find clustered train books
    clustered_mask = combined_clusters[:len(df_train)] != -1
    clustered_indices = np.where(clustered_mask)[0]

    if len(clustered_indices) > 0:
        # For each test book, find nearest clustered train book
        for test_idx in range(test_start_idx, test_end_idx):
            test_emb = combined_embeddings[test_idx].reshape(1, -1)
            train_embs = combined_embeddings[clustered_indices]

            # Find most similar clustered book
            similarities = cosine_similarity(test_emb, train_embs).flatten()
            nearest_idx = clustered_indices[np.argmax(similarities)]

            # Assign same cluster
            combined_clusters[test_idx] = combined_clusters[nearest_idx]

    print(f"Test books clustered: {(combined_clusters[test_start_idx:] != -1).sum()}/{len(df_test)}")

df_combined['kmeans_cluster'] = combined_clusters

print(f"âœ“ Deployment data ready (K={optimal_k} clusters)")
print(f"âœ“ Clustered books: {(combined_clusters != -1).sum():,}/{len(df_combined):,}")

# Update the main dataframe variable for rest of code
df_train = df_combined  # Use combined as "train" for rest of notebook
train_embeddings = combined_embeddings
train_clusters = combined_clusters

LOADING CHECKPOINTS AND DATA FOR DEPLOYMENT
Loading checkpoint 1 (train data)...
Loaded: 6,217 train + 1,555 test books
Combined: 7,772 total books for recommendations
Loading checkpoint 2 (embeddings)...
Loading MPNet embeddings...
Loaded train embeddings: (6217, 768)
Loaded test embeddings: (1555, 768)
Combined embeddings: (7772, 768)
Loading checkpoint 3 (evaluation metrics - from test set)...
Assigning clusters to test set books...
Test books clustered: 1555/1555
âœ“ Deployment data ready (K=9 clusters)
âœ“ Clustered books: 7,772/7,772


In [60]:
# Rebuild genre matrix for combined data
print("\nPREPARING GENRE VECTORS FOR COMBINED DATA")
all_genres_list = []
for genres in df_train['genres']:
    if isinstance(genres, list):
        all_genres_list.extend(genres)

unique_genres = sorted(set(all_genres_list))
print(f"Found {len(unique_genres)} unique genres")

mlb = MultiLabelBinarizer(classes=unique_genres)
mlb.fit([unique_genres])
train_genre_matrix = mlb.transform(df_train['genres'])
print(f"Genre matrix: {train_genre_matrix.shape}")


PREPARING GENRE VECTORS FOR COMBINED DATA
Found 751 unique genres
Genre matrix: (7772, 751)


In [61]:
print("INITIALIZING CLUSTER-AWARE RECOMMENDER")

class ClusterAwareRecommender:
    """Enhanced cluster-aware hybrid recommender"""

    def __init__(self, df, embeddings, genre_matrix, clusters,
                 alpha_content=0.5, alpha_genre=0.3, alpha_cluster=0.2):
        self.df = df
        self.embeddings = embeddings
        self.genre_matrix = genre_matrix
        self.clusters = clusters
        self.alpha_content = alpha_content
        self.alpha_genre = alpha_genre
        self.alpha_cluster = alpha_cluster

        # Normalize embeddings
        self.norm_embeddings = embeddings / np.linalg.norm(
            embeddings, axis=1, keepdims=True
        )

        self.clustered_mask = clusters != -1
        print(f"âœ“ Recommender initialized")
        print(f"  Weights: C={alpha_content}, G={alpha_genre}, Cl={alpha_cluster}")

    def compute_content_similarity(self, query_idx):
        """Compute content-based similarity"""
        query_emb = self.norm_embeddings[query_idx].reshape(1, -1)
        similarities = np.dot(self.norm_embeddings, query_emb.T).flatten()
        similarities[query_idx] = -1
        return similarities

    def compute_genre_similarity(self, query_idx):
        """Compute genre-based similarity (Jaccard)"""
        query_genres = self.genre_matrix[query_idx]
        intersection = np.dot(self.genre_matrix, query_genres)
        union = (self.genre_matrix.sum(axis=1) + query_genres.sum() - intersection)
        union = np.where(union == 0, 1, union)
        return intersection / union

    def compute_cluster_bonus(self, query_idx):
        """Compute cluster membership bonus"""
        query_cluster = self.clusters[query_idx]
        if query_cluster == -1:
            return np.zeros(len(self.clusters))
        same_cluster = (self.clusters == query_cluster).astype(float)
        return same_cluster

    def recommend(self, query_idx, top_k=10, within_cluster_only=False):
        """Generate cluster-aware hybrid recommendations"""
        content_sim = self.compute_content_similarity(query_idx)
        genre_sim = self.compute_genre_similarity(query_idx)
        cluster_bonus = self.compute_cluster_bonus(query_idx)

        hybrid_scores = (
            self.alpha_content * content_sim +
            self.alpha_genre * genre_sim +
            self.alpha_cluster * cluster_bonus
        )

        if within_cluster_only and self.clusters[query_idx] != -1:
            query_cluster = self.clusters[query_idx]
            cluster_mask = self.clusters == query_cluster
            hybrid_scores = np.where(cluster_mask, hybrid_scores, -np.inf)

        hybrid_scores[query_idx] = -np.inf
        top_indices = np.argsort(hybrid_scores)[-top_k:][::-1]

        results = []
        for idx in top_indices:
            if hybrid_scores[idx] == -np.inf:
                continue

            cluster_id = int(self.clusters[idx]) if self.clusters[idx] != -1 else None
            results.append({
                'index': idx,
                'title': self.df.iloc[idx]['title'],
                'author': self.df.iloc[idx]['author'],
                'cluster': cluster_id,
                'cluster_name': CLUSTER_NAMES.get(cluster_id, 'Unclustered') if cluster_id is not None else 'Unclustered',
                'genres': ', '.join(self.df.iloc[idx]['genres']) if isinstance(self.df.iloc[idx]['genres'], list) else str(self.df.iloc[idx]['genres']),
                'description': self.df.iloc[idx]['description'][:300] + '...',
                'content_sim': float(content_sim[idx]),
                'genre_sim': float(genre_sim[idx]),
                'cluster_bonus': float(cluster_bonus[idx]),
                'hybrid_score': float(hybrid_scores[idx])
            })

        return results

    def recommend_with_diversity(self, query_idx, top_k=10,
                                 within_cluster=7, cross_cluster=3):
        """Balanced recommendations: within + cross cluster"""
        within_recs = self.recommend(query_idx, top_k=within_cluster,
                                    within_cluster_only=True)

        all_recs = self.recommend(query_idx, top_k=top_k*2,
                                 within_cluster_only=False)

        query_cluster = self.clusters[query_idx]
        cross_recs = [r for r in all_recs
                     if r['cluster'] != query_cluster and r['cluster'] is not None]
        cross_recs = cross_recs[:cross_cluster]

        combined = within_recs + cross_recs
        combined.sort(key=lambda x: x['hybrid_score'], reverse=True)

        return combined[:top_k]

INITIALIZING CLUSTER-AWARE RECOMMENDER


In [62]:
print("DEFINING HELPER FUNCTIONS")

def find_book_fuzzy(query, titles, threshold=70):
    """Fuzzy search for book titles with better matching"""
    # Clean query
    query = query.strip()

    # Try exact match first (case insensitive)
    for title in titles:
        if title.lower() == query.lower():
            return [(title, 100)]

    # Try starts-with match (common for book searches like "The Da Vinci Code")
    query_lower = query.lower()
    starts_matches = []
    for title in titles:
        if title.lower().startswith(query_lower):
            score = 95
            starts_matches.append((title, score))

    if starts_matches:
        return starts_matches[:5]

    # Try contains match (substring)
    contains_matches = []
    for title in titles:
        title_lower = title.lower()
        if query_lower in title_lower:
            # Score based on position and length ratio
            position_score = 1 - (title_lower.index(query_lower) / len(title_lower))
            length_ratio = len(query) / len(title)
            score = int(80 + (10 * position_score) + (10 * length_ratio))
            contains_matches.append((title, score))

    if contains_matches:
        contains_matches.sort(key=lambda x: x[1], reverse=True)
        return contains_matches[:5]

    # Fall back to fuzzy matching only if nothing else worked
    matches = process.extract(
        query,
        titles,
        scorer=fuzz.token_sort_ratio,
        limit=5
    )
    # rapidfuzz returns (match, score, index) tuples
    matches = [(title, score) for title, score, idx in matches if score >= threshold]
    return matches

def fetch_cover_image(title):
    """Fetch book cover from Google Books API"""
    try:
        query = quote(title)
        url = f"https://www.googleapis.com/books/v1/volumes?q=intitle:{query}&maxResults=1"
        r = requests.get(url, timeout=3).json()
        img_url = r['items'][0]['volumeInfo'].get('imageLinks', {}).get('thumbnail')
        return img_url if img_url else "https://via.placeholder.com/128x192?text=No+Cover"
    except:
        return "https://via.placeholder.com/128x192?text=No+Cover"

def extract_common_themes(query_desc, rec_desc, top_n=3):
    """Extract common themes using TF-IDF"""
    try:
        texts = [query_desc, rec_desc]
        vectorizer = TfidfVectorizer(max_features=30, stop_words='english',
                                     ngram_range=(1,2))
        tfidf = vectorizer.fit_transform(texts)
        feature_names = vectorizer.get_feature_names_out()

        query_tfidf = tfidf[0].toarray().flatten()
        rec_tfidf = tfidf[1].toarray().flatten()
        common_scores = query_tfidf * rec_tfidf

        if common_scores.max() == 0:
            return []

        top_idx = np.argsort(common_scores)[-top_n:][::-1]
        themes = [feature_names[i] for i in top_idx if common_scores[i] > 0]
        return themes
    except:
        return []

def get_cluster_color(cluster_name):
    """Get color for cluster badges"""
    colors = {
        "Domestic & Psychological Thrillers": "#8b4513",
        "Horror & Supernatural Mysteries": "#6b2c91",
        "Police Procedurals & Detective Fiction": "#2c5f7d",
        "Literary & British Mysteries": "#3d6e5c",
        "Comics & Graphic Novels": "#b5651d",
        "Romantic Suspense": "#8b1538",
        "Espionage & Military Thrillers": "#2c3e50",
        "True Crime & Crime Journalism": "#9a2a2a",
        "Historical Mysteries": "#a0522d",
        "Unclustered": "#5a5a5a"
    }
    return colors.get(cluster_name, "#5a5a5a")

def format_recommendations_html(query_title, query_cluster, recommendations,
                               alpha_c, alpha_g, alpha_cl, mode):
    """Format recommendations as beautiful gothic-themed HTML"""

    # Get query book details
    query_idx = title_list.index(query_title)
    query_book = df_train.iloc[query_idx]
    query_cover = fetch_cover_image(query_title)
    query_author = query_book.get('author', 'Unknown')
    query_rating = query_book.get('rating', 'N/A')
    query_genres = ', '.join(query_book['genres']) if isinstance(query_book['genres'], list) else str(query_book['genres'])
    query_desc = query_book['description'][:200] + ("..." if len(query_book['description']) > 200 else "")

    cluster_color = get_cluster_color(query_cluster)

    # Input book header with gothic styling
    html = f"""
    <div style="font-family: 'Crimson Text', serif; background: #0d0d0d; color: #c9c9c9;">
        <div style="background: linear-gradient(135deg, rgba(139, 69, 19, 0.15) 0%, rgba(139, 69, 19, 0.05) 100%);
                    border: 1px solid rgba(139, 69, 19, 0.4); padding: 25px; margin-bottom: 30px; position: relative;">
            <div style="position: absolute; top: 10px; left: 10px; right: 10px; bottom: 10px;
                        border: 1px solid rgba(139, 69, 19, 0.2); pointer-events: none;"></div>
            <div style="position: relative;">
                <div style="font-family: 'Cinzel', serif; font-size: 0.85em; color: #8b4513;
                            font-weight: 600; letter-spacing: 3px; text-transform: uppercase; margin-bottom: 15px;">
                    â—† Input Book â—†
                </div>
                <div style="display: flex; gap: 20px; align-items: start;">
                    <img src="{query_cover}" width="100" style="border: 1px solid rgba(139, 69, 19, 0.3);
                                                                 box-shadow: 0 4px 12px rgba(0,0,0,0.5);">
                    <div style="flex: 1;">
                        <div style="font-size: 1.3em; color: #d4d4d4; font-weight: 600; margin-bottom: 10px;">
                            {query_title} <span style="color: #8b4513;">by</span> {query_author}
                        </div>
                        <div style="font-size: 0.95em; color: #a0673d; margin-bottom: 8px;">
                            Rating: <span style="color: #8b4513; font-weight: 600;">{query_rating}</span>
                        </div>
                        <div style="font-size: 0.9em; color: #7a7a7a; margin-bottom: 8px;">
                            <strong style="color: #8b4513;">Cluster:</strong> {query_cluster}
                        </div>
                        <div style="font-size: 0.9em; color: #7a7a7a; margin-bottom: 12px;">
                            <strong style="color: #8b4513;">Genres:</strong> {query_genres}
                        </div>
                        <div style="font-size: 0.95em; color: #7a7a7a; font-style: italic; line-height: 1.5;">
                            {query_desc}
                        </div>
                    </div>
                </div>
                <div style="margin-top: 15px; padding-top: 15px; border-top: 1px solid rgba(139, 69, 19, 0.2);">
                    <span style="font-family: 'Cinzel', serif; color: #8b4513; font-size: 0.85em; letter-spacing: 2px;">
                        MODE: {mode.upper()}
                    </span>
                    <span style="color: #7a7a7a; font-size: 0.85em; margin-left: 15px;">
                        Weights: Content={alpha_c:.1f} | Genre={alpha_g:.1f} | Cluster={alpha_cl:.1f}
                    </span>
                </div>
            </div>
        </div>
    """

    # Recommendations section with gothic table
    html += f"""
        <div style="background: linear-gradient(135deg, rgba(0, 0, 0, 0.6) 0%, rgba(20, 20, 20, 0.8) 100%);
                    border: 1px solid rgba(139, 69, 19, 0.4); padding: 25px; position: relative;">
            <div style="position: absolute; top: 10px; left: 10px; right: 10px; bottom: 10px;
                        border: 1px solid rgba(139, 69, 19, 0.2); pointer-events: none;"></div>
            <div style="font-family: 'Cinzel', serif; font-size: 1.1em; color: #8b4513; font-weight: 600;
                        letter-spacing: 3px; text-transform: uppercase; margin-bottom: 20px; position: relative; text-align: center;">
                â—† Top {len(recommendations)} Recommendations â—†
            </div>
            <div style="overflow-x: auto; position: relative;">
                <table style="width:100%; border-collapse: collapse; position: relative;">
                    <thead style="background: linear-gradient(135deg, rgba(139, 69, 19, 0.15) 0%, rgba(139, 69, 19, 0.05) 100%);">
                        <tr>
                            <th style="padding: 18px 15px; text-align: center; color: #8b4513; font-family: 'Cinzel', serif;
                                       font-size: 0.85em; font-weight: 600; border-bottom: 1px solid rgba(139, 69, 19, 0.4);
                                       letter-spacing: 2px; text-transform: uppercase; width: 5%;">Rank</th>
                            <th style="padding: 18px 15px; text-align: left; color: #8b4513; font-family: 'Cinzel', serif;
                                       font-size: 0.85em; font-weight: 600; border-bottom: 1px solid rgba(139, 69, 19, 0.4);
                                       letter-spacing: 2px; text-transform: uppercase; width: 10%;">Cover</th>
                            <th style="padding: 18px 15px; text-align: left; color: #8b4513; font-family: 'Cinzel', serif;
                                       font-size: 0.85em; font-weight: 600; border-bottom: 1px solid rgba(139, 69, 19, 0.4);
                                       letter-spacing: 2px; text-transform: uppercase; width: 20%;">Title & Author</th>
                            <th style="padding: 18px 15px; text-align: center; color: #8b4513; font-family: 'Cinzel', serif;
                                       font-size: 0.85em; font-weight: 600; border-bottom: 1px solid rgba(139, 69, 19, 0.4);
                                       letter-spacing: 2px; text-transform: uppercase; width: 18%;">Cluster</th>
                            <th style="padding: 18px 15px; text-align: left; color: #8b4513; font-family: 'Cinzel', serif;
                                       font-size: 0.85em; font-weight: 600; border-bottom: 1px solid rgba(139, 69, 19, 0.4);
                                       letter-spacing: 2px; text-transform: uppercase; width: 32%;">Description</th>
                            <th style="padding: 18px 15px; text-align: center; color: #8b4513; font-family: 'Cinzel', serif;
                                       font-size: 0.85em; font-weight: 600; border-bottom: 1px solid rgba(139, 69, 19, 0.4);
                                       letter-spacing: 2px; text-transform: uppercase; width: 15%;">Scores</th>
                        </tr>
                    </thead>
                    <tbody>
    """

    # Recommendation rows
    for i, rec in enumerate(recommendations, 1):
        cover = fetch_cover_image(rec['title'])
        same_cluster = rec['cluster_name'] == query_cluster
        badge_color = get_cluster_color(rec['cluster_name'])

        html += f"""
                        <tr style="border-bottom: 1px solid rgba(139, 69, 19, 0.15);">
                            <td style="padding: 18px 15px; text-align: center; vertical-align: middle;">
                                <div style="font-family: 'Cinzel', serif; font-size: 1.4em; color: #8b4513; font-weight: 600;">
                                    {i}
                                </div>
                            </td>
                            <td style="padding: 18px 15px; vertical-align: middle;">
                                <img src="{cover}" width="90" style="border: 1px solid rgba(139, 69, 19, 0.3);
                                                                      box-shadow: 0 4px 12px rgba(0,0,0,0.5);">
                            </td>
                            <td style="padding: 18px 15px; vertical-align: middle;">
                                <div style="font-size: 1.1em; color: #d4d4d4; font-weight: 600; margin-bottom: 6px;">
                                    {rec['title']}
                                </div>
                                <div style="font-size: 0.95em; color: #a0673d;">
                                    by {rec['author']}
                                </div>
                            </td>
                            <td style="padding: 18px 15px; vertical-align: middle; text-align: center;">
                                <div style="background: {badge_color}; color: white; padding: 8px 12px;
                                           border-radius: 4px; font-size: 0.85em; margin-bottom: 6px;">
                                    {'ðŸŽ¯ Same' if same_cluster else 'ðŸ”„ Cross'}
                                </div>
                                <div style="font-size: 0.8em; color: #7a7a7a; line-height: 1.3;">
                                    {rec['cluster_name'][:30]}{'...' if len(rec['cluster_name']) > 30 else ''}
                                </div>
                            </td>
                            <td style="padding: 18px 20px; vertical-align: middle;">
                                <div style="font-size: 0.9em; color: #7a7a7a; font-style: italic; line-height: 1.6;">
                                    {rec['description']}
                                </div>
                            </td>
                            <td style="padding: 18px 12px; vertical-align: middle;">
                                <div style="font-size: 0.8em; color: #7a7a7a; line-height: 1.5;">
                                    <div style="margin-bottom: 4px;">
                                        <span style="color: #8b4513;">Content:</span> {rec['content_sim']:.3f}
                                    </div>
                                    <div style="margin-bottom: 4px;">
                                        <span style="color: #8b4513;">Genre:</span> {rec['genre_sim']:.3f}
                                    </div>
                                    <div style="margin-bottom: 4px;">
                                        <span style="color: #8b4513;">Cluster:</span> {rec['cluster_bonus']:.1f}
                                    </div>
                                    <div style="padding-top: 6px; border-top: 1px solid rgba(139, 69, 19, 0.2);">
                                        <span style="color: #a0673d; font-weight: 600;">Total:</span> {rec['hybrid_score']:.3f}
                                    </div>
                                </div>
                            </td>
                        </tr>
        """

    html += """
                    </tbody>
                </table>
            </div>
        </div>
    </div>
    """

    return html

# Pre-compute title list
title_list = df_train['title'].tolist()

print("Helper functions defined")


DEFINING HELPER FUNCTIONS
Helper functions defined


In [63]:
print("DEFINING GRADIO APP FUNCTIONS")

def search_and_recommend(query, mode, alpha_content, alpha_genre, alpha_cluster,
                        top_k, search_threshold, progress=gr.Progress()):
    """Main function for Gradio app with cluster awareness"""

    progress(0, desc="Searching for book...")

    # Find matching books
    matches = find_book_fuzzy(query, title_list, threshold=search_threshold)

    if not matches:
        return f"""
        <div style='padding: 40px; text-align: center; background: #fff3cd;
                    border-radius: 10px; border: 2px solid #ffc107;'>
            <h3 style='color: #856404; margin: 0 0 10px 0;'>
                  No matches found for "{query}"
            </h3>
            <p style='color: #856404; margin: 0;'>
                Try a different title or lower the search threshold (try 50).
            </p>
        </div>
        """

    progress(0.3, desc="Found book, analyzing cluster...")

    # Use best match
    best_match = matches[0][0]
    match_score = matches[0][1]
    query_idx = title_list.index(best_match)
    query_cluster = train_clusters[query_idx]
    query_cluster_name = CLUSTER_NAMES.get(query_cluster, 'Unclustered') if query_cluster != -1 else 'Unclustered'

    # Adjust weights based on mode
    if mode == "Similar (Within Cluster)":
        alpha_c, alpha_g, alpha_cl = 0.4, 0.2, 0.4
    elif mode == "Explore (Balanced)":
        alpha_c, alpha_g, alpha_cl = alpha_content, alpha_genre, alpha_cluster
    elif mode == "Discover (Cross-Cluster)":
        alpha_c, alpha_g, alpha_cl = 0.5, 0.4, 0.1

    progress(0.5, desc="Computing recommendations...")

    # Initialize recommender
    recommender = ClusterAwareRecommender(
        df_train,
        train_embeddings,
        train_genre_matrix,
        train_clusters,
        alpha_content=alpha_c,
        alpha_genre=alpha_g,
        alpha_cluster=alpha_cl
    )

    # Generate recommendations - FIXED BALANCED MODE
    if mode == "Similar (Within Cluster)":
        recommendations = recommender.recommend(
            query_idx,
            top_k=top_k,
            within_cluster_only=True
        )
    elif mode == "Explore (Balanced)":
        progress(0.6, desc="Finding balanced recommendations...")
        # FIXED: Actually get cross-cluster recommendations
        within_count = int(top_k * 0.7)
        cross_count = top_k - within_count

        # Get within-cluster recs
        within_recs = recommender.recommend(
            query_idx,
            top_k=within_count * 2,
            within_cluster_only=True
        )[:within_count]

        progress(0.75, desc="Finding cross-cluster books...")

        # Get ALL recommendations
        all_recs = recommender.recommend(
            query_idx,
            top_k=top_k * 3,
            within_cluster_only=False
        )

        # Filter for DIFFERENT clusters only
        query_cluster_id = query_cluster if query_cluster != -1 else None
        cross_recs = []
        for rec in all_recs:
            if rec['cluster'] != query_cluster_id and rec['cluster'] is not None:
                cross_recs.append(rec)
                if len(cross_recs) >= cross_count:
                    break

        # Combine and sort by hybrid score
        recommendations = within_recs + cross_recs
        recommendations.sort(key=lambda x: x['hybrid_score'], reverse=True)
        recommendations = recommendations[:top_k]

    else:  # Discover (Cross-Cluster)
        progress(0.6, desc="Discovering new sub-genres...")

        # For discover mode, we want to HEAVILY favor cross-cluster books
        # First get all recommendations with LOW cluster weight
        temp_recommender = ClusterAwareRecommender(
            df_train,
            train_embeddings,
            train_genre_matrix,
            train_clusters,
            alpha_content=0.6,  # High content similarity
            alpha_genre=0.35,   # Moderate genre matching
            alpha_cluster=0.05  # Very low cluster bonus (encourage different clusters)
        )

        all_recs = temp_recommender.recommend(
            query_idx,
            top_k=top_k * 4,  # Get many more candidates
            within_cluster_only=False
        )

        # Separate by cluster
        query_cluster_id = query_cluster if query_cluster != -1 else None
        cross_recs = [r for r in all_recs if r['cluster'] != query_cluster_id and r['cluster'] is not None]
        within_recs = [r for r in all_recs if r['cluster'] == query_cluster_id]

        # Prioritize cross-cluster heavily (at least 80% should be cross-cluster)
        min_cross = int(top_k * 0.8)
        recommendations = cross_recs[:min_cross]

        # Fill remaining spots with best matches (cross or within)
        remaining = top_k - len(recommendations)
        if remaining > 0:
            remaining_recs = cross_recs[min_cross:] + within_recs
            recommendations.extend(remaining_recs[:remaining])

    progress(0.9, desc="Formatting results...")

    # Format as HTML
    html_output = format_recommendations_html(
        best_match,
        query_cluster_name,
        recommendations,
        alpha_c,
        alpha_g,
        alpha_cl,
        mode
    )

    # Add search info header with auto-scroll script
    search_info = f"""
    <div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
                padding: 20px; border-radius: 10px; margin-bottom: 20px;
                box-shadow: 0 4px 6px rgba(0,0,0,0.1);">
        <div style="color: white;">
            <h4 style="margin: 0 0 10px 0; font-size: 18px;">  Search Results</h4>
            <p style="margin: 0 0 8px 0; font-size: 15px;">
                Found: <strong>"{best_match}"</strong> (Match: {match_score}%)
            </p>
            <p style="margin: 0; font-size: 14px; opacity: 0.9;">
                Cluster: {query_cluster_name}
            </p>
    """

    if len(matches) > 1:
        search_info += """
        <details style="margin-top: 10px;">
            <summary style="cursor: pointer; font-size: 13px; opacity: 0.9;">
                Other possible matches â–¼
            </summary>
            <div style="margin-top: 8px; padding-left: 10px;">
        """
        for title, score in matches[1:]:
            search_info += f"<div style='font-size: 13px; margin: 4px 0;'>â€¢ {title} ({score}%)</div>"
        search_info += "</div></details>"

    search_info += """
        </div>
    </div>
    <script>
        setTimeout(function() {
            // Try to find the results element and scroll to it
            const results = document.querySelector('.gradio-container');
            if (results) {
                const resultsSection = results.querySelectorAll('.gr-prose')[1];
                if (resultsSection) {
                    resultsSection.scrollIntoView({ behavior: 'smooth', block: 'start' });
                }
            }
        }, 100);
    </script>
    """

    progress(1.0, desc="Complete!")

    return search_info + html_output


print("Gradio functions defined")

DEFINING GRADIO APP FUNCTIONS
Gradio functions defined


In [66]:
print("CREATING ENHANCED GRADIO INTERFACE")

custom_css = """
@import url('https://fonts.googleapis.com/css2?family=Crimson+Text:ital,wght@0,400;0,600;1,400&family=Cinzel:wght@400;600;700&display=swap');

body, .gradio-container {
    background: #0d0d0d !important;
    color: #c9c9c9 !important;
    font-family: 'Crimson Text', serif !important;
}

.gradio-container::before {
    content: '';
    position: fixed;
    top: 0;
    left: 0;
    width: 100%;
    height: 100%;
    background: repeating-linear-gradient(0deg, transparent, transparent 2px, rgba(0, 0, 0, .3) 2px, rgba(0, 0, 0, .3) 4px),
                radial-gradient(ellipse at center, rgba(20, 20, 20, 0.9) 0%, #0d0d0d 100%);
    pointer-events: none;
    opacity: 0.4;
    z-index: 0;
}

h1, .gr-prose h1 {
    font-family: 'Cinzel', serif !important;
    color: #8b4513 !important;
    text-align: center !important;
    letter-spacing: 8px !important;
    font-weight: 600 !important;
    text-transform: uppercase !important;
    margin: 20px 0 !important;
    font-size: 2.5em !important;
    border-bottom: 1px solid rgba(139, 69, 19, 0.3) !important;
    padding-bottom: 20px !important;
}

.gr-prose p {
    text-align: center !important;
    color: #6b6b6b !important;
    font-style: italic !important;
    letter-spacing: 2px !important;
    margin-bottom: 30px !important;
}

.gr-box, .gr-input, .gr-textbox, textarea, input {
    background: linear-gradient(135deg, rgba(0, 0, 0, 0.6) 0%, rgba(20, 20, 20, 0.8) 100%) !important;
    border: 1px solid rgba(139, 69, 19, 0.5) !important;
    border-radius: 0 !important;
    color: #c9c9c9 !important;
    font-family: 'Crimson Text', serif !important;
}

.gr-input:focus, textarea:focus, input:focus {
    border-color: #8b4513 !important;
    box-shadow: inset 0 0 15px rgba(0, 0, 0, 0.8) !important;
}

label, .gr-label {
    font-family: 'Cinzel', serif !important;
    color: #8b4513 !important;
    font-weight: 600 !important;
    letter-spacing: 2px !important;
    text-transform: uppercase !important;
    font-size: 0.9em !important;
}

/* Radio button styling */
.gr-radio-group, .gr-radio {
    background: transparent !important;
}

input[type="radio"] {
    appearance: none;
    -webkit-appearance: none;
    width: 18px !important;
    height: 18px !important;
    border: 2px solid #8b4513 !important;
    border-radius: 0 !important;
    background: rgba(0, 0, 0, 0.6) !important;
    cursor: pointer !important;
    position: relative !important;
    margin-right: 10px !important;
}

input[type="radio"]:checked {
    background: #8b4513 !important;
}

input[type="radio"]:checked::after {
    content: 'â—†';
    position: absolute;
    top: 50%;
    left: 50%;
    transform: translate(-50%, -50%);
    color: #000;
    font-size: 12px;
}

.gr-radio-group label {
    cursor: pointer !important;
    padding: 12px 20px !important;
    border: 1px solid rgba(139, 69, 19, 0.3) !important;
    margin: 5px 0 !important;
    background: linear-gradient(135deg, rgba(0, 0, 0, 0.4) 0%, rgba(20, 20, 20, 0.6) 100%) !important;
    transition: all 0.3s ease !important;
    display: flex !important;
    align-items: center !important;
}

.gr-radio-group label:hover {
    background: linear-gradient(135deg, rgba(139, 69, 19, 0.2) 0%, rgba(139, 69, 19, 0.1) 100%) !important;
    border-color: #8b4513 !important;
}

.gr-button, button {
    background: linear-gradient(135deg, #1a1a1a 0%, #000000 100%) !important;
    color: #8b4513 !important;
    border: 1px solid rgba(139, 69, 19, 0.6) !important;
    border-radius: 0 !important;
    font-family: 'Cinzel', serif !important;
    font-weight: 600 !important;
    letter-spacing: 2px !important;
    text-transform: uppercase !important;
    padding: 16px 35px !important;
    transition: all 0.4s ease !important;
    cursor: pointer !important;
}

.gr-button:hover, button:hover {
    background: linear-gradient(135deg, #000000 0%, #1a1a1a 100%) !important;
    border-color: #8b4513 !important;
    color: #a0673d !important;
    box-shadow: 0 5px 25px rgba(139, 69, 19, 0.3) !important;
}

.gr-panel, .gr-form, .gr-block {
    background: transparent !important;
    border: none !important;
}

::placeholder {
    color: #4a4a4a !important;
    font-style: italic !important;
}
"""

with gr.Blocks(
    title="Thrills & Mysteries â€” Cluster-Aware Book Recommender",
    css=custom_css
) as app:

    gr.HTML("<h1>â—† Thrills & Mysteries â€” Book Reccomender System â—†</h1>")
    gr.HTML("<p>Discover Your Next Literary Obsession</p>")

    # HORIZONTAL TABS for info (more user-friendly)
    with gr.Tabs():
        with gr.Tab("â—† SEARCH"):
            with gr.Row():
                with gr.Column(scale=8):  # 80% width for input
                    # Main Input
                    book_input = gr.Textbox(
                        label="Enter Book Title",
                        placeholder="The Shining, Sherlock Holmes, Death on the Nile...",
                        lines=1
                    )

                    # Recommendation Mode
                    mode_radio = gr.Radio(
                        choices=[
                            "Similar (Within Cluster)",
                            "Explore (Balanced)",
                            "Discover (Cross-Cluster)"
                        ],
                        value="Explore (Balanced)",
                        label="Recommendation Mode",
                        info="Choose your exploration strategy",
                        interactive=True
                    )

                    with gr.Accordion("âš™ Advanced Configuration", open=False):
                        gr.Markdown("**Weight Customization** _(Applies in Explore mode)_")

                        with gr.Row():
                            alpha_content_slider = gr.Slider(
                                minimum=0.0,
                                maximum=1.0,
                                value=0.5,
                                step=0.1,
                                label="Content Weight (Î±)"
                            )

                            alpha_genre_slider = gr.Slider(
                                minimum=0.0,
                                maximum=1.0,
                                value=0.3,
                                step=0.1,
                                label="Genre Weight (Î²)"
                            )

                            alpha_cluster_slider = gr.Slider(
                                minimum=0.0,
                                maximum=1.0,
                                value=0.2,
                                step=0.1,
                                label="Cluster Weight (Î³)"
                            )

                        with gr.Row():
                            top_k_slider = gr.Slider(
                                minimum=3,
                                maximum=20,
                                value=10,
                                step=1,
                                label="Number of Recommendations"
                            )

                            threshold_slider = gr.Slider(
                                minimum=40,
                                maximum=90,
                                value=50,  # LOWERED to 50
                                step=5,
                                label="Search Threshold",
                                info="Lower = more forgiving matching"
                            )

                    # Examples as collapsible accordion - MOVED ABOVE BUTTON
                    examples_accordion = gr.Accordion("ðŸ“š Example Searches", open=False)
                    with examples_accordion:
                        examples_component = gr.Examples(
                            examples=[
                                ["Death on the Nile", "Similar (Within Cluster)", 0.5, 0.3, 0.2, 8, 50],
                                ["Sherlock Holmes", "Discover (Cross-Cluster)", 0.5, 0.3, 0.2, 10, 50],
                                ["The Girl with the Dragon Tattoo", "Explore (Balanced)", 0.5, 0.3, 0.2, 10, 50],
                                ["The Shining", "Similar (Within Cluster)", 0.5, 0.3, 0.2, 10, 50]
                            ],
                            inputs=[
                                book_input, mode_radio, alpha_content_slider,
                                alpha_genre_slider, alpha_cluster_slider,
                                top_k_slider, threshold_slider
                            ]
                        )

                    recommend_btn = gr.Button("â—† Discover Recommendations â—†", variant="primary")

                with gr.Column(scale=2):  # 20% width for quick start
                    gr.HTML("""
                        <div style="background: linear-gradient(135deg, rgba(139, 69, 19, 0.15) 0%, rgba(139, 69, 19, 0.05) 100%);
                                    border: 1px solid rgba(139, 69, 19, 0.4); padding: 20px; border-radius: 0;
                                    font-family: 'Crimson Text', serif; color: #c9c9c9; position: relative;">
                            <div style="position: absolute; top: 8px; left: 8px; right: 8px; bottom: 8px;
                                        border: 1px solid rgba(139, 69, 19, 0.2); pointer-events: none;"></div>
                            <div style="position: relative;">
                                <div style="font-family: 'Cinzel', serif; font-size: 0.9em; color: #8b4513;
                                            font-weight: 600; letter-spacing: 3px; text-transform: uppercase;
                                            margin-bottom: 15px; text-align: center; border-bottom: 1px solid rgba(139, 69, 19, 0.3);
                                            padding-bottom: 10px;">
                                    â—† Quick Start â—†
                                </div>
                                <div style="font-size: 0.9em; line-height: 1.6; color: #a0a0a0;">
                                    <p style="margin: 12px 0;"><strong style="color: #8b4513;">1.</strong> Enter a book title</p>
                                    <p style="margin: 12px 0; font-size: 0.85em; color: #7a7a7a; padding-left: 15px;">
                                        Try: The Shining, Death on the Nile, Sherlock Holmes
                                    </p>
                                    <p style="margin: 12px 0;"><strong style="color: #8b4513;">2.</strong> Choose mode:</p>
                                    <ul style="margin: 8px 0 12px 20px; padding: 0; font-size: 0.85em; color: #7a7a7a;">
                                        <li style="margin: 5px 0;"><strong>Similar</strong>: Same sub-genre</li>
                                        <li style="margin: 5px 0;"><strong>Explore</strong>: 70% same + 30% different</li>
                                        <li style="margin: 5px 0;"><strong>Discover</strong>: New sub-genres</li>
                                    </ul>
                                    <p style="margin: 12px 0;"><strong style="color: #8b4513;">3.</strong> Click discover</p>
                                    <div style="margin-top: 15px; padding-top: 15px; border-top: 1px solid rgba(139, 69, 19, 0.2);">
                                        <p style="font-size: 0.8em; font-style: italic; color: #6b6b6b; margin: 0;">
                                            ðŸ’¡ Lower threshold if no matches
                                        </p>
                                    </div>
                                </div>
                            </div>
                        </div>
                    """)

            # OUTPUT BELOW INPUT (as requested)
            gr.HTML("<hr style='border: 1px solid rgba(139, 69, 19, 0.3); margin: 30px 0;'>")
            output_html = gr.HTML(label="Recommendations")

        with gr.Tab("â—† SYSTEM STATS"):
            gr.Markdown(
                f"""
                ## System Statistics

                | Metric | Value |
                |--------|-------|
                | **Total Books** | {len(df_train):,} Mystery & Thriller titles |
                | **Clusters** | {optimal_k} distinct sub-genres |
                | **Model** | {best_model_name} (Sentence Transformer) |
                | **Diversity Score** | {eval_results['diversity']:.3f} |
                | **Genre Precision** | {eval_results['genre_precision']:.1%} |
                | **Within-Cluster Rate** | {eval_results['within_cluster_rate']:.1%} |

                ---

                ### What These Metrics Mean

                - **Diversity**: How varied the recommendations are (higher = more diverse)
                - **Genre Precision**: Accuracy of genre-based recommendations
                - **Within-Cluster Rate**: % of similar-mode recommendations from same sub-genre
                """
            )

        with gr.Tab("â—† THE 9 CLUSTERS"):
            gr.Markdown("## Sub-Genre Clusters")

            cluster_info = ""
            for i in range(optimal_k):
                color = get_cluster_color(CLUSTER_NAMES[i])
                cluster_info += f"""
                <div style="background: {color}; color: white; padding: 12px 18px;
                            margin: 8px 0; border-radius: 5px; font-size: 14px; font-weight: 600;">
                    <strong>{i}.</strong> {CLUSTER_NAMES[i]}
                </div>
                """

            gr.HTML(cluster_info)

        with gr.Tab("â—† ABOUT"):
            gr.Markdown(
                f"""
                ## How It Works

                This recommender uses a **hybrid algorithm** combining three signals:

                1. **Content Similarity (Î±)**: AI embeddings capture plot, themes, and writing style
                2. **Genre Matching (Î²)**: Compares genre tags using Jaccard similarity
                3. **Cluster Bonus (Î³)**: Leverages sub-genre groupings from K-means clustering

                **Final Score** = Î± Ã— Content + Î² Ã— Genre + Î³ Ã— Cluster

                ---

                ## The Three Modes

                - **Similar**: Maximizes cluster bonus (Î±=0.4, Î²=0.2, Î³=0.4) and only returns books from the same sub-genre
                - **Explore**: Uses your custom weights and returns 70% same cluster + 30% different clusters
                - **Discover**: Reduces cluster bonus (Î±=0.5, Î²=0.4, Î³=0.1) to encourage cross-genre exploration

                ---

                ## Project Info

                **Built with**: Python Â· Sentence Transformers ({best_model_name}) Â· K-Means Clustering Â· Scikit-learn Â· Gradio
                **Course**: STATS 507 Final Project
                **Institution**: University of Michigan
                **Author**: Sai Sneha Siddapura Venkataramappa
                **Dataset**: {len(df_train):,} mystery and thriller books from Goodreads
                """
            )

    # When example is clicked, collapse the accordion
    examples_component.dataset.click(
        fn=lambda: gr.Accordion(open=False),
        inputs=None,
        outputs=examples_accordion
    )

    # Button action
    recommend_btn.click(
        fn=lambda: gr.Button(interactive=False),  # Just disable, don't change text
        inputs=None,
        outputs=recommend_btn,
        queue=False
    ).then(
        fn=search_and_recommend,
        inputs=[
            book_input, mode_radio, alpha_content_slider,
            alpha_genre_slider, alpha_cluster_slider,
            top_k_slider, threshold_slider
        ],
        outputs=output_html,
        show_progress=True
    ).then(
        fn=lambda: gr.Button(interactive=True),  # Re-enable after completion
        inputs=None,
        outputs=recommend_btn
    )

print("Gradio interface created")

CREATING ENHANCED GRADIO INTERFACE
Gradio interface created


In [67]:
print("LAUNCHING ENHANCED CLUSTER-AWARE GRADIO APP")

print("\n Launching app...")
print(" Features enabled:")
print("  - Cluster-aware recommendations")
print("  - Multiple recommendation modes")
print("  - Weight customization")
print("  - Visual cluster indicators")
print("  - Detailed explainability")

app.launch(
    share=True,
    show_error=True,
    debug = True
)

print("NOTEBOOK 4 ENHANCED - COMPLETE ")

LAUNCHING ENHANCED CLUSTER-AWARE GRADIO APP

 Launching app...
 Features enabled:
  - Cluster-aware recommendations
  - Multiple recommendation modes
  - Weight customization
  - Visual cluster indicators
  - Detailed explainability
Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://189cc5512b75f06f30.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


âœ“ Recommender initialized
  Weights: C=0.5, G=0.3, Cl=0.2
âœ“ Recommender initialized
  Weights: C=0.5, G=0.4, Cl=0.1
âœ“ Recommender initialized
  Weights: C=0.6, G=0.35, Cl=0.05
âœ“ Recommender initialized
  Weights: C=0.4, G=0.2, Cl=0.4
âœ“ Recommender initialized
  Weights: C=0.5, G=0.4, Cl=0.1
âœ“ Recommender initialized
  Weights: C=0.6, G=0.35, Cl=0.05
âœ“ Recommender initialized
  Weights: C=0.5, G=0.4, Cl=0.1
âœ“ Recommender initialized
  Weights: C=0.6, G=0.35, Cl=0.05
âœ“ Recommender initialized
  Weights: C=0.5, G=0.4, Cl=0.1
âœ“ Recommender initialized
  Weights: C=0.6, G=0.35, Cl=0.05
âœ“ Recommender initialized
  Weights: C=0.4, G=0.2, Cl=0.4
âœ“ Recommender initialized
  Weights: C=0.5, G=0.4, Cl=0.1
âœ“ Recommender initialized
  Weights: C=0.6, G=0.35, Cl=0.05
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://189cc5512b75f06f30.gradio.live
NOTEBOOK 4 ENHANCED - COMPLETE 
