# Project Recommendation System

This notebook implements a recommendation system for Lovable projects using:
1. Semantic search with FAISS to find similar projects based on content
2. Simple cold-start recommendation strategy for new projects

Let's get started!

## 1. Setup and Data Loading

### Memory Management

This notebook involves working with vector embeddings that can use significant memory. We'll use Python's garbage collector (`gc`) at key points to free up memory and prevent kernel crashes.

In [95]:
# Install required packages
!pip install faiss-cpu numpy pandas scikit-learn sentence-transformers matplotlib

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [114]:
import json
import numpy as np
import pandas as pd
import faiss
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
import gc  # Import garbage collector for memory management
import os

In [97]:
# Load the enriched projects data
with open('enriched_data/enriched_projects.json', 'r') as file:
    projects_data = json.load(file)

# Convert to DataFrame for easier manipulation
projects_df = pd.DataFrame(projects_data)

# Display the first few projects
projects_df.head()

Unnamed: 0,id,title,description,link,remixes,image_url,author_img,scraped_at,full_description,technologies,image_analysis,text_features,scraped_date,popularity_score,screenshot_description
0,c97c0a8e-4d68-4bc4-9fe0-c26ee71c3856,fancy-saas-splash,A modern landing page template for SaaS produc...,https://lovable.dev/projects/c97c0a8e-4d68-4bc...,"{'count': 35, 'text': '35 Remixes'}",https://storage.googleapis.com/gpt-engineer-sc...,https://lh3.googleusercontent.com/a/ACg8ocKpsv...,2025-05-21T18:49:16.306404,,[],"{'dimensions': {'width': 1920, 'height': 1979,...","{'word_count': 3, 'sentiment': {'polarity': 0....",2025-05-21,175,
1,65e44bd6-84d0-4f67-a532-c7f2a71f7d46,valk-life,A lifestyle application or website dedicated t...,https://lovable.dev/projects/65e44bd6-84d0-4f6...,"{'count': 40, 'text': '40 Remixes'}",https://storage.googleapis.com/gpt-engineer-sc...,https://avatars.githubusercontent.com/u/441197...,2025-05-21T18:49:16.337169,,[],"{'dimensions': {'width': 1920, 'height': 2935,...","{'word_count': 2, 'sentiment': {'polarity': 0....",2025-05-21,200,
2,6fb25ca9-773a-4499-aa1a-2c5df274248d,monster-landing-magic,An eye-catching landing page template with bol...,https://lovable.dev/projects/6fb25ca9-773a-449...,"{'count': 47, 'text': '47 Remixes'}",https://storage.googleapis.com/gpt-engineer-sc...,https://lh3.googleusercontent.com/a/ACg8ocKpsv...,2025-05-21T18:49:16.388370,,[],"{'dimensions': {'width': 1920, 'height': 2655,...","{'word_count': 3, 'sentiment': {'polarity': 0....",2025-05-21,235,
3,156c4586-27b3-4c80-992f-6996d77bfbc1,future-real-estate-site,A forward-thinking real estate platform showca...,https://lovable.dev/projects/156c4586-27b3-4c8...,"{'count': 48, 'text': '48 Remixes'}",https://storage.googleapis.com/gpt-engineer-sc...,https://lh3.googleusercontent.com/a/ACg8ocKBv6...,2025-05-21T18:49:16.422113,,[],"{'dimensions': {'width': 1920, 'height': 2126,...","{'word_count': 4, 'sentiment': {'polarity': 0....",2025-05-21,240,
4,f60ff3fb-aac0-4b5b-a359-c84636b03332,turn-based-chess-duel,An interactive chess game application featurin...,https://lovable.dev/projects/f60ff3fb-aac0-4b5...,"{'count': 52, 'text': '52 Remixes'}",https://storage.googleapis.com/gpt-engineer-sc...,https://lh3.googleusercontent.com/a/ACg8ocJgj-...,2025-05-21T18:49:16.456430,,[],"{'dimensions': {'width': 1920, 'height': 1080,...","{'word_count': 4, 'sentiment': {'polarity': 0....",2025-05-21,260,


In [98]:
# Examine the data
print(f"Total number of projects: {len(projects_df)}")
print(f"Columns available: {projects_df.columns.tolist()}")

# Check for missing values
print("\nMissing values by column:")
print(projects_df.isnull().sum())

Total number of projects: 32
Columns available: ['id', 'title', 'description', 'link', 'remixes', 'image_url', 'author_img', 'scraped_at', 'full_description', 'technologies', 'image_analysis', 'text_features', 'scraped_date', 'popularity_score', 'screenshot_description']

Missing values by column:
id                         0
title                      0
description                0
link                       0
remixes                    0
image_url                  0
author_img                 0
scraped_at                 0
full_description          26
technologies              26
image_analysis             0
text_features              0
scraped_date               0
popularity_score           0
screenshot_description    31
dtype: int64


## 2. Semantic Search with FAISS

We'll create embeddings for each project based on its description and title, then build a FAISS index to enable fast similarity search.

In [99]:
# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [100]:
# Create a function to generate text for embedding
def create_embedding_text(row):
    text = f"Title: {row['title']}. "
    text += f"Description: {row['description']}. "
    
    # Add project category if available
    if 'text_features' in row and 'project_category' in row['text_features']:
        text += f"Category: {row['text_features']['project_category']}. "
    
    # Add keywords if available
    if 'text_features' in row and 'keywords' in row['text_features']:
        text += f"Keywords: {', '.join(row['text_features']['keywords'])}."
    
    return text

In [101]:
# Generate text for each project
projects_df['embedding_text'] = projects_df.apply(create_embedding_text, axis=1)

# Preview the text we'll use for embeddings
print(projects_df['embedding_text'].iloc[0])

Title: fancy-saas-splash. Description: A modern landing page template for SaaS products featuring sleek design elements and engaging visual splash effects.. Category: landing_page. Keywords: fancy, saas, splash.


In [102]:
# Generate embeddings for all projects
project_embeddings = model.encode(projects_df['embedding_text'].tolist(), show_progress_bar=True)

# Display shape and sample of embeddings
print(f"Embedding shape: {project_embeddings.shape}")
print(f"Sample embedding (first 5 dimensions): {project_embeddings[0][:5]}")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Embedding shape: (32, 384)
Sample embedding (first 5 dimensions): [-0.0309571   0.01860341  0.04708434  0.00311612  0.07348949]


In [103]:
# Force garbage collection to free memory
gc.collect()

2599

In [104]:
# Set up FAISS index
dimension = project_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)  # Using L2 distance for similarity

# Add the project embeddings to the index
index.add(project_embeddings)

# Verify the index size
print(f"Number of vectors in the index: {index.ntotal}")

Number of vectors in the index: 32


In [105]:
# Function to get semantically similar projects
def get_similar_projects(query_text, top_k=5):
    # Convert query to embedding
    query_embedding = model.encode([query_text])
    
    # Search for similar projects in the FAISS index
    distances, indices = index.search(query_embedding, top_k)
    
    # Get the similar projects
    similar_projects = projects_df.iloc[indices[0]]
    
    # Add distance information
    similar_projects = similar_projects.copy()
    similar_projects['distance'] = distances[0]
    
    return similar_projects[['id', 'title', 'description', 'distance', 'remixes']]

In [106]:
# Test the semantic search with a sample query
query = "I need a landing page for my SaaS product"
similar_projects = get_similar_projects(query, top_k=5)
print(f"Query: '{query}'\n")
similar_projects

Query: 'I need a landing page for my SaaS product'



Unnamed: 0,id,title,description,distance,remixes
0,c97c0a8e-4d68-4bc4-9fe0-c26ee71c3856,fancy-saas-splash,A modern landing page template for SaaS produc...,0.907642,"{'count': 35, 'text': '35 Remixes'}"
2,6fb25ca9-773a-4499-aa1a-2c5df274248d,monster-landing-magic,An eye-catching landing page template with bol...,1.123671,"{'count': 47, 'text': '47 Remixes'}"
19,84322c50-98b7-4990-ba4e-2545accf5d91,landing-simulator-sorcery,A landing page creation tool with magical drag...,1.167778,"{'count': 1115, 'text': '1115 Remixes'}"
27,32740053-d4a5-474d-90a4-a52ce5eabadc,lovable-product-grove,An e-commerce platform showcasing a curated co...,1.370921,"{'count': 1, 'text': '1 Remix'}"
9,721b7097-37cd-4dc4-8946-0910b3ea8bc7,agri-dom,An agricultural domain management system for f...,1.4383,"{'count': 1996, 'text': '1996 Remixes'}"


## 3. Cold Start Recommendation

Now let's implement a strategy to recommend new projects that have few or no remixes but might be a good fit.

In [107]:
# Define new projects as those with less than 10 remixes
new_projects = projects_df[projects_df['remixes'].apply(lambda x: x['count'] if isinstance(x, dict) and 'count' in x else 0) < 10]
print(f"Number of new projects: {len(new_projects)}")

Number of new projects: 8


In [108]:
def recommend_new_projects(query, k=3, diversity_weight=0.2):
    # Filter to only include new projects
    new_project_embeddings = project_embeddings[projects_df['remixes'].apply(
        lambda x: x['count'] if isinstance(x, dict) and 'count' in x else 0) < 10]
    
    # Get indices of new projects
    new_project_indices = projects_df.index[projects_df['remixes'].apply(
        lambda x: x['count'] if isinstance(x, dict) and 'count' in x else 0) < 10].tolist()
    
    if len(new_project_indices) == 0:
        return pd.DataFrame(columns=['id', 'title', 'description', 'distance', 'remixes'])
    
    # Create a FAISS index for new projects only
    dimension = new_project_embeddings.shape[1]
    new_projects_index = faiss.IndexFlatL2(dimension)
    new_projects_index.add(new_project_embeddings)
    
    # Convert query to embedding
    query_embedding = model.encode([query])
    
    # Search for similar new projects
    distances, local_indices = new_projects_index.search(query_embedding, k*2)  # Get more candidates
    
    # Convert local indices to global indices
    global_indices = [new_project_indices[i] for i in local_indices[0]]
    
    # Get the candidate projects
    candidates = projects_df.iloc[global_indices].copy()
    candidates['distance'] = distances[0]
    
    # Calculate diversity score based on uniqueness of project categories
    if len(candidates) > 0 and 'text_features' in candidates.columns:
        # Extract categories if they exist
        categories = candidates['text_features'].apply(
            lambda x: x.get('project_category', 'unknown') if isinstance(x, dict) else 'unknown')
        
        # Count occurrences of each category
        category_counts = categories.value_counts()
        
        # Calculate diversity score (lower count = more unique = higher score)
        candidates['diversity_score'] = categories.apply(lambda x: 1.0/category_counts[x])
    else:
        candidates['diversity_score'] = 1.0
    
    # Normalize distances (lower is better, so we invert it after scaling)
    max_dist = candidates['distance'].max()
    min_dist = candidates['distance'].min()
    if max_dist > min_dist:
        candidates['distance_norm'] = 1 - ((candidates['distance'] - min_dist) / (max_dist - min_dist))
    else:
        candidates['distance_norm'] = 1.0
    
    # Combined score: relevance and diversity
    candidates['combined_score'] = (1 - diversity_weight) * candidates['distance_norm'] + \
                                   diversity_weight * candidates['diversity_score']
    
    # Sort by combined score
    candidates = candidates.sort_values('combined_score', ascending=False)
    
    return candidates.head(k)[['id', 'title', 'description', 'distance', 'remixes', 'combined_score']]

In [109]:
# Test the new project recommendation function
query = "I need a landing page for my SaaS product"
new_project_recommendations = recommend_new_projects(query)
print(f"Query: '{query}'\n")
new_project_recommendations

Query: 'I need a landing page for my SaaS product'



Unnamed: 0,id,title,description,distance,remixes,combined_score
27,32740053-d4a5-474d-90a4-a52ce5eabadc,lovable-product-grove,An e-commerce platform showcasing a curated co...,1.370921,"{'count': 1, 'text': '1 Remix'}",1.0
11,0a723aaf-fa4e-4c29-8841-7b9dfaf2611b,tableone-fundraise-hub,A fundraising platform specifically designed f...,1.461263,"{'count': 2, 'text': '2 Remixes'}",0.754972
25,ee95d51b-52bd-4ed8-a6a9-681b011ba839,echoing-site-template-26,A website template system with audio or voice ...,1.504088,"{'count': 9, 'text': '9 Remixes'}",0.505487


## 4. Complete Recommendation Pipeline

Now let's combine everything into a complete recommendation pipeline.

In [110]:
def get_project_recommendations(query, established_count=5, new_count=2):
    print(f"\n===== Recommendations for: '{query}' =====\n")
    
    # Get recommendations for established projects using KNN/semantic search
    print(f"Top {established_count} established projects:")
    established_recommendations = get_similar_projects(query, top_k=established_count)
    display(established_recommendations)
    
    # Get recommendations for new projects
    print(f"\nTop {new_count} promising new projects:")
    new_recommendations = recommend_new_projects(query, k=new_count)
    display(new_recommendations)

In [111]:
# Test with different queries
queries = [
    "I need a landing page for my SaaS product",
    "Looking for a tool to visualize music",
    "I want to build a chess game",
    "Need a dashboard for food tracking",
    "AI tool that can adapt to my thinking"
]

for query in queries:
    get_project_recommendations(query)


===== Recommendations for: 'I need a landing page for my SaaS product' =====

Top 5 established projects:


Unnamed: 0,id,title,description,distance,remixes
0,c97c0a8e-4d68-4bc4-9fe0-c26ee71c3856,fancy-saas-splash,A modern landing page template for SaaS produc...,0.907642,"{'count': 35, 'text': '35 Remixes'}"
2,6fb25ca9-773a-4499-aa1a-2c5df274248d,monster-landing-magic,An eye-catching landing page template with bol...,1.123671,"{'count': 47, 'text': '47 Remixes'}"
19,84322c50-98b7-4990-ba4e-2545accf5d91,landing-simulator-sorcery,A landing page creation tool with magical drag...,1.167778,"{'count': 1115, 'text': '1115 Remixes'}"
27,32740053-d4a5-474d-90a4-a52ce5eabadc,lovable-product-grove,An e-commerce platform showcasing a curated co...,1.370921,"{'count': 1, 'text': '1 Remix'}"
9,721b7097-37cd-4dc4-8946-0910b3ea8bc7,agri-dom,An agricultural domain management system for f...,1.4383,"{'count': 1996, 'text': '1996 Remixes'}"



Top 2 promising new projects:


Unnamed: 0,id,title,description,distance,remixes,combined_score
27,32740053-d4a5-474d-90a4-a52ce5eabadc,lovable-product-grove,An e-commerce platform showcasing a curated co...,1.370921,"{'count': 1, 'text': '1 Remix'}",1.0
11,0a723aaf-fa4e-4c29-8841-7b9dfaf2611b,tableone-fundraise-hub,A fundraising platform specifically designed f...,1.461263,"{'count': 2, 'text': '2 Remixes'}",0.701743



===== Recommendations for: 'Looking for a tool to visualize music' =====

Top 5 established projects:


Unnamed: 0,id,title,description,distance,remixes
5,f6f0a2d7-0b30-4d2c-9290-e18ae9e1fd05,musicwave-harmony,A music visualization and playback app that cr...,0.889122,"{'count': 53, 'text': '53 Remixes'}"
8,b8ce5f67-1b0f-4932-a684-e51855e29de9,hungry-dashboard,A food tracking and nutrition analytics dashbo...,1.380395,"{'count': 91, 'text': '91 Remixes'}"
29,d9e9e909-200c-4b87-875b-9305d31e815d,sleek-navisphere-65-90-76-19-70,A modern navigation interface with a sophistic...,1.39793,"{'count': 27, 'text': '27 Remixes'}"
25,ee95d51b-52bd-4ed8-a6a9-681b011ba839,echoing-site-template-26,A website template system with audio or voice ...,1.489588,"{'count': 9, 'text': '9 Remixes'}"
6,0fa655bd-55f5-4761-af2c-39b0bfabbaae,ai-tool-hub,A centralized platform that aggregates and org...,1.492048,"{'count': 71, 'text': '71 Remixes'}"



Top 2 promising new projects:


Unnamed: 0,id,title,description,distance,remixes,combined_score
25,ee95d51b-52bd-4ed8-a6a9-681b011ba839,echoing-site-template-26,A website template system with audio or voice ...,1.489588,"{'count': 9, 'text': '9 Remixes'}",0.9
26,ea007ae4-bf41-4777-bc2b-32fbfcb9592b,tactile-signature-tool,A digital signature application with haptic fe...,1.563866,"{'count': 1, 'text': '1 Remix'}",0.780046



===== Recommendations for: 'I want to build a chess game' =====

Top 5 established projects:


Unnamed: 0,id,title,description,distance,remixes
4,f60ff3fb-aac0-4b5b-a359-c84636b03332,turn-based-chess-duel,An interactive chess game application featurin...,1.124233,"{'count': 52, 'text': '52 Remixes'}"
1,65e44bd6-84d0-4f67-a532-c7f2a71f7d46,valk-life,A lifestyle application or website dedicated t...,1.564737,"{'count': 40, 'text': '40 Remixes'}"
21,fafe259a-ff46-45f0-812e-1c598bf4b505,characterforge-imagix,A character creation and visualization tool fo...,1.587932,"{'count': 1695, 'text': '1695 Remixes'}"
30,ca2c8cb7-488c-4548-be13-782d945ee184,custom-thinking-ai,A personalized artificial intelligence system ...,1.603545,"{'count': 1, 'text': '1 Remix'}"
19,84322c50-98b7-4990-ba4e-2545accf5d91,landing-simulator-sorcery,A landing page creation tool with magical drag...,1.671813,"{'count': 1115, 'text': '1115 Remixes'}"



Top 2 promising new projects:


Unnamed: 0,id,title,description,distance,remixes,combined_score
30,ca2c8cb7-488c-4548-be13-782d945ee184,custom-thinking-ai,A personalized artificial intelligence system ...,1.603545,"{'count': 1, 'text': '1 Remix'}",0.9
11,0a723aaf-fa4e-4c29-8841-7b9dfaf2611b,tableone-fundraise-hub,A fundraising platform specifically designed f...,1.692112,"{'count': 2, 'text': '2 Remixes'}",0.595838



===== Recommendations for: 'Need a dashboard for food tracking' =====

Top 5 established projects:


Unnamed: 0,id,title,description,distance,remixes
8,b8ce5f67-1b0f-4932-a684-e51855e29de9,hungry-dashboard,A food tracking and nutrition analytics dashbo...,0.52038,"{'count': 91, 'text': '91 Remixes'}"
15,4b043dc7-4365-4c96-b875-11bd9694b03e,chef-cuistot-ia,A culinary AI assistant that helps with recipe...,1.233432,"{'count': 2, 'text': '2 Remixes'}"
11,0a723aaf-fa4e-4c29-8841-7b9dfaf2611b,tableone-fundraise-hub,A fundraising platform specifically designed f...,1.362932,"{'count': 2, 'text': '2 Remixes'}"
9,721b7097-37cd-4dc4-8946-0910b3ea8bc7,agri-dom,An agricultural domain management system for f...,1.470143,"{'count': 1996, 'text': '1996 Remixes'}"
6,0fa655bd-55f5-4761-af2c-39b0bfabbaae,ai-tool-hub,A centralized platform that aggregates and org...,1.473712,"{'count': 71, 'text': '71 Remixes'}"



Top 2 promising new projects:


Unnamed: 0,id,title,description,distance,remixes,combined_score
15,4b043dc7-4365-4c96-b875-11bd9694b03e,chef-cuistot-ia,A culinary AI assistant that helps with recipe...,1.233432,"{'count': 2, 'text': '2 Remixes'}",0.866667
11,0a723aaf-fa4e-4c29-8841-7b9dfaf2611b,tableone-fundraise-hub,A fundraising platform specifically designed f...,1.362932,"{'count': 2, 'text': '2 Remixes'}",0.714042



===== Recommendations for: 'AI tool that can adapt to my thinking' =====

Top 5 established projects:


Unnamed: 0,id,title,description,distance,remixes
30,ca2c8cb7-488c-4548-be13-782d945ee184,custom-thinking-ai,A personalized artificial intelligence system ...,0.564497,"{'count': 1, 'text': '1 Remix'}"
6,0fa655bd-55f5-4761-af2c-39b0bfabbaae,ai-tool-hub,A centralized platform that aggregates and org...,0.925114,"{'count': 71, 'text': '71 Remixes'}"
13,ec1d4f1e-2506-4da5-a91b-34afa90cceb6,wrlds-ai-integration,A system that integrates artificial intelligen...,0.976814,"{'count': 2354, 'text': '2354 Remixes'}"
20,513db1a2-0fcc-4643-bd43-f10d076dfa80,cortex-second-brain,A knowledge management system that serves as a...,1.164604,"{'count': 1438, 'text': '1438 Remixes'}"
7,4e4f8fa9-169c-4762-b885-56ac83b4ade3,chatgpt-clone,A replica of the ChatGPT interface and functio...,1.194527,"{'count': 68, 'text': '68 Remixes'}"



Top 2 promising new projects:


Unnamed: 0,id,title,description,distance,remixes,combined_score
30,ca2c8cb7-488c-4548-be13-782d945ee184,custom-thinking-ai,A personalized artificial intelligence system ...,0.564497,"{'count': 1, 'text': '1 Remix'}",0.9
15,4b043dc7-4365-4c96-b875-11bd9694b03e,chef-cuistot-ia,A culinary AI assistant that helps with recipe...,1.265414,"{'count': 2, 'text': '2 Remixes'}",0.38553


In [112]:
# Free memory by garbage collecting
print("Cleaning up memory...")
gc.collect()
print("Done.")

Cleaning up memory...
Done.


## 5. Save the Models

Let's save our models for future use.

In [115]:
# Make sure the models directory exists
if not os.path.exists('models'):
    os.makedirs('models')

# Save FAISS index
faiss.write_index(index, 'models/project_search_index.faiss')

# Save project information and embeddings for future use
np.save('models/project_embeddings.npy', project_embeddings)
projects_df[['id', 'title', 'description', 'remixes']].to_csv('models/project_metadata.csv', index=False)

## 6. Conclusion

In this notebook, we've created a simple but effective recommendation system that:

1. Uses semantic search with KNN to find projects similar to the user's query
2. Recommends promising new projects with few remixes

This semantic search approach captures the meaning behind user queries and project descriptions, enabling more relevant recommendations compared to keyword-based approaches. It could be improved with:

- Personalization based on user preferences and history
- Testing different embedding models for better semantic understanding
- Adding more sophisticated filtering options
- Periodic retraining as new projects are added