# Datathon 2025 RAG Solution

### Ingestion

Let's first import relevant packages, take a look at our data, and create a master dataframe.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
media_df = pd.read_csv('Datasets/media.csv')
places_p2_df = pd.read_csv('Datasets/places.csv')
reviews_df = pd.read_csv('Datasets/reviews.csv')

In [6]:
print(f"P2 Media preview: {media_df.head}\n")
print(f"P2 Places preview: {places_p2_df.head}\n")
print(f"P2 Reviews preview: {reviews_df.head}\n")

P2 Media preview: <bound method NDFrame.head of          place_id                                          media_url
0         place_1  https://cdn.corner.inc/place-photo/AUjq9jnss_x...
1         place_1  https://cdn.corner.inc/place-photo/AUjq9jliO8l...
2         place_1  https://cdn.corner.inc/place-photo/AUjq9jmYn9S...
3         place_1  https://cdn.corner.inc/place-photo/AUjq9jnc5Zm...
4         place_1  https://cdn.corner.inc/place-photo/AUjq9jmiIE3...
...           ...                                                ...
37617  place_1500  https://cdn.corner.inc/place-photo/AUGGfZkzoPe...
37618  place_1500  https://cdn.corner.inc/place-photo/AUGGfZniHDf...
37619  place_1500  https://cdn.corner.inc/place-photo/AUGGfZnZCnm...
37620  place_1500  https://cdn.corner.inc/place-photo/AUGGfZn2wUa...
37621  place_1500  https://cdn.corner.inc/ugc/9781d311-2298-4582-...

[37622 rows x 2 columns]>

P2 Places preview: <bound method NDFrame.head of         place_id                        name  n

Aggregate reviews corresponding to a given place_id into a list of strings. Do the same for media, then merge both into larger df.

In [7]:
reviews_df = reviews_df.drop_duplicates(subset=['place_id', 'review_text'])
media_df = media_df.drop_duplicates(subset=['place_id', 'media_url'])

reviews_agg = (
    reviews_df
    .groupby('place_id')['review_text']
    .apply(list)
    .reset_index()
)

media_agg = (
    media_df
    .groupby('place_id')['media_url']
    .apply(list)
    .reset_index()
)

merge_df = pd.merge(places_p2_df, media_agg, on='place_id', how='inner')
merge_df = pd.merge(merge_df, reviews_agg, on='place_id', how='inner')
merge_df = merge_df.drop_duplicates(subset='place_id').reset_index(drop=True)
merge_df.head()

Unnamed: 0,place_id,name,neighborhood,latitude,longitude,tags,short_description,emoji,media_url,review_text
0,place_1,Public Records,Brooklyn,40.68227,-73.9864,"{night_club,cafe,bar,restaurant}",vinyl dance club,💿,[https://cdn.corner.inc/place-photo/AUjq9jnss_...,[the best place to dance until 4am in nyc. get...
1,place_2,Silence Please,,40.71895,-73.9949,{cafe},vinyl cafe,💿,[https://cdn.corner.inc/place-photo/AWYs27xW6j...,[i heard they charge an entrance fee now at th...
2,place_3,schmuck.,,40.72637,-73.98647,{bar},craft cocktails,🍸,[https://cdn.corner.inc/ugc/0875b9e6-d6fe-4db1...,[apparently this is very vibey and THE spot bu...
3,place_4,The Django,Tribeca,40.71941,-74.00491,"{bar,night_club,restaurant}",underground jazz,🎷,[https://cdn.corner.inc/place-photo/AUjq9jkq_2...,[The prettiest jazz club I’ve been! Good cockt...
4,place_5,Honeycomb Hi-Fi Lounge,Park Slope,40.68077,-73.97775,{bar},listening bar,🎵,[https://cdn.corner.inc/place-photo/cb4ddc19-d...,"[listening bar, One of my favorite bars in NYC..."


Let's concatenate reviews associated with a given place_id in preparation for embedding. Hopefully this can help with vibier searches.

In [8]:
def concat_reviews(series):
    """Join all review texts in the group into a single string."""
    return ' '.join(series.astype(str))

# Group reviews and create the all_reviews column
agg_reviews = (
    reviews_df
    .groupby('place_id')['review_text']
    .apply(concat_reviews)
    .reset_index(name='concat_reviews')
)

# Merge the concatenated reviews into merge_df
merge_df = merge_df.merge(
    agg_reviews,
    how='left',
    left_on='place_id',
    right_on='place_id'
)

Let's write a function to prepare available structured data for semantic embedding, tack on the concatenated reviews and add it to our df.

In [9]:
def combining_text(row):
    name = str(row.get('name', ''))
    neighborhood = str(row.get('neighborhood', ''))
    tags = str(row.get('tags', ''))
    short_description = str(row.get('short_description', ''))
    emojis = str(row.get('emojis', ''))
    reviews = str(row.get('concat_reviews', ''))

    combined_text = (
        f"Name: {name}. "
        f"Neighborhood: {neighborhood}. "
        f"Tags: {tags}. "
        f"Description: {short_description}. "
        f"Emojis: {emojis}."
        f"User Reviews: {reviews}."
    )
    return combined_text

merge_df['combined_text'] = merge_df.apply(combining_text, axis=1)

Ok, now let's actually start embedding our data. Cross-Encoders will be too computationally heavy for this. Instead will use bi-encoders such as sentence_transformers and CLIP for seemantic and multimodal embeddings respectively.

### Metadata Text Embeddings (Dense + Sparse)

Eventually, we want a hybrid search which requires is a combination of dense, sparse, and multimodal embeddings. Let's start with dense.

In [10]:
# Load tqdm and MiniLM model(dense semantic text embedding)
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer
metadata_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
embeddings = []
batch_size = 32

for i in tqdm(range(0, len(merge_df), batch_size), desc="Generating embeddings from dataset"):
    batch = merge_df['combined_text'].iloc[i: i + batch_size].tolist()
    batch_embeddings = metadata_model.encode(batch, normalize_embeddings=True)
    embeddings.append(batch_embeddings)

embeddings = np.vstack(embeddings)

# Save in Dataframe
merge_df['dense_metadata_embedding'] = embeddings.tolist()

Generating embeddings from dataset: 100%|██████████| 47/47 [00:20<00:00,  2.33it/s]


Ok, now let's try a sparse text embedding model from fastembed. Hopefully with the hybrid, we can pick up explicit meaning as well as implied.

In [12]:
from fastembed import SparseTextEmbedding
sparse_model = SparseTextEmbedding(model_name="prithivida/Splade_PP_en_v1")

Fetching 5 files: 100%|██████████| 5/5 [00:09<00:00,  1.96s/it]


Time to embed!

In [13]:
sparse_embeddings = []

for i in tqdm(range(0, len(merge_df), batch_size), desc="Generating sparse embeddings"):
    batch = merge_df['combined_text'].iloc[i: i + batch_size].tolist()
    batch_embeddings = list(sparse_model.embed(batch))
    sparse_embeddings.extend(batch_embeddings)

merge_df['sparse_metadata_embedding'] = sparse_embeddings

Generating sparse embeddings: 100%|██████████| 47/47 [03:36<00:00,  4.60s/it]


Let's save it real quick.

In [14]:
import joblib

In [15]:
# Save the DataFrame
joblib.dump(merge_df, 'merge_dense_and_sparse_df.joblib')

['merge_dense_and_sparse_df.joblib']

### Media (Image) Embeddings

Because there are over 30,000 media_urls, We opted for batch processing 1 image per place_id, reducing the image's resolution, and then embedding it using a Hugging Face CLIP model for multimodal embedding in another notebook. The result is the image_embeddings_sorted.csv that we can simply read in as a df. Ideally, with more time we could embed several/all images corresponding to a place_id, and then take their arithmetic mean for a more generally representative embedding per place_id.

In [16]:
image_df = pd.read_csv('Datasets/image_embeddings_sorted.csv')

Also, even though we already embedded the images in another file, let's import and load the CLIP model hear to use for embedding queries in the future.

In [17]:
# Embedded 1 image for each location using CLIPProcessor, CLIPModel in another ipynb
from transformers import CLIPProcessor, CLIPModel

# Load the CLIP model and processor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


### Vector Similarity Search + Hybrid Search Model

Let's use Facebook AI Similarity Search (FAISS) for efficient vector similarity search on our:
- Metadata Embeddings (Dense)
- Media Embeddings (Image)

And, let's use Cosine Similarity for our sparse model.

In [18]:
import faiss
from sklearn.metrics.pairwise import cosine_similarity

Metadata FAISS (Dense)

In [19]:
# Get metadata embeddings
all_dense_embeddings = np.array(merge_df['dense_metadata_embedding'].tolist(), dtype='float32')

# Determine embedding dimension
d = all_dense_embeddings.shape[1]

# Create FAISS index (L2 distance)
index_dense_metadata = faiss.IndexFlatL2(d)
index_dense_metadata.add(all_dense_embeddings)

print(f"Built FAISS dense metadata index with {index_dense_metadata.ntotal} vectors of dimension {d}.")

# Let's make a search function
def search_places_dense_metadata(query, index=index_dense_metadata, top_k=5):
    query_embedding = metadata_model.encode([query], normalize_embeddings=True)[0].astype('float32').reshape(1, -1)
    distances, indices = index.search(query_embedding, top_k)

    results = []
    for i, idx in enumerate(indices[0]):
        row = merge_df.iloc[idx]
        results.append({
            'Place Name': row['name'],
            'Neighborhood': row['neighborhood'],
            'Tags': row['tags'],
            'Description': row['short_description'],
            'Distance': distances[0][i]
        })
    return results

Built FAISS dense metadata index with 1499 vectors of dimension 384.


In [20]:
# Let's try a query and output the 5 nearest embeddings
query = "where to drink a matcha"
search_results = search_places_dense_metadata(query, index_dense_metadata, top_k=5)

for result in search_results:
    print("------------------------------")
    print(f"Place Name   : {result['Place Name']}")
    print(f"Neighborhood : {result['Neighborhood']}")
    print(f"Tags         : {result['Tags']}")
    print(f"Description  : {result['Description']}")
    print(f"L2 Distance  : {result['Distance']:.4f}")

------------------------------
Place Name   : Sorate
Neighborhood : SoHo
Tags         : {destination}
Description  : matcha bar
L2 Distance  : 0.7257
------------------------------
Place Name   : The Mandarin
Neighborhood : nan
Tags         : {cafe}
Description  : cozy cafe
L2 Distance  : 0.7298
------------------------------
Place Name   : Setsugekka East Village
Neighborhood : East Village
Tags         : {shop}
Description  : japanese tea house
L2 Distance  : 0.7679
------------------------------
Place Name   : Kettl Tea
Neighborhood : NoHo
Tags         : {destination}
Description  : matcha & soba tea
L2 Distance  : 0.7777
------------------------------
Place Name   : Nana’s Green Tea
Neighborhood : Koreatown
Tags         : {restaurant}
Description  : japanese desserts
L2 Distance  : 0.8102


Image FAISS

In [21]:
# Extract embedding columns (assumes embed_0 to embed_511)
embedding_cols = [col for col in image_df.columns if col.startswith('embed_')]
all_embeddings = image_df[embedding_cols].to_numpy().astype('float32')

# Determine embedding dimension
d = all_embeddings.shape[1]

# Build FAISS index
index_image = faiss.IndexFlatL2(d)
index_image.add(all_embeddings)

print(f"Built FAISS image index with {index_image.ntotal} vectors of dimension {d}.")

# Let's make a search function
def search_places_image(query, index=index_image, top_k=5):
    # Encode the text query using CLIP text encoder
    inputs = processor(text=[query], return_tensors="pt", padding=True)
    query_embedding = clip_model.get_text_features(**inputs)
    query_embedding = query_embedding / query_embedding.norm(p=2, dim=-1, keepdim=True)
    query_embedding = query_embedding.detach().cpu().numpy().astype('float32').reshape(1, -1)

    # Perform FAISS search
    distances, indices = index.search(query_embedding, top_k)

    # Gather results from image_df (which holds the place info)
    results = []
    for i, idx in enumerate(indices[0]):
        row = merge_df.iloc[idx]
        results.append({
            'Place Name': row['name'],
            'Neighborhood': row['neighborhood'],
            'Tags': row['tags'],
            'Description': row['short_description'],
            'Distance': distances[0][i]
        })
    return results

Built FAISS image index with 1499 vectors of dimension 512.


In [22]:
# Let's try a query and output the 5 nearest embeddings
search_results = search_places_image("Something to do on a gloomy day", index_image, top_k=5)

for result in search_results:
    print("------------------------------")
    print(f"Place Name   : {result['Place Name']}")
    print(f"Neighborhood : {result['Neighborhood']}")
    print(f"Tags         : {result['Tags']}")
    print(f"Description  : {result['Description']}")
    print(f"L2 Distance  : {result['Distance']:.4f}")

------------------------------
Place Name   : Bird & Branch Coffee Roasters
Neighborhood : Hell's Kitchen
Tags         : {cafe}
Description  : specialty coffee
L2 Distance  : 1.5070
------------------------------
Place Name   : ThirdSpace
Neighborhood : Brooklyn
Tags         : {destination}
Description  : creative events
L2 Distance  : 1.5274
------------------------------
Place Name   : % Arabica New York Nolita
Neighborhood : Little Italy
Tags         : {cafe}
Description  : coffee & pastries
L2 Distance  : 1.5314
------------------------------
Place Name   : Supermoon Bakehouse
Neighborhood : Lower East Side
Tags         : {bakery}
Description  : creative bakery
L2 Distance  : 1.5317
------------------------------
Place Name   : Little Ruby's East Village
Neighborhood : East Village
Tags         : {restaurant}
Description  : aussie brunch
L2 Distance  : 1.5338


Metadata Cosine Similarity (Sparse)

In [23]:
def sparse_to_dense(sparse_embedding, dim=30315):
    dense_vec = np.zeros(dim, dtype=np.float32)
    dense_vec[sparse_embedding.indices] = sparse_embedding.values
    return dense_vec

print("Converting sparse embeddings to dense matrix...")
sparse_embeddings_dense = np.vstack([
    sparse_to_dense(embedding, dim=30315)
    for embedding in tqdm(merge_df['sparse_metadata_embedding'], desc="Converting embeddings")
])

# Let's make a search function
def search_places_sparse_metadata(query, sparse_model, embeddings_matrix, top_k=5):
    # Embed the query (FastEmbed sparse model)
    query_sparse = list(sparse_model.embed([query]))[0]
    query_dense = sparse_to_dense(query_sparse, dim=30315).reshape(1, -1)

    # Compute cosine similarities
    similarities = cosine_similarity(query_dense, embeddings_matrix)[0]

    # Fast top-k retrieval
    top_indices = np.argpartition(-similarities, top_k)[:top_k]
    top_indices = top_indices[np.argsort(similarities[top_indices])[::-1]]

    # Prepare results
    results = []
    for idx in top_indices:
        row = merge_df.iloc[idx]
        results.append({
            'Place Name': row['name'],
            'Neighborhood': row['neighborhood'],
            'Tags': row['tags'],
            'Description': row['short_description'],
            'Similarity': similarities[idx]
        })

    return results

Converting sparse embeddings to dense matrix...


Converting embeddings: 100%|██████████| 1499/1499 [00:00<00:00, 26082.70it/s]


In [25]:
# Let's try a query and output the 5 nearest embeddings
search_results = search_places_sparse_metadata("dance-y bars that have disco balls", sparse_model, sparse_embeddings_dense, top_k=5)

# Output the results
for result in search_results:
    print("------------------------------")
    print(f"Place Name        : {result['Place Name']}")
    print(f"Neighborhood      : {result['Neighborhood']}")
    print(f"Tags              : {result['Tags']}")
    print(f"Description       : {result['Description']}")
    print(f"Cosine Similarity : {result['Similarity']:.4f}")

------------------------------
Place Name        : Cafe Balearica
Neighborhood      : Brooklyn
Tags              : {bar,restaurant}
Description       : disco bar
Cosine Similarity : 0.3478
------------------------------
Place Name        : Ciao Ciao Disco
Neighborhood      : Brooklyn
Tags              : {bar}
Description       : disco drag club
Cosine Similarity : 0.3199
------------------------------
Place Name        : Gabriela
Neighborhood      : Brooklyn
Tags              : {bar}
Description       : disco club
Cosine Similarity : 0.2824
------------------------------
Place Name        : Jupiter Disco
Neighborhood      : Bushwick
Tags              : {bar,night_club}
Description       : space disco
Cosine Similarity : 0.2816
------------------------------
Place Name        : Doris
Neighborhood      : Brooklyn
Tags              : {bar}
Description       : disco cocktails
Cosine Similarity : 0.2725


Now that we have our three similarity search functions, let's final make our hybrid search function! First, we need to normalize FAISS distance.

In [26]:
from sklearn.preprocessing import MinMaxScaler

In [27]:
# helper to normalize scores between 0 and 1
def normalize_scores(scores):
    scores = np.array(scores).reshape(-1, 1)
    scaler = MinMaxScaler()
    return scaler.fit_transform(scores).flatten()

In [28]:
# hybrid search function
# Full hybrid search function
def hybrid_search(query, metadata_index, image_index, sparse_embeddings,
                  sparse_model, metadata_model, processor, clip_model, top_k=5,
                  weight_dense=0.4, weight_sparse=0.3, weight_image=0.3):
    # Metadata (Dense)
    query_dense = metadata_model.encode([query], normalize_embeddings=True)[0].astype('float32').reshape(1, -1)
    distances_dense, indices_dense = metadata_index.search(query_dense, top_k)
    scores_dense = -distances_dense[0]  # Negative L2 distance → higher is better

    # Sparse Embeddings
    query_sparse = list(sparse_model.embed([query]))[0]
    query_sparse_dense = sparse_to_dense(query_sparse, dim=30315).reshape(1, -1)
    similarities_sparse = cosine_similarity(query_sparse_dense, sparse_embeddings)[0]
    top_indices_sparse = np.argpartition(-similarities_sparse, top_k)[:top_k]
    top_indices_sparse = top_indices_sparse[np.argsort(similarities_sparse[top_indices_sparse])[::-1]]
    scores_sparse = similarities_sparse[top_indices_sparse]

    # Image Embeddings
    inputs = processor(text=[query], return_tensors="pt", padding=True)
    query_image_embedding = clip_model.get_text_features(**inputs)
    query_image_embedding = query_image_embedding / query_image_embedding.norm(p=2, dim=-1, keepdim=True)
    query_image_embedding = query_image_embedding.detach().cpu().numpy().astype('float32').reshape(1, -1)
    distances_image, indices_image = image_index.search(query_image_embedding, top_k)
    scores_image = -distances_image[0]  # Negative L2 distance → higher is better

    # Normalize Scores
    norm_dense = normalize_scores(scores_dense)
    norm_sparse = normalize_scores(scores_sparse)
    norm_image = normalize_scores(scores_image)

    # Hybrid Scoring
    hybrid_scores = (weight_dense * norm_dense[:top_k] +
                     weight_sparse * norm_sparse[:top_k] +
                     weight_image * norm_image[:top_k])

    # Gather Results
    results = []
    for i in range(top_k):
        idx = indices_dense[0][i]  # Take from dense indices
        row = merge_df.iloc[idx]
        results.append({
            'Place Name': row['name'],
            'Neighborhood': row['neighborhood'],
            'Tags': row['tags'],
            'Description': row['short_description'],
            'Hybrid Score': hybrid_scores[i],
            'Dense Score': norm_dense[i],
            'Sparse Score': norm_sparse[i],
            'Image Score': norm_image[i]
        })

    results = [res for res in results if res['Hybrid Score'] > 0.1]

    # Sort Final Output
    results = sorted(results, key=lambda x: x['Hybrid Score'], reverse=True)

    return results

Let's test out our hybrid_search function - It should be a great at specific and vague queries.

In [33]:
user_query = "what to if I want to exercise"

results = hybrid_search(
    query=user_query,
    metadata_index=index_dense_metadata,       # dense FAISS index
    image_index=index_image,                   # image FAISS index
    sparse_embeddings=sparse_embeddings_dense, # dense-converted sparse embeddings
    metadata_model=metadata_model,
    sparse_model=sparse_model,                 # Dense text embedding model (MiniLM)
    processor=processor,                       # CLIP processor
    clip_model=clip_model,                         # CLIP model
)

for res in results:
    print("-------------------------")
    print(f"Place Name       : {res['Place Name']}")
    print(f"Neighborhood     : {res['Neighborhood']}")
    print(f"Tags             : {res['Tags']}")
    print(f"Description      : {res['Description']}")
    print(f"Hybrid Score     : {res['Hybrid Score']:.4f}")
    print(f"Dense Score      : {res['Dense Score']:.4f}")
    print(f"Image Score      : {res['Image Score']:.4f}")
    print(f"Sparse Score     : {res['Sparse Score']:.4f}")

-------------------------
Place Name       : VITAL Climbing Gym - Brooklyn
Neighborhood     : Brooklyn
Tags             : {health}
Description      : rooftop bouldering
Hybrid Score     : 1.0000
Dense Score      : 1.0000
Image Score      : 1.0000
Sparse Score     : 1.0000
-------------------------
Place Name       : Pebble Beach
Neighborhood     : Brooklyn
Tags             : {nature}
Description      : riverside views
Hybrid Score     : 0.3371
Dense Score      : 0.4505
Image Score      : 0.1954
Sparse Score     : 0.3276
-------------------------
Place Name       : Cafe Mogador
Neighborhood     : East Village
Tags             : {bar,restaurant}
Description      : moroccan cafe
Hybrid Score     : 0.2378
Dense Score      : 0.4153
Image Score      : 0.1779
Sparse Score     : 0.0612


In [38]:
from pathlib import Path

# 1. Save merged dataframe with all embeddings
output_dir = "Datasets/processed_data"
Path(output_dir).mkdir(exist_ok=True)

merge_df.to_csv(f"{output_dir}/merged.csv", index=False)
print("Saved merged.csv with all embeddings.")

# 2. Save dense metadata FAISS index
faiss.write_index(index_dense_metadata, f"{output_dir}/metadata.index")
print("Saved metadata.index")

# 3. Save image embedding FAISS index
faiss.write_index(index_image, f"{output_dir}/image.index")
print("Saved image.index")

Saved merged.csv with all embeddings.
Saved metadata.index
Saved image.index
