## Multimodal Product Search

**Goal**: Building a system where users can search for products using text or images. Will be using CLIP for the image/text embeddings, then store these embeddings in FAISS. Rank the embeddings by cosine similarity and create a front-facing UI for users to search on.

Following concepts will be used:
<br>
1. Multimodal embeddings
2. Shared embedding space (text & image will be embedded into the same latent space)
3. Image processing
4. Cosine similarity
5. Vector database
6. Retrieval & ranking
7. UI
<br>


| Phase                            | Tasks                                                         | Est. Time |
| -------------------------------- | ------------------------------------------------------------- | --------- |
| **1. Research & Setup**          | Understand CLIP, FAISS, define scope, gather dataset          | 3–4 hrs   |
| **2. Embedding Pipeline**        | Encode product images + text descriptions using CLIP/OpenCLIP | 3–5 hrs   |
| **3. Indexing + Search**         | Use FAISS/LanceDB to build index, implement similarity search | 3–4 hrs   |
| **4. Query System**              | Add text + image search input, return top-k products          | 2–3 hrs   |
| **5. Frontend / UI (optional)**  | Build simple UI with Streamlit/Gradio to demo system          | 2–4 hrs   |
| **6. Polishing + Writeup**       | Refactor code, write README, evaluate search quality          | 2–3 hrs   |
| **7. Bonus Features (optional)** | Personalization, hybrid scoring, caching, dockerize           | 2–4 hrs   |


### CLIP

Foundation for most multi-modal search systems. CLIP embeds images and texts into a shared latent space and learns to push embeddings closer in "meaning" together. 

Overview: https://www.youtube.com/watch?v=KcSXcpluDe4

Paper: https://arxiv.org/abs/2103.00020

Paper explanation: https://www.youtube.com/watch?v=T9XSU0pKX2E

### Modular File Hierarchy 

In [1]:
'''
multimodal-product-search/
│
├── data/                   # Raw and processed product data + images
│   ├── products.csv        # Your product catalog (name, category, description, image_path)
│   └── images/             # Product images
│
├── notebooks/              # For EDA or prototype testing
│   └── explore_clip.ipynb
│
├── src/                    # Core source code
│   ├── __init__.py
│   ├── config.py           # Paths, hyperparams
│   ├── data_loader.py      # Load product metadata + images
│   ├── embedder.py         # Load CLIP and generate embeddings
│   ├── indexer.py          # Build and query FAISS index
│   ├── search.py           # Combined image/text search logic
│   └── utils.py            # Shared functions (normalization, logging, etc.)
│
├── app/                    # UI or API interface (optional)
│   ├── streamlit_app.py    # If using Streamlit
│   └── api.py              # If building FastAPI/Flask backend
│
├── tests/                  # Unit tests
│   └── test_embedding.py
│
├── requirements.txt        # Package dependencies
├── README.md               # Project overview
└── run.py                  # Entry point script (e.g., build index, search interactively)
'''


'\nmultimodal-product-search/\n│\n├── data/                   # Raw and processed product data + images\n│   ├── products.csv        # Your product catalog (name, category, description, image_path)\n│   └── images/             # Product images\n│\n├── notebooks/              # For EDA or prototype testing\n│   └── explore_clip.ipynb\n│\n├── src/                    # Core source code\n│   ├── __init__.py\n│   ├── config.py           # Paths, hyperparams\n│   ├── data_loader.py      # Load product metadata + images\n│   ├── embedder.py         # Load CLIP and generate embeddings\n│   ├── indexer.py          # Build and query FAISS index\n│   ├── search.py           # Combined image/text search logic\n│   └── utils.py            # Shared functions (normalization, logging, etc.)\n│\n├── app/                    # UI or API interface (optional)\n│   ├── streamlit_app.py    # If using Streamlit\n│   └── api.py              # If building FastAPI/Flask backend\n│\n├── tests/                  # 

### Project Workflow

Step 1: Data
Place product metadata (name, description, image path) in data/products.csv

        Place all product images in data/images/

Step 2: Embedding
embedder.py: Load CLIP, encode both:

                Images (via encode_image)

                Descriptions (via encode_text)

Step 3: Indexing
indexer.py: Store embeddings in FAISS or Weaviate

                Save indexes for later use

Step 4: Search Logic
search.py: Given a text or image query, return top-k similar products

                Can combine modalities (text + image fusion later)

Step 5: Interactive App (Optional)
Add streamlit_app.py to let users:

                Upload image or enter text

                View top-k product matches