# Build Google-Like Search with PostgreSQL (PGVector) + Python (Step-by-Step)

### Install PostgreSQL

* **[Install PostgreSQL](https://www.w3schools.com/postgresql/postgresql_install.php)**

## Install PGVector

* **[Install PGVector](https://www.youtube.com/watch?v=erNoI9COOn0)**

#### 1. **Linux/Mac**

Compile and install the extension (supports Postgres 13+)

```shell
cd /tmp
git clone --branch v0.8.1 https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install # may need sudo
```
#### 2. **Windows**

Ensure [C++ support in Visual Studio](https://learn.microsoft.com/en-us/cpp/build/building-on-the-command-line?view=msvc-170#download-and-install-the-tools) is installed and run x64 Native Tools Command Prompt for VS [version] **as administrator**. Then use nmake to build

```shell
set "PGROOT=C:\Program Files\PostgreSQL\18"
cd %TEMP%
git clone --branch v0.8.1 https://github.com/pgvector/pgvector.git
cd pgvector
nmake /F Makefile.win
nmake /F Makefile.win install
```

Please make a **NOTE** to start command prompt as **administrator** (by right clicking on command prompt icon) otherwise installation could fail.

## Table of Contents

1. **Setup & Configs**
2. **Load Blogs Dataset**
3. **Create Table for Embeddings**
4. **Generate Embeddings (OpenAI/OpenSource-Ollama)**
5. **Insert into PostgreSQL**
6. **Search Embeddings (Vector)**
7. **Hybrid Search**
8. **(Optional) Create Index for Faster Search**

## 1. Setup & Configs

In [None]:
# PGVector + Python Semantic Search Demo
# Build Google-Like Search using PostgreSQL + PGVector

!pip install openai psycopg2-binary pandas tqdm

In [1]:
import os
import psycopg2
from psycopg2.extras import execute_values

import pandas as pd
from tqdm import tqdm
from openai import OpenAI

from dotenv import load_dotenv
load_dotenv() ## Loads .env file from current directory.

# Initialize OpenAI client (requires OPENAI_API_KEY)
client = OpenAI(
    base_url='http://localhost:11434/v1', ## OpenAI URL 
    api_key="",  ## os.environ["OPENAI_API_KEY"]
)

# PostgreSQL config
DB_CONFIG = {
    "host": os.getenv("POSTGRESQL_HOST", "localhost"), ## URL of Posgresql if you are using AWS, Google Cloud, etc. It should be public.
    "port": os.getenv("POSTGRESQL_PORT", "5432"),
    "dbname": os.getenv("POSTGRESQL_DATABASE", "postgres"), ## Use your own database if you have created.
    "user": os.getenv("POSTGRESQL_USER", "postgres"), ## PosgreSQL User (default postgres)
    "password": os.getenv("POSTGRESQL_PASS", "postgres"), ## Password
}

EMBEDDING_MODEL = "all-minilm"  # 384 dimensions ## Other useful llms: bge-m3, bge-large, embeddinggemma, qwen3-embedding, text-embedding-3-small, etc.

### 1.1. Test PostgreSQL Connection

In [2]:
def create_connection():
    connection = psycopg2.connect(**DB_CONFIG)
    connection.autocommit=True
    cursor = connection.cursor()

    return connection, cursor

In [3]:
try:
    connection, cursor = create_connection()
    print("Connected to PostgreSQL")
except Exception as e:
    print("Connection failed:", e)
finally:
    cursor.close()
    connection.close()
    print('Connection Closed.')

Connected to PostgreSQL
Connection Closed.


### 1.2 Embeddings LLM

In [4]:
def get_embedding(text):
    response = client.embeddings.create(
        model="all-minilm",  ## bge-m3, bge-large, embeddinggemma, qwen3-embedding, open AI embedding model, etc.
        input=text
    )
    return response.data[0].embedding

embed = get_embedding("test text")

print(f"Embeddings Length: {len(embed)}")

Embeddings Length: 384


## 2. Load Dataset

* **[Medium articles dataset](https://www.kaggle.com/datasets/dorianlazar/medium-articles-dataset)**
    * **Data about 6K+ articles published in 2019 from 7 different publications**

In [8]:
medium_blogs = pd.read_csv("medium_data.csv")
medium_blogs["subtitle"] = medium_blogs["subtitle"].fillna(value="") ## Fillings NaNs to avoid failures
medium_blogs["publication"] = medium_blogs["publication"].fillna(value="") ## Fillings NaNs to avoid failures
medium_blogs["date"] = pd.to_datetime(medium_blogs["date"])

print(f"Dataset Size : {medium_blogs.shape[0]}")

medium_blogs.head()

Dataset Size : 6508


Unnamed: 0,id,url,title,subtitle,image,claps,responses,reading_time,publication,date
0,1,https://towardsdatascience.com/a-beginners-gui...,A Beginner’s Guide to Word Embedding with Gens...,,1.png,850,8,8,Towards Data Science,2019-05-30
1,2,https://towardsdatascience.com/hands-on-graph-...,Hands-on Graph Neural Networks with PyTorch & ...,,2.png,1100,11,9,Towards Data Science,2019-05-30
2,3,https://towardsdatascience.com/how-to-use-ggpl...,How to Use ggplot2 in Python,A Grammar of Graphics for Python,3.png,767,1,5,Towards Data Science,2019-05-30
3,4,https://towardsdatascience.com/databricks-how-...,Databricks: How to Save Files in CSV on Your L...,When I work on Python projects dealing…,4.jpeg,354,0,4,Towards Data Science,2019-05-30
4,5,https://towardsdatascience.com/a-step-by-step-...,A Step-by-Step Implementation of Gradient Desc...,One example of building neural…,5.jpeg,211,3,4,Towards Data Science,2019-05-30


## 3. Create Table for Embeddings

In [46]:
create_table_sql = """
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS documents (
    id SERIAL PRIMARY KEY,
    url TEXT,
    title TEXT,
    subtitle TEXT,
    content TEXT,
    publication TEXT,
    claps INT,
    reading_time SMALLINT, 
    embedding vector(384),
    date TIMESTAMP
);
"""


try:
    connection, cursor = create_connection()
    print("Connected to PostgreSQL.")
    cursor.execute(create_table_sql)
    print("Table Created.")
except Exception as e:
    print("Connection failed:", e)
finally:
    cursor.close()
    connection.close()
    print('Connection Closed.')

Connected to PostgreSQL.
Table Created.
Connection Closed.


## 4. Generate Embeddings

In [28]:
print("Sample Entry: ")
print()
print(("Title : " + medium_blogs['title'] + "\n\nSubTitle : " + medium_blogs['subtitle'] + "\n\nPublication : " + medium_blogs['publication'])[3])

Sample Entry: 

Title : Databricks: How to Save Files in CSV on Your Local Computer

SubTitle : When I work on Python projects dealing…

Publication : Towards Data Science


In [36]:
%%time

medium_blogs['embedding'] = ("Title : " + medium_blogs['title'] \
                             + "\n\nSubTitle : " + medium_blogs['subtitle'] \
                             + "\n\nPublication : " + medium_blogs['publication']
                            ).apply(get_embedding)

medium_blogs["content"] = "" ## Adding Empty Content Column
medium_blogs = medium_blogs[["url", "title", "subtitle", "content", "publication", "claps", "reading_time", "embedding", "date"]]

medium_blogs.head()

CPU times: user 7.86 s, sys: 307 ms, total: 8.17 s
Wall time: 12min 5s


Unnamed: 0,url,title,subtitle,content,publication,claps,reading_time,embedding,date
0,https://towardsdatascience.com/a-beginners-gui...,A Beginner’s Guide to Word Embedding with Gens...,,,Towards Data Science,850,8,"[-0.037931133061647415, -0.07476947456598282, ...",2019-05-30
1,https://towardsdatascience.com/hands-on-graph-...,Hands-on Graph Neural Networks with PyTorch & ...,,,Towards Data Science,1100,9,"[-0.17751798033714294, -0.044844068586826324, ...",2019-05-30
2,https://towardsdatascience.com/how-to-use-ggpl...,How to Use ggplot2 in Python,A Grammar of Graphics for Python,,Towards Data Science,767,5,"[-0.015444916673004627, -0.010802224278450012,...",2019-05-30
3,https://towardsdatascience.com/databricks-how-...,Databricks: How to Save Files in CSV on Your L...,When I work on Python projects dealing…,,Towards Data Science,354,4,"[-0.06038868427276611, 0.05434291437268257, -0...",2019-05-30
4,https://towardsdatascience.com/a-step-by-step-...,A Step-by-Step Implementation of Gradient Desc...,One example of building neural…,,Towards Data Science,211,4,"[-0.10472466796636581, -0.05851958319544792, 0...",2019-05-30


## 5. Insert Data into Table

In [42]:
def insert(records):
    sql = """
    INSERT INTO documents (url, title, subtitle, content, publication, claps, reading_time, embedding, date)
    VALUES %s;
    """
    
    try:
        connection, cursor = create_connection()
        print("Connected to PostgreSQL.")
        
        execute_values(cursor, sql, records)
        print("Data Inserted.")
    except Exception as e:
        print("Connection failed:", e)
    finally:
        cursor.close()
        connection.close()
        print('Connection Closed.')

In [47]:
%time insert(medium_blogs.values)

Connected to PostgreSQL.
Data Inserted.
Connection Closed.
CPU times: user 3.24 s, sys: 14.2 ms, total: 3.25 s
Wall time: 5.61 s


## 6. Search Documents/Articles (Semantic Search)

In [39]:
def search(query_text, top_k=5):
    try:
        connection, cursor = create_connection()
        print("Connected to PostgreSQL.")
        
        q_emb = str(get_embedding(query_text)) ## Type casting is required
        
        cursor.execute("""
        SELECT url, title, content, reading_time, claps, 1 - (embedding <=> %s) AS distance
        FROM documents
        ORDER BY embedding <=> %s
        LIMIT %s;
        """, (q_emb, q_emb, top_k))
        
        return pd.DataFrame(cursor.fetchall(), columns=["URL", "Title", "Content", "Reading Time", "Claps", "Cosine_Similarity"])
    except Exception as e:
        print("Connection failed:", e)
    finally:
        cursor.close()
        connection.close()
        print('Connection Closed.')

In [40]:
query = "embeddings introduction"
results = search(query, top_k=5)

results

Connected to PostgreSQL.
Connection Closed.


Unnamed: 0,URL,Title,Content,Reading Time,Claps,Cosine_Similarity
0,https://medium.com/swlh/embedding-of-categoric...,Embedding of categorical variables for Deep Le...,,5,84,0.513503
1,https://towardsdatascience.com/word-embeddings...,Word Embeddings for NLP,,7,176,0.508565
2,https://towardsdatascience.com/a-beginners-gui...,A Beginner’s Guide to Word Embedding with Gens...,,8,850,0.483658
3,https://towardsdatascience.com/entity-embeddin...,"<strong class=""markup--strong markup--h3-stron...",,4,23,0.473463
4,https://towardsdatascience.com/distributed-vec...,Distributed Vector Representation : Simplified,,5,44,0.449757


In [42]:
for i, (url, title, content, reading_time, claps, distance) in enumerate(results.values):
    print(f"{i} - {title} (distance: {distance:.4f}, claps: {claps}, reading_time: {reading_time})")
    print(f"{url}")
    print(content[:150], "...")

0 - Embedding of categorical variables for Deep Learning model — explained (distance: 0.5135, claps: 84, reading_time: 5)
https://medium.com/swlh/embedding-of-categorical-variables-for-deep-learning-model-explained-6aa8a1a04603
 ...
1 - Word Embeddings for NLP (distance: 0.5086, claps: 176, reading_time: 7)
https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4
 ...
2 - A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model (distance: 0.4837, claps: 850, reading_time: 8)
https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92
 ...
3 - <strong class="markup--strong markup--h3-strong">Entity embedding using t-SNE</strong> (distance: 0.4735, claps: 23, reading_time: 4)
https://towardsdatascience.com/entity-embedding-using-t-sne-973cb5c730d7
 ...
4 - Distributed Vector Representation : Simplified (distance: 0.4498, claps: 44, reading_time: 5)
https://towardsdatascience.com/distributed-vector-representation-simplified-

In [43]:
query = "python machine learning"
results = search(query, top_k=5)

results

Connected to PostgreSQL.
Connection Closed.


Unnamed: 0,URL,Title,Content,Reading Time,Claps,Cosine_Similarity
0,https://towardsdatascience.com/logistic-regres...,Logistic Regression in Machine Learning using ...,,7,195,0.57601
1,https://towardsdatascience.com/hackcvilleds-46...,Evaluating Classification Models,,10,378,0.567973
2,https://towardsdatascience.com/automate-your-m...,Automate Your ML Model Tuning and Selection us...,,6,103,0.550186
3,https://towardsdatascience.com/the-hundred-pag...,The Hundred-Page Machine Learning Book Book Re...,,10,618,0.539624
4,https://towardsdatascience.com/3-advanced-pyth...,3 Advanced Python Functions for Data Scientists,,3,1700,0.499216


In [44]:
for i, (url, title, content, reading_time, claps, distance) in enumerate(results.values):
    print(f"{i} - {title} (distance: {distance:.4f}, claps: {claps}, reading_time: {reading_time})")
    print(f"{url}")
    print(content[:150], "...")

0 - Logistic Regression in Machine Learning using Python (distance: 0.5760, claps: 195, reading_time: 7)
https://towardsdatascience.com/logistic-regression-explained-and-implemented-in-python-880955306060
 ...
1 - Evaluating Classification Models (distance: 0.5680, claps: 378, reading_time: 10)
https://towardsdatascience.com/hackcvilleds-4636c6c1ba53
 ...
2 - Automate Your ML Model Tuning and Selection using AutoML in Python (distance: 0.5502, claps: 103, reading_time: 6)
https://towardsdatascience.com/automate-your-ml-model-tuning-and-selection-2f8c0b6992ce
 ...
3 - The Hundred-Page Machine Learning Book Book Review (distance: 0.5396, claps: 618, reading_time: 10)
https://towardsdatascience.com/the-hundred-page-machine-learning-book-book-review-72b51c5ad083
 ...
4 - 3 Advanced Python Functions for Data Scientists (distance: 0.4992, claps: 1700, reading_time: 3)
https://towardsdatascience.com/3-advanced-python-functions-for-data-scientists-f869016da63a
 ...


## 7. Hybrid Search

In [56]:
def hybrid_search(query_text, publication=None, reading_time=0, claps=0, top_k=5):
    try:
        connection, cursor = create_connection()
        print("Connected to PostgreSQL.")
    
        q_emb = str(get_embedding(query_text)) ## Type casting is required
    
        if publication:
            cursor.execute("""
                SELECT url, title, content, reading_time, claps, 1 - (embedding <=> %s) AS distance
                FROM documents
                WHERE publication = %s
                AND reading_time > %s
                AND claps > %s
                ORDER BY embedding <=> %s
                LIMIT %s;
            """, (q_emb, publication, reading_time, claps, q_emb, top_k))
        else:
            cursor.execute("""
                SELECT url, title, content, reading_time, claps, 1 - (embedding <=> %s) AS distance
                FROM documents
                ORDER BY embedding <=> %s
                LIMIT %s;
            """, (q_emb, q_emb, top_k))
    
        return pd.DataFrame(cursor.fetchall(), columns=["URL", "Title", "Content", "Reading Time", "Claps", "Cosine_Similarity"])
    except Exception as e:
        print("Connection failed:", e)
    finally:
        cursor.close()
        connection.close()
        print('Connection Closed.')

In [57]:
print(medium_blogs["publication"].unique())

['Towards Data Science' 'UX Collective' 'The Startup'
 'The Writing Cooperative' 'Data Driven Investor' 'Better Marketing'
 'Better Humans']


In [58]:
query = "python machine learning"
results = hybrid_search(query, publication="Towards Data Science", top_k=5)

results

Connected to PostgreSQL.
Connection Closed.


Unnamed: 0,URL,Title,Content,Reading Time,Claps,Cosine_Similarity
0,https://towardsdatascience.com/logistic-regres...,Logistic Regression in Machine Learning using ...,,7,195,0.57601
1,https://towardsdatascience.com/hackcvilleds-46...,Evaluating Classification Models,,10,378,0.567973
2,https://towardsdatascience.com/automate-your-m...,Automate Your ML Model Tuning and Selection us...,,6,103,0.550186
3,https://towardsdatascience.com/the-hundred-pag...,The Hundred-Page Machine Learning Book Book Re...,,10,618,0.539624
4,https://towardsdatascience.com/3-advanced-pyth...,3 Advanced Python Functions for Data Scientists,,3,1700,0.499216


In [59]:
query = "python machine learning"
results = hybrid_search(query, publication="Towards Data Science", reading_time=10, top_k=5)

results

Connected to PostgreSQL.
Connection Closed.


Unnamed: 0,URL,Title,Content,Reading Time,Claps,Cosine_Similarity
0,https://towardsdatascience.com/the-limitations...,The Limitations of Machine Learning,,12,810,0.487581
1,https://towardsdatascience.com/20-popular-mach...,20 Popular Machine Learning Metrics. Part 1: C...,,11,403,0.483002
2,https://towardsdatascience.com/machine-learnin...,"Machine Learning Algorithms In Layman’s Terms,...",,12,958,0.471053
3,https://towardsdatascience.com/feature-enginee...,Fundamental Techniques of Feature Engineering ...,,14,3500,0.456673
4,https://towardsdatascience.com/cross-validatio...,Cross Validation: A Beginner’s Guide,,18,648,0.446142


In [60]:
query = "python machine learning"
results = hybrid_search(query, publication="Towards Data Science", claps=500, top_k=5)

results

Connected to PostgreSQL.
Connection Closed.


Unnamed: 0,URL,Title,Content,Reading Time,Claps,Cosine_Similarity
0,https://towardsdatascience.com/the-hundred-pag...,The Hundred-Page Machine Learning Book Book Re...,,10,618,0.539624
1,https://towardsdatascience.com/3-advanced-pyth...,3 Advanced Python Functions for Data Scientists,,3,1700,0.499216
2,https://towardsdatascience.com/optimization-wi...,Optimization with Python: How to make the most...,,7,1000,0.490182
3,https://towardsdatascience.com/the-limitations...,The Limitations of Machine Learning,,12,810,0.487581
4,https://towardsdatascience.com/bite-sized-pyth...,Bite-Sized Python Recipes,,6,533,0.476455


## 8. Index for Faster Search

In [62]:
index_sql = """
CREATE INDEX IF NOT EXISTS documents_embedding_idx
ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

ANALYZE documents;
"""

try:
    connection, cursor = create_connection()
    print("Connected to PostgreSQL.")
    cursor.execute(index_sql)
    print("IVFFlat index created (cosine distance).")
except Exception as e:
    print(e)
finally:
    cursor.close()
    connection.close()
    print('Connection Closed.')

Connected to PostgreSQL.
IVFFlat index created (cosine distance).
Connection Closed.


## Summary

**What we built:**

- Stored text embeddings in PostgreSQL using PGVector  
- Performed semantic (meaning-based) search  
- Optionally added a vector index for faster retrieval  
- Saw how hybrid filtering works with SQL + vectors