# **Needle in a Haystack Problem**  
### **Semantic Search with OpenAI LLM**

**Author**: [Shantanu Borkar](www.linkedin.com/in/shantanubkr)

**Reviewer**: [Sarvesh Bhujle](https://www.linkedin.com/in/sarvesh-bhujle-78701231b/)


<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://miro.medium.com/v2/resize:fit:2000/1*aDcggCoq28DmWhin4W1vDA.jpeg"> 
</p>
</div>

---

## **Overview**

The **Needle in a Haystack Problem** is a semantic search program that uses OpenAI's GPT-4 to find the most relevant sentence from a "haystack" of abstract sentences based on a user-provided query (the "needle"). The results are neatly formatted in Markdown for clear, structured output, making it ideal for Jupyter Notebook environments or any interface supporting Markdown rendering.

---

## **Features**

- **Semantic Search:** Matches sentences based on meaning rather than exact keywords.  
- **Markdown Output:** Neatly structures query results using Markdown formatting for readability.  
- **AI-powered Matching:** Utilizes GPT-4’s chat completion API for natural language processing.  
- **Error Handling:** Includes fallback responses for unmatched queries.  

---

## **How It Works**

1. **Environment Setup:**  
   - Loads API keys using `dotenv`.  
   - Configures OpenAI's GPT-4 for semantic search.  

2. **Semantic Search:**  
   - The `find_needle()` function constructs a prompt instructing GPT-4 to identify the most relevant sentence from the haystack based on the query.  
   - The result is formatted in Markdown.  

3. **Testing:**  
   - A set of test queries (`test_needles`) is processed to check how well the model finds matches.  

4. **Output:**  
   - Results are displayed using Jupyter's `Markdown` formatting via the `IPython.display` module.  

---

## **1️⃣ Environment Setup**

### **Step 1: Install Dependencies**  

You need two main libraries:
- **openai** — to use GPT-4 for semantic search.  
- **python-dotenv** — to store your API key securely.

Run this in your terminal:
```bash
pip install openai python-dotenv
```

### **Step 2: Add Your API Key**
Create a `.env` file in the same folder as your script.

Add your OpenAI API key like this:

`.env` file:

```bash 
OPENAI_API_KEY=your_api_key_here
 ```

### **Step 3: Import Libraries and Load API Key**

Now, let's set up the environment and load the API key in your Python script:

**✅ What this does:**

Loads your API key from the .env file.
Configures the OpenAI client.
Sets up Markdown display for clean output in Jupyter notebooks.

In [None]:
from dotenv import load_dotenv
from IPython.display import Markdown, display
import openai
import os

# Load environment variables
load_dotenv()

# Set the API key
openai.api_key = os.getenv("OPENAI_API_KEY")

### **2️⃣ Semantic Search Function**

Now, let's create a function called `find_needle()`

This function will:

1. Take in a list of sentences (the haystack).

2. Accept a user query (the needle).

3. Ask GPT-4 to find the best match.

4. Format the result in Markdown.

**✅ Explanation:**

- The prompt tells GPT-4:

    - Find the most relevant sentence from the haystack.

    - Format the output using Markdown.

- If no match is found, it responds with:

- "No match found."

- The function returns the AI's response.

In [None]:
def find_needle(haystack, needle):
    prompt = f"""
    You are an advanced AI skilled in semantic search. Your task is to find the most relevant sentence from the provided haystack that best matches the given query, even if the match is approximate.

    Format the output in Markdown with the following structure:

    **Query:** "{needle}"

    **Best Match:** "[Insert best matching sentence from haystack]"

    If no match is found, respond with:

    **Query:** "{needle}"

    **"No match found."**

    Haystack:
    {chr(10).join(haystack)}
    """

    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "system", "content": "You are a helpful assistant"},
                  {"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content.strip() if response.choices else "**Best Match:** \"No match found.\""


### **3️⃣ Create the Haystack and Test Queries**

Now, let's build our haystack (abstract sentences) and test needles (queries).

**✅ Why this setup?**

- The haystack is a collection of abstract sentences — not simple keyword matches.

- The test needles include both valid and absurd queries, testing if GPT-4 can handle unexpected inputs.


In [None]:
haystack = [
    "The clock forgot to chime, yet time marched on.",
    "A cat danced in the moonlight, chasing shadows.",
    "Silence echoed louder than the thunderstorm.",
    "The invisible wind wrote secrets in the sand.",
    "A broken compass pointed to nowhere and everywhere.",
    "Colors argued on the painter's palette.",
    "Dreams wandered through locked doors.",
    "The echo of a whisper traveled farther than the shout.",
    "An unread book knew all the answers.",
    "Lightning remembered what the sky forgot.",
    "The mountain moved, but the pebble stayed still.",
    "A question mark floated above an empty chair.",
    "Stars blinked, wondering who was watching.",
    "The glass was both full and empty at once.",
    "A shadow ran faster than the light chasing it."
]

### **4️⃣ Running the Tests**
Now, let's test the function by running each query through the haystack.

**✅ Explanation:**

- Loops through each needle (query).

- Calls find_needle() to search for the most relevant haystack sentence.

- Displays the result using Markdown for clean output.


In [None]:
test_needles = [
    "What did the wind write?",
    "Which object defied direction?",
    "Who knew the answers but remained silent?",
    "What sound was louder than noise?",
    "Which event happened despite no motion?",
    "What did obama tell sarvesh?"
]


for test_needle in test_needles:
    match = find_needle(haystack, test_needle)
    display(Markdown(match))

---

## 🎯 **Conclusion**

Congratulations! 🎉  

You've successfully built a **semantic search tool** powered by **OpenAI's GPT-4**.  


In this journey, you learned how to:

- **Set up your environment** — installing dependencies and configuring API keys.  

- **Create a haystack and needles** — designing abstract sentences and queries for testing.  

- **Implement semantic search** — using GPT-4 to find the most relevant matches.  

- **Handle errors and no-match cases** — ensuring your program responds gracefully.  

- **Customize and expand** — adding your own haystacks, queries, and AI models.  


This is just the beginning! 🚀  

You can now enhance your tool by:

- Adding **confidence scores** for matches.  

- Implementing **similarity thresholds** — only showing results above a certain match level.

- Building a simple **web interface** — using **Flask** or **Streamlit** to let others test your app.


The possibilities are endless.  

Remember: **Every great solution starts with asking the right questions.**  

Happy coding! 💡✨


---

# Thank You for visiting The Hackers Playbook! 🌐

If you liked this research material;

- [Subscribe to our newsletter.](https://thehackersplaybook.substack.com)

- [Follow us on LinkedIn.](https://www.linkedin.com/company/the-hackers-playbook/)

- [Leave a star on our GitHub.](https://www.github.com/thehackersplaybook)

<div style="display:flex; align-items:center; padding: 50px;">
<p style="margin-right:10px;">
    <img height="200px" style="width:auto;" width="200px" src="https://avatars.githubusercontent.com/u/192148546?s=400&u=95d76fbb02e6c09671d87c9155f17ca1e4ef8f21&v=4"> 
</p>
</div>