<a href="https://colab.research.google.com/github/rahilpatwa-CTO/IITK---AI/blob/main/Google_Mini_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
rke

In [None]:
# personal_search_engine.py

import nltk
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# --- 1. Make sure NLTK has the tokenizer ---
# Run these once (they will download tokenization models)
nltk.download("punkt")
# For newer NLTK, you may also need:
nltk.download("punkt_tab")

# --- 2. Define your small "database" of documents ---

documents = [
    # Doc 0: AI analytics tool
    """
    We recently launched a new AI analytics tool for business intelligence.
    The tool helps organizations analyze large volumes of operational data in real time.
    It provides interactive dashboards, anomaly detection, and predictive insights.
    This AI analytics platform is designed for non-technical business users.
    Users can create custom reports using a simple drag-and-drop interface.
    The solution integrates with existing CRM and ERP systems.
    """,

    # Doc 1: Finance report
    """
    The quarterly finance report shows a steady increase in revenue.
    Operating margins improved due to cost optimization initiatives.
    The finance team highlighted risks related to foreign exchange volatility.
    Shareholders were informed about a new dividend payout policy.
    The CFO also discussed long-term capital allocation plans.
    The report emphasizes the importance of disciplined financial planning.
    """,

    # Doc 2: Cloud infrastructure (AWS, Azure)
    """
    Our cloud infrastructure runs on both AWS and Microsoft Azure.
    We use AWS for compute-intensive workloads and Azure for analytics.
    The architecture includes virtual networks, load balancers, and managed databases.
    Security groups and network security policies are centrally managed.
    We use infrastructure-as-code to automate provisioning and updates.
    This hybrid-cloud design improves scalability, reliability, and cost control.
    """,

    # Doc 3: Marketing campaign (SEO)
    """
    The new marketing campaign focuses heavily on SEO and content marketing.
    We conducted keyword research to identify high-intent search terms.
    The team optimized landing pages for better click-through and conversion rates.
    Social media posts and email newsletters support the SEO strategy.
    Performance is monitored using web analytics dashboards.
    The campaign will be iterated based on user engagement metrics.
    """,

    # Doc 4: AI + machine learning together
    """
    Our latest AI tool uses machine learning models to classify customer feedback.
    The system identifies sentiment, topics, and intent in support tickets.
    Machine learning helps prioritize issues that need urgent attention.
    The AI tool continuously retrains on new labeled data from the support team.
    This approach improves accuracy and adapts to changing customer language.
    Over time, the machine learning pipeline becomes a core asset for the company.
    """
]

# --- 3. Chunk documents: sentence tokenize + group every 3 sentences ---

def chunk_documents(docs, chunk_size=3):
    chunks = []
    for doc_id, doc in enumerate(docs):
        sentences = sent_tokenize(doc.strip())
        # Group every `chunk_size` sentences together
        for i in range(0, len(sentences), chunk_size):
            chunk_sentences = sentences[i:i + chunk_size]
            chunk_text = " ".join(chunk_sentences)
            if chunk_text.strip():
                chunks.append({
                    "doc_id": doc_id,
                    "chunk_id": len(chunks),
                    "text": chunk_text
                })
    return chunks

chunks = chunk_documents(documents, chunk_size=3)

# --- 4. Vectorize chunks using TF-IDF ---

chunk_texts = [c["text"] for c in chunks]

vectorizer = TfidfVectorizer(stop_words="english")
chunk_vectors = vectorizer.fit_transform(chunk_texts)

# --- 5. Function to search ---

def search(query, top_k=5):
    """
    Returns top_k chunks ranked by cosine similarity to the query.
    """
    query_vector = vectorizer.transform([query])
    similarities = cosine_similarity(query_vector, chunk_vectors).flatten()

    scored_chunks = []
    for score, chunk in zip(similarities, chunks):
        scored_chunks.append({
            "score": float(score),
            "doc_id": chunk["doc_id"],
            "chunk_id": chunk["chunk_id"],
            "text": chunk["text"]
        })

    # Sort by similarity score (descending)
    scored_chunks.sort(key=lambda x: x["score"], reverse=True)

    return scored_chunks[:top_k]

# --- 6. Simple interactive loop ---

if __name__ == "__main__":
    print("Welcome to your personal mini-Google! (type 'exit' to quit)\n")

    while True:
        query = input("Enter your search query: ").strip()
        if query.lower() in ("exit", "quit", ""):
            print("Goodbye!")
            break

        results = search(query, top_k=5)

        print("\nTop results:")
        for rank, r in enumerate(results, start=1):
            print(f"\nResult #{rank}")
            print(f"Score: {r['score']:.4f}")
            print(f"From document: {r['doc_id']}")
            print(f"Chunk text: {r['text']}")
        print("\n" + "-" * 80 + "\n")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Welcome to your personal mini-Google! (type 'exit' to quit)


Top results:

Result #1
Score: 0.2010
From document: 3
Chunk text: Social media posts and email newsletters support the SEO strategy. Performance is monitored using web analytics dashboards. The campaign will be iterated based on user engagement metrics.

Result #2
Score: 0.1730
From document: 3
Chunk text: The new marketing campaign focuses heavily on SEO and content marketing. We conducted keyword research to identify high-intent search terms. The team optimized landing pages for better click-through and conversion rates.

Result #3
Score: 0.0000
From document: 0
Chunk text: We recently launched a new AI analytics tool for business intelligence. The tool helps organizations analyze large volumes of operational data in real time. It provides interactive dashboards, anomaly detection, and predictive insights.

Result #4
Score: 0.0000
From document: 0
Chunk text: This AI analytics platform is designed for non-technical busine