# Task 3: Build a Search Engine Using TF-IDF and Cosine Similarity

## Overview
In this task, you will build a document search system using TF-IDF vectorization and perform top-k cosine similarity search entirely in memory.

You will use a dataset of pre-merged Yahoo Answers entries stored in qa_data.json. This task is split into two parts:

Task 3.1: Preprocess and vectorize the documents, then save both the vectorizer and the document embeddings.  
Task 3.2: Accept a search query and return the top-k most relevant results using cosine similarity.

This forms the basis for semantic search engines, retrieval-augmented generation (RAG), and other vector-based applications in NLP.

In [1]:
import joblib
import json
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Task 3.1 Preprocess and Index the Documents

In [2]:
def build_and_index_documents():

    # Open and load the JSON file
    with open("data/qa_data.json", "r") as file:
        data = json.load(file)

    # Access data
    print(len(data))
    vectorizer = TfidfVectorizer(max_features = 1000)

    tfidf_matrix = vectorizer.fit_transform(data)
    vectors = tfidf_matrix.toarray()

    joblib.dump(vectorizer, "output/vectorizer.pkl")
    np.save("output/document_vectors.npy", vectors)

In [3]:
build_and_index_documents()

1000


## Task 3.2: Run a Search Query

In [4]:
def run_query(query: str, k: int = 5):
    vectorizer = joblib.load("output/vectorizer.pkl")
    vectors = np.load("output/document_vectors.npy")

    query_vector = vectorizer.transform([query])

    cosine_scores = cosine_similarity(query_vector, vectors)[0]
    # print(len(cosine_scores))
    # print(len(cosine_scores[0]))

    # get top k indices
    cosine_scores = cosine_scores.argsort()
    cosine_scores = cosine_scores[::-1] # reverse the array as we want top score first DESC order
    top_k_indices = cosine_scores[:k]
    # print(top_k_indices)

    top_k_scores = cosine_scores[top_k_indices]
    # print(top_k_scores)

    # Open and load the JSON file
    with open("data/qa_data.json", "r") as file:
        data = json.load(file)
    
    print(f"Top {k} results for: {query}")
    for idx in top_k_indices:
        print(f"doc_{idx} {data[idx]}")

In [5]:
run_query("python data science", 3)

Top 3 results for: python data science
doc_160 Where can I get ideas for our science project?  My daughter gets her science project ideas from infoplease.com.  She loves this site.
doc_848 Why indian  good in Math and science?  Indians r best at everythin dude
doc_167 What is programming?  Programming is instructing a computer to do something for you with the help of a programming language. The role of a programming language can be described in two ways:\n\n   1. Technical: It is a means for instructing a Computer to perform Tasks\n   2. Conceptual: It is a framework within which we organize our ideas about things and processes. \n\nAccording to the last statement, in programming we deal with two kind of things:\n\n    * data, representing ``objects'' we want to manipulate\n    * procedures, i.e. ``descriptions'' or ``rules'' that define how to manipulate data. \n\nAccording to Abelson and Sussman ([ABELSON, 1985, 4,])\n\n    ``.....we should pay particular attention to the means that 