In [1]:
import json
import os

# Define the path to the JSON file
json_path = os.path.join('wiki_corpus.json')

# Load the JSON data
with open(json_path, 'r') as file:
    wiki_data = json.load(file)

# Display the structure of the first few entries to understand the data
first_few_entries = wiki_data[:3]
first_few_entries

[[['? (film)'],
  '? (also written Tanda Tanya, meaning Question Mark) is a 2011 Indonesian drama film directed by Hanung Bramantyo. It stars Revalina Sayuthi Temat, Reza Rahadian, Agus Kuncoro, Endhita, Rio Dewanto, and Hengky Sulaeman. The film focuses around Indonesia\'s religious pluralism, which often results in conflict between different beliefs, represented in a plot that revolves around the interactions of three families, one Buddhist, one Muslim, and one Catholic. After undergoing numerous hardships and the deaths of several family members in religious violence, they are reconciled.\nBased on Bramantyo\'s experiences as a mixed-race child, ? was meant to counter the portrayal of Islam as a "radical religion". Owing to the film\'s theme of religious pluralism and controversial subject matter, Bramantyo had difficulty finding backing. Eventually, Mahaka Pictures put forth Rp 5 billion ($600,000) to fund the production. Filming began on 5 January 2011 in Semarang.\nReleased on 7 

In [None]:
# !pip install sentence-transformers

In [3]:
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Load preproceesed text data
documents = []
with open('preprocessed_data.txt', 'r', encoding='utf-8') as file:
    documents = file.read().splitlines()

# Generate embeddings
embeddings = model.encode(documents, show_progress_bar=True)

print("Embeddings generated for the text data.")

Batches:   0%|          | 0/201 [00:00<?, ?it/s]

Embeddings generated for the text data.


In [4]:
print(f"Total documents loaded: {len(documents)}")
print("\nFirst few documents:")
for doc in documents[:5]:
    print(doc)
    print("---")

Total documents loaded: 6428

First few documents:
? (film) also written tanda tanya meaning question mark 2011 indonesian drama film directed hanung bramantyo star revalina sayuthi temat reza rahadian agus kuncoro endhita rio dewanto hengky sulaeman film focus around indonesia religious pluralism often result conflict different belief represented plot revolves around interaction three family one buddhist one muslim one catholic undergoing numerous hardship death several family member religious violence reconciled based bramantyos experience mixedrace child meant counter portrayal islam radical religion owing film theme religious pluralism controversial subject matter bramantyo difficulty finding backing eventually mahaka picture put forth rp 5 billion 600000 fund production filming began 5 january 2011 semarang released 7 april 2011 critical commercial success received favourable review viewed 550000 people screened internationally nominated nine citra award 2011 indonesian film festi

In [None]:
# Concatenating title and text for each document
# documents = [entry[0][0] + " " + entry[1] for entry in wiki_data]

In [5]:
import numpy as np

# Cosine Similarity Function
def cosine_similarity(vec_a, vec_b):
    """Calculate the cosine similarity between two vectors."""
    cos_sim = np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))
    return cos_sim

# Search Function
def search(query, model, documents, embeddings, top_n=5):
    """
    Search the documents for the given query.

    Parameters:
    - query: The search query string.
    - model: The sentence transformer model for embedding.
    - documents: A list of documents.
    - embeddings: The precomputed embeddings for the documents.
    - top_n: Number of top results to return.

    Returns:
    - A list of tuples (document, score) sorted by relevance to the query.
    """
    # Generate the query embedding
    query_embedding = model.encode([query])[0]
    
    # Calculate similarities with all document embeddings
    similarities = np.array([cosine_similarity(query_embedding, doc_embedding) for doc_embedding in embeddings])
    
    # Get the top N most similar document indices
    top_indices = np.argsort(similarities)[::-1][:top_n]
    
    # Retrieve the top N most similar documents and their scores
    top_documents_scores = [(documents[idx], similarities[idx]) for idx in top_indices]
    
    return top_documents_scores

In [6]:
# Example search query
query = "55 wall st"

top_n = 5  # Number of results to return
top_results = search(query, model, documents, embeddings, top_n)

for doc, score in top_results:
    print(f"Document: {doc}, Score: {score:.4f}")

Document: 55 Wall Street 55 wall street formerly national city bank building eightstory building wall street william hanover street financial district lower manhattan new york city new york united state lowest three story completed either 1841 1842 fourstory merchant exchange designed isaiah rogers greek revival style 1907 1910 mckim mead white removed original fourth story added five floor create present building facade part interior new york city designated landmark building listed new york state register historic place national register historic place nrhp national historic landmark also contributing property wall street historic district listed nrhp 55 wall street granite facade includes two stacked colonnade facing wall street twelve column inside cruciform banking hall 60foot 18 vaulted ceiling corinthian column marble floor wall entablature around interior banking hall among largest united state completed office citibanks predecessor national city bank corner banking hall fourth

In [7]:
# Example search query
query = "fountain of time"

top_n = 5  # Number of results to return
top_results = search(query, model, documents, embeddings, top_n)

for doc, score in top_results:
    print(f"Document: {doc}, Score: {score:.4f}")

Document: Fountain of Time fountain time simply time sculpture lorado taft measuring 126 foot 10 inch 3866 length situated western edge midway plaisance within washington park chicago illinois united state sculpture inspired henry austin dobson poem paradox time 100 figure passing father time created monument 100 year peace united state united kingdom following treaty ghent 1814 father time face 100 across water basin fountain water turned 1920 sculpture dedicated 1922 contributing structure washington park united state registered historic district national register historic place listing part larger beautification plan midway plaisance time constructed new type molded steelreinforced concrete claimed durable cheaper alternative said first kind finished work art made concrete completion millennium park 2004 considered important art installation chicago park district time one several chicago work art funded benjamin fergusons trust fund time undergone several restoration deterioration d

In [8]:
# Example search query
query = "easter brown snake"

top_n = 5  # Number of results to return
top_results = search(query, model, documents, embeddings, top_n)

for doc, score in top_results:
    print(f"Document: {doc}, Score: {score:.4f}")

Document: Eastern brown snake eastern brown snake pseudonaja textilis often referred common brown snake specie extremely venomous snake family elapidae specie native eastern central australia southern new guinea first described andr marie constant dumril gabriel bibron auguste dumril 1854 adult eastern brown snake slender build grow 2 7 ft length colour surface range pale brown black underside pale creamyellow often orange grey splotch eastern brown snake found habitat except dense forest often farmland outskirt urban area place populated main prey house mouse specie oviparous international union conservation nature classifies snake leastconcern specie though status new guinea unclear considered world secondmost venomous land snake inland taipan oxyuranus microlepidotus based ld50 value subcutaneous mouse main effect venom circulatory systemcoagulopathy haemorrhage bleeding cardiovascular collapse cardiac arrest one main component venom prothrombinase complex pseutarinc break prothromb

In [9]:
# Example search query
query = "super mario"

top_n = 5  # Number of results to return
top_results = search(query, model, documents, embeddings, top_n)

for doc, score in top_results:
    print(f"Document: {doc}, Score: {score:.4f}")

Document: New Super Mario Bros. new super mario bros 2006 platform video game developed published nintendo nintendo d first released may 2006 north america japan pal region june 2006 first installment new super mario bros subseries super mario franchise follows mario fight way bowsers henchman rescue princess peach mario access several old new powerups help complete quest including super mushroom fire flower super star giving unique ability traveling eight world 80 level mario must defeat bowser jr bowser saving princess peach new super mario bros commercially critically successful praise went towards game improvement introduction made mario franchise faithfulness older mario game criticism targeted low difficulty level lingering similarity previous game called one best game available nintendo d several critic calling one best sidescrolling super mario title sold 30 million copy worldwide making bestselling game nintendo d one bestselling video game time game success led line sequel re