<a href="https://colab.research.google.com/github/sayandas96476/RAG/blob/main/Sparse_retrieval_wikipedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


In [2]:
import requests
from bs4 import BeautifulSoup

def get_full_wikipedia_content(title):
    # Construct the URL for the full Wikipedia page
    url = f"https://en.wikipedia.org/wiki/{title.replace(' ', '_')}"

    try:
        # Make the request
        response = requests.get(url)
        response.raise_for_status()

        # Parse the HTML content
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the main content div
        content_div = soup.find(id="mw-content-text")

        # Extract all paragraphs
        paragraphs = content_div.find_all('p')

        # Combine all paragraph texts
        full_text = '\n\n'.join([para.get_text() for para in paragraphs])

        # Remove citations [1], [2], etc.
        import re
        full_text = re.sub(r'\[\d+\]', '', full_text)

        return full_text.strip()

    except requests.RequestException as e:
        return f"Error fetching page: {str(e)}"

# First install beautifulsoup4 if you haven't:
# pip install beautifulsoup4

# Example usage
title = "Batman"
text = get_full_wikipedia_content(title)


In [3]:
text = text.replace('\n', '')

lis = text.split(".")
def combine_strings(original_list, chunk_size=3):
    return [''.join(original_list[i:i + chunk_size])
            for i in range(0, len(original_list), chunk_size)]

# Example usage:
original = lis  # Your 100 strings
combined = combine_strings(original)
text = """ """
for i in combined:
  text += i+"\n\n\n"

In [4]:
documents = text.split("\n\n\n")

In [5]:
len(documents)

136

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download required NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize
    words = text.split()

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if w not in stop_words]

    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

preprocessed_docs = [preprocess_text(doc) for doc in documents]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [7]:
# Step 2: Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Step 3: Fit and transform the documents to generate the TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(preprocessed_docs)



In [8]:
# Get number of documents (rows)
n_docs = tfidf_matrix.shape[0]
print("Number of vectors", n_docs)

# Get number of terms/features (columns)
n_terms = tfidf_matrix.shape[1]
print("features of a vector", n_terms)

Number of vectors 136
features of a vector 2107


In [9]:
# Step 4: Define the search query
query = "Who created Batman comic"

# Step 5: Transform the query into a TF-IDF vector
query_vector = vectorizer.transform([query])


In [11]:

# Step 6: Compute cosine similarity between the query and the documents
cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()

# Step 7: Get the top 3 results based on similarity scores
top_n = 3
top_indices = np.argsort(cosine_similarities)[::-1][:top_n]

# Step 8: Print the results
print("Top 3 Search Results:")
for i, idx in enumerate(top_indices):
    print(f"{i + 1}. Document: {documents[idx]} (Score: {cosine_similarities[idx]:.4f})")
    print("================")

Top 3 Search Results:
1. Document:  Batman[b] is a superhero who appears in American comic books published by DC Comics Batman was created by the artist Bob Kane and writer Bill Finger, and debuted in the 27th issue of the comic book Detective Comics on March 30, 1939 In the DC Universe, Batman is the alias of Bruce Wayne, a wealthy American playboy, philanthropist, and industrialist who resides in Gotham City (Score: 0.3357)
2. Document:  Various creators worked to return Batman to his darker roots in the 1970s and 1980s, culminating with the 1986 miniseries The Dark Knight Returns by Frank MillerDC has featured Batman in many comic books, including comics published under its imprints such as Vertigo and Black Label; he has been considered DC's flagship character since the 1990s The longest-running Batman comic, Detective Comics, is the longest-running comic book in the United States (Score: 0.2262)
3. Document: In early 1939, following the success of Superman, DC Comics' editors requ

In [13]:
def retrieve(query):
  query_vector = vectorizer.transform([query])
  # Step 6: Compute cosine similarity between the query and the documents
  cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()

  # Step 7: Get the top 3 results based on similarity scores
  top_n = 3
  top_indices = np.argsort(cosine_similarities)[::-1][:top_n]

  # Step 8: Print the results
  print("Top 3 Search Results:")
  for i, idx in enumerate(top_indices):
      print(f"{i + 1}. Document: {documents[idx]} (Score: {cosine_similarities[idx]:.4f})")
      print("================")


In [14]:

query = "Who created Batman comic"
retrieve(query)

Top 3 Search Results:
1. Document:  Batman[b] is a superhero who appears in American comic books published by DC Comics Batman was created by the artist Bob Kane and writer Bill Finger, and debuted in the 27th issue of the comic book Detective Comics on March 30, 1939 In the DC Universe, Batman is the alias of Bruce Wayne, a wealthy American playboy, philanthropist, and industrialist who resides in Gotham City (Score: 0.3357)
2. Document:  Various creators worked to return Batman to his darker roots in the 1970s and 1980s, culminating with the 1986 miniseries The Dark Knight Returns by Frank MillerDC has featured Batman in many comic books, including comics published under its imprints such as Vertigo and Black Label; he has been considered DC's flagship character since the 1990s The longest-running Batman comic, Detective Comics, is the longest-running comic book in the United States (Score: 0.2262)
3. Document: In early 1939, following the success of Superman, DC Comics' editors requ

In [16]:
query = "Batmans love life"

retrieve(query)

Top 3 Search Results:
1. Document:  Vicki's attempts to uncover Batman's true identity lead to a complicated romantic involvement that waxed and waned over the years, especially during the early 1980s when their relationship became more seriousTalia al Ghul, introduced in Detective Comics #411 (1971), is another key player in Batman's love life Their relationship is fraught with conflict due to her father, Ra's al Ghul, and his criminal ambitions (Score: 0.1903)
2. Document: Everybody loves to draw Batman, and everybody wants to put their own spin on it (Score: 0.1884)
3. Document:  Over the years, they have shared intense connections, often navigating the fine line between love and conflict Their relationship culminated in an engagement during the Rebirth eraAnother important figure is Vicki Vale, a journalist introduced in Batman #49 (1948) (Score: 0.1313)


In [19]:
query = "Batmans first love"

retrieve(query)

Top 3 Search Results:
1. Document:  The most prominent of these, Duke Thomas, later becomes Batman's crimefighting partner as The SignalBatman's romantic history spans decades, filled with relationships that reflect his struggle between personal happiness and his duty as Gotham's protector His first love interest was Julie Madison, introduced in Detective Comics #31 (1939) (Score: 0.2137)
2. Document: Everybody loves to draw Batman, and everybody wants to put their own spin on it (Score: 0.2030)
3. Document: The third Robin in the mainstream comics is Tim Drake, who first appeared in 1989 He went on to star in his own comic series, and goes by the name Red Robin, a variation on the traditional Robin persona In the first decade of the new millennium, Stephanie Brown served as the fourth in-universe Robin between stints as her self-made vigilante identity the Spoiler, and later as Batgirl (Score: 0.1429)
