# **Simple Search Engine**

## **Working of Simple Search Engine**

* The Simple Search Engine operates on a straightforward keyword-based search methodology. It utilizes the PyPDF2 library in Python to extract text content from uploaded PDF files. Once a PDF is uploaded, the search engine reads through each page, preprocesses the text, and indexes it based on stemmed tokens.

* When a user submits a search query, the search engine identifies relevant documents by finding the intersection of all pages containing the queried keywords. This process involves preprocessing the query, tokenizing it into individual terms, and identifying the stemmed forms of these terms.

* The search engine then looks up each stemmed term in its index to retrieve a list of document IDs where the term appears. By taking the intersection of these sets of document IDs for all query terms, the engine determines the pages that contain all the keywords provided by the user.

* Finally, the search engine presents the text content of these identified pages to the user, allowing them to access the relevant information they are looking for. This simple yet effective approach provides users with a quick and intuitive way to search for specific information within PDF documents.

In [11]:
!pip install PyPDF2




In [12]:
import re
import nltk
import PyPDF2
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer


class SimpleSearchEngine:
    def __init__(self):
        self.index = {}  #  Initialize an empty dictionary to store the index.
        self.documents = {}  # Store document text by document ID
        self.vectorizer = TfidfVectorizer()
        self.stemmer = PorterStemmer()

    def preprocess_text(self, text):  # Define a method to preprocess text by converting it to lowercase and removing non-alphanumeric characters
        text = text.lower()
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        text = re.sub(' +', ' ', text)
        return text

    def stem_tokens(self, tokens): # Define a method to stem tokens using the Porter Stemmer.
        return [self.stemmer.stem(token) for token in tokens]

    def index_pdf(self, pdf_file): # Define a method to index a PDF file by extracting text from each page.
        with open(pdf_file, 'rb') as file: # Open the PDF file
            reader = PyPDF2.PdfReader(file)   # Create a PDF reader object.

            for page_num in range(len(reader.pages)): # Iterate through each page of the PDF.
                page_text = reader.pages[page_num].extract_text() # Extract text from the current page.
                processed_text = self.preprocess_text(page_text) # Preprocess the extracted text.
                doc_id = f"{pdf_file}_page_{page_num}"  # Generate unique document ID
                self.documents[doc_id] = page_text  # Store the document text in self.document
                self.index_document(doc_id, processed_text) # Index the document by calling the index_document method.

        # Update TF-IDF vectorizer
        self.vectorizer.fit(list(self.documents.values()))

    def index_document(self, doc_id, text): # Define a method to index a document by splitting text into tokens, stemming them, and updating the index.
        tokens = text.split()
        stemmed_tokens = self.stem_tokens(tokens)
        for token in stemmed_tokens:
            if token not in self.index:
                self.index[token] = []
            self.index[token].append(doc_id)

    def get_document_text(self, doc_id):
        return self.documents.get(doc_id, None)

    def search(self, query): # Define method for Search query
        processed_query = self.preprocess_text(query)
        query_vector = self.vectorizer.transform([processed_query])
        query_tokens = processed_query.split()
        stemmed_query_tokens = self.stem_tokens(query_tokens)
        result_set = set()
        for token in stemmed_query_tokens:
            if token in self.index:
                if not result_set:
                    result_set.update(self.index[token])
                else:
                    result_set.intersection_update(self.index[token])
        return [self.get_document_text(doc_id) for doc_id in result_set]




In [13]:
# Passing Necessary Parameters

search_engine = SimpleSearchEngine() # calling simplesearchengine
pdf_file = '/content/Projects.pdf' # passing our PDF
search_engine.index_pdf(pdf_file) # passing it to index_pdf to get indexed suitable for searching.

In [14]:
# Querying in Simple Search Engine

query = "python" # Passing Query
results = search_engine.search(query)

for result in results: # For Printing all pages found by the search result
    print("Search result:")
    print(result)

# As python in present in project 1,2,3 and 4 it returns all the pages

Search result:
4 | P a g e  
 
Project 4(Four)  
Simple Search Engine  
 
• The Simple Search Engine operates on a straightforward keyword -based 
search methodology. It utilizes the PyPDF2 library in Python to extract text 
content from uploaded PDF files. Once a PDF is uploaded, the search engine 
reads through each page, preprocess es the text, and indexes it based on 
stemmed tokens.  
• When a user submits a search query, the search engine identifies relevant 
documents by finding the intersection of all pages containing the queried 
keywords. This process involves preprocessing the query, tokenizing it into 
individual terms, and identifyin g the stemmed forms of these terms.  
• The search engine then looks up each stemmed term in its index to retrieve a 
list of document IDs where the term appears. By taking the intersection of 
these sets of document IDs for all query terms, the engine determines the 
pages that contain all the key words provided by the user.  
• Finally, the 

In [15]:
query = "Samsung Vs Apple Comparison Module" # Passing Query
results = search_engine.search(query)

for result in results: # for prints all the pages found by result running loop to fetch all the found page
    print("Search result:")
    print(result)

# Here, the search query is very specific, focusing on "Samsung Vs Apple Comparison Module".
# The search engine returns only the page where all keywords are found.

Search result:
2 | P a g e  
 
Project 2(Two)  
Samsung Vs Apple Comparison Module  
Overview:  
The Apple vs Samsung Comparison Module provides a comprehensive analysis of the performance 
and financial metrics of Apple Inc. and Samsung Electronics Co., Ltd. Users can compare various 
aspects such as revenue, net income, market capitalization, stock pr ices, and other key metrics to 
gain insights into the competitive landscape between these two technology giants.  
Features:  
1. Revenue Analysis:  Compare the annual revenue trends of Apple and Samsung over the 
years.  
2. Net Income Comparison:  Analyze the net income of Apple and Samsung to understand their 
profitability.  
3. Market Capitalization Trends:  Explore the market capitalization trends of both companies 
and how they have evolved over time.  
4. Stock Price Comparison:  Visualize the stock price movements of Apple and Samsung and 
identify any significant trends.  
5. Dashboard Visualization:  Present all the comparat