<h2>Information Retrieval Assignment 1</h2>
<h4>Group ID: 26</h4>
<h4>Group Members Name with Student ID:</h4>
<h4>1. KARTHIKEYAN J - 2024AA05372</h4>
<h4>2. JANGALE SAVEDANA SUBHASH PRATIBHA - 2024AA05187</h4>
<h4>3. GANAPATHY SUBRAMANIAN S - 2024AA05188</h4>
<h4>4. ANANDAN A - 2024AA05269</h4>

<h3>Problem Statement</h3>
<h4>Designing a Text Search and Query Correction System using Levenshtein Edit Distance algorithm for Medical Documents</h4`>

# 1. Import and download the required libraries

In [40]:
import re
import os
from collections import defaultdict
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import PyPDF2
import pandas as pd
from docx import Document  # Must be imported!
import os

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saved\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saved\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\saved\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Global variables and NLP setup

In [42]:
# Global variables
inverted_index = defaultdict(set)
all_terms = set()
documents = []
doc_metadata = []

# NLP setup
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

## 1. Data Preprocessing
### a) User defined functions to remove all punctuation, numbers, and special characters from the dataset. 
### b) Apply lemmatization techniques to convert words to their base or root forms.

In [44]:
def preprocess_text(text):
    """Full preprocessing with intermediate steps"""
    print("\n=== ORIGINAL TEXT (SAMPLE) ===")
    print(text[:200] + "...\n" if len(text) > 200 else text)
    
    # 1. Clean text
    cleaned = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower())
    print("=== AFTER CLEANING ===")
    print(cleaned[:200] + "...\n" if len(cleaned) > 200 else cleaned)
    
    # 2. Tokenization
    tokens = word_tokenize(cleaned)
    print(f"TOKENS ({len(tokens)}):", tokens[:30], "...\n")
    
    # 3. Stopword removal
    filtered = [w for w in tokens if w not in stop_words and len(w) > 2]
    print(f"AFTER STOPWORD REMOVAL ({len(filtered)}):", filtered[:30], "...\n")
    
    # 4. Lemmatization
    lemmatized = [lemmatizer.lemmatize(w) for w in filtered]
    print(f"FINAL PROCESSED TERMS ({len(lemmatized)}):", lemmatized[:30], "...")
    
    return lemmatized

### User defined functions to read different types of files from a directory.

In [46]:
def read_txt(file_path):
    """Read text file"""
    encodings = ['utf-8', 'latin-1', 'windows-1252']
    for encoding in encodings:
        with open(file_path, 'r', encoding=encoding) as f:
                return f.read()
        
def read_pdf(file_path):
    """Read PDF file"""
    text = ""
    with open(file_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            for page in reader.pages:
                text += page.extract_text()
    return text

def read_csv(file_path):
    """Read CSV file"""
    encodings = ['utf-8', 'latin-1', 'windows-1252']
    for encoding in encodings:
       df = pd.read_csv(file_path, encoding=encoding)
       return ' '.join(df.select_dtypes(include=['object']).astype(str).values.flatten())

def read_excel(file_path):
    """Read Excel file"""
    df = pd.read_excel(file_path)
    return ' '.join(df.select_dtypes(include=['object']).astype(str).values.flatten())

def read_docx(file_path):
    """Read Word DOCX file"""
    doc = Document(file_path)
    return '\n'.join([para.text for para in doc.paragraphs])


### a) Load documents from the directory provided.
### b) Preprocess each document.
### c) create non-positional inverted index. Will display terms, unique terms and sample terms from each document

In [48]:
def load_documents(directory):
    """Load documents from directory and build index"""
    global documents, doc_metadata, inverted_index, all_terms
    document_metadata = []
    
    if not os.path.exists(directory):
        raise FileNotFoundError(f"Directory not found: {directory}")
    
    print(f"Loading documents from: {directory}")
    
    for root, _, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            try:
                if file.endswith('.txt'):
                    text = read_txt(file_path)
                elif file.endswith('.pdf'):
                    text = read_pdf(file_path)
                elif file.endswith('.csv'):
                    text = read_csv(file_path)
                elif file.endswith(('.xls', '.xlsx')):
                    text = read_excel(file_path)
                elif file.endswith('.docx'):
                    text = read_docx(file_path)
                else:
                    continue
                
                if text.strip():
                    doc_id = len(documents)
                    documents.append(text)
                    doc_metadata.append({
                        'file_name': file,
                        'file_path': file_path
                    })

                    print(f"\n===============Loading: {file}=====================================")
                    # Add to index
                    terms = preprocess_text(text)  # preprocessing each 
                    for term in terms:
                        inverted_index[term].add(doc_id) # inverted index creation
                        all_terms.add(term)

                     # Store metadata - PROPERLY INDENTED
                    document_metadata.append({
                        'doc_id': doc_id,
                        'filename': file,
                        'filetype': os.path.splitext(file)[1],
                        'terms': len(terms),
                        'unique_terms': len(set(terms))
                    })
            
                    # Display file processing info
                    print(f"\n📄 {file} ({document_metadata[-1]['filetype']})")
                    print(f"  - Terms: {document_metadata[-1]['terms']}")
                    print(f"  - Unique terms: {document_metadata[-1]['unique_terms']}")
                    print(f"  - Sample terms: {list(set(terms))[:5]}...")
                       
                    print(f"Loaded: {file}")
            except Exception as e:
                print(f"Error processing {file}: {str(e)}")

    print(f"\nTOTAL SUMMARY")
    print(f"\nTotal documents loaded: {len(documents)}")
    print(f"Unique terms in index: {len(all_terms)}")

    # Show most frequent terms
    top_terms = sorted(inverted_index.items(), 
                      key=lambda x: len(x[1]), 
                      reverse=True)[:5]
    print("\nTop 5 terms:")
    for term, doc_ids in top_terms:
        print(f"  {term}: appears in {len(doc_ids)} documents")

    return inverted_index, document_metadata

### a) Levenshtein distance logic
### b) Suggest terms for misspelled search strings

In [50]:
def levenshtein(s1, s2):
    if len(s1) < len(s2):
        return levenshtein(s2, s1)
    
    if len(s2) == 0:
        return len(s1)
    
    prev_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        curr_row = [i + 1]
        for j, c2 in enumerate(s2):
            inserts = prev_row[j + 1] + 1
            deletes = curr_row[j] + 1
            substitute = prev_row[j] + (c1 != c2)
            curr_row.append(min(inserts, deletes, substitute))
        prev_row = curr_row
    
    return prev_row[-1]

def suggest_terms(misspelled_word, inverted_index, max_suggestions=5):
    # First check for quick matches with common errors
    suggestions = []
    
    # Calculate distances to all terms in our vocabulary
    distances = []
    for correct_word in inverted_index.keys():
        distance = levenshtein(misspelled_word.lower(), correct_word.lower())
        distances.append((correct_word, distance))
    
    # Sort by distance (closest first)
    distances.sort(key=lambda x: x[1])
    
    # Get the top N suggestions with smallest distance
    closest_matches = [word for word, dist in distances[:max_suggestions]]
    
    return closest_matches

### a) Wildcard Search
### b) Regular search

In [52]:
# Search Functions
def wildcard_search(query, inverted_index):
    if not query.endswith('*'):
        return []
    prefix = query[:-1].lower()
    return sorted([term for term in inverted_index.keys() 
                 if term.startswith(prefix)])

def regular_search(query, inverted_index, doc_metadata):
    terms = preprocess_text(query)
    if not terms:
        return []
    
    # Find documents containing ALL terms (AND logic)
    matching_docs = None
    for term in terms:
        if term in inverted_index:
            if matching_docs is None:
                matching_docs = set(inverted_index[term])
            else:
                matching_docs.intersection_update(inverted_index[term])
        else:
            return []  # If any term doesn't exist, return nothing
    
    return list(matching_docs) if matching_docs else []

def search(query, inverted_index, doc_metadata):
    if query.endswith('*'):
        terms = wildcard_search(query, inverted_index)
        return {
            'type': 'wildcard',
            'query': query,
            'count': len(terms),
            'results': terms
        }
    else:
        doc_ids = regular_search(query, inverted_index, doc_metadata)
        results = []
        for doc_id in doc_ids:
            doc = doc_metadata[doc_id]
            #preview = doc['content'][:100] + '...' if len(doc['content']) > 100 else doc['content']
            results.append({
                'doc_id': doc_id,
                'filename': doc['filename'],
                #'preview': preview
            })
        return {
            'type': 'regular', 
            'query': query,
            'count': len(results),
            'results': results
        }

## Main Function to call all the user defined functions

In [54]:
def main():
    """Run the search system"""
    print("Medical Document Search System\n")
    
    # Load documents
    directory = "D:/AIML/IR/Assignment/medical_documents/"
    print(directory)
    
    path = os.path.abspath(directory)
    try:
        inverted_index, document_metadata  =  load_documents(path) #load all the document from directory

        print("\n***************TESTING SEARCHES**********************")
         
        while True:
             print("\nSearch options:")
             print("- Regular search: 'eg: diabetes'")
             print("- Wildcard search: 'eg: cardio*'")
             print("- Mispelled word: 'eg: cardeo'")
             print("Type 'exit' to quit\n")
             query = input("\nEnter Search term: ").strip()
    
             results = search(query, inverted_index, document_metadata)
    
             if results['type'] == 'wildcard':
                print(f"\nFound {results['count']} matching terms:")
                for term in results['results'][:20]:  # Show first 20 matches
                    print(f"- {term} (in {len(inverted_index[term])} documents)")
                if results['count'] > 20:
                    print(f"... plus {results['count']-20} more terms")
             else:
                if results['results']: 
                    print(f"\nFound {results['count']} documents:")
                    for doc in results['results']:
                        print(f"\n{doc['filename']}")
                        #print(doc['preview'])
                else:
                    print("\nNo direct matches found")

                print("\nDid you mean: ")
                suggestions = suggest_terms(query, inverted_index)
                print(f"'{query}': {suggestions}")
    
    
             if query.lower() == 'exit':
                break


    
    except Exception as e:
        print(f"Error: {str(e)}")
        return

## Main Function to call

In [56]:
if __name__ == "__main__":
    main()

Medical Document Search System

D:/AIML/IR/Assignment/medical_documents/
Loading documents from: D:\AIML\IR\Assignment\medical_documents


=== ORIGINAL TEXT (SAMPLE) ===
Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks
Pranav RajpurkarPRANAVSR @CS.STANFORD .EDU
Awni Y. HannunAWNI @CS.STANFORD .EDU
Masoumeh Haghpanahi MHAGHPANAHI @IRHYTHMTEC...

=== AFTER CLEANING ===
cardiologistlevel arrhythmia detection with convolutional neural networks
pranav rajpurkarpranavsr csstanford edu
awni y hannunawni csstanford edu
masoumeh haghpanahi mhaghpanahi irhythmtech com
codie...

TOKENS (4517): ['cardiologistlevel', 'arrhythmia', 'detection', 'with', 'convolutional', 'neural', 'networks', 'pranav', 'rajpurkarpranavsr', 'csstanford', 'edu', 'awni', 'y', 'hannunawni', 'csstanford', 'edu', 'masoumeh', 'haghpanahi', 'mhaghpanahi', 'irhythmtech', 'com', 'codie', 'bourn', 'cbourn', 'irhythmtech', 'com', 'andrew', 'y', 'ng', 'ang'] ...

AFTER STOPWORD REMOVAL (2853): ['cardiol


Enter Search term:  cardiology



=== ORIGINAL TEXT (SAMPLE) ===
cardiology
=== AFTER CLEANING ===
cardiology
TOKENS (1): ['cardiology'] ...

AFTER STOPWORD REMOVAL (1): ['cardiology'] ...

FINAL PROCESSED TERMS (1): ['cardiology'] ...

Found 4 documents:

Cardio.pdf

mtsamples.csv

gender-differences-arteries.pdf

mtsamples.xlsx

Did you mean: 
'cardiology': ['cardiology', 'cardiolo', 'radiology', 'cardiologist', 'cardiol']

Search options:
- Regular search: 'eg: diabetes'
- Wildcard search: 'eg: cardio*'
- Mispelled word: 'eg: cardeo'
Type 'exit' to quit




Enter Search term:  cardio*



Found 12 matching terms:
- cardio (in 2 documents)
- cardiogenic (in 1 documents)
- cardiographic (in 1 documents)
- cardiol (in 1 documents)
- cardiolo (in 1 documents)
- cardiologist (in 2 documents)
- cardiologistlevel (in 1 documents)
- cardiology (in 4 documents)
- cardiopulmonary (in 4 documents)
- cardiovascular (in 9 documents)
- cardioversion (in 1 documents)
- cardioverter (in 1 documents)

Search options:
- Regular search: 'eg: diabetes'
- Wildcard search: 'eg: cardio*'
- Mispelled word: 'eg: cardeo'
Type 'exit' to quit




Enter Search term:  heart



=== ORIGINAL TEXT (SAMPLE) ===
heart
=== AFTER CLEANING ===
heart
TOKENS (1): ['heart'] ...

AFTER STOPWORD REMOVAL (1): ['heart'] ...

FINAL PROCESSED TERMS (1): ['heart'] ...

Found 9 documents:

Cardio.pdf

Cardiovascular  Pulmonary.txt

DataAnalyticsinhealthcare.pdf

gender-differences-arteries.pdf

Medical Specialty.txt

mtsamples.csv

mtsamples.xlsx

train_1.txt

Train_Data.txt

Did you mean: 
'heart': ['heart', 'healt', 'pert', 'wear', 'ear']

Search options:
- Regular search: 'eg: diabetes'
- Wildcard search: 'eg: cardio*'
- Mispelled word: 'eg: cardeo'
Type 'exit' to quit




Enter Search term:  exit



=== ORIGINAL TEXT (SAMPLE) ===
exit
=== AFTER CLEANING ===
exit
TOKENS (1): ['exit'] ...

AFTER STOPWORD REMOVAL (1): ['exit'] ...

FINAL PROCESSED TERMS (1): ['exit'] ...

Found 1 documents:

train_1.txt

Did you mean: 
'exit': ['exit', 'exist', 'exam', 'exact', 'eric']
