**Multi-Label Document Classification with Relevance Feedback for Query-Specific Retrieval**

---



This Information Retrieval (IR) project, titled “Multi-Label Document Classification with Relevance Feedback for Query-Specific Retrieval,” focuses on enhancing the efficiency of document retrieval systems through key IR concepts such as document representation, indexing, query processing, and relevance feedback. The system accepts documents in multiple formats (.txt, .pdf, .docx), performs text preprocessing and feature extraction using TF-IDF, and applies multi-label classification to categorize documents under multiple topics. During retrieval, the user’s query is transformed into a vector and matched against the document index using similarity measures like cosine similarity. The relevance feedback mechanism (based on the Rocchio algorithm) allows the system to refine the query vector dynamically, improving precision and recall in subsequent searches. Overall, this project demonstrates how IR techniques can be combined with machine learning to create an adaptive, user-centered document retrieval system.

In [None]:
!pip install pdfplumber python-docx scikit-learn pandas nltk



In [None]:
from google.colab import files
uploaded_files = files.upload()

Saving sample1.txt to sample1 (1).txt
Saving sample2.txt to sample2 (1).txt
Saving sample3.txt to sample3 (1).txt
Saving sample4.docx to sample4 (1).docx
Saving sample5.pdf to sample5 (1).pdf


In [None]:
import pdfplumber
from docx import Document
import os

def extract_text(file_path):
    if file_path.endswith(".txt"):
        return open(file_path, "r", encoding="utf-8", errors="ignore").read()

    elif file_path.endswith(".docx"):
        doc = Document(file_path)
        return "\n".join([p.text for p in doc.paragraphs])

    elif file_path.endswith(".pdf"):
        text = ""
        with pdfplumber.open(file_path) as pdf:
            for page in pdf.pages:
                text += page.extract_text() or ""
        return text

    else:
        return ""

In [None]:
documents = []

for fname in uploaded_files.keys():
    text = extract_text(fname)
    documents.append({"id": len(documents)+1, "filename": fname, "text": text})

len(documents), documents[:2]

(5,
 [{'id': 1,
   'filename': 'sample1 (1).txt',
   'text': 'The government introduced a new education reform bill that changes the structure of primary school examinations.\nThe bill is expected to bring transparency into the school system.\nLegal experts stated that the law focuses on improving teacher accountability.\n'},
  {'id': 2,
   'filename': 'sample2 (1).txt',
   'text': 'The stock market showed rapid growth in the technology sector during the third quarter.\nInvestors are focusing heavily on artificial intelligence and software startups.\nFinancial analysts predict a strong market recovery next year.\n'}])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [d["text"] for d in documents]

vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
doc_vectors = vectorizer.fit_transform(corpus)


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression


labels = []
for d in documents:
    t = d["text"].lower()
    l = []
    if "law" in t or "court" in t: l.append("law")
    if "education" in t or "school" in t: l.append("education")
    if "finance" in t or "market" in t: l.append("finance")
    if "health" in t or "medical" in t: l.append("health")
    if "technology" in t or "software" in t: l.append("technology")
    labels.append(l)

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(labels)

clf = OneVsRestClassifier(LogisticRegression(max_iter=2000))
clf.fit(doc_vectors, Y)

print("Labels learned:", mlb.classes_)


Labels learned: ['education' 'finance' 'health' 'law' 'technology']


In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def search(query, top_k=5):
    q_vec = vectorizer.transform([query])
    scores = cosine_similarity(doc_vectors, q_vec).reshape(-1)
    sorted_idx = np.argsort(scores)[::-1]

    results = []

    for idx in sorted_idx[:top_k]:
        pred = clf.predict(doc_vectors[idx])
        pred = np.array(pred).reshape(1, -1)
        pred_labels = mlb.inverse_transform(pred)[0]

        results.append({
            "doc_id": documents[idx]["id"],
            "filename": documents[idx]["filename"],
            "score": float(scores[idx]),
            "labels": pred_labels,
            "text_snippet": documents[idx]["text"][:200] + "..."
        })
    return results


In [None]:
def rocchio(query, relevant_ids, non_relevant_ids, alpha=1.0, beta=0.75, gamma=0.15):
    q_vec = vectorizer.transform([query]).toarray()[0]

    rel_vecs = doc_vectors[[i-1 for i in relevant_ids]].toarray()
    non_rel_vecs = doc_vectors[[i-1 for i in non_relevant_ids]].toarray()

    new_q = (
        alpha * q_vec +
        beta * rel_vecs.mean(axis=0) -
        gamma * non_rel_vecs.mean(axis=0)
    )

    # convert vector → keywords
    terms = vectorizer.get_feature_names_out()
    top_terms = np.argsort(new_q)[-10:]
    expanded_query = " ".join([terms[i] for i in top_terms])

    return expanded_query


In [None]:
query = "education law reform"
results = search(query)

for r in results:
    print(r, "\n")



{'doc_id': 1, 'filename': 'sample1 (1).txt', 'score': 0.3631337593145946, 'labels': ('education', 'law'), 'text_snippet': 'The government introduced a new education reform bill that changes the structure of primary school examinations.\nThe bill is expected to bring transparency into the school system.\nLegal experts stated...'} 

{'doc_id': 5, 'filename': 'sample5 (1).pdf', 'score': 0.0, 'labels': ('education', 'technology'), 'text_snippet': 'The use of machine learning in smart classrooms is increasing rapidly.\nEducational institutions are adopting technology for improved student engagement.\nExperts believe AI-based tools will revolutioni...'} 

{'doc_id': 4, 'filename': 'sample4 (1).docx', 'score': 0.0, 'labels': ('law',), 'text_snippet': 'The Supreme Court ruled in favor of a new amendment that strengthens consumer protection laws.\nThis ruling is expected to affect corporate policies nationwide....'} 

{'doc_id': 3, 'filename': 'sample3 (1).txt', 'score': 0.0, 'labels': (), 'tex

In [None]:
relevant = [1,3]
non_relevant = [2]

updated_query = rocchio(query, relevant, non_relevant)
print("Updated Query:", updated_query)


Updated Query: medical results reported responses vaccine new school education reform law


In [None]:
search(updated_query)


[{'doc_id': 3,
  'filename': 'sample3 (1).txt',
  'score': 0.3953706533332504,
  'labels': (),
  'text_snippet': 'A recent medical study shows positive results in the clinical trial for a new vaccine.\nDoctors reported improved immunity responses in all age groups.\nThe health department has approved further testin...'},
 {'doc_id': 1,
  'filename': 'sample1 (1).txt',
  'score': 0.3716227536304303,
  'labels': ('education', 'law'),
  'text_snippet': 'The government introduced a new education reform bill that changes the structure of primary school examinations.\nThe bill is expected to bring transparency into the school system.\nLegal experts stated...'},
 {'doc_id': 4,
  'filename': 'sample4 (1).docx',
  'score': 0.03755032425152954,
  'labels': ('law',),
  'text_snippet': 'The Supreme Court ruled in favor of a new amendment that strengthens consumer protection laws.\nThis ruling is expected to affect corporate policies nationwide....'},
 {'doc_id': 5,
  'filename': 'sample5 (1).pdf',