# Information Retrieval Project  
### Web Crawling, TF-IDF Indexing, Ranking, and Evaluation

This notebook documents the full pipeline for the IR project:
1. Web crawler using Scrapy  
2. HTML parsing  
3. TF-IDF index construction  
4. Query processing and ranking using cosine similarity  
5. Results generation  
6. Discussion and evaluation

The crawled dataset comes from:
**https://en.wikipedia.org/wiki/Machine_learning**

All code used for crawling, indexing, and query ranking is included in this project folder.
This project was created by Riddhi Das, with some logical guidance from ChatGPT. No external code repositories or third-party implementations were used; all components were developed specifically for this assignment.

## Imports

- Loads all required Python libraries.
- Enables HTML parsing, TF-IDF vectorization, cosine similarity, and data loading.
- Prepares the environment for the rest of the pipeline.

In [1]:
import os
import json
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

## List Crawled Documents

- Points the notebook to the folder containing crawled HTML files.
- Lists how many documents were collected.
- Confirms that the crawler successfully generated data for indexing.

In [2]:
html_dir = "../data/crawl_html"

files = os.listdir(html_dir)
len(files), files[:10]

(34,
 ['Autoencoder.html',
  'Liquid_state_machines.html',
  'IDSIA.html',
  'Deep_learning.html',
  'Long_short-term_memory.html',
  'Main_Page.html',
  'Isolation_forest.html',
  'Reservoir_computing.html',
  'Gated_recurrent_unit.html',
  'Julia_(programming_language).html'])

## Parse HTML Files

- Opens each HTML file and extracts clean plain text.
- Converts raw HTML pages into usable document content.
- Builds `docs` and `doc_ids` lists for TF-IDF processing.

In [3]:
docs = []
doc_ids = []

for f in os.listdir(html_dir):
    if f.endswith(".html"):
        path = os.path.join(html_dir, f)
        text = BeautifulSoup(open(path, encoding="utf-8", errors="ignore"), "html.parser").get_text(" ", strip=True)
        docs.append(text)
        doc_ids.append(f)

len(docs)

34

## Build TF-IDF Matrix

- Converts the cleaned text documents into numerical TF-IDF vectors.
- Creates the document-term matrix that represents documents mathematically.
- This is the core representation used for ranking.

In [4]:
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(docs)

tfidf_matrix.shape

(34, 20419)

## Build the Inverted Index

- Iterates through every term in the vocabulary.
- Records which documents contain the term and with what TF-IDF weight.
- Produces a classical IR inverted index (term â†’ posting list).

In [5]:
vocab = vectorizer.get_feature_names_out()
index = {}

for term_idx, term in enumerate(vocab):
    column = tfidf_matrix[:, term_idx].toarray().ravel()
    postings = {}
    for i, score in enumerate(column):
        if score > 0:
            postings[doc_ids[i]] = float(score)
    index[term] = postings

len(index)

20419

## Save index.json

- Saves the inverted index to a JSON file.
- Allows external inspection or reuse of the index.
- Completes the indexing phase of the pipeline.

In [6]:
with open("../data/index.json", "w") as f:
    json.dump(index, f)

"index.json saved."

'index.json saved.'

## Load Queries

- Loads the input queries from the queries.csv file.
- Prepares the notebook to generate ranked results for each query.
- Ensures the retrieval stage has valid inputs.

In [7]:
queries_df = pd.read_csv("../data/queries.csv")
queries_df

Unnamed: 0,query_id,query_text
0,6E93CDD1-52F9-4F41-A405-54E398EF6FF8,information overload
1,0D97BCC6-C46E-4242-9777-7CEAED55B362,database server hardware specs
2,78452FF4-94D7-422C-9283-A14615C44ADC,search engine open sorce


## Rank Documents Using Cosine Similarity

- Transforms each query into a TF-IDF vector.
- Computes cosine similarity between the query and all documents.
- Selects the top 3 most relevant documents for each query.
- Produces the main output of the ranking system.

In [8]:
results = []

for _, row in queries_df.iterrows():
    qid = row[0]
    qtext = row[1]

    qvec = vectorizer.transform([qtext])
    sims = cosine_similarity(qvec, tfidf_matrix).flatten()
    ranked = sims.argsort()[::-1][:3]

    for rank, idx in enumerate(ranked, start=1):
        results.append([qid, rank, doc_ids[idx]])

results_df = pd.DataFrame(results, columns=["query_id", "rank", "document_id"])
results_df

  qid = row[0]
  qtext = row[1]


Unnamed: 0,query_id,rank,document_id
0,6E93CDD1-52F9-4F41-A405-54E398EF6FF8,1,Outline_of_computer_science.html
1,6E93CDD1-52F9-4F41-A405-54E398EF6FF8,2,John_Preskill.html
2,6E93CDD1-52F9-4F41-A405-54E398EF6FF8,3,Reservoir_computing.html
3,0D97BCC6-C46E-4242-9777-7CEAED55B362,1,Outline_of_computer_science.html
4,0D97BCC6-C46E-4242-9777-7CEAED55B362,2,Machine_learning.html
5,0D97BCC6-C46E-4242-9777-7CEAED55B362,3,Wikimedia_Foundation.html
6,78452FF4-94D7-422C-9283-A14615C44ADC,1,Wikimedia_Foundation.html
7,78452FF4-94D7-422C-9283-A14615C44ADC,2,Julia_(programming_language).html
8,78452FF4-94D7-422C-9283-A14615C44ADC,3,Felix_Gers.html


## Save results.csv

- Saves all ranked results into results.csv.
- Produces the final deliverable for the retrieval component.
- Ensures results can be submitted or evaluated externally.

In [9]:
results_df.to_csv("../data/results.csv", index=False)
"results.csv saved."

'results.csv saved.'

# Evaluation & Discussion

### Retrieval Quality
The TF-IDF + cosine similarity method produced reasonable rankings. Broad queries retrieved overview articles, while more specific queries matched technically similar pages.

### Dataset Coverage
The crawler gathered around 30 Wikipedia pages related to machine learning, including neural networks, inference methods, and probabilistic models. This provided good variety for ranking tests.

### Observations
- TF-IDF is effective for lexical matching but does not capture deeper semantic meaning.
- Documents that share terminology naturally rank higher.
- The method is fast, transparent, and works well for classical IR tasks.

Overall, the system performs as expected and demonstrates a complete IR pipeline from crawling to ranking.
