# Baseline Model

## Overview 
The goal of this notebook is to define a simple baseline model for NeuroSift: an AI agent that extracts scientific methods information (e.g., number of EEG channels, programming tools used) from neuroscience papers. This baseline will help measure improvements made by more advanced models like Flan-T5.

## Table of Contents
1. [Model Choice](#model-choice)
2. [Feature Selection](#feature-selection)
3. [Implementation](#implementation)
4. [Evaluation](#evaluation)


In [None]:
# Import necessary libraries
import PyPDF2
import streamlit as st
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import tempfile


## Model Choice

A simple TF-IDF + cosine similarity model was chosen as baseline. For each question (e.g., "How many EEG channels were used?"), the model finds the sentence in the PDF that is most similar to the question.

This model doesn't require training and can be implemented easily using scikit-learn. It establishes a performance floor based on keyword overlap and semantic similarity, against which more powerful models like FLAN-T5 or SciFive can be evaluated.


## Feature Selection

Input Features: Sentences extracted from PDF documents.

Target Query: A user question (e.g., “Which software was used?”).

Representation: TF-IDF vectors for both question and sentences.


## Implementation

[Implement your baseline model here.]



In [None]:
from utils.config import dir_dataset

def extract_text(pdf_file):
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    return "\n".join(page.extract_text() for page in pdf_reader.pages)

def get_best_match(text, question):
    sentences = [s.strip() for s in text.split(".") if len(s.strip()) > 10]
    corpus = sentences + [question]
    vectorizer = TfidfVectorizer().fit(corpus)
    tfidf_matrix = vectorizer.transform(corpus)
    sims = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])
    best_index = sims[0].argmax()
    return sentences[best_index], sims[0][best_index]

# Example usage:
with open("PMC9204106.pdf", "rb") as f: 
    paper_text = extract_text(f)

question = "How many EEG channels were used?"
answer, score = get_best_match(paper_text, question)

print("Question:", question)
print("Best Match:", answer)
print("Similarity Score:", score)


## Evaluation

[Clearly state what metrics you will use to evaluate the model's performance. These metrics will serve as a starting point for evaluating more complex models later on.]



In [None]:
# Evaluate the baseline model
# Example for a classification problem
# y_pred = model.predict(X_test)
# accuracy = accuracy_score(y_test, y_pred)

# For a regression problem, you might use:
# mse = mean_squared_error(y_test, y_pred)

# Your evaluation code here
