# Coherence Checker Using Multi-Approach Based NSP


Done by: Sudip Das

Last Updated: May 7th, 2025

## Introduction

Learning a new language can be particularly challenging for young children and non-native English speakers, especially when it comes to constructing coherent sentences. To assist with this process, we leverage Next Sentence Prediction (NSP) as a tool for evaluating the coherence of user-generated sentences. By doing so, we can provide immediate feedback and improve their understanding of English grammar and syntax.

This project is structured into multiple phases, each designed to explore different approaches for improving sentence coherence detection. The primary objectives include:

- Fine-tuning a pre-trained model on a subset of the Wikipedia dataset for enhanced understanding of coherent sentence pairs.

- Experimenting with various prompting methods, including standard and few-shot prompts, to assess how well the model can identify coherence.

- Implementing a retrieval-augmented generation (RAG) approach that utilizes a FAISS index built from coherent and incoherent sentence pairs.

- Combining above approaches to achieve higher accuracy in NSP tasks.

- Demonstrating the practical benefits of these techniques for young learners, including clearer feedback on writing errors and more personalized language learning support.


## Install & Import Dependencies

In [None]:
#Install dependencies

!pip install \
  torch==2.1.2 \
  torchaudio==2.1.2 \
  transformers==4.29.2 \
  datasets==2.12.0 \
  huggingface_hub==0.15.1 \
  sentence-transformers==2.2.2 \
  numpy==1.26.3 \
  scikit-learn==1.4.0 \
  pandas==2.1.4 \
  matplotlib==3.8.2 \
  tqdm==4.66.1 \
  nltk==3.8.1 \
  faiss-cpu==1.7.4 \
  evaluate==0.4.1 \
  accelerate==0.25.0



In [None]:
# Import necessary libraries
import os
import random
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import nltk
from nltk.tokenize import sent_tokenize

# Hugging Face imports
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForSeq2SeqLM,
    BertForNextSentencePrediction,
    BertTokenizer,
    T5ForConditionalGeneration,
    T5Tokenizer,
    BartForConditionalGeneration,
    BartTokenizer,
    Trainer,
    TrainingArguments,
    pipeline
)
from sentence_transformers import SentenceTransformer
import faiss
import evaluate

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

In [None]:
# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Download NLTK resources
nltk.download('punkt')

Using device: cuda


[nltk_data] Downloading package punkt to
[nltk_data]     /teamspace/studios/this_studio/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# Set your Hugging Face access token
os.environ["HUGGINGFACE_TOKEN"] = "Enter your token here"

from huggingface_hub import login
login(token=os.environ["HUGGINGFACE_TOKEN"])

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /teamspace/studios/this_studio/.cache/huggingface/token
Login successful


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Extract Dataset

The Wiki Paragraphs dataset used in this NSP task provides a large-scale, high-quality source of English sentence pairs for fine-tuning. By utilizing Wikipedia as the source, the dataset benefits from a wide range of topics, sentence structures, and language styles. Moreover, the dataset is balanced in terms of coherent (1) and incoherent (0) labels, ensuring that the model does not develop a bias towards one class.

The dataset is available on https://huggingface.co/datasets/dennlinger/wiki-paragraphs and the size is substantially large with over 25 million training pairs and millions more for validation and testing. For the given resources, the dataset is shrunk down to 100,000 sentences with a 70,15,15 split.

**Skip the next 3 lines while re running the notebook. The dataset has over 30 million sentence pairs and cannot fit in the drive. Hence it is randomly sampled to 100,000 sentence pairs. This is then saved in the wiki-paragraphs-sampled folder and can be directly loaded from there.**

In [None]:
def explore_wiki_dataset(train_path, val_path, test_path):
    """
    Explore the Wiki Paragraphs dataset from CSV files.

    Args:
        train_path (str): Path to the training CSV file
        val_path (str): Path to the validation CSV file
        test_path (str): Path to the test CSV file
    """
    # Load the datasets
    train_df = pd.read_csv(train_path)
    val_df = pd.read_csv(val_path)
    test_df = pd.read_csv(test_path)

    # Display dataset sizes
    print(f"Dataset sizes:")
    print(f"Train: {len(train_df)} examples")
    print(f"Validation: {len(val_df)} examples")
    print(f"Test: {len(test_df)} examples")

    # Display column names
    print(f"\nColumns in the dataset: {list(train_df.columns)}")

    # Display class distribution
    print(f"\nClass distribution:")
    print(f"Train: {train_df['label'].mean():.2f} positive")
    print(f"Validation: {val_df['label'].mean():.2f} positive")
    print(f"Test: {test_df['label'].mean():.2f} positive")

    # Display a few examples
    print("\nExample from train dataset:")
    sample = train_df.iloc[0]
    print(f"Sentence 1: {sample['sentence1']}")
    print(f"Sentence 2: {sample['sentence2']}")
    print(f"Label: {sample['label']}")

    return train_df, val_df, test_df

In [None]:
train_df, val_df, test_df = explore_wiki_dataset(
    train_path="wiki-paragraphs/train.csv",
    val_path="wiki-paragraphs/validation.csv",
    test_path="wiki-paragraphs/test.csv"
)

Dataset sizes:
Train: 25375582 examples
Validation: 3163684 examples
Test: 3171464 examples

Columns in the dataset: ['sentence1', 'sentence2', 'label']

Class distribution:
Train: 0.50 positive
Validation: 0.50 positive
Test: 0.50 positive

Example from train dataset:
Sentence 1: Terminfo:10904051
Sentence 2: Terminfo is a library and database that enables programs to use display terminals in a device-independent manner.
Label: 1


In [None]:
def sample_directly_from_wiki_dataset(train_path, val_path, test_path,
                                      train_size=70000, val_size=15000, test_size=15000,
                                      random_state=42):
    """

    Args:
        train_path (str): Path to the training CSV file
        val_path (str): Path to the validation CSV file
        test_path (str): Path to the test CSV file
        train_size (int): Number of examples to sample from train set
        val_size (int): Number of examples to sample from validation set
        test_size (int): Number of examples to sample from test set
        random_state (int): Random seed for reproducibility

    Returns:
        tuple: (sampled_train_df, sampled_val_df, sampled_test_df)
    """
    # Make sure the output directory exists
    os.makedirs("wiki-paragraphs-sampled", exist_ok=True)

    # Sample from training set
    print(f"Sampling {train_size} examples from training set...")
    train_df = pd.read_csv(train_path)
    sampled_train = train_df.sample(n=train_size, random_state=random_state)
    del train_df  # Free up memory

    # Sample from validation set
    print(f"Sampling {val_size} examples from validation set...")
    val_df = pd.read_csv(val_path)
    sampled_val = val_df.sample(n=val_size, random_state=random_state)
    del val_df  # Free up memory

    # Sample from test set
    print(f"Sampling {test_size} examples from test set...")
    test_df = pd.read_csv(test_path)
    sampled_test = test_df.sample(n=test_size, random_state=random_state)
    del test_df  # Free up memory

    # Show the new sizes
    print(f"\nNew dataset sizes:")
    print(f"Train: {len(sampled_train)} examples")
    print(f"Validation: {len(sampled_val)} examples")
    print(f"Test: {len(sampled_test)} examples")

    # Check class balance
    print(f"\nNew class distribution:")
    print(f"Train: {sampled_train['label'].mean():.2f} positive")
    print(f"Validation: {sampled_val['label'].mean():.2f} positive")
    print(f"Test: {sampled_test['label'].mean():.2f} positive")

    # Save the new splits to CSV files
    print("\nSaving new splits to CSV files...")
    sampled_train.to_csv("wiki-paragraphs-sampled/train.csv", index=False)
    sampled_val.to_csv("wiki-paragraphs-sampled/validation.csv", index=False)
    sampled_test.to_csv("wiki-paragraphs-sampled/test.csv", index=False)
    print("Saved to wiki-paragraphs-sampled/ directory")

    return sampled_train, sampled_val, sampled_test

# Run the function
train_df, val_df, test_df = sample_directly_from_wiki_dataset(
    train_path="wiki-paragraphs/train.csv",
    val_path="wiki-paragraphs/validation.csv",
    test_path="wiki-paragraphs/test.csv"
)

Sampling 70000 examples from training set...
Sampling 15000 examples from validation set...
Sampling 15000 examples from test set...

New dataset sizes:
Train: 70000 examples
Validation: 15000 examples
Test: 15000 examples

New class distribution:
Train: 0.50 positive
Validation: 0.50 positive
Test: 0.50 positive

Saving new splits to CSV files...
Saved to wiki-paragraphs-sampled/ directory


## Run from this block to directly use the randomly sampled dataset of (100,000 sentence pairs)
**Note:** If the samples are already loaded, it can be directly run from here saving time and space

In [None]:
#Run this directly if dataset is already loaded and sampled
train_df = pd.read_csv('/content/drive/My Drive/CS225_final_project/wiki-paragraphs-sampled/train.csv')
val_df = pd.read_csv("/content/drive/My Drive/CS225_final_project/wiki-paragraphs-sampled/validation.csv")
test_df = pd.read_csv("/content/drive/My Drive/CS225_final_project/wiki-paragraphs-sampled/test.csv")

In [None]:
train_data = train_df.reset_index(drop=True)
val_data = val_df.reset_index(drop=True)
test_data = test_df.reset_index(drop=True)

In [None]:
print(train_data.columns)
print(train_data.head())


Index(['sentence1', 'sentence2', 'label'], dtype='object')
                                           sentence1  \
0  Damaging agents: Although numerous insects and...   
1  Pyatt became an unrestricted free agent after ...   
2  Beginning performance 13 January 2012, Dillon ...   
3  In 2007, an addition opened, providing a resea...   
4  Recovering from a major knee operation, Bennet...   

                                           sentence2  label  
0  The hickory bark beetle ("Scolytus quadrispino...      1  
1  On May 24, 2016, the Ottawa Senators signed hi...      1  
2  Dillon was also the understudy for the charact...      0  
3  Helen Bentley donated of land on Bentley Road ...      1  
4  While growing up in Nottingham, England, Benne...      0  


## Fine-tuning a BERT-based NSP Model

In this step, we fine-tuned a pre-trained BERT model (using the bert-base-uncased architecture) for the Next Sentence Prediction (NSP) task. The training involved using a labeled dataset of sentence pairs where each pair is assigned a label indicating coherence (1) or incoherence (0).

Dataset Preparation: The sentence pairs were tokenized and converted into a format suitable for BERT using the Hugging Face BertTokenizer. The input included two sentences concatenated with special tokens and padded or truncated to a fixed length. Labels were added as the target output.

Fine-tuning Process: When fine-tuning the BERT model for the NSP task, several parameters and arguments were carefully chosen to balance performance, resource constraints, and ease of implementation.

- Batch Size

Training batch size: 16

Evaluation batch size: 32

A smaller batch size (16) was chosen for training to fit the model into memory while processing relatively long sequences. A larger batch size (32) for evaluation allows more efficient use of resources since the model isn’t updating gradients during evaluation, and the focus is on computing metrics.

- Learning Rate

Value: 2e-5

This rate is small enough to allow stable convergence but large enough to make meaningful progress within a limited number of steps. Fine-tuning pre-trained transformer models often benefits from relatively low learning rates because the model’s weights are already initialized to a useful state, and small updates are sufficient to adapt to the new task.

- Number of Epochs

Value: 1

A single epoch was used to prevent overfitting, given that the model already starts from a strong pre-trained base and the training set is large. Running more epochs might lead to marginal gains in accuracy, but it also increases training time and the risk of overfitting, particularly if the dataset is not extremely diverse.

- Warmup Steps

Value: 100

Warmup steps are employed to gradually increase the learning rate at the beginning of training. This helps avoid large, sudden updates to the model’s weights in the initial steps, which could destabilize training. By warming up over 100 steps, the model starts training smoothly and settles into a stable learning pattern.

- Weight Decay

Value: 0.01

Adding a small weight decay term helps regularize the model and prevent overfitting by penalizing large weights. This encourages the model to learn more generalizable features rather than fitting noise in the training data.


In [None]:
class SentencePairDataset(Dataset):
    """
    Dataset for sentence pairs with next sentence prediction labels.
    """
    def __init__(self, sentence_pairs, tokenizer, max_length=128):
        self.sentence_pairs = sentence_pairs
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.sentence_pairs)

    def __getitem__(self, idx):
        pair = self.sentence_pairs.iloc[idx]
        sentence_a = pair['sentence1']
        sentence_b = pair['sentence2']
        label = pair['label']
        # Tokenize the sentence pair
        encoding = self.tokenizer(sentence_a, sentence_b,
                                  truncation=True,
                                  max_length=self.max_length,
                                  padding='max_length',
                                  return_tensors='pt')

        encoding = {key: val.squeeze(0) for key, val in encoding.items()}
        encoding['labels'] = torch.tensor(label, dtype=torch.long)

        return encoding

In [None]:
def train_bert_nsp_model(train_data, val_data, model_name="bert-base-uncased", output_dir="./bert_nsp_model"):
    """
    Fine-tune a BERT model for Next Sentence Prediction.

    Args:
        train_data (list): List of training sentence pairs
        val_data (list): List of validation sentence pairs
        model_name (str): Name of the pre-trained BERT model to use
        output_dir (str): Directory to save the fine-tuned model

    Returns:
        tuple: (model, tokenizer)
    """
    # Load the tokenizer and model
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertForNextSentencePrediction.from_pretrained(model_name)

    # Create datasets
    train_dataset = SentencePairDataset(train_data, tokenizer)
    val_dataset = SentencePairDataset(val_data, tokenizer)

    # Define training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=1,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=32,
        warmup_steps=100,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=250,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        greater_is_better=True,
        learning_rate=2e-5,
    )

    # Define compute_metrics function for evaluation
    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        accuracy = accuracy_score(labels, predictions)
        precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
        return {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1': f1
        }

    # Initialize Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics
    )

    # Train the model
    print("Training the BERT NSP model...")
    trainer.train()

    # Save the best model and tokenizer
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    return model, tokenizer

In [None]:
def evaluate_bert_nsp_model(model, tokenizer, test_data):
    """
    Evaluate the fine-tuned BERT NSP model on test data.

    Args:
        model: Fine-tuned BERT NSP model
        tokenizer: BERT tokenizer
        test_data (list): List of test sentence pairs

    Returns:
        dict: Evaluation metrics
    """
    # Create test dataset
    test_dataset = SentencePairDataset(test_data, tokenizer)
    test_dataloader = DataLoader(test_dataset, batch_size=32)

    # Move model to device
    model.to(device)
    model.eval()

    all_preds = []
    all_labels = []

    # Evaluate
    with torch.no_grad():
        for batch in tqdm(test_dataloader, desc="Evaluating"):
            # Move batch to device
            batch = {k: v.to(device) for k, v in batch.items()}

            # Forward pass
            outputs = model(**{k: v for k, v in batch.items() if k != 'labels'})

            # Get predictions
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=1).cpu().numpy()
            labels = batch['labels'].cpu().numpy()

            all_preds.extend(predictions)
            all_labels.extend(labels)

    # Calculate metrics
    accuracy = accuracy_score(all_labels, all_preds)
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average='binary')

    # Print results
    print(f"Test Accuracy: {accuracy:.4f}")
    print(f"Test Precision: {precision:.4f}")
    print(f"Test Recall: {recall:.4f}")
    print(f"Test F1 Score: {f1:.4f}")

    # Return metrics
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

In [None]:
def predict_sentence_coherence(model, tokenizer, sentence_a, sentence_b):
    """
    Predict whether sentence_b is a coherent continuation of sentence_a.

    Args:
        model: Fine-tuned BERT NSP model
        tokenizer: BERT tokenizer
        sentence_a (str): First sentence
        sentence_b (str): Second sentence

    Returns:
        tuple: (is_coherent, confidence_score)
    """
    # Tokenize the sentence pair
    encoding = tokenizer(sentence_a, sentence_b,
                         truncation=True,
                         max_length=128,
                         padding='max_length',
                         return_tensors='pt')

    # Move to device
    encoding = {k: v.to(device) for k, v in encoding.items()}

    # Get prediction
    model.eval()
    with torch.no_grad():
        outputs = model(**encoding)

    # Process outputs
    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=1)
    is_next_prob = probabilities[0, 0].item()  # Probability of being "IsNext"
    is_coherent = is_next_prob >= 0.5

    return is_coherent, is_next_prob

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

**Note: Skip the next cell if GPU memory is the issue. Training can be slow and heavy. The model has been loaded in the 'bert_nsp_model' folder so that training doesn't need to be repeated. The trained model can then be directly reused from there.**

In [None]:
bert_nsp_model, bert_tokenizer = train_bert_nsp_model(train_data, val_data)

# Uncomment to evaluate the model on test data
bert_nsp_metrics = evaluate_bert_nsp_model(bert_nsp_model, bert_tokenizer, test_data)

# Example usage
sentence_a = "The cat sat on the mat."
sentence_b = "It was purring loudly."
is_coherent, confidence = predict_sentence_coherence(bert_nsp_model, bert_tokenizer, sentence_a, sentence_b)
print(f"Sentences are {'coherent' if is_coherent else 'incoherent'} with confidence {confidence:.4f}")

Training the BERT NSP model...




{'loss': 1.6137, 'learning_rate': 1.929824561403509e-05, 'epoch': 0.06}
{'loss': 0.6252, 'learning_rate': 1.8128654970760235e-05, 'epoch': 0.11}
{'loss': 0.5994, 'learning_rate': 1.695906432748538e-05, 'epoch': 0.17}
{'loss': 0.6086, 'learning_rate': 1.578947368421053e-05, 'epoch': 0.23}
{'loss': 0.5841, 'learning_rate': 1.4619883040935675e-05, 'epoch': 0.29}
{'loss': 0.5862, 'learning_rate': 1.345029239766082e-05, 'epoch': 0.34}
{'loss': 0.5931, 'learning_rate': 1.2280701754385966e-05, 'epoch': 0.4}
{'loss': 0.5728, 'learning_rate': 1.1111111111111113e-05, 'epoch': 0.46}
{'loss': 0.5765, 'learning_rate': 9.941520467836257e-06, 'epoch': 0.51}
{'loss': 0.5806, 'learning_rate': 8.771929824561405e-06, 'epoch': 0.57}
{'loss': 0.5725, 'learning_rate': 7.60233918128655e-06, 'epoch': 0.63}
{'loss': 0.5685, 'learning_rate': 6.432748538011696e-06, 'epoch': 0.69}
{'loss': 0.5597, 'learning_rate': 5.263157894736842e-06, 'epoch': 0.74}
{'loss': 0.5657, 'learning_rate': 4.093567251461989e-06, 'epoc

Evaluating:   0%|          | 0/469 [00:00<?, ?it/s]

Test Accuracy: 0.7113
Test Precision: 0.7124
Test Recall: 0.7074
Test F1 Score: 0.7099
Sentences are incoherent with confidence 0.2728


**Results:** The model was evaluated on a test set of sentence pairs. It achieved an accuracy of 71.1%, with a precision of 71.2%, recall of 70.7%, and F1 score of 70.9%. These metrics indicate a solid performance for this initial fine-tuning phase, showing that the model can effectively distinguish between coherent and incoherent sentences.

The model is used to test on a real-world example and it correctly identified the pair as incoherent, providing a confidence score of approximately 27.3%. This example demonstrates how the fine-tuned NSP model can be used to help learners understand when two sentences do not form a logically coherent pair.

**Evaluation of Result:** The results of this fine-tuning step suggest that the BERT model effectively learned to identify sentence coherence within the training and test datasets. The metrics are promising, especially given the relatively short training duration. However, there may still be room for improvement, such as fine-tuning with additional epochs, experimenting with different learning rates, or using a larger, more diverse dataset. In terms of practical applications, these results indicate that the model can serve as a reliable starting point for providing coherence feedback to language learners.

## Prompting based approach for NSP

The idea for this approach is to leverage the model’s pre-trained knowledge without the need for extensive fine-tuning. Here the model is presented with a sentence pair and asked to decide if the second sentence logically follows the first.

A pre-trained language model 'google/flan-t5-base', is loaded along with its tokenizer. The model is then moved to GPU (if available) for efficient inference.

Two types of prompting are tested here. One is the standard or direct format which states “Determine if the second sentence is a logical continuation of the first sentence. Answer ‘yes’ if it is coherent, or ‘no’ if it is not coherent.” and the other prompting method is a few shot method where the model receives few examples before performing its tasks.

Once the prompt is formatted, it is tokenized and fed into the model. The model generates a short text response (e.g., “yes” or “no”), which is then analyzed to determine coherence. Responses are normalized and mapped to binary labels (coherent or incoherent).

In [None]:
def setup_prompt_model(model_name="google/flan-t5-base"):
    """
    Set up a pre-trained language model for prompting.

    Args:
        model_name (str): Name of the pre-trained model to use

    Returns:
        tuple: (model, tokenizer)
    """
    # Load the model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    # Move model to device
    model.to(device)

    return model, tokenizer

In [None]:
def is_positive_response(response):
    """
    Normalize and check for clear affirmative response.
    """
    normalized = response.strip().lower()
    return normalized in {"yes", "yes.", "coherent", "logical", "yes it is"}

In [None]:
def predict_coherence_with_prompting(model, tokenizer, sentence_a, sentence_b, prompt_template=None):
    """
    Predict whether sentence_b is a coherent continuation of sentence_a using prompting.

    Args:
        model: Pre-trained language model
        tokenizer: Tokenizer for the model
        sentence_a (str): First sentence
        sentence_b (str): Second sentence
        prompt_template (str, optional): Custom prompt template

    Returns:
        tuple: (is_coherent, raw_output)
    """
    # Default prompt template if none provided
    if prompt_template is None:
        prompt_template = (
            "Determine if the second sentence is a logical continuation of the first sentence. "
            "Answer with 'yes' if it is coherent, or 'no' if it is not coherent.\n\n"
            "First sentence: {sentence_a}\n"
            "Second sentence: {sentence_b}\n\n"
            "Is the second sentence a coherent continuation of the first? "
        )

    # Format the prompt
    prompt = prompt_template.format(sentence_a=sentence_a, sentence_b=sentence_b)

    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(device)

    # Generate response
    model.eval()
    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],
            max_length=10,  # Short response expected
            num_return_sequences=1,
            temperature=0.7,
            top_p=0.9
        )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip().lower()
    is_coherent = is_positive_response(response)

    return is_coherent, response

In [None]:
def evaluate_prompt_approach(model, tokenizer, test_data, prompt_template=None):
    """
    Evaluate the prompting-based approach on test data.

    Args:
        model: Pre-trained language model
        tokenizer: Tokenizer for the model
        test_data (list): List of test sentence pairs
        prompt_template (str, optional): Custom prompt template

    Returns:
        dict: Evaluation metrics
    """
    # Sample a subset for evaluation (prompting can be slow)
    if len(test_data) > 100:
        sampled_test_data = random.sample(test_data, 100)
    else:
        sampled_test_data = test_data

    all_preds = []
    all_labels = []
    all_responses = []

    # Evaluate
    for pair in tqdm(sampled_test_data, desc="Evaluating prompting approach"):
        sentence_a = pair['sentence_a']
        sentence_b = pair['sentence_b']
        true_label = pair['label']

        # Get prediction
        is_coherent, response = predict_coherence_with_prompting(
            model, tokenizer, sentence_a, sentence_b, prompt_template
        )

        # Convert boolean to int (0/1)
        pred_label = 1 if is_coherent else 0

        all_preds.append(pred_label)
        all_labels.append(true_label)
        all_responses.append(response)

    # Calculate metrics
    accuracy = accuracy_score(all_labels, all_preds)
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average='binary')

    # Print results
    print(f"Prompt Approach Test Accuracy: {accuracy:.4f}")
    print(f"Prompt Approach Test Precision: {precision:.4f}")
    print(f"Prompt Approach Test Recall: {recall:.4f}")
    print(f"Prompt Approach Test F1 Score: {f1:.4f}")

    print("\nExample responses:")
    for i in range(min(5, len(all_responses))):
        print(f"Response {i+1}: {all_responses[i]}")

    # Return metrics
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'responses': all_responses
    }

In [None]:
def experiment_with_different_prompts(model, tokenizer, test_data):
    """
    Experiment with different prompt templates and compare their performance.

    Args:
        model: Pre-trained language model
        tokenizer: Tokenizer for the model
        test_data (list): List of test sentence pairs

    Returns:
        dict: Results for each prompt template
    """
    # Define different prompt templates
    prompt_templates = {
        "standard": (
            "Determine if the second sentence is a logical continuation of the first sentence. "
            "Answer with 'yes' if it is coherent, or 'no' if it is not coherent.\n\n"
            "First sentence: {sentence_a}\n"
            "Second sentence: {sentence_b}\n\n"
            "Is the second sentence a coherent continuation of the first? "
        ),

        "few_shot": (
            "Below are examples of sentence pairs. For each pair, decide if the second sentence "
            "logically follows the first. Answer \"Yes\" if it does, or \"No\" if it does not.\n\n"
            "Example 1:\n"
            "First: The dog chased the ball into the yard.\n"
            "Second: It brought the ball back to its owner.\n"
            "Answer: Yes\n\n"
            "Example 2:\n"
            "First: The sun set behind the mountains.\n"
            "Second: I finished my homework before breakfast.\n"
            "Answer: No\n\n"
            "Example 3:\n"
            "First: The cake was ready to serve.\n"
            "Second: Everyone sat down at the table to eat dessert.\n"
            "Answer: Yes\n\n"
            "Now, consider this pair:\n"
            "First: {sentence_a}\n"
            "Second: {sentence_b}\n"
            "Answer: "
        )
    }

    # Sample a small subset for quick experimentation
    if len(test_data) > 100:
        sampled_test_data = random.sample(test_data, 100)
    else:
        sampled_test_data = test_data

    # Evaluate each prompt template
    results = {}
    for name, template in prompt_templates.items():
        print(f"\nEvaluating prompt template: {name}")
        metrics = evaluate_prompt_approach(model, tokenizer, sampled_test_data, template)
        results[name] = metrics

    # Compare results
    print("\nPrompt Template Comparison:")
    for name, metrics in results.items():
        print(f"{name}: Accuracy={metrics['accuracy']:.4f}, F1={metrics['f1']:.4f}")

    return results

In [None]:
test_data = test_data.rename(columns={
    'sentence1': 'sentence_a',
    'sentence2': 'sentence_b'
}).to_dict(orient='records')

In [None]:
prompt_model, prompt_tokenizer = setup_prompt_model()

#prompt_metrics = evaluate_prompt_approach(prompt_model, prompt_tokenizer, test_data)

prompt_experiment_results = experiment_with_different_prompts(prompt_model, prompt_tokenizer, test_data)

# Example usage
sentence_a = "The cat sat on the mat."
sentence_b = "It was purring loudly."
is_coherent, response = predict_coherence_with_prompting(prompt_model, prompt_tokenizer, sentence_a, sentence_b)
print(f"Sentences are {'coherent' if is_coherent else 'incoherent'}")
print(f"Model response: {response}")


Evaluating prompt template: standard


Evaluating prompting approach:   0%|          | 0/100 [00:00<?, ?it/s]

Prompt Approach Test Accuracy: 0.5100
Prompt Approach Test Precision: 0.4878
Prompt Approach Test Recall: 0.4167
Prompt Approach Test F1 Score: 0.4494

Example responses:
Response 1: yes
Response 2: no
Response 3: no
Response 4: no
Response 5: yes

Evaluating prompt template: few_shot


Evaluating prompting approach:   0%|          | 0/100 [00:00<?, ?it/s]

Prompt Approach Test Accuracy: 0.5600
Prompt Approach Test Precision: 0.5476
Prompt Approach Test Recall: 0.4792
Prompt Approach Test F1 Score: 0.5111

Example responses:
Response 1: yes
Response 2: no
Response 3: no
Response 4: no
Response 5: answer: yes

Prompt Template Comparison:
standard: Accuracy=0.5100, F1=0.4494
few_shot: Accuracy=0.5600, F1=0.5111
Sentences are incoherent
Model response: no


#### Results
The prompting based model performance is tested on the test set extracted from the wiki-paragraphs dataset.

Using the standard prompt, the approach achieved an accuracy of 51% and an F1 score of 0.4494.

Few-shot prompting, on the other hand, yielded better results, with an accuracy of 56% and an F1 score of 0.5111.

The results demonstrate that both the standard prompting and few-shot prompting approaches are capable of predicting sentence coherence, though the overall accuracy and F1 scores leave room for improvement. It also suggests that including examples in the prompt helps the model make more informed decisions, likely by clarifying the task and setting clearer expectations.

While prompting leverages a pre-trained model’s existing knowledge, it may not fully adapt to the nuances of the NSP task. Therefore fine-tuning on a labeled dataset and then prompting can result in better accuracy.



## Retrieval-Augmented Generation Based NSP Approach

In this section, a retrieval-augmented generation (RAG) methodology is used for coherence detection. The key idea behind RAG is to leverage a large collection of text and build a searchable index that allows retrieval of relevant context for any given input sentence. This approach enables the model to incorporate more meaningful background information into its predictions, rather than solely relying on the sentence pairs provided.

#### Data Preparation:
As the objective is to teach young children or non-English speakers, no difficult vocabulary is used. Text from Wikipedia is extracted using an online Wiki text extractor and then inputted below as a text file. The next step involved reading raw text files containing various paragraphs or articles and splitting them into sentences. These sentences were then cleaned, normalized, and filtered to remove very short or incomplete lines. Once prepared, the sentences served as the foundation for building the index and generating sentence pairs.

#### Sentence Pair Creation:
From the pool of sentences, coherent (positive) pairs were formed by taking consecutive sentences, while incoherent (negative) pairs were constructed by randomly selecting non-consecutive sentences. This approach ensured a balanced dataset with a mix of logically connected and unrelated sentence pairs, which is crucial for training and evaluating the coherence detection model.

#### Building a FAISS Index:
Using a pre-trained sentence encoder, all sentences were transformed into high-dimensional vector embeddings. To improve similarity-based retrieval, these embeddings were normalized for cosine similarity. A FAISS index was then built, which allows for fast and efficient similarity searches. The index enables us to retrieve context sentences that are most relevant to any input sentence, providing additional information that can improve the accuracy of coherence judgments.

#### Contextual Predictions with RAG:
For each test sentence pair, the FAISS index was queried to retrieve several related sentences from the corpus. These retrieved sentences formed a context block that was appended to the first sentence before comparing it with the second sentence. The model, using this enriched input, then predicted whether the pair was coherent or incoherent. By augmenting the input with relevant background context, the model was better equipped to make informed predictions.

#### Evaluation and Results:
The RAG-based approach was evaluated on a limited set of samples, as the retrieval process and context enhancement can be computationally intensive.

In [None]:
import re

def read_text_file(file_path):
    """Read text from file and split into sentences"""
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()

    # Clean the content
    content = re.sub(r'\n+', ' ', content)  # Replace multiple newlines with space
    content = re.sub(r'\s+', ' ', content)  # Replace multiple spaces with single space

    # Split content into sentences
    sentences = sent_tokenize(content)

    # Remove very short sentences
    sentences = [s for s in sentences if len(s.split()) > 3]

    print(f"Extracted {len(sentences)} sentences from {file_path}")
    return sentences

In [None]:
# Function to create sentence pairs for training
def create_sentence_pairs(sentences, positive_ratio=0.5):
    """Create coherent and incoherent sentence pairs"""
    pairs = []
    labels = []

    # Create positive pairs (coherent sentences)
    for i in range(len(sentences) - 1):
        pairs.append((sentences[i], sentences[i + 1]))
        labels.append(1)  # Coherent pair

    # Create negative pairs (incoherent sentences)
    num_positive = len(pairs)
    num_negative = int(num_positive * (1 - positive_ratio) / positive_ratio)

    for _ in range(num_negative):
        idx1 = random.randint(0, len(sentences) - 1)
        idx2 = random.randint(0, len(sentences) - 1)

        # Ensure sentences are not consecutive
        while abs(idx1 - idx2) <= 1 or idx1 >= len(sentences) - 1 or idx2 >= len(sentences) - 1:
            idx1 = random.randint(0, len(sentences) - 1)
            idx2 = random.randint(0, len(sentences) - 1)

        pairs.append((sentences[idx1], sentences[idx2]))
        labels.append(0)  # Incoherent pair

    return pairs, labels

In [None]:
# Function to build FAISS index
def build_faiss_index(sentences, sentence_encoder):
    """Build a FAISS index from a list of sentences"""
    # Remove duplicates while preserving order
    unique_sentences = []
    seen = set()
    for s in sentences:
        if s not in seen:
            seen.add(s)
            unique_sentences.append(s)

    # Encode sentences
    print(f"Encoding {len(unique_sentences)} sentences for FAISS index...")
    embeddings = sentence_encoder.encode(unique_sentences, show_progress_bar=True)

    # Normalize embeddings for cosine similarity
    faiss.normalize_L2(embeddings)

    # Build FAISS index
    vector_dimension = embeddings.shape[1]
    index = faiss.IndexFlatIP(vector_dimension)  # Inner product for cosine similarity
    index.add(embeddings)

    print(f"FAISS index built with {index.ntotal} vectors of dimension {vector_dimension}")
    return index, unique_sentences

In [None]:
# Function to retrieve context from FAISS index
def retrieve_context(query_sentence, faiss_index, sentences_list, sentence_encoder, k=3):
    """Retrieve context sentences from the FAISS index"""
    # Encode query sentence
    query_embedding = sentence_encoder.encode([query_sentence])
    faiss.normalize_L2(query_embedding)

    # Search the index
    distances, indices = faiss_index.search(query_embedding, k+1)  # +1 to account for the query itself

    # Get the retrieved sentences (excluding the query if it's in the index)
    retrieved_sentences = []
    for idx in indices[0]:
        if idx < len(sentences_list) and sentences_list[idx] != query_sentence:
            retrieved_sentences.append(sentences_list[idx])
            if len(retrieved_sentences) >= k:
                break

    return retrieved_sentences


In [None]:
# Function to predict with context
def predict_with_context(sentence_a, sentence_b, faiss_index, sentences_list,
                         sentence_encoder, tokenizer, nsp_model, device, retrieval_k=3):
    """Predict if sentence_b is a coherent continuation of sentence_a using RAG"""
    # Retrieve context from FAISS index
    context_sentences = retrieve_context(sentence_a, faiss_index, sentences_list,
                                         sentence_encoder, k=retrieval_k)

    # Build enhanced input with context
    enhanced_context = " ".join(context_sentences) + " " + sentence_a

    # Truncate if too long (BERT has a token limit)
    if len(enhanced_context.split()) > 250:  # Arbitrary limit to prevent exceeding BERT's token limit
        enhanced_context = " ".join(enhanced_context.split()[-250:])

    # Prepare inputs for the NSP model
    encoding = tokenizer(enhanced_context, sentence_b, return_tensors='pt')

    # can be uncommented if using with the fine tuned Bert based NSP model for max length=512
    # encoding = tokenizer(
    #     enhanced_context,
    #     sentence_b,
    #     return_tensors="pt",
    #     truncation=True,
    #     max_length=512,
    #     padding="max_length"
    # )

    # Move inputs to the appropriate device
    encoding = {k: v.to(device) for k, v in encoding.items()}

    # Get model prediction
    with torch.no_grad():
        outputs = nsp_model(**encoding)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=1)
        prediction = torch.argmax(probs, dim=1).item()
        confidence = probs[0][prediction].item()

    # Return 1 for IsNextSentence, 0 for NotNextSentence
    # BERT's NSP task uses 0 for IsNextSentence and 1 for NotNextSentence, so we invert it
    return 1 - prediction, confidence

In [None]:
from sklearn.metrics import classification_report

# Function to evaluate the model on a limited number of samples
def evaluate_samples(test_pairs, test_labels, faiss_index, sentences_list,
                            sentence_encoder, tokenizer, nsp_model, device, num_samples=50):
    """Evaluate the model on a limited number of test samples"""
    # Ensure we have a balanced set of samples (equal positive and negative examples)
    pos_pairs = [(pair, label) for pair, label in zip(test_pairs, test_labels) if label == 1]
    neg_pairs = [(pair, label) for pair, label in zip(test_pairs, test_labels) if label == 0]

    # Take up to half of samples from each class
    samples_per_class = min(num_samples // 2, len(pos_pairs), len(neg_pairs))

    # If one class has fewer than half the samples, take more from the other class
    pos_samples = min(samples_per_class, len(pos_pairs))
    neg_samples = min(num_samples - pos_samples, len(neg_pairs))

    # If still not enough samples, adjust positive samples again
    if pos_samples + neg_samples < num_samples and len(pos_pairs) > pos_samples:
        pos_samples = min(num_samples - neg_samples, len(pos_pairs))

    # Randomly sample from each class
    random.seed(42)  # For reproducibility
    sampled_pos = random.sample(pos_pairs, pos_samples)
    sampled_neg = random.sample(neg_pairs, neg_samples)

    # Combine and shuffle
    sampled_pairs = sampled_pos + sampled_neg
    random.shuffle(sampled_pairs)

    sample_test_pairs = [pair for pair, _ in sampled_pairs]
    sample_test_labels = [label for _, label in sampled_pairs]

    print(f"Evaluating on {len(sample_test_pairs)} samples ({pos_samples} coherent, {neg_samples} incoherent)")

    # Evaluate the samples
    predictions = []
    confidences = []

    for i, (sentence_a, sentence_b) in enumerate(sample_test_pairs):
        pred, conf = predict_with_context(sentence_a, sentence_b, faiss_index, sentences_list,
                                         sentence_encoder, tokenizer, nsp_model, device)
        predictions.append(pred)
        confidences.append(conf)

        # Print progress every 5 pairs
        if (i+1) % 5 == 0:
            print(f"Evaluated {i+1}/{len(sample_test_pairs)} pairs...")

    # Generate classification report
    report = classification_report(sample_test_labels, predictions, target_names=['Incoherent', 'Coherent'])

    # Calculate accuracy
    accuracy = sum(1 for p, t in zip(predictions, sample_test_labels) if p == t) / len(sample_test_labels)

    return {
        'predictions': predictions,
        'confidences': confidences,
        'accuracy': accuracy,
        'classification_report': report,
        'test_pairs': sample_test_pairs,
        'test_labels': sample_test_labels
    }

In [None]:
# Initialize models
sentence_encoder = SentenceTransformer('all-MiniLM-L6-v2')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
nsp_model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
nsp_model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForNextSentencePrediction(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [None]:
training_files = ["/content/drive/My Drive/CS225_final_project/rag-based-nsp/dog.txt", "/content/drive/My Drive/CS225_final_project/rag-based-nsp/siamese_cat.txt"]
# Read and process all training files
all_training_sentences = []
for file_path in training_files:
    sentences = read_text_file(file_path)
    all_training_sentences.extend(sentences)

print(f"Total training sentences: {len(all_training_sentences)}")

# Create sentence pairs for training
training_pairs, training_labels = create_sentence_pairs(all_training_sentences)
print(f"Created {len(training_pairs)} sentence pairs for training")

Extracted 346 sentences from rag-based-nsp/dog.txt
Extracted 100 sentences from rag-based-nsp/siamese_cat.txt
Total training sentences: 446
Created 890 sentence pairs for training


In [None]:
# Build FAISS index
faiss_index, sentences_list = build_faiss_index(all_training_sentences, sentence_encoder)


Encoding 446 sentences for FAISS index...


Batches:   0%|          | 0/14 [00:00<?, ?it/s]

FAISS index built with 446 vectors of dimension 384


In [None]:
def test_single_pair(sentence_a, sentence_b):
    """Test a single pair of sentences"""
    pred, conf = predict_with_context(sentence_a, sentence_b, faiss_index, sentences_list,
                                     sentence_encoder, tokenizer, nsp_model, device)

    print(f"Sentence A: {sentence_a}")
    print(f"Sentence B: {sentence_b}")
    print(f"Prediction: {'Coherent' if pred == 1 else 'Incoherent'} (confidence: {conf:.4f})")

    # Show retrieved context
    context = retrieve_context(sentence_a, faiss_index, sentences_list, sentence_encoder)
    print("\nRetrieved context:")
    for i, ctx in enumerate(context):
        print(f"{i+1}. {ctx}")

    return pred, conf

# Example usage of test_single_pair function
print("\nTesting a custom sentence pair:")
# Replace with your own sentences to test
custom_a = "Dogs are known for their loyalty and companionship."
custom_b = "Many different breeds have been developed for various purposes."
test_single_pair(custom_a, custom_b)


Testing a custom sentence pair:
Sentence A: Dogs are known for their loyalty and companionship.
Sentence B: Many different breeds have been developed for various purposes.
Prediction: Coherent (confidence: 1.0000)

Retrieved context:
1. These sophisticated forms of social cognition and communication may account for dogs' trainability, playfulness, and ability to fit into human households and social situations, and probably also their co-existence with early human hunter-gatherers.
2. Health benefits The scientific evidence is mixed as to whether a dog's companionship can enhance human physical and psychological well-being.
3. Cultural importance Artworks have depicted dogs as symbols of guidance, protection, loyalty, fidelity, faithfulness, alertness, and love.


(1, 0.9999909400939941)

In [None]:
# Path to test file
test_file = "/content/drive/My Drive/add-your-path"

# Read and process test file
test_sentences = read_text_file(test_file)
test_pairs, test_labels = create_sentence_pairs(test_sentences)
print(f"Created {len(test_pairs)} sentence pairs for testing")

# Evaluate the model on limited samples
print("\nEvaluating the model on 50 samples...")
results = evaluate_samples(test_pairs, test_labels, faiss_index, sentences_list,
                                 sentence_encoder, tokenizer, nsp_model, device, num_samples=50)

# Print results
print("\nResults:")
print(f"Accuracy: {results['accuracy']:.4f}")
print("\nClassification Report:")
print(results['classification_report'])

Extracted 337 sentences from rag-based-nsp/cat.txt
Created 672 sentence pairs for testing

Evaluating the model on 50 samples...
Evaluating on 50 samples (25 coherent, 25 incoherent)
Evaluated 5/50 pairs...
Evaluated 10/50 pairs...
Evaluated 15/50 pairs...
Evaluated 20/50 pairs...
Evaluated 25/50 pairs...
Evaluated 30/50 pairs...
Evaluated 35/50 pairs...
Evaluated 40/50 pairs...
Evaluated 45/50 pairs...
Evaluated 50/50 pairs...

Results:
Accuracy: 0.5600

Classification Report:
              precision    recall  f1-score   support

  Incoherent       1.00      0.12      0.21        25
    Coherent       0.53      1.00      0.69        25

    accuracy                           0.56        50
   macro avg       0.77      0.56      0.45        50
weighted avg       0.77      0.56      0.45        50



#### Results
The overall accuracy of 56% on the 50-sample test set highlights that the RAG-based approach is making progress in coherence detection but is far from perfect. The F1 score for the coherent class is substantially higher than that for the incoherent class. This imbalance suggests that the model is more confident and consistent in identifying coherence, but struggles with nuanced or subtly incoherent pairs. Such a disparity might point to a need for more diverse or challenging examples in the training data, particularly for the incoherent category.

The ability to retrieve context from the FAISS index does improve the model’s understanding, as evidenced by the confident prediction on coherent pairs and the clear rationale behind certain decisions. However, the retrieval process is only as good as the data it draws from. As the context is mostly from 'dogs' page and the evaluation is on a 'cats' page it fails to achieve high accuracy. One random example is taken for a dog's case and the prediction obtained was coherent (label 1) with a confidence of 99.999%. This further confirms that the mismatch in context and evaluation text resulted in the lower accuracy. This can be improved by updating and enhancing the size of the context with more relevant information.

## Enhanced NSP/Sentence Coherence Using Tiered Inference Approach

This section is the final phase of the sentence coherence prediction project, where the previously tested strategies: fine-tuned Next Sentence Prediction (NSP), Retrieval-Augmented Generation (RAG), and prompting, are integrated into a robust, interpretable, and fallback-safe system.

Combining these methods into a unified model pipeline is not practical because:

- The fine-tuned BERT NSP model operates as a classification head trained on binary coherence labels.

- RAG-based methods rely on external context retrieval and are sensitive to token limits and sequence padding.

- Few-shot prompting models like FLAN-T5 are decoder-based and follow instruction templates rather than logits from a classification head.

Therefore, the final prediction architecture is a tiered decision approach based on confidence score.


The objective is to reuse the fine-tuned BERT NSP model from the first step. If the confidence level of the prediction is below 0.8, the model should fall back to the RAG based prediction. As the test samples are the wiki paragraphs, it is hard to provide a definitive context. Therefore, RAG produces the lowest accuracy in terms of results compared to the other two methods. Hence, it has been commented out in this architecture. If the fine-tuned NSP model isn't confident in its prediction, the model switches to few shot prompting based NSP approach. This time the model chosen is flan-t5-large.

In [None]:
# Reload the fine‑tuned BERT NSP checkpoint
nsp_model     = BertForNextSentencePrediction.from_pretrained("/content/drive/My Drive/CS225_final_project/bert_nsp_model").to(device)
nsp_tokenizer = BertTokenizer.from_pretrained("/content/drive/My Drive/CS225_final_project/bert_nsp_model")

In [None]:
#Load the flan-t5-large model
prompt_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
prompt_model     = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")

prompt_model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=1024, out_features=2816, bias=False)
              (wi_1): Linear(in_features=1024, out_features=2816, bias=False)
       

In [None]:
# FEW‑SHOT prompt block

FEW_SHOT_TEMPLATE = (
    "Below are examples of sentence pairs. For each pair, decide if the second sentence "
    "logically follows the first. Answer \"Yes\" if it does, or \"No\" if it does not.\n\n"
    "Example 1:\n"
    "First: The dog chased the ball into the yard.\n"
    "Second: It brought the ball back to its owner.\n"
    "Answer: Yes\n\n"
    "Example 2:\n"
    "First: The sun set behind the mountains.\n"
    "Second: I finished my homework before breakfast.\n"
    "Answer: No\n\n"
    "Example 3:\n"
    "First: The cake was ready to serve.\n"
    "Second: Everyone sat down at the table to eat dessert.\n"
    "Answer: Yes\n\n"
    "Now, consider this pair:\n"
    "First: {sentence_a}\n"
    "Second: {sentence_b}\n"
    "Answer: "
)

In [None]:
def predict_coherence_with_prompting(model, tokenizer, sentence_a, sentence_b, prompt_template):
    prompt = prompt_template.format(sentence_a=sentence_a, sentence_b=sentence_b)

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(model.device)

    output_ids = model.generate(
        **inputs,
        max_new_tokens=1,      # one‑token answer
        num_beams=2,
        early_stopping=True
    )
    response = tokenizer.decode(output_ids[0],skip_special_tokens=True).strip().lower()

    return is_positive_response(response), response

In [None]:
# Build a FAISS index on ALL training sentences

all_train_sentences = pd.concat([train_df["sentence1"], train_df["sentence2"]]).tolist()
sentence_encoder    = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
faiss_index, sentences_list = build_faiss_index(all_train_sentences, sentence_encoder)

Encoding 138628 sentences for FAISS index...


Batches:   0%|          | 0/4333 [00:00<?, ?it/s]

FAISS index built with 138628 vectors of dimension 384


In [None]:
def combined_predict(sent_a, sent_b, retrieval_k = 2):
    """
    Returns (prediction, chosen_tier, nsp_conf, rag_conf, prompt_conf)
    tier = 'NSP' | 'RAG' | 'PROMPT'
    """
    # fine-tuned BERT based NSP
    is_coh, p_next = predict_sentence_coherence(
        nsp_model, nsp_tokenizer, sent_a, sent_b
    )

    is_coh = not is_coh
    p_next = 1.0 - p_next

    nsp_conf = p_next if is_coh else (1.0 - p_next)


    if nsp_conf >= 0.8:
        return int(is_coh), "Fine-tuned NSP", nsp_conf

    # RAG‑BERT
    # rag_pred, rag_conf = predict_with_context(sent_a, sent_b,faiss_index, sentences_list, sentence_encoder,
    #     nsp_tokenizer, nsp_model, device,retrieval_k=retrieval_k)
    # if rag_conf >= 0.8:
    #     return rag_pred, "RAG", rag_conf

    # Few‑shot prompting (FLAN‑T5‑Large)
    prompt_pred, raw_resp = predict_coherence_with_prompting(
        prompt_model, prompt_tokenizer,
        sent_a, sent_b,
        prompt_template=FEW_SHOT_TEMPLATE
    )

    return int(prompt_pred), "PROMPT", 0.8


In [None]:
examples = [
    # 1. Obviously coherent
    ("The dog chased the ball into the yard.",
     "It brought the ball back to its owner."),

    # 2. Clearly incoherent
    ("I turned off the lights before going to bed.",
     "Saturn’s rings are made mostly of ice."),

    # 3. Borderline / ambiguous
    ("The class finished the science experiment.",
     "Afterwards, the students wrote their lab reports.")
]

for i, (a, b) in enumerate(examples, 1):
    pred, tier, conf = combined_predict(a, b)
    print(f"\nPair {i}:")
    print(f"  Sentence A: {a}")
    print(f"  Sentence B: {b}")
    print(f"  ➜ Prediction : {'coherent' if pred else 'incoherent'}")
    print(f"  ➜ Chosen tier: {tier}   (confidence={conf:.3f})")


Pair 1:
  Sentence A: The dog chased the ball into the yard.
  Sentence B: It brought the ball back to its owner.
  ➜ Prediction : coherent
  ➜ Chosen tier: Fine-tuned NSP   (confidence=0.873)

Pair 2:
  Sentence A: I turned off the lights before going to bed.
  Sentence B: Saturn’s rings are made mostly of ice.
  ➜ Prediction : incoherent
  ➜ Chosen tier: PROMPT   (confidence=0.800)

Pair 3:
  Sentence A: The class finished the science experiment.
  Sentence B: Afterwards, the students wrote their lab reports.
  ➜ Prediction : coherent
  ➜ Chosen tier: Fine-tuned NSP   (confidence=0.829)


In [None]:
#  Evaluate on the test split

y_true, y_pred = [], []
for _, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Evaluating"):

    pred, _, _ = combined_predict(row["sentence1"], row["sentence2"])
    y_true.append(row["label"])
    y_pred.append(pred)

acc  = accuracy_score(y_true, y_pred)
prec, rec, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="binary")

print("\nCombined model on Wiki-Paragraphs test set")
print(f"Accuracy : {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall   : {rec:.4f}")
print(f"F1-score : {f1:.4f}")


Evaluating:   0%|          | 0/15000 [00:00<?, ?it/s]


Combined model on Wiki-Paragraphs test set (1% sample)
Accuracy : 0.6333
Precision: 0.7373
Recall   : 0.4126
F1-score : 0.5291


#### Results
The final combined model was tested across three example pairs and the model was able to do NSP accurately. In one of the cases, fine fine-tuned model did not produce results with confidence, hence it fell back to a prompt-based NSP checker. Even after using the training set for building the FAISS index, the RAG based approach was not accurate enough.

The combined model was also tested on the test set (15,000 sentence pairs).

The model achieves a moderate accuracy of 63.33%, indicating that it correctly classifies about two-thirds of all sentence pairs. The precision of 73.73% is relatively high, suggesting that when your model predicts a positive relationship between sentences, it's correct about 74% of the time.

However, the recall is notably lower at 41.26%, indicating that the model identifies less than half of the actual positive sentence pairs in the dataset. This substantial gap between precision and recall results in an F1-score of 52.91%. This performance pattern suggests that the combined model is more conservative in making positive predictions.

## User Interaction: Try Your Own Sentences
The final cell allows users to manually input two sentences and receive immediate feedback on whether the second sentence logically follows the first. This feature demonstrates how the model can be applied in real-world settings to assess sentence coherence:

In [None]:
user_sentence_a = input("Sentence A (first sentence): ")
user_sentence_b = input("Sentence B (second sentence): ")
pred, _, _ = combined_predict(user_sentence_a, user_sentence_b)

if pred:
    print("Coherent — the second sentence logically follows the first.")
else:
    print("❌  Incoherent — the second sentence does NOT follow the first.")

Sentence A (first sentence):  Hi, I am Sudip!
Sentence B (second sentence):  I learned a lot while doing this project!


❌  Incoherent — the second sentence does NOT follow the first.


## Use Case & Educational Value

This interactive mode is particularly valuable for:

- Children learning how to write coherent narratives or essays.

- Beginner English learners practicing sentence order and logical flow.

- Tutoring tools that offer instant, interpretable feedback for writing practice.

- Language learning platforms that help users improve sentence construction.

- Demo/testing purposes to showcase how sentence coherence can be modeled computationally.

## Real-World Integration

This logic could easily be embedded into a website or educational application, with a simple interface allowing users to:

- Type two sentences,

- Click a “Check Coherence” button,

- And receive a verdict like:

“Coherent — good continuation!”

“Incoherent — try revising your second sentence.”

Such tools can encourage self-guided learning, improve feedback loops for writing assignments, and introduce AI-powered writing aids into classrooms or online platforms.

