<a href="https://colab.research.google.com/github/tcharos/NLP-Toxicity-Detection/blob/main/AIDL_CS01_NLP_Project_task_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install required packages
!pip install langchain-community langchain-text-splitters pypdf sentence-transformers numpy


In [None]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
from unsloth import FastLanguageModel
import torch

### 1. Load and segment a PDF

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
online_document = "https://academicweb.nd.edu/~lemmon/courses/deep-learning/lecture-book/deep-learning-book-2025.pdf"

loader = PyPDFLoader(online_document) # define loader
pages = loader.load() # load

print(f"Loaded {len(pages)} pages")
print(f"First page preview:\n{pages[0].page_content[:500]}")

In [None]:
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=20,
    length_function=len
) # define splitter with chunk size 1000 and 20% overlap

chunks = splitter.split_documents(pages)

print(f"\nCreated {len(chunks)} chunks")
print(f"First chunk:\n{chunks[0].page_content}")
print(f"Metadata: {chunks[0].metadata}")

### 2. Embed and store in numpy array

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

model_emb = SentenceTransformer('all-MiniLM-L6-v2')

# Extract text and generate embeddings
texts = [doc.page_content for doc in chunks]
embeddings_array = model_emb.encode(texts, convert_to_numpy=True)

# Save and reload example
np.save('embeddings.npy', embeddings_array)
embeddings_array = np.load('embeddings.npy')

### 3. Use an S/LLM from Hugginface for paraphrase

In [None]:
# use a model, check how to do that in HF, for example check this model card https://huggingface.co/google/gemma-2b
#  or select on from unsloath
# paraphrase 2 times (create the necessary prompt)

In [17]:
# Load model and tokenizer
model_llm, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model_llm)

def get_paraphrases(question, n=2):
    prompt = f"Paraphrase the following question in {n} different ways. Return only the paraphrased questions, one per line:\n\nQuestion: {question}"
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    outputs = model_llm.generate(**inputs, max_new_tokens=100)
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    # Simple split by newline (adjust based on model output)
    return [question] + decoded.split('\n')[-n:]

question = "what is an RNN?"
paraphrased_queries = get_paraphrases(question)

==((====))==  Unsloth 2026.1.2: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [18]:
print("--- Paraphrased Qustion ---")
for i, query in enumerate(paraphrased_queries, 1):
    print(f"Paraphrase {i}: {query}")

--- Paraphrased Qustion ---
Paraphrase 1: what is an RNN?
Paraphrase 2: 1. What is the term for a particular kind of recurrent neural network that is often used in machine learning applications?
Paraphrase 3: 2. What is a type of neural network design that is optimized for handling sequential data?


### 4. Retrieve 5 most semantically close chunks (cosine similarity) for every paraphrase*, then add threshold 0.3 and select top 3

In [19]:
from sklearn.metrics.pairwise import cosine_similarity

all_indices = []
threshold = 0.3

for q in paraphrased_queries:
    q_emb = model_emb.encode([q])
    sims = cosine_similarity(q_emb, embeddings_array)[0]

    # Get indices of top 5 that also meet threshold
    top_5_idx = sims.argsort()[-5:][::-1]
    filtered_idx = [i for i in top_5_idx if sims[i] >= threshold]
    all_indices.extend(filtered_idx)

# Unique indices and select final top 3 (simplification: take the first 3 unique)
unique_top_indices = list(dict.fromkeys(all_indices))[:3]
retrieved_context = "\n".join([chunks[i].page_content for i in unique_top_indices])

In [20]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# 1. Combine original question and paraphrases for a broader search
all_queries = [question] + paraphrased_queries
candidate_pool = []
threshold = 0.3

print(f"--- Verification: Searching for {len(all_queries)} query variations ---")

for q in all_queries:
    # Embed the specific query
    q_emb = model_emb.encode([q])

    # Calculate cosine similarity against all chunk embeddings
    sims = cosine_similarity(q_emb, embeddings_array)[0]

    # Get indices of top 5 matches
    top_5_idx = sims.argsort()[-5:][::-1]

    # Filter by threshold and print findings for verification
    for idx in top_5_idx:
        score = sims[idx]
        if score >= threshold:
            candidate_pool.append((idx, score))
            print(f"Match Found! Query: '{q[:30]}...' | Score: {score:.4f} | Index: {idx}")

# 2. Sort all matches by score and remove duplicates
candidate_pool.sort(key=lambda x: x[1], reverse=True)

final_indices = []
seen_indices = set()

for idx, score in candidate_pool:
    if idx not in seen_indices:
        final_indices.append(idx)
        seen_indices.add(idx)
    if len(final_indices) == 3: # Limit to top 3 as per requirements
        break

# 3. Final Verification Print
print(f"\n--- Final Selection (Top {len(final_indices)} Unique Chunks) ---")
for i, idx in enumerate(final_indices):
    content_preview = chunks[idx].page_content[:150].replace('\n', ' ')
    print(f"{i+1}. [Index {idx}] Content: {content_preview}...")

# Store for Step 5
retrieved_context = "\n\n".join([chunks[i].page_content for i in final_indices])

--- Verification: Searching for 4 query variations ---
Match Found! Query: 'what is an RNN?...' | Score: 0.6279 | Index: 403
Match Found! Query: 'what is an RNN?...' | Score: 0.4881 | Index: 407
Match Found! Query: 'what is an RNN?...' | Score: 0.4469 | Index: 487
Match Found! Query: 'what is an RNN?...' | Score: 0.4154 | Index: 485
Match Found! Query: 'what is an RNN?...' | Score: 0.4137 | Index: 416
Match Found! Query: 'what is an RNN?...' | Score: 0.6279 | Index: 403
Match Found! Query: 'what is an RNN?...' | Score: 0.4881 | Index: 407
Match Found! Query: 'what is an RNN?...' | Score: 0.4469 | Index: 487
Match Found! Query: 'what is an RNN?...' | Score: 0.4154 | Index: 485
Match Found! Query: 'what is an RNN?...' | Score: 0.4137 | Index: 416
Match Found! Query: '1. What is the term for a part...' | Score: 0.6313 | Index: 403
Match Found! Query: '1. What is the term for a part...' | Score: 0.5886 | Index: 416
Match Found! Query: '1. What is the term for a part...' | Score: 0.5875 | I

In [None]:
# Re-ranking (not mandatory)

# ----------- Disclaimer - Not mine code (AI generated) but I wanted to try it out -----------

from sentence_transformers import CrossEncoder

rerank_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

original_query = all_queries[0]
pairs = [[original_query, chunk] for chunk in retrieved_context]

rerank_scores = rerank_model.predict(pairs)

reranked_results = sorted(
    zip(retrieved_context, rerank_scores),
    key=lambda x: x[1],
    reverse=True
)

final_retrieved_context = [chunk for chunk, score in reranked_results[:3]]

print("Re-ranking complete. Chunks are now ordered by relevance to the original question.")

# ----------- Disclaimer - Not mine code (AI generated) -----------

### 5. Apply the augmentation phase using an SLM (could be the one you have used in step 2 or the only aligned with DPO)

In [21]:
final_prompt = f"""Use the following context to answer the question.
Context: {retrieved_context}
Question: {paraphrased_queries[0]}
Answer:"""

inputs = tokenizer([final_prompt], return_tensors="pt").to("cuda")
outputs = model_llm.generate(**inputs, max_new_tokens=250)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

Use the following context to answer the question.
Context: in more detail.
FIGURE 4. Training Curves for model MSE. (left) 1D Con-
volutional Model, (middle) Sequential Model, (right) LSTM
model)
2. Recurrent Neural Networks
A recurrent neural network (RNN) is a model architecture where the output
of a hidden node not only depends on the input, but also on the ”past” output
from the hidden node. You can therefore think of an RNN as a dynamical
system in which the activation levels of the hidden layers are the system’s
states. This means that the computation graph of an RNN has self loops in
contrast to the graphs of feedforward sequential models.

202 6. DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING
There are two useful variations on the LSTM that are frequently used;
gated recurrent units (GRU) and bidirectional LSTMs. A GRU [ CGCB14,
CVMG+14] may be seen as a simplified version an LSTM whose operation
can again be explained in terms of ”gating layers”. Keras also has a GRU
layer whos