# NoLimit Data Scientist Technical Test - RAG Chatbot

## 0. Instalasi dependensi

In [None]:
!pip install --upgrade pymupdf
!pip install tools

Collecting pymupdf
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.3
Collecting tools
  Downloading tools-1.0.4-py3-none-any.whl.metadata (1.3 kB)
Downloading tools-1.0.4-py3-none-any.whl (39 kB)
Installing collected packages: tools
Successfully installed tools-1.0.4


In [None]:
import os
import re
import fitz

## 1. Ekstraksi teks dari PDF

In [None]:
# Clone repo github untuk akses file PDF
!git clone https://github.com/salmadanu/nolimit-ds-test-salmanadhira.git

Cloning into 'nolimit-ds-test-salmanadhira'...
remote: Enumerating objects: 174, done.[K
remote: Counting objects: 100% (137/137), done.[K
remote: Compressing objects: 100% (94/94), done.[K
remote: Total 174 (delta 82), reused 92 (delta 41), pack-reused 37 (from 2)[K
Receiving objects: 100% (174/174), 63.71 MiB | 12.38 MiB/s, done.
Resolving deltas: 100% (83/83), done.


In [None]:
def extract_text_from_pdf_folder(folder_path):
    texts = {}
    for filename in os.listdir(folder_path):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(folder_path, filename)
            doc = fitz.open(pdf_path)

            # Ektraksi per halaman (to cite sources later on)
            page_texts = {}
            for page_num, page in enumerate(doc, start=1):
                page_texts[page_num] = page.get_text()
            doc.close()

            texts[filename] = page_texts
    return texts

In [None]:
folder_path = "/content/nolimit-ds-test-salmanadhira/dataset"
pdf_texts = extract_text_from_pdf_folder(folder_path)

# Check
sample_file = list(pdf_texts.keys())[0]
print(f"Sample File: {sample_file}")
print(pdf_texts[sample_file][1][:500])

Sample File: Fake_News_Stance_Detection_Using_Deep_Learning_Architecture_CNN-LSTM.pdf
Received August 10, 2020, accepted August 24, 2020, date of publication August 26, 2020, date of current version September 9, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3019735
Fake News Stance Detection Using Deep
Learning Architecture (CNN-LSTM)
MUHAMMAD UMER
1, ZAINAB IMTIAZ
1, SALEEM ULLAH
1,
ARIF MEHMOOD
2, GYU SANG CHOI
3, AND BYUNG-WON ON4
1Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan 64200, Pakistan
2Departmen


## 2. Praproses data
Menghilangkan referensi, sitasi, header

In [None]:
def preprocess_text(text):
  text = re.split(r"\bAcknowledgment\b|\bAcknowledgement\b|\bAcknowledgements\b|\bReferences\b|\bBibliography\b", text, flags=re.IGNORECASE)[0]
  text = re.sub(r"\s+", " ", text)
  return text.strip()

In [None]:
preprocessed_texts = {}
for filename, pages in pdf_texts.items():
    preprocessed_texts[filename] = {}
    for page_num, text in pages.items():
        preprocessed_texts[filename][page_num] = preprocess_text(text)

# Check
sample_file = list(preprocessed_texts.keys())[0]
sample_page = list(preprocessed_texts[sample_file].keys())[0]
print(f"Sample File: {sample_file}")
print(f"Sample Page: {sample_page}")
print(preprocessed_texts[sample_file][sample_page][:500])

Sample File: Fake_News_Stance_Detection_Using_Deep_Learning_Architecture_CNN-LSTM.pdf
Sample Page: 1
Received August 10, 2020, accepted August 24, 2020, date of publication August 26, 2020, date of current version September 9, 2020. Digital Object Identifier 10.1109/ACCESS.2020.3019735 Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM) MUHAMMAD UMER 1, ZAINAB IMTIAZ 1, SALEEM ULLAH 1, ARIF MEHMOOD 2, GYU SANG CHOI 3, AND BYUNG-WON ON4 1Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan 64200, Pakistan 2Departmen


## 3. Chunking

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_pdfs(pdf_texts, chunk_size=400, chunk_overlap=50):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )
    chunks = []
    for filename, pages in pdf_texts.items():
        for page_num, text in pages.items():
            page_chunks = splitter.split_text(text)
            for i, chunk in enumerate(page_chunks):
                chunks.append({
                    "filename": filename,
                    "page_number": page_num,
                    "chunk_id": i,
                    "text": chunk
                })
    return chunks

In [None]:
chunks = chunk_pdfs(preprocessed_texts)

print(f"Total chunks: {len(chunks)}")
print(chunks[0])

Total chunks: 5987
{'filename': 'Fake_News_Stance_Detection_Using_Deep_Learning_Architecture_CNN-LSTM.pdf', 'page_number': 1, 'chunk_id': 0, 'text': 'Received August 10, 2020, accepted August 24, 2020, date of publication August 26, 2020, date of current version September 9, 2020. Digital Object Identifier 10.1109/ACCESS.2020.3019735 Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM) MUHAMMAD UMER 1, ZAINAB IMTIAZ 1, SALEEM ULLAH 1, ARIF MEHMOOD 2, GYU SANG CHOI 3, AND BYUNG-WON ON4 1Department of Computer Science, Khwaja'}


## 4. Menambahkan chunks ke database vektor

In [None]:
!pip install faiss-cpu langchain sentence-transformers langchain_huggingface

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting langchain_huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_huggingface-0.3.1-py3-none-any.whl (27 kB)
Installing collected packages: faiss-cpu, langchain_huggingface
Successfully installed faiss-cpu-1.12.0 langchain_huggingface-0.3.1


In [None]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_community-0.3.27-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dataclasses_json-0.6.7-py3-none-any.whl (

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

def build_faiss_vectorstore(chunks):
    texts = [chunk["text"] for chunk in chunks]
    metadatas = [{"filename": c["filename"], "page_number": c["page_number"], "chunk_id": c["chunk_id"]} for c in chunks]

    vectorstore = FAISS.from_texts(texts, embedding_model, metadatas=metadatas)
    return vectorstore

vectorstore = build_faiss_vectorstore(chunks)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
vectorstore.save_local("faiss_index")
vectorstore = FAISS.load_local("faiss_index", embedding_model, allow_dangerous_deserialization=True)

In [None]:
# Memeriksa apakah dokumen yang di-retrieve sudah sesuai
query = "What is stance detection?"
results = vectorstore.similarity_search(query, k=10)

for res in results:
    print(res.page_content[:200])
    print(res.metadata)

stance detection. Stance detection is a task to automatically determine whether the author of a text supports, opposes, or is neutral to the proposition or target in the test [6]. Stance detection is 
{'filename': '1-s2.0-S1877050921023449-main.pdf', 'page_number': 2, 'chunk_id': 7}
Bontcheva, ‘‘Stance detection with bidirectional conditional encoding,’’ in Proc. 2016 Conf. Empirical Methods Natural Lang. Process., Austin, TX, USA, Nov. 2016, pp. 876–885. [Online]. Available: htt
{'filename': 'Fake_News_Stance_Detection_Using_Deep_Learning_Architecture_CNN-LSTM.pdf', 'page_number': 11, 'chunk_id': 15}
in the feature-based machine learning approach using Support Vector Machine is by far the most commonly employed feature-based machine learning approach for stance detection. Support Vector Machine is
{'filename': '1-s2.0-S1877050921023449-main.pdf', 'page_number': 2, 'chunk_id': 9}
help alleviate the burdensome and time-consuming human activity of fact checking [10], [11]. Despite that, 

In [None]:
# Memeriksa dimensi embedding (menangani AssertionError saat deploy)
sample_vector = embedding_model.embed_query("test query")
print("Embedding dimension:", len(sample_vector))

Embedding dimension: 768


In [None]:
import faiss
print("FAISS index dimension:", vectorstore.index.d)

FAISS index dimension: 768


## 5. Ngequery LLM
Generate jawaban dari query berdasarkan chunks

In [None]:
from langchain import PromptTemplate, LLMChain
from langchain_huggingface import HuggingFacePipeline

from transformers import pipeline

qa_pipeline = pipeline(
    "text2text-generation",
    model="google/flan-t5-large",
    device=0
)

llm = HuggingFacePipeline(pipeline=qa_pipeline)

template = """
You are an academic assistant helping summarize research papers.
Use the provided CONTEXT to answer the QUESTION clearly and concisely.
- Write the answer in well-formed sentences, even if the context has fragmented text.
- Do not copy broken words or incomplete phrases directly from the context.
- If the question is about methods or models, list them clearly and EXPLAIN their purpose.
- If needed, cite authors or papers mentioned in the context.
- If the answer cannot be found in the context, say "The context does not provide enough information."

CONTEXT:
{context}

QUESTION: {question}

ANSWER:
"""

def format_doc(doc):
    meta = doc.metadata
    source = f"(Source: {meta.get('filename', 'unknown')}, page {meta.get('page_number', '?')})"
    return f"{doc.page_content}\n{source}"

prompt = PromptTemplate(template=template, input_variables=["context", "question"])
qa_chain = LLMChain(llm=llm, prompt=prompt)

def answer_query(query, vectorstore, k=10):
    results = vectorstore.max_marginal_relevance_search(query, k=k, fetch_k=20)
    context = "\n\n".join([format_doc(doc) for doc in results[:3]])
    answer = qa_chain.run({"context": context, "question": query})
    return answer

Device set to use cuda:0


### Q&A

In [None]:
def clean_answer(text: str) -> str:
    text = re.sub(r"-\s+", "", text)
    text = re.sub(r"\s{2,}", " ", text)
    return text.strip()

In [None]:
query = "What is framing analysis in computational media studies??"
answer = answer_query(query, vectorstore)
answer = clean_answer(answer)
print(answer[0].upper() + answer[1:])

Framing theory is one of the most popular theoretical frameworks in communication research. Generally speaking, the (Source: 10346313.pdf, page 3) as defined in this article. Lastly, with the support of a research grant and a cross-disciplinary team, our system aims to make computational framing analysis accessible to researchers with limited experience in computer science. Through a click-and-run web-based system, users can follow the guidance on the website and run advanced computational analysis step-by-step. We also make it our (Source: 10346313.pdf, page 8) of computational communication research will also be discussed. Mapping the Field: Computational Framing Analysis The field of computational communication research comprises work by scholars of various disciplinary backgrounds, research perspectives, and methodological approaches. One consistent criticism of this type of research is its lack of contributions to journalism and communication (Source: 10346313.pdf, page 8)


In [None]:
query = "How is propaganda detection defined in computational linguistics?"
answer = answer_query(query, vectorstore)
answer = clean_answer(answer)
print(answer[0].upper() + answer[1:])

The intentional influencing of someone’s opinion using various rhetorical and psychological techniques (Da San Martino et al., 2020b). Propaganda uses techniques such as loaded language (using words or phrases with strong emotional connotations to influence an audience’s opinion) or flag waving (associating oneself or one’s cause with patriotism or a national symbol to gain support).


In [None]:
query = "What deep learning models are commonly applied to propaganda detection?"
answer = answer_query(query, vectorstore)
answer = clean_answer(answer)
print(answer[0].upper() + answer[1:])

CNN, RNN and LSTM


In [None]:
query = "How can Twitter data be preprocessed for misinformation detection?"
answer = answer_query(query, vectorstore)
answer = clean_answer(answer)
print(answer[0].upper() + answer[1:])

Token indices sequence length is longer than the specified maximum sequence length for this model (527 > 512). Running this sequence through the model will result in indexing errors


Most existing truth discovery methods focus on handling structured input in the form of Subject-Predicate-Object (SPO) tuples, while social media data is highly unstructured and noisy. Second, truth discovery methods can not be well applied when a (Source: 1708.01967v3.pdf, page 9) arXiv:1702.05638, 2017. [63] Martin Potthast, Sebastian K opsel, Benno Stein, and Matthias Hagen. Clickbait detection. In European Conference on Information Retrieval, pages 810–817. Springer, 2016. [64] Vahed Qazvinian, Emily Rosengren, Dragomir R Radev, and Qiaozhu Mei.


In [None]:
query = "How is framing analysis applied to coverage of international conflicts?"
answer = answer_query(query, vectorstore)
answer = clean_answer(answer)
print(answer[0].upper() + answer[1:])

Journalists make their best efforts to pursue objectivity, media framing often favors one side over another in political disputes, thus always resulting in some degree of bias (Entman, 2010). Hence, a news framing analysis is helpful because it not only tells us whether a news article is leftor right-leaning (or positive or negative), but also reveals how the article is structured to promote a
