
# Menyiapkan Environment

- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)


## Instalasi Haystack


In [None]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,preprocessing,elasticsearch,metrics,inference]

Collecting pip
  Downloading pip-24.1-py3-none-any.whl (1.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 10.1 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.1
Collecting farm-haystack[colab,elasticsearch,inference,metrics,preprocessing]
  Downloading farm_haystack-1.26.2-py3-none-any.whl.metadata (31 kB)
Collecting boilerpy3 (from farm-haystack[colab,elasticsearch,inference,metrics,preprocessing])
  Downloading boilerpy3-1.0.7-py3-none-any.whl.metadata (5.8 kB)
Collecting events (from farm-haystack[colab,elasticsearch,inference,metrics,preprocessing])
  Downloading Events-0.5-py3-none-any.whl.metadata (3.9 kB)
Collecting httpx (from farm-haystack[colab,elasticsearch,inference,metrics,preprocessing])
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting lazy-imports==0.3.1 (fro



In [None]:
from haystack.telemetry import tutorial_running

tutorial_running(5)

Set the logging level to INFO:

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

In [None]:
# Here are some imports that we'll need

from haystack.nodes import DensePassageRetriever
from haystack.utils import fetch_archive_from_http
from haystack.document_stores import InMemoryDocumentStore

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Variabel untuk fine-tuning
doc_dir = "/content"
train_filename = "DPRtrain.json"
dev_filename = "DPRuji.json"

query_model = "firqaaa/indo-dpr-question_encoder-single-squad-base"
passage_model = "firqaaa/indo-dpr-ctx_encoder-single-squad-base"

save_dir = "/content/drive/MyDrive/models"

## Start an Elasticsearch server

You can start Elasticsearch on your local machine instance using Docker:

In [None]:
# Recommended: Start Elasticsearch using Docker via the Haystack utility function
from haystack.utils import launch_es

launch_es()



If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source:

In [None]:
%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2

In [None]:
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

Wait 30 seconds only to be sure Elasticsearch is ready before continuing:

In [None]:
import time

time.sleep(30)

In [None]:
import os

from haystack.document_stores import ElasticsearchDocumentStore


# make sure these indices do not collide with existing ones, the indices will be wiped clean before data is inserted
doc_index = "docs"
label_index = "labels"

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(
    host=host,
    username="",
    password="",
    index=doc_index,
    label_index=label_index,
    embedding_field="emb",
    embedding_dim=768,
    excluded_meta_data=["emb"],
)

# Fine-tune Model DPR

In [None]:
# Inisialisasi model DPR versi single-squad

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="firqaaa/indo-dpr-question_encoder-single-squad-base",
    passage_embedding_model="firqaaa/indo-dpr-ctx_encoder-single-squad-base",
    max_seq_len_query=64,
    max_seq_len_passage=256,
    batch_size=16,
    use_gpu=True,
    embed_title=True,
    use_fast_tokenizers=True
)

In [None]:
# Inisialisasi model DPR versi multiset

retriever2 = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="firqaaa/indo-dpr-question_encoder-multiset-base",
    passage_embedding_model="firqaaa/indo-dpr-ctx_encoder-multiset-base",
    max_seq_len_query=64,
    max_seq_len_passage=256,
    use_gpu=True,
    embed_title=True,
    use_fast_tokenizers=True
)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/230k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/733k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english


tokenizer_config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/230k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/733k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english


In [None]:
# Inisialisasi torch
import torch.distributed as dist
import os

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'

dist.init_process_group("gloo", rank=2, world_size=5)

## Eksperimen fine-tune

In [None]:
# Training model

retriever.train(
    data_dir=doc_dir,
    train_filename=train_filename,
    dev_filename=dev_filename,
    test_filename=dev_filename,
    n_epochs=5,
    batch_size=16,
    grad_acc_steps=8,
    save_dir=save_dir,
    evaluate_every=1000,
    embed_title=True,
    num_positives=1,
    num_hard_negatives=1,
)

INFO:haystack.modeling.data_handler.data_silo:
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|
 (o)(o)------'\ _ /     ( )
 
INFO:haystack.modeling.data_handler.data_silo:LOADING TRAIN DATA
INFO:haystack.modeling.data_handler.data_silo:Loading train set from: /content/DPRtrain.json 
Preprocessing dataset: 100%|██████████| 3/3 [00:02<00:00,  1.20 Dicts/s]
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:LOADING DEV DATA
INFO:haystack.modeling.data_handler.data_silo:Loading dev set from: /content/DPRuji.json
Preprocessing dataset: 100%|██████████| 1/1 [00:00<00:00,  1.48 Dicts/s]
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:LOADING TEST DATA
INFO:haystack.modeling.data_handler.data_silo:Loading test set from: /content/DPRuji.json
Preprocessing dataset: 100%|██████████| 1/1 [00:00<00:00,  1.91 Dicts/s]
INFO:hay

In [None]:
# Training model

retriever2.train(
    data_dir=doc_dir,
    train_filename=train_filename,
    dev_filename=dev_filename,
    test_filename=dev_filename,
    n_epochs=5,
    batch_size=16,
    grad_acc_steps=8,
    save_dir=save_dir,
    evaluate_every=500,
    embed_title=True,
    num_positives=1,
    num_hard_negatives=1,
    learning_rate=0.000001,
    weight_decay=0.0001
)

INFO:haystack.modeling.data_handler.data_silo:
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|
 (o)(o)------'\ _ /     ( )
 
INFO:haystack.modeling.data_handler.data_silo:LOADING TRAIN DATA
INFO:haystack.modeling.data_handler.data_silo:Loading train set from: /content/DPRtrain.json 
Preprocessing dataset: 100%|██████████| 3/3 [00:01<00:00,  1.75 Dicts/s]
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:LOADING DEV DATA
INFO:haystack.modeling.data_handler.data_silo:Loading dev set from: /content/DPRuji.json
Preprocessing dataset: 100%|██████████| 1/1 [00:00<00:00,  2.48 Dicts/s]
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:LOADING TEST DATA
INFO:haystack.modeling.data_handler.data_silo:Loading test set from: /content/DPRuji.json
Preprocessing dataset: 100%|██████████| 1/1 [00:00<00:00,  2.23 Dicts/s]
INFO:hay

In [None]:
retriever2.train(
    data_dir=doc_dir,
    train_filename=train_filename,
    dev_filename=dev_filename,
    test_filename=dev_filename,
    n_epochs=5,
    batch_size=16,
    grad_acc_steps=8,
    save_dir=save_dir,
    evaluate_every=1000,
    embed_title=True,
    num_positives=1,
    num_hard_negatives=1,
    learning_rate=0.000001,
    weight_decay=0.0001
)

In [None]:
retriever2.train(
    data_dir=doc_dir,
    train_filename=train_filename,
    dev_filename=dev_filename,
    test_filename=dev_filename,
    n_epochs=5,
    batch_size=16,
    grad_acc_steps=8,
    save_dir=save_dir,
    evaluate_every=3000,
    embed_title=True,
    num_positives=1,
    num_hard_negatives=1,

)

INFO:haystack.modeling.data_handler.data_silo:
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|
 (o)(o)------'\ _ /     ( )
 
INFO:haystack.modeling.data_handler.data_silo:LOADING TRAIN DATA
INFO:haystack.modeling.data_handler.data_silo:Loading train set from: /content/DPRtrain.json 
Preprocessing dataset: 100%|██████████| 3/3 [00:03<00:00,  1.15s/ Dicts]
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:LOADING DEV DATA
INFO:haystack.modeling.data_handler.data_silo:Loading dev set from: /content/DPRuji.json
Preprocessing dataset: 100%|██████████| 1/1 [00:00<00:00,  1.47 Dicts/s]
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:LOADING TEST DATA
INFO:haystack.modeling.data_handler.data_silo:Loading test set from: /content/DPRuji.json
Preprocessing dataset: 100%|██████████| 1/1 [00:00<00:00,  1.45 Dicts/s]
INFO:hay

In [None]:
reloaded_retriever = DensePassageRetriever.load(load_dir=save_dir, document_store=document_store)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.nodes.retriever.dense:DPR model loaded from /content/drive/MyDrive/models


# Pembentukan Hybrid Passage Retrieval

## Menyimpan data ke document store

In [None]:
from haystack.utils import fetch_archive_from_http


doc_dir = "/content/drive/MyDrive/dataset/data_pelatihan.json"

In [None]:
from haystack.nodes import PreProcessor

# Add evaluation data to Elasticsearch Document Store
# We first delete the custom tutorial indices to not have duplicate elements
# and also split our documents into shorter passages using the PreProcessor
preprocessor = PreProcessor(
    split_respect_sentence_boundary=False,
    clean_empty_lines=False,
    clean_whitespace=False,
)
document_store.delete_documents(index=doc_index)
document_store.delete_documents(index=label_index)

# The add_eval_data() method converts the given dataset in json format into Haystack document and label objects. Those objects are then indexed in their respective document and label index in the document store. The method can be used with any dataset in SQuAD format.
document_store.add_eval_data(
    filename=doc_dir,
    doc_index=doc_index,
    label_index=label_index,
    preprocessor=preprocessor,
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 1687.17docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 2169.84docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 1855.07docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 1769.00docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 1982.19docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 2351.07docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 1669.71docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 1886.78docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 1611.33docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 2055.02docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 1932.86docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 2063.11docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 1453.33docs/s]
Preprocessing: 100%|██████████| 1/1 [00:00<00:00,

## Inisialisasi dua komponen retriever

In [None]:
# Initialize Retriever
from haystack.nodes import BM25Retriever, DensePassageRetriever

sparse_retriever = BM25Retriever(document_store=document_store)

# Alternative: Evaluate dense retrievers (EmbeddingRetriever or DensePassageRetriever)
# The EmbeddingRetriever uses a single transformer based encoder model for query and document.
# In contrast, DensePassageRetriever uses two separate encoders for both.

# Please make sure the "embedding_dim" parameter in the DocumentStore above matches the output dimension of your models!
# Please also take care that the PreProcessor splits your files into chunks that can be completely converted with
#        the max_seq_len limitations of Transformers
# The SentenceTransformer model "sentence-transformers/multi-qa-mpnet-base-dot-v1" generally works well with the EmbeddingRetriever on any kind of English text.
# For more information and suggestions on different models check out the documentation at: https://www.sbert.net/docs/pretrained_models.html

# from haystack.retriever import EmbeddingRetriever, DensePassageRetriever
# retriever = EmbeddingRetriever(document_store=document_store,
#                                embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1")
dense_retriever = DensePassageRetriever.load(load_dir=save_dir, document_store=document_store)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.nodes.retriever.dense:DPR model loaded from /content/drive/MyDrive/models


In [None]:
document_store.update_embeddings(retriever=dense_retriever, index=doc_index)

INFO:haystack.document_stores.search_engine:Updating embeddings for all 246 docs ...
Updating embeddings:   0%|          | 0/246 [00:00<?, ? Docs/s]
Create embeddings:   0%|          | 0/256 [00:00<?, ? Docs/s][A
Create embeddings:   6%|▋         | 16/256 [00:01<00:20, 11.90 Docs/s][A
Create embeddings:  12%|█▎        | 32/256 [00:01<00:09, 23.97 Docs/s][A
Create embeddings:  19%|█▉        | 48/256 [00:01<00:05, 35.29 Docs/s][A
Create embeddings:  25%|██▌       | 64/256 [00:01<00:04, 45.28 Docs/s][A
Create embeddings:  31%|███▏      | 80/256 [00:02<00:03, 53.65 Docs/s][A
Create embeddings:  38%|███▊      | 96/256 [00:02<00:02, 59.99 Docs/s][A
Create embeddings:  44%|████▍     | 112/256 [00:02<00:02, 65.39 Docs/s][A
Create embeddings:  50%|█████     | 128/256 [00:02<00:01, 69.38 Docs/s][A
Create embeddings:  56%|█████▋    | 144/256 [00:02<00:01, 72.41 Docs/s][A
Create embeddings:  62%|██████▎   | 160/256 [00:03<00:01, 74.59 Docs/s][A
Create embeddings:  69%|██████▉   | 176/25

## Inisialisasi komponen JoinDocument dan Reranker

In [None]:
from haystack.nodes import JoinDocuments, SentenceTransformersRanker

join_documents = JoinDocuments(join_mode="concatenate")
rerank = SentenceTransformersRanker(model_name_or_path="cross-encoder/ms-marco-MiniLM-L-12-v2")

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Membentuk pipeline

In [None]:
from haystack.pipelines import Pipeline

pipeline = Pipeline()
pipeline.add_node(component=sparse_retriever, name="SparseRetriever", inputs=["Query"])
pipeline.add_node(component=dense_retriever, name="DenseRetriever", inputs=["Query"])
pipeline.add_node(component=join_documents, name="JoinDocuments", inputs=["SparseRetriever", "DenseRetriever"])
pipeline.add_node(component=rerank, name="ReRanker", inputs=["JoinDocuments"])

## Proses retrieval

In [None]:
def pretty_print_results(prediction):
  for doc in prediction["documents"]:
        print(doc.id, "\t", doc.score)
        print(doc.content)
        print(doc.meta)
        print("\n", "\n")

In [None]:
prediction = pipeline.run(
    query="siapa yang menulis paper bagus tentang data science atau sains data",
    params={
        "SparseRetriever": {"top_k":1},
        "DenseRetriever": {"top_k": 1},
        # "JoinDocuments": {"top_k_join": 15},  # comment for debug
        "JoinDocuments": {"top_k_join": 2}, #uncomment for debug
        "ReRanker": {"top_k": 1, "debug":True}
    },
)

In [None]:
prediction2 = pipeline.run(
    query="siapa yang menulis paper bagus tentang data science atau sains data",
    params={
        "SparseRetriever": {"top_k":5},
        "DenseRetriever": {"top_k": 5},
        # "JoinDocuments": {"top_k_join": 15},  # comment for debug
        "JoinDocuments": {"top_k_join": 10}, #uncomment for debug
        "ReRanker": {"top_k": 5, "debug":True},
    },
)

In [None]:
pretty_print_results(prediction2)

9f5a325f0a3e3aa15376fa412342656f-0 	 0.9995142221450806
 ada paper bagus yang ditulis oleh david donoho  2017   seorang profesor statistika dari stanford university yang bergelut dalam sains data donoho mempertegas bahwa tukey  1962  telah mendorong perlunya reformasi statistika dari fokus pada deskripsi dan inferensi menuju akuisisi data dan prediksi inilah yang kemudian dikenal sebagai data science atau sains data  clevaland  2001   yang menjadi istilah pertama kali untuk konsep tersebut data science atau sains data adalah kombinasi dari beberapa disiplin ilmu  seperti ilmu komputer  matematika  dan statistik tujuan utama data science adalah menganalisis data  menemukan pola  dan membuat prediksi di masa depan
{'_split_id': 0, '_split_offset': 0, 'document_id': 508}

 

9e10a7173d9194f36bcd731256aefdf3-0 	 0.9861878752708435
data science atau sains data adalah kombinasi dari beberapa disiplin ilmu  seperti ilmu komputer  matematika  dan statistik tujuan utama data science adalah meng

In [None]:
pretty_print_results(prediction)

9f5a325f0a3e3aa15376fa412342656f-0 	 0.9995142221450806
 ada paper bagus yang ditulis oleh david donoho  2017   seorang profesor statistika dari stanford university yang bergelut dalam sains data donoho mempertegas bahwa tukey  1962  telah mendorong perlunya reformasi statistika dari fokus pada deskripsi dan inferensi menuju akuisisi data dan prediksi inilah yang kemudian dikenal sebagai data science atau sains data  clevaland  2001   yang menjadi istilah pertama kali untuk konsep tersebut data science atau sains data adalah kombinasi dari beberapa disiplin ilmu  seperti ilmu komputer  matematika  dan statistik tujuan utama data science adalah menganalisis data  menemukan pola  dan membuat prediksi di masa depan
{'_split_id': 0, '_split_offset': 0, 'document_id': 508}

 



In [None]:
prediction2["_debug"]

{'JoinDocuments': {'input': {'documents': [<Document: {'content': ' ada paper bagus yang ditulis oleh david donoho  2017   seorang profesor statistika dari stanford university yang bergelut dalam sains data donoho mempertegas bahwa tukey  1962  telah mendorong perlunya reformasi statistika dari fokus pada deskripsi dan inferensi menuju akuisisi data dan prediksi inilah yang kemudian dikenal sebagai data science atau sains data  clevaland  2001   yang menjadi istilah pertama kali untuk konsep tersebut data science atau sains data adalah kombinasi dari beberapa disiplin ilmu  seperti ilmu komputer  matematika  dan statistik tujuan utama data science adalah menganalisis data  menemukan pola  dan membuat prediksi di masa depan', 'content_type': 'text', 'score': 0.9311534358016594, 'meta': {'_split_id': 0, '_split_offset': 0, 'document_id': 508}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '9f5a325f0a3e3aa15376fa412342656f-0'}>,
    <Document: {'content': 'data science atau sains 

In [None]:
prediction2["_debug"]

{'Query': {'input': {'debug': True}, 'output': {}, 'exec_time_ms': 0.26},
 'SparseRetriever': {'input': {'root_node': 'Query',
   'query': 'apa saja disiplin ilmu yang menjadi komponen utama dalam data science',
   'top_k': 5,
   'debug': True},
  'output': {'documents': [<Document: {'content': ' ada paper bagus yang ditulis oleh david donoho  2017   seorang profesor statistika dari stanford university yang bergelut dalam sains data donoho mempertegas bahwa tukey  1962  telah mendorong perlunya reformasi statistika dari fokus pada deskripsi dan inferensi menuju akuisisi data dan prediksi inilah yang kemudian dikenal sebagai data science atau sains data  clevaland  2001   yang menjadi istilah pertama kali untuk konsep tersebut data science atau sains data adalah kombinasi dari beberapa disiplin ilmu  seperti ilmu komputer  matematika  dan statistik tujuan utama data science adalah menganalisis data  menemukan pola  dan membuat prediksi di masa depan', 'content_type': 'text', 'score': 0.

# Evaluasi Hybrid Passage Retrieval

In [None]:
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=True)

In [None]:
eval_labels

[<MultiLabel: {'labels': [{'id': '3584e40c-e1f6-4c0a-9cdb-9934f6fe413e', 'query': 'apa tujuan utama data science ', 'document': {'id': '9e10a7173d9194f36bcd731256aefdf3-0', 'content': 'data science atau sains data adalah kombinasi dari beberapa disiplin ilmu  seperti ilmu komputer  matematika  dan statistik tujuan utama data science adalah menganalisis data  menemukan pola  dan membuat prediksi di masa depan dalam sains data  ilmu komputer digunakan untuk pengenalan pola  visualisasi  pergudangan data  dan komputasi kinerja tinggi matematika digunakan untuk pemodelan matematika  sedangkan statistik digunakan untuk pemodelan statistic dan stohestic serta probabilitas dengan menggunakan teori dan teknik dari berbagai bidang  data science membantu mengumpulkan  membersihkan  mengintegrasikan  menganalisis  memvisualisasikan  dan berinteraksi dengan data untuk menghasilkan produk data yang bermanfaat bagi para pengambil keputusan di berbagai industri seperti sains  teknik  ekonomi  politik

In [None]:
eval_result = pipeline.eval(labels=eval_labels, params= {"SparseRetriever": {"top_k": 1}, "DenseRetriever": {"top_k": 1}, "JoinDocuments": {"top_k_join": 2, "debug":True}, "ReRanker":{"top_k":1, "debug":True}})

In [None]:
eval_result.save("/content")

INFO:haystack.schema:Saving evaluation results to /content


In [None]:
eval_result.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 0.9919028340080972,
  'recall_single_hit': 0.9919028340080972,
  'precision': 0.07912645490044871,
  'map': 0.8488586813866329,
  'mrr': 0.8603120486776996,
  'ndcg': 0.8893290573112195},
 'DenseRetriever': {'recall_multi_hit': 0.9757085020242915,
  'recall_single_hit': 0.9757085020242915,
  'precision': 0.07469635627530366,
  'map': 0.7052182329438191,
  'mrr': 0.7426662518767781,
  'ndcg': 0.7830985207424542},
 'JoinDocuments': {'recall_multi_hit': 0.9919028340080972,
  'recall_single_hit': 0.9919028340080972,
  'precision': 0.04834569015375151,
  'map': 0.8483235238641795,
  'mrr': 0.8603120486776996,
  'ndcg': 0.8890518506610406},
 'ReRanker': {'recall_multi_hit': 0.9919028340080972,
  'recall_single_hit': 0.9919028340080972,
  'precision': 0.07672064777327936,
  'map': 0.8355740638129301,
  'mrr': 0.8890768588137009,
  'ndcg': 0.8891097103046356}}

In [None]:
eval_result.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 0.9990291262135922,
  'recall_single_hit': 0.9990291262135922,
  'precision': 0.3211650485436893,
  'map': 0.9684358144552319,
  'mrr': 0.9867313915857605,
  'ndcg': 0.9813553761972242},
 'DenseRetriever': {'recall_multi_hit': 0.945631067961165,
  'recall_single_hit': 0.945631067961165,
  'precision': 0.2928155339805825,
  'map': 0.8011394282632146,
  'mrr': 0.8248058252427184,
  'ndcg': 0.845350006940612},
 'JoinDocuments': {'recall_multi_hit': 1.0,
  'recall_single_hit': 1.0,
  'precision': 0.23331676683618432,
  'map': 0.9588376868546771,
  'mrr': 0.9868700878409616,
  'ndcg': 0.9770869570259663},
 'ReRanker': {'recall_multi_hit': 0.9922330097087378,
  'recall_single_hit': 0.9922330097087378,
  'precision': 0.3170873786407768,
  'map': 0.8532524271844659,
  'mrr': 0.8855339805825243,
  'ndcg': 0.8982049069757914}}

In [None]:
#5,5,10,5 #12
eval_result.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 0.9990291262135922,
  'recall_single_hit': 0.9990291262135922,
  'precision': 0.3213592233009709,
  'map': 0.9679611650485437,
  'mrr': 0.9872168284789644,
  'ndcg': 0.9812514421510329},
 'DenseRetriever': {'recall_multi_hit': 0.9271844660194175,
  'recall_single_hit': 0.9271844660194175,
  'precision': 0.28388349514563105,
  'map': 0.7733346817691478,
  'mrr': 0.7933495145631068,
  'ndcg': 0.8190896348452924},
 'JoinDocuments': {'recall_multi_hit': 1.0,
  'recall_single_hit': 1.0,
  'precision': 0.23051317614424413,
  'map': 0.9591023655416859,
  'mrr': 0.9873555247341655,
  'ndcg': 0.9773709777478317},
 'ReRanker': {'recall_multi_hit': 0.9941747572815534,
  'recall_single_hit': 0.9941747572815534,
  'precision': 0.31708737864077674,
  'map': 0.8521601941747572,
  'mrr': 0.8842233009708738,
  'ndcg': 0.8977232487350842}}

In [None]:
eval_result2 = pipeline.eval(labels=eval_labels, params= {"SparseRetriever": {"top_k": 10}, "DenseRetriever": {"top_k": 10}, "JoinDocuments": {"top_k_join": 20, "debug":True}, "ReRanker":{"top_k":10, "debug":True}})

In [None]:
eval_result2.calculate_metrics()

NameError: name 'eval_result2' is not defined

In [None]:
#10
eval_result2.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 1.0,
  'recall_single_hit': 1.0,
  'precision': 0.16759439050701191,
  'map': 0.9557553442088811,
  'mrr': 0.9868932038834951,
  'ndcg': 0.9755700538324369},
 'DenseRetriever': {'recall_multi_hit': 0.9786407766990292,
  'recall_single_hit': 0.9786407766990292,
  'precision': 0.16368932038834955,
  'map': 0.7848736515641854,
  'mrr': 0.8293492834026815,
  'ndcg': 0.8466910096866652},
 'JoinDocuments': {'recall_multi_hit': 1.0,
  'recall_single_hit': 1.0,
  'precision': 0.1150465578094527,
  'map': 0.9449883651600481,
  'mrr': 0.9868932038834951,
  'ndcg': 0.9706836820770883},
 'ReRanker': {'recall_multi_hit': 0.996116504854369,
  'recall_single_hit': 0.996116504854369,
  'precision': 0.16708737864077672,
  'map': 0.8156115466772829,
  'mrr': 0.8677311604253353,
  'ndcg': 0.8760220638191221}}

In [None]:
eval_result3 = pipeline.eval(labels=eval_labels, params= {"SparseRetriever": {"top_k": 15}, "DenseRetriever": {"top_k": 15}, "JoinDocuments": {"top_k_join": 30, "debug":True}, "ReRanker":{"top_k":15, "debug":True}})

In [None]:
eval_result3.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 1.0,
  'recall_single_hit': 1.0,
  'precision': 0.11420095546309139,
  'map': 0.9489653191943415,
  'mrr': 0.9868932038834951,
  'ndcg': 0.9723577403020937},
 'DenseRetriever': {'recall_multi_hit': 0.983495145631068,
  'recall_single_hit': 0.983495145631068,
  'precision': 0.1129449838187702,
  'map': 0.7788662817636465,
  'mrr': 0.8297523431261294,
  'ndcg': 0.8451577182285669},
 'JoinDocuments': {'recall_multi_hit': 1.0,
  'recall_single_hit': 1.0,
  'precision': 0.07630602736557054,
  'map': 0.9384872281859054,
  'mrr': 0.9868932038834951,
  'ndcg': 0.9675258337833305},
 'ReRanker': {'recall_multi_hit': 0.9980582524271845,
  'recall_single_hit': 0.9980582524271845,
  'precision': 0.11346278317152103,
  'map': 0.7999539496756903,
  'mrr': 0.860011183530601,
  'ndcg': 0.866357557598478}}

In [None]:
#15
eval_result3.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 1.0,
  'recall_single_hit': 1.0,
  'precision': 0.11420095546309139,
  'map': 0.9489653191943415,
  'mrr': 0.9868932038834951,
  'ndcg': 0.9723577403020937},
 'DenseRetriever': {'recall_multi_hit': 0.983495145631068,
  'recall_single_hit': 0.983495145631068,
  'precision': 0.1129449838187702,
  'map': 0.7788662817636465,
  'mrr': 0.8297523431261294,
  'ndcg': 0.8451577182285669},
 'JoinDocuments': {'recall_multi_hit': 1.0,
  'recall_single_hit': 1.0,
  'precision': 0.11411003236245955,
  'map': 0.9489653191943415,
  'mrr': 0.9868932038834951,
  'ndcg': 0.9723577403020937},
 'ReRanker': {'recall_multi_hit': 1.0,
  'recall_single_hit': 1.0,
  'precision': 0.11411003236245955,
  'map': 0.8011921354274567,
  'mrr': 0.8619980424106639,
  'ndcg': 0.8680664590395107}}

In [None]:
eval_result4 = pipeline.eval(labels=eval_labels, params= {"top_k":1})

In [None]:
eval_result4 = pipeline.eval(labels=eval_labels, params= {"SparseRetriever": {"top_k": 1}, "DenseRetriever": {"top_k": 1}, "JoinDocuments": {"top_k_join": 2, "debug":True}, "ReRanker":{"top_k":1, "debug":True}})

In [None]:
eval_result4.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 0.974757281553398,
  'recall_single_hit': 0.974757281553398,
  'precision': 0.974757281553398,
  'map': 0.974757281553398,
  'mrr': 0.974757281553398,
  'ndcg': 0.974757281553398},
 'DenseRetriever': {'recall_multi_hit': 0.7378640776699029,
  'recall_single_hit': 0.7378640776699029,
  'precision': 0.7378640776699029,
  'map': 0.7378640776699029,
  'mrr': 0.7378640776699029,
  'ndcg': 0.7378640776699029},
 'JoinDocuments': {'recall_multi_hit': 0.9854368932038835,
  'recall_single_hit': 0.9854368932038835,
  'precision': 0.8563106796116505,
  'map': 0.9800970873786408,
  'mrr': 0.9800970873786408,
  'ndcg': 0.9814953663002777},
 'ReRanker': {'recall_multi_hit': 0.9436893203883495,
  'recall_single_hit': 0.9436893203883495,
  'precision': 0.9436893203883495,
  'map': 0.9436893203883495,
  'mrr': 0.9436893203883495,
  'ndcg': 0.9436893203883495}}

In [None]:
eval_result4.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 0.9757281553398058,
  'recall_single_hit': 0.9757281553398058,
  'precision': 0.9757281553398058,
  'map': 0.9757281553398058,
  'mrr': 0.9757281553398058,
  'ndcg': 0.9757281553398058},
 'DenseRetriever': {'recall_multi_hit': 0.7378640776699029,
  'recall_single_hit': 0.7378640776699029,
  'precision': 0.7378640776699029,
  'map': 0.7378640776699029,
  'mrr': 0.7378640776699029,
  'ndcg': 0.7378640776699029},
 'JoinDocuments': {'recall_multi_hit': 0.987378640776699,
  'recall_single_hit': 0.987378640776699,
  'precision': 0.8567961165048543,
  'map': 0.9815533980582525,
  'mrr': 0.9815533980582525,
  'ndcg': 0.9830787932454927},
 'ReRanker': {'recall_multi_hit': 0.9446601941747573,
  'recall_single_hit': 0.9446601941747573,
  'precision': 0.9446601941747573,
  'map': 0.9446601941747573,
  'mrr': 0.9446601941747573,
  'ndcg': 0.9446601941747573}}

In [None]:
eval_result4.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 0.9757281553398058,
  'recall_single_hit': 0.9757281553398058,
  'precision': 0.9757281553398058,
  'map': 0.9757281553398058,
  'mrr': 0.9757281553398058,
  'ndcg': 0.9757281553398058},
 'DenseRetriever': {'recall_multi_hit': 0.7378640776699029,
  'recall_single_hit': 0.7378640776699029,
  'precision': 0.7378640776699029,
  'map': 0.7378640776699029,
  'mrr': 0.7378640776699029,
  'ndcg': 0.7378640776699029},
 'JoinDocuments': {'recall_multi_hit': 0.9757281553398058,
  'recall_single_hit': 0.9757281553398058,
  'precision': 0.9757281553398058,
  'map': 0.9757281553398058,
  'mrr': 0.9757281553398058,
  'ndcg': 0.9757281553398058},
 'ReRanker': {'recall_multi_hit': 0.9757281553398058,
  'recall_single_hit': 0.9757281553398058,
  'precision': 0.9757281553398058,
  'map': 0.9757281553398058,
  'mrr': 0.9757281553398058,
  'ndcg': 0.9757281553398058}}

In [None]:
#datalatih 6
eval_result4.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 0.974757281553398,
  'recall_single_hit': 0.974757281553398,
  'precision': 0.974757281553398,
  'map': 0.974757281553398,
  'mrr': 0.974757281553398,
  'ndcg': 0.974757281553398},
 'DenseRetriever': {'recall_multi_hit': 0.7378640776699029,
  'recall_single_hit': 0.7378640776699029,
  'precision': 0.7378640776699029,
  'map': 0.7378640776699029,
  'mrr': 0.7378640776699029,
  'ndcg': 0.7378640776699029},
 'JoinDocuments': {'recall_multi_hit': 0.9854368932038835,
  'recall_single_hit': 0.9854368932038835,
  'precision': 0.8563106796116505,
  'map': 0.9800970873786408,
  'mrr': 0.9800970873786408,
  'ndcg': 0.9814953663002777},
 'ReRanker': {'recall_multi_hit': 0.9378640776699029,
  'recall_single_hit': 0.9378640776699029,
  'precision': 0.9378640776699029,
  'map': 0.9378640776699029,
  'mrr': 0.9378640776699029,
  'ndcg': 0.9378640776699029}}

In [None]:
#datauji 6
eval_result4.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 0.8385214007782101,
  'recall_single_hit': 0.8404669260700389,
  'precision': 0.8404669260700389,
  'map': 0.8385214007782101,
  'mrr': 0.8404669260700389,
  'ndcg': 0.8389616622286594},
 'DenseRetriever': {'recall_multi_hit': 0.6478599221789884,
  'recall_single_hit': 0.6498054474708171,
  'precision': 0.6498054474708171,
  'map': 0.6478599221789884,
  'mrr': 0.6498054474708171,
  'ndcg': 0.6483001836294376},
 'JoinDocuments': {'recall_multi_hit': 0.9280155642023347,
  'recall_single_hit': 0.9299610894941635,
  'precision': 0.745136186770428,
  'map': 0.8832684824902723,
  'mrr': 0.8852140077821011,
  'ndcg': 0.8954261927039261},
 'ReRanker': {'recall_multi_hit': 0.9124513618677043,
  'recall_single_hit': 0.914396887159533,
  'precision': 0.914396887159533,
  'map': 0.9124513618677043,
  'mrr': 0.914396887159533,
  'ndcg': 0.9128916233181535}}

In [None]:
#datauji 12
eval_result4.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 0.8385214007782101,
  'recall_single_hit': 0.8404669260700389,
  'precision': 0.8404669260700389,
  'map': 0.8385214007782101,
  'mrr': 0.8404669260700389,
  'ndcg': 0.8389616622286594},
 'DenseRetriever': {'recall_multi_hit': 0.6478599221789884,
  'recall_single_hit': 0.6498054474708171,
  'precision': 0.6498054474708171,
  'map': 0.6478599221789884,
  'mrr': 0.6498054474708171,
  'ndcg': 0.6483001836294376},
 'JoinDocuments': {'recall_multi_hit': 0.9280155642023347,
  'recall_single_hit': 0.9299610894941635,
  'precision': 0.745136186770428,
  'map': 0.8832684824902723,
  'mrr': 0.8852140077821011,
  'ndcg': 0.8954261927039261},
 'ReRanker': {'recall_multi_hit': 0.9163424124513618,
  'recall_single_hit': 0.9182879377431906,
  'precision': 0.9182879377431906,
  'map': 0.9163424124513618,
  'mrr': 0.9182879377431906,
  'ndcg': 0.9167826739018111}}

In [None]:
eval_result4.save("/content")

INFO:haystack.schema:Saving evaluation results to /content


In [None]:
eval_result4.calculate_metrics()

{'SparseRetriever': {'recall_multi_hit': 0.9757281553398058,
  'recall_single_hit': 0.9757281553398058,
  'precision': 0.9757281553398058,
  'map': 0.9757281553398058,
  'mrr': 0.9757281553398058,
  'ndcg': 0.9757281553398058},
 'DenseRetriever': {'recall_multi_hit': 0.7378640776699029,
  'recall_single_hit': 0.7378640776699029,
  'precision': 0.7378640776699029,
  'map': 0.7378640776699029,
  'mrr': 0.7378640776699029,
  'ndcg': 0.7378640776699029},
 'JoinDocuments': {'recall_multi_hit': 0.987378640776699,
  'recall_single_hit': 0.987378640776699,
  'precision': 0.8567961165048543,
  'map': 0.9815533980582525,
  'mrr': 0.9815533980582525,
  'ndcg': 0.9830787932454927},
 'ReRanker': {'recall_multi_hit': 0.9446601941747573,
  'recall_single_hit': 0.9446601941747573,
  'precision': 0.9446601941747573,
  'map': 0.9446601941747573,
  'mrr': 0.9446601941747573,
  'ndcg': 0.9446601941747573}}

In [None]:
eval_result4.save("/content")

INFO:haystack.schema:Saving evaluation results to /content


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('gt_data_uji.csv')
df

Unnamed: 0,answer_id,document_id,question_id,text,answer_start,answer_end,answer_category,question,file_name,context
0,1275,1079,2923,dua titik dianggap sebagai tetangga jika jarak...,414,510,,bagaimana sebuah titik ditentukan sebagai titi...,,se dengan baik ide utama dbscan adalah bahwa s...
1,1276,1081,2918,separasi dinyatakan dalam ukuran between clus...,321,389,,bagaimana cara mengukur separasi pada clustering,,engan within cluster sum of square wcss atau...
2,1407,1234,3065,algoritma relief menghitung bobot quality est...,258,332,,apa salah satu yang dihitung pada algoritma re...,,algoritma fisher score mengevaluasi fitur seca...
3,1217,995,2843,tujuan utama data science adalah menganalisis ...,124,226,,apa tujuan utama data science,,data science atau sains data adalah kombinasi ...
4,1218,996,2960,business understanding adalah tahap pertama da...,1,71,,apa tahap pertama dalam proyek analitik data,,business understanding adalah tahap pertama d...
...,...,...,...,...,...,...,...,...,...,...
254,1517,1225,3172,algoritma relief mampu menangani fitur kategor...,117,250,,apa yang tidak dapat ditangani oleh algoritma ...,,sebaliknya jika nilai fitur a pada sampel r i...
255,1518,1227,3173,nonlinear feature extraction memetakan data ke...,0,309,,apa maksud dari nonlinear feature extraction?,,nonlinear feature extraction memetakan data ke...
256,1520,1027,3174,contoh dari performance matrics accuracy err...,564,743,,apa saja yang termasuk performance matrics?,,taset terdiri dari baris dan kolom yang berkai...
257,1521,1230,3175,linear feature extraction memproyeksikan mentr...,64,139,,bagaimana linear feature extraction bekerja?,,dua kategori utama ekstraksi fitur adalah line...


In [None]:
df2 = df['question']
df2

0      bagaimana sebuah titik ditentukan sebagai titi...
1      bagaimana cara mengukur separasi pada clustering 
2      apa salah satu yang dihitung pada algoritma re...
3                         apa tujuan utama data science 
4          apa tahap pertama dalam proyek analitik data 
                             ...                        
254    apa yang tidak dapat ditangani oleh algoritma ...
255        apa maksud dari nonlinear feature extraction?
256          apa saja yang termasuk performance matrics?
257         bagaimana linear feature extraction bekerja?
258             apa yang dilakukan oleh ekstraksi fitur?
Name: question, Length: 259, dtype: object

In [None]:
for text in df2:
    query_text = text
    print(query_text)

bagaimana sebuah titik ditentukan sebagai titik yang bertetangga dalam dbscan clustering 
bagaimana cara mengukur separasi pada clustering 
apa salah satu yang dihitung pada algoritma relief 
apa tujuan utama data science 
apa tahap pertama dalam proyek analitik data 
mengapa proyek data sains juga disebut sebagai proyek bisnis 
selain harus selalu berorientasi pada pencapaian hasil yang fokus pada bisnis  apakah hal penting yang menunjukkan bahwa proyek data sains adalah proyek bisnis 
siapa yang akan menggunakan informasi yang dihasilkan dari penggunaan data 
dimanakan produk diuji ketika test phase 
mengapa penting menguji model dalam sains data 
apa yang dilakukan dalam langkah evaluasi dalam praktik ilmuwan data 
terdiri dari apa saja tree pada decision tree 
apa yang dimaksud dengan decision tree 
apa yang disebut probabilitas awal dari h pada teorema bayes 
apa yang melambangkan data training atau data pembelajaran pada konteks machine learning  supervised learning  
apa yang me

In [None]:
def runquestions(df):
  results = []
  for text in df2:
        # Get the question from the DataFrame
        query_text = text

        question_results =[]

        # Define the parameters for the pipeline
        params = {
            "SparseRetriever": {"top_k": 5},
            "DenseRetriever": {"top_k": 5},
            "JoinDocuments": {"top_k_join": 10, "debug":True},
            "ReRanker": {"top_k": 1},
        }

        # Run the prediction with the specified query and parameters
        prediction = pipeline.run(query=query_text, params=params)

        for doc in prediction["documents"]:
          id = doc.id
          score = doc.score
          content = doc.content

        # Append the result to the list
        question_results.append({'question': query_text, 'id': id, 'score' : score, 'content' : content})

        results.extend(question_results)

  # Convert the results list to a DataFrame
  results_df = pd.DataFrame(results)

  return results_df


In [None]:
resultdf = runquestions(df2)

In [None]:
resultdf

Unnamed: 0,question,id,score,content
0,bagaimana sebuah titik ditentukan sebagai titi...,8d37e96e15a66bc72585441cdb2f6f2e-0,0.997214,sebuah titik adalah outlier jika titik tersebu...
1,bagaimana cara mengukur separasi pada clustering,5d05a1f9c28b85f344efcbf954b53dd4-0,0.997691,kohesi dapat diukur dengan within cluster sum ...
2,apa salah satu yang dihitung pada algoritma re...,cd0b8414ac45a94a5921b8f1a72dbf3e-0,0.991723,sebaliknya jika nilai fitur a pada sampel r i...
3,apa tujuan utama data science,9e10a7173d9194f36bcd731256aefdf3-0,0.998549,data science atau sains data adalah kombinasi ...
4,apa tahap pertama dalam proyek analitik data,7b13cd5e078eaf6a4ae6e39eaafe5672-0,0.999717,business understanding adalah tahap pertama d...
...,...,...,...,...
254,apa yang tidak dapat ditangani oleh algoritma ...,6e0e9298f0af1cb84aa4213636823ba9-0,0.996199,algoritma relieff lebih robost sehingga mampu ...
255,apa maksud dari nonlinear feature extraction?,fe1054ff949aa343ae97c94501805627-0,0.996683,dua kategori utama ekstraksi fitur adalah line...
256,apa saja yang termasuk performance matrics?,69068109e51f1d3e2c678865632736d2-0,0.481371,teknik ini tidak menjamin distribusi sample da...
257,bagaimana linear feature extraction bekerja?,6d47ca3f0244ff579c330db883f33154-0,0.473563,nonlinear feature extraction memetakan data ke...


In [None]:
resultdf.to_csv('prediction.csv', index=False)

In [None]:
dfgt = pd.read_csv("gt_data_uji.csv")
dfgt

Unnamed: 0,answer_id,document_id,question_id,text,answer_start,answer_end,answer_category,question,file_name,context
0,1275,1079,2923,dua titik dianggap sebagai tetangga jika jarak...,414,510,,bagaimana sebuah titik ditentukan sebagai titi...,,se dengan baik ide utama dbscan adalah bahwa s...
1,1276,1081,2918,separasi dinyatakan dalam ukuran between clus...,321,389,,bagaimana cara mengukur separasi pada clustering,,engan within cluster sum of square wcss atau...
2,1407,1234,3065,algoritma relief menghitung bobot quality est...,258,332,,apa salah satu yang dihitung pada algoritma re...,,algoritma fisher score mengevaluasi fitur seca...
3,1217,995,2843,tujuan utama data science adalah menganalisis ...,124,226,,apa tujuan utama data science,,data science atau sains data adalah kombinasi ...
4,1218,996,2960,business understanding adalah tahap pertama da...,1,71,,apa tahap pertama dalam proyek analitik data,,business understanding adalah tahap pertama d...
...,...,...,...,...,...,...,...,...,...,...
254,1517,1225,3172,algoritma relief mampu menangani fitur kategor...,117,250,,apa yang tidak dapat ditangani oleh algoritma ...,,sebaliknya jika nilai fitur a pada sampel r i...
255,1518,1227,3173,nonlinear feature extraction memetakan data ke...,0,309,,apa maksud dari nonlinear feature extraction?,,nonlinear feature extraction memetakan data ke...
256,1520,1027,3174,contoh dari performance matrics accuracy err...,564,743,,apa saja yang termasuk performance matrics?,,taset terdiri dari baris dan kolom yang berkai...
257,1521,1230,3175,linear feature extraction memproyeksikan mentr...,64,139,,bagaimana linear feature extraction bekerja?,,dua kategori utama ekstraksi fitur adalah line...


In [None]:
matching_rows_count = df[resultdf['content'] == dfgt['context']].shape[0]

In [None]:
matching_rows_count

21

In [None]:
resultdf.rename(columns={'content': 'context'}, inplace=True)

In [None]:
merged_df = pd.merge(resultdf, df2, on='context')

In [None]:
sorted_df = merged_df.sort_values(by='question_x')

In [None]:
sorted_df

Unnamed: 0,question_x,id,score,context,answer_id,document_id,question_id,text,answer_start,answer_end,answer_category,question_y,file_name
12,apa ciri dari hasil clustering yang baik,4f0f576566298295e7e73d5faf516249-0,0.999410,cluster yang dihasilkan dinilai valid atau tid...,1270,1074,2916,evaluasi pada clustering dapat dilakukan deng...,292,434,,bagaimana evaluasi clustering dengan menggunak...,
83,apa definisi dari random sampling,4ae36c63f7f7cfffd5516c6e7e521a79-0,0.973814,kelemahannya jumlah data semakin bertambah da...,1385,1205,3049,mereplikasi sampel dari kelas minoritas secara...,266,377,,bagaimana cara random oversampling menyamakan ...,
82,apa definisi dari random sampling,4ae36c63f7f7cfffd5516c6e7e521a79-0,0.973814,kelemahannya jumlah data semakin bertambah da...,1384,1205,3041,mereplikasi sampel dari kelas minoritas secara...,266,377,,bagaimana cara kerja random oversampling,
35,apa fungsi method df.drop pada library pandas,2465fb2a1cf7434f31b2b8455d62af79-0,0.997434,drop dari library pandas untuk menghapus kol...,1314,1130,3026,df nunique,230,240,,apa nama method yang digunakan untuk menghapus...,
36,apa fungsi method df.drop pada library pandas,2465fb2a1cf7434f31b2b8455d62af79-0,0.997434,drop dari library pandas untuk menghapus kol...,1315,1130,3030,mengidentifikasi dan hapus kolom yang hanya be...,131,211,,apa fungsi dari penggunaan method df.nunique,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
70,sebutan lain q2 adalah apa,2782c706c505bd67ec2f5d05f7c9f444-0,0.984372,simpangan baku adalah salah satu ukuran sebara...,1337,1147,2996,q2,294,297,,median dikenal juga dengan sebutan apa,
56,setelah mengidentifikasi tujuan bisnis maka se...,48f5412cf30205b9e0b6cdd8cc8c8840-0,0.994547,tiga hal utama dalam business understanding ad...,1317,1131,2970,tahapan business understanding dalam ai melipu...,284,397,,apakah masalah dan solusi termasuk dalam tahap...,
2,setelah menuliskan semua langkah kemudian tug...,d6686b06e1cb7f288626baed940accb2-0,0.990586,kemudian berdasarkan umpan balik produk dapat...,1290,1096,2942,program perangkat lunak yang diuji bermigras...,237,313,,apa output dari deployment phase,
72,untuk menampilkan simpangan deviasi data vis...,46b7e44f8d927ed59def62f77d7c83da-0,0.998657,visualisasi merupakan suatu teknik dalam pemb...,1353,1162,3013,visualisasi sendiri memiliki empat tujuan anta...,296,908,,apa saja tujuan dari visualisasi data,


In [None]:
import pandas as pd

In [None]:
df = pd.read_json("answersDPR.json")
df

Unnamed: 0,question,answers,positive_ctxs,negative_ctxs,hard_negative_ctxs
0,apa yang dimaksud dengan silhouette coefficient,[nilai kohesi dan separasi ini dapat dievaluas...,"[{'title': '', 'text': 'evaluasi ini disebut j...",[],"[{'title': '', 'text': 'kemudian berdasarkan u..."
1,apa yang dimaksud dengan kohesi,[ ukuran kedekatan data dalam suatu cluster],"[{'title': '', 'text': 'evaluasi ini disebut j...",[],"[{'title': '', 'text': 'kohesi dapat diukur de..."
2,bagaimana kohesi didapatkan dari sebuah cluster,[didapatkan dengan menghitung rata rata jarak ...,"[{'title': '', 'text': 'evaluasi ini disebut j...",[],"[{'title': '', 'text': 'kohesi dapat diukur de..."
3,apa yang perlu dihitung untuk mencari nilai ko...,[rata rata jarak data dengan data lain dalam c...,"[{'title': '', 'text': 'evaluasi ini disebut j...",[],"[{'title': '', 'text': 'yang keenam mean abso..."
4,bagaimana cara menghitung nilai separasi dan k...,[silhouette coefficient ],"[{'title': '', 'text': 'evaluasi ini disebut j...",[],"[{'title': '', 'text': 'kohesi dapat diukur de..."
...,...,...,...,...,...
1028,apa tujuan visualisasi dengan komposisi compo...,[untuk melihat komposisi dari suatu variabel ...,"[{'title': '', 'text': 'jika ingin melihat dis...",[],"[{'title': '', 'text': 'tujuan visualisasi ke ..."
1029,visualisasi apa yang biasa digunakan pada komp...,[stacked bar chart],"[{'title': '', 'text': 'jika ingin melihat dis...",[],"[{'title': '', 'text': 'tujuan visualisasi ke ..."
1030,untuk melihat keterhubungan antara suatu varia...,[relasi atau relationship ],"[{'title': '', 'text': 'jika ingin melihat dis...",[],"[{'title': '', 'text': 'tujuan visualisasi ke ..."
1031,sebutkan 7 macam visualisasi data,[pie chart bar chart line graphs scatter pl...,"[{'title': '', 'text': 'visualisasi data dapat...",[],"[{'title': '', 'text': 'visualisasi yang tepat..."


In [None]:
from sklearn.model_selection import train_test_split

# Assuming df is your DataFrame containing the data
# X should contain all the columns as features

# Splitting the data into training and testing datasets in an 80:20 ratio
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

# Printing the shapes of the datasets to verify the split
print("Training set - Features:", X_train.shape)
print("Testing set - Features:", X_test.shape)

Training set - Features: (826, 5)
Testing set - Features: (207, 5)


In [None]:
X_train.to_json('train_data.json', orient='records')

In [None]:
X_test.to_json('test_data.json', orient='records')