# **Relevance Information Retrieval**
---
* Github: https://github.com/pinantyo/DiscordBot.git

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install -q transformers torch sentence-transformers

In [None]:
from transformers import AutoTokenizer, AutoModel, TFAutoModel
import torch

## Data

In [None]:
questions = ["daftar kelas", "Daftar kelas", "Register class", "regist kelas", 
             "regis kelas", "mau daftar kelas", "Registrasi kelas", "les", "kursus",
             "list kelas", "Mata kuliah", "mata kuliah", "matkul", "Matkul"]

answers = [
  "Silahkan daftar kelas TORCHE di https://torche.app/registration",
  "Bisa daftar kelas di https://torche.app/registration",
  "Kalau mau daftar les/kursus, bisa di https://torche.app/registration",
  "Semua kelas yang tersedia di TORCHE bisa dilihat di https://torche.app/courses"
]

## Tensorflow

In [None]:
model_name = "indolem/indobert-base-uncased"

In [None]:
model = TFAutoModel.from_pretrained(model_name, from_pt=True, output_hidden_states=True, trainable=False)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predict

In [None]:
import tensorflow as tf
import re
from sklearn.metrics.pairwise import cosine_similarity


def normalize_test(text):
  text = text.strip().lower()                       # Mengubah teks menjadi lower case
  text = re.sub(r'https?://\S+|www\.\S+', '', text) # Menghapus URL
  text = re.sub(r'[-+]?[0-9]+', '', text)           # Menghapus angka
  text = re.sub(r'[^\w\s]','', text)                # Menghapus karakter tanda baca
  return text


# Buat fungsi untuk memetakan input hasil input formatting agar sesuai dengan model BERT
def map_example_to_dict(input_ids, attention_masks):
  return {
      "input_ids": input_ids,               # Sebagai token embedding
      "attention_mask": attention_masks,    # Sebagai filter informasi mana yang kalkulasi oleh model
  }

def tokenize_data(list_data):
  encoding = {'input_ids':[], 'attention_mask':[]}
  for i in list_data:
    tokenized_data = tokenizer.encode_plus(
        i,
        max_length=128,
        truncation=True,
        padding='max_length',
        return_tensors='tf'
    )

    encoding['input_ids'].append(tokenized_data['input_ids'][0])
    encoding['attention_mask'].append(tokenized_data['attention_mask'][0])
  
  encoding['input_ids'] = tf.stack(encoding['input_ids'])
  encoding['attention_mask'] = tf.stack(encoding['attention_mask'])

  # return tf.data.Dataset.from_tensor_slices((encoding['input_ids'], encoding['attention_mask'])).map(map_example_to_dict)
  
  return encoding


def get_features(data):
  encoding = tokenize_data(data)

  # Pengambilan fitur dari layer terakhiur
  output = model(encoding)
  features = output.last_hidden_state

  # Ambil data mask attention
  att_mask = encoding['attention_mask']

  mask = tf.cast(tf.broadcast_to(tf.expand_dims(att_mask, axis=-1), features.shape), dtype='float')

  # Average pooling
  pooling = tf.reduce_sum(features, axis=1) / tf.clip_by_value(tf.reduce_sum(mask, axis=1), clip_value_min=1e-9, clip_value_max=1)

  # Mengambil fitur hasil pooling
  return tf.stop_gradient(pooling).numpy()


def ask_bot(question, answers):
  # Text Preprocessing
  text_normalized = normalize_test(question)
  answers_normalized = [normalize_test(i) for i in answers]

  # Ambil Fitur
  feature_q = get_features([text_normalized])
  features_a = get_features(answers_normalized)

  print(f'Question: {question}', end="\n\n")
  prediction = list(cosine_similarity(feature_q, features_a).reshape(-1))

  # Pengambilan index tertinggi
  index_high = prediction.index(max(prediction))
  print(f'Answer:\n{answers[index_high]}')

In [None]:
ask = str(input("Masukkan pertanyaan: "))

ask_bot(ask, answers)

Masukkan pertanyaan: saya mau daftar les dimana ya
Question: saya mau daftar les dimana ya

Answer:
Kalau mau daftar les/kursus, bisa di https://torche.app/registration


In [None]:
ask = str(input("Masukkan pertanyaan: "))

ask_bot(ask, answers)

Masukkan pertanyaan: kak mau daftar kelas dong
Question: kak mau daftar kelas dong

Answer:
Bisa daftar kelas di https://torche.app/registration
