<a href="https://colab.research.google.com/github/yousenwang/information-retrieval/blob/main/information_retrieval_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Information Retrieval

# Outline

- TFIDF (Term Frequency Inverse Document Frequency)
- KD Tree (K - Dimensional Tree)

# TFIDF
- Term Frequency Inverse Document Frequency

- TF * IDF

- TF 
  - 一個字的重要性取決於它在該文章出現多少次。
- IDF
  - 一個字出現在所有文章中越多次代表其文章的代表性是低的，所以要取倒數。
  - 一個字越有文章代表性，其全部Document Frequency就會對低。

# KDTree

- K - Dimensional Tree

In [52]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KDTree

In [10]:
faq = pd.read_csv('english_FAQ.csv')

In [11]:
faq.head()

Unnamed: 0,question,reply,Similar question
0,How to jump to the specified work station?,"You need to log in the administrator account, ...",How to return to the previous station? / How c...
1,How to deal with the discrepancy between the r...,Background of Service Center -> Receiving and ...,How to deal with the discrepancy? / What if th...
2,How to modify the content of the label printed...,The service center can contact the manufacture...,Can the content of the label printed by the wo...
3,How to upgrade Product SN after replacing spar...,Click `Parts Replacement` to enter the spare p...,How to upgrade the SN after replacing the PCBA...
4,How to work with multiple product failures?,"During the operation of service center, the [W...",Meet a variety of bad how to deal with? How to...


# 訓練: 目前數據只輸入回覆內容；無問題


In [28]:
tfidf = TfidfVectorizer().fit(faq['reply'])
train = tfidf.transform(faq.reply).toarray()
faq["tfidf"] = list(train)

In [29]:
faq.head()

Unnamed: 0,question,reply,Similar question,tfidf
0,How to jump to the specified work station?,"You need to log in the administrator account, ...",How to return to the previous station? / How c...,"[0.0, 0.13059249446406113, 0.0, 0.153621717161..."
1,How to deal with the discrepancy between the r...,Background of Service Center -> Receiving and ...,How to deal with the discrepancy? / What if th...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.13140786205855845,..."
2,How to modify the content of the label printed...,The service center can contact the manufacture...,Can the content of the label printed by the wo...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,How to upgrade Product SN after replacing spar...,Click `Parts Replacement` to enter the spare p...,How to upgrade the SN after replacing the PCBA...,"[0.0, 0.0, 0.0, 0.0, 0.10546317532403915, 0.0,..."
4,How to work with multiple product failures?,"During the operation of service center, the [W...",Meet a variety of bad how to deal with? How to...,"[0.3052216076243383, 0.0, 0.15261080381216915,..."


In [30]:
faq['tfidf'][0].shape

(171,)

In [66]:
kdtree = KDTree(train)

In [38]:
print(faq.question[2])
distance, idx = kdtree.query(faq.tfidf[2].reshape(1,-1), k=3)

for i, value in list(enumerate(idx[0])):
  print(f'question : {faq["question"][value]}')
  print(f'Distance : {distance[0][i]}')
  print(f'reply : {faq["reply"][value]}')

How to modify the content of the label printed in the station?
question : How to modify the content of the label printed in the station?
Distance : 0.0
reply : The service center can contact the manufacturer to modify the label printing at the factory station. The maintenance service background of the factory -> Engineering Support -> the printing at the factory station. The label content can be modified by selecting the corresponding printing template.
question : Where can I see the use record of spare parts?
Distance : 1.1204617066588503
reply : For example, in the process of operation in the station, you can check the replacement of spare parts of the product in the field of `Replacement Maintenance Record` of the repaired product information; In the background of the service center -> Engineering Support -> Spare Parts Management, select the corresponding manufacturer and click `Spare Parts Usage Record` to view the use of spare parts of the manufacturer in the service center.
ques

In [39]:
import joblib

In [47]:
joblib.dump(tfidf, 'trained_tfidf_vectorizer.pkl')
joblib.dump(kdtree, 'trained_kd_tree.pkl')

['trained_kd_tree.pkl']

# 預測: 輸入問題

In [48]:
str_to_numerical = joblib.load('trained_tfidf_vectorizer.pkl')
info_retrieve = joblib.load('trained_kd_tree.pkl')

In [53]:
question = [faq.question[2]]

In [54]:
numerical = str_to_numerical.transform(question).toarray()

distance, idx = info_retrieve.query(numerical, k=3)
print(question)
print(numerical.shape)

for i, value in list(enumerate(idx[0])):
  print(f'question : {faq["question"][value]}')
  print(f'Distance : {distance[0][i]}')
  print(f'reply : {faq["reply"][value]}')

['How to modify the content of the label printed in the station?']
(1, 171)
question : How to modify the content of the label printed in the station?
Distance : 0.938872278120583
reply : The service center can contact the manufacturer to modify the label printing at the factory station. The maintenance service background of the factory -> Engineering Support -> the printing at the factory station. The label content can be modified by selecting the corresponding printing template.
question : Where can I see the use record of spare parts?
Distance : 1.0655491778976458
reply : For example, in the process of operation in the station, you can check the replacement of spare parts of the product in the field of `Replacement Maintenance Record` of the repaired product information; In the background of the service center -> Engineering Support -> Spare Parts Management, select the corresponding manufacturer and click `Spare Parts Usage Record` to view the use of spare parts of the manufacturer 

In [63]:

for k, _ in faq.question.items():

  question = [faq.question[k]]
  reply = [faq.reply[k]]
  numerical = str_to_numerical.transform(question).toarray()

  distance, idx = info_retrieve.query(numerical, k=1)
  print(question)
  print(reply)
  print(numerical.shape)

  for i, value in list(enumerate(idx[0])):
    print(f'Predicted_Q : {faq["question"][value]}')
    print(f'Distance : {distance[0][i]}')
    print(f'Predicted_Reply : {faq["reply"][value]}')
  print("\n")

['How to jump to the specified work station?']
['You need to log in the administrator account, enter the service center background -> Schedule Control -> Schedule Maintenance, enter the `Warranty Item` page, select the corresponding product serial number, pull down the `Maintenance operation`, click `Schedule Maintenance`, and select `Skip to designated site` to skip.']
(1, 171)
Predicted_Q : How to set the products and stations that operators can work on?
Distance : 1.1383577351791334
Predicted_Reply : Service center background -> Service Settings -> Station Settings, select the operator, click [product Settings] to check the product, you can set the operator can work on the product; Click `Station Settings` to check `station Settings` to set the station that the operator can operate.


['How to deal with the discrepancy between the received repair products and the warranty application?']
['Background of Service Center -> Receiving and Sending of goods -> Receiving and sending managem

# Reference

- [Information Retrieval | NLTK | Day 06 | NLP Tutorial | Python](https://www.youtube.com/watch?v=4DJVB3FrcvY&list=PLl1gyDCKkiQTCdmjNcBQuPCOJxwAgBJu8&index=1&ab_channel=Learnbay)

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

- KNN
[https://scikit-learn.org/stable/modules/neighbors.html](https://scikit-learn.org/stable/modules/neighbors.html)
- KD-Tree
https://www.youtube.com/watch?v=1OoM0phlO_U&t=1s&ab_channel=OsirisSalazar
