***Problem Statement*** - Getting information from the file that contains research data of user/item interactions,star ratings,timestamps,product reviews,social networks,item-to-item relationships(e.g.copurchases,compatibility),product images,price,brand,and category information,GPS data,heart-rate sequences,other metadata.

***Solution Strategy*** - Build a POC which should solve the following requirements:

- Users would responses from the researched data
- If they want to refer to the original data from which the bot is responding, the bot should provide a citation as well.

Goal - Solving the above two requirements well in the POC would ensure that the accuracy of the overall model is good and therefore further improvisations and customizations make sense.

***Data Used*** - Recommendation datasets stored in one pdf file

***Tools used*** - LlamaIndex (only for now) has been used due to its powerful query engine, fast data processing using data loaders and directory readers as well as easier and faster implementation using fewer lines of code.

**Import the necessary libraries**

In [None]:
!pip install llama-index



In [None]:
#Loading docx2txt for document reading related dependencies
!pip install docx2txt



In [None]:
!pip install pypdf



In [None]:
!pip install openai



In [None]:
from llama_index.core.llms import ChatMessage
import os
import openai

**Mount your Google Drive and Set the API key**

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Set the API key
filepath = "/content/drive/My Drive/Semantic_Spotter_Support/"

with open(filepath + "OpenAI_API_Key.txt", "r") as f:
  openai.api_key = ' '.join(f.readlines())

In [None]:
from llama_index.core import SimpleDirectoryReader

**Data Loading**

In [None]:
reader = SimpleDirectoryReader(input_dir="/content/drive/MyDrive/Semantic_Spotter_Support")

In [None]:
?SimpleDirectoryReader

In [None]:
documents = reader.load_data()
print(f"Loaded {len(documents)} docs")

Loaded 25 docs


In [None]:
documents

[Document(id_='8d5af365-90a9-49f9-822c-343fe283a70c', embedding=None, metadata={'file_path': '/content/drive/MyDrive/Semantic_Spotter_Support/OpenAI_API_Key.txt', 'file_name': 'OpenAI_API_Key.txt', 'file_type': 'text/plain', 'file_size': 56, 'creation_date': '2024-06-30', 'last_modified_date': '2024-06-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='sk-proj-CrI5j4ocwo8mAmLWbR88T3BlbkFJNMqkvQSDX9McjKHGgQVO', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='bf3eb708-1343-4fd6-b3fd-2d07caec8380', embedding=None, metadata={'page_label': '1', 'file_name': 'Recommender_Systems_Datasets.pdf', 'file_path': '/content/drive/

**Building the query engine**

In [None]:
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core import VectorStoreIndex
from IPython.display import display, HTML

# create parser and parse document into nodes
parser = SimpleNodeParser.from_defaults()
nodes = parser.get_nodes_from_documents(documents)

# # build index
index = VectorStoreIndex(nodes)

# Construct Query Engine
query_engine = index.as_query_engine()

**Checking responses and response parameters**

In [None]:
response = query_engine.query("how many Number of base recipes:?")

In [None]:
#Checking the response
response.response

'36,000'

In [None]:
#Check the source node
response.source_nodes

[NodeWithScore(node=TextNode(id_='9b62ac0a-b1f5-4a2a-823a-786c49232b84', embedding=None, metadata={'page_label': '6', 'file_name': 'Recommender_Systems_Datasets.pdf', 'file_path': '/content/drive/MyDrive/Semantic_Spotter_Support/Recommender_Systems_Datasets.pdf', 'file_type': 'application/pdf', 'file_size': 722679, 'creation_date': '2024-06-30', 'last_modified_date': '2024-06-30'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='e5efe93e-dc8d-4850-9022-25887248e5eb', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '6', 'file_name': 'Recommender_Systems_Datasets.pdf', 'file_path': '/content/drive/MyDrive/Semantic_Spotter_Support/Recommender_Systems_Datasets.pdf', 'file_type': 'application/pdf',

In [None]:
#Extract the file name
response.source_nodes[0].node.metadata['file_name']

'Recommender_Systems_Datasets.pdf'

In [None]:
#Extract the score
response.source_nodes[1].score

0.7761363231169902

**Creating a response Pipeline**

In [None]:
## Query response function
def query_response(user_input):
  response = query_engine.query(user_input)
  file_name = response.source_nodes[0].node.metadata['file_name']
  final_response = response.response + '\n Check further at ' + file_name + ' document'
  return final_response

In [None]:
def initialize_conv():
  print('Feel free to ask Questions regarding Recommender Systems and Personalization Datasets. Press exit once you are done')
  while True:
    user_input = input()
    if user_input.lower() == 'exit':
      print('Exiting the program... bye')
      break
    else:
      response = query_response(user_input)
      display(HTML(f'<p style="font-size:20px">{response}</p>'))

In [None]:
initialize_conv()

Feel free to ask Questions regarding Recommender Systems and Personalization Datasets. Press exit once you are done
how many streamers in Twitch


number of nodes


number of nodes in Twitter


Num of unique images


Number of unique images in Reddit Submissions


Timespan


Timespan in Reddit Submissions


Pairwise Fashion Explanations


Mentioned Items and the Percentages:


Mentioned Items and the Percentages in Pairwise Fashion Explanations


exit
Exiting the program... bye


**Build a Testing Pipeline**

In [None]:
questions = ['Number of recipes in Recipe Pairs data?', "number of Workouts in EndoMondo Fitness Tracking Data ?",
             'Amazon Product Reviews ?']

In [None]:
def testing_pipeline(questions):
  test_feedback  = []
  for i in questions:
    print(i)
    print(query_response(i))
    print('\n Please provide your feedback on the response provided by the bot')
    user_input = input()
    test_feedback.append((i,query_response(i),user_input))
  feedback_df = pd.DataFrame(test_feedback, columns =['Question', 'Response', 'Good or Bad'])
  return feedback_df

In [None]:
import pandas as pd

In [None]:
testing_pipeline(questions)

Number of recipes in Recipe Pairs data?
60,000
 Check further at Recommender_Systems_Datasets.pdf document

 Please provide your feedback on the response provided by the bot
good
number of Workouts in EndoMondo Fitness Tracking Data ?
253,020
 Check further at Recommender_Systems_Datasets.pdf document

 Please provide your feedback on the response provided by the bot
good
Amazon Product Reviews ?
The Amazon Product Reviews dataset contains questions and answers about products from Amazon. It includes basic statistics such as 1.48 million questions, 4,019,744 answers, 309,419 labeled yes/no questions, and 191,185 unique products with questions. The metadata for this dataset includes question and answer text, question type, answer type, timestamps, and product IDs to reference the review dataset. An example entry from the dataset includes information like ASIN, question type, answer type, answer time, Unix time, question, and answer.
 Check further at Recommender_Systems_Datasets.pdf doc

Unnamed: 0,Question,Response,Good or Bad
0,Number of recipes in Recipe Pairs data?,"60,000\n Check further at Recommender_Systems_...",good
1,number of Workouts in EndoMondo Fitness Tracki...,"253,020\n Check further at Recommender_Systems...",good
2,Amazon Product Reviews ?,The Amazon Product Reviews dataset contains qu...,good


Final Steps Refining and Enhancing
- Improve Dataset Quality: Clean the dataset and add more relevant documents.
- Tune the Vectorizer: Adjust the parameters of the TfidfVectorizer for better
   accuracy.
- Enhance Response Formatting: Include more detailed snippets, keyword
  highlighting, or links to the full documents.
- Implement Additional Features: Based on user feedback, you can add features
   like filtering results by data and type of information