# Build RAG System with Unstructured Excel Data

- Load Excel dataset containing course reviews
- Combine relevant information into single column
- Transform into format to be used for document processing with Hugging Face embeddings
- Utilise FAISS database to store document
- Use Open AI text generator to write response for query

In [1]:
# Libraries
import pandas as pd
from langchain_community.document_loaders import DataFrameLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores.faiss import FAISS
from openai import OpenAI
from IPython.display import Markdown

In [2]:
# Load course reviews data from Excel into a DataFrame
reviews = pd.read_excel('Reviews_lab.xlsx')

In [3]:
# Display the first five rows of the loaded DataFrame for a preview
reviews.head()

Unnamed: 0,Course Name,Student Name,Timestamp,Rating,Comment
0,Master Python for Data Analysis and Business A...,Gaurav Mehra,2024-08-21 06:46:55+00:00,4.0,
1,Master Python for Data Analysis and Business A...,Harigovind S,2024-08-21 04:35:13+00:00,5.0,
2,Data Literacy and Business Analytics for Busin...,Celine Jayme,2024-08-21 01:42:37+00:00,4.0,
3,Decision Making with Problem Solving & Critica...,Donovan Smith,2024-08-20 20:02:59+00:00,4.0,
4,Econometrics and Statistics for Business in R ...,Mark Stent,2024-08-20 16:59:09+00:00,4.0,


In [4]:
# Create a new 'Review' column by combining course name, rating, and comment
reviews['Review'] = 'Course: ' + reviews['Course Name'].astype(str) + ', Rating: ' + reviews['Rating'].astype(str) + ', Comment: ' + reviews['Comment'].astype(str)

In [5]:
# Remove the original columns no longer needed after combining them into 'Review'
reviews.drop(columns=['Course Name', 'Rating', 'Comment'], inplace=True)
reviews.head()

Unnamed: 0,Student Name,Timestamp,Review
0,Gaurav Mehra,2024-08-21 06:46:55+00:00,Course: Master Python for Data Analysis and Bu...
1,Harigovind S,2024-08-21 04:35:13+00:00,Course: Master Python for Data Analysis and Bu...
2,Celine Jayme,2024-08-21 01:42:37+00:00,Course: Data Literacy and Business Analytics f...
3,Donovan Smith,2024-08-20 20:02:59+00:00,Course: Decision Making with Problem Solving &...
4,Mark Stent,2024-08-20 16:59:09+00:00,Course: Econometrics and Statistics for Busine...


In [6]:
# Load the transformed DataFrame as documents using the DataFrameLoader
loader = DataFrameLoader(reviews, page_content_column='Review')
docs = loader.load()

In [7]:
# Preview the first three documents to ensure proper transformation
docs[:3]

[Document(page_content='Course: Master Python for Data Analysis and Business Analytics 2024, Rating: 4.0, Comment: nan', metadata={'Student Name': 'Gaurav Mehra', 'Timestamp': '2024-08-21 06:46:55+00:00'}),
 Document(page_content='Course: Master Python for Data Analysis and Business Analytics 2024, Rating: 5.0, Comment: nan', metadata={'Student Name': 'Harigovind S', 'Timestamp': '2024-08-21 04:35:13+00:00'}),
 Document(page_content='Course: Data Literacy and Business Analytics for Business Leaders, Rating: 4.0, Comment: nan', metadata={'Student Name': 'Celine Jayme', 'Timestamp': '2024-08-21 01:42:37+00:00'})]

## Embedding Vectors and storing into FAISS

Use Hugging Face's sentence Transformers to convert review text into embedding vectors and store in FAISS (Facebook AI Similarity Search) database.

In [9]:
# Use sentence transformers model from Hugging Face as embedding
# Initialise embedding model from Hugging Face's Sentence Transformers
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  from tqdm.autonotebook import tqdm, trange


In [10]:
# Store vectors in the FAISS database for efficient similarity searches
# Embed the documents into vectors and store them in FAISS database
db_faiss = FAISS.from_documents(docs, embedding_model)

In [11]:
# Access the FAISS index and check the number of vectors stored
print(f"Number of documents in FAISS: {db_faiss.index.ntotal}")

Number of documents in FAISS: 861


## Retrieve Relevant documents using FAISS and Sentence Transformers Embeddings

In [13]:
# Function to retrieve relevant documents from FAISS based on a query
def retrieve_docs(query, k):
    # Perform a similarity search on the FAISS index for the given query and return k documents
    docs_faiss = db_faiss.similarity_search(query, k=k)
    return docs_faiss

In [14]:
# Define the query for retrieving relevant feedback on time series courses
query = "What feedback have we received on the time series courses, and what common themes or suggestions for improvement are there?"

In [15]:
# Call the retrieve docs function and specify the number of documents to retrieve
context = retrieve_docs(query, k = 5)

In [16]:
# Display the retrieved documents to verify the results
for doc in context:
    print(doc.page_content)

Course: Master Time Series Analysis and Forecasting with Python 2024, Rating: 5.0, Comment: A sensational course: the materials, the level of expertise of the trainer, the Q&A support.
Course: Master Time Series Analysis and Forecasting with Python 2024, Rating: 4.0, Comment: It was good till at present With DataSet provided and course Outcomes and Explanantions
Course: Master Time Series Analysis and Forecasting with Python 2024, Rating: 4.0, Comment: The course covers topics that I wanted to learn about. Will know more once I start learning from the course.
Course: Forecasting Models & Time Series Analysis for Business in R, Rating: 5.0, Comment: The course is clear, concise and full of incredibly helpful information. The guidance is thorough and there are no missing or ambiguous parts which allows you to follow the whole process easily. Definitely recommend!
Course: Master Time Series Analysis and Forecasting with Python 2024, Rating: 5.0, Comment: Really informative and well struct

## Generate responses using OpenAI API based on User Query and context

In [17]:
# System message that defines the assistant's role as a manager reviewing feedback
system_message = """
You are the manager of an online data analysis education platform.
Your primary responsibility is to carefully review feedback from users, including ratings and comments, to gain insights into their experiences.
"""

In [18]:
# Function to generate response based on query and context
def generate_response(query, context):
    # Create the messages for the API call
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": f"Quest: {query} \nReviews: {context}"}
    ]
    
    # Call the OpenAI API and generate a response
    client = OpenAI()
    response = client.chat.completions.create(
        model = "gpt-4o-mini",
        messages=messages,
        max_tokens=600,
        temperature=0.7
    )
    
    # Extract and return the assistant's message
    assistant_message = response.choices[0].message.content
    return assistant_message

In [19]:
# Retrieve documents as context based on the query
# Get 50 documents
query = "What feedback have we received on the time series courses, and what common themes or suggestions for improvement are there?"

context = retrieve_docs(query, k=50)

In [21]:
# Generate and display the response
answer = generate_response(query, context)
display(Markdown(answer))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The feedback for the time series courses, particularly the "Master Time Series Analysis and Forecasting with Python 2024" and "Forecasting Models & Time Series Analysis for Business in R," reveals several common themes and suggestions for improvement.

### Positive Feedback Themes:
1. **Quality of Content**: Many users praised the courses for being clear, concise, and full of helpful information. Comments like "sensational course," "really informative," and "well-structured" indicate that the content is well-received.
   
2. **Expertise of Instructors**: Reviewers frequently noted the expertise and teaching ability of the instructors, such as Diogo in the R course, which contributed to a positive learning experience.

3. **Practical Applications**: Several users appreciated the practical nature of the courses, stating that the courses allowed them to apply their knowledge to real-life problems.

4. **Support and Resources**: The Q&A support and additional materials provided were highlighted as beneficial, enhancing the overall learning experience.

### Areas for Improvement:
1. **Incomplete Feedback**: A significant number of reviews for the "Master Time Series Analysis and Forecasting with Python 2024" course included "nan" comments, indicating a lack of detailed feedback. Encouraging users to provide more specific comments could help in understanding their experiences better.

2. **Diverse Learning Outcomes**: While many users rated the course highly, some ratings were lower (such as a 1.5 rating), suggesting that there may be gaps in meeting the expectations of all learners. Gathering more detailed feedback from lower-rated reviews could help identify specific areas of dissatisfaction.

3. **Engagement with Material**: Some comments expressed a need for more engagement or interaction within the course. Suggestions could include incorporating more interactive elements or practical exercises to keep learners engaged.

4. **Course Progression**: A few reviews mentioned that while the course content was good, they would know more about its effectiveness once they fully engaged with the material. This indicates a potential need for clearer course outcomes or milestones to guide learners through their journey.

### Conclusion:
Overall, the time series courses appear to be well-regarded, especially for their quality content and knowledgeable instructors. To enhance user satisfaction further, efforts should be made to encourage detailed feedback, address any gaps identified in lower ratings, and consider suggestions for increasing interactivity and learner engagement.