# Final Project - 22KDL1

<div align="center">

**Danh sách nhóm**

<table>
    <tr>
        <th>Họ và tên</th>
        <th>MSSV</th>
    </tr>
    <tr>
        <td>Nguyễn Trường Thịnh</td>
        <td>22280086</td>
    </tr>
    <tr>
        <td>Lê Nguyễn Quỳnh Anh</td>
        <td>22280002</td>
    </tr>
    <tr>
        <td>Trần Bình Phương</td>
        <td>22280071</td>
    </tr>
    <tr>
        <td>Lê Thanh Thùy</td>
        <td>22280094</td>
    </tr>
</table>

</div>

## **Question 1:** LLM integration

The task involves building an AI capable of language translation.

<div style="margin-left: 20px;">

**1. Single Text Translation:**  
You are asked to write a Python code using the OpenAI API to translate a given text into Vietnamese.  
(You should check the text if it’s already the destination language).  
For example, translating "Hello" into Vietnamese should return "Xin chào", but "Xin chào" should return the same.
</div>

<div style="margin-left: 20px;">

**2. Multiple Texts Translation:** <br>
Similar to 2.1, but the input is a list of texts. The Python code should accept a list of strings and return their translations in the specified language.<br>
For instance, translating ["Hello", "I am John", “Tôi là sinh viên”] into Vietnamese should return ["Xin chào", "Tôi tên là John", “Tôi là sinh viên”].
</div>

In [1]:
import google.generativeai as genai
import os
from dotenv import load_dotenv

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Tạo môi trường
load_dotenv(dotenv_path="environment.env") 
# Định cấu hình API Gemini
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)

# Mô hình sử dụng
model = genai.GenerativeModel('gemini-pro')

In [3]:
def is_vietnamese(text):
    """
    Cố gắng xác định xem văn bản đã cho có phải là tiếng Việt hay không bằng cách kiểm tra
    nếu có chứa ký tự tiếng Việt (dấu phụ)
    """
    vietnamese_chars = "àáảãạăằắẳẵặâầấẩẫậđèéẻẽẹêềếểễệìíỉĩịòóỏõọôồốổỗộơờớởỡợùúủũụưừứửữựỳýỷỹỵ"
    for char in text:
        if char in vietnamese_chars:
            return True
    return False

In [4]:
def translate_single_text(json):
    """
    Dịch một chuỗi văn bản sang ngôn ngữ được chỉ định bằng API Gemini
    """

    if is_vietnamese(json['text']):
        return json['text']
    try:
        prompt = f"Translate the following text to {json['dest_language']}: {json['text']}"
        response = model.generate_content(prompt)
        return response.text
    except Exception as e:
        print(f"Error during translation: {e}")
        return None

In [5]:
def translate_multiple_texts(json):
    """
    Dịch danh sách các chuỗi văn bản sang ngôn ngữ được chỉ định bằng API Gemini
    """
    translated_texts = []
    for text in json['text']:
        if is_vietnamese(text):
            translated_texts.append(text)
        else:
            prompt = f"Translate the following text to {json['dest_language']}: {text}"
            response = model.generate_content(prompt)
            translated_texts.append(response.text)
    return translated_texts

In [9]:
if __name__ == '__main__':
    json_1 = {
        'text': 'Hello',
        'dest_language': 'vi'
    }

    translated_1 = translate_single_text(json_1)
    print(f"Original: '{json_1['text']}'| Translated: '{translated_1}'")

    json_2 = {
        'text': 'Xin chào',
        'dest_language': 'vi'
    }

    translated_2 = translate_single_text(json_2)
    print(f"Original: '{json_2['text']}'| Translated: '{translated_2}'")

    json_3 = {
        'text': [
        "Hello",
        "I am John",
        "Tôi là sinh viên",
        "Good morning",
        "Bạn có khỏe không?",
        "The weather is nice today."
    ],
        'dest_language': 'vi'
    }

    translated_3 = translate_multiple_texts(json_3)
    print(f"Original: '{json_3['text']}' \nTranslated: '{translated_3}'")
    translated_3_sentences = ", ".join(translated_3)
    print(f"Original: '{json_3['text']}' \nTranslated: '{translated_3_sentences}'")

    GOOGLE_API_KEY = ""

Original: 'Hello'| Translated: 'Xin chào'
Original: 'Xin chào'| Translated: 'Xin chào'
Original: '['Hello', 'I am John', 'Tôi là sinh viên', 'Good morning', 'Bạn có khỏe không?', 'The weather is nice today.']' 
Translated: '['Xin chào', 'Tôi là John', 'Tôi là sinh viên', 'Chào buổi sáng', 'Bạn có khỏe không?', 'Hôm nay thời tiết đẹp quá.']'
Original: '['Hello', 'I am John', 'Tôi là sinh viên', 'Good morning', 'Bạn có khỏe không?', 'The weather is nice today.']' 
Translated: 'Xin chào, Tôi là John, Tôi là sinh viên, Chào buổi sáng, Bạn có khỏe không?, Hôm nay thời tiết đẹp quá.'


## **Question 2:** Chatbot Development

Assignment Test: Chatbot Development from Website Data. The data is at https://www.presight.io/privacy-policy.html
<div style="margin-left: 20px;">

**1. Data Access and Indexing:** <br>
Tasked with creating a chatbot, begin by web crawling the specified website to gather relevant data, then preprocess and structure this data into a searchable index, <br> ready for query retrieval.

</div>

<div style="margin-left: 20px;">

**2. Chatbot Development:** <br>
Develop a chatbot that employs natural language processing to comprehend user questions, searches the indexed data from 2.1 for the best match, and delivers <br> the most accurate response drawn from the website's information.

</div>

### **Crawling data**

In [12]:
import json
from bs4 import BeautifulSoup
import requests

url = 'https://www.presight.io/privacy-policy.html'

In [None]:
def crawl_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    sections = []
    current_section = None
    
    all_elements = soup.find_all(['h2', 'i', 'p', 'ul'])
    
    element_index = 0
    while element_index < len(all_elements):
        element = all_elements[element_index]
        if element.name in ['h2']:
            current_section = {
                "section_title": [element.text.strip()],
                "content": [],
                "subsections": [],
                "items": []
            }
           
            sections.append(current_section)
           
           # Duyệt tiếp các thẻ tiếp theo cho đến khi gặp thẻ h2 mới hoặc hết tài liệu
            next_element_index = element_index + 1
            while next_element_index < len(all_elements) and all_elements[next_element_index].name not in ['h2']:
                next_element = all_elements[next_element_index]
                # Kiếm tra thẻ thuộc phần nào thì đưa vào phần đó với p là content và i là subsection
                if next_element.name == 'p':
                    current_section['content'].append(next_element.text.strip())
                elif next_element.name == 'i' : # Với subsection thì ta lại có thêm các phần tử con là p và ul lưu content và items của subsection
                    # Tạo subsection để lưu dữ liệu đó và add vào phần subsection của section
                    current_subsection = {
                        "subsection_title": [next_element.text.strip()],
                            "content":[],
                            "items": []
                    }
                    current_section['subsections'].append(current_subsection)
                    
                    # Giống với section thì ta duyệt tiếp với subsection cho đến khi gặp thẻ h2 hoặc i hoặc hết tài liệu
                    subsection_next_element_index = next_element_index + 1
                    while subsection_next_element_index < len(all_elements) and all_elements[subsection_next_element_index].name not in ['h2', 'i']:
                        subsection_next_element = all_elements[subsection_next_element_index]
                        # Nếu là thẻ p thì đưa vào content của subsection
                        if subsection_next_element.name == 'p':
                            current_subsection['content'].append(subsection_next_element.text.strip())
                        # Nếu là thẻ ul thì duyệt qua các thẻ li là list item để lấy dữ liệu và lưu vào items của subsection
                        elif subsection_next_element.name == 'ul':
                                for li in subsection_next_element.find_all('li', recursive = False):
                                    current_subsection['items'].append(li.text.strip())
                        subsection_next_element_index += 1

                    next_element_index = subsection_next_element_index -1
                # Làm tương tự với section
                elif next_element.name == 'ul':
                    for li in next_element.find_all('li', recursive = False):
                         current_section['items'].append(li.text.strip())
                next_element_index += 1

            element_index = next_element_index
            
        else:
              element_index +=1
          
    return {"sections": sections}

In [None]:
# Crawl trang web
data = crawl_page(url)

# Xuất ra file JSON
with open('structured_data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

print("Dữ liệu đã được lưu vào structured_data.json")

Dữ liệu đã được lưu vào structured_data.json


### **Building Chatbot**

#### Preparation

In [14]:
import json
import logging
import torch
import re
import google.generativeai as genai
from sentence_transformers import SentenceTransformer, util
import os
import random

In [15]:
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
device = "cuda" if torch.cuda.is_available() else "cpu"
sbert_model = sbert_model.to(device)

In [16]:
# Tạo lại môi trường
load_dotenv(dotenv_path="environment.env")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)

#### **Part 2.1** Embedding data

In [17]:
logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')

def load_json_from_file(file_path):
    """
    Tải dữ liệu JSON từ tệp
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as f:  # Thêm mã hóa='utf-8'
            data = json.load(f)
            return data
    except FileNotFoundError:
        logging.error(f"File not found at '{file_path}'")
        return None
    except json.JSONDecodeError:
        logging.error(f"Invalid JSON format in file '{file_path}'")
        return None
    except UnicodeDecodeError as e: # Thêm ngoại lệ cụ thể hơn
        logging.error(f"UnicodeDecodeError occurred: {e} in file '{file_path}'")
        return None
    except Exception as e:
        logging.error(f"An unexpected error occurred: {e}")
        return None

In [18]:
file_path = "structured_data.json"
loaded_data = load_json_from_file(file_path)

if loaded_data:
    print("Successfully loaded JSON data from file:")
    print(json.dumps(loaded_data, indent=1))
else:
    print("Failed to load JSON data")

Successfully loaded JSON data from file:
{
 "sections": [
  {
   "section_title": [
    "PRIVACY POLICY"
   ],
   "content": [],
   "subsections": [],
   "items": []
  },
  {
   "section_title": [
    "Last updated 15 Sep 2023"
   ],
   "content": [
    "At Presight, we are committed to protecting the privacy of our customers and visitors to our website. This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors."
   ],
   "subsections": [],
   "items": []
  },
  {
   "section_title": [
    "Information Collection and Use"
   ],
   "content": [
    "We collect several different types of information for various purposes to provide and improve our Service to you."
   ],
   "subsections": [],
   "items": []
  },
  {
   "section_title": [
    "Types of Data Collected"
   ],
   "content": [],
   "subsections": [
    {
     "subsection_title": [
      "Personal Data"
     ],
     "content": [
      "While using our Service, we may ask you to p

In [19]:
loaded_data['sections'][0]['content'] = loaded_data['sections'][1]['content']

new_data = {
        "section_title": None,
        "content": None,
        "subsections": [],
        "items": []
    }

loaded_data['sections'].insert(1, new_data)

loaded_data['sections'][1]['section_title'] = ["Latest Version"]
loaded_data['sections'][1]['content'] = loaded_data['sections'][2]['section_title']

del loaded_data['sections'][2]

In [20]:
print(json.dumps(loaded_data, indent=1))

{
 "sections": [
  {
   "section_title": [
    "PRIVACY POLICY"
   ],
   "content": [
    "At Presight, we are committed to protecting the privacy of our customers and visitors to our website. This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors."
   ],
   "subsections": [],
   "items": []
  },
  {
   "section_title": [
    "Latest Version"
   ],
   "content": [
    "Last updated 15 Sep 2023"
   ],
   "subsections": [],
   "items": []
  },
  {
   "section_title": [
    "Information Collection and Use"
   ],
   "content": [
    "We collect several different types of information for various purposes to provide and improve our Service to you."
   ],
   "subsections": [],
   "items": []
  },
  {
   "section_title": [
    "Types of Data Collected"
   ],
   "content": [],
   "subsections": [
    {
     "subsection_title": [
      "Personal Data"
     ],
     "content": [
      "While using our Service, we may ask you to provide us with c

In [21]:
def clean_text(text):
    """
    Làm sạch chuỗi văn bản: Thay xuống dòng bằng khoảng trắng và xóa khoảng trắng thừa
    """
    if not isinstance(text, str):
        return ""
    text = text.replace('\n', ' ').strip()
    text = re.sub(r'\s+', ' ', text)

    return text

In [22]:
def preprocess_privacy_policy(data):
    """
    Xử lý privacy policy data từ cấu trúc JSON
    """
    processed_texts = []

    for section in data.get("sections", []):
        section_texts = []
        section_title = section.get("section_title", [])
        if isinstance(section_title, list):
            section_title = " ".join(item.strip() for item in section_title if isinstance(item, str))
        section_title = section_title.strip()

        section_content = section.get("content", [])
        if isinstance(section_content, list):
            section_content = " ".join(item.strip() for item in section_content if isinstance(item, str))
        section_content = section_content.strip()

        items = section.get("items", [])
        if isinstance(items, list):
            items = [item.strip() for item in items if isinstance(item, str)]
            if items:
                section_content += " " +  ", ".join(items)

        section_text = " ".join(filter(None, [section_title, section_content])).strip()

        if section_text:
            section_texts.append(section_text)  

        for subsection in section.get("subsections", []):
            subsection_title = subsection.get("subsection_title", [])
            if isinstance(subsection_title, list):
                subsection_title = " ".join(item.strip() for item in subsection_title if isinstance(item, str))
            subsection_title = subsection_title.strip()

            subsection_content = subsection.get("content", [])
            if isinstance(subsection_content, list):
                subsection_content = " ".join(item.strip() for item in subsection_content if isinstance(item, str))
            subsection_content = subsection_content.strip()

            subsection_items = subsection.get("items", [])
            if isinstance(subsection_items, list):
                subsection_items = [item.strip() for item in subsection_items if isinstance(item, str)]
                if subsection_items:
                    subsection_content += " " + ", ".join(subsection_items)

            subsection_text = " ".join(filter(None, [subsection_title, subsection_content])).strip()

            if subsection_text:
                section_texts.append(subsection_text)  

        if not (section_content or items or section.get("subsections")):
            continue

        processed_texts.append(section_texts)

    cleaned_texts = [text for text in processed_texts if text]
    return cleaned_texts

In [23]:
texts = preprocess_privacy_policy(loaded_data)
embeddings = [sbert_model.encode(text) for text in texts]

#### **Part 2.2** Chatbot Development

##### Finding the best match paragraph based on index

In [24]:
def find_match(user_query, texts, embeddings, model):
    query_embedding = model.encode(user_query)

    best_score = -1
    best_match_index = []

    for i in range(len(embeddings)):
        for j in range(len(embeddings[i])): 
            keyword_score = util.cos_sim(query_embedding, embeddings[i][j]).item()
            if keyword_score > best_score:
                  best_score = keyword_score
                  best_match_index = [i, j]

    if best_match_index[1] == 0:
        result = "\n".join(item.strip() for item in texts[best_match_index[0]] if isinstance(item, str))
    else:
        result = "\n".join(item.strip() for item in texts[best_match_index[0]][best_match_index[1]] if isinstance(item, str))

    return result

##### Using Gemini API to answer queries

In [25]:
def chatbot_response(user_query, texts, embeddings, sbert_model):
    match = find_match(user_query, texts, embeddings, sbert_model)

    model = genai.GenerativeModel("gemini-1.5-flash")

    prompt = f"With text {match}, your task is answer question {user_query}," \
    f"Requirements:" \
    f"1. Full information from text"\
    f"2. Answer with natural English text"\
    f"Output format must be break the line if it is too long"

    response = model.generate_content(prompt)

    return response.text

##### Comunicating with chatbot

In [26]:
def greeting_response(text):
    text = text.lower()

    bot_greetings = ['hi', 'hello', 'hey']
    user_greetings = ['hi', 'hey', 'hello']

    for word in text.split():
        if word in user_greetings:
            return random.choice(bot_greetings)

Danh sách câu hỏi tham khảo:<br>
- Who do you share data with?
- How do you use my personal information?
- What types of personal data do you collect?
- What about the privacy policy changes?
- How about third-party websites?
- What do you use cookies for?
- What types of personal data might the service ask you to provide?
- How can I access personal information?
- When was the last version updated?

In [29]:
print('I will answer the queries about Privacy Policy of Presight. If you want to exist, type bye.')

exit_list = ['exit', 'see you later', 'bye', 'quit', 'break', 'thanks']

while(True):
  user_input = input()
  print('User:',user_input)
  if user_input.lower() in exit_list:
    print('Chatbot: Bye Bye! Chat with you later!')
    break
  else:
    if greeting_response(user_input) != None:
      print('Chatbot: '+greeting_response(user_input)+'\n')
    else:
      response = chatbot_response(user_input, texts, embeddings, sbert_model)
      print('Chatbot: '+response)

I will answer the queries about Privacy Policy of Presight. If you want to exist, type bye.
User: hello
Chatbot: hi

User: Who do you share data with?
Chatbot: Your personal data will not be shared, transferred, rented, or exchanged with any third parties, including AI models.

User: What do you use cookies for?
Chatbot: We use cookies to enhance your experience on our website.  You can control their use through your web browser settings.

User: What types of personal data might the service ask you to provide?
Chatbot: The service may ask you to provide the following personally identifiable information:

* Email address
* First name and last name
* Phone number
* Address
* State
* Province
* ZIP/Postal code
* City
* Cookies
* Usage Data

User: How can I access personal information?
Chatbot: To access your personal information, log into the application and go to your settings and profile.  This allows you to view, correct, amend, or append your information.

User: What about the privacy