# Final Assignment - Python for DS

|Full Name|Student ID|
|---------|----------|
| Trương Minh Hoàng | 22280034|
|Nguyễn Duy Huân | 22280035|

Chúng em có làm một trang web trên Streamlit để sử dụng chatbot ở link này: https://presight-chatbot-byhoangvahuan.streamlit.app/

Chi tiết tất cả các file code và data sau khi crawl và indexing của chúng em sẽ để ở link GitHub sau: https://github.com/tmhoanggg/Final-Assignment_Python-for-DS

## Question 1: LLM Integration

In [1]:
import google.generativeai as genai
from dotenv import load_dotenv
import os

load_dotenv()

class Translator():
    def __init__(self, model):
        self.__api_key = os.getenv("API_KEY")
        self.model = model
    
    def translate_single_text(self, json):
        """Dịch một chuỗi văn bản."""
        text = json['text']
        dest_language = json['dest_language']

        genai.configure(api_key=self.__api_key)
        prompt = f"Translate text {text} into destination languague {dest_language}.\
            I      If the text has already been the destination language, keep this text.\
                   Do not add redundant information or punctuation."
        response = self.model.generate_content(prompt)

        return response.text
    
    def translate_multiple_texts(self, json):
        """Dịch nhiều chuỗi văn bản."""
        res = []
        dict_list = [{'text': text, 'dest_language': json['dest_language']} for text in json['text']]

        for dict in dict_list:
            response_text = self.translate_single_text(dict)
            res.append(response_text)
            
        return res

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model = genai.GenerativeModel("gemini-1.5-flash")

translator = Translator(model)

In [3]:
json_1 = {
'text': 'Hello',
'dest_language': 'vi'
}

translator.translate_single_text(json_1)


'Xin chào\n'

In [4]:
json_2 = {
'text': ['Hello', 'I am Peter', 'Tôi là sinh viên'],
'dest_language': 'vi'
}

translator.translate_multiple_texts(json_2)

['Xin chào\n', 'Tôi là Peter\n', 'Tôi là sinh viên\n']

## Question 2: Chatbot Development


### 2.1 Data Access and Indexing

In [5]:
from selenium import webdriver
from selenium.webdriver.edge.options import Options
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By 
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np




In [6]:
def scrapingData(url):
    DRIVER_PATH = "C:/edgedriver_win32/msedgedriver.exe"
    options = Options()
    options.add_argument('--headless')
    service = Service(executable_path=DRIVER_PATH)  
    driver = webdriver.Chrome(service=service, options=options)
    driver.get(url)
    
    elements = driver.find_elements(By.XPATH, "//h2 | //p | //ul | //i")
    content = []
    headers = []

    for element in elements:
        text = element.text.strip() 
        if element.tag_name == 'h2' or element.tag_name == 'i':
            headers.append(text)
        if not text: 
            continue
        if element.tag_name == 'ul':
            li_elements = element.find_elements(By.TAG_NAME, 'li')
            list_items = [li.text.strip() for li in li_elements if li.text.strip()]
            if list_items:
                content.append(" ".join(list_items))  
        else:
            content.append(text)

    driver.quit()
    return content,headers
        

In [7]:
url = 'https://www.presight.io/privacy-policy.html'
content, headers = scrapingData(url)
# Xem 5 phần tử đầu tiên
content[:5]

['PRIVACY POLICY',
 'Last updated 15 Sep 2023',
 'At Presight, we are committed to protecting the privacy of our customers and visitors to our website. This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors.',
 'Information Collection and Use',
 'We collect several different types of information for various purposes to provide and improve our Service to you.']

In [8]:
headers

['PRIVACY POLICY',
 'Last updated 15 Sep 2023',
 'Information Collection and Use',
 'Types of Data Collected',
 'Personal Data',
 'Usage Data',
 'Use of Data',
 'Consent',
 'Access to Personal Information',
 'Accessing Your Personal Information',
 'Automated Edit Checks',
 'Disclosure of Information',
 'Sharing of Personal Data',
 'Google User Data and Google Workspace APIs',
 'Data Security',
 'Data Retention & Disposal',
 "Quality, Including Data Subjects' Responsibilities for Quality",
 'Monitoring and Enforcement',
 'Cookies',
 'Third-Party Websites',
 'Changes to Privacy Policy',
 'Contact Us',
 'Purposeful Use Only']

In [9]:
def index_content(content, headers):
    indexed_content = {}
    current_index = None
    title = None
    content_list = []
    for item in content:
        # Kiểm tra nếu item có trong headers
        if item in headers:
            # Nếu có title mới, lưu lại title và content cũ
            if current_index is not None:
                indexed_content[current_index] = {
                    'title': title,
                    'content': content_list
                }   
            # Thiết lập title mới và làm mới content
            title = item
            content_list = []
            current_index = len(indexed_content)  
        else:
            # Nếu không trùng, thêm item vào content của title hiện tại
            content_list.append(item)
    
    # Lưu phần cuối cùng
    if current_index is not None:
        indexed_content[current_index] = {
            'title': title,
            'content': content_list
        }
    return indexed_content


indexed_content = index_content(content, headers)

for index, data in indexed_content.items():
    print(f"Index {index}:")
    print(f"\tTitle: {data['title']}")
    print(f"\tContent: {data['content']}")

Index 0:
	Title: PRIVACY POLICY
	Content: []
Index 1:
	Title: Last updated 15 Sep 2023
	Content: ['At Presight, we are committed to protecting the privacy of our customers and visitors to our website. This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors.']
Index 2:
	Title: Information Collection and Use
	Content: ['We collect several different types of information for various purposes to provide and improve our Service to you.']
Index 3:
	Title: Types of Data Collected
	Content: []
Index 4:
	Title: Personal Data
	Content: ['While using our Service, we may ask you to provide us with certain personally identifiable information that can be used to contact or identify you ("Personal Data"). Personally identifiable information may include, but is not limited to:', 'Email address First name and last name Phone number Address, State, Province, ZIP/Postal code, City Cookies and Usage Data']
Index 5:
	Title: Usage Data
	Content: ['We may al

In [12]:
indexed_content[0] = {'title': 'PRIVACY POLICY - Last update', 'content': ['Last updated 15 Sep 2023']}
indexed_content[1]['title'] = 'PRIVACY POLICY - Description'
indexed_content[3]['content'] = 'Personal Data, Usage Data'
indexed_content[8]['content'] = 'Accessing Your Personal Information, Automated Edit Checks'
indexed_content

{0: {'title': 'PRIVACY POLICY - Last update',
  'content': ['Last updated 15 Sep 2023']},
 1: {'title': 'PRIVACY POLICY - Description',
  'content': ['At Presight, we are committed to protecting the privacy of our customers and visitors to our website. This Privacy Policy explains how we collect, use, and disclose information about our customers and visitors.']},
 2: {'title': 'Information Collection and Use',
  'content': ['We collect several different types of information for various purposes to provide and improve our Service to you.']},
 3: {'title': 'Types of Data Collected',
  'content': 'Personal Data, Usage Data'},
 4: {'title': 'Personal Data',
  'content': ['While using our Service, we may ask you to provide us with certain personally identifiable information that can be used to contact or identify you ("Personal Data"). Personally identifiable information may include, but is not limited to:',
   'Email address First name and last name Phone number Address, State, Province, Z

In [23]:
import json

# Lưu dictionary vào file JSON
with open("data.json", "w", encoding="utf-8") as file:
    json.dump(indexed_content, file, ensure_ascii=False)

### 2.2 Chatbot Development

In [24]:
class Chatbot():
    def __init__(self, encoder, model):
        self.__api_key = os.getenv("API_KEY")
        self.encoder = encoder
        self.model = model

    def separate(self, indexed_content):
        titles = {key: value['title'] for key, value in indexed_content.items()}
        content = {key: value['content'] for key, value in indexed_content.items()}
        return titles, content
    
    def chat_bot(self, indexed_content, query):
        titles, content = self.separate(indexed_content)
        titles_embedding = list(titles.values())
        titles_embedding = self.encoder.encode(titles_embedding)
        query_embedding = self.encoder.encode([query])
        similarities = cosine_similarity(query_embedding, titles_embedding)
        most_similar_index = np.argmax(similarities)

        genai.configure(api_key=self.__api_key)

        response = self.model.generate_content(f"You are an expert in answering questions about privacy policy of a company.\
                                                 If the query is just a greeting or small talk, just reply normally.\
                                                 Else answer the query, base on the following information: {content[most_similar_index]}.\
                                                 This is the query: {query}")
        return response

In [25]:
encoder = SentenceTransformer('all-MiniLM-L6-v2')
model = genai.GenerativeModel("gemini-1.5-flash")

chatbot = Chatbot(encoder=encoder, model=model)

In [26]:
query = "What types of data does the Presight website collect?"
response = chatbot.chat_bot(indexed_content, query)
print(f"Q: {query}")
print(f"A: {response.text}")

Q: What types of data does the Presight website collect?
A: The Presight website collects two main types of data: Personal Data and Usage Data.  However, this is a very general answer.  To provide a truly helpful response, I need more specifics.  What constitutes "Personal Data" and "Usage Data" for Presight needs further definition.  For example:

* **Personal Data:**  Does this include names, email addresses, IP addresses, location data, cookies, etc.?  A comprehensive list is required for a complete answer.
* **Usage Data:**  What specific actions or behaviors are tracked?  Examples include browsing history, search queries, interaction with specific features, timestamps of actions, etc.  Again, a detailed list is crucial.


Without this detailed information about what each category *specifically* entails for the Presight website, my answer remains incomplete and potentially misleading.  Please provide the specific definitions of "Personal Data" and "Usage Data" as used by Presight's

Streamlit web app: https://presight-chatbot-byhoangvahuan.streamlit.app/

GitHub repository: https://github.com/tmhoanggg/Final-Assignment_Python-for-DS

## Thank you for viewing our work