### *Import Necessary Libraries and Define Preprocessing Functions*

## 2.0 Preprocessing the Text

This preprocessing pipeline standardizes and cleans textual data for efficient search queries. The steps are as follows:
---

### Step 1: Tokenization
- Splits text into individual words (tokens).
- Removes punctuation and converts text to lowercase.

**Example Input:**  
*"Michelin-starred restaurant serves exquisite dishes in Paris!"*  
**Processed Tokens:**  
`["michelinstarred", "restaurant", "serves", "exquisite", "dishes", "paris"]`

---

### Step 2: Stopword Removal
- Removes common words like "and", "is", "the" that do not contribute meaning.  
**Before:**  
`["michelinstarred", "restaurant", "serves", "exquisite", "dishes", "paris"]`  
**After:**  
`["michelinstarred", "restaurant", "exquisite", "dishes", "paris"]`

---

### Step 3: Stemming
- Reduces words to their base forms.  
**Examples:**  
`"serving" → "serv"`  
`"dishes" → "dish"`

**Final Tokens:**  
`["michelinstar", "restaurant", "exquisit", "dish", "paris"]`

---

## Processed Columns
Preprocessing is applied to the following:
- `restaurant_name`
- `description`
- `city`
- `country`

---

## Example: Preprocessing in Action

**Raw Description:**  
*"Michelin-starred restaurant offering exquisite French cuisine in Paris."*  
**Processed Description:**  
`"michelinstar restaur offer exquisit french cuisin pari"`


In [None]:
import pandas as pd
import re
import math
import json
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Initialize NLP tools for preprocessing
stop_words = set(stopwords.words('english'))  # Common stopwords
stemmer = PorterStemmer()  # Stemmer for reducing words to their base forms

def preprocess_text(text):
    """
    Preprocess the text:
    - Remove punctuation
    - Convert to lowercase
    - Remove stopwords
    - Apply stemming for normalization
    """
    if not isinstance(text, str):
        return []
    text = re.sub(r"[^a-zA-Z0-9\s]", " ", text)  # Remove special characters
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    tokens = text.lower().split()
    return [stemmer.stem(word) for word in tokens if word not in stop_words]

## *Create Vocabulary and Inverted Index (Task 2.1.1)*

## 2.1 Building the Vocabulary and Inverted Index

The vocabulary and inverted index are fundamental components for efficient text-based search. Here’s a breakdown of their creation:

---

### Step 1: Build Vocabulary
- Maps each unique word in the dataset to a unique integer (Term ID).
- **Input:** Preprocessed descriptions.
- **Output:** A dictionary where keys are words and values are Term IDs.

**Example Vocabulary:**  
| Word          | Term ID |
|---------------|---------|
| "dish"        | 0       |
| "restaurant"  | 1       |
| "paris"       | 2       |

### Step 2: Build Inverted Index

- Maps each Term ID to a list of Document IDs where the term appears.
- Input: Preprocessed descriptions and vocabulary.
- Output: A dictionary where keys are Term IDs and values are lists of Document IDs.


| Term ID       | Document ID |
|---------------|---------    |
| 0        | 	[1, 3, 5]           |
| 1  | [0, 2, 4]           |
| 2       |   [2]           |

In [18]:
def build_vocabulary(documents):
    """
    Create a vocabulary mapping each unique word in the dataset to a unique integer (term ID).
    """
    return {word: idx for idx, word in enumerate(sorted(set(word for doc in documents for word in doc)))}

def build_inverted_index(documents, vocabulary):
    """
    Create an inverted index that maps each term ID to the list of document IDs where the term appears.
    """
    inverted_index = defaultdict(list)
    for doc_id, doc in enumerate(documents):
        for word in doc:
            if word in vocabulary:  # Only process words in the vocabulary
                term_id = vocabulary[word]
                if doc_id not in inverted_index[term_id]:  # Avoid duplicate entries
                    inverted_index[term_id].append(doc_id)
    return inverted_index

# Load dataset
df = pd.read_csv(r"C:\Users\39339\Desktop\ADM\HW3\michelin_restaurants.csv")

# Preprocess restaurant descriptions
descriptions = [preprocess_text(desc) for desc in df['description']]

# Create vocabulary and inverted index
vocabulary = build_vocabulary(descriptions)  # Map words to unique IDs
inverted_index = build_inverted_index(descriptions, vocabulary)  # Map term IDs to document IDs

# Save vocabulary and inverted index for future use
vocabulary_path = r"C:\Users\39339\Desktop\ADM\HW3\vocabulary.csv"
inverted_index_path = r"C:\Users\39339\Desktop\ADM\HW3\inverted_index.json"

# Save vocabulary as CSV
pd.DataFrame(list(vocabulary.items()), columns=["Word", "Term ID"]).to_csv(vocabulary_path, index=False)

# Save inverted index as JSON
with open(inverted_index_path, 'w') as f:
    json.dump(inverted_index, f)

print(f"Vocabulary and inverted index created and saved to files.")


Vocabulary and inverted index created and saved to files.


## *Execute Conjunctive Query (Task 2.1.2)*

## 2.2 Executing a Conjunctive Query

A conjunctive query retrieves documents (restaurants) where **all query words** are present in their descriptions. Below is a concise breakdown of how it works:

---

### **Function: `conjunctive_query`**
This function performs the following steps:
1. **Preprocess the Query:**
   - Tokenizes, cleans, and stems the query text.
   - Converts query words into their corresponding Term IDs using the vocabulary.

2. **Find Matching Documents:**
   - Intersects the lists of Document IDs for all Term IDs in the query.
   - Ensures that only documents containing all query terms are returned.


### Example: Conjunctive Query Execution

### **Input Query**
*"Michelin-starred fine dining in Paris"*

---

### **Steps**

1. **Preprocess the Query**  
   - **Tokens:**  
     `["michelinstar", "fine", "dine", "pari"]`

2. **Convert Tokens to Term IDs**  
   - Using the vocabulary:  
     `["michelinstar" → 12, "fine" → 34, "dine" → 56, "pari" → 78]`  
   - **Term IDs:**  
     `[12, 34, 56, 78]`

3. **Retrieve Matching Documents**  
   - Using the inverted index, find documents containing all terms:  
     **Matching Document IDs:** `[0, 5, 9]`

4. **Output Results**  
   - Display restaurant names, addresses, descriptions, and websites for matching documents.

---

### **Results Table**

| Restaurant Name       | Address          | Description                                      | Website               |
|-----------------------|------------------|------------------------------------------------|-----------------------|
| Le Jules Verne        | Eiffel Tower     | A Michelin-starred restaurant in Paris.         | www.lejulesverne.com |
| L'Astrance            | Rue Beethoven    | Fine dining experience with exquisite cuisine.  | www.lastrance.com    |
| Epicure               | Rue Saint-Honoré | Luxurious dining at the heart of Paris.         | www.epicure.com      |

---

### **Number of Matching Restaurants**  
`3`


In [19]:
def conjunctive_query(query, vocabulary, inverted_index):
    """
    Execute a conjunctive query:
    - Find restaurants where all query words are present in their description.
    """
    query_tokens = preprocess_text(query)  # Preprocess the query terms
    term_ids = [vocabulary[word] for word in query_tokens if word in vocabulary]  # Map query words to term IDs

    if not term_ids:  # If no query words are in the vocabulary, return empty
        return []

    # Find documents containing all the terms (intersection of lists)
    matching_docs = set(inverted_index[term_ids[0]])  # Start with the first term's document list
    for term_id in term_ids[1:]:
        matching_docs &= set(inverted_index.get(term_id, []))  # Intersect with subsequent term's document lists

    return list(matching_docs)  # Return the matching document IDs


# Input query from user
query = input("Enter your query: ")

# Execute conjunctive query
matching_docs = conjunctive_query(query, vocabulary, inverted_index)

# Create a results table
table = []
for idx in matching_docs:
    row = df.iloc[idx]
    table.append({
        "restaurantName": row['restaurantName'],  # Restaurant name
        "address": row['address'],               # Address
        "description": row['description'],       # Description
        "website": row['website']                # Website URL
    })

# Display the results table in the desired format
display(pd.DataFrame(table))
print(f"Number of matching restaurants: {len(matching_docs)}")


Unnamed: 0,restaurantName,address,description,website
0,Osteria Numero 2,via Ghisiolo 2/a,"A beautiful farmhouse not far from the town, w...",https://www.osterianumero2.it/


Number of matching restaurants: 1


### *Build Ranked Search Engine with TF-IDF (Task 2.2.1)*

### **2.2 Conjunctive Query & Ranking Score**

For the second Search Engine, given a query, we want to get the *top-k* (in this case, we chose $k=5$) documents related to the query and a **similarity** measure. In particular, we chose to perform the following procedure:

1. We built a dictionary containing all the words found in the `description` column of each document (or reused a previously built vocabulary). Using this vocabulary, we built the **TfIdf inverted index** of the words. The TfIdf inverted index is a dictionary of the form:

    ```
    {
    term_id_1:[(document_1, tfIdf_{term,document1}), (document_2, tfIdf_{term,document2}), (document_4, tfIdf_{term,document4})],
    term_id_2:[(document_1, tfIdf_{term,document1}), (document_3, tfIdf_{term,document3}), (document_5, tfIdf_{term,document5}), (document_6, tfIdf_{term,document6})],
    ...}
    ```

    where `document_i` is the *id* of a document that contains a specific word, the `term_id_i` is the *id* of a specific word in the vocabulary, and the `TfIdf` is the Term Frequency-Inverse Document Frequency value of the `term_id_i` within `document_i`. The Term Frequency-Inverse Document Frequency is given by:

    \begin{equation}
    \text{TfIdf}(t,d) = \text{Tf}(t,d) \times \text{Idf}(t) =  \text{Tf}(t,d) \times \log \left(\frac{N}{1+df}\right),
    \end{equation}

    where $\text{Tf}(t,d)$ is the (normalized) frequency of times the term $t$ appears in document $d$, and $\text{Idf}(t)$ is the **inverse document frequency** of term $t$, where $df$ is the number of documents that include term $t$ and $N$ is the total number of documents. The purpose of the Idf is to give higher weight to terms that are rare across the entire corpus of text and lower weight to terms that are common.

    The TfIdf index gives a mapping from every word in the vocabulary to all the documents that contain it and a measurement of **how important** that word is within each document relative to its importance across all documents. 

2. Once we built (and saved) the vocabulary and TfIdf inverted index, we took a **query** as input. As a first step, we preprocess the query and then search for all the documents that contain **all** of the words/tokens in the query. To perform this search, we take advantage of the TfIdf inverted index since we only have to take the intersection of the lists of the terms contained in the query (the lists of the first elements of the tuple).

3. Once we got all the relevant documents, we decided to sort them by their **Cosine Similarity** with respect to the query. The Cosine Similarity is a vector similarity measure that measures how similar two vectors $\vec{A}$ and $\vec{B}$ are by taking the angle $\theta$ between them. It is given by:

    \begin{equation}
    \text{sim}\left(\vec{A}, \vec{B}\right) = \cos(\theta) = \frac{\vec{A}\cdot\vec{B}}{|\vec{A}||\vec{B}|}.
    \tag{2}
    \end{equation}

    In the context of NLP, we can represent documents as vectors where each vector value is the tfIdf representation of a term within the document. Therefore, we can obtain the similarity between documents by taking the **cosine similarity** of their tfIdf representations. In this case, we obtained the cosine similarity between the query and all the obtained documents. 
    
    Once we had a similarity value for all the relevant documents, we **sorted** them in descending order with respect to the similarity with the query. To maintain the top-$k$ documents efficiently, we used a **max-heap** data structure.

---

#### **2.2.1 Inverted Index**

As in the first Search Engine we built, the TfIdf inverted index is a fundamental tool for the Search Engine. This is why the index is computed *before* making any query and saved into memory. This allows it to be loaded when needed instead of being recalculated every time. To achieve this, the index is incorporated as a class **attribute**, so it is loaded into memory each time the `TopKSearchEngine` class is initialized.

As an exercise, we can observe the first 5 elements of the TfIdf inverted index value for the "data" term as follows:

```python
term_id = vocabulary["data"]
print(tfidf_inverted_index[term_id][:5])


In [20]:
def compute_tf(document):
    """
    Compute term frequency (TF):
    - Calculate how often each word appears in the document relative to its length.
    """
    tf = defaultdict(int)
    for word in document:
        tf[word] += 1
    return {word: count / len(document) for word, count in tf.items()}  # Normalize by document length

def compute_idf(documents, vocabulary):
    """
    Compute inverse document frequency (IDF):
    - Measures the importance of a word across the entire dataset.
    """
    num_docs = len(documents)  # Total number of documents
    doc_freq = defaultdict(int)
    for doc in documents:
        unique_words = set(doc)  # Consider only unique words in each document to avoid repetitions for a word
        for word in unique_words:
            if word in vocabulary:
                doc_freq[word] += 1 #increasing the count if the word is in a document
    return {word: math.log((num_docs + 1) / (doc_freq[word] + 1)) + 1 for word in vocabulary}

def compute_tfidf(document, idf):
    """
    Compute the tf-idf scores combining term frequency (TF) 
    and inverse document frequency (IDF) for a given document.
    """
    tf = compute_tf(document)
    return {word: tf[word] * idf[word] for word in document if word in idf}

# Calculate IDF for all words in the vocabulary
idf = compute_idf(descriptions, vocabulary)

# Calculate TF-IDF scores for all documents
tfidf = [compute_tfidf(doc, idf) for doc in descriptions]

# Build updated inverted index with TF-IDF scores
tfidf_inverted_index = defaultdict(list)
for doc_id, doc_tfidf in enumerate(tfidf):
    for word, score in doc_tfidf.items():
        term_id = vocabulary[word]
        tfidf_inverted_index[term_id].append((doc_id, score))  # Store document ID and TF-IDF score

# Save updated inverted index
tfidf_inverted_index_path = r"C:\Users\39339\Desktop\ADM\HW3\inverted_index.json"
with open(tfidf_inverted_index_path, 'w') as f:
    json.dump(tfidf_inverted_index, f)

print("TF-IDF scores computed and updated inverted index saved.")


TF-IDF scores computed and updated inverted index saved.


### Execute Ranked Query (Task 2.2.2)

### **2.2 Conjunctive query & Ranking score**

For the second Search Engine, given a query, we want to get the *top-k* (in this case, we chose $k=5$) documents related to the query and a **similarity** measure. In particular we chose to perform the following procedure:

1. We built a dictionary containing all the words found in the `description` column of each course (in reality, use the one we created before). Using this vocabulary we built the **TfIdf inverted index** of the words. The TfIdf inverted index is a dictionary of the form:
    ```
    {
    term_id_1:[(document_1, tfIdf_{term,document1}), (document_2, tfIdf_{term,document2}), (document_4, tfIdf_{term,document4})],
    term_id_2:[(document_1, tfIdf_{term,document1}), (document_3, tfIdf_{term,document3}), (document_5, tfIdf_{term,document5}), (document_6, tfIdf_{term,document6})],
    ...}
    ```
    where `document_i` is the *id* of a document that contains a specific word, the `term_id_i` is the *id* of a specific word in the vocabulary, and the `TfIdf` is the Term Frequency-Inverse Document Frequency value of the `term_id_i` within `document_i`. The Term Frequency-Inverse Document Frequency is given by:

    \begin{equation}
    \text{TfIdf}(t,d) = \text{Tf}(t,d) \times \text{Idf}(t) =  \text{Tf}(t,d) \times \log \left(\frac{N}{1+df}\right),
    \end{equation}

    where $\text{Tf}(t,d)$ is the (normalized) frequency of times the term $t$ appears in document $d$ and $\text{Idf}(t)$ is the **inverse document frequency** of term $t$, where $df$ is the number of documents that include term $t$ and $N$ is the total number of documents. The purpose of the Idf is to give higher weight to terms that are rare across the entire corpus of a text and lower weight to terms that are common.

    The TfIdf index gives a mapping from every word in the vocabulary to all the documents that contain it and a measurement **how important** is that word within each document relative to its importance across all documents. 

2. Once we built (and saved) the vocabulary and TfIdf inverted index, we take a **query** as an input. As a first step we preprocess the query and then search for all the documents that contain **all** of the words/tokens in the query. To perform this search we takes advantage of the TfIdf inverted index since we only have to take the intersection of the lists of the terms contained in the query (the lists of the first elements of the tuple).

3. Once we got all the relevant documents, we decided to sort them by their **Cosine Similarity** with respect to the query. The Cosine Similarity is a vector similarity measure that measures how similar are two vectors $\vec{A}$ and $\vec{B}$ by taking the angle $\theta$ between them. It is given by:

    \begin{equation}
    \text{sim}\left(\vec{A}, \vec{B}\right) = cos(\theta) = \frac{\vec{A}\cdot\vec{B}}{|\vec{A}||\vec{B}|}.
    \tag{2}
    \end{equation}

    In the context of NLP, we can represent documents as vectors where each vector value is the tfIdf representation of a term within this document. Therefore, we can obtain the similarity between documents by taking the **cosine similarity** of their tfIdf representation. In this case, we obtained the cosine similarity between the query and all the obtained documents. 
    
    Once we had a similarity value for all the relevant documents, we **sorted** them in descending value with respect to the similarity with the query. In order to mantain the top-$k$ documents in an efficient way, we used a **max-heap** data structure.

#### **2.2.1 Inverted index**

As in the first Search Engine we built, the TfIdf inverted index is a fundamental tool for building of our Search Engine, this is why they are obtained *before* making any query to the Search Engine and saved into memory. In this way they are loaded into memory when necessary instead of being calculated each time. To do this we incorporated them as a class **attribute** so it was loaded into memory each time the `TopKSearchEngine` class is initialized.

As an exercise, we can observe the first 5 elements of the TfIdf inverted index value for the "data" term as we did before:

In [None]:
def cosine_similarity(vec1, vec2):
    """
    Measures similarity between the tf-idf scores of the given query and each document.
    """
    common_words = set(vec1.keys()) & set(vec2.keys())  # Find common words
    numerator = sum(vec1[word] * vec2[word] for word in common_words)  # Dot product
    norm_vec1 = math.sqrt(sum(val ** 2 for val in vec1.values()))  # Magnitude of vec1
    norm_vec2 = math.sqrt(sum(val ** 2 for val in vec2.values()))  # Magnitude of vec2
    return numerator / (norm_vec1 * norm_vec2) if norm_vec1 and norm_vec2 else 0  # In case one of the vectors are 0(common words = 0), it returns 0

def ranked_query(query, k=5):
    """
    Find and rank restaurants based on cosine similarity between the query and each document.
    """
    query_tokens = preprocess_text(query)  # Preprocess the query terms
    query_tfidf = compute_tfidf(query_tokens, idf)  # Compute TF-IDF for the query

    scores = []
    for doc_id, doc_tfidf in enumerate(tfidf):
        similarity = cosine_similarity(query_tfidf, doc_tfidf)  # Compute similarity
        scores.append((doc_id, similarity))  # Store document ID and similarity score

    # Sort by similarity score and return top-k results
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:k]

# Input query from user
query = input("Enter your query: ")
#k = int(input("Enter the number of top results to display: "))

# Execute ranked query
top_k_results = ranked_query(query, k=5)

# Display results in a table
results_table = [
    {
        "Restaurant Name": df.iloc[doc_id]["restaurantName"],
        "Address": df.iloc[doc_id]["address"],
        "Description": df.iloc[doc_id]["description"],
        "Website": df.iloc[doc_id]["website"],
        "Similarity Score": f"{score:.4f}"
    }
    for doc_id, score in top_k_results
]

display(pd.DataFrame(results_table))  # Print the results table


Unnamed: 0,Restaurant Name,Address,Description,Website,Similarity Score
0,Saur,via Filippo Turati 8,"In a tiny rural village, this contemporary, al...",https://ristorantesaur.it,0.3061
1,Razzo,via Andrea Doria 17/f,"A quiet restaurant with a relaxed, young and m...",https://vadoarazzo.it/,0.2719
2,La Botte,via Giuseppe Garibaldi 8,A modern and welcoming contemporary bistro sit...,http://www.trattorialabottestresa.it,0.2632
3,Piccolo Lord,corso San Maurizio 69 bis/g,"Professional service in a welcoming, modern re...",https://www.ristorantepiccololord.it/,0.2507
4,La Valle,via Umberto I 25,A well - run restaurant in a quiet area just o...,https://www.ristorantelavalle.it/,0.2408


## Ex 3. Define a New Score!
Now, we will define a custom ranking metric to prioritize restaurants based on user queries.

Steps:
* User Query: The user provides a text query. We’ll retrieve relevant documents using the search engine built in Step 2.1.
* New Ranking Metric: After retrieving relevant documents, we’ll rank them using a new custom score. Instead of limiting the scoring to only the description field, we can include other attributes like priceRange, facilitiesServices, and cuisineType.
* You will use a heap data structure (e.g., Python’s heapq library) to maintain the top-k restaurants.

#### New Scoring Function:  
Define a scoring function that takes into account various attributes:
* Description Match: Give weight based on the query similarity to the description (using TF-IDF scores).
* Cuisine Match: Increase the score for matching cuisine types.
Facilities and Services: Give more points for matching facilities/services (e.g., “Terrace,” “Air conditioning”).
* Price Range: Higher scores could be given to more affordable options based on the user’s choice.

### Output:
The output should include:
* restaurantName
* address
* description
* website
* The new similarity score based on the custom metric.  

Are the results you obtain better than with the previous scoring function? Explain and compare results.

## 3. **Define a New Scoring System!**

Can we do better? In the last implementation of our Search Engine, we sorted results by the **cosine similarity** of the TF-IDF representation of their `description` field with respect to the query. However, this approach does not account for the following considerations:

1. A restaurant can be **more relevant** to a query if it matches the query across multiple fields. For example:
    - If we search for the query *"pizza"* and it appears not only in the `description` but also in the `cuisineType` or `facilitiesServices`, these fields should contribute to the ranking score. Matching across multiple fields provides a broader understanding of relevance.

2. Users often prioritize restaurants that meet **additional preferences** such as:
    - **Facilities:** For example, a user searching for "wheelchair access" should find restaurants where this facility is explicitly listed.
    - **Price Compatibility:** Users may prefer restaurants that fit within their desired price range.
    - **Cuisine Type:** A user searching for "Italian" should find restaurants where this cuisine is explicitly provided.

To address these points, we propose a **weighted** scoring system that considers the following fields: `description`, `cuisineType`, `facilitiesServices`, and `priceRange`.

---

### **3.1 Custom Weighted Scoring Formula**

Our new scoring metric combines the cosine similarity of the query with multiple fields of the dataset (`description`, `cuisineType`, `facilitiesServices`) and considers the price compatibility. The formula is defined as:

\begin{equation}
\text{Score} = w_{\text{desc}} \cdot \text{CS}_{\text{desc}} + w_{\text{cuis}} \cdot \text{CS}_{\text{cuis}} + w_{\text{facil}} \cdot \text{CS}_{\text{facil}} + w_{\text{price}} \cdot \text{Sim}_{\text{price}}
\tag{1}
\end{equation}

---

### **3.2 Key Components**

\begin{equation}
\text{similarity}(\vec{d}, \vec{q}) = \frac{w_{\text{context}}}{3} \cdot \left[\text{cs}_{\text{description}}(\vec{d}, \vec{q}) + \text{cs}_{\text{cuis}}(\vec{d}, \vec{q}) + \text{cs}_{\text{facil}}(\vec{d}, \vec{q})\right] + \frac{w_{\text{price}}}{1} \cdot \text{Sim}_{\text{price}}
\tag{2}
\end{equation}

Where:
- **Cosine similarities:**
  \begin{equation}
  \text{CS}_{\text{desc}}, \text{CS}_{\text{cuis}}, \text{CS}_{\text{facil}}
  \tag{3}
  \end{equation}
  These represent the cosine similarities for `description`, `cuisineType`, and `facilitiesServices`, respectively.

- **Price compatibility:**
  \begin{equation}
  \text{Sim}_{\text{price}}
  \tag{4}
  \end{equation}
  Quantifies how compatible the restaurant's price range is with the user's maximum price preference.

- **Weights:**
  \begin{equation}
  w_{\text{desc}}, w_{\text{cuis}}, w_{\text{facil}}, w_{\text{price}}
  \tag{5}
  \end{equation}
  These are the weights assigned to each factor, controlling their relative importance in the overall score.

---

### **3.3 Key Components**

#### **Cosine Similarity for Text Fields**

Cosine similarity is used to compute the alignment between the TF-IDF representation of the query and each field. It is defined as:

\begin{equation}
\text{CS}(\vec{A}, \vec{B}) = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \cdot \|\vec{B}\|}
\tag{6}
\end{equation}

This ensures that fields with higher overlap with the query receive higher scores.

#### **Price Compatibility**

For the price similarity component:

\begin{equation}
\text{Sim}_{\text{price}}
\tag{7}
\end{equation}

We assign a higher score to restaurants whose price range does not exceed the user's maximum budget.

---

### **3.4 Implementation**

We implemented this scoring system in Python using the following key functions:


In [None]:
import numpy as np
from collections import defaultdict
import math
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from heapq import nlargest
import re


# Initialize NLP tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    """
    Preprocess a given text by removing special characters, extra spaces, 
    and stopwords. Applying stemming and then lemmatization. 
    The function splits the text into words, so the output is normalized for further processing.
    """
    if not isinstance(text, str):
        return []
    text = re.sub(r"[^a-zA-Z0-9\s]", " ", text)  # Remove special characters
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    tokens = text.lower().split()
    return [lemmatizer.lemmatize(stemmer.stem(word)) for word in tokens if word not in stop_words]

def build_vocabulary(documents):
    """
    Create a vocabulary from a list of tokenized documents.
    Each unique word is mapped to a unique index to avoid that the same word has 
    different indexes.
    """
    return {word: idx for idx, word in enumerate(sorted(set(word for doc in documents for word in doc)))}

def optimized_get_idf(documents, vocabulary):
    """
    Calculate the IDF scores for all words in the vocabulary
    """
    doc_freq = defaultdict(int)
    for doc in documents:
        unique_words = set(doc)
        for word in unique_words:
            if word in vocabulary:
                doc_freq[word] += 1
    num_docs = len(documents)
    return {word: math.log((num_docs + 1) / (doc_freq[word] + 1)) + 1 for word in vocabulary}

def compute_tfidf(document, idf):
    """
    Compute TF-IDF scores for a given document.
    """
    tf = defaultdict(int)
    for word in document:
        tf[word] += 1
    return {word: (tf[word] / len(document)) * idf[word] for word in document if word in idf}

def cosine_similarity(vec1, vec2):
    """
    Calculate the cosine similarity between two TF-IDF vectors.
    """
    common_words = set(vec1.keys()) & set(vec2.keys())
    numerator = sum(vec1[word] * vec2[word] for word in common_words)
    norm_vec1 = math.sqrt(sum(val ** 2 for val in vec1.values()))
    norm_vec2 = math.sqrt(sum(val ** 2 for val in vec2.values()))
    return numerator / (norm_vec1 * norm_vec2) if norm_vec1 and norm_vec2 else 0

def compute_custom_score(tfidf_query_desc, tfidf_desc, tfidf_query_cuis, tfidf_cuis, tfidf_query_facil, tfidf_facil, max_price, doc_price, weights):
    """
    Compute a custom score for ranking, based on a weighted combination of:
    - Cosine similarity for text descriptions
    - Matching of cuisine preferences
    - Matching of desired facilities
    - Price similarity between the query and the document
    The weights determine the relative importance of each component in the final score.
    """
    w_desc, w_cuis, w_facil, w_price = weights

    # Cosine similarity for description
    sim_desc = cosine_similarity(tfidf_query_desc, tfidf_desc) * w_desc

    # Matching cuisines
    matching_cuis = cosine_similarity(tfidf_query_cuis, tfidf_cuis)* w_cuis

    # Matching facilities
    matching_facil =  cosine_similarity(tfidf_query_facil, tfidf_facil) * w_facil

    # Price similarity
    #sim_price = (1 / (1 + abs(max_price - doc_price))) * w_price
    sim_price = w_price if doc_price <= max_price else 0

    # Total score
    return sim_desc + matching_cuis + matching_facil + sim_price

def get_top_k(query, cuis, facil, max_price, descriptions, prices, idf_desc, tfidf_desc, idf_cuis, tfidf_cuis, idf_facil, tfidf_facil, k=5, weights=(0.4, 0.2, 0.2, 0.2)):
    """
    Retrieve the top-k ranked restaurants based on the given custom scoring metric.
    This function processes user input, calculates query vectors, and ranks 
    the restaurants based on their similarity to the query and other criteria.
    """
    # Preprocess the queries
    query_tokens = preprocess_text(query)
    cuis_tokens = preprocess_text(cuis)
    facil_tokens = preprocess_text(facil)

    # Compute query TF-IDF
    tfidf_query_desc = compute_tfidf(query_tokens, idf_desc)
    tfidf_query_cuis = compute_tfidf(cuis_tokens, idf_cuis)
    tfidf_query_facil = compute_tfidf(facil_tokens, idf_facil)

    scores = []
    for idx in range(len(descriptions)):
        # Compute custom score
        score = compute_custom_score(
            tfidf_query_desc,
            tfidf_desc[idx],
            tfidf_query_cuis,
            tfidf_cuis[idx],
            tfidf_query_facil,
            tfidf_facil[idx],
            max_price,
            prices[idx],
            weights
        )
        scores.append((idx, score))

    # Get top-k results using a heap
    return nlargest(k, scores, key=lambda x: x[1])

# Example Usage
if __name__ == "__main__":
    # Load dataset
    df = pd.read_csv(r"C:\Users\39339\Desktop\ADM\HW3\michelin_restaurants.csv")
    descriptions = [preprocess_text(desc) for desc in df['description']]
    cuisines = [preprocess_text(cuisine) for cuisine in df['cuisineType']]
    facilities = [preprocess_text(facility) for facility in df['facilitiesServices']]
    prices = [len(price) for price in df['priceRange']]  # Convert € symbols to numeric scale

    # Build vocabularies and IDF
    vocabulary_desc = build_vocabulary(descriptions)
    idf_desc = optimized_get_idf(descriptions, vocabulary_desc)

    vocabulary_cuis = build_vocabulary(cuisines)
    idf_cuis = optimized_get_idf(cuisines, vocabulary_cuis)

    vocabulary_facil = build_vocabulary(facilities)
    idf_facil = optimized_get_idf(facilities, vocabulary_facil)

    # Compute TF-IDF for documents
    tfidf_desc = [compute_tfidf(doc, idf_desc) for doc in descriptions]
    tfidf_cuis = [compute_tfidf(cuis, idf_cuis) for cuis in cuisines]
    tfidf_facil = [compute_tfidf(facil, idf_facil) for facil in facilities]


    # User inputs
    query = input("Enter your query for the description: ")
    cuis = input("Enter the cuisine types: ")
    facil = input("Enter the facilities: ")
    max_price = len(input("Enter the maximum price (€, €€, etc.): ").strip())  # Convert € symbols to numeric

    # Get top-k results
    top_k_results = get_top_k(query, cuis, facil, max_price, descriptions, prices, idf_desc, tfidf_desc, idf_cuis, tfidf_cuis, idf_facil, tfidf_facil, k=5)

    # Display results
    results_table = [
        {
            "Restaurant Name": df.iloc[idx]["restaurantName"],
            "Address": df.iloc[idx]["address"],
            "Description": df.iloc[idx]["description"],
            "Website": df.iloc[idx]["website"],
            "Score": f"{score:.4f}"
        }
        for idx, score in top_k_results
    ]

    display(pd.DataFrame(results_table))


Unnamed: 0,Restaurant Name,Address,Description,Website,Score
0,Osteria Numero 2,via Ghisiolo 2/a,"A beautiful farmhouse not far from the town, w...",https://www.osterianumero2.it/,1.0
1,Remo Villa Cariolato,strada di Bertesina 313,"Long recommended by the Michelin Guide, the hi...",https://www.removillacariolato.it,0.6205
2,La Passion,via San Nicolò 5/b,"A small, cosy, completely wood - panelled stub...",https://www.lapassion.it/it/,0.5754
3,Vert Osteria Contemporanea,Località Bogonza,Housed in a rustic building on the green slope...,https://vertosteria.it/,0.5426
4,Aubergine,via Ghislandi 5,Situated in a town famous for its thermal bath...,https://www.ristoranteaubergine.it/,0.5234
