# ADM HW 3

For this homework, no dataset has been provided. Instead, you have to build your own. Your search engine will run on text documents. So, here we detail the procedure to follow for the data collection. We strongly suggest you work on different modules when implementing the required functions. For example, you may have a crawler.py module, a parser.py module, and a engine.py module: this is a good practice that improves readability in reporting and efficiency in deploying the code. Be careful; you are likely dealing with exceptions and other possible issues!

### 1.1. Get the list of Michelin restaurants
* You should begin by compiling a list of restaurants to include in your document corpus. Specifically, you will focus on web scraping the Michelin Restaurants in Italy. Your task is to collect the URL associated with each restaurant in this list. The output of this step should be a .txt file where each line contains a single restaurant’s URL. By the end, you should have approximately 2,000 restaurants on your list. The number changes daily, so some groups might have different number of restaurants.

---
We will start by loading the relevant libraries first. Then we will try to scrape the *relevant* links on only the first page of the [Michelin website](https://guide.michelin.com/en/it/restaurants/) to test. If the operation goes on successfully, we will scrape the links from the 100 pages!


In [None]:
# import the relevant libraries
import requests
from bs4 import BeautifulSoup
import os
import re
import lxml
import csv
import random   
import time
from tqdm import tqdm
import asyncio
from tqdm.asyncio import tqdm
import aiohttp
from aiohttp import ClientSession, ClientResponseError
from crawler import *
from myparser import *
import pandas as pd

In [2]:
# setting current working directory to /Users/saifdev/Desktop/ADM/ADM_HW3
os.chdir("/Users/saifdev/Desktop/ADM/ADM_HW3")

Before I begin to build the function `scraping_urls`, which will scrape all the links for the Michelin restaurants in Italy and save them to `urls.txt`, I will start by trying to scrape only the first page to test and practice.

In [None]:
# A SMALL TEST
url = "https://guide.michelin.com/en/it/restaurants/page/" # url of the Michelin Guide Italy
with open('test.txt', 'w') as file: # open a file to write the links of the restaurants
    current_page = url + "1" # the first page only for testing
    request = requests.get(current_page) # get the first page
    soup = BeautifulSoup(request.content, 'lxml') # parse the content of the first page
    for a in soup.select("div.flex-fill a"): # Ref Keith Galli Y.T. channel :)
        file.write("https://guide.michelin.com"+a.get('href') + '\n')   # write the links of the restaurants in the first page

In [None]:
# check the number of lines written in the file = 20 (and also manually checked)
with open('test.txt', 'r') as file: # open the file to read the links of the restaurants
    num_lines = len(file.readlines()) # count the number of lines in the file
    result = "20" if num_lines == 20 else "not 20" # check if the number of lines is 20
    print(f"The number of urls scraped is {result}") # print the result. It is indeed 20


The number of urls scraped is 20


It worked as expected. Now we will scrape the urls from the whole 100 pages on the Micheline website using the `scraping_urls` function.

A little note about the `scraping_urls` function:
* It is a function that takes the url of the Michelin Guide Italy webpage and the number of pages to scrape as input and returns a text file containing the URLs of all the restaurants. The function is defined in the crawler.py file.

In [5]:
pages = 100 # set the number of pages to scrape
url = "https://guide.michelin.com/en/it/restaurants/page/" # url of the Michelin Guide Italy
scraping_urls(url=url, pages=pages) # scrape the urls of the restaurants in the Michelin Guide Italy

Scraping Pages: 100%|██████████| 100/100 [03:37<00:00,  2.17s/it]


Checking if the number of urls scraped is approximately 2000

In [6]:
# check the number of urls scraped
with open('urls.txt', 'r') as file:
    num_lines = len(file.readlines())
    print(f"Total Number of restaurants urls scraped is {num_lines}")


Total Number of restaurants urls scraped is 2029


---

### 1.2. Crawl Michelin restaurant pages

##### Once you have all the URLs on the list, you should:

* Download the HTML corresponding to each of the collected URLs.
* After collecting each page, immediately save its HTML in a file. This way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
* Organize the downloaded HTML pages into folders. Each folder will contain the HTML of the restaurants from page 1, page 2, ... of the Michelin restaurant list.
* Tip: Due to the large number of pages to download, consider using methods that can help shorten the process. If you employed a particular process or approach, kindly describe it.

---

Now, we will scrape the information of the restaurants from the urls scraped. To do this, we will use the function load_urls, fetch_and_save, and download_html_in_batches. All these functions are defined in the crawler.py file.

Some information about each of these functions:
* **load_urls**: This function loads the urls of the restaurants from the urls.txt file.
* **fetch_and_save**: This function fetches the html content of the urls and saves the html file into a folder of a particular batch/page.
* **download_html_in_batches**: This function downloads the html content of the urls in batches of 20. The function makes used of `asyncio` and `aiohttp` to download the html content in parallel. This function is used to speed up the process of downloading the html content of the urls due to concurrency.

In [None]:
url_file = 'urls.txt' # file containing the urls of the restaurants
output_dir = 'michelin_html_batches' # directory to save the sub-directories and the html files
urls = load_urls(url_file) # load the urls of the restaurants

start_time = time.time()
try:
    await download_html_in_batches(urls, output_dir) # using asyncio environment to download the html files in batches
except RuntimeError:
    # Check if there's already an event loop running. Why? Because we can't create a new event loop in a thread that already has
    # an event loop running. Jupyter notebooks often have an event loop running by default. If we try to create a new event loop
    # in a Jupyter notebook, we'll get a RuntimeError. Thus, use asyncio.run if not in a running event loop environment
    asyncio.run(download_html_in_batches(urls, output_dir))

print(f"Finished downloading in {time.time() - start_time} seconds")

Now, we have the html files of the restaurants in the Michelin Guide Italy website downloaded in batches - organised into folders. Now is the time that we start parsing them.

---
### 1.3 Parse downloaded pages

##### At this point, you should have all the HTML documents about the restaurant of interest, and you can start to extract specific information. The list of the information we desire for each restaurant and their format is as follows:

* Restaurant Name (to save as restaurantName): string;
* Address (to save as address): string;
* City (to save as city): string;
* Postal Code (to save as postalCode): string;
* Country (to save as country): string;
* Price Range (to save as priceRange): string;
* Cuisine Type (to save as cuisineType): string;
* Description (to save as description): string;
* Facilities and Services (to save as facilitiesServices): list of strings;
* Accepted Credit Cards (to save as creditCards): list of strings;
* Phone Number (to save as phoneNumber): string;
* URL to the Restaurant Page (to save as website): string. 


---

we have created many functions to parse the html files and extract the information we need. These functions ae in the myparser.py file. Here is the list of functions. The functions are listed below and their names are self-explanatory:  

* `get_restaurant_name(soup)` - taking in soup object and returning the name of the restaurant for example.
* `get_address(soup)`
* `get_city(address)`
* `get_postal_code(address)`
* `get_country(address)`
* `get_price_range(soup)`
* `get_cuisine_type(soup)`
* `get_description(soup)`
* `get_facilities_services(soup)`
* `get_credit_cards(soup)`
* `get_phone_number(soup)`
* `get_website(soup)`

We will test these functions of a random html file form the directories we have created. This will help us to understand the output of these functions and help us evaluate how well they perform.

In [8]:
# Test the parser functions on a particular html file
with open('/Users/saifdev/Desktop/ADM/ADM_HW3/michelin_html_batches/batch_1/20tre.html', 'r', encoding='utf-8') as file:
	content = file.read()
	soup = BeautifulSoup(content, 'lxml')
	name = get_restaurant_name(soup)
	print(f"Restaurant name: {name}")
	address = get_address(soup)
	print(f"Restaurant address: {address}")
	city = get_city(address)
	print(f"Restaurant city: {city}")
	zipcode = get_postal_code(address)
	print(f"Restaurant zipcode: {zipcode}")
	country = get_country(address)
	print(f"Restaurant country: {country}")
	price = get_price_range(soup)
	print(f"Restaurant price range: {price}")
	cuisine = get_cuisine_type(soup)
	print(f"Restaurant cuisine type: {cuisine}")
	description = get_description(soup)
	print(f"Restaurant description: {description}")
	facilities = get_facilities_services(soup)
	print(f"Restaurant facilities and services: {facilities}")
	credit_cards = get_credit_cards(soup)
	print(f"Restaurant credit cards: {credit_cards}")
	phone = get_phone_number(soup)
	print(f"Restaurant phone number: {phone}")
	website = get_website(soup)
	print(f"Restaurant website: {website}")

Restaurant name: 20Tre
Restaurant address: via David Chiossone 20 r, Genoa, 16123, Italy
Restaurant city: Genoa
Restaurant zipcode: 16123
Restaurant country: Italy
Restaurant price range: €€
Restaurant cuisine type: Farm to table, Modern Cuisine
Restaurant description: Situated in the heart of Genoa’s historic centre, this contemporary-style restaurant focuses on just a few dishes, almost all fish-based, presented in a very modern style and in generous portions. Seasonal ingredients and market-fresh produce are the guiding philosophy here.
Restaurant facilities and services: ['Air conditioning']
Restaurant credit cards: ['amex', 'dinersclub', 'mastercard', 'visa']
Restaurant phone number: +39 010 247 6191
Restaurant website: https://www.ristorante20tregenova.it/


Our Parser functions work perfectly. Let's employ them now to get the required data from all the html files we have stored. Then, we will gather the data into a csv file which we can later read into the Pandas DataFrame.

To do this, we will use the following functions stored in myparser.py:
`extract_restaurant_data`, `save_restaurant_data_to_csv`, `forming_a_csv`

A brief explanation of the functions:
* `extract_restaurant_data` - This function takes the html content of a restaurant page and extracts the relevant data using the parser functions.
* `save_restaurant_data_to_csv` - This function takes the extracted data and saves it to a csv file.
* `forming_a_csv` - This function takes the html files stored in the directory and extracts the data from each file and saves it to a csv file.


In [None]:
# form a csv file with the information of the restaurants using extract_restaurant_data function and the save_restaurant_data_to_csv function
forming_a_csv()

Data saved to michelin_restaurant_data.csv


Finally, we have the csv file with the information of the restaurants using the aforementioned functions. The csv file is named *"michelin_restaurants.csv"*. We can now load the csv file using `pd.read_csv` and perform the required tasks.

In [3]:
data = pd.read_csv('michelin_restaurant_data.csv')
data.head()

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,Locanda Radici,"SP 21, contrada San Vincenzo, Melizzano, 82030...",Melizzano,82030.0,Italy,€€,"Modern Cuisine, Campanian",A rustic restaurant with contemporary Mediterr...,Air conditioning; Car park; Garden or park; Gr...,amex; mastercard; visa,+39 0824 944506,https://www.locandaradici.it/
1,Posta,"viale Vittorio Veneto 169, Sant'Omobono Terme,...",Sant'Omobono Terme,24038.0,Italy,€€€,Italian,"Situated in the Imagna valley, this welcoming,...",Air conditioning; Wheelchair access,amex; dinersclub; mastercard; visa,+39 035 851134,https://www.frosioristoranti.it
2,Hostaria Baccofurore,"via G.B. Lama 9, Furore, 84010, Italy",Furore,84010.0,Italy,€€,"Regional Cuisine, Farm to table",Patience is needed to get to this restaurant f...,Air conditioning; Car park; Garden or park; Gr...,amex; dinersclub; jcb; maestrocard; mastercard...,+39 089 830360,https://www.baccofurore.it/
3,Nni Lausta,"via Risorgimento 188, Santa Marina Salina, 980...",Santa Marina Salina,98050.0,Italy,€€,"Seafood, Seasonal Cuisine","Fish plays the starring role here, where the a...",Terrace,amex; jcb; maestrocard; mastercard; visa,+39 090 984 3486,
4,Osteria de Börg,"via Forzieri 12, Rimini, 47921, Italy",Rimini,47921.0,Italy,€,"Cuisine from Romagna, Traditional Cuisine",Borgo San Giuliano is situated just a stroll f...,Terrace,amex; maestrocard; mastercard; visa,+39 0541 56074,https://www.osteriadeborg.it/


In [14]:
print(f"Checking the shape of the data: It should be {data.shape}")

Checking the shape of the data: It should be (2029, 12)


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2029 entries, 0 to 2028
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   restaurantName      2029 non-null   object 
 1   address             2029 non-null   object 
 2   city                2000 non-null   object 
 3   postalCode          1983 non-null   float64
 4   country             2029 non-null   object 
 5   priceRange          1983 non-null   object 
 6   cuisineType         2029 non-null   object 
 7   description         2029 non-null   object 
 8   facilitiesServices  1959 non-null   object 
 9   creditCards         1979 non-null   object 
 10  phoneNumber         1983 non-null   object 
 11  website             1875 non-null   object 
dtypes: float64(1), object(11)
memory usage: 190.3+ KB


# 2.0 Preprocessing
## 2.0.0) Preprocessing the text

Before building the search engine, we must clean and prepare the text in each restaurant’s description.

We will be using the following procedure incorporated into the function `preprocessing` (defined in `functions.py`):

+ Remove stopwords.
+ Remove punctuation.
+ Bring everything to lowercase.
+ Handle compund words with '-'.
+ Apply stemming.
>For this, we use the nltk library.

We will apply the preprocessing function on every row of the 'description' column of our dataset, to then create a 'cleaned_description' column in which each row contains the tokenized and ready-to-use version of the documents.

In [151]:
data['cleaned_description'] = data['description'].apply(preprocessing)

In [6]:
data['cleaned_description'].head()

0    [rustic, restaur, contemporari, mediterranean,...
1    [situat, imagna, valley, welcom, famili, run, ...
2    [patienc, need, get, restaur, coast, result, w...
3    [fish, play, star, role, authent, tradit, flav...
4    [borgo, san, giuliano, situat, stroll, sea, ol...
Name: cleaned_description, dtype: object

In [8]:
data.to_csv('michelin_restaurant_data.csv', index=False)

## 2.1 Conjunctive Query
This first version of the search engine narrows the search to the description field of each restaurant. Only restaurants whose descriptions contain **all** the query words will be returned.
### 2.1.1 Creating our Index! 
>Vocabulary File: Create a file called vocabulary.csv that maps each word to a unique integer (term_id).

We have prepared a function in the file `functions.py` named as `create_vocabulary` that returns a dictionary with words as keys and IDs as values. It also creates a new file `vocabulary.csv` that maps each word to a unique integer (term_id)

In [155]:
vocabulary = create_vocabulary(data['cleaned_description'])

In [156]:
with open('vocabulary.csv', mode='r') as file:
    reader = csv.reader(file)
    for i, row in enumerate(reader):
        if i == 0:
            continue  # Skip the header row
        if i > 10:  # Print first 10 rows
            break
        print(row)
print(f"Our vocabulary contains {len(vocabulary)} unique words. This may vary according to description preprocessing choices")

['local', '1']
['owner', '2']
['style', '3']
['vast', '4']
['region', '5']
['surround', '6']
['mediterranean', '7']
['larg', '8']
['decor', '9']
['contemporari', '10']
Our vocabulary contains 7853 unique words. This may vary according to description preprocessing choices


>Inverted Index: Build a dictionary mapping each term_id to a list of document IDs where that term appears.

+ Map Words to Document IDs: For each word in each description, find which document (restaurant) it appears in.
+ Use Term IDs: Replace each word with its term_id from vocabulary.csv.
+ Store the Inverted Index: Save the inverted index to a file (e.g., inverted_index.json) so that you don’t need to recompute it.

To do this we will use  `create_inverted_index` function stored in the `functions.py` file.

In [159]:
inverted_index = create_inverted_index(data['cleaned_description'], vocabulary)

### 2.1.2 Execute the Query
When the user inputs a query, for example, "vegan restaurant with garden", the search engine will:

- Preprocess the string containing the search query using the **preprocessing** function.
- For each unique term in the query, its term ID is fetched from the **dictionary**.
- If the term ID exists, the corresponding document list is retrieved from the **inverted index** and stored.
- Iteratively compute the intersection of document lists to find documents containing all query terms.
- Retrieve and return matching document details (restaurantName, address, description, website) from the dataset.


The function outputs a DataFrame with details of matching documents or None if no matches are found.

In [160]:
def conjuntive_query(query, dictionary, inverted_index):

    #preprocessing query text with the same function used to clean descriptions
    cleaned_query = preprocessing(query)

    unique_words = set(cleaned_query)

    doc_lists = []
    for word in unique_words:
        #retrive the corresponding term_id from the dictionary
        term_id = dictionary.get(word)
        if term_id:
            #retrive and append the list of documents containing term_id
            docs = inverted_index.get(str(term_id))
            if list:
                doc_lists.append(docs)
    
    if doc_lists:
        intersection = doc_lists[0] #initialize as the first list
        for docs in doc_lists[1:]:
            intersection = list(set(intersection) & set(docs)) #iteratively intersect with any other list
    else:
        intersection = []

    result = data.loc[intersection, ['restaurantName', 'address', 'description', 'website']]
    print(query)

    if intersection:
        return result
    else:
        print('No matches found')
        return None

In [161]:
def load_dictionary(csv_file):
    dictionary = {}
    with open(csv_file, mode='r') as file:
        reader = csv.reader(file)
        next(reader)  # Skip header row
        for row in reader:
            term, term_id = row
            dictionary[term] = int(term_id)  # Store term and its term_id
    return dictionary

def load_inverted_index(json_file):
    with open(json_file, 'r') as file:
        inverted_index = json.load(file)
    return inverted_index

In [162]:
dictionary = load_dictionary('vocabulary.csv')
inverted_index = load_inverted_index('inverted_index.json')
conjuntive_query(input('Input your query'), dictionary,inverted_index)

vegan restaurant with garden


Unnamed: 0,restaurantName,address,description,website
194,Il Mirto,"via Provinciale Lacco, 284, Forio, 80075, Italy","When you arrive at the luxury Botania hotel, y...",http://www.ilmirtoristorantevegetariano.it
1245,Li Lioni,"SS 131 km 244,400 - regione Li Lioni, Porto To...",The highlights of this restaurant just outside...,https://www.tenutalilioni.it/


We now use the `highlight_matches` function of the `functions.py` file, that is an auxiliary tool that helps us to visually check the correctness of the conjuntive_query function: the matching query terms are highlighted in the output documents.

Here we will show the **conjuntive_query** function in action for two queries: *'Vegan restaurant with garden'* and *'Romantic restaurant with terrace'.*
We can double check the correctness thanks to the highlight matches function

In [185]:
query = "vegan restaurant with garden"
result = conjuntive_query(query, dictionary, inverted_index)

if result is not None:
    query_words = preprocessing(query)  # Process query the same way as descriptions
    highlight_matches(result, query_words)
else:
    print("No matches found.")

vegan restaurant with garden
Restaurant: Il Mirto
Address: via Provinciale Lacco, 284, Forio, 80075, Italy
Description: When you arrive at the luxury Botania hotel, you’ll be accompanied through 3ha grounds of lush Mediterranean vegetation to the **restaur**ant. You couldn’t have a better introduction to the cuisine that awaits at Il Mirto, where you can choose between two long tasting menus or a shorter four-course option featuring dishes from these menus. One of the menus is vegetarian, the other **vegan** and almost all the vegetables are sourced from the property’s own kitchen **garden** or from other local growers from Ischia. A delightful experience in a beautiful natural setting, with excellent service provided by young, friendly staff.
Website: http://www.ilmirtoristorantevegetariano.it

Restaurant: Li Lioni
Address: SS 131 km 244,400 - regione Li Lioni, Porto Torres, 07046, Italy
Description: The highlights of this **restaur**ant just outside the tourist centre of Porto Torres

In [186]:
query = "romantic restaurant with terrace"
result = conjuntive_query(query, dictionary, inverted_index)

if result is not None:
    query_words = preprocessing(query)  # Process query the same way as descriptions
    highlight_matches(result, query_words)
else:
    print("No matches found.")

romantic restaurant with terrace
Restaurant: Urubamba
Address: via Gaetano Filangieri 16/c, Naples, 80121, Italy
Description: Situated amid luxury boutiques and historic palazzi in one of Naples’ most exclusive districts, this **restaur**ant boasts a chic and **romant**ic dining room with soft lighting on the first floor, as well as an exclusive outdoor **terrac**e for fine weather. The cuisine is totally in keeping with the trendy setting, featuring a successful combination of two different gastronomic cultures (namely Peruvian and Japanese) that blend together perfectly.
Website: http://www.urubamba.it

Restaurant: Sogno
Address: via Porto San Felice 41, San Felice del Benaco, 25010, Italy
Description: This aptly named **restaur**ant (“sogno” is the Italian for dream) really is a dream come true, especially in summer when you can dine on the **romant**ic lakeside **terrac**e. The delicious Mediterranean-style cuisine will more than satisfy your tastebuds, while there’s also the optio

## 2.2 Ranked Search Engine with TF-IDF and Cosine Similarity
For the second search engine, given a query, retrieve the top-k restaurants ranked by relevance to the query.

### 2.2.1 Inverted Index with TF-IDF Scores
> tfIdf Scores: Calculate TF-IDF scores for each term in each restaurant’s description.

Assume that: 

   + $q$ is our query;
   
   + $tf_{t,d}$ is the term frequency (number of occurrences of term t in document d);

   + $idf_{t}$ is the  inverse document frequency of term t defined by $idf_{t} = log(\frac{N}{df_{t}})$;
    
   + $N$ is the number of documents (restaurants);
    
   + $df_t$ is the the document frequency.

Then the tf-idf weighting scheme assigned to term $t$ is a weight in document d given by $$tf-idf_{t,d} = tf_{t,d} × idf_{t}$$

This allows to define the TF-IDF scores as the sum over all the terms of the query of the weights: $$ Score(q, d) = \sum_{t∈q} tf-idf_{t,d}$$

    
> Updated Inverted Index: Build a new inverted index where each entry is a term, and the value is a list of tuples containing document IDs and TF-IDF scores.

We have implemented the `tf_idf_inverted_index` in the `functions.py` file.

In [168]:
vocabulary = load_dictionary('vocabulary.csv')
tf_idf_inverted_index = tf_idf_inverted_index(data, vocabulary)

### 2.2.2 Execute the Ranked Query
For the ranked search engine:

+ Process the query terms.
   > using the preprocessing function
+ Use Cosine Similarity to rank matching restaurants based on the TF-IDF vectors of the query and each document.
   > computing query vector and document vectors with one component in the vector for each dictionary term. Such components are computed using the tf-idf scores.
+ Return the top-k results or all matching restaurants if fewer than k have non-zero similarity.
+ Each result should include: restaurantName, address, description, website, Similarity score (between 0 and 1)

In [169]:
import numpy as np
import math
from collections import defaultdict
N = len(data)
N

2029

We first define in `functions.py` the `query_vector` function, which generates a TF-IDF vector for a search query based on a given vocabulary and inverted index, representing the query in the same space as document vectors for similarity comparisons, for ranking search results.

Input: 
- query: A string containing the search query.
- vocabulary: A dictionary mapping terms to unique IDs.
- inverted_index: A dictionary mapping term IDs to a list of document IDs and their TF-IDF scores.

Steps:
- The query is preprocessed (e.g., cleaned and tokenized).
- For each query term:
    + Check Existence: If the term is not in the vocabulary, it is skipped.
    + TF Calculation: Term frequency (TF) is computed based on the query.
    + IDF Calculation: Inverse document frequency (IDF) is computed using the document frequency (DF) from the inverted index.
    + TF-IDF Computation: Combines TF and IDF to calculate the TF-IDF score for the term.
    + Populate Vector: The TF-IDF score is stored in the query vector using the term’s ID.
- Final Vector: Builds the final vector, ensuring all terms in the vocabulary are represented (defaulting to zero if absent).

Output:

Returns a vector where each position corresponds to a term in the vocabulary and contains its TF-IDF score for the query.

In [None]:
vocabulary = load_dictionary('vocabulary.csv')
inverted_index = load_inverted_index('inverted_index.json')
v= query_vector('vegan restaurant with garden', vocabulary, inverted_index)

Term: vegan, TF: 1
Term: vegan, IDF: 7.615298339825815
Term: vegan, TF-IDF: 7.615298339825815
Term: restaur, TF: 1
Term: restaur, IDF: 7.615298339825815
Term: restaur, TF-IDF: 7.615298339825815
Term: garden, TF: 1
Term: garden, IDF: 7.615298339825815
Term: garden, TF-IDF: 7.615298339825815
Final Query Vector: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7.615298339825815, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7.615298339825815, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Similarly, we compute the vector representation of the documents using the  `doc_vectors` function in the `functions.py` file, by creating a document-term matrix $D$, where rows represent terms in the vocabulary, columns represent documents, and each entry contains the TF-IDF score of a term in a document.

Input:
- data: Dataset of documents.
- vocabulary: Maps terms to unique IDs.
- tf_idf_inverted_index: Maps term IDs to lists of document IDs and TF-IDF scores.

Matrix Initialization: 
- $D$ is a zero matrix with dimensions (num_terms,num_docs).

Matrix Population:

- For each term ID in the inverted index:
    + Adjust term and document indices to match matrix bounds.
    + Populate $D$ with TF-IDF scores for documents containing the term.

Output:
Returns the completed $D$, capturing the TF-IDF relationship between terms and documents.

In [175]:
D = doc_vectors(data, vocabulary, tf_idf_inverted_index)
D.shape

(7853, 2029)

The matrix $D$ represents the structure of the document vectors. Each column corresponds to a document, and each row represents a term from the vocabulary, identified by its term_id. The value at position $D[i,j]$ is the TF-IDF score of term $i$ in document $j$. Since the vocabulary contains 7853 terms, each document vector has 7853 components, reflecting the TF-IDF scores for all terms in the vocabulary. Our dataset represents 2029 documents (restaurants), so $D\in\mathbb{R}^{7853 x 2029}$.

The following **cosine_similarity** function computes the cosine similarity between two vectors.

In [176]:
def cosine_similarity(a,b):
    cos_sim = np.dot(a,b) / (np.linalg.norm(a)*np.linalg.norm(b))
    return cos_sim

The following **ranked_search** function implements a ranked search engine to retrieve the top $k$ restaurants, based on their similarity to a user's query.

It takes in imput the user's search query and the number $k$ of top results to return.

It converts the query into a vector using the **query_vector** function, which maps the query terms to the vocabulary and computes their weights using the inverted index.
Then for each document in the dataset, it computes its cosine similarity with the query vector calling the **cosine_similarity** function and appends the resulting similarity score to an initialized list.
The similarity scores are finally added to the dataset as a new column (SimilarityScore) in order to sort the dataset by such scores in descending order to select the top $k$ results.

Te function outputs a DataFrame containing the top $k$ restaurants, along with their name, address, description, website, and similarity score.

In [177]:
def ranked_search(query, k):
    
    q = query_vector(query, vocabulary, inverted_index)
    similarity_scores = []
   
    for doc_id in range(len(data)):
        # document vector (doc_id-th column of V)
        doc_vector = D[:, doc_id]
        s = cosine_similarity(q, doc_vector)
        similarity_scores.append(s)

    data['SimilarityScore'] = similarity_scores

    #sorting rows in dataset
    top_k_data = data.sort_values(by='SimilarityScore', ascending=False).head(k)[['restaurantName', 'address', 'description', 'website', 'SimilarityScore']]
   
   
    return top_k_data
    

In [None]:
ranked_search('vegan restaurant with garden',  3 )

Term: vegan, TF: 1
Term: vegan, IDF: 7.615298339825815
Term: vegan, TF-IDF: 7.615298339825815
Term: restaur, TF: 1
Term: restaur, IDF: 7.615298339825815
Term: restaur, TF-IDF: 7.615298339825815
Term: garden, TF: 1
Term: garden, IDF: 7.615298339825815
Term: garden, TF-IDF: 7.615298339825815
Final Query Vector: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7.615298339825815, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7.615298339825815, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Unnamed: 0,restaurantName,address,description,website,SimilarityScore
94,Sale Grosso,"vicolo De' Facchini 4/a, Bologna, 40124, Italy","A simple, bistro - style restaurant serving we...",,0.26472
1245,Li Lioni,"SS 131 km 244,400 - regione Li Lioni, Porto To...",The highlights of this restaurant just outside...,https://www.tenutalilioni.it/,0.183576
194,Il Mirto,"via Provinciale Lacco, 284, Forio, 80075, Italy","When you arrive at the luxury Botania hotel, y...",http://www.ilmirtoristorantevegetariano.it,0.157729


In [179]:
ranked_search('romantic restaurant with terrace',  int(input('k')) )

Term: romant, TF: 1
Term: romant, IDF: 7.615298339825815
Term: romant, TF-IDF: 7.615298339825815
Term: restaur, TF: 1
Term: restaur, IDF: 7.615298339825815
Term: restaur, TF-IDF: 7.615298339825815
Term: terrac, TF: 1
Term: terrac, IDF: 7.615298339825815
Term: terrac, TF-IDF: 7.615298339825815
Final Query Vector: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7.615298339825815, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7.615298339825815, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Unnamed: 0,restaurantName,address,description,website,SimilarityScore
1229,Capo Santa Chiara,"via al Capo di Santa Chiara 69, Boccadasse, Ge...","The chef at this modern, renovated restaurant ...",https://ristorantecaposantachiara.com/,0.214436
1884,Andrea Monesi - Locanda di Orta,"via Olina 18, Orta San Giulio, 28016, Italy",Situated in the historic centre of one of the ...,https://www.andreamonesi.com/,0.202419
1379,Villa Baroni,"via Acquadro 12, Bodio Lomnago, 21020, Italy",This romantic restaurant on the lakeshore has ...,https://www.villabaroni.it/,0.1844


In [42]:
conjuntive_query('romantic restaurant with terrace', vocabulary, inverted_index)

romantic restaurant with terrace


Unnamed: 0,restaurantName,address,description,website
513,Urubamba,"via Gaetano Filangieri 16/c, Naples, 80121, Italy",Situated amid luxury boutiques and historic pa...,http://www.urubamba.it
1283,Sogno,"via Porto San Felice 41, San Felice del Benaco...",This aptly named restaurant (“sogno” is the It...,https://www.sognogarda.it/
1936,"Il Giardino ""Da Felicin""","via Vallada 18, Monforte d'Alba, 12065, Italy","A bastion of classic Piedmontese cuisine, this...",https://www.felicin.it/
923,La Speranzina Restaurant & Relais,"via Dante 16, Sirmione, 25019, Italy","A romantic, Mediterranean-style restaurant in ...",https://www.lasperanzina.it/
1313,La Serra,"via Marconi 169, località Belvedere Fornillo, ...","Despite its bright, elegant and airy feel, the...",https://www.leagavi.it/ristorante-la-serra
1442,Al Monastero,"via Ristori 9, Cividale del Friuli, 33043, Italy",Enjoy excellent local hams and salami and othe...,https://www.almonastero.com/
167,Aria,"Via Costa 18, Lavagna, 16033, Italy",The Villa Riviera resort home to this restaura...,https://www.villarivieraresort.com/it/hotel-ri...
1080,Unterwirt,"Gudon 45, Gudon, 39043, Italy","Occupying a private house, this restaurant boa...",https://www.unterwirtgufidaun.com/it
827,Il Flauto di Pan,"via Santa Chiara 26, Ravello, 84010, Italy",It would be no exaggeration to say that this r...,https://www.hotelvillacimbrone.com/it/il-flaut...
444,Maxi,"via Luigi Serio 8, SS 145 Sorrentina km 14,500...","From the entrance to the Capo la Gala hotel, y...",https://www.hotelcapolagala.com/it/dining/maxi...


# 3. Define a New Score!


Now, we will define a custom ranking metric to prioritize restaurants based on user queries.

### Steps:
User query:

The user provides a text query. We’ll retrieve relevant documents using the search engine built in Step 2.1.
New Ranking Metric: After retrieving relevant documents, we’ll rank them using a new custom score. Instead of limiting the scoring to only the description field, we can include other attributes like priceRange, facilitiesServices, and cuisineType.
You will use a heap data structure (e.g., Python’s heapq library) to maintain the top-k restaurants.

### New Scoring Function:
Define a scoring function that takes into account various attributes:

Description Match: Give weight based on the query similarity to the description (using TF-IDF scores).
Cuisine Match: Increase the score for matching cuisine types.
Facilities and Services: Give more points for matching facilities/services (e.g., “Terrace,” “Air conditioning”).
Price Range: Higher scores could be given to more affordable options based on the user’s choice.

### Output:
The output should include:

restaurantName,
address,
description,
website,
The new similarity score based on the custom metric.
Are the results you obtain better than with the previous scoring function? Explain and compare results.



# Import
### Introduction

In this cell, we import the necessary packages and load the files that will be used throughout the project, including:
- The TF-IDF inverted index file,
- The Michelin restaurant data,
- The vocabulary and inverted index for text processing.


In [None]:
# Import necessary libraries and functions
from HLP.funcions import *  
import pandas as pd         
import numpy as np          
import json                

# File path setup
root = "/Users/lucanudo/Documents/GitHub/ADM_HW3/"

# Load the TF-IDF inverted index from a JSON file
file_path = root + "tf_idf_inverted_index.json"
try:
    with open(file_path, 'r') as file:
        TFiDF = json.load(file)  # Load the TF-IDF inverted index
except FileNotFoundError:
    print("The file was not found. Please check the provided path.")  # Error message if file is missing

# Load restaurant data from a CSV file
data = pd.read_csv(root + 'michelin_restaurant_data.csv')
data['id'] = list(data['description'].keys())  # Add a unique identifier column based on the description index

# Load vocabulary from a CSV file
vocabulary = pd.read_csv(root + 'vocabulary.csv')

# Load the inverted index from another JSON file
file_path = root + "inverted_index.json"
try:
    with open(file_path, 'r') as file:
        inverted_index = json.load(file)  # Load the inverted index
except FileNotFoundError:
    print("The file was not found. Please check the provided path.")  # Error message if file is missing




# Functions

### Query Functions Explanation

This script defines two functions for handling restaurant queries:

### 1. conjunctive_query (query, dictionary, inverted_index)
This function processes a conjunctive query (AND search) to find restaurants that match all the keywords in the query as point 2.1.

- **Preprocessing**: The query is cleaned and tokenized into unique words.
- **Inverted Index Lookup**: For each unique word, the function retrieves the corresponding document IDs (restaurants) from the inverted index.
- **Intersection**: It computes the intersection of the document lists for all query words to find the common documents (restaurants).
- **Return Data**: The matching restaurants are then retrieved from the dataset using the intersection of document IDs and returned as a DataFrame with key restaurant details such as name, address, and description.

### 2. customized_query (query_text, dictionary, inverted_index, TFiDF, vocabulary, query_price, query_services)
This function refines the results from "conjunctive_query" by considering additional filters like price range, available services, and cuisine types.

- **Precise Match Filtering**: First, the "conjunctive_query" function is used to filter the restaurants based on the query text.
- **TF-IDF Scoring**: A TF-IDF score is calculated for each restaurant based on the query words. The score is accumulated for each matching term in the restaurant descriptions.
- **Price Range Scoring**: A price score is added, where restaurants with a price range close to the user's query receive a higher score.
- **Services and Cuisine Scoring**: Restaurants are further scored based on the presence of requested facilities and services, as well as the type of cuisine.
- **Sorting**: The results are sorted by the accumulated score in descending order using "heapq" for efficient sorting.

### Result
The final result is a sorted list of restaurants, with the most relevant ones appearing at the top, based on the combination of query match, TF-IDF score, price range, services, and cuisine type.


In [None]:
import heapq

def conjunctive_query(query, dictionary, inverted_index):
    # Preprocessing query text
    cleaned_query = preprocessing(query)
    unique_words = set(cleaned_query)

    doc_lists = []
    for word in unique_words:
        # Retrieve term_id from the dictionary
        term_id = dictionary.get(word)
        if term_id:
            # Retrieve the list of documents and append
            docs = inverted_index.get(str(term_id))
            if docs:
                doc_lists.append(docs)

    if doc_lists:
        # Initialize intersection with the first list
        intersection = doc_lists[0]
        for docs in doc_lists[1:]:
            # Use heapq to compute intersection more efficiently
            intersection = list(heapq.nsmallest(len(intersection), set(intersection) & set(docs)))
    else:
        intersection = []

    result = data.loc[intersection, [
        'restaurantName', 'address', 'description', 'website',
        'priceRange', 'cuisineType', 'facilitiesServices', 'id'
    ]]

    if intersection:
        return result
    else:
        print('No matches found')
        return None


def customized_query(query_text, dictionary, inverted_index, TFiDF, vocabulary, query_price, query_services):
    # Filter precise match
    filter_precise_match = conjunctive_query(query_text, dictionary, inverted_index)
    filter_precise_match['priceRange_numeric'] = filter_precise_match['priceRange'].apply(convert_to_numeric)
    filter_precise_match['score'] = 0

    for w in query_text.split(' '):
        for index, ristorante in filter_precise_match.iterrows():
            if w in vocabulary:
                TFIDF_w = {str(elemento[0]): elemento[1] for elemento in TFiDF[str(vocabulary[w])]}
                id_ristorante = ristorante['id']
                if id_ristorante:
                    TFIDF_w_r = TFIDF_w.get(str(id_ristorante), 0)
                    filter_precise_match.at[index, 'score'] += TFIDF_w_r

    # Calculate additional scores
    for i, restaurant in filter_precise_match.iterrows():
        price_value = restaurant['priceRange_numeric']
        price_diff = price_value - query_price
        score_price = -0.5 * price_diff
        if price_diff == 0:
            score_price = 2
        filter_precise_match.at[i, 'score'] += score_price

        if isinstance(restaurant['facilitiesServices'], str):
            score_services = sum(1 for service in query_services if service in restaurant['facilitiesServices'].split(';')) * 0.1
        else:
            score_services = 0

        if isinstance(restaurant['cuisineType'], str):
            score_cuisine = sum(1 for service in query_services if service in restaurant['cuisineType'].split(',')) * 0.1
        else:
            score_cuisine = 0

        filter_precise_match.at[i, 'score'] += score_price + score_cuisine + score_services

    # Use heapq to sort by score in descending order
    sorted_restaurants = heapq.nlargest(
        len(filter_precise_match), 
        filter_precise_match.itertuples(index=False), 
        key=lambda x: x.score
    )

    # Convert back to DataFrame
    columns = ['restaurantName', 'address', 'description', 'website', 'score']
    result = pd.DataFrame.from_records(
        [(r.restaurantName, r.address, r.description, r.website, r.score) for r in sorted_restaurants],
        columns=columns
    )

    return result


# Output


### User Query Input and Output

This is the section of the code where the user is prompted to input their query parameters. The program will then execute the customized_query function based on these inputs and produce the corresponding results.

1. **Query Text**  
   The user is asked to enter the query text (e.g., "restaurant with garden").

2. **Price Range**  
   The user provides a desired price range between 1 and 5, which corresponds to different pricing levels for the restaurants.

3. **Cuisine Types**  
   The user is prompted to enter a list of cuisine types separated by commas (e.g., "Ligurian, Italian"). This will be used to filter the restaurants based on cuisine.

4. **Services**  
   The user is asked to specify a list of services separated by commas (e.g., "Air conditioning, Car park"). This will help narrow down the search to restaurants offering the specified services.

After the user inputs these parameters, the customized_query function is called with the user's inputs along with other necessary data like the dictionary, inverted_index, TFiDF, and vocabulary variables. The result is then displayed, which will show restaurants that match the criteria specified by the user.

This cell is responsible for triggering the output based on the user's query.


In [None]:


# Define the query parameters
query_text = 'restaurant with garden'
query_price = 1
query_cuisine = ['Ligurian']
query_services = ['Air conditioning', 'Car park']

customized_query(query_text, dictionary, inverted_index, TFiDF, vocabulary, query_price, query_services)



  filter_precise_match.at[i, 'score'] += score_price


Unnamed: 0,restaurantName,address,description,website,score
1534,Forentum,"piazza Plebiscito 16, Lavello, 85024, Italy","This rustic, family restaurant in the historic...",https://www.forentum.it,-0.9
1189,La Fratanza,"via Garibaldi 37, Nocera Superiore, 84015, Italy",Follow your satnav and you’ll easily find this...,https://www.lafratanzaristorante.it/,-0.9
767,Borgo Spoltino,"strada Selva Alta, Mosciano Sant'Angelo, 64023...",Borgo Spoltino occupies a 19C farmhouse amid h...,https://borgospoltino.it/,-0.9
55,La Masseria,"via Chiesa 61, località Marzaglia, Modena, 411...",What used to be an old watermill is now a welc...,https://www.ristorantemasseria.com/,-0.9
999,Dalla Libera,"via Farra 52, Sernaglia della Battaglia, 31020...","At this restaurant, an American-style barbecue...",http://www.trattoriadallalibera.it,-0.9
...,...,...,...,...,...
1086,Villa Cordevigo,"Loc. Cordevigo, Verona, Italy",Half an hour’s drive up the road from Verona i...,,
10,The Ashbee Hotel,"Viale San Pancrazio n.46, Taormina, Italy",Built steps away from the ruins of an ancient ...,,
1549,Villa Neri Resort & Spa,"Contrada Arrigo, Linguaglossa, Italy","If you think of volcanoes in Italy, Mount Vesu...",,
1037,"The First Arte, Rome","Via del Vantaggio 14, Rome, Italy","Here, just off the Via del Corso, at the First...",,


### Are the results you obtain better than with the previous scoring function? Explain and compare results.


In this scenario, we are comparing the results of three different search methods for the query "vegan restaurant with garden":

1. **The scoring function with weighted variables** (used in the **customized_query** function).
2. **The similarity score function** (used in **ranked_search**).
3. **The conjunctive query** that matches keywords in the description (used in **conjunctive_query**).

To evaluate which approach produces the "better" results, we need to consider **precision**, **relevance**, and **how well each method meets the intent of the query**. In this case, the query specifically asks for "vegan restaurants with garden," which requires results that are not only vegan but also likely have outdoor garden spaces. 

#### **1. Results from the Scoring Function with Weighted Variables:**

| **restaurantName** | **address** | **description** | **website** | **score** |
|--------------------|-------------|-----------------|-------------|-----------|
| Secondo Tempo      | via Vittorio Amedeo 55, Termini Imerese, 90018... | Situated on the first floor of a building (for...) | [Link](http://www.ristorantesecondotempo.it) | 4.0 |
| Cuculia            | via dei Serragli 3r, Florence, 50124, Italy | This beautiful restaurant with a smart and elegant interior, and a wonderful outdoor garden. | [Link](https://www.cuculia.it) | 4.0 |
| Osteria 1861       | via Valentino Izzo 1, Santa Maria di Castellab... | Accessed via a narrow alleyway behind the beac... | [Link](https://www.osteria1861.it) | 4.0 |
| Sale Grosso        | vicolo De' Facchini 4/a, Bologna, 40124, Italy | A simple, bistro-style restaurant serving we... | NaN | 1.0 |
| Li Lioni           | SS 131 km 244,400 - regione Li Lioni, Porto To... | The highlights of this restaurant just outside... | [Link](https://www.tenutalilioni.it/) | 1.0 |

**Analysis**: 
- This function ranks the results based on various factors, including TF-IDF, price, services, and the type of cuisine. **Cuculia** stands out here as a restaurant with a garden, and it receives a high score (4.0), likely due to its relevance to the query.
- The results are well-rounded and consider multiple factors, such as the type of cuisine and the environment (likely considering the garden).
- **Sale Grosso** and **Li Lioni**, despite being relevant in terms of cuisine, rank low due to other factors like description match and amenities.

#### **2. Results from the Similarity Score Function:**

| **restaurantName** | **address** | **description** | **website** | **SimilarityScore** |
|--------------------|-------------|-----------------|-------------|---------------------|
| Sale Grosso        | vicolo De' Facchini 4/a, Bologna, 40124, Italy | A simple, bistro-style restaurant serving we... | NaN | 0.264720 |
| Li Lioni           | SS 131 km 244,400 - regione Li Lioni, Porto To... | The highlights of this restaurant just outside... | [Link](https://www.tenutalilioni.it/) | 0.183576 |
| Il Mirto           | via Provinciale Lacco, 284, Forio, 80075, Italy | When you arrive at the luxury Botania hotel, y... | [Link](http://www.ilmirtoristorantevegetariano.it) | 0.157729 |

**Analysis**:
- The **similarity score function** ranks the results by the cosine similarity between the query vector and the document vectors. 
- **Sale Grosso** and **Li Lioni** appear in the top results, but they receive relatively low scores (0.2647 and 0.1836), suggesting that they are not particularly relevant to the query, even though they may contain some of the keywords. 
- **Il Mirto** also appears, but it is not particularly relevant either. This function doesn’t account for factors like garden spaces, services, or even cuisine in a meaningful way, relying solely on textual similarity, which often leads to less relevant results when the query is specific (like "vegan restaurant with garden").

#### **3. Results from the Conjunctive Query (Exact Keyword Match in Description):**

| **restaurantName** | **address** | **description** | **website** |
|--------------------|-------------|-----------------|-------------|
| Il Mirto           | via Provinciale Lacco, 284, Forio, 80075, Italy | When you arrive at the luxury Botania hotel, y... | [Link](http://www.ilmirtoristorantevegetariano.it) |
| Li Lioni           | SS 131 km 244,400 - regione Li Lioni, Porto To... | The highlights of this restaurant just outside... | [Link](https://www.tenutalilioni.it/) |

**Analysis**:
- The **conjunctive query** approach strictly matches keywords (in this case, "vegan" and "garden") in the description. 
- It returns **Il Mirto** and **Li Lioni**, which are vegan-friendly restaurants, but there is no clear mention of a garden or outdoor space in the descriptions provided.
- The results are less relevant in terms of matching the user's expectations of finding a vegan restaurant with a garden, as it doesn't consider context or additional factors like restaurant ambiance or setting.

### **Comparison and Conclusion:**

- **The scoring function with weighted variables** (used in customized_query) produces the most relevant and well-rounded results, as it integrates multiple factors that can affect a restaurant’s suitability for the query. The **Cuculia** restaurant is a standout due to its garden, and the overall scores give a clearer picture of the restaurants' relevance. It is especially useful for a query that implies several considerations (vegan, garden, type of restaurant).
  
- **The similarity score function** (used in ranked_search) is less effective in this context. While it ranks **Sale Grosso** and **Li Lioni**, it doesn’t capture the nuances of the query well because it relies purely on textual similarity, without considering the full context of the query (e.g., garden space, vegan offerings). It thus produces results that are not well aligned with the user's search intent.

- **The conjunctive query** (used in conjunctive_query), while focusing on keyword matching, does not perform well either. It doesn’t fully capture the intent of the user, as it returns restaurants like **Il Mirto** and **Li Lioni**, which may be vegan but do not mention a garden in the description. This method is highly restrictive because it only matches exact keywords and does not account for context or other relevant criteria like ambiance or services.

**Conclusion**: 
The **scoring function with weighted variables** offers **the best results** because it provides a more comprehensive and accurate ranking of restaurants, taking into account not only the query keywords but also essential factors like the restaurant's environment (e.g., garden) and amenities, alongside cuisine type and price. This makes it more likely to return results that align with the user's needs, whereas the other two functions either rely too heavily on simple keyword matching or similarity without context, leading to less precise outcomes.



# 4. Visualizing the Most Relevant Restaurants

Maps can provide users with an easy way to see where restaurants are located. This is especially useful for understanding which regions in Italy have more options.

Steps for Visualization:
1. Geocode Locations: Collect information on unique restaurant locations in Italy (in the format of City and Region). You can use tools such as Google API, OpenStreetMap, or a pre-defined list to retrieve representative coordinates for each region.
2. Ask a Large Language Model (LLM): Alternatively, you can compile a list of unique cities and regions in Italy, formatted as (City, Region), and ask an LLM (e.g., ChatGPT) to provide coordinates for these locations. This can be an efficient way to gather data without using API calls. Just make sure that the retrieved information is correct and helpful.
3. Map Setup: Use a mapping library like plotly or folium to create a visual display of restaurants by region.
4. Encoding Price Ranges: Incorporate a visual representation for price ranges:
    * Use color-coding or marker size to represent the restaurant's price range (€, €€, €€€, €€€€).
    * Include a legend for interpreting price levels.
5. Plot Top-K Restaurants: Use the custom score from Step 3 to select the top-k restaurants for display.
This map will give users an overview of restaurant options across different regions in Italy, with an indication of cost based on visual cues.

---

First, we will extract the cities from the addresses of the restaurants using the `get_city` function defined in the `functions.py` file. 


## Algoritmic Question (AQ)

A robot is tasked with collecting $n$ packages located at coordinates $(x_i, y_i)$ starting at position $(0,0)$, where:
​
+ $t$ is the number of test cases
+ $n$ is the number of packages
+ $(x_i, y_i)$ are the coordinates of the $i$-th package.


The robot can only move right (R) or up (U). Therefore, to successfully collect all packages without retracing its steps (moving down or left), the following condition must hold for all pairs $i<j$ in the list of package coordinates:  $$x_i\leq x_j \iff  y_i\leq y_j$$

This ensures that the robot can reach every package while moving strictly in the allowed directions.

### **Input**

The first line contains t — the number of test cases.
Each test case starts with n, the number of packages. The next n lines contain the coordinates $(x_i,y_i)$ for each package $i\in\{1,...,n\}$.

### **Output**
For each test case, print "YES" and the lexicographically smallest path, or "NO" if it’s impossible to collect all packages.

### **Pseudo-algorithm**

`t` = input number of test cases  

**For** each test case:  
- get `n` = input number of packages  
- Read and store package coordinates in `coords`  

  **If** `n == 1`:  
  - Extract `(x, y)` from `coords`  
  - Print "YES" and path = `'R' * x + 'U' * y`  
  - **Skip** the rest of the test case  

  **Else**:  
  - Sort `coords` lexicographically  
  - *Initialize* `possible = True`  

  **For** `(x1, y1), (x2, y2)` in consecutive pairs of `coords`:  
  - **If** `x2 < x1` or `y2 < y1`:  
    - `possible = False` 

  **If** not `possible`:  
  - Print "NO"  

  **Else**:  
  - *Initialize* `(x0, y0) = (0, 0)` and `path = ""`  
  - **For** `(x, y)` in `coords`:  
    - Add `'R' * (x - x0)` and `'U' * (y - y0)` to `path`  
    - Update `(x0, y0) = (x, y)`  
  - Print "YES" and `path`  

It follows the implementation of the pseudocode for the algorithm

### **AQ1. Algorithm**



In [211]:
t = int(input('test cases'))

for test in range(t):
    n = int(input('number of packages'))
    coords = []
    for i in range(n):
        (x,y) = map(int, input(f'Coordinates of {i+1}-th package').split())
        coords.append((x,y))

    #if the condition doesn't hold for any pair, print no
    if n == 1:  # Special case: only one package
        x, y = coords[0]
        print("YES")
        print('R' * x + 'U' * y)
        continue

    else: 
        sorted_pcks= sorted(coords)
        possible = True

        for (x1, y1), (x2, y2) in zip(sorted_pcks, sorted_pcks[1:]):
            if x2 < x1 or y2 < y1:
                possible = False
                break

        if not possible:
            print('NO')
        else:
            path = []
            #generate lexicographically shortest path:
            tempx, tempy = 0,0 #starting at zero
            for x,y in sorted_pcks:
                steps_r = x-tempx
                steps_u = y-tempy
                path.append(steps_r * 'R')
                path.append(steps_u * 'U')
                tempx, tempy =  x, y

            print('Yes')
            print(''.join(path))


Yes
RUUURRRRUU
NO
YES
RRRRUUU


### **AQ2. Proof of Correctness**
##### Case 1: n = 1
For a single package $(x, y)$, the robot can always reach the package using exactly x right moves and y up moves.
The output path $'R' ⋅ x + 'U' ⋅ y$ is valid and lexicographically smallest by construction.
##### Case 2: General Case (n > 1)
Sorting ensures that the robot processes the packages in order of increasing $x$-coordinates. Within packages with the same $x$-coordinate, it processes those with increasing $y$-coordinates.
If the sorted order is invalid (i.e., $x_2 < x_1$ or $y_2 < y_1$), the algorithm correctly identifies this and outputs "NO."
##### Feasibility:
The algorithm checks all pairs of consecutive packages in the sorted list to ensure that moving between them is feasible without moving left or down.
If any pair violates this condition, the robot cannot collect all packages, and the algorithm outputs "NO."
##### Path Construction:
The algorithm constructs the path by calculating the exact number of right and up moves needed to reach each package from the current position.
Because the coordinates are sorted lexicographically, the robot always moves in the lexicographically smallest way.
By appending right moves first ('R' ⋅ steps_r) followed by up moves ('U' ⋅ steps_u), the algorithm ensures the path is constructed in lexicographical order.
##### Output Correctness:
If the feasibility check passes, the algorithm outputs "YES" and the constructed path. Both are correct because the robot can visit all packages in order without violating the movement constraints.

We can further consider some edge cases:
+ Single Test Case (t == 1): The loop correctly handles a single test case.
+ Multiple Test Cases: The algorithm handles each test case independently, so results from one test case do not affect others.

Boundary Values:
+ If n == 1, the algorithm outputs the correct path.
+ If coords contain only points along the x-axis or y-axis, the algorithm handles these correctly.
+ If any package is unreachable (due to invalid coordinates), the algorithm outputs "NO."

### **AQ.3 Time Complexity**

The **outer loop** requires $t$ iterations, one for each test case, so it takes $O(t)$. The **inner loop** requires $n$ iterations, one for each different number of packages, in order to read and append the coordinates, which are operations that require constant time, hence of the order $O(n)$.

Inside the loop, in the best case, that is $n = 1$, we just compute this check, write the path which has lenght equal to the sum of the coordinates, print it and exit the inner loop to go to the next test case, which takes constant time (const).

In the worst case, that is $n>1$, after reading the $n$ coordinates and append them to the $n$-dimentional list $(const⋅n)$, we **sort it**, wich costs $O(nlogn)$, since python uses the TimSort algorithm.

Then, for each of the $n-1$ pairs of coordinates in the $n$-dimentional sorted list we need to compare their $x$ and $y$ to check the feasibility of the operation, performing a total of $2⋅(n-1)$ checks. Subsequentially we iterate once again over the coordinates of the list to compute the number of 'R' and 'U' steps, and perform constant time operations, which costs $O(n)$.

Finally we have $O(t⋅[n+const+nlogn+2(n-1)+n(const)])$, but since the term $nlogn$ clearly dominates on the others it is summed with, we end up with $$T(t,n) = O(t⋅nlogn)$$


### **AQ.4 LLM tool complexity evaluation**

We instructed ChatGPT with the following prompt: 

*"Evaluate the time complexity of the following code using Big O notation"* , and fed it with our algorithm.
#### ChatGPT Response:
" To evaluate the time complexity of the given code, we break it down into its major components:

1. Outer Loop (Test Cases):
    This loop runs for $t$ test cases, where $t$ is the number of test cases provided as input.
2. Inner Loop (Reading Coordinates):
    This loop runs $n$ times for each test case, where $n$ is the number of packages in the test case.
    Time complexity: $O(n)$ for input and appending the coordinates to the coords list.
3. Special Case for $n == 1$:
    For the special case where $n == 1$, the algorithm directly calculates and prints the path. This involves creating strings of length $x$ and $y$, where
    $x$ and $y$ are the coordinates of the package.
    Time complexity for this case: $O(x+y)$
4. Sorting Coordinates:
    Sorting $n$ coordinates lexicographically has a time complexity of $O(nlogn)$.
5. Feasibility Check:
    The loop iterates through $n - 1$ pairs of coordinates to check the feasibility condition.
    Time complexity: $O(n)$.
6. Path Construction:
    The loop iterates over $n$ coordinates, and for each coordinate, it computes the number of steps (steps_r and steps_u) and appends the corresponding characters to the path.
    Appending strings and joining the path involve processing a total of $O(total path length)$, where the path length is proportional to the sum of the $x$ and $y$ coordinates of all packages: $\sum (x_i+y_i)$.
    Time complexity: $O(n+total path length)$.


7. Overall Time Complexity
    For each test case:
    + Input Reading: $O(n)$,
    + Special Case ($n == 1$): $O(x+y)$ (only if $n==1$).
    + Sorting: $O(nlogn)$
    + Feasibility Check: $O(n)$
    + Path Construction: $O$($n$+total path length).

    Combining these:

    Worst-case complexity per test case: $O(nlogn+n+total path length).$

For $t$ test cases:

Overall complexity: $O(t⋅(nlogn+n+total path length))$.
If we assume the path length scales linearly with $n$ (e.g. total path length = $O(n)$), the final complexity simplifies to:
$O(t⋅nlogn)$"

  The response provided is correct.

### **AQ.5 Greedy alorithm optimality**
> Assume now that the robot can also move towards the left or downwards, and consider the greedy approach: from the current location go to the closest package. Notice that now we can always collect all packages. Prove that the greedy algorithm is optimal (i.e., it minimizes the total distance traveled), or provide a counterexample showing that it is not.

In [212]:
t = int(input("test cases"))

for test in range(t):
    n = int(input("number of packages"))
    coords = []
    for i in range(n):
        x, y = map(int, input(f"Coordinates of {i+1}-th package: ").split())
        coords.append((x, y))

    # Start at (0, 0)
    x0, y0 = 0, 0
    path = []
    
    while coords:
        # Find the nearest package based on Manhattan distance
        coords.sort(key=lambda p: abs(p[0] - x0) + abs(p[1] - y0))
        x, y = coords.pop(0)  # Get the nearest package
        
        # path instructions
        if x > x0:
            path.append('R' * (x - x0))
        elif x < x0:
            path.append('L' * (x0 - x))
        
        if y > y0:
            path.append('U' * (y - y0))
        elif y < y0:
            path.append('D' * (y0 - y))
        
        # Update current position
        x0, y0 = x, y

    print("Yes") #always possible
    print("".join(path))


Yes
RRRURRUULLLUUU
Yes
RRRRRRUUULLUUUUU


Sorting the packages by Manhattan distance has a time complexity of $O(n^2)$ in total across $n$ steps.
The path construction is an $O(total path length)$.
The overall complexity is then $O(t⋅n ^2)$, which is acceptable for small values of $n$, but for larger inputs, optimizations (e.g., using a priority queue) would be necessary.

The greedy algorithm described, where the robot moves to the closest package based on Manhattan distance, is not guaranteed to be optimal in terms of minimizing the total distance traveled. The robot here always chooses the nearest package (in terms of Manhattan distance) from the current location, without considering how this decision might affect future moves. This approach does not always lead to the globally optimal solution. The problem arises because choosing the "locally optimum" among the packages at every step can result in a suboptimal path when considering all packages.

#### Counterexample:
>Start: (0, 0)
>Packages at (2, 2), (5, 0), and (0, 5).

- Greedy Algorithm
    * Start at (0, 0).
    * Closest package: (2, 2) (distance = 4). Move to (2, 2).
    * Closest package from (2,2): (0, 5) and (5, 0) equally distant, so assume it chooses (5, 0) (distance = 5). Move to (5, 0).
    * Last package: (0, 5) (distance = 10).
    * Total distance (Greedy): $4+5+10=19$

- Possible Alternative Optimal Solution
    * Start at (0, 0).
    * Move to (0, 5) first (distance = 5).
    * From (0, 5) move to (2, 2) (distance = 5).
    * Finally, move to (5, 0) (distance = 5).
    * Total distance (Optimal): $5+5+5=15$

The greedy algorithm makes a locally optimal decision at each step but does not minimize the total distance. This counterexample proves that the greedy approach is not guaranteed to be optimal.