# ADM HW 3

For this homework, no dataset has been provided. Instead, you have to build your own. Your search engine will run on text documents. So, here we detail the procedure to follow for the data collection. We strongly suggest you work on different modules when implementing the required functions. For example, you may have a crawler.py module, a parser.py module, and a engine.py module: this is a good practice that improves readability in reporting and efficiency in deploying the code. Be careful; you are likely dealing with exceptions and other possible issues!

### 1.1. Get the list of Michelin restaurants
* You should begin by compiling a list of restaurants to include in your document corpus. Specifically, you will focus on web scraping the Michelin Restaurants in Italy. Your task is to collect the URL associated with each restaurant in this list. The output of this step should be a .txt file where each line contains a single restaurant’s URL. By the end, you should have approximately 2,000 restaurants on your list. The number changes daily, so some groups might have different number of restaurants.

---
We will start by loading the relevant libraries first. Then we will try to scrape the *relevant* links on only the first page of the [Michelin website](https://guide.michelin.com/en/it/restaurants/) to test. If the operation goes on successfully, we will scrape the links from the 100 pages!


In [None]:
# import the relevant libraries
import requests
from bs4 import BeautifulSoup
import os
import time
import asyncio
from crawler import *
from myparser import *
from functions import *
import pandas as pd
tqdm.pandas()
from tqdm import tqdm
import matplotlib.pyplot as plt
from tqdm.asyncio import tqdm

In [2]:
# setting current working directory to /Users/saifdev/Desktop/ADM/ADM_HW3
os.chdir("/Users/saifdev/Desktop/ADM/ADM_HW_3")

Before I begin to build the function `scraping_urls`, which will scrape all the links for the Michelin restaurants in Italy and save them to `urls.txt`, I will start by trying to scrape only the first page to test and practice.

In [None]:
# A SMALL TEST
url = "https://guide.michelin.com/en/it/restaurants/page/" # url of the Michelin Guide Italy
with open('test.txt', 'w') as file: # open a file to write the links of the restaurants
    current_page = url + "1" # the first page only for testing
    request = requests.get(current_page) # get the first page
    soup = BeautifulSoup(request.content, 'lxml') # parse the content of the first page
    for a in soup.select("div.flex-fill a"): # Ref Keith Galli Y.T. channel :)
        file.write("https://guide.michelin.com"+a.get('href') + '\n')   # write the links of the restaurants in the first page

In [None]:
# check the number of lines written in the file = 20 (and also manually checked)
with open('test.txt', 'r') as file: # open the file to read the links of the restaurants
    num_lines = len(file.readlines()) # count the number of lines in the file
    result = "20" if num_lines == 20 else "not 20" # check if the number of lines is 20
    print(f"The number of urls scraped is {result}") # print the result. It is indeed 20


The number of urls scraped is 20


It worked as expected. Now we will scrape the urls from the whole 100 pages on the Micheline website using the `scraping_urls` function.

A little note about the `scraping_urls` function:
* It is a function that takes the url of the Michelin Guide Italy webpage and the number of pages to scrape as input and returns a text file containing the URLs of all the restaurants. The function is defined in the `crawler.py` file.

In [None]:
pages = 100 # set the number of pages to scrape
url = "https://guide.michelin.com/en/it/restaurants/page/" # url of the Michelin Guide Italy
scraping_urls(url=url, pages=pages) # scrape the urls of the restaurants in the Michelin Guide Italy

Checking if the number of urls scraped is approximately 2000

In [None]:
# check the number of urls scraped
with open('urls.txt', 'r') as file:
    num_lines = len(file.readlines())
    print(f"Total Number of restaurants urls scraped is {num_lines}")


Total Number of restaurants urls scraped is 2029


* So, all the urls linking towards the restaurant on the Michelin website has been scraped save as a .txt file in `'urls.txt'`. Now we can move onto the next section. 

---

### 1.2. Crawl Michelin restaurant pages

##### Once you have all the URLs on the list, you should:

* Download the HTML corresponding to each of the collected URLs.
* After collecting each page, immediately save its HTML in a file. This way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
* Organize the downloaded HTML pages into folders. Each folder will contain the HTML of the restaurants from page 1, page 2, ... of the Michelin restaurant list.
* Tip: Due to the large number of pages to download, consider using methods that can help shorten the process. If you employed a particular process or approach, kindly describe it.

---

Now, we will scrape the information of the restaurants from the urls scraped. To do this, we will use the function `load_urls`, `fetch_and_save`, and `download_html_in_batches`. All these functions are defined in the `crawler.py` file.

Some information about each of these functions:
* **load_urls**: This function loads the urls of the restaurants from the urls.txt file.
* **fetch_and_save**: This function fetches the html content of the urls and saves the html file into a folder of a particular batch/page.
* **download_html_in_batches**: This function downloads the html content of the urls in batches of 20. The function makes used of `asyncio` and `aiohttp` to download the html content in parallel. This function is used to speed up the process of downloading the html content of the urls due to concurrency.

In [None]:
url_file = 'urls.txt' # file containing the urls of the restaurants
output_dir = 'michelin_html_batches' # directory to save the sub-directories and the html files
urls = load_urls(url_file) # load the urls of the restaurants

start_time = time.time()
try:
    await download_html_in_batches(urls, output_dir) # using asyncio environment to download the html files in batches
except RuntimeError:
    # Check if there's already an event loop running. Why? Because we can't create a new event loop in a thread that already has
    # an event loop running. Jupyter notebooks often have an event loop running by default. If we try to create a new event loop
    # in a Jupyter notebook, we'll get a RuntimeError. Thus, use asyncio.run if not in a running event loop environment
    asyncio.run(download_html_in_batches(urls, output_dir))

print(f"Finished downloading in {time.time() - start_time} seconds")

Now, we have the html files of the restaurants in the Michelin Guide Italy website downloaded in batches - organised into folders. Now is the time that we start parsing them.

---
### 1.3 Parse downloaded pages

##### At this point, you should have all the HTML documents about the restaurant of interest, and you can start to extract specific information. The list of the information we desire for each restaurant and their format is as follows:

* Restaurant Name (to save as restaurantName): string;
* Address (to save as address): string;
* City (to save as city): string;
* Postal Code (to save as postalCode): string;
* Country (to save as country): string;
* Price Range (to save as priceRange): string;
* Cuisine Type (to save as cuisineType): string;
* Description (to save as description): string;
* Facilities and Services (to save as facilitiesServices): list of strings;
* Accepted Credit Cards (to save as creditCards): list of strings;
* Phone Number (to save as phoneNumber): string;
* URL to the Restaurant Page (to save as website): string. 


---

we have created many functions to parse the html files and extract the information we need. These functions are in the `myparser.py` file. Here is the list of functions. The functions are listed below and their names are self-explanatory:  

* `get_restaurant_name(soup)` - taking in soup object and returning the name of the restaurant for example.
* `get_address(soup)`
* `get_city(address)`
* `get_postal_code(address)`
* `get_country(address)`
* `get_price_range(soup)`
* `get_cuisine_type(soup)`
* `get_description(soup)`
* `get_facilities_services(soup)`
* `get_credit_cards(soup)`
* `get_phone_number(soup)`
* `get_website(soup)`

We will test these functions of a random html file form the directories we have created. This will help us to understand the output of these functions and help us evaluate how well they perform.

In [None]:
# Test the parser functions on a particular html file
with open('/Users/saifdev/Desktop/ADM/ADM_HW3/michelin_html_batches/batch_1/20tre.html', 'r', encoding='utf-8') as file:
	content = file.read()
	soup = BeautifulSoup(content, 'lxml')
	name = get_restaurant_name(soup)
	print(f"Restaurant name: {name}")
	address = get_address(soup)
	print(f"Restaurant address: {address}")
	city = get_city(address)
	print(f"Restaurant city: {city}")
	zipcode = get_postal_code(address)
	print(f"Restaurant zipcode: {zipcode}")
	country = get_country(address)
	print(f"Restaurant country: {country}")
	price = get_price_range(soup)
	print(f"Restaurant price range: {price}")
	cuisine = get_cuisine_type(soup)
	print(f"Restaurant cuisine type: {cuisine}")
	description = get_description(soup)
	print(f"Restaurant description: {description}")
	facilities = get_facilities_services(soup)
	print(f"Restaurant facilities and services: {facilities}")
	credit_cards = get_credit_cards(soup)
	print(f"Restaurant credit cards: {credit_cards}")
	phone = get_phone_number(soup)
	print(f"Restaurant phone number: {phone}")
	website = get_website(soup)
	print(f"Restaurant website: {website}")

Restaurant name: 20Tre
Restaurant address: via David Chiossone 20 r, Genoa, 16123, Italy
Restaurant city: Genoa
Restaurant zipcode: 16123
Restaurant country: Italy
Restaurant price range: €€
Restaurant cuisine type: Farm to table, Modern Cuisine
Restaurant description: Situated in the heart of Genoa’s historic centre, this contemporary-style restaurant focuses on just a few dishes, almost all fish-based, presented in a very modern style and in generous portions. Seasonal ingredients and market-fresh produce are the guiding philosophy here.
Restaurant facilities and services: ['Air conditioning']
Restaurant credit cards: ['amex', 'dinersclub', 'mastercard', 'visa']
Restaurant phone number: +39 010 247 6191
Restaurant website: https://www.ristorante20tregenova.it/


Our Parser functions work perfectly. Let's employ them now to get the required data from all the html files we have stored. Then, we will gather the data into a csv file which we can later read into the Pandas DataFrame.

To do this, we will use the following functions stored in `myparser.py`:
- `extract_restaurant_data`, `save_restaurant_data_to_csv`, `forming_a_csv`

A brief explanation of the functions:
* `extract_restaurant_data` - This function takes the html content of a restaurant page and extracts the relevant data using the parser functions.
* `save_restaurant_data_to_csv` - This function takes the extracted data and saves it to a csv file.
* `forming_a_csv` - This function takes the html files stored in the directory and extracts the data from each file and saves it to a csv file.


In [None]:
# form a csv file with the information of the restaurants using extract_restaurant_data function and the save_restaurant_data_to_csv function
forming_a_csv()

Data saved to michelin_restaurant_data.csv


Finally, we have the csv file with the information of the restaurants using the aforementioned functions. The csv file is named *"michelin_restaurants.csv"*. We can now load the csv file using `pd.read_csv` and perform the required tasks.

In [3]:
data = pd.read_csv('michelin_restaurant_data.csv')
data.head()

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,Locanda Radici,"SP 21, contrada San Vincenzo, Melizzano, 82030...",Melizzano,82030.0,Italy,€€,"Modern Cuisine, Campanian",A rustic restaurant with contemporary Mediterr...,Air conditioning; Car park; Garden or park; Gr...,amex; mastercard; visa,+39 0824 944506,https://www.locandaradici.it/
1,Posta,"viale Vittorio Veneto 169, Sant'Omobono Terme,...",Sant'Omobono Terme,24038.0,Italy,€€€,Italian,"Situated in the Imagna valley, this welcoming,...",Air conditioning; Wheelchair access,amex; dinersclub; mastercard; visa,+39 035 851134,https://www.frosioristoranti.it
2,Hostaria Baccofurore,"via G.B. Lama 9, Furore, 84010, Italy",Furore,84010.0,Italy,€€,"Regional Cuisine, Farm to table",Patience is needed to get to this restaurant f...,Air conditioning; Car park; Garden or park; Gr...,amex; dinersclub; jcb; maestrocard; mastercard...,+39 089 830360,https://www.baccofurore.it/
3,Nni Lausta,"via Risorgimento 188, Santa Marina Salina, 980...",Santa Marina Salina,98050.0,Italy,€€,"Seafood, Seasonal Cuisine","Fish plays the starring role here, where the a...",Terrace,amex; jcb; maestrocard; mastercard; visa,+39 090 984 3486,
4,Osteria de Börg,"via Forzieri 12, Rimini, 47921, Italy",Rimini,47921.0,Italy,€,"Cuisine from Romagna, Traditional Cuisine",Borgo San Giuliano is situated just a stroll f...,Terrace,amex; maestrocard; mastercard; visa,+39 0541 56074,https://www.osteriadeborg.it/


In [4]:
print(f"Checking the shape of the data: It should be {data.shape}")

Checking the shape of the data: It should be (2029, 12)


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2029 entries, 0 to 2028
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   restaurantName      2029 non-null   object 
 1   address             2029 non-null   object 
 2   city                2000 non-null   object 
 3   postalCode          1983 non-null   float64
 4   country             2029 non-null   object 
 5   priceRange          1983 non-null   object 
 6   cuisineType         2029 non-null   object 
 7   description         2029 non-null   object 
 8   facilitiesServices  1959 non-null   object 
 9   creditCards         1979 non-null   object 
 10  phoneNumber         1983 non-null   object 
 11  website             1875 non-null   object 
dtypes: float64(1), object(11)
memory usage: 190.3+ KB


# Section 2 Search Engines

This search engine allows you to retrieve restaurants based on a user query. We’ll build two types of search engines:

* Conjunctive Search Engine: Returns restaurants where all query terms appear in the description.
* Ranked Search Engine: Returns the top-k restaurants sorted by similarity to the query, using **TF-IDF and Cosine Similarity**.

### 2.0.0) Preprocessing the Text
Before building the search engine, you must clean and prepare the text in each restaurant’s description. We will:

* Remove stopwords.
* Remove punctuation.
* Apply stemming.
* Perform any other necessary cleaning to improve search accuracy.

---
To do the preprocessing of the data, first we need to check if there exists missing or empty values in the data.

In [4]:
missing_values = data.isnull().sum()
print(f"Missing values in the data: \n{missing_values}")
print(f"The total number of empty descriptions are: {data[data['description'] == ''].shape[0]}")

Missing values in the data: 
restaurantName          0
address                 0
city                   29
postalCode             46
country                 0
priceRange             46
cuisineType             0
description             0
facilitiesServices     70
creditCards            50
phoneNumber            46
website               154
dtype: int64
The total number of empty descriptions are: 0


In [5]:
# filling the missing values of columns cuisineType and facilitiesServices with empty strings. This will be useful for the search engine later
data['cuisineType'] = data['cuisineType'].fillna('')
data['facilitiesServices'] = data['facilitiesServices'].fillna('')

Because we have no missing values or emtpy values in the description column, we will, for now, not remove any rows where some of the data is missing.

Now, we will do the preprocessing of the text in description columns using the following procedure incorporated into the function `preprocess` (defined in `functions.py`):

* Tokenize the text (keeping in account the appostrophies)
* Remove stop words
* Apply stemming

We will make a new column in

In [6]:
# Applying the preporcessor function to the description column
data['description_clean'] = data['description'].apply(preprocess)

data['description_clean'].head()

0    [rustic, restaur, contemporari, decor, surroun...
1    [situat, imagna, valley, welcom, restaur, serv...
2    [patienc, need, get, restaur, coast, result, w...
3    [fish, play, star, role, authent, tradit, flav...
4    [borgo, san, giuliano, situat, stroll, sea, ol...
Name: description_clean, dtype: object

### 2.1 Conjunctive Query
This first version of the search engine narrows the search to the description field of each restaurant. Only restaurants whose descriptions contain all the query words will be returned.

#### 2.1.1 Create Your Index!
* Vocabulary File: Create a file called vocabulary.csv that maps each word to a unique integer (term_id).
* Inverted Index: Build a dictionary mapping each term_id to a list of document IDs where that term appears.
---

We have prepared a function in the file `functions.py` named as `get_vocab` that creates a dictionary `vocabulary` with words as keys and IDs as values. It also creates a new file `vocabulary.csv` that maps each word to a unique integer (term_id)

In [7]:
vocabulary = get_vocab(data, export_to_csv=True)
vocab_csv = pd.read_csv('vocabulary.csv')

Now, we will move on the task to create an **Inverted Index**, which will be a dictionary mapping each term_id to a list of document IDs where that term appears.

In [8]:
terms = vocab_csv.drop(columns=['id'])
# print the first 3 terms of the vocabulary for demonstration
terms.head(3).term.to_list()

['rustic', 'restaur', 'contemporari']

In [9]:
# create a new column in the terms dataframe that contains the document ids of the documents that contain the term
# this is a smart approach demonstrated by the professor Chatzigeorgiou
terms['document_id'] = terms.term.progress_apply(lambda term: list(data.loc[data.description_clean.apply(lambda row: term in row)].index))
terms.loc[:5]

100%|██████████| 7534/7534 [00:10<00:00, 699.27it/s]


Unnamed: 0,term,document_id
0,rustic,"[0, 4, 9, 17, 30, 34, 43, 44, 48, 55, 157, 165..."
1,restaur,"[0, 1, 2, 4, 5, 6, 7, 8, 10, 11, 13, 15, 16, 1..."
2,contemporari,"[0, 10, 16, 19, 36, 38, 47, 66, 69, 71, 72, 76..."
3,decor,"[0, 4, 11, 14, 23, 24, 30, 39, 41, 42, 48, 59,..."
4,surround,"[0, 10, 14, 26, 35, 61, 146, 159, 168, 174, 18..."
5,greeneri,"[0, 130, 167, 198, 411, 451, 520, 917, 962, 11..."


Our result is much similar to the standard, as shown in the picture below:

![Introduction to information Retrieval Book](images/inverted_index.png "Inverted Index")

Ref: Introduction to information Retrieval Book

Now, use the terms dataframe to create an inverted index. The inverted index is a dictionary where the keys are the terms and the values are the document ids of the documents that contain the term

In [10]:
inverted_index = dict(zip(terms.term, terms.document_id))

### 2.1.2 Execute the Query
When the user inputs a query, for example, "modern seasonal cuisine", the search engine will:

* Process the query terms.
* Find and return a list of restaurants containing all query words in their description.

The output should include the following columns: restaurantName, address, description, website

---

To do this, we will use the function `search_engine` defined in the `functions.py` file.

`search_engine` function description:

Steps:
example query: "modern seasonal cuisine"
1. Preprocess the query
2. Intersect the list of document_ids the query words appear in using the following method:
   * Pick the list of document_ids with the shortest length and intersect it with the second smallest
   list of document_ids. This can be done by sorting the query terms according to the length of their list document_ids.
   * Continue intersecting with the next smallest list until all lists are processed.
3. What is left after the intersection is the list of document_ids to be returned.
4. Use the returned list of doc ids to make a pandas DataFrame.
5. Return the DataFrame.

**Args:**
- `query` (str): The query to search for.
- `inverted_index` (dict): The inverted index to use for the search.
- `dataset` (pandas.DataFrame): The dataset to search in.
- `columns_for_dataset` (list): The columns in the dataset to include in the results.

**Returns:**
- `pandas.DataFrame`: The search results as a pandas DataFrame.


In [11]:
# Preprocess the query
# query = "modern seasonal cuisine"
query = input("Enter your query: ")
columns_for_dataset = ["restaurantName", "address", "description", "website"]
results = search_engine(query=query, inverted_index=inverted_index, dataset=data, columns_for_dataset=columns_for_dataset)

In [13]:
results

Unnamed: 0,restaurantName,address,description,website
52,Winter Garden Florence,"piazza Ognissanti 1, Florence, 50123, Italy",Horse-drawn carriages once entered the old cou...,https://www.wintergardenflorence.com/it/
79,San Michele,"via Castello di Fagagna 33, Fagagna, 33034, Italy",Situated next to the ruins of the old castle a...,http://sanmichele.restaurant
102,Mima,"via Madonnelle 9, Vico Equense, 80069, Italy",You’ll be won over by the seasonal Mediterrane...,http://www.domo20.com/restaurant
256,Pipero Roma,"corso Vittorio Emanuele II 250, Rome, 00186, I...",Situated opposite the church of Santa Maria in...,https://www.piperoroma.it/
330,Chichibio,"via Guglielmo Marconi 1, Roccaraso, 67037, Italy","Despite its lack of awards, this restaurant st...",
455,La Bandiera,"contrada Pastini 4, Civitella Casanova, 65010,...",Although it takes a while to reach this restau...,https://www.labandiera.it/
514,Osteria Ophis,"corso Serpente Aureo 54/b, Offida, 63073, Italy",Situated in the beautiful historic centre of O...,https://www.osteriaophis.com/
570,Sintesi,"viale dei Castani 17, Ariccia, 00072, Italy","A modern, welcoming restaurant whose motto “Tr...",http://ristorantesintesi.it
600,Secondo Tempo,"via Vittorio Amedeo 55, Termini Imerese, 90018...",Situated on the first floor of a building (for...,http://www.ristorantesecondotempo.it
636,Piccolo Lord,"corso San Maurizio 69 bis/g, Turin, 10124, Italy","Professional service in a welcoming, modern re...",https://www.ristorantepiccololord.it/


# 2.2 Ranked Search Engine with TF-IDF and Cosine Similarity

For the second search engine, given a query, retrieve the top-k restaurants ranked by relevance to the query.

### 2.2.1 Inverted Index with TF-IDF Scores
* TfIdf Scores: Calculate TF-IDF scores for each term in each restaurant’s description.
* Updated Inverted Index: Build a new inverted index where each entry is a term, and the value is a list of tuples containing document IDs and TF-IDF scores.

---


We will start by calculating the making an inverted index dictionary which contains the term as keys and a list of tuples as values. Each tuple contains the index of the document where the word appeared and its corresponding Tf-IDF score.

The `inverted_index_tfidf` (defined in `functions.py`) will help us in this operation. It calculates the Term Frequency-Inverse Document Frequency (TF-IDF) for terms in the given document. It works as follows:

**Term Frequency (TF):**
TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.

**Inverse Document Frequency (IDF):**
IDF is calculated as the number of documents in the corpus divided by the number of documents in the corpus that contain the term.

**Args:**
- `inverted_index` (dict): The inverted index containing terms and their corresponding document lists.
- `dataset` (pandas.DataFrame): The dataset containing the documents.

**Returns:**
- `dict`: A dictionary which contains the term as keys and a list of tuples as values. Each tuple contains the index of the document where the word appeared and its corresponding Tf-IDF score


In [14]:
inverted_tf_idf_index = inverted_index_tfidf(inverted_index, data)

Test the search engine with the inverted index with tf-idf. We will use a word that is in the vocabulary, "decor", and print the first 5 documents that contain the term alongside their tf-idf scores.

In [15]:
inverted_tf_idf_index["decor"][:5]

[(0, 0.09), (4, 0.04), (11, 0.05), (14, 0.01), (23, 0.07)]

### 2.2.2 Execute the Ranked Query

For the ranked search engine:
* Process the query terms.
* Use Cosine Similarity to rank matching restaurants based on the TF-IDF vectors of the query and each document.
* Return the top-k results or all matching restaurants if fewer than k have non-zero similarity.

Each result should include:restaurantName, address, description, website, and the Similarity score (between 0 and 1)

---

To do this, we will use the function `search_engine_tfidf` (and all the helper functions) is (are) defined in the `functions.py` file.

`search_engine_tfidf` function description:

A search engine using the TF-IDF and cosine similarity scores for ranking documents.

**Steps:**
1. Preprocess the query using a function `preprocess`
2. Calculate the TF-IDF scores for the query terms using `calculate_query_tfidf`
3. Compute the cosine similarity between the query and each document in the dataset using `cosine_similarity`
4. Rank the documents based on their cosine similarity scores.
5. Use the document IDs to create a pandas DataFrame with the specified columns.
6. Return the DataFrame.
7. Retrieve the top `k` documents based on the ranking using .head().

**Args:**
- `inverted_tf_idf_index` (dict): The inverted index with TF-IDF scores.
- `query` (str): The query to search for.
- `inverted_index` (dict): The inverted index of the dataset.
- `data` (pandas.DataFrame): The dataset to search in.
- `columns_for_dataset` (list): The columns in the dataset to include in the results.

**Returns:**
- `pandas.DataFrame`: The search results as a pandas DataFrame.

In [16]:
# query = "modern seasonal cuisine"
query = input("Enter your query: ")
search_results_tf_idf = search_engine_tfidf(inverted_tf_idf_index, query, inverted_index, data, columns_for_dataset)

Having executed the query, we can finally list the top k results, where k for now is 10. The results are ordered in descending order with respect to cosine similarity scores.

In [17]:
# define k = 10
k = 10
search_results_tf_idf.head(k)

Unnamed: 0,restaurantName,address,description,website,cosine_similarity
0,Locanda Radici,"SP 21, contrada San Vincenzo, Melizzano, 82030...",A rustic restaurant with contemporary Mediterr...,https://www.locandaradici.it/,0.999979
1,Hostaria Baccofurore,"via G.B. Lama 9, Furore, 84010, Italy",Patience is needed to get to this restaurant f...,https://www.baccofurore.it/,0.999979
2,Nni Lausta,"via Risorgimento 188, Santa Marina Salina, 980...","Fish plays the starring role here, where the a...",,0.999979
3,Osteria de Börg,"via Forzieri 12, Rimini, 47921, Italy",Borgo San Giuliano is situated just a stroll f...,https://www.osteriadeborg.it/,0.999979
4,Trequarti,"piazza del Donatore 3/4, Val Liona, 36044, Italy",This modern and minimalist-style restaurant wi...,https://www.ristorantetrequarti.com/,0.999979
5,Il Buco,"II rampa Marina Piccola 5, Sorrento, 80067, Italy","Situated in the centre of Sorrento, this resta...",https://www.ilbucoristorante.it/,0.999883
6,San Martino,"piazza Cappelletto 1, località Rio San Martino...","A smart, elegant restaurant with a bright, min...",https://www.ristorantesanmartino.info/,0.999348
7,St. George by Heinz Beck,"viale San Pancrazio 46, Taormina, 98039, Italy",One of the most desirable and exclusive locati...,https://www.theashbeehotel.it/menu-st-george-r...,0.998867
8,Grand Hotel Excelsior Vittoria,"Piazza Tasso 34, Sorrento, Italy",The Grand Hotel Excelsior Vittoria benefits fr...,,0.998867
9,Al Volt,"via Fiume 73, Riva del Garda, 38066, Italy",As you explore the narrow streets which lead f...,http://www.ristorantealvolt.com,0.998867


# 3. Define a New Score!

Now, we will define a custom ranking metric to prioritize restaurants based on user queries.

Steps:
* User Query: The user provides a text query. We’ll retrieve relevant documents using the search engine built in Step 2.1.
* New Ranking Metric: After retrieving relevant documents, we’ll rank them using a new custom score. Instead of limiting the scoring to only the description field, we can include other attributes like priceRange, facilitiesServices, and cuisineType.
* You will use a heap data structure (e.g., Python’s heapq library) to maintain the top-k restaurants.  


**New Scoring Function**:
Define a scoring function that takes into account various attributes:
* Description Match: Give weight based on the query similarity to the description (using TF-IDF scores).
* Cuisine Match: Increase the score for matching cuisine types.
* Facilities and Services: Give more points for matching facilities/services (e.g., “Terrace,” “Air conditioning”).
* Price Range: Higher scores could be given to more affordable options based on the user’s choice.

**Output**: 
The output should include: restaurantName, address, description, website, The new similarity score based on the custom metric.

Are the results you obtain better than with the previous scoring function? **Explain and compare results.**

--- 

### Enhanced Ranking Criteria for Top-End Restaurants

Instead of limiting the scoring to only the cosine similarity scores derived from the `description` field, we can include other attributes like `priceRange`, `facilitiesServices`, and `cuisineType`. Our proposal towards a more domain-specific ranking, which in this case is top-end restaurants, is to design a ranking criteria as follows:

Given the base cosine similarity scores between the query and documents, we calculate the final score as follows:
* If the query contains a term that is in the `cuisineType` of the document, we enhance the score by **50%**. This is because we believe that the cuisine type is a very important factor in the relevance of the document specific to our domain of listing results for the *top-end restaurants*.
* If the query contains a term that is in the `facilitiesServices` of the document, we enhance the score by **30%**. This is because we believe that the facilities and services of the restaurant are important factors in the relevance of the document. However, we gave a weight proportionally lesser than the type of cuisine for obvious reasons.
* Lastly, if the query contains a term that is a synonym for "low-cost", we enhance the score by **20%**. This is because we believe that some customers might be looking for low-cost restaurants (but not everyone, as we are looking at Michelin restaurants). And because our domain is top-notch restaurants, it is coherent to not penalise the score if the price is high!

After retrieving relevant documents, we’ll rank them using this new custom score.

*Note, we will define ourselves the list of synonyms for the word **low-cost** for example*

---

First, lets query for a restaurant with a terrace with reasonable prices. Now, as the question requires, we use use our very basic bolean query based conjuctive search engine to find the restaurants with the query.

In [18]:
# query = "reasonable priced italian cuisine with terrace"
query = input("Enter your query: ")
results = search_engine(query=query, inverted_index=inverted_index, dataset=data, columns_for_dataset=columns_for_dataset)
results.head()

Unnamed: 0,restaurantName,address,description,website


**BAM!**  We get no results! It is as expected because the query is very specific and the bolean seach is *"conjuctive"* - yielding results only if *all* the terms are present in the description. In our case, no restaurant has the terms as in the query.

So, even though the question expects us to use the results from the search engine made in Section 2.1, it is not coherent in current context and we will proceed with the *search results (doc_ids)* from the search engine based on the TF-IDF and cosine similarity scores and use them for comparison with the results of the search engine based on our new engine with customised scoring function.

In [19]:
# Using our search engine based on the TF-IDF algorithm and cosine similarity to find the most relevant restaurants. 
# The results will be later be used for comparison with the results of the search engine based on our new scoring engine.
search_results_tf_idf = search_engine_tfidf(inverted_tf_idf_index, query, inverted_index, data, columns_for_dataset)
search_results_tf_idf.head(k)

Unnamed: 0,restaurantName,address,description,website,cosine_similarity
0,Locanda Radici,"SP 21, contrada San Vincenzo, Melizzano, 82030...",A rustic restaurant with contemporary Mediterr...,https://www.locandaradici.it/,0.935413
1,Posta,"viale Vittorio Veneto 169, Sant'Omobono Terme,...","Situated in the Imagna valley, this welcoming,...",https://www.frosioristoranti.it,0.93446
2,Hostaria Baccofurore,"via G.B. Lama 9, Furore, 84010, Italy",Patience is needed to get to this restaurant f...,https://www.baccofurore.it/,0.934069
3,Nni Lausta,"via Risorgimento 188, Santa Marina Salina, 980...","Fish plays the starring role here, where the a...",,0.931451
4,Osteria de Börg,"via Forzieri 12, Rimini, 47921, Italy",Borgo San Giuliano is situated just a stroll f...,https://www.osteriadeborg.it/,0.925255
5,Terrazza Bosquet,"piazza Tasso 34, Sorrento, 80067, Italy","In the heart of Sorrento, a delightful garden ...",https://terrazzabosquet.exvitt.it,0.925019
6,Trequarti,"piazza del Donatore 3/4, Val Liona, 36044, Italy",This modern and minimalist-style restaurant wi...,https://www.ristorantetrequarti.com/,0.924176
7,Il Buco,"II rampa Marina Piccola 5, Sorrento, 80067, Italy","Situated in the centre of Sorrento, this resta...",https://www.ilbucoristorante.it/,0.860109
8,Osteria L'Orciaia,"via Capitan Goro 10, Montebenichi, 52021, Italy",A small rustic - style hostelry set inside a 1...,https://www.osterialorciaia.it/,0.860097
9,The Ashbee Hotel,"Viale San Pancrazio n.46, Taormina, Italy",Built steps away from the ruins of an ancient ...,,0.860097


Having the search results from the search engine based on the TF-IDF algorithm, we can now use our new scoring engine to find the most relevant restaurants using our own defined scoring engine, named `engine_results`, defined in `functions.py`

`my_engine` function description:

A search engine using our own defined scores for ranking documents.

**Steps:**
1. Load the scores from search_results_tf_idf engine.
2. Apply additional score adjustments:
   - Enhance the score by **50%** if the query contains a term in the `cuisineType` of the document.
   - Enhance the score by **30%** if the query contains a term in the `facilitiesServices` of the document.
   - Enhance the score by **20%** if the query contains a term that is a synonym for "low cost".
3. Use `nlargest` to get the top `k` documents by score.
4. Extract the document IDs for the top `k` documents.
5. Prepare the search results DataFrame with the specified columns and the adjusted similarity scores.
6. Return the search results DataFrame.

**Args:**
- `query` (str): The search query.
- `search_results_tf_idf` (pandas.DataFrame): Search results from the TF-IDF search engine.
- `data` (pandas.DataFrame): The dataset.
- `synonyms_for_low_price` (set): Synonyms for low price.
- `columns_for_dataset` (list): Columns to include in the search results.
- `facilities` (set): All facilities and services.
- `cuisine_types` (set): All cuisine types.
- `k` (int): Number of search results to return.

**Returns:**
- `pandas.DataFrame`: The search results.

Finally, we can move towards deploying our final search engine. However, we have to: 
1. Extract the facilities and cuisine types from the data using the `get_all_facilities` and `get_all_cuisine_types` functions defined in the `functions.py` file. These functions just basically collect distinct facilities and cuisine types in the entire dataset.
2. Initialise a thoughtful synonym list for the word *low-price*.
3. Use the my_engine function to get the results of the search engine based on our new scoring engine

In [20]:
# get all the facilities and cuisine types from the dataset
facilities = {facility.strip().lower() for facility in get_all_facilities(data) if facility.strip()}
cuisine_types = {cuisine.strip().lower() for cuisine in get_all_cuisine_types(data) if cuisine.strip()}
# define the synonyms for low price
synonyms_for_low_price = ['cheap', 'affordable', 'budget', 'inexpensive', 'reasonable', 'low-cost', 'economical', 'cost-effective', 'low-priced', 'wallet-friendly', 'bargain', 'value for money', 'discounted', 'economy', 'thrifty', 'pocket-friendly', 'modestly priced', 'fair-priced', 'money-saving', 'cut-rate', 'low-budget', 'accessible']
# fetch the results of the search engine
engine_results = my_engine(query, search_results_tf_idf, data, synonyms_for_low_price, columns_for_dataset, facilities, cuisine_types, k=10)

In [21]:
engine_results

Unnamed: 0,restaurantName,address,description,website,similarity_score
0,Locanda Radici,"SP 21, contrada San Vincenzo, Melizzano, 82030...",A rustic restaurant with contemporary Mediterr...,https://www.locandaradici.it/,1.824055
1,Posta,"viale Vittorio Veneto 169, Sant'Omobono Terme,...","Situated in the Imagna valley, this welcoming,...",https://www.frosioristoranti.it,1.822196
2,Hostaria Baccofurore,"via G.B. Lama 9, Furore, 84010, Italy",Patience is needed to get to this restaurant f...,https://www.baccofurore.it/,1.821434
3,Nni Lausta,"via Risorgimento 188, Santa Marina Salina, 980...","Fish plays the starring role here, where the a...",,1.816329
4,Osteria de Börg,"via Forzieri 12, Rimini, 47921, Italy",Borgo San Giuliano is situated just a stroll f...,https://www.osteriadeborg.it/,1.804246
5,Onda Blu,"via Orsa Minore 1, San Mauro a Mare, 47030, Italy",Almost appearing to rise up directly out of th...,https://www.ristoranteondablu.com/,1.803788
6,Terrazza Bosquet,"piazza Tasso 34, Sorrento, 80067, Italy","In the heart of Sorrento, a delightful garden ...",https://terrazzabosquet.exvitt.it,1.802144
7,Trequarti,"piazza del Donatore 3/4, Val Liona, 36044, Italy",This modern and minimalist-style restaurant wi...,https://www.ristorantetrequarti.com/,1.677213
8,Il Buco,"II rampa Marina Piccola 5, Sorrento, 80067, Italy","Situated in the centre of Sorrento, this resta...",https://www.ilbucoristorante.it/,1.677189
9,Osteria L'Orciaia,"via Capitan Goro 10, Montebenichi, 52021, Italy",A small rustic - style hostelry set inside a 1...,https://www.osterialorciaia.it/,1.677189


---

#### Are the results you obtain better than with the previous (cosine similarity based) scoring function? Explain and compare results.

---

The results obtained from the new scoring function are better than the results obtained from the previous cosine similarity based scoring function. It is because the new scoring function is more sophisticated and takes into account more factors than the cosine similarity based scoring function.
The new scoring function considers the following factors:
1. The cosine similarity between the query and the description of the restaurant. 
2. The presence of the query terms in the restaurant's type of cuisine.
3. The presence of the query terms in the restaurant's types of facilities offered.
4. The presence of the query terms in the restaurant's price range (estimated based on the description).


# 4. Visualizing the Most Relevant Restaurants

Maps can provide users with an easy way to see where restaurants are located. This is especially useful for understanding which regions in Italy have more options.

Steps for Visualization:
1. Geocode Locations: Collect information on unique restaurant locations in Italy (in the format of City and Region). You can use tools such as Google API, OpenStreetMap, or a pre-defined list to retrieve representative coordinates for each region.
2. Ask a Large Language Model (LLM): Alternatively, you can compile a list of unique cities and regions in Italy, formatted as (City, Region), and ask an LLM (e.g., ChatGPT) to provide coordinates for these locations. This can be an efficient way to gather data without using API calls. Just make sure that the retrieved information is correct and helpful.
3. Map Setup: Use a mapping library like plotly or folium to create a visual display of restaurants by region.
4. Encoding Price Ranges: Incorporate a visual representation for price ranges:
    * Use color-coding or marker size to represent the restaurant's price range (€, €€, €€€, €€€€).
    * Include a legend for interpreting price levels.
5. Plot Top-K Restaurants: Use the custom score from Step 3 to select the top-k restaurants for display.
This map will give users an overview of restaurant options across different regions in Italy, with an indication of cost based on visual cues.

---

First, we will extract the cities from the addresses of the restaurants using the `get_city` function defined in the `functions.py` file. 


In [22]:
# Read the data from the csv file
data = pd.read_csv('michelin_restaurant_data.csv')
# Make a column for the city of the restaurant
data["city"] = data.address.apply(get_city)
# Get the unique cities in the dataset
unique_cities = list(data.city.unique())
print(f"Number of Unique cities in the dataset: {len(unique_cities)}")

Number of Unique cities in the dataset: 1160


##### Section 4.1:
* Geocode Locations: Collect information on unique restaurant locations in Italy (in the format of City and Region). You can use tools such as Google API, OpenStreetMap, or a pre-defined list to retrieve representative coordinates for each region.

 We will use the unique cities extracted above to get their corresponding *regions* and the values of the *latitude* and *longitude* of each region. This is required by the exercise.

 We will make use of the two functions defined in the `functions.py` file to get the regions, their latitude and longitude, and eventually latitude and longitude of also each individual address of the restaurant in the dataset. The functions are:

 * `get_region_and_coordinates`: This function takes in the unique cities and an API KEY and returns the region and the latitude and longitude of the region.
 * `get_coordinates`: This function takes in the address of the restaurant and an API KEY and returns the latitude and longitude of the address of the restaurant in question.

 Both of these functions are defined in the functions.py file and are well documented.



In [None]:
API_KEY = 'YOUR API KEY'
with open('city_region_coordinates.csv', 'a') as f:
    for city in tqdm(unique_cities, desc="Fetching coordinates for regions"):
        region, lat, lng = get_region_and_coordinates(city, API_KEY)
        if region and lat and lng:
            f.write(f"{city},{region},{lat},{lng}\n")


with open('restaurant_coordinates.csv', 'a') as f:
    addresses = list(data.address)
    for address in tqdm(addresses, desc="Fetching coordinates for restaurants"):
        lat, lng = get_coordinates(address, API_KEY)
        if lat and lng:
            f.write(f"{lat},{lng}\n")

Having the coordinates of the regions and the restaurants stored in two different files, namely:
- `city_region_coordinates.csv`
- `restaurant_coordinates.csv

We can now load them, do some preprocessing (like color mapping the regions and merging the two datasets), and visualize them on a map using the folium library.

In [23]:
# Read the data from the csv file containing the coordinates of the regions
region_data = pd.read_csv('city_region_coordinates.csv', names=['city', 'region', 'lat_region', 'lng_region'])

# calculate the number of regions in Italy. This will be useful for coloring the regions in the map 
num_regions = len(region_data['region'].unique())
print(f"Number of regions in Italy: {num_regions}")


# Assign colors to regions
colors = plt.get_cmap('tab20', num_regions)
# Create a dictionary that maps each region to a color
region_color_map = {region: colors(i) for i, region in enumerate(region_data['region'].unique())}
# Map the colors to the regions
region_data['color_region'] = region_data['region'].map(region_color_map)

# Convert RGBA to hex for folium
region_data['color_region'] = region_data['color_region'].apply(lambda x: '#%02x%02x%02x' % (int(x[0]*255), int(x[1]*255), int(x[2]*255)))

region_data.head()

Number of regions in Italy: 27


Unnamed: 0,city,region,lat_region,lng_region,color_region
0,Melizzano,Campania,41.112508,14.845462,#1f77b4
1,Sant'Omobono Terme,Lombardy,45.479067,9.845243,#1f77b4
2,Furore,Campania,41.112508,14.845462,#1f77b4
3,Santa Marina Salina,Sicily,37.39793,14.658782,#aec7e8
4,Rimini,Emilia-Romagna,44.596761,11.21864,#ff7f0e


The table above shows the first 5 rows of the region_data dataframe. The dataframe contains the city, region, latitude, longitude, and the color of each region. The color is used to color the regions in the map.

Next, we will load the data containing the coordinates of the restaurants. We will then merge the data with the region_data dataframe to get the region of each restaurant. This will be useful for using just a single dataframe to plot the regions and the restaurants on the map.



In [24]:
# load the data containing the coordinates of the restaurants
coordinates_data = pd.read_csv('restaurant_coordinates.csv', names=['address_lat', 'address_lng'])

# merge the data containing the restaurants information with the data containing the coordinates of the restaurants
dataset = pd.merge(data, coordinates_data, left_index=True, right_index=True)

# now, encode for price for each restaurant
price_colors = {
    '€': 'green',
    '€€': 'yellow',
    '€€€': 'orange',
    '€€€€': 'red'
}
# Assign colors to price ranges
dataset['color'] = dataset['priceRange'].map(price_colors)

# merge the dataset with the region data to make a unified dataset
dataset = pd.merge(dataset, region_data, left_index=True, right_index=True)
dataset.head()

Unnamed: 0,restaurantName,address,city_x,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website,address_lat,address_lng,color,city_y,region,lat_region,lng_region,color_region
0,Locanda Radici,"SP 21, contrada San Vincenzo, Melizzano, 82030...",Melizzano,82030.0,Italy,€€,"Modern Cuisine, Campanian",A rustic restaurant with contemporary Mediterr...,Air conditioning; Car park; Garden or park; Gr...,amex; mastercard; visa,+39 0824 944506,https://www.locandaradici.it/,41.234823,14.476782,yellow,Melizzano,Campania,41.112508,14.845462,#1f77b4
1,Posta,"viale Vittorio Veneto 169, Sant'Omobono Terme,...",Sant'Omobono Terme,24038.0,Italy,€€€,Italian,"Situated in the Imagna valley, this welcoming,...",Air conditioning; Wheelchair access,amex; dinersclub; mastercard; visa,+39 035 851134,https://www.frosioristoranti.it,45.814752,9.532718,orange,Sant'Omobono Terme,Lombardy,45.479067,9.845243,#1f77b4
2,Hostaria Baccofurore,"via G.B. Lama 9, Furore, 84010, Italy",Furore,84010.0,Italy,€€,"Regional Cuisine, Farm to table",Patience is needed to get to this restaurant f...,Air conditioning; Car park; Garden or park; Gr...,amex; dinersclub; jcb; maestrocard; mastercard...,+39 089 830360,https://www.baccofurore.it/,40.618573,14.550961,yellow,Furore,Campania,41.112508,14.845462,#1f77b4
3,Nni Lausta,"via Risorgimento 188, Santa Marina Salina, 980...",Santa Marina Salina,98050.0,Italy,€€,"Seafood, Seasonal Cuisine","Fish plays the starring role here, where the a...",Terrace,amex; jcb; maestrocard; mastercard; visa,+39 090 984 3486,,38.559471,14.871372,yellow,Santa Marina Salina,Sicily,37.39793,14.658782,#aec7e8
4,Osteria de Börg,"via Forzieri 12, Rimini, 47921, Italy",Rimini,47921.0,Italy,€,"Cuisine from Romagna, Traditional Cuisine",Borgo San Giuliano is situated just a stroll f...,Terrace,amex; maestrocard; mastercard; visa,+39 0541 56074,https://www.osteriadeborg.it/,44.064796,12.563341,green,Rimini,Emilia-Romagna,44.596761,11.21864,#ff7f0e


Now that we have the dataset ready, we can plot the map using the `make_map` function which will create an interactive map using folium. The make_map function is defined in the functions.py file and is well commented for better understanding.

In [25]:
map_filename = 'michelin_restaurants_map.html'
make_map(dataset, map_filename)

  icon=folium.Icon(color=row["color"], icon="cutlery", prefix="fa")


Map saved to michelin_restaurants_map.html


---

5. Plot Top-K Restaurants: Use the custom score from Step 3 to select the top-k restaurants for display.
This map will give users an overview of restaurant options across different regions in Italy, with an indication of cost based on visual cues.

To perform task 5, We will first load the results from the search engine defined based on our scoring criteria. Having these results (data containing information about the restaurants from the search engine), we will merge them with the unified dataset we made earlier. There will be some duplicate columns in this new merged dataset, so we will drop them. We will then make a map using the make_map function and save it as a html file. The map will contain the restaurants that are the most relevant to the query entered by the user. The restaurants will be colored based on their price range and the regions will be colored based on the region they belong to. The map will be saved as a html file and displayed in the browser.

In [26]:
engine_results = my_engine(query, search_results_tf_idf, data, synonyms_for_low_price, columns_for_dataset, facilities, cuisine_types, k=10)

top_k_data = pd.merge(engine_results, dataset, on=['restaurantName'], how='left', suffixes=('_x', '_y'))

cols_to_drop = [col for col in top_k_data.columns if col.endswith('_y')]
top_k_data.drop(columns=cols_to_drop, inplace=True)
cols_to_rename = {col: col.replace('_x', '') for col in top_k_data.columns}
top_k_data.rename(columns=cols_to_rename, inplace=True)

top_k_data.head(5)

Unnamed: 0,restaurantName,address,description,website,similarity_score,city,postalCode,country,priceRange,cuisineType,facilitiesServices,creditCards,phoneNumber,address_lat,address_lng,color,region,lat_region,lng_region,color_region
0,Locanda Radici,"SP 21, contrada San Vincenzo, Melizzano, 82030...",A rustic restaurant with contemporary Mediterr...,https://www.locandaradici.it/,1.824055,Melizzano,82030.0,Italy,€€,"Modern Cuisine, Campanian",Air conditioning; Car park; Garden or park; Gr...,amex; mastercard; visa,+39 0824 944506,41.234823,14.476782,yellow,Campania,41.112508,14.845462,#1f77b4
1,Posta,"viale Vittorio Veneto 169, Sant'Omobono Terme,...","Situated in the Imagna valley, this welcoming,...",https://www.frosioristoranti.it,1.822196,Sant'Omobono Terme,24038.0,Italy,€€€,Italian,Air conditioning; Wheelchair access,amex; dinersclub; mastercard; visa,+39 035 851134,45.814752,9.532718,orange,Lombardy,45.479067,9.845243,#1f77b4
2,Hostaria Baccofurore,"via G.B. Lama 9, Furore, 84010, Italy",Patience is needed to get to this restaurant f...,https://www.baccofurore.it/,1.821434,Furore,84010.0,Italy,€€,"Regional Cuisine, Farm to table",Air conditioning; Car park; Garden or park; Gr...,amex; dinersclub; jcb; maestrocard; mastercard...,+39 089 830360,40.618573,14.550961,yellow,Campania,41.112508,14.845462,#1f77b4
3,Nni Lausta,"via Risorgimento 188, Santa Marina Salina, 980...","Fish plays the starring role here, where the a...",,1.816329,Santa Marina Salina,98050.0,Italy,€€,"Seafood, Seasonal Cuisine",Terrace,amex; jcb; maestrocard; mastercard; visa,+39 090 984 3486,38.559471,14.871372,yellow,Sicily,37.39793,14.658782,#aec7e8
4,Osteria de Börg,"via Forzieri 12, Rimini, 47921, Italy",Borgo San Giuliano is situated just a stroll f...,https://www.osteriadeborg.it/,1.804246,Rimini,47921.0,Italy,€,"Cuisine from Romagna, Traditional Cuisine",Terrace,amex; maestrocard; mastercard; visa,+39 0541 56074,44.064796,12.563341,green,Emilia-Romagna,44.596761,11.21864,#ff7f0e


In [27]:
map_filename = 'my_engine_results_map.html'
make_map(top_k_data, map_filename)

Map saved to my_engine_results_map.html


  icon=folium.Icon(color=row["color"], icon="cutlery", prefix="fa")
