# **Homework 3 - Michelin restaurants in Italy**

*Group#20*

- **Marco Zimmatore** - [zimmatore.1947442@studenti.uniroma1.it](mailto:zimmatore.1947442@studenti.uniroma1.it)
- **Gabriele Cabibbo** - [cabibbo.2196717@studenti.uniroma1.it](mailto:cabibbo.2196717@studenti.uniroma1.it)
- **Emre Yeşil** - [1emreyesil@gmail.com](mailto:1emreyesil@gmail.com)
- **Emanuele Iaccarino** - [emanueleiaccarino.ei@gmail.com](mailto:emanueleiaccarino.ei@gmail.com)

___

## **1. Data collection**

Although it would be ideal to call all functions together under ``` engine.py ``` and perform the parsing in a single step, this approach was not possible with our "more efficient" implementation. 

We’ll explore these issues later on, along with the solution implemented to address them.

Anyway I left ``` engine.py ``` file under DataColletion folder: the file works there but doesn't here on this Jupyter Notebook

### **1.1 Get the list of Michelin restaurants**

Fun Fact: I first did this part of the code 04/11/2024, the next day the number of restaurants was reduced from 2037 to 1983 (this is confirmed by the official site where now we have 100 pages instead of 102)

In [None]:
from DataCollection.crawler import get_michelin_urls
import logging
import time

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

start_time = time.time()
get_michelin_urls()
logging.info(f"Time to collect urls: {time.time() - start_time} seconds") # <3

DEBUG:root:Analyzing page 1
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): guide.michelin.com:443
DEBUG:urllib3.connectionpool:https://guide.michelin.com:443 "GET /en/it/restaurants/page/1 HTTP/1.1" 200 None
DEBUG:root:Found restaurant URL: https://guide.michelin.com/en/campania/gragnano/restaurant/o-me-o-il-mare
DEBUG:root:Found restaurant URL: https://guide.michelin.com/en/abruzzo/popoli_1845563/restaurant/donevandro
DEBUG:root:Found restaurant URL: https://guide.michelin.com/en/piemonte/alba/restaurant/ape-vino-e-cucina
DEBUG:root:Found restaurant URL: https://guide.michelin.com/en/campania/sorrento/restaurant/da-bob-cook-fish
DEBUG:root:Found restaurant URL: https://guide.michelin.com/en/basilicata/matera/restaurant/da-mo
DEBUG:root:Found restaurant URL: https://guide.michelin.com/en/sardegna/cagliari/restaurant/sa-domu-sarda
DEBUG:root:Found restaurant URL: https://guide.michelin.com/en/sicilia/palermo/restaurant/charleston
DEBUG:root:Found restaurant URL: https:/

### **1.2 Crawl Michelin restaurant pages**

Using asyncio in the web scraping code allows to handle multiple HTML downloads concurrently, making the process faster and more efficient. Instead of waiting for each download to complete before starting the next (sequentially), the event loop jumps to the next available task when one is waiting for data, allowing to download multiple pages at the same time.

More info can be found [here](https://www.zenrows.com/blog/asynchronous-web-scraping-python#what-is-asynchronous-web-scraping), while the official Github repository is avaible at this [link](https://github.com/oxylabs/asynchronous-web-scraping-python)


 Apparently the reason why ``` engine.py ``` was working but calling here was not, is that jupyter run an event loop in the background that is not compatible with our asyncio library

In [None]:
from DataCollection.crawler import download_html_async

# I was having the following error: asyncio.run() cannot be called from a running event loop 
# solution was found here: https://community.openai.com/t/error-when-using-langchain-webresearchretriever-runtimeerror-asyncio-run-cannot-be-called-from-a-running-event-loop/341969/5
# and here: https://github.com/googlecolab/colabtools/issues/3720
async def main(): # This allow us to run the function asynchronously
    await download_html_async() 

await main()

# Apparently the reason why engine.py was working but calling here was not, is that jupyter run an event loop in the background that is not compatible with our asyncio library

INFO:root:Downloading batch 1
DEBUG:root:Saved HTML to page_1\restaurant_8.html
DEBUG:root:Saved HTML to page_1\restaurant_1.html
DEBUG:root:Saved HTML to page_1\restaurant_3.html
DEBUG:root:Saved HTML to page_1\restaurant_11.html
DEBUG:root:Saved HTML to page_1\restaurant_13.html
DEBUG:root:Saved HTML to page_1\restaurant_9.html
DEBUG:root:Saved HTML to page_1\restaurant_17.html
DEBUG:root:Saved HTML to page_1\restaurant_6.html
DEBUG:root:Saved HTML to page_1\restaurant_5.html
DEBUG:root:Saved HTML to page_1\restaurant_10.html
DEBUG:root:Saved HTML to page_1\restaurant_20.html
DEBUG:root:Saved HTML to page_1\restaurant_15.html
DEBUG:root:Saved HTML to page_1\restaurant_16.html
DEBUG:root:Saved HTML to page_1\restaurant_4.html
DEBUG:root:Saved HTML to page_1\restaurant_19.html
DEBUG:root:Saved HTML to page_1\restaurant_2.html
DEBUG:root:Saved HTML to page_1\restaurant_12.html
DEBUG:root:Saved HTML to page_1\restaurant_7.html
DEBUG:root:Saved HTML to page_1\restaurant_18.html
DEBUG:root

In [14]:
# moves folders under same main folder for cleaness
from DataCollection.organize_folders import organize_folders
organize_folders()

Created main folder: michelin_restaurants
Moved folder 'page_1' to 'michelin_restaurants'
Moved folder 'page_10' to 'michelin_restaurants'
Moved folder 'page_100' to 'michelin_restaurants'
Moved folder 'page_11' to 'michelin_restaurants'
Moved folder 'page_12' to 'michelin_restaurants'
Moved folder 'page_13' to 'michelin_restaurants'
Moved folder 'page_14' to 'michelin_restaurants'
Moved folder 'page_15' to 'michelin_restaurants'
Moved folder 'page_16' to 'michelin_restaurants'
Moved folder 'page_17' to 'michelin_restaurants'
Moved folder 'page_18' to 'michelin_restaurants'
Moved folder 'page_19' to 'michelin_restaurants'
Moved folder 'page_2' to 'michelin_restaurants'
Moved folder 'page_20' to 'michelin_restaurants'
Moved folder 'page_21' to 'michelin_restaurants'
Moved folder 'page_22' to 'michelin_restaurants'
Moved folder 'page_23' to 'michelin_restaurants'
Moved folder 'page_24' to 'michelin_restaurants'
Moved folder 'page_25' to 'michelin_restaurants'
Moved folder 'page_26' to 'm

### **1.3 Parse downloaded pages**

With the same method, we use asyncio for parsing, allowing to process multiple HTML files at the same time.
Additionally, the use of batches allow us to control the parallelism without overloading our sources, this way we analyze 100 files per batch. 

Smart and efficient way to perform it!

In [6]:
from DataCollection.parser import parse_all_restaurants
import asyncio
await parse_all_restaurants()

INFO:root:Starting to parse all restaurants.
DEBUG:root:TSV file header written.
INFO:root:Processing batch 1 with 100 files.
DEBUG:root:Processing file: michelin_restaurants\page_1\restaurant_20.html, URL: https://guide.michelin.com/en/lombardia/milano/restaurant/procaccini
DEBUG:root:Processing file: michelin_restaurants\page_1\restaurant_10.html, URL: https://guide.michelin.com/en/campania/marina-di-casal-velino/restaurant/alessandro-feo
DEBUG:root:Processing file: michelin_restaurants\page_1\restaurant_13.html, URL: https://guide.michelin.com/en/emilia-romagna/noceto_1827072/restaurant/palazzo-utini
DEBUG:root:Processing file: michelin_restaurants\page_1\restaurant_8.html, URL: https://guide.michelin.com/en/toscana/bibbiena/restaurant/il-tirabuscio262517
DEBUG:root:Processing file: michelin_restaurants\page_1\restaurant_2.html, URL: https://guide.michelin.com/en/abruzzo/popoli_1845563/restaurant/donevandro
DEBUG:root:Processing file: michelin_restaurants\page_1\restaurant_5.html, U

In [15]:
import shutil
folder_path = "michelin_restaurants"
shutil.rmtree(folder_path)
print(f"The folder '{folder_path}' has been deleted successfully.")
# it's too heavy to push it and anyway we don't neeed it anymore 

The folder 'michelin_restaurants' has been deleted successfully.


In [12]:
from DataCollection.file_type_converter import tsv_to_csv
tsv_to_csv('michelin_restaurants_data.tsv', 'michelin_restaurants_data.csv')

Conversion complete: michelin_restaurants_data.csv


## **2. Search Engine**

___

### **2.0 Preprocessing**

#### **2.0.1 Preprocessing the Text**

### **2.1 Conjunctive Query** 

#### **2.1.1 Create Your Index!**

#### **2.1.2 Execute the Query**

### **2.2 Ranked Search Engine with TF-IDF and Cosine Similarity**

#### **2.2.1 Inverted Index with TF-IDF Scores**

#### **2.2.2 Execute the Ranked Query**


## **3. Define a New Score!**

___

1. **User Query**: The user provides a text query. We’ll retrieve relevant documents using the search engine built in Step 2.1.

2. **New Ranking Metric**: After retrieving relevant documents, we’ll rank them using a new custom score. Instead of limiting the scoring to only the description field, we can include other attributes like `priceRange`, `facilitiesServices`, and `cuisineType`.

3. You will use a **heap data structure** (e.g., Python’s heapq library) to maintain the *top-k* restaurants.


## **4. Visualizing the Most Relevant Restaurants**

___

## **5. BONUS: Advanced Search Engine**

___

## **Algorithmic Question (AQ)**
___