In [3]:
import os
import time
import crawler
import requests 
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup
from constants import *

This notebook contains the third homework assignment for the ADM course. 

To improve the organization of the project, we created three additional Python files: `constants`, `crawler`, and `search_engine`.

- **`constants`**: Contains variables for constant links and paths used throughout the project.  
- **`crawler`**: Includes functions necessary for scraping data from the Michelin website.  
- **`search_engine`**: Provides key functions to identify the best restaurants based on the given query.

### Part 1 (Data collection)

#### 1.1. Get the list of Michelin restaurants


`get_restaurants_links` scrapes restaurant links from a website across multiple pages. It constructs URLs for each page, fetches their content, and parses the HTML to extract restaurant links. Each link is stored in a list with a prefix indicating the page number. The function returns the complete list of links.

In [6]:
restaurants_list = crawler.get_restaurants_links()

100%|██████████| 100/100 [00:52<00:00,  1.90it/s]


1. **`write_restaurants_list`**:
   - This function takes a list of restaurant links and saves them to a text file, with each link written on a new line.

In [7]:
def write_restaurants_list(restaurants_list):
    with open('restaurants.txt', 'w') as _file:
        restaurants_str = '\n'.join(restaurants_list)  # Convert the list to a single string with newline separators
        _file.write(restaurants_str)  # Write the string to the file

write_restaurants_list(restaurants_list)

2. **`read_restaurants_list`**:
   - This function reads the restaurant links from a text file and returns them as a list, where each line in the file corresponds to one link.

In [2]:
def read_restaurants_list():
    with open('restaurants.txt', 'r') as _file:
        restaurants_list = _file.read().split('\n')  # Split the file content into a list of lines

    return restaurants_list  # Return the list of restaurant links

# Read the list from the file and print it
restaurants_list = read_restaurants_list()
print(restaurants_list)


['page1>https://guide.michelin.com/en/campania/gragnano/restaurant/o-me-o-il-mare', 'page1>https://guide.michelin.com/en/abruzzo/popoli_1845563/restaurant/donevandro', 'page1>https://guide.michelin.com/en/piemonte/alba/restaurant/ape-vino-e-cucina', 'page1>https://guide.michelin.com/en/campania/sorrento/restaurant/da-bob-cook-fish', 'page1>https://guide.michelin.com/en/basilicata/matera/restaurant/da-mo', 'page1>https://guide.michelin.com/en/sardegna/cagliari/restaurant/sa-domu-sarda', 'page1>https://guide.michelin.com/en/sicilia/palermo/restaurant/charleston', 'page1>https://guide.michelin.com/en/toscana/bibbiena/restaurant/il-tirabuscio262517', 'page1>https://guide.michelin.com/en/emilia-romagna/cesenatico/restaurant/la-buca130947', 'page1>https://guide.michelin.com/en/campania/marina-di-casal-velino/restaurant/alessandro-feo', 'page1>https://guide.michelin.com/en/lombardia/cervesina/restaurant/dama-1213583', 'page1>https://guide.michelin.com/en/campania/napoli/restaurant/il-ristoran

#### 1.2. Crawl Michelin restaurant pages

**`download_restaurants_pages`**:  
This function downloads the HTML content of restaurant pages from a list of URLs. It organizes the saved pages into directories named after the page number from which the links were extracted. Each restaurant's page is saved as an HTML file using the last part of its URL as the file name.

In [44]:
crawler.download_restaurants_pages(restaurants_list)

100%|██████████| 1983/1983 [08:58<00:00,  3.68it/s]


### 1.3. Parse downloaded pages

#### To parse each page we need the following functions provided in crawler.py

1. **`extract_title`**: Retrieves the restaurant's name from the HTML content.
2. **`extract_address`**: Parses and structures the address, city, postal code, and country from the given details.
3. **`extract_details`**: Extracts the price range and cuisine type of the restaurant.
4. **`extract_description`**: Retrieves the restaurant's description from the HTML content.
5. **`extract_facilities`**: Extracts a list of available facilities or services provided by the restaurant.
6. **`extract_cards`**: Extracts the names of accepted credit cards from the HTML content.
7. **`extract_phone_number`**: Retrieves the restaurant's phone number.
8. **`extract_website`**: Extracts the restaurant's website URL, if available.
9. **`process_restaurant_file`**: Processes a single restaurant's HTML file, extracting its attributes and appending them to a dictionary.
10. **`process_pages`**: Iterates through directories of restaurant HTML files, extracting their details into a consolidated attributes dictionary.

In [3]:
pages = os.listdir(pages_base_dir)
pages.sort(key=lambda x: int(x[4:]))

rest_attr_dict = crawler.process_pages(pages, pages_base_dir)

100%|██████████| 100/100 [04:16<00:00,  2.57s/it]


In [5]:
restaurants_df = pd.DataFrame(rest_attr_dict)
restaurants_df.to_csv('restaurants.tsv', sep='\t')
restaurants_df.head(5)

Unnamed: 0,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,facilitiesServices,creditCards,phoneNumber,website
0,La Trattoria Enrico Bartolini,Località Badiola,Castiglione della Pescaia,58043,Italy,€€€€,"Mediterranean Cuisine, Grills",After a majestic picture-postcard approach via...,"[Air conditioning, Car park, Garden or park, I...","[Amex, Mastercard, Visa]",+39 0564 944322,https://www.enricobartolini.net/ristorante-la-...
1,Donevandro,via Garibaldi 2,Popoli,65026,Italy,€€,"Contemporary, Seasonal Cuisine","Up until a few years ago, the owner-chef at th...",[Air conditioning],"[Mastercard, Visa]",+39 388 887 6858,http://www.donevandroristorante.it
2,O Me O Il Mare,Via Roma 45/47,Gragnano,80054,Italy,€€€€,"Italian Contemporary, Modern Cuisine","Known around the world as the town of pasta, G...","[Air conditioning, Interesting wine list, Whee...","[Amex, Dinersclub, Mastercard, Visa]",+39 081 620 0550,
3,Sa Domu Sarda,via Sassari 51,Cagliari,9124,Italy,€€,Sardinian,"Despite being an island, Sardinia’s traditiona...","[Air conditioning, Terrace]","[Amex, Mastercard, Visa]",+39 070 653400,https://www.osteriasadomusarda.it/
4,Ape Vino e Cucina,Piazza Risorgimento 3,Alba,12051,Italy,€€,"Piedmontese, Contemporary",This attractive restaurant in the heart of Alb...,"[Air conditioning, Terrace, Wheelchair access]","[Amex, Dinersclub, Maestrocard, Mastercard, Visa]",+39 0173 363453,https://www.apewinebar.it/alba/


### Part 2 (Search Engine)

In [1]:
import search_engine

import os
import time
import crawler
import requests 
import numpy as np
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup
from constants import *

  from pandas.core import (
[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
restaurants_df = pd.read_csv('restaurants.tsv', sep='\t', index_col=0)

### 2.0. Preprocessing

**`clean_text`**:  
This function preprocesses text for NLP tasks. It performs the following steps:  

1. Converts the input text to lowercase.  
2. Removes punctuation from the text.  
3. Tokenizes the text into words.  
4. Filters out stopwords (commonly used words with little semantic value).  
5. Applies stemming to reduce words to their root form.  

The function can return either a tokenized list of stemmed words or a cleaned string, depending on the `tokenized` parameter.

In [3]:
tqdm.pandas()
restaurants_df['cleaned_description'] = restaurants_df['description'].progress_apply(lambda text: search_engine.clean_text(text))

100%|██████████| 1982/1982 [00:02<00:00, 907.47it/s]


### 2.1 Conjunctive Query

#### 2.1.1 Create Our Index!

**`create_vocab_indexer`**:  
This function generates a vocabulary indexer from a specified column in a DataFrame. It tokenizes the text in the column, collects all unique tokens into a set, and converts the set into a DataFrame with a single column named `'token'`. The resulting DataFrame provides a list of all unique tokens in the text corpus.

In [4]:
vocab_df = search_engine.create_vocab_indexer(restaurants_df, 'cleaned_description')
vocab_df.to_csv('vocabulary.csv')

**`create_inverted_indexer`**:  
This function builds an inverted index to map tokens from a vocabulary to the indices of restaurants whose descriptions contain those tokens. It iterates through each token in the vocabulary and checks where the token appears in the `'cleaned_description'` column of the restaurants DataFrame. The result is a dictionary where each token index is associated with a list of restaurant indices where the token occurs.

In [5]:
vocab_documents_dict = search_engine.create_inverted_indexer(vocab_df, restaurants_df)

7707it [00:14, 529.22it/s]


#### 2.1.2 Execute the Query

**`conjunctive_search`**:  
This function performs a conjunctive (AND) search on restaurant descriptions. It takes a query string, cleans and tokenizes it, and finds restaurants whose descriptions contain all the tokens in the query. It uses an inverted index to retrieve matching document indices for each token, intersects these indices to find common matches, and returns details (name, address, description, website) of the matching restaurants.

In [7]:
search_engine.conjuctive_search("modern seasonal cuisine", vocab_df, vocab_documents_dict, restaurants_df).head(5)

Unnamed: 0,restaurantName,address,description,website
1153,Ronchi Rò,località Cime di Dolegna 12,Ronchi Rò is an estate-cum-agriturismo surroun...,https://www.ronchiro.it
898,La Bandiera,contrada Pastini 4,Although it takes a while to reach this restau...,https://www.labandiera.it/
1544,Babette,via Michelangelo 17,Situated just beyond the centre of Albenga in ...,https://www.ristorantebabette.net/
1289,Flurin,Laubengasse 2,Flurin occupies an old medieval tower in Glore...,https://www.flurin.it
12,20Tre,via David Chiossone 20 r,"Run by three partners, this contemporary-style...",https://www.ristorante20tregenova.it/


#### 2.2.2 Execute the Ranked Query

**`create_inverted_indexer_with_tfidf`**:  
This function extends a traditional inverted index by incorporating TF-IDF (Term Frequency-Inverse Document Frequency) scores for each token in a restaurant dataset. It works as follows:  

1. Iterates through each token in the vocabulary.
2. Identifies which restaurant descriptions contain the token using regex for exact word matching.
3. Calculates the **TF-IDF** score for the token in each matching document:  
   - **TF**: Frequency of the token in the document divided by the total number of tokens in the document.  
   - **IDF**: Logarithm of the total number of documents divided by the number of documents containing the token.  
4. Stores the matching document indices along with their respective TF-IDF scores for each token in a dictionary.  

The result is a dictionary where each token is mapped to a list of tuples containing document indices and their corresponding TF-IDF scores.

In [8]:
vocab_documents_with_tfidf = search_engine.create_inverted_indexer_with_tfidf(vocab_df, restaurants_df)

  vocab = row[0]
100%|██████████| 7707/7707 [01:31<00:00, 84.34it/s]


**`cosine_similarity`**:  
This function calculates the cosine similarity between a given vector and all rows in a matrix. Cosine similarity measures the angle between vectors, providing a score to determine how similar they are, irrespective of their magnitudes.  

- It computes the norms (magnitudes) of the vector and each row of the matrix.  
- It calculates the dot product of the vector with each matrix row.  
- The cosine similarity is obtained by dividing the dot product by the product of the norms.  


**`rank_search_engine`**

This function performs ranked search for a query over a set of restaurant descriptions using TF-IDF and cosine similarity:

1. **Query Processing:**  
   - The input query is cleaned and tokenized.
   - A binary vector representation of the query is created, where each position corresponds to a vocabulary term, and a `1` indicates the term is present in the query.

2. **Constructing Document Vectors:**  
   - Each restaurant is represented by a vector where each position corresponds to a vocabulary term.
   - The TF-IDF scores from `vocab_documents_with_tfidf` are used to populate these vectors for each restaurant.

3. **Computing Similarity:**  
   - Cosine similarity is calculated between the query vector and all restaurant vectors.
   - This determines how closely the query matches each restaurant's description.

4. **Ranking Results:**  
   - The restaurants are ranked by their similarity scores in descending order.
   - The top `k` restaurants are selected based on their scores.

5. **Returning Results:**  
   - A DataFrame of the top `k` restaurants is returned, including their similarity scores, for display or further processing. 


In [10]:
ranked_results = search_engine.rank_search_engine(query='modern seasonal cuisine', 
                                                  vocab_df=vocab_df, 
                                                  vocab_documents_with_tfidf=vocab_documents_with_tfidf,
                                                  restaurants_df=restaurants_df, 
                                                  k=10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_rests['Similarity Score'] = np.sort(similarity_scores)[::-1][:k]


### Part 4 (Visualizing the Most Relevant Restaurants)

**`get_coordinates_with_details`**

This function retrieves geographic coordinates (latitude and longitude) for a given restaurant by querying its detailed address, including postal code, city, and country. If the full address is unavailable, it falls back to using the city and country. It leverages the `geopy` library's `Nominatim` geocoder to perform the lookup.

**`enrich_with_coordinates`**

This function enriches a restaurant DataFrame by adding latitude and longitude columns. It applies the `get_coordinates_with_details` function to each row to fetch coordinates and filters out rows where coordinates could not be determined.

In [12]:
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="restaurant_locator")

def get_coordinates_with_details(row):
    """Get latitude and longitude using address, postal code, city, and country."""
    query = f"{row['address']}, {row['postalCode']}, {row['city']}, {row['country']}"
    location = geolocator.geocode(query)
    if location:
        return location.latitude, location.longitude
    elif location is None:
        location = geolocator.geocode(f"{row['city']}, {row['country']}")
        
        if location:
            return location.latitude, location.longitude

    return None, None

def enrich_with_coordinates(restaurants_df):
    """Add latitude and longitude to the DataFrame."""
    restaurants_df[['latitude', 'longitude']] = restaurants_df.apply(
        lambda row: pd.Series(get_coordinates_with_details(row)), axis=1
    )
    return restaurants_df.dropna(subset=['latitude', 'longitude'])


In [14]:
import folium

def plot_top_restaurants(top_k_restaurants):
    
    # Initialize a map centered on Italy
    italy_map = folium.Map(location=[41.8719, 12.5674], zoom_start=6)
    
    # Define a color scheme for price ranges
    price_color = {
        "€": "green",
        "€€": "blue",
        "€€€": "orange",
        "€€€€": "red"
    }
    
    # Add each restaurant to the map
    for _, row in top_k_restaurants.iterrows():
        popup_info = f"""
        <b>{row['restaurantName']}</b><br>
        {row['address']}<br>
        {row['city']}, {row['country']}<br>
        Price Range: {row['priceRange']}<br>
        Cuisine Type: {row['cuisineType']}
        """
        
        folium.Marker(
            location=[row['latitude'], row['longitude']],
            popup=popup_info,
            icon=folium.Icon(color=price_color.get(row['priceRange'], "gray"), icon="cutlery"),
        ).add_to(italy_map)
    
    return italy_map

enriched_restaurants_df = enrich_with_coordinates(ranked_results)  # Add coordinates to the DataFrame
map_result = plot_top_restaurants(enriched_restaurants_df)

# Display map in Jupyter notebook
map_result


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurants_df[['latitude', 'longitude']] = restaurants_df.apply(
