# Vivino Webscraping 
This code automates the process of collecting data from Vivino, a popular platform for wine reviews and ratings. The scraper uses Selenium to navigate Vivino's dynamic content, extracting key details about wine bottles and storing them in an SQLite3 database for analysis. The goal is to create a structured dataset of information pertaining to each wine bottle, including the name, producer, wine type, vintage, price, region, country, ratings, and the number of reviews.

## Set up Notebook
Begin by setting up the notebook environment.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
import re
import sqlite3

## Vivino Data Storage 
Vivino imposes a limit of approximately 2000 bottles per filter search response. To scrape data for over 24,000 bottles listed on their home page, the filtering criteria needs to be incrementally adjsuted to create a list of URLs. Each URL should limit the number of bottles to approximately 2000, ensuring the scraping process adheres to the websites restraints.

In [None]:
# Create a list of pages that are restricted to 2000 wines per page
page_urls = ['https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=500&price_range_min=375&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=375&price_range_min=210&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=210&price_range_min=143&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=143&price_range_min=110&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=110&price_range_min=87&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=87&price_range_min=70&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=70&price_range_min=58&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=58&price_range_min=48&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=48&price_range_min=40&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=40&price_range_min=34&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=34&price_range_min=28&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=28&price_range_min=22&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            'https://www.vivino.com/explore?currency_code=CAD&min_rating=1&order_by=ratings_average&order=desc&page={page_num}&price_range_max=22&price_range_min=0&vc_only=&wsa_year=null&discount_prices=false&wine_type_ids[]=1',
            
]

# Record the number of pages in each URL
number_of_pages = [76, 73, 77, 72, 75, 67, 75, 76, 73, 76, 68, 77, 47]

# Create an empty list to store URLs
page_url_list = []

# Loop through each URL template and corresponding number of pages
for i in range(len(page_urls)):
    for page_num in range(1, number_of_pages[i] + 1):
        # Format the URL with the current page number
        formatted_url = page_urls[i].format(page_num=page_num)
        page_url_list.append(formatted_url)

## Web Scraping 
Now that the data is segmented into managable partions, the next step is to sequentially web scrape each URL in the generated list. The goal is to extract relevant information for each wine bottle, including the name, type, vintage, rating, price, and other available details.

### SQLite3 Database for Web Scraping Storage
To store such large amounts of data, an SQLite3 database was created. This database saves the data as a separate file on the local drive, making it easy to access, query, and update as needed. Additionally, it protects against data loss by storing already processed information, even if the scraping process is interrupted or crashes.

#### SQLite3 Database Creation
The red_wines database is created with a unique constraint on the URL column. Each wine bottle on Vivino has a unique URL associated with it, which we can leverage to prevent duplicate entries in the database. This is especially useful when rerunning the code in the event of an interruption or crash, as it ensures that previously scraped data is not duplicated.

Additionally, a scraper_state database is created to track and store the most recently processed URL from the page_url_list. This enables the scraper to resume from where it left off in case of interruptions, eliminating the need to rerun the entire scraping process from the beginning.

In [None]:
# Database setup
conn = sqlite3.connect('red_wines_final.db')  # Connect to SQLite database (or create it if it doesn't exist)
c = conn.cursor()

# Create a table for storing wine data if it doesn't already exist
c.execute('''
    CREATE TABLE IF NOT EXISTS red_wines (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        Producer TEXT,
        WineType TEXT,
        Year INTEGER,
        Region TEXT,
        Country TEXT,
        URL TEXT UNIQUE,
        Rating REAL, 
        Num_Ratings INTEGER,
        Price REAL,
        url_idx INTEGER
    )
''')

# Create a table for storing the url pages state
c.execute('''
    CREATE TABLE IF NOT EXISTS scraper_state (
        id INTEGER PRIMARY KEY,
        url_idx INTEGER,
        last_url TEXT
    )
''')

# Check if there's a last URL to resume from
c.execute('SELECT last_url FROM scraper_state WHERE id = 1')
result = c.fetchone()

### URL Index Tracking
In addition to the scraper state, we can track what URL has been processed wihtin the page_url_list to reference in case the program is inturrupted. 

In [None]:
# If the scraper has a saved state, resume from the last processed URL
if result and result[0]:
    # Find the index of the last processed URL in the URL list
    next_url_idx = page_url_list.index(result[0])  # Index for the next URL to scrape
    url_idx = page_url_list.index(result[0]) - 1   # Index for tracking the current URL being processed

    # Set the starting URL to resume scraping
    start_url = result[0]
    print(f"Resuming from last URL: {start_url}")

else:
    # If no saved state exists, start scraping from the first URL in the list
    next_url_idx = 0  # Start from the first URL
    url_idx = -1      # No previous URL processed yet

    # Set the starting URL to the first URL in the list
    start_url = page_url_list[0]
    print("Starting from the initial URL.")

### Selenium for Web Scraping
Many elements on Vivino's webpage are dynamically loaded using JavaScript, and Selenium effectively handles this dynamic content by waiting for elements to fully render before extracting data. Unlike static web scraping tools like BeautifulSoup, Selenium mimics the behavior of a real user by controlling a web browser. 

#### WebDriver Options
To speed up the web scraping process, we configure the WebDriver to disable image loading. This reduces page load times, which is particularly beneficial on image-heavy websites like Vivino. By optimizing the loading process, scraping becomes more efficient, especially when dealing with numerous pages.

In [None]:
# Setup WebDriver options to not load images for faster loading
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(options=chrome_options)

### Element Extraction and HTML Parsing
To collect the required data for each wine bottle on Vivino's webpage, the necessary HTML elements must be identified and extracted. Since Vivino relies heavily on dynamically loaded elements, Selenium can be used to load these elements. Methods such as find_element() and find_elements() can then efficiently extract information, including the wine's name, producer, type, ratings, location, and price. The following section outlines the key functions and the main scraping loop used to parse and store this data effectively.

### Price Exraction
Vivino uses multiple paths for displaying the price per bottle. The following functions are designed to handle this complexity by extracting numeric price values and accommodating two potential XPaths. The first function processes and cleans the price text to extract a numeric value, while the second function utilizes both a primary and secondary XPath to ensure reliable price retrieval.

In [None]:
# Function to extract price 
def extract_price(price_text):
    """
    Extracts a numeric price value from a string by cleaning and parsing it.

    Arguments:
        price_text (str): The text containing the price information to be extracted.

    Returns:
        float or None: The extracted price as a float, or None if no numeric value is found.
    """
    # Remove any non-numeric characters except for commas and decimal points
    cleaned_text = re.sub(r'[^\d.,]', '', price_text)
    # Remove commas to convert the text into a proper float format
    cleaned_text = cleaned_text.replace(',', '')
    # Use regex to find the numeric pattern
    match = re.search(r'\d+(\.\d{1,2})?', cleaned_text)
    return float(match.group()) if match else None

In [None]:
# Function to find the price using primary and fallback XPaths
def find_price(driver, index):
    """
    Vivino has two possible XPaths to retrieve the price. This function finds the price using primary and fallback XPaths.

    Arguments:
        driver (WebDriver): The Selenium WebDriver instance controlling the browser.  
        index (int): The zero-based index of the wine item for which the price is to be retrieved.

    Returns:
        float or None: The extracted price as a float or `None` if neither XPath works.
    """
    # Primary XPath to use
    primary_xpath = f"(//div[contains(@class, 'addToCartButton__currency--2CTNX addToCartButton__prefix--3LzGf')]/following-sibling::div)[{index+1}]"
    # Fallback XPath to use if the first one fails
    fallback_xpath = f"(//div[contains(@class, 'addToCart__subText--1pvFt addToCart__ppcPrice--ydrd5')])[{index+1}]"

    try:
        # Try using the primary XPath first
        price_element = driver.find_element(By.XPATH, primary_xpath)
        price_text = price_element.text
        price = extract_price(price_text)
        if price:
            return price  # Return the price if a valid one is found
    except Exception as e:
        # If the primary XPath fails, print the error and move on to the fallback
        print(f"Primary XPath failed: {e}")
        pass

    try:
        # Use the fallback XPath if the primary one didn't work
        price_element = driver.find_element(By.XPATH, fallback_xpath)
        price_text = price_element.text
        price = extract_price(price_text)
        return price  # Return the price from the fallback XPath
    except Exception as e:
        # If both XPaths fail, print the error
        print(f"Fallback XPath also failed: {e}")
        return None

### Wine Type and Year Extraction
Vivino stores the wine name and year in a single string (eg., 'Cabernet Sauvignon 2020'). To handle this, the following function is defined to separate the wine type and year from the given string. 

In [None]:
def extract_wine_type_and_year(wine_type_full):
    """
    Extracts the wine type and year from a given wine type string.

    If the wine type string contains a four-digit year (ex, "Red Wine 2020"), the function
    separates the wine type and the year. If no year is found, the function sets the year to None.

    Arguments:
        wine_type_full (str): The full string containing the wine type and possibly the year.

    Returns:
        tuple: A tuple containing the wine type (str) and the year (int or None).
               Example: ("Red Wine", 2020) or ("Red Wine", None).
    """
    # Use regex to find the wine type and a four-digit year at the end of the string
    match = re.search(r'(.*?)(\d{4})$', wine_type_full)
    if match:
        wine_type = match.group(1).strip()  # Extract the wine type
        year = int(match.group(2))         # Extract the year and convert to an integer
    else:
        wine_type = wine_type_full         # Use the full string if no year is found
        year = None                        # Set year to None

    return wine_type, year


## Main Scraping Loop
This section contains the main scraping loop for collecting data from Vivino's explore page. The loop handles navigating through multiple pages, extracting wine details, and storing the data in a database while maintaining progress tracking for resumption in case of interruptions.

The key steps for this function are defined as follows: 
1. Page Loading: Open and loads the page, using WebDriverWait to ensure all elements are present before extracting
2. Data Extraction: Extracts information for each wine bottle and commits it to the database 
3. Pagination: Automatically navigates to the next URL in the page_url_list and updates the scraper_state to track the current URL index for resumption 
4. Error Handling: Skips over any problematic URL and logs its index for resumption

In [None]:
# Open the Vivino explore page
driver.get(start_url)

# Allow the page to load
time.sleep(5)

# Loop until there is no "Next" button or the URL does not change
while True:

    try:
        # Wait for the main elements to be present on the page
        WebDriverWait(driver, 30).until(
            EC.presence_of_all_elements_located((By.XPATH, "//a[@data-testid='vintagePageLink']"))
        )
        
        wine_elements = driver.find_elements(By.XPATH, "//a[@data-testid='vintagePageLink']")
        producers = driver.find_elements(By.XPATH, "//div[contains(@class, 'wineInfoVintage__truncate--3QAtw')][1]")
        wine_types = driver.find_elements(By.XPATH, "//div[contains(@class, 'wineInfoVintage__vintage--VvWlU wineInfoVintage__truncate--3QAtw')]")
        ratings = driver.find_elements(By.XPATH, "//div[contains(@class, 'vivinoRating_averageValue__uDdPM')]")
        num_ratings = driver.find_elements(By.XPATH, "//div[contains(@class, 'vivinoRating_caption__xL84P')]")
        locations = driver.find_elements(By.XPATH, "//div[contains(@class, 'wineInfoLocation__regionAndCountry--1nEJz')]")

        url_idx += 1

        # Extract data from the current page
        for i in range(len(wine_elements)):
            try:
                # Extract elements needed for the database
                url = wine_elements[i].get_attribute("href")    # Extract the wine bottle's unique URL 
                producer = producers[i].text                    # Exract the producer name
                wine_type_full = wine_types[i].text             # Extract the wine type name 
                rating_text = ratings[i].text                   # Extract the rating as a str
                num_ratings_text = num_ratings[i].text          # Extract the number of ratings as a str
                location = locations[i].text.split(", ")        # Extract the location including country and region
                                
                # convert rating to number
                rating = float(rating_text)

                # Parse number of ratings by removing all characters that are not a digit
                num_ratings_value = int(re.sub(r'\D', '', num_ratings_text)) if num_ratings_text else None
                
                # Extract Region and Country if available
                region = location[0].strip() if len(location) > 0 else None
                country = location[1].strip() if len(location) > 1 else None

                # Extract the year from the wine type, if present
                wine_type, year = extract_wine_type_and_year(wine_type_full)
                                    
                # Find the price using the find_price function
                price = find_price(driver, i)                
                
                # Insert into the database (avoid duplicates with UNIQUE constraint on URL)
                c.execute('''
                    INSERT OR IGNORE INTO red_wines (Producer, WineType, Year, Region, Country, URL, Rating, Num_Ratings, Price, url_idx)
                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                ''', (producer, wine_type, year, region, country, url, rating, num_ratings_value, price, url_idx))

                
            except Exception as e:
                print(f"Could not extract all data for a wine element: {e}")

        # Commit after each page to save progress
        conn.commit()

        try:
            
            # update the ulr index
            next_url_idx += 1            

            # get the next url from the list
            next_url = page_url_list[next_url_idx]

            
            if page_url_list[-1] == next_url:
                break
            
            # Store the next_url in the database for resuming later
            c.execute('''
                INSERT OR REPLACE INTO scraper_state (id, url_idx, last_url)
                VALUES (1, ?, ?)
            ''', (url_idx, next_url,))
            conn.commit()    

            driver.get(next_url)

            # Allow some time for the new page to load
            time.sleep(5)

        except Exception as e:
            # If there is an error navigating to the next URL
            print(f"Error navigating to the next URL: {e}. Last processed page index: {url_idx}")
            break
    
    except Exception as e:
        print(f"No elements found, moving to next page: {e}. Skipped over url: {url_idx}")
        try:            
            # update the ulr index
            next_url_idx += 1            

            # get the next url from the list
            next_url = page_url_list[next_url_idx]

            
            if page_url_list[-1] == next_url:
                break
            
            # Store the next_url in the database for resuming later
            c.execute('''
                INSERT OR REPLACE INTO scraper_state (id, url_idx, last_url)
                VALUES (1, ?, ?)
            ''', (url_idx, next_url,))
            conn.commit()    

            driver.get(next_url)

            # Allow some time for the new page to load
            time.sleep(5)

        except Exception as e:
            print(f"Error in main loop: {e}")
            break

# Close the database connection
conn.close()

# Close the driver
driver.quit()


### Scraper State
Lastly, if the code is rerun from the start, the scraper state needs to be cleared to let the code resume from the first URL in the page_url_list.

In [None]:
# Database setup
conn = sqlite3.connect('red_wines_complete.db')  # Connect to SQLite database (or create it if it doesn't exist)
c = conn.cursor()

# Add a reset flag
RESET_SCRAPER_STATE = False  # Set to True to reset the scraper state

if RESET_SCRAPER_STATE:
    # Clear the scraper_state table
    c.execute('DELETE FROM scraper_state')
    conn.commit()
    print("Scraper state has been reset.")