# The following code is used for scraping wines from Vivino

The code is performed in functions to automate the scraping process. It is intended to scrape wine in the premium sector with price above 50 pounds from Vivino

### Problems during scraping
* The price configuration was presented using drag slider with no input boxes
* The price wasn't always linked to the same html element
* The automatic window scroll was often stuck, when the number of resulting wine cards exceeded 1000

### Scraping solutions
* The special Drag and Drop Selenium function has been found for moving price slider
* All possible elements linked to the price of wine have been identified
* The scraping data have been split into parts based on price intervals and wine categories to avoid computational issues <br> 1. The algorithm iterates over a list of price intervals from 50 pounds to the maximum price
<br> 2. Within each price interval, the algorithm separately scrapes red wine and other categories of less popular wine
<br> The separation of data using price intervals and wine categories ensures the smooth Selenium scraping process

### Steps of algorithm

#### Pre - Algorithm
1. Modules are imported
2. Functions for changing price interval, automatically scrolling page and interating over results are defined

#### Algortihm
3. The first price interval is set
4. Resulting page is filtered using price interval and red wine category
5. Resulting page is scrolled down to the bottom using Selenium
6. HTML code is extracted from the resulting page using Beautiful Soup, where all essential wine data are retrieved
7. Data subset is added to the whole data
8. Resulting page is filtered using the same price interval, but using the rest of wine categories
9. Resulting page is scrolled down to the bottom using Selenium
10. HTML code is extracted from the resulting page using Beautiful Soup, where all essential wine data are retrieved
11. Data subset is added to the whole data
**Process (3-11) is repeated until no price intervals are left**

#### Post - Algorithm
12. Data frame is created based on the whole data and column names
13. Data are cleaned, processed, where the right formats are set
14. Data are exported to csv/xlsx

### Pre - Algorithm

In [None]:
# Importing essential packages for scraping and data management
# Please note, tqdm package is used for representing progress bar line during scraping process
from tqdm.notebook import tqdm 
from bs4 import BeautifulSoup
import selenium
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import sys
import pandas as pd
import numpy as np
import re
import time

In [None]:
# Selecting the path of chromedriver
# import os

# CHROMEDRIVER_PATH = os.system("which chromedriver")
# print(CHROMEDRIVER_PATH)

In [None]:
# Launcing the Selenium browser
wd = webdriver.Chrome()
# Opening the link
refresh_link = "https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1MTBQS660dXdSSwYSAWoFQNn0NNuyxKLM1JLEHLX8ohTblNTiZLX8pErblMzi5PzSvJL4gtSi5NS8EgCfGBoC"
wd.get(refresh_link)
time.sleep(2)
# Accepting cookies
wd.find_element(By.XPATH,'//*[@id="cookie-notice-container"]/div/button').click()
time.sleep(2)

In [None]:
# Function for setting price interval on the website
def price_slider(left_value, right_value):
    # identifying the left and right element of price slider
    left_slider = wd.find_element_by_css_selector("div[class^='rc-slider-handle rc-slider-handle-1']")
    right_slider = wd.find_element_by_css_selector("div[class^='rc-slider-handle rc-slider-handle-2']")
    time.sleep(2)
    # moving the left element of price slider in pixels (set by left_value)
    ActionChains(wd).drag_and_drop_by_offset(left_slider, left_value, 0).perform()
    time.sleep(2)
    # moving the right element of price slider in pixels (set by right_value)
    ActionChains(wd).drag_and_drop_by_offset(right_slider, right_value, 0).perform()
    time.sleep(3)

In [None]:
### Function for automatically scrolling page to the bottom using Selenium
# Partially referenced from: https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python
def scroll_page():
    # extracting scroll height
    last_height = wd.execute_script("return document.body.scrollHeight")
    while True:
        # scrolling down to the bottom
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1.5)
        # extracting new scroll height and comparing with the last scroll height (if no difference, stop the function)
        new_height = wd.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            
            # check, whether it's stuck
            wd.execute_script("scrollBy(0,-3000);")
            time.sleep(1.5)
            wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(1.5)
            new_height = wd.execute_script("return document.body.scrollHeight")
            if new_height == last_height:
                break
        last_height = new_height

In [None]:
# Function for scraping data from resulting page
def scrape_wine():
    
    # extracting the wine cards with all necessary data
    WineCards = wd.find_elements_by_css_selector("div[class^='wineCard__wineCard--2dj2T']")
    wine_data = []
    for winecard in tqdm(WineCards):
        
        try:
            # assigning the empty row list
            row = []
            # extracting html code of wine card element
            winecard_html = winecard.get_attribute('innerHTML')
            soup = BeautifulSoup(winecard_html, "html.parser")
            
            # assigning values to wine brand and price attributes to avoid None values
            wine_brand_name = ''
            wine_price = ''
            
            # extracting wine brand name
            if soup.find('div', attrs={'class': re.compile('^wineInfoVintage__truncate.*')}):
                wine_brand_name = soup.find('div', attrs={'class': re.compile('^wineInfoVintage__truncate.*')}).text
            # extracting wine name
            wine_name = soup.find('div', attrs={'class': re.compile('^wineInfoVintage__vintage.*')}).text
            # extracting wine price
            if soup.find('div', attrs={'class': re.compile('^addToCartButton__price.*')}):
                wine_price = soup.find('div', attrs={'class': re.compile('^addToCartButton__price.*')}).text
            # check for the 2nd case of price
            if soup.find('div', attrs={'class': re.compile('^addToCart__subText--1pvFt.*')}):
                price_value = soup.find('div', attrs={'class': re.compile('^addToCart__subText--1pvFt.*')}).text
                if re.search("Available online", price_value):
                    wine_price = price_value

            # extracting wine rating
            wine_rating = soup.find("div", class_="vivinoRating_averageValue__uDdPM").text
            # extracting the number of wine reviews
            wine_review_count = soup.find("div", class_="vivinoRating_caption__xL84P").text
            wine_location = soup.find('div', attrs={'class': re.compile('^wineInfoLocation__regionAndCountry.*')}).text
            
            # adding all attributes to the row
            row.extend([wine_brand_name, 
                        wine_name, 
                        wine_price,  
                        wine_rating, 
                        wine_review_count, 
                        wine_location])
            # adding row to the data
            wine_data.append(row)

        except Exception as e:
            print('Error on line {}'.format(sys.exc_info()[-1].tb_lineno), type(e).__name__, e)
            pass
    
    return wine_data

### Algorithm

**BE AWARE** This algorithm will take a long time to run as it is a lot of pages to go through. Would higly reccomend using the data stored in wines_scrape_clean_result.csv instead. It will be the same result.

In [None]:
# Assigning price intervals by setting pixels padding from the left and right inside price slider
price_slider_intervals = [(245, -120), 
                          (255, -110), 
                          (265, -100), 
                          (275, -95), 
                          (280, -85), 
                          (290, -70), 
                          (305, -55), 
                          (323, -33), 
                          (345, 0)]
data = []

for index, value in enumerate(price_slider_intervals):
    
    # refreshing the page
    wd.get(refresh_link)
    time.sleep(2)
    # accepting cookies
    try:
        if wd.find_element(By.XPATH,'//*[@id="cookie-notice-container"]/div/button'):
            wd.find_element(By.XPATH,'//*[@id="cookie-notice-container"]/div/button').click()
    except Exception:
        pass
    time.sleep(1)
    
    # extracting the left and right pixels padding for price slider
    left_value = value[0]
    right_value = value[1]
    # assigning paddings using Selenium function
    price_slider(left_value, right_value)
    
    print(f"Price interval {index + 1} is set")
    
    # selecting red wine separately (it's the largest category)
    wd.find_element(By.XPATH,'//*[@id="explore-page-app"]/div/div/div[2]/div[1]/div/div[1]/div[2]/label[1]').click()
    time.sleep(2)
    # scrolling page to the bottom
    scroll_page()
    
    # scraping wine listed on the page (here, only red wine)
    block_data = scrape_wine()
    # adding subset to the whole data
    data = data + block_data
    
    # unselecting red wine
    wd.find_element(By.XPATH,'//*[@id="explore-page-app"]/div/div/div[2]/div[1]/div/div[1]/div[2]/label[1]').click()
    time.sleep(2)
    # selecting the rest of wine categories
    for i in range(2, 7):
        time.sleep(0.5)
        wd.find_element(By.XPATH,'//*[@id="explore-page-app"]/div/div/div[2]/div[1]/div/div[1]/div[2]/label[' + str(i) + ']').click()
    
    
    time.sleep(2)
    # scrolling page to the bottom
    scroll_page()
    # scraping wine listed on the page (here, all categories of wine, except red wine)
    block_data = scrape_wine()
    # adding subset to the whole data
    data = data + block_data

### Post - Algorithm

In [None]:
# assigning column names
columns = ['WineBrand',
           'WineName',
           'WinePrice',
           'WineRating',
           'WineReviewCount',
           'WineLocation']
# creating dataframe
wine_df = pd.DataFrame(data, columns=columns)

In [None]:
# data cleaning procedures

# stripping and replacing values for wine price and review count
wine_df["WinePrice"] = wine_df["WinePrice"].str.replace("Available online from", "")
wine_df["WinePrice"] = wine_df["WinePrice"].str.replace("£", "")
wine_df["WinePrice"] = wine_df["WinePrice"].str.strip()
wine_df["WineReviewCount"] = wine_df["WineReviewCount"].str.replace(" ratings", "")

# changing data formats of columns
wine_df['WinePrice'] = wine_df['WinePrice'].astype('float64', copy=False)
wine_df['WineRating'] = wine_df['WineRating'].astype('float64', copy=False)
wine_df['WineReviewCount'] = wine_df['WineReviewCount'].astype('float64', copy=False)

# adding new column with wine year
wine_df['WineYear'] = wine_df['WineName'].str.extract('(19\d{2}|20\d{2})')

# removing duplicate rows
wine_df = wine_df.drop_duplicates()
wine_df.shape

In [None]:
# exporting data to excel
wine_df.to_csv("vivino_wine_scrape_result.csv", index=False)

# Visualisations

In [None]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

In [None]:
# Grouping and data preprocessing 
wine_year_slice = wine_df.groupby(['WineYear']).aggregate(np.mean)[["WineReviewCount", "WineRating"]]
wine_year_slice = wine_year_slice.reset_index()
wine_year_slice['WineYear'] = wine_year_slice['WineYear'].astype('int64', copy=False)
wine_year_slice = wine_year_slice[wine_year_slice['WineYear'] >= 1975]
wine_year_slice['WineYear'] = wine_year_slice['WineYear'].astype(str, copy=False)
xticks = wine_year_slice['WineYear'].str[2:]
colors = []

# Assigning colors based on the average wine rating
for value in wine_year_slice["WineRating"]:
    if value < 4.2:
        colors.append('#00a7fa')
    elif value < 4.25:
        colors.append('#00b3fa')
    elif value < 4.3:
        colors.append('#005cfa')
    else:
        colors.append('#6800fa')

fig = plt.figure(figsize=(20, 6))
ax = fig.add_subplot(111)
ax.set_frame_on(False)

# Creating bar chart, where average number of reviews is shown by each year from 1975 to 2021
sns.barplot(data=wine_year_slice, x=xticks, y="WineReviewCount", palette = colors)

# Adding legend elements
leg1 = mpatches.Patch(color='#00b3fa', label='Average rating < 4.2')
leg2 = mpatches.Patch(color='#00a7fa', label='Average rating < 4.25')
leg3 = mpatches.Patch(color='#005cfa', label='Average rating < 4.3')
leg4 = mpatches.Patch(color='#6800fa', label='Average rating > 4.3')

# Creating the chart legend
legend = ax.legend(handles=[leg1, leg2, leg3, leg4],
          title='Range of average wine rating [4.17, 4.42]',
          loc = (0.03, 0.65),
          fontsize=13, 
          fancybox=True)

# Changing the font size of legend title
plt.setp(legend.get_title(),fontsize=12.9)

# Changing the font size of x-axis and y-axis ticks
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)

# Changing the font size of x-axis and y-axis labels
plt.xlabel('Wine year of production', fontsize=16)
plt.ylabel('Average number of wine reviews', fontsize=16)

# Saving the bar chart with good resolution
plt.savefig('VivinoBarChart.jpeg", dpi=600)

#Displaying the figure
plt.show()