# Use BeatifulSoup to scrape red wine data from Vivino using LCBO data

This notebook goes through iterative steps to connect to the Vivino webpage and then scrape salient data associated with all the red wine items based on data scraped from LCBO. The most important information is the name, rating and number of reviews. 

In [1]:
import re
import requests
import time
import random
from bs4 import BeautifulSoup
import pandas as pd
import csv

Read LCBO data and use bottle names as search items on Vivino website

In [2]:
dfl = pd.read_csv('lcbo_redwine.csv')

In [3]:
dfl['search'] = dfl['name'] + ' ' + dfl['region']

In [4]:
# Check number of bottles
len(list(dfl['search']))

6089

## Webscraping www.vivino.com using their search function on website

This is the main code using BeautifulSoup to extract information from Vivino based on data obtained from the LCBO. Use the bottle names from the LCBO and the search functionality on Vivino to come up with the best match. Matches won't be perfect but is mostly accurate. The most salient information for this purpose is rating and number of reviews.

Finally, write data to a .csvfile, indicating when a bottle has been missed.

In [5]:
url = 'https://www.vivino.com/search/wines?q={kw}&start={page}'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}

# Helper function to scrape the first bottle of search result 
def get_wines(kw):
    """Scrape vivino.com for rating and number of reviews for a single bottle.

    Keyword arguments:
        kw -- a string containing the bottle name and associated information for the search
    
    Returns tuple containing:
        title -- the name of the bottle as stored on vivino.com
        region -- the region where the wine was produced
        country -- the country where the wine was produced
        score -- the average rating score of the wine
        num_reviews -- the number of reviews of the wine
    """
    with requests.session() as s:
        page = 1
        soup = BeautifulSoup(s.get(url.format(kw=kw, page=page), headers=headers).content, 'html.parser')
        params = [wc['data-vintage'] for wc in soup.select('.default-wine-card')]
                
        title = soup.find('div', attrs={'class': 'default-wine-card vintage-price-id-'+str(params[0])})\
        .find('span', attrs={'class': 'bold'}).get_text().strip()
        
        region, country = soup.find('div', attrs={'class': 'default-wine-card vintage-price-id-'+str(params[0])}).\
        find('span', attrs={'class': 'text-block wine-card__region'}).get_text().strip().split('\n·\n')
        
        score = soup.find('div', attrs={'class': 'default-wine-card vintage-price-id-'+str(params[0])}).\
        find('div', attrs={'class': 'text-inline-block light average__number'}).get_text().strip()
        score = float(score.replace(',','.'))
        
        num_reviews = soup.find('div', attrs={'class': 'default-wine-card vintage-price-id-'+str(params[0])}).\
        find('div', attrs={'class': 'text-inline-block average__stars'}).get_text().strip()
        num_reviews = int(re.findall('\d*\ ',num_reviews)[0].strip())
        
        yield title, region, country, score, num_reviews
        
        # Very important to put random timer otherwise server will block ip address
        time.sleep(random.randint(1,3))
            
# Test using a single search name before providing full list of thousands of bottles
wines = ['Solaia 2009 Tuscany, Italy']
# Comment above line and uncomment below line to do full search. Be prepared to wait for a few hours!
#wines = list(dfl['search'])

# Main loop to iterate through list of bottle names, try collecting data and printing to a .csvfile
total = len(wines)
i=1
with open('vivino_redwine1.csv', 'w', encoding='utf-8', newline='') as csvfile:
    bottle_writer = csv.writer(csvfile)
    bottle_writer.writerow(['title','region','country','score','num_reviews']) 
    for wine in wines:
        try:
            bottle = list(*get_wines(wine))
        except:
            bottle = ['missed','missed','missed',0,0]
        bottle_writer.writerow(bottle)
        print('Finished bottle ' + str(i) + ' of ' + str(total))
        i += 1

Finished bottle 1 of 1
