# Project: Wine

## Идеи
* Опять же, посмотреть облако слов, которыми описывают вина, сделать словарик для тех, кто хочет блеснуть в приличном обществе. 
* всего 20 уникльных тестеров, можно посомтреть, кто лучше предсказывает цену
* связь цены и оценок. Интересно, будет ли расти variance с ростом цены (типа, с дешевыми винами всё понятно, а вот дорогие более противоречивые) 
* description облако слов по вообще любым срезаам: по тестерам, по сортаам, регионам.
* цены по регионам

In [2]:
# main requirements

import pandas as pd
import numpy as np

### A. Working with ready datasets

In [3]:
data1 = pd.read_csv('winemag-data_first150k.csv')
data2 = pd.read_csv('winemag-data-130k-v2.csv')
data = [data1, data2]

In [4]:
for d in data:
    print(d.columns)
    display(d.head())

Index(['Unnamed: 0', 'country', 'description', 'designation', 'points',
       'price', 'province', 'region_1', 'region_2', 'variety', 'winery'],
      dtype='object')


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


Index(['Unnamed: 0', 'country', 'description', 'designation', 'points',
       'price', 'province', 'region_1', 'region_2', 'taster_name',
       'taster_twitter_handle', 'title', 'variety', 'winery'],
      dtype='object')


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [5]:
data2.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

### B. Manually parsing vivino.com

### 1. Beautiful Soup

We are trying to drill down to see individual wine properties in the explore page. The HTML tree classes of the original page include:
* `body class="inner-page"`
* `div class="wrap"`
* `div id="explore-page-app"`
* `div class="explorerPage__explorePage--26aGH layout__outer--S05yQ"`
* `div class="layout__inner--3JC-x"`
* `div class="explorerPage__columns--1TTaK"`
* `div class="explorerPage__results--3wqLw"`
* `div class="explorerCard__explorerCard--3Q7_0 explorerPageResults__explorerCard--3q6Qe"`
* etc

Eventually, we need the following element: `"anchor__anchor--2QZvA"` visible in order to get the link to the page of each wine

In [6]:
import requests
from bs4 import BeautifulSoup


headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Host": "www.vivino.com",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15",
    "Accept-Language": "en-gb",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive"    
}

response = requests.get("https://www.vivino.com/explore?e=eJwNyb0KgCAYBdC3ubNB611aWtsj4stMBH9CzertaznLCZkdgotUCPKwVwr65ThA_0w4_22a8fIe9mCT7EwVj7QxS3XRllWayWINEndTNO46L-w-zyUdWg==", headers=headers)
content = response.content

parser = BeautifulSoup(content, 'html.parser')
body = parser.body
# print(body)
wine_titles = body.find_all(class_="vintageTitle__wine--U7t9G")
for title in wine_titles:
    type(title.text)
# type(wine_titles)
wine_titles

[]

#### Observation on Beautiful Soup

Apparently, simple BeautifulSoup does not work since vivino.com is using JavaScript to generate dynamic web pages, therefore, individual elements of the wine list can not be parsed without JavaScript. 

The deepest element of HTML page that can be achieved with Beautiful soup is the container with id `"explore-page-app"`.

### 2. Selenium web driver

First, I considered using PhantomJS to generate web pages with datascript, see details [here](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python).
However, after some attempts it appears that PhantomJS has been depricated, and developers of Selenium suggest using headless versions of Chrome or Firefox instead 
(see details [here](https://stackoverflow.com/questions/50416538/python-phantomjs-says-i-am-not-using-headless))


In [7]:
from selenium import webdriver

In [8]:
# driver = webdriver.PhantomJS("/Users/sveta/Downloads/phantomjs-2.1.1-macosx/bin/phantomjs")
# response_selenium = driver.get("https://www.vivino.com/explore?e=eJwNyb0KgCAYBdC3ubNB611aWtsj4stMBH9CzertaznLCZkdgotUCPKwVwr65ThA_0w4_22a8fIe9mCT7EwVj7QxS3XRllWayWINEndTNO46L-w-zyUdWg==")
# response_selenium
# a_elements = response_selenium.find_elements_by_tag_name("a")
# for el in a_elements:
#     print(el)

Below is the code required to initialize a web driver.

Basically, everything is done on the WebDriver instance object. The `find_element(By)` or `find_element_by_id` methods return another object type, the WebElement.
See documentation [here](https://www.selenium.dev/documentation/en/getting_started_with_webdriver/locating_elements/).

In [9]:
# Initialize web driver

def initialize_chrome_driver(long_screen=False):
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15')
    options.add_argument("start-maximized")
    if long_screen:
        options.add_argument("--window-size=1920x10800")
    else:
        options.add_argument("--window-size=1920x1080")
    # options.add_argument("disable-infobars")
    # options.add_argument("--no-sandbox")
    # options.add_argument("--disable-extensions")
    # options.add_argument("--disable-dev-shm-usage")
    # options.add_argument('--lang=en')
    # options.add_argument('--incognito')

    browser = webdriver.Chrome("/Users/sveta/Documents/Data analysis/Vivino-project/lib/chromedriver.uu", options=options)
    return browser

### Parsing the wine list (no scroll)

In [10]:
# script to check functionality of google.com
# browser.get('http://www.google.com/xhtml')
# res = browser.find_element_by_id("SIvCob")

In order to see whether the explore page is loaded correctly, the screenshot is saved to the folder. 

After several attempts to load the webpage I get the response that my IP has been temporarily blocked for exceeding bulk request limits. Therefore, I added some properties to the Web Driver, such as setting a **User agent** that does not reveal headless nature of Chrome

In [34]:
# get the screenshot of the explore page

browser = initialize_chrome_driver()
test_explore_page = "https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1MTBQS660dXdSSwYSAWoFQNn0NNuyxKLM1JLEHLX8JNuixJLMvPTi-MSy1KLE9FS1fNuU1OJktfKS6FhbQwDu-xpj"
browser.get(test_explore_page)
# res = browser.find_element_by_class_name("inner-page")
browser.get_screenshot_as_file('test_screenshot.png')
# browser.text

True

Once the page is loaded correctly, we are trying to get the individual wine properties.

Those can be found within the class `"anchor__anchor--2QZvA"`
The property of each wine has the link that leads to the individual web page of that wine. 

We are saving each link to a Pyhon list.
However, the page contains some other links unnecessary for our study (such as wine-regions, wine-sountries, etc). Those are specifically excluded. 

In [37]:
def get_list_no_scroll(browser, page, class_name="anchor__anchor--2QZvA"):
    browser.get(page)
    results_list = browser.find_elements_by_class_name(class_name)
    wine_pages_list = []
    for el in results_list:
        cur_link = el.get_property("href")
        if cur_link.startswith('https://www.vivino.com/wine-countries/')\
        or cur_link.startswith('https://www.vivino.com/wine-regions/')\
        or cur_link.startswith('https://www.vivino.com/redirect/')\
        or cur_link.startswith('https://instagram')\
        or cur_link.startswith('https://facebook')\
        or cur_link.startswith('https://twitter'):
            continue
        else:
            wine_pages_list.append(cur_link)
    return wine_pages_list

In [68]:
browser = initialize_chrome_driver(long_screen=True)
wine_pages_list_v2 = get_list_no_scroll(browser, test_explore_page)
# len(wine_pages_list_v2)

[]


In [61]:
# This is a temporary solution to get the first 25 links, since sometimes the page is not loading fast enough and does not gain any useful links, so we use results that we already had before
wine_pages_list_v2 = wine_pages_list
len(wine_pages_list_v2)

25

We can see that even with JavaScript-enabled web driver, the wines are not all loaded simultaneously, and the search results in just a few elements. Therefore, we need to implement scrolling. 

### Parsing a wine list (with JavaScript scrolling)

It appears that we need a scroller to get the whole list of wines, since they are not all loaded simultaneously.

First, to implement a scroller, we need find an element that is located in the bottom of the page, and run a script until such element becomes visible.

We are trying to find an element with a class `"addWidgetLink__addWidgetLink--aPZ_V"` to indicate the page bottom, and see if the scrolling works

In [14]:
# scroll until the end of a specific wine page 

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

def scroll_until_page_bottom(browser, page, page_end_class="addWidgetLink__addWidgetLink--aPZ_V"):
    browser.get(page)
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    try:
        WebDriverWait(browser, 50).until(EC.visibility_of_element_located((By.CLASS_NAME, page_end_class)))
    except TimeoutException:
        print("exception raised")

In [15]:
# trying to extract more wines 

# red wines with ratings of 4.5+ and price above 400 
explore_link_1 = "https://www.vivino.com/explore?e=eJzLLbI10TNVy83MszUxMFDLTawA08mVtu5OaslAIkCtwNZQLT3NtiyxKDO1JDFHLT_JtiixJDMvvTg-sSy1KDE9VS3fNiW1OFmtvCQ61tYQADLIGy0="
# desert wines with ratings of 4.5+ and price above 25 
explore_link_2 = "https://www.vivino.com/explore?e=eJwNxbEKgCAUBdC_eWNYGE13aWltj4iXmQipoWL197mc4yJk05OzHl2NX0ghSH2YRlKVmW60ZE4UjlZnvijsiJytN2njoiMbTQGHToqevKwYfiK_GwY="

scroll_until_page_bottom(browser, explore_link_1)
results_list_1 = browser.find_elements_by_class_name("anchor__anchor--2QZvA")

scroll_until_page_bottom(browser, explore_link_2)
results_list_2 = browser.find_elements_by_class_name("anchor__anchor--2QZvA")

wine_count_1 = len(results_list_1)
wine_count_2 = len(results_list_2)
print(wine_count_1)
print(wine_count_2)

exception raised
exception raised
105
105


In [16]:
# browser.get_screenshot_as_file('scroll_screen.png')

Scrolling until certain element is located does not seem to work properly. 
Therefore, the next try is to scroll a page down certain number of times, and wait a bit after each scroll (to allow the page to load more wines). This solution is ispired by [this article](https://dev.to/mr_h/python-selenium-infinite-scrolling-3o12)

Therefore, we write a function that scrolls down with a certain sleep timer (2 sec) after each scroll, loads the web elements that have a certain class (`"anchor__anchor--2QZvA"`), and returns a list with such elements. 

In order to limit the desired number of elements, we can either manually restrict the length of the resulting list (using a variable `total_wine_num`, or target the overall number of existing wines, as can be seen in the top of the explorer page.

Also, since the page loads only a few new elements with each scroll, in order to increase efficiency, we check web elements every 10 scrolls (therefore, the iteration counter is required).

In [17]:
import time 

def scroll_load_scroll(driver, page, timeout=2, class_name="anchor__anchor--2QZvA"):
    
    # Open the explore page
    driver.get(page)
    
    # extract the total number of wines with a given search criteria
#     time.sleep(timeout)
#     total_wine_string = browser.find_element_by_class_name("querySummary__querySummary--39WP2").text
#     total_wine_num = int(total_wine_string.split()[1])

    total_wine_num = 50
    
    results_list = []
    count_iter = 0

    while len(results_list) < total_wine_num:
        
#         browser.get_screenshot_as_file('scroll_screen_' + str(count_iter) +'.png')
#         print("gotcha!")

        if count_iter % 10 == 0:

            timepoint_1 = time.time()

            all_results = browser.find_elements_by_class_name(class_name)

            timepoint_2 = time.time()
            print("time to parse elements: " + str(timepoint_2 - timepoint_1))

            # get the link and check whether it meets the required criteria in order to be included to the list 
            for el in all_results:
                cur_link = el.get_property("href")
                if cur_link.startswith('https://www.vivino.com/wine-countries/')\
                or cur_link.startswith('https://www.vivino.com/wine-regions/')\
                or cur_link.startswith('https://www.vivino.com/redirect/')\
                or cur_link.startswith('https://instagram')\
                or cur_link.startswith('https://facebook')\
                or cur_link.startswith('https://twitter')\
                or cur_link in results_list:
                    continue
                else:
                    results_list.append(cur_link)
        
        timepoint_3 = time.time()
        print("time to update list of links: " + str(timepoint_3 - timepoint_2))
        
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(timeout)
        
        timepoint_4 = time.time()
        print("time to scroll and wait: " + str(timepoint_4 - timepoint_3))
        
        count_iter += 1
        
    print("list has {} elements".format(len(results_list)))
    return results_list

The above solution `scroll_load_scroll` that loads after each scroll (or each 10 scrolls) seems pretty slow, so, alternatively `scroll_and_load` function takes a given number of scrolls as an argument, and performs this number of scrolls before loading the page.

In [18]:
def scroll_and_load(driver, page, timeout=1, scrolls=60, class_name="anchor__anchor--2QZvA"):
    timepoint_0 = time.time()
    results_list = []
    driver.get(page)
    timepoint_1 = time.time()
    for i in range (scrolls):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(timeout)
    timepoint_2 = time.time()
#     print("time to scroll and wait: {} s.".format(timepoint_2 - timepoint_1))
    loaded_elements = browser.find_elements_by_class_name(class_name)
    timepoint_3 = time.time()
#     print("time to load elements: {} s.".format(timepoint_3 - timepoint_2))
    for el in loaded_elements:
        cur_link = el.get_property("href")
        if cur_link.startswith('https://www.vivino.com/wine-countries/')\
        or cur_link.startswith('https://www.vivino.com/wine-regions/')\
        or cur_link.startswith('https://www.vivino.com/redirect/')\
        or cur_link.startswith('https://instagram')\
        or cur_link.startswith('https://facebook')\
        or cur_link.startswith('https://twitter')\
        or cur_link in results_list:
            continue
        else:
            results_list.append(cur_link)
    timepoint_4 = time.time()
#     print("time to extract the list of links: {} s.".format(timepoint_4 - timepoint_3))
    print("total time elapsed: {} s.".format(timepoint_4 - timepoint_0))
    print("list has {} elements".format(len(results_list)))
    print("average time per element is {} s".format((timepoint_4 - timepoint_0)/len(results_list)))
    return results_list

In [19]:
browser = initialize_chrome_driver(long_screen=False)

In [20]:
# some other links for testing

# all fortified wines with a rating above 4.5 (the overal result should be 423)
explore_link_3 = "https://www.vivino.com/explore?e=eJwNxL0KgCAYBdC3uWNY2HiXltb2iPgyEyE1TPp5-zrDCZm6ahF8pEKQh1opmJd9B_M34GANt_GS7G2RHWlhluKjO2e5bBZnkbja0-Au48RGfyshGv4="
# all red wines, sorted by rating
explore_link_4 = "https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1MTBQS660dXdSSwYSAWoFQNn0NNuyxKLM1JLEHLX8JNuixJLMvPTi-MSy1KLE9FS1fNuU1OJktfKS6FhbQwDu-xpj"
# all red wines, sorted by popularity
explore_link_5 = "https://www.vivino.com/explore?e=eJzLLbI1VMvNzLM1UMtNrLA1MTBQS660dXdSSwYSAWoFQNn0NNuyxKLM1JLEHLX8JNuixJLMvPTi-OT80rwStXzblNTiZLXykuhYW0MAulcZsQ=="

In [21]:
list_with_scroll = scroll_and_load(browser, explore_link_5)

total time elapsed: 131.5886869430542 s.
list has 1425 elements
average time per element is 0.09234293820565206 s


In [22]:
list_with_scroll[-1]

'https://www.vivino.com/kaiken-sa-malbec-reserva/w/1152375?year=2015&cart_item_source='

The above strategy (60 scrolls, each with 1s timeout) gained 1425 elements. We should experiment with a number of scrolls and/or the value of timeout to see if it helps to increase the speed of loading of new elements. Experiments performed below. Since the timing and the list size differ in different experiments, one should look to the average loading time per element to see whether a strategy is efficient or not. 

In [23]:
browser = initialize_chrome_driver()
list_with_scroll = scroll_and_load(browser, explore_link_5, scrolls=100)

total time elapsed: 158.18566989898682 s.
list has 1275 elements
average time per element is 0.12406719207763672 s


In [24]:
browser = initialize_chrome_driver()
list_with_scroll = scroll_and_load(browser, explore_link_5, timeout=0.5)

total time elapsed: 59.10882592201233 s.
list has 749 elements
average time per element is 0.07891699055008322 s


In [25]:
browser = initialize_chrome_driver()
list_with_scroll = scroll_and_load(browser, explore_link_5, timeout=0.5, scrolls=100)

total time elapsed: 113.08866906166077 s.
list has 1275 elements
average time per element is 0.08869699534247903 s


We can see that timeout of 0.5s and 100 scrolls gains relatively high speed of loading and achievs relatively more results. We will use the strategy of 0.5s and 200 scrolls further on, and apply it to various wine groups (red vs. white, and also divided by country).

To use this approach we feed individual web pages per each country and wine type to our program (sorted inthe order of descending popularity), and run a web crawler through each of them. The results are saved in a list. 

In [26]:
argentina_red = "https://www.vivino.com/explore?e=eJwNirsOgCAMAP-mMyauXVxc3Y0xtSIhETClPvh7u9wNd0mwgxQzOkj0Ye8ccMNxADZMcFkNBz4k0SudUDYU0phDXbncWaHg7ivDq_NiK7dqJvkBsTQc7g=="
australia_red = "https://www.vivino.com/explore?e=eJwNijsKgDAQBW_z6gi229jY2ovIumoImETyUXN7t5kpZnyiDt4FMvD8UW8MpNE4QBQTbq32pIeTOwpfiBslLi7YvEqsoSDSfmTBW-ZFV2lZzfUHsTcc8Q=="
austria_red = "https://www.vivino.com/explore?e=eJwNijsOgCAQBW_zakxst7GxtTfGrCsSEgED64fbSzNTzIRMHYKPZBD4o94YSKVxgDRMuFp1Bz2cvVU-kTbKrD66skq6oyLRbovg1Xlpq9TSzPoDsTYc8A=="
chile_red = "https://www.vivino.com/explore?e=eJwNijsKgDAQBW_z6gi229jY2otIXGMI5CPJ-ru928wUM6lShxQyGST7Um8M-KNxACsmnFr9QbetwYmNKBtVKyH7tnK5sqDQ7hrjkXnRlb-m5vgDsTIc6g=="
france_red = "https://www.vivino.com/explore?e=eJwNirsOgCAMAP-mMyauXVxc3Y0xtQIhETClvv7eLnfDXRbsIKeCDjK92DsH_OE4ABsmOK3GgDdJ8koH1A2FNJXYVq5XUai4-8bw6LzYyl8zB_kBsT4c8w=="
germany_red = "https://www.vivino.com/explore?e=eJwNijsOgCAQBW_zakxst7GxtTfG4IqERMDA-uH2bjNTzMRCHWJIZBDtR70x4EbjAFZMuLT6gx5bghN7Im9UrITk68r5ToJMu6uMV-ZFV25VvbsfsS0c5A=="
italy_red = "https://www.vivino.com/explore?e=eJwNirsOgCAMAP-mMyauXVxc3Y0xtSJpImCgPvh7u9wNd7FgB1ESOoj0Ye8ccMNxADZMcFkNBz5UxCudkDcspJJCXTnfSSHj7ivDq_NiK7dqFv0BsUYc-A=="
portugal_red = "https://www.vivino.com/explore?e=eJwNijsKgDAQBW_z6gi229jY2otIXGMImA_J-ru928wUM7FShxgSGUT7Um8M-KNxACsmFK3-oNvW4MSeyBtVKyH5tnK-kiDT7hrjkXnRlb-mLvIDsVQc_w=="
spain_red = "https://www.vivino.com/explore?e=eJwNirsOgCAMAP-mMyauXVxc3Y0xtSIhETC0Pvh7u9wNd6liBylmdJDow9454IbjAGyY4LIaDnyoRq90QtmwksYcZOVyZ4WCuxeGV-fFVm5i9vIDsT0c8w=="
usa_red = "https://www.vivino.com/explore?e=eJwNijEOgCAMAH_TGRPXLi6u7saYWpWQCBhaVH5vl7vhLhbsIIaEDiJ92DsH3HAcgA0T3Fb9iQ-VcChdkDcspCF5WTnXpJBxP4Th1XmxlZuYq_yxXR0D"

full_red_list = []
length_list = []

browser = initialize_chrome_driver()
country_pages = [argentina_red, australia_red, austria_red, chile_red, france_red, germany_red, italy_red, portugal_red, spain_red, usa_red]
for page in country_pages:
    country_list = scroll_and_load(browser, page, timeout=0.5, scrolls=200)
    length_list.append(len(country_list))
    full_red_list.append(country_list)
    print("{} finished and gained {} links".format(page, len(country_list)))

total time elapsed: 350.8615720272064 s.
list has 2027 elements
average time per element is 0.173094016786979 s
https://www.vivino.com/explore?e=eJwNirsOgCAMAP-mMyauXVxc3Y0xtSIhETClPvh7u9wNd0mwgxQzOkj0Ye8ccMNxADZMcFkNBz4k0SudUDYU0phDXbncWaHg7ivDq_NiK7dqJvkBsTQc7g== finished and gained 2027 links
total time elapsed: 372.25341987609863 s.
list has 2028 elements
average time per element is 0.1835569131538948 s
https://www.vivino.com/explore?e=eJwNijsKgDAQBW_z6gi229jY2ovIumoImETyUXN7t5kpZnyiDt4FMvD8UW8MpNE4QBQTbq32pIeTOwpfiBslLi7YvEqsoSDSfmTBW-ZFV2lZzfUHsTcc8Q== finished and gained 2028 links
total time elapsed: 429.04940485954285 s.
list has 2025 elements
average time per element is 0.2118762493133545 s
https://www.vivino.com/explore?e=eJwNijsOgCAQBW_zakxst7GxtTfGrCsSEgED64fbSzNTzIRMHYKPZBD4o94YSKVxgDRMuFp1Bz2cvVU-kTbKrD66skq6oyLRbovg1Xlpq9TSzPoDsTYc8A== finished and gained 2025 links
total time elapsed: 386.37881684303284 s.
list has 2026 elements
average time per element is 0.1907101761

In [27]:
argentina_white = "https://www.vivino.com/explore?e=eJwNi7sKgDAMAP8mcxXXLC6u7iISYy0F20oaX39vlrvhuCTYQIoZHSR6sXMO-MOhBzaMcFoNO94k0SsdUFYU0phDXbhcWaHg5ivDo9OMra3VTPIDsT4c7w=="
australia_white = "https://www.vivino.com/explore?e=eJwNizsOgCAQBW_zajS229jY2htj1lUJiYCBxc_tpZkpJuMTNfAukIHnlzpjIB8NPaRixFWrPejm5HblE3GlxOqCzYvEEhSRtj0LHp1mauuaq7n8sUEc8g=="
austria_white = "https://www.vivino.com/explore?e=eJwNizsOgCAQBW_zajS229jY2htj1hUJiYCB9Xd7aWaKyYRMDYKPZBD4pc4YyEdDD6kYcdbqdro5e6t8IK2UWX10ZZF0RUWizRbBo9NMbV1LNesPsUAc8Q=="
chile_white = "https://www.vivino.com/explore?e=eJwNizsKgDAQBW_z6ii229jY2ovIumoI5CPJ-ru9aWaKYUKmBsFFMgj8UmcM5KOhh1SMOGu1B92c3a7skVbKrC7aski6oiLRthfBo9NMbV1LtfgfsTwc6w=="
france_white = "https://www.vivino.com/explore?e=eJwNi7sKgDAMAP8mcxXXLC6u7iISoy0F20oaX39vlrvhuCTYQIoZHSR6sXMO-MOhBzaMcFoNHm-SuCsdUFYU0phDXbhcWaHgtleGR6cZW1ur2csPsUgc9A=="
germany_white = "https://www.vivino.com/explore?e=eJwNizsOgCAQBW_zajS229jY2htj1hUJiYCB9Xd7aWaKyYRMDYKPZBD4pc4YyEdDD6kYcdbqdro5e6t8IK2UWX10ZZF0RUWizRbBo9NMbV1L9WZ_sTcc5Q=="
italy_white = "https://www.vivino.com/explore?e=eJwNi7sKgDAMAP8mcxXXLC6u7iISYy0B20obX39vlrvhuFiwgSgJHUR6sXMO-MOhBzaMcFoNO95UxCsdkFcspJJCXThfSSHj5ivDo9OMra3VLPoDsVAc-Q=="
portugal_white = "https://www.vivino.com/explore?e=eJwNizsKgDAQBW-zdRTb19jY2ovIumoImA_J-ru9aWaKYXxGQ94FGPL8ojOG5MPQk1SMlGq1B27Oblc-Ka7IrC7Yski8glLEthehR6cZbV1LddIfsV4dAA=="
spain_white = "https://www.vivino.com/explore?e=eJwNi7sKgDAMAP8mcxXXLC6u7iISYy0F20oTX39vlrvhuFSxgRQzOkj0Yucc8IdDD2wY4bQadrypRq90QFmxksYcZOFyZYWCmxeGR6cZW1vF7OUHsUcc9A=="
usa_white = "https://www.vivino.com/explore?e=eJwNi7sKgDAMAP8mcxXXLC6u7iISo5aCbaVJffy9We6G42LBBmJI6CDSi51zwB8OPbBhhMuqP_CmEnalE_KKhTQkLwvnmhQybrswPDrN2Noq5io_sWcdBA=="

full_white_list = []
white_length_list = []

browser = initialize_chrome_driver()
country_white_pages = [argentina_white, australia_white, austria_white, chile_white, france_white, germany_white, italy_white, portugal_white, spain_white, usa_white]
for page in country_white_pages:
    country_list = scroll_and_load(browser, page, timeout=1, scrolls=150)
    white_length_list.append(len(country_list))
    full_white_list.append(country_list)
    print("{} finished and gained {} links".format(page, len(country_list)))

total time elapsed: 232.58302402496338 s.
list has 1603 elements
average time per element is 0.14509234187458725 s
https://www.vivino.com/explore?e=eJwNi7sKgDAMAP8mcxXXLC6u7iISYy0F20oaX39vlrvhuCTYQIoZHSR6sXMO-MOhBzaMcFoNO94k0SsdUFYU0phDXbhcWaHg5ivDo9OMra3VTPIDsT4c7w== finished and gained 1603 links
total time elapsed: 263.62972378730774 s.
list has 1754 elements
average time per element is 0.15030200900074558 s
https://www.vivino.com/explore?e=eJwNizsOgCAQBW_zajS229jY2htj1lUJiYCBxc_tpZkpJuMTNfAukIHnlzpjIB8NPaRixFWrPejm5HblE3GlxOqCzYvEEhSRtj0LHp1mauuaq7n8sUEc8g== finished and gained 1754 links
total time elapsed: 254.5235240459442 s.
list has 1674 elements
average time per element is 0.1520451159175294 s
https://www.vivino.com/explore?e=eJwNizsOgCAQBW_zajS229jY2htj1hUJiYCB9Xd7aWaKyYRMDYKPZBD4pc4YyEdDD6kYcdbqdro5e6t8IK2UWX10ZZF0RUWizRbBo9NMbV1LNesPsUAc8Q== finished and gained 1674 links
total time elapsed: 246.8078384399414 s.
list has 1625 elements
average time per element is 0.15188174

Check the resulting sample size for red and white wines:

In [28]:
records = 0
for i in full_red_list:
    records += len(i)
print('our search gained {} red wine records'.format(records))

our search gained 18377 red wine records


In [29]:
records = 0
for i in full_white_list:
    records += len(i)
print('our search gained {} white wine records'.format(records))

our search gained 16612 white wine records


Save results to a file (for convenience, using [pickle library](https://docs.python.org/3/library/pickle.html))

In [31]:
import pickle
with open("popular_reds_sample", 'wb') as f:
    pickle.dump(full_red_list, f)

In [32]:
with open("popular_whites_sample", 'wb') as f:
    pickle.dump(full_white_list, f)

### Parsing the properties of each wine from the list 

Trying to get interesting information per each wine, such as: 
- id:
    * wine name (this usualy includes also the year) (`class="wine"`)
- properties:
    * winery (`class="winery"`)
    * wine type (`class="wineLocationHeader__wineType--14nrC"`)
    * grapes (`class="wineFacts__fact--3BAsi"`) 
    * wine style (`class="wineFacts__fact--3BAsi"`) 
    * region (`class="anchor__anchor--3DOSm"`)
    * country (`class="wineLocationHeader__country--1RcW2"`)
- quality information:
    * number of reviews (`class="vivinoRatingWide__basedOn--s6y0t"`)
    * average price (`class="purchaseAvailabilityPPC__amount--2_4GT"`)
    * rating score (`class="vivinoRatingWide__averageValue--1zL_5"`)
    * 3 random community reviews (`class="reviewCard__reviewNote--fbIdd"`)
- taste structure:
    * there are 4 progress bars: light/bold, smooth/tannic, dry/sweet, soft/acidic 
    * each taste progressbar has the following class: `class="indicatorBar__progress--3aXLX"` with the following style properties identifying the bar location: `style="width: 15%; left: 85%;"` 
    * notes mention (`class="tasteNote__popularKeywords--1q7RG"`) 
    
Some of these elements are not visible until the page is scrolled down till the end. Therefore, just as with the wine list, we might need to scroll down, but this time without a pause. Instead, we will scroll until the specific element in the bottom becomes visible. This will ensure that all other elements are visible too. The bottom element chosen for this purposes has a class `addWidgetLink__addWidgetLink--aPZ_V`. 

In [41]:
# scroll until the end of a specific wine page 

def scroll_until_the_end_of_wine_page(browser, page, page_end_class="addWidgetLink__addWidgetLink--aPZ_V"):
    browser.get(page)
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    try:
        WebDriverWait(browser, 50).until(EC.visibility_of_element_located((By.CLASS_NAME, page_end_class)))
    except TimeoutException:
        print("exception raised")

Once the whole wine page is loaded, we will need to check existence of each item that we need for analysis. 

To parse the data, important to note that each of page items has either a unique class name (in this case a single element can be extracted using a method `browser.find_element_by_class_name()`), or a generic class name that applies to more than one element (and therefore the list of elements can be extracted using a method `browser.find_elements_by_class_name()` and then indexed to get the desired data). 

Generally, we are interested in the text property of a given web element, however, in specific cases (eg. for the wine taste structure that is represented as a slidebar on vivino web page) we are interested in web element attributes instead.  

Some of the required information may be missing on a wine page, in such cases no exception should be raised. Instead, the program should consider `np.nan` as the correct data input. 

In order to consider the above specifics, we introduce a function `extract_data_with_exceptions` that takes several optional boolean arguments (`multiple_elements` tells whether a given class has more than one elements on a page, `get_style` tells whether we need attribute of a web element instead of text)


In [42]:
# extract either a single element, or a list of multiple elements
def extract_data_with_exceptions(browser, class_name, multiple_elements=False, get_style=False):
    if multiple_elements:
        try:
            list_with_data = []
            data_list = browser.find_elements_by_class_name(class_name)
            for element in data_list:
                if get_style:
                    list_with_data.append(element.get_attribute("style"))
                else:
                    list_with_data.append(element.text)
        except:
            list_with_data = "Not available"
        return list_with_data
    else:
        try:
            if get_style:
                data = browser.find_element_by_class_name(class_name).get_attribute("style")
            else:
                data = browser.find_element_by_class_name(class_name).text
        except:
            data = np.nan
        return data

The data will be stored in a pandas DataFrame. Therefore, for convenience, it was chosen to extract individual wine data in the form of a Python dictionary, that will be appended to the target DataFrame one by one. 

In [43]:
def extract_data_as_dict(browser):
    data_dict = {}
    # extract wine name - OK text
    data_dict["wine_name"] = extract_data_with_exceptions(browser, "vintage") 
    # extract winery - OK text
    data_dict["winery"] = extract_data_with_exceptions(browser, "winery")
    # extract wine type - OK text
    data_dict["wine_type"] = extract_data_with_exceptions(browser, "wineLocationHeader__wineType--14nrC")
    # extract grapes - OK text list element
    facts_list = extract_data_with_exceptions(browser, "wineFacts__fact--3BAsi", multiple_elements=True)
    data_dict["grapes"] = facts_list[1] if len(facts_list) > 1 else np.nan
    # extract wine style - OK text list element
    data_dict["wine_style"] = facts_list[3] if len(facts_list) > 3 else np.nan
    # extract region - OK text 
    data_dict["region"] = extract_data_with_exceptions(browser, "anchor__anchor--3DOSm")
    # extract country - OK text
    data_dict["country"] = extract_data_with_exceptions(browser, "wineLocationHeader__country--1RcW2")
    # extract the number of reviews - OK text 
    data_dict["reviews"] = extract_data_with_exceptions(browser, "vivinoRatingWide__basedOn--s6y0t")
    # extract the average price - OK text
    data_dict["price"] = extract_data_with_exceptions(browser, "purchaseAvailabilityPPC__amount--2_4GT")
    # extract the rating score - OK text
    data_dict["score"] = extract_data_with_exceptions(browser, "vivinoRatingWide__averageValue--1zL_5")
    # extract 3 reviews - OK text list
    reviews_list = extract_data_with_exceptions(browser, "reviewCard__reviewNote--fbIdd", multiple_elements=True)
    data_dict["review_1"] = reviews_list[0] if len(reviews_list) > 0 else np.nan
    data_dict["review_2"] = reviews_list[1] if len(reviews_list) > 1 else np.nan
    data_dict["review_3"] = reviews_list[2] if len(reviews_list) > 2 else np.nan
    # extract taste bar
    taste_list = extract_data_with_exceptions(browser, "indicatorBar__progress--3aXLX", multiple_elements=True, get_style=True)
    data_dict["taste_light_bold"] = taste_list[0] if len(taste_list) > 0 else np.nan
    data_dict["taste_smooth_tannic"] = taste_list[1] if len(taste_list) > 1 else np.nan
    data_dict["taste_dry_sweet"] = taste_list[2] if len(taste_list) > 2 else np.nan
    data_dict["taste_soft_acidic"] = taste_list[3] if len(taste_list) > 3 else np.nan
    # extract keywords
    keyword_list = extract_data_with_exceptions(browser, "tasteNote__popularKeywords--1q7RG", multiple_elements=True)
    data_dict["keywords_1"] = keyword_list[0] if len(keyword_list) > 0 else np.nan
    data_dict["keywords_2"] = keyword_list[1] if len(keyword_list) > 1 else np.nan
    data_dict["keywords_3"] = keyword_list[2] if len(keyword_list) > 2 else np.nan
    return data_dict

In [63]:
df_25_wines = pd.DataFrame()

for page in wine_pages_list:
    scroll_until_the_end_of_wine_page(browser, page, page_end_class="addWidgetLink__addWidgetLink--aPZ_V")
    cur_wine_data = extract_data_as_dict(browser)
    df_25_wines = df_25_wines.append(cur_wine_data, ignore_index=True)

In [64]:
df_25_wines.head()

Unnamed: 0,country,grapes,keywords_1,keywords_2,keywords_3,price,region,review_1,review_2,review_3,reviews,score,taste_dry_sweet,taste_light_bold,taste_smooth_tannic,taste_soft_acidic,wine_name,wine_style,wine_type,winery
0,United States,100% Cabernet Sauvignon,"chocolate, oak, vanilla","black fruit, blackberry...","leather, cocoa, graphit...",£366.67,Napa Valley,,,,88 ratings,4.9,width: 15%; left: 11.0281%;,width: 15%; left: 85%;,width: 15%; left: 52.0888%;,width: 15%; left: 51.0406%;,Kayli Morgan Vineyard Cabernet Sauvignon 2015,Napa Valley Cabernet Sauvignon,Red wine,Hundred Acre
1,Italy,Cabernet Sauvignon,,,,£250.53,Veneto,,,,82 ratings,4.9,,,,,Raro Cabernet Sauvignon Selezione 2007,Northern Italy Red,Red wine,Marion
2,United States,Cabernet Sauvignon,,,,£185.40,Rutherford,,,,80 ratings,4.9,,,,,Patriarch 2012,Napa Valley Cabernet Sauvignon,Red wine,Frank Family
3,,,"honey, leather, stone","apricot, peach, apple","caramel, butterscotch, ...",£322.51,4.9\n74 ratings,,,,74 ratings,4.9,,,,,Chateau D Yguene 2001,,Red wine,
4,United States,100% Cabernet Sauvignon,"dark fruit, blackberry,...","earthy, cocoa, smoke","coffee, oak, chocolate",£193.12,Rutherford,,,,69 ratings,4.9,width: 15%; left: 0px;,width: 15%; left: 85%;,width: 15%; left: 47.2605%;,width: 15%; left: 51.55%;,The Beast Cabernet Sauvignon 2012,Napa Valley Cabernet Sauvignon,Red wine,Del Dotto


Now when we know that the web scraper works correctly, we might want to apply it to a bigger subset of wines at the same time (eg. about 20k red wines that were saved in the script above).

However, once the script was run, it managed to load data about app. 340 wines before it triggered the bulk request limits on vivino.com.

The following message was thrown: 
*Your IP address (185.192.69.14) has been temporarily blocked for exceeding bulk request limits. If you believe this was done in error or you have legitimate needs to access our pages and data above and beyond these limits please contact admin@vivino.com with the subject 'Requests Blocked' and we'll try and resolve the issue.*

Still, for documentation purposes, the below code is kept here, and might be un-commented if needed. 

In [67]:
# wine_df_new = pd.DataFrame()

# browser = initialize_chrome_driver()

# start_time = time.time()
# for page in list_with_scroll:
#     scroll_until_the_end_of_wine_page(browser, page, page_end_class="addWidgetLink__addWidgetLink--aPZ_V")
#     cur_wine_data = extract_data_as_dict(browser)
#     wine_df_new = wine_df_new.append(cur_wine_data, ignore_index=True)
# end_time = time.time()

# print(wine_df_new)
# print("time elapsed: " + str(end_time - start_time) + " s.")

# wine_df_new.to_csv('red_wine_highest_score.csv')

### 3. Using Vivino API

##### Wine data

During a more detailed review of vivino request/response pairs, it was discovered that vivino has its own API not shown to the users, it returns a structured JSON string. The JavaString request to explore page looks as follows:
https://www.vivino.com/api/explore/explore?country_code=GB&currency_code=GBP&grape_filter=varietal&min_rating=1&order_by=ratings_average&order=desc&page=4&price_range_max=400&price_range_min=0&wine_type_ids[]=1

From the format of the request sent to vivino servers, it can be seen that the following arguments are passed to the request to explore page: 
* `country_code`
* `currency_code`
* `grape_filter`
* `min_rating`
* `order_by`
* `order`
* `page`
* `price_range_max`
* `price_range_min`
* `wine_type_ids[]`

Here, in order to ensure that all wines are included to the request, we might expand the search criteria by setting the wide price range. Further, since the data is loaded by pages, we can iterate by page number until the whole database is loaded.

But first, we check which data can be retrieved (wine data & reviews) using this method.

We will simply use `requests` library and pass the headers specifying that json is expected in return (otherwise, the request might throw IP blocking for bulk request)

In [69]:
test_page = "https://www.vivino.com/api/explore/explore?country_code=GB&currency_code=GBP&grape_filter=varietal&min_rating=1&order_by=ratings_average&order=desc&page=4&price_range_max=400&price_range_min=0&wine_type_ids[]=1"

headers_api = {
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'User-Agent': 'python/requests',
}

response = requests.get(test_page, headers=headers_api)
test_result = response.content

Since vivino.com server returns a JSON string, we might want to convert it to a Python-readable format. We will use `json` library for this task that converts string into a Python dictionary. 

In [96]:
import json 
json_obj = json.loads(test_result)

In [97]:
test_df = pd.DataFrame(json_obj['explore_vintage']['matches'])
test_df.shape

(25, 3)

We can see that a single page request yields 25 wined. We need to 'unroll' the vintage column which is by itself a dictionary. The results are stored in the `new_test_df`.

In [80]:
new_test_df = pd.DataFrame()
for row in range(len(test_df)):
#     print(row)
    new_test_df = new_test_df.append(test_df['vintage'][row], ignore_index=True)

In [81]:
new_test_df.head(2)

Unnamed: 0,grapes,has_valid_ratings,id,image,name,seo_name,statistics,wine,year,top_list_rankings
0,,1.0,1438498.0,{'location': '//images.vivino.com/thumbs/SCnpJ...,Vega Sicilia Unico 1970,vega-sicilia-unico-1970,"{'status': 'Normal', 'ratings_count': 379, 'ra...","{'id': 77137, 'name': 'Unico', 'seo_name': 'un...",1970.0,
1,,1.0,1232825.0,{'location': '//images.vivino.com/thumbs/rUPGZ...,Domaine de La Romanée-Conti La Tâche Grand Cru...,domaine-de-la-romanee-conti-la-tache-grand-cru...,"{'status': 'Normal', 'ratings_count': 364, 'ra...","{'id': 83911, 'name': 'La Tâche Grand Cru', 's...",2000.0,"[{'rank': 1, 'previous_rank': 1, 'description'..."


We can see that a bunch of useful data is stored in 'wine' and 'statistics' columns that might be unrolled further. 

In [124]:
def extract_wine_info(df):
    for stat_field in ['ratings_count', 'ratings_average', 'labels_count']:
        df[stat_field] = df['statistics'].apply(lambda x: x[stat_field]) 
    df['wine_id'] = df['wine'].apply(lambda x: x['id'])
    df['wine_name'] = df['wine'].apply(lambda x: x['name'])
    df['type'] = df['wine'].apply(lambda x: x['type_id'])
    df['vintage_type'] = df['wine'].apply(lambda x: x['vintage_type'])
    df['country'] = df['wine'].apply(lambda x: x['style']['country']['name'])
    df['region'] = df['wine'].apply(lambda x: x['region']['name'])
    df['winery'] = df['wine'].apply(lambda x: x['winery']['name'])
    for taste in ['acidity', 'fizziness', 'intensity', 'sweetness', 'tannin']:
        df[taste] = df['wine'].apply(lambda x: x['taste']['structure'][taste])
    for style_char in ['id', 'regional_name', 'varietal_name', 'body', 'body_description', \
                  'acidity', 'acidity_description', 'grapes']:    
        df['style_' + style_char] = df['wine'].apply(lambda x: x['style'][style_char])
    df['flavor'] = df['wine'].apply(lambda x: x['taste']['flavor'])
    return df

In [84]:
test_df_extended = extract_wine_info(new_test_df)

In [85]:
test_df_extended.columns

Index(['grapes', 'has_valid_ratings', 'id', 'image', 'name', 'seo_name',
       'statistics', 'wine', 'year', 'top_list_rankings', 'ratings_count',
       'ratings_average', 'labels_count', 'wine_id', 'wine_name', 'type',
       'vintage_type', 'country', 'region', 'winery', 'acidity', 'fizziness',
       'intensity', 'sweetness', 'tannin', 'style_id', 'style_regional_name',
       'style_varietal_name', 'style_body', 'style_body_description',
       'style_acidity', 'style_acidity_description', 'style_grapes', 'flavor'],
      dtype='object')

Once the necessary columns are extracted, we can drop some columns that are not required any longer

In [86]:
test_df_extended.drop(['grapes', 'image', 'seo_name', 'statistics', 'top_list_rankings', 'wine']\
                              , inplace=True, axis=1)

In [87]:
test_df_extended.head(2)

Unnamed: 0,has_valid_ratings,id,name,year,ratings_count,ratings_average,labels_count,wine_id,wine_name,type,...,tannin,style_id,style_regional_name,style_varietal_name,style_body,style_body_description,style_acidity,style_acidity_description,style_grapes,flavor
0,1.0,1438498.0,Vega Sicilia Unico 1970,1970.0,379,4.8,919,77137,Unico,1,...,3.445054,180,Spanish,Ribera Del Duero Red,5,Very full-bodied,3,High,"[{'id': 19, 'name': 'Tempranillo', 'seo_name':...","[{'group': 'oak', 'stats': {'count': 833, 'sco..."
1,1.0,1232825.0,Domaine de La Romanée-Conti La Tâche Grand Cru...,2000.0,364,4.8,2194,83911,La Tâche Grand Cru,1,...,2.61897,283,Burgundy,Côte de Nuits Red,3,Medium-bodied,3,High,"[{'id': 14, 'name': 'Pinot Noir', 'seo_name': ...","[{'group': 'earth', 'stats': {'count': 165, 's..."


#### Reviews data

From the looks of the API request we can see that it takes the following arguments:
* wine ID
* year
* number of reviews per page

We test sending request to reviews API, and it also results in a JSON string with requested information.

In [149]:
test_reviews_page = "https://www.vivino.com/api/wines/83496/reviews?year=2016&per_page=15"
response = requests.get(test_reviews_page, headers=headers_api)
reviews_test_result = response.content

In [150]:
json_obj = json.loads(reviews_test_result)
reviews_test_df = pd.DataFrame(json_obj['reviews'])

In [151]:
reviews_test_df.head(2)

Unnamed: 0,id,rating,note,language,created_at,aggregated,user,vintage,activity,flavor_word_matches,tagged_note
0,112855589,4.0,Dark purple in color. Notes of cherry and dark...,en,2018-12-22T05:26:15.000Z,True,"{'id': 23558734, 'seo_name': 'tim.kroetsch', '...","{'id': 91466354, 'seo_name': 'joel-gott-cabern...","{'id': 291243329, 'statistics': {'likes_count'...","[{'id': 93, 'match': 'cherry'}, {'id': 134, 'm...",Dark purple in color. Notes of cherry and dark...
1,106411098,4.0,Somewhat heady and delicious cab. Decidedly fr...,en,2018-10-08T00:01:05.000Z,True,"{'id': 8170315, 'seo_name': 'm6fe64166df314181...","{'id': 91466354, 'seo_name': 'joel-gott-cabern...","{'id': 276753387, 'statistics': {'likes_count'...","[{'id': 334, 'match': 'plum'}]",Somewhat heady and delicious cab. Decidedly fr...


According to the email from Birkir Barkarson (the CTO of vivino), the requests should be limited to 1000 per 10 minute window in order to avoid major interrution to their servers and resulting IP blockerage. 

Therefore, the `ratelimiter` is set accordingly. It will respect the limit of 1 request per second.

In [112]:
# all preprocessing steps to get from JSON string to a row in a DataFrame

def extract_json_to_df(df, json_str):
    json_obj = json.loads(json_str)
    for wine in json_obj['explore_vintage']['matches']:
#         print(type(wine))
        df = df.append(wine['vintage'], ignore_index=True)
    return df
        
my_df = pd.DataFrame()
my_df = extract_json_to_df(my_df, test_result)
my_df.head(2)

Unnamed: 0,grapes,has_valid_ratings,id,image,name,seo_name,statistics,wine,year,top_list_rankings
0,,1.0,1438498.0,{'location': '//images.vivino.com/thumbs/SCnpJ...,Vega Sicilia Unico 1970,vega-sicilia-unico-1970,"{'status': 'Normal', 'ratings_count': 379, 'ra...","{'id': 77137, 'name': 'Unico', 'seo_name': 'un...",1970.0,
1,,1.0,1232825.0,{'location': '//images.vivino.com/thumbs/rUPGZ...,Domaine de La Romanée-Conti La Tâche Grand Cru...,domaine-de-la-romanee-conti-la-tache-grand-cru...,"{'status': 'Normal', 'ratings_count': 364, 'ra...","{'id': 83911, 'name': 'La Tâche Grand Cru', 's...",2000.0,"[{'rank': 1, 'previous_rank': 1, 'description'..."


In [119]:
from ratelimiter import RateLimiter

test_page = "https://www.vivino.com/api/explore/explore?country_code=GB&currency_code=GBP&grape_filter=varietal&min_rating=1&order_by=ratings_average&order=desc&\
page=4&price_range_max=400&price_range_min=0&wine_type_ids[]=1"

headers_api = {
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'User-Agent': 'python/requests',
}

@RateLimiter(max_calls=1, period=1)
def get_wine_json(page, headers, df):
    response = requests.get(page, headers=headers)
    json_str = response.content
    df = extract_json_to_df(df, json_str)
    return df
#     print(time.time())

subset_red = pd.DataFrame()

for i in range(1,60):
    page = "https://www.vivino.com/api/explore/explore?country_code=GB&currency_code=GBP&grape_filter=varietal&min_rating=1&order_by=ratings_average&order=desc&\
page={}&price_range_max=400&price_range_min=0&wine_type_ids[]=1".format(i)
    subset_red = get_wine_json(page, headers_api, subset_red)

In [121]:
subset_red.shape

(1475, 10)

In [126]:
subset_red.head()

Unnamed: 0,grapes,has_valid_ratings,id,image,name,seo_name,statistics,wine,year,top_list_rankings,ratings_count,ratings_average,labels_count,wine_id,wine_name,type,vintage_type
0,,1.0,143500908.0,{'location': '//images.vivino.com/thumbs/ZOn7K...,Hundred Acre Kayli Morgan Vineyard Cabernet Sa...,hundred-acre-kayli-morgan-vineyard-cabernet-sa...,"{'status': 'Normal', 'ratings_count': 88, 'rat...","{'id': 84351, 'name': 'Kayli Morgan Vineyard C...",2015.0,,88,4.9,602,84351,Kayli Morgan Vineyard Cabernet Sauvignon,1,0
1,,1.0,155452769.0,{'location': '//images.vivino.com/labels/bDEH9...,Marion Raro Cabernet Sauvignon Selezione 2007,marion-raro-cabernet-sauvignon-selezione-2007,"{'status': 'Normal', 'ratings_count': 82, 'rat...","{'id': 5926791, 'name': 'Raro Cabernet Sauvign...",2007.0,,82,4.9,315,5926791,Raro Cabernet Sauvignon Selezione,1,0
2,,1.0,142749808.0,{'location': '//images.vivino.com/thumbs/BFAcH...,Frank Family Patriarch 2012,frank-family-patriarch-rutherford-red-wine-2012,"{'status': 'Normal', 'ratings_count': 80, 'rat...","{'id': 4382344, 'name': 'Patriarch', 'seo_name...",2012.0,,80,4.9,341,4382344,Patriarch,1,0
3,,1.0,57646109.0,{'location': '//images.vivino.com/labels/V5JCH...,Chateau D Yguene 2001,chateau-d-yguene-red-wine-v-jdmvq-2001,"{'status': 'Normal', 'ratings_count': 74, 'rat...","{'id': 3474900, 'name': 'Chateau D Yguene', 's...",2001.0,,74,4.9,624,3474900,Chateau D Yguene,1,0
4,,1.0,3208609.0,{'location': '//images.vivino.com/thumbs/easjT...,Del Dotto The Beast Cabernet Sauvignon 2012,del-dotto-the-beast-cabernet-sauvignon-2012,"{'status': 'Normal', 'ratings_count': 69, 'rat...","{'id': 1403662, 'name': 'The Beast Cabernet Sa...",2012.0,,69,4.9,357,1403662,The Beast Cabernet Sauvignon,1,0


In [154]:
test_test = pd.json_normalize(json_obj['reviews'])

In [156]:
test_test.head()

Unnamed: 0,id,rating,note,language,created_at,aggregated,flavor_word_matches,tagged_note,user.id,user.seo_name,...,vintage.image.variations.bottle_large,vintage.image.variations.bottle_medium,vintage.image.variations.bottle_medium_square,vintage.image.variations.bottle_small,vintage.image.variations.bottle_small_square,vintage.image.variations.label,vintage.image.variations.label_large,vintage.image.variations.label_medium,vintage.image.variations.label_medium_square,vintage.image.variations.label_small_square
0,112855589,4.0,Dark purple in color. Notes of cherry and dark...,en,2018-12-22T05:26:15.000Z,True,"[{'id': 93, 'match': 'cherry'}, {'id': 134, 'm...",Dark purple in color. Notes of cherry and dark...,23558734,tim.kroetsch,...,,,,,,,,,,
1,106411098,4.0,Somewhat heady and delicious cab. Decidedly fr...,en,2018-10-08T00:01:05.000Z,True,"[{'id': 334, 'match': 'plum'}]",Somewhat heady and delicious cab. Decidedly fr...,8170315,m6fe64166df3141817e4d7f40f683be6,...,,,,,,,,,,
2,125161577,4.0,"Deep flavor. Strong notes of black fruit, toba...",en,2019-04-27T05:12:23.000Z,True,"[{'id': 39, 'match': 'black fruit'}, {'id': 15...","Deep flavor. Strong notes of black fruit, toba...",34010037,christopher.sugo,...,,,,,,,,,,
3,173222507,4.0,"Fruit forward, green pepper, tobacco, nice dep...",en,2020-08-11T04:10:44.000Z,True,"[{'id': 276, 'match': 'minerals'}, {'id': 320,...","Fruit forward, green pepper, tobacco, nice dep...",9553409,e1bc48eca68223d2dcc77da21cb4f643,...,,,,,,,,,,
4,128927894,4.0,Nice bold yet balanced wine and at a nice pric...,en,2019-06-06T19:37:27.000Z,True,"[{'id': 49, 'match': 'blackberry'}, {'id': 93,...",Nice bold yet balanced wine and at a nice pric...,31887895,costas.tsitsiragos,...,,,,,,,,,,


### Eugene comments 

Идеи по гипотезам
* Чем меньше популяция, относительно которой мы хотим делать выводы, тем проще нам будет собрать репрезентативную выборку. Поэтому вместо того, чтобы пытаться сделать вывод относительно всего рынка вин по миру, проще (и реалестичнее) пытаться делать выводы о более локальных популяциях- по странам, континентам.
* Например, может быть интересно посравнивать вина америки и европы, как два основных континента-поставщика.
    * Есть ли разница по оценкам между сортами? (некоторые сорта растут лучше в одном регионе, некоторые в другом)
    * Есть ли разница по сочетаниям? (гипотеза, что в европе больше сыра, в америке мяска)
    * Раазница по описаниям (какие слова используют для описания)
    * Сделать что-нибудь вроде описания прототипичного американского и европейского вин (взять средние вкусы/сорта/оценки, вывести средние, найти самое близкое из существующих к этому среднему - у нас есть прототипичные вина континетов! Можно разбить еще по сортам/ белому-красному/ еще чему-нибудь)
* Отедьно, конечно, интересно посмотреть выборку дешевых и дорогих вин.
    * Возможно, это будет сложно, но у меня есть фантазия взять данные по температурам в разные годы в регионах, где делается вино, и покоррелировать температуру и оценку/цену на вина. Можно ли предсказать цену на вино в винодельне по температуре?
    * Посмотреть облако слов для дешевых-средниих-супердорогих вин. Гипотеза, что описания дорогих вин будут более пафосными)
    * Посравнивать описания хороших оценок и плохих оценок для дешевых-средних-супер дорогих вин. Гипотеза - в дорогих винах людей неустраивают другие штуки, в сравнении с дешевыми (например, я посмотрел, что часто единички к дорогим винам ставят с пометкой crooked, плохо хранилось. Может еще будут инсайты)

### Касательно реперезентативности выборки
В зависимости от того, что мы в итоге будем хотеть проверять, намн ужно будет нагенерить репрезентативную выборку относительно именно той популяции, которой мы исследуем. Условно, если мы сравниваем америку и европу, нам нужны репрезентативные выборки по этим двум континентам. Мне кажется, что можно ограничиться следующими парааметрами:
* Страна
* Регион
* Год изготовления
* Тип (белое/красное/розовое/пузырики)
* Сорта
* Цена 

Нам нужно посмотреть распределение вин по этим переменным, и постараться заиметь похожее соотношение в нашей выборке. У нас есть информация по распределению по этим факторам по пупуляции? (например, по континентаам, странам, или по миру).

### One more thing...
Все эти идеи родились у меня в голове, а я про вино знаю чуть больше, чем ничего) Мне кажется, что по-настоящему клевые и интересные гипотезы у нас могут родиться, если мы почитаем про винаа чуть больше. Мне кжется, это вообще важный этап такой работы. Как в науке - делаешь литературный обрзор, потом формируешь гипотезы, потом думаешь о том, какие даанные тебе нужны, потом проверяешь гипотезы) Возможно, это overkill, хотя если цель - поупражняться в аналитике, то предварительный анализ - это важная ее часть. Тогда мы сможем оформить проектик почти как настоящую статью - с интродакшеном, референсами и прочим)


В общем скажи, что думаешь. Можно созвониться и обсудить. Может ты про вино знаешь больше меня и у тебя будут более интересные гипотезы)