## Vivino Webscraper 

#### Introduction: 
Vivino is a popular wine enthusiast website database of information about wine from around the world. This project is a proof of concept for scraping data from Vivino for the purposes of wine understanding the wine market in Spain. This proof of concept has some limitations. First, the list of wines scraped are not a complete and comprehensive view of Spanish wine, as wine culture in Spain has traditionally been that of small production and barrel sales. For this reason, many small wine producers are not included in Vivinos database and therefore not represented in the data. Second, the wines shown only include wines currently available for purchase. Lastly, as a consumer database, the results of ratings may not represent professional opinions and may be limited by the 5-star rating system due to its overly simplistic nature. 

url = https://www.vivino.com/api/explore/explore?country_code=kr&currency_code=KRW&grape_filter=varietal&min_rating=1&order_by=ratings_average&order=desc&price_range_max=500000&price_range_min=0&page=1&language=en

In [52]:
import requests
import json
import pandas as pd
import time

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}

payload = {
        "country_codes": "kr",  
        "currency_code": "KRW",
        # "food_ids[]": [4],
        "grape_filter": "varietal",
        "min_rating": "1",
        "order_by": "ratings_average",
        "order": "desc",
        "page": 1,
        "price_range_max": "500000",
        "price_range_min": "0",
        "language": "en",
}


r = requests.get('https://www.vivino.com/api/explore/explore?',
                 params=payload, headers=headers)
n_matches = r.json()['explore_vintage']['records_matched']
matches = r.json()['explore_vintage']['matches']

In [53]:
n_matches

3639

In [68]:
matches

[{'vintage': {'id': 177404982,
   'seo_name': 'weingut-carl-loewen-eighteen-ninetysix-riesling-2023',
   'name': 'Carl Loewen 1896 Riesling 2023',
   'statistics': {'status': 'Normal',
    'ratings_count': 115,
    'ratings_average': 5,
    'labels_count': 15,
    'wine_ratings_count': 354,
    'wine_ratings_average': 4.6,
    'wine_status': ''},
   'image': {'location': '//images.vivino.com/thumbs/Q8yn9Dn7SZa_V_IBeD51yw_pl_480x640.png',
    'variations': {'bottle_large': '//images.vivino.com/thumbs/Q8yn9Dn7SZa_V_IBeD51yw_pb_x960.png',
     'bottle_medium': '//images.vivino.com/thumbs/Q8yn9Dn7SZa_V_IBeD51yw_pb_x600.png',
     'bottle_medium_square': '//images.vivino.com/thumbs/Q8yn9Dn7SZa_V_IBeD51yw_pb_600x600.png',
     'bottle_small': '//images.vivino.com/thumbs/Q8yn9Dn7SZa_V_IBeD51yw_pb_x300.png',
     'bottle_small_square': '//images.vivino.com/thumbs/Q8yn9Dn7SZa_V_IBeD51yw_pb_300x300.png',
     'label': '//images.vivino.com/thumbs/Q8yn9Dn7SZa_V_IBeD51yw_pl_480x640.png',
     'labe

In [111]:
column_names=["vintage_id", "vintage_name", "vintage_year", "vintage_price", "vintage_ratings_average", "vintage_ratings_count", "vintage_wine_id", "vintage_wine_name",  "vintage_winery", "vintage_country", "vintage_region", "vintage_wine_type_id", "acidity", "fizziness", "intensity", "sweetness", "tannin", "flavor", "foods" ]

In [98]:
import numpy as np

In [116]:
# with open('response_data.json', 'r', encoding='utf-8') as file:
#     data = json.load(file)
matches = r.json()['explore_vintage']['matches']

# None 여부 검사 
# foods = matches[0]['vintage']['wine']['style']['food']
results = []
for match in matches:
    if match['vintage'] is None:
        print('vintage가 존재하지 않습니다.')
    else:
        vintage_id = match['vintage']['id']
        vintage_name = match['vintage']['name']
        vintage_year = match["vintage"]["year"]
        vintage_price = match["prices"][0]["amount"]
        vintage_ratings_average = match["vintage"]["statistics"]["ratings_average"]
        vintage_ratings_count = match["vintage"]["statistics"]["ratings_count"]
        
        if match['vintage']['wine'] is None:
            print('wine이 존재하지 않습니다.')
        else:
            vintage_wine_id = match["vintage"]["wine"]["id"]
            vintage_wine_name = f'{match["vintage"]["wine"]["name"]} {match["vintage"]["year"]}'
            vintage_winery = match["vintage"]["wine"]["winery"]["name"]
            vintage_country = match['vintage']['wine']['region']['country']['name']
            vintage_region = match['vintage']['wine']['region']['name']
            vintage_wine_type_id = match['vintage']['wine']['type_id']
            
            taste = match['vintage']['wine']['taste']
            acidity = np.nan
            fizziness = np.nan
            intensity = np.nan
            sweetness = np.nan
            tannin = np.nan
            flavor = np.nan
            
            if taste is None:
                pass
            else:
                structure = taste['structure']
                if structure is None:
                    pass
                else:
                    acidity = structure['acidity']
                    fizziness = structure['fizziness']
                    intensity = structure['intensity']
                    sweetness = structure['sweetness']
                    tannin = structure['tannin']
                    if acidity is None:
                        pass
                    if fizziness is None:
                        pass
                    if intensity is None:
                        pass
                    if sweetness is None:
                        pass
                    if tannin is None:
                        pass
                flavor_list = taste['flavor']
                if flavor_list is None:
                    pass
                else:
                    flavor = flavor_list[0]['group']
                    if flavor is None:
                        pass

            style = match['vintage']['wine']['style']
            foods = [] # 초기값
            if style is None:
                pass
            else:
                foods_list = style['food']
                if foods_list is None:
                    pass
                else:
                    for food in foods_list:
                        foods.append(food['name'])
                
            results.append((vintage_id, vintage_name, vintage_year, vintage_price, vintage_ratings_average, vintage_ratings_count, vintage_wine_id, vintage_wine_name,  vintage_winery, vintage_country, vintage_region, vintage_wine_type_id, acidity, fizziness, intensity, sweetness, tannin, flavor, foods))

df = pd.DataFrame(results, columns=column_names)

In [117]:
df.shape

(25, 19)

In [113]:
df.head()

Unnamed: 0,vintage_id,vintage_name,vintage_year,vintage_price,vintage_ratings_average,vintage_ratings_count,vintage_wine_id,vintage_wine_name,vintage_winery,vintage_country,vintage_region,vintage_wine_type_id,acidity,fizziness,intensity,sweetness,tannin,flavor,foods
0,177404982,Carl Loewen 1896 Riesling 2023,2023,82045,5.0,115,1945087,1896 Riesling 2023,Carl Loewen,Germany,Mosel,2,4.61,,3.32,2.05,,tree_fruit,"[Pork, Shellfish, Spicy food, Poultry, Cured M..."
1,2184215,Château Pétrus Pomerol 1960,1960,6209294,4.9,65,1166837,Pomerol 1960,Château Pétrus,France,Pomerol,1,3.37,,4.13,1.97,3.45,oak,"[Beef, Lamb, Game (deer, venison), Poultry]"
2,1510217,Château Haut-Brion Pessac-Léognan (Premier Gra...,1989,3244496,4.8,1538,1152755,Pessac-Léognan (Premier Grand Cru Classé) 1989,Château Haut-Brion,France,Pessac-Léognan,1,4.22,,4.08,1.67,4.23,oak,"[Beef, Lamb, Game (deer, venison), Poultry]"
3,2611979,Château Latour Grand Vin Pauillac (Premier Gra...,1982,2759686,4.8,1445,1655970,Grand Vin Pauillac (Premier Grand Cru Classé) ...,Château Latour,France,Pauillac,1,4.12,,4.12,1.68,4.13,oak,"[Beef, Lamb, Game (deer, venison), Poultry]"
4,1889890,Château Pétrus Pomerol 1990,1990,6861922,4.8,1245,1166837,Pomerol 1990,Château Pétrus,France,Pomerol,1,3.37,,4.13,1.97,3.45,oak,"[Beef, Lamb, Game (deer, venison), Poultry]"


In [114]:
# import numpy as np
# 
# all_results = []
# for i in range(int(n_matches / 25)):
#     # 각 페이지당 25개의 result 존재, 페이지 넘길 때마다 알림 호출
#     payload['page'] = i + 1
#     print(f'Requesting data from page: {payload["page"]}')
# 
#     r = requests.get('https://www.vivino.com/api/explore/explore?',
#                  params=payload, headers=headers)
#     results = [
#         (
#             t["vintage"]["year"],
#             t["vintage"]["wine"]["id"],
#             f'{t["vintage"]["wine"]["name"]} {t["vintage"]["year"]}',
#             t["vintage"]["statistics"]["ratings_average"],
#             t["vintage"]["statistics"]["ratings_count"],
#             t["prices"][0]["amount"],
#             t["vintage"]["wine"]["winery"]["name"],
#             t['vintage']['wine']['region']['country']['name'],
#             t['vintage']['wine']['region']['name'],
#             t['vintage']['wine']['type_id'],
#             # Safely accessing and returning np.nan if any part is None
#             (t.get("vintage", {}).get("wine", {}).get("taste", {}).get("structure", {}).get('acidity', np.nan) 
#              if t.get("vintage", {}).get("wine", {}).get("taste", {}).get("structure") is not None else np.nan),           
#             (t.get("vintage", {}).get("wine", {}).get("taste", {}).get("structure", {}).get('fizziness', np.nan) 
#              if t.get("vintage", {}).get("wine", {}).get("taste", {}).get("structure") is not None else np.nan),            
#             (t.get("vintage", {}).get("wine", {}).get("taste", {}).get("structure", {}).get('intensity', np.nan) 
#              if t.get("vintage", {}).get("wine", {}).get("taste", {}).get("structure") is not None else np.nan),            
#             (t.get("vintage", {}).get("wine", {}).get("taste", {}).get("structure", {}).get('sweetness', np.nan) 
#              if t.get("vintage", {}).get("wine", {}).get("taste", {}).get("structure") is not None else np.nan),            
#             (t.get("vintage", {}).get("wine", {}).get("taste", {}).get("structure", {}).get('tannin', np.nan) 
#              if t.get("vintage", {}).get("wine", {}).get("taste", {}).get("structure") is not None else np.nan),           
#             t.get("vintage", {}).get("wine", {}).get("taste", {}).get("flavor", [{}])[0].get("group", np.nan) 
#             if t.get("vintage", {}).get("wine", {}).get("taste", {}).get("flavor") else np.nan,
#             
#         )
#         for t in r.json()["explore_vintage"]["matches"] 
#     ]   
#     
#     all_results.extend(results)
# 
# df = pd.DataFrame(all_results, columns=column_names)
    

In [82]:
df.head()

Unnamed: 0,Year,Wine_ID,Wine_name(year),Ratings_average,Ratings_count,price,Winery,Country,Region,Wine_type_id,acidity,fizziness,intensity,sweetness,tannin,flavor,food
0,2023,1945087,1896 Riesling 2023,5.0,115,82045,Carl Loewen,Germany,Mosel,2,4.61,,3.32,2.05,,tree_fruit,[]
1,1960,1166837,Pomerol 1960,4.9,65,6209294,Château Pétrus,France,Pomerol,1,3.37,,4.13,1.97,3.45,oak,[]
2,1989,1152755,Pessac-Léognan (Premier Grand Cru Classé) 1989,4.8,1538,3244496,Château Haut-Brion,France,Pessac-Léognan,1,4.22,,4.08,1.67,4.23,oak,[]
3,1982,1655970,Grand Vin Pauillac (Premier Grand Cru Classé) ...,4.8,1445,2759686,Château Latour,France,Pauillac,1,4.12,,4.12,1.68,4.13,oak,[]
4,1990,1166837,Pomerol 1990,4.8,1245,6861922,Château Pétrus,France,Pomerol,1,3.37,,4.13,1.97,3.45,oak,[]


In [83]:
df['Wine_type_id'].unique()
df['food'].value_counts()

food
[]    3625
Name: count, dtype: int64

In [83]:
df.to_csv(r"C:\SD card\Documents\Data Analytics\Vivino Webscraper\vivino.csv", index=False)