
<h1><center>In Vino Veritas</center></h1>
<h2><center> <span style="font-weight:normal"><font color='#e42618'> Finding truth in wine data</font>  </span></center></h2>


<h3><center><font color='gray'>JONAS GOTTAL</font></center></h3>




<h4>Project scope</h4>

In 'Vino Veritas' is an old Latin phrase that means 'in wine, there is truth'. And we would like to obtain said truth in wine. What is objectively good wine and how is it influenced by the weather? The Wine Spectators official vintage charts are used as an objective measure for good wine, which is enriched by wine-specific features from public weather data.
<br>

---
---


In [25]:
#!conda env export > environment.yml
#!pip freeze | grep -v "^-e" | grep -v "@" | awk -F= '{print $1 "==" $3}' > requirements.txt


# Predictive modelling
We follow our earlier cleaning approach for the merged data:

In [18]:
# import pandas etc
import pandas as pd
import numpy as np

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

from xgboost import XGBRegressor
# suppress warnings
import warnings
warnings.filterwarnings('ignore')

# load csv
df = pd.read_csv('data/merged_data.csv')

# Replace non-breaking spaces in column names
df.columns = df.columns.str.replace('\xa0', ' ')

df.head()


Unnamed: 0,Vintage,Score,Drink Window,Description,Country,Region,Variety,Vola_Temp,Vola_Rain,Longest_Dry,Longest_Wet,Avg_Rain,Count_above35,Count_under10,Count_under0,Coulure_Wind,June_Rain
0,2021,93–97,NYR,As vines struggled to ripen their fruit at the...,United States,Napa,Cabernet,0.212736,7.168308,29.0,4.0,0.149378,10.0,73.0,0.0,17.094839,1.18
1,2020,87,Drink,The setup—wet winter into dry spring—was ideal...,United States,Napa,Cabernet,0.251209,12.225392,73.0,3.0,0.027841,17.0,57.0,1.0,15.233226,4.5
2,2019,97,Hold,A wet spring resulted in less overt tannic str...,United States,Napa,Cabernet,0.219853,6.959674,58.0,4.0,0.31573,11.0,52.0,0.0,15.071129,30.7
3,2018,99,Hold,A wet winter provided sufficient water through...,United States,Napa,Cabernet,0.213956,6.207925,67.0,7.0,0.403734,3.0,68.0,0.0,16.054113,1.25
4,2017,92,Drink or hold,"Drought broke over the winter, with lots of ve...",United States,Napa,Cabernet,0.221765,13.114877,54.0,4.0,0.000872,19.0,46.0,0.0,15.661129,0.0


In [19]:

# make 'Drink or Hold' == 'Drink or hold'
df['Drink Window'] = df['Drink Window'].str.replace('Drink or hold', 'Drink or Hold')

# interpret as 2, 1, 0, -1, and Nan for  'Drink', 'Hold', 'Drink or Hold', 'Past peak', 'NYR'
df['Drink Window'] = df['Drink Window'].replace('Drink', '2')
df['Drink Window'] = df['Drink Window'].replace('Hold', '1')
df['Drink Window'] = df['Drink Window'].replace('Drink or Hold', '0')
df['Drink Window'] = df['Drink Window'].replace('Past peak', '-1')
df['Drink Window'] = df['Drink Window'].replace('NYR', np.nan) # not yet rated

# also remove the star *:
# Remove trailing asterisks from the "Score" column
df['Score'] = df['Score'].str.rstrip('*')

# Replace 'NYR' with NaN in the "Score" column
df['Score'] = df['Score'].replace('NYR', np.nan)


# Drop rows where 'Score' is NaN or infinite
df = df.dropna(subset=['Score'])
df = df[~df['Score'].isin([np.inf, -np.inf])]

# Define a function to convert a score to a number
def convert_score(score):
    if isinstance(score, str) and '–' in score:
        low, high = map(int, score.split('–'))
        return (low + high) / 2
    else:
        return score
   

# Apply the function to the "Score" column
df['Score'] = df['Score'].apply(convert_score)

# make Vintage, Score, Drink Window a int
df['Vintage'] = df['Vintage'].astype(int)
df['Score'] = df['Score'].astype(float)
df['Drink Window'] = df['Drink Window'].astype(float)


## Basic model: XGBoost

In [20]:
# Drop Vintage,	Drink Window,	Description,	Country,	Region,	Variety
df = df.drop(['Vintage', 'Drink Window', 'Description', 'Country', 'Region', 'Variety'], axis=1)
df.head()

Unnamed: 0,Score,Vola_Temp,Vola_Rain,Longest_Dry,Longest_Wet,Avg_Rain,Count_above35,Count_under10,Count_under0,Coulure_Wind,June_Rain
0,95.0,0.212736,7.168308,29.0,4.0,0.149378,10.0,73.0,0.0,17.094839,1.18
1,87.0,0.251209,12.225392,73.0,3.0,0.027841,17.0,57.0,1.0,15.233226,4.5
2,97.0,0.219853,6.959674,58.0,4.0,0.31573,11.0,52.0,0.0,15.071129,30.7
3,99.0,0.213956,6.207925,67.0,7.0,0.403734,3.0,68.0,0.0,16.054113,1.25
4,92.0,0.221765,13.114877,54.0,4.0,0.000872,19.0,46.0,0.0,15.661129,0.0


In [21]:
# show nan values in y
df['Score'].isna().sum()

0

In [22]:
# Separate the target variable and input variables
X = df.drop('Score', axis=1)
y = df['Score']

# Split the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an XGBoost regression model
model = XGBRegressor()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r = X_train.shape[0]
p = 2
adj_r2 = 1-(((1-r2)*(r-1))/(r-p-1))
print('R-square score: {:.6f}'.format(r2))
print('Adjusted R-square score on test set: {:.6f}'.format(adj_r2))
print("The mean squared error (MSE) on test set: {:.6f}".format(mse))
print("The absolute error (MAE) on test set: {:.6f}".format(mae))

R-square score: -0.152839
Adjusted R-square score on test set: -0.154807
The mean squared error (MSE) on test set: 17.229869
The absolute error (MAE) on test set: 3.019083


This is just a quick and preliminary modelling approach, but the results are subpar at best. We could already see from the initial analysis that there are no strong predictive signals in the weather data.