# Wines Points prediction 

In [118]:
%load_ext autoreload
%autoreload 2
import sys; sys.path.append('../')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Here we will try to predict the points a wine will get based on known characteristics (i.e. features, in the ML terminology). The mine point in this stage is to establish a simple, ideally super cost effective, basline.
In the real world there is a tradeoff between complexity and perforamnce, and the DS job, among others, is to present a tradeoff tables of what performance is achivalbel at what complexity level. 

to which models with increased complexity and resource demands will be compared. Complexity should then be translated into cost. For example:
 * Compute cost 
 * Maintenance cost
 * Serving costs (i.e. is new platform needed?) 
 

## Loading the data

In [1]:
import pandas as pd
import cufflinks as cf; cf.go_offline()

In [2]:
wine_reviews = pd.read_csv("data/winemag-data-130k-v2.csv")
wine_reviews.shape

(129971, 14)

In [3]:
wine_reviews.sample(5)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
74225,74225,France,A wine made entirely from Pinot Meunier is unu...,Blanc de Noirs Brut,89,59.0,Champagne,Champagne,,Roger Voss,@vossroger,H. Blin NV Blanc de Noirs Brut Pinot Meunier (...,Pinot Meunier,H. Blin
59743,59743,Italy,"Aromas of toasted oak, vanilla and a confectio...",Mongris Riserva,88,30.0,Northeastern Italy,Collio,,Kerin O’Keefe,@kerinokeefe,Marco Felluga 2012 Mongris Riserva Pinot Grigi...,Pinot Grigio,Marco Felluga
70995,70995,US,"This rich, fruity Cabernet comes from one of t...",Watchtower Vineyard Estate,90,45.0,California,Dry Creek Valley,Sonoma,,,Gustafson Family 2007 Watchtower Vineyard Esta...,Cabernet Sauvignon,Gustafson Family
34108,34108,US,Among the impressive single-vineyard Adelsheim...,Ribbon Springs Vineyard,93,75.0,Oregon,Ribbon Ridge,Willamette Valley,Paul Gregutt,@paulgwine,Adelsheim 2013 Ribbon Springs Vineyard Pinot N...,Pinot Noir,Adelsheim
121596,121596,Argentina,Aromas of herbs and desert brush blend with ra...,Reserve,89,19.0,Mendoza Province,Valle de Uco,,Michael Schachner,@wineschach,Salentein 2014 Reserve Malbec (Valle de Uco),Malbec,Salentein


## Points prediction

Points is descrete value target. There for we are talking about a prediction (Regression) problem (in contrary to classification problem). Prediction solutions can be measured in few metrics:

* MSE - [Mean score error](https://en.wikipedia.org/wiki/Mean_squared_error)
* R2 - [R Square](https://en.wikipedia.org/wiki/Coefficient_of_determination)
* MAE - [Mean absolut error](https://en.wikipedia.org/wiki/Mean_absolute_error)

Read more [here](https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b)

### Train and test set split

To properly report results, let's split to train and test datasets:

In [25]:
train_data = wine_reviews.sample(frac = 0.8)
test_data = wine_reviews[~wine_reviews.index.isin(train_data.index)]
assert(len(train_data) + len(test_data) == len(wine_reviews))

In [39]:
len(test_data), len(train_data)

(25994, 103977)

### Baselines

In [27]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [28]:
def calc_prediction_quality(df, pred_score_col, true_score_col):
    return pd.Series({'MSE': mean_squared_error(df[true_score_col], df[pred_score_col]),
                      'MAE': mean_absolute_error(df[true_score_col], df[pred_score_col]),
                      'R2': r2_score(df[true_score_col], df[pred_score_col])})

#### Basline 1

The most basic baseline is simply the average points. The implementaion is as simple as:

In [64]:
test_data['basiline_1_predicted_points'] = train_data.points.mean()
b1_stats = calc_prediction_quality(test_data, 'basiline_1_predicted_points', 'points')
b1_stats

MSE    9.265837
MAE    2.491054
R2    -0.000092
dtype: float64

#### Basline 2

We can probably improve by predicting the average score based on the origin country:

In [65]:
avg_points_by_country = train_data.groupby('country').points.mean()
avg_points_by_country.head()

country
Argentina                 86.697832
Armenia                   88.000000
Australia                 88.532902
Austria                   90.115712
Bosnia and Herzegovina    86.500000
Name: points, dtype: float64

In [66]:
test_data['basiline_2_predicted_points'] = test_data.country.map(avg_points_by_country).fillna(train_data.points.mean())
b2_stats = calc_prediction_quality(test_data, 'basiline_2_predicted_points', 'points')
b2_stats

MSE    8.829102
MAE    2.428411
R2     0.047046
dtype: float64

### Baseline 3

Adding more breakdowns will increase our granularity but can result in overfitting. Yet:

In [44]:
avg_points_by_country_and_region = train_data.groupby(['country','province']).points.mean().rename('basiline_3_predicted_points')
avg_points_by_country_and_region.head()

country    province        
Argentina  Mendoza Province    86.824793
           Other               85.926773
Armenia    Armenia             88.000000
Australia  Australia Other     85.435897
           New South Wales     87.470588
Name: basiline_3_predicted_points, dtype: float64

In [59]:
test_data_with_baseline_3 = test_data.merge(avg_points_by_country_and_region, on = ['country','province'], how='left')
test_data_with_baseline_3.basiline_3_predicted_points = test_data_with_baseline_3.basiline_3_predicted_points.fillna(test_data_with_baseline_3.basiline_2_predicted_points).fillna(test_data.basiline_1_predicted_points)
test_data_with_baseline_3.shape, test_data.shape

((25994, 17), (25994, 16))

In [67]:
b3_stats = calc_prediction_quality(test_data_with_baseline_3, 'basiline_3_predicted_points', 'points')
b3_stats

MSE    8.324264
MAE    2.341909
R2     0.101535
dtype: float64

### Baselines summary

In [72]:
baseline_summary = pd.DataFrame([b1_stats, b2_stats, b3_stats], index=['baseline_1', 'baseline_2','baseline_3'])
baseline_summary

Unnamed: 0,MSE,MAE,R2
baseline_1,9.265837,2.491054,-9.2e-05
baseline_2,8.829102,2.428411,0.047046
baseline_3,8.324264,2.341909,0.101535


In [73]:
baseline_summary.to_csv('data/baselines_summary.csv', index=False)

## Training a Boosting trees regressor

In [96]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#### Preparing data - Lable encoding categorical features

In [108]:
categorical_features = ['country','province','region_1','region_2','taster_name','variety','winery']
numerical_features = ['price']
features = categorical_features + numerical_features

In [106]:
encoded_features = wine_reviews[categorical_features].apply(lambda col: le.fit_transform(col.fillna('NA')))
encoded_features['price'] = wine_reviews.price
encoded_features['points'] = wine_reviews.points
encoded_features.head()

Unnamed: 0,country,province,region_1,region_2,taster_name,variety,winery,price,points
0,22,332,424,6,9,691,11608,,87
1,32,108,738,6,16,451,12956,15.0,87
2,41,269,1218,17,15,437,13018,14.0,87
3,41,218,549,6,0,480,14390,13.0,87
4,41,269,1218,17,15,441,14621,65.0,87


#### Re-splitting to train and test

In [107]:
train_encoded_features = encoded_features[encoded_features.index.isin(train_data.index)]
test_encoded_features = encoded_features[encoded_features.index.isin(test_data.index)]
assert(len(train_encoded_features) + len(test_encoded_features) == len(wine_reviews))

#### Fitting a tree-regressor

In [119]:
from commons.models import i_feel_lucky_xgboost_training

In [125]:
xgb_clf, clf_name = i_feel_lucky_xgboost_training(train_encoded_features, test_encoded_features, features, 'points', name='xgb_clf_points_prediction')

Let's look at the function output - specifically the **xgb_clf_points_prediction** column:

In [128]:
test_encoded_features.head()

Unnamed: 0,country,province,region_1,region_2,taster_name,variety,winery,price,points,xgb_clf_points_prediction
1,32,108,738,6,16,451,12956,15.0,87,85.588341
4,41,269,1218,17,15,441,14621,65.0,87,89.602592
21,41,269,788,11,15,441,135,20.0,87,86.302856
22,22,332,992,6,9,691,904,19.0,87,86.459175
38,22,342,856,6,14,460,7322,11.0,86,83.836739


In [130]:
xgb_stats = calc_prediction_quality(test_encoded_features, 'xgb_clf_points_prediction','points')
xgb_stats

MSE    7.542085
MAE    2.245315
R2     0.185958
dtype: float64

In [133]:
all_compared = pd.DataFrame([b1_stats, b2_stats, b3_stats, xgb_stats], index=['baseline_1', 'baseline_2','baseline_3','regression_by_xgb'])
all_compared

Unnamed: 0,MSE,MAE,R2
baseline_1,9.265837,2.491054,-9.2e-05
baseline_2,8.829102,2.428411,0.047046
baseline_3,8.324264,2.341909,0.101535
regression_by_xgb,7.542085,2.245315,0.185958


In [134]:
all_compared.to_csv('data/all_models_compared.csv', index=False)