# Recommending Wine to Wine Reviewers

This notebook aims to use both Collaborative Filtering and Content-Based Recommendation Engines to recommend wines to wine reviewers based on the data set "wine-reviews." 

The two methods of recommendation machine learning that I'll use:
* **Collaborative Filter (CF) Recommendations** - collaborative filter recommendations use the 'wisdom of the masses' to recommend items for users based on similarities between user ratings of the items. It uses the logic of Person A likes Item 1, 2 and 3 - Person B likes Item 1, and 2 - therefore Person B will likely like Item 3 as well. 
* **Content-Based Recommendations** - content based recommendations rely on matrix reductions to identify cosine similarity between feature vectors of items. It uses the logic of identifying similarity between items to understand what a user would like given their current preferences. 

This notebook will investigate both methods of recommendation engines to produce results for wine reviewers.

In [1]:
# Importing dependencies

# Surprise is a Python library for collaborative filtering recommendation algorithms
from surprise import SVD
from surprise import NMF
from surprise.model_selection import cross_validate
from surprise import Reader, Dataset

# Sci-kit Learn is a popular Python library for machine learning and data science models
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

## Reading and Analyzing the Data

First we will read the wine review data and conduct some simple data exploration on it.

In [2]:
# Data pulled into a Pandas DataFrame

wine_df = pd.read_csv('/kaggle/input/wine-reviews/winemag-data-130k-v2.csv')
wine_df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [3]:
# Investigate our numeric columns

wine_df.describe()

Unnamed: 0.1,Unnamed: 0,points,price
count,129971.0,129971.0,120975.0
mean,64985.0,88.447138,35.363389
std,37519.540256,3.03973,41.022218
min,0.0,80.0,4.0
25%,32492.5,86.0,17.0
50%,64985.0,88.0,25.0
75%,97477.5,91.0,42.0
max,129970.0,100.0,3300.0


In [4]:
# Check the non-null count and data types for each column

wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 14 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             129971 non-null  int64  
 1   country                129908 non-null  object 
 2   description            129971 non-null  object 
 3   designation            92506 non-null   object 
 4   points                 129971 non-null  int64  
 5   price                  120975 non-null  float64
 6   province               129908 non-null  object 
 7   region_1               108724 non-null  object 
 8   region_2               50511 non-null   object 
 9   taster_name            103727 non-null  object 
 10  taster_twitter_handle  98758 non-null   object 
 11  title                  129971 non-null  object 
 12  variety                129970 non-null  object 
 13  winery                 129971 non-null  object 
dtypes: float64(1), int64(2), object(11)


## Collaborative Filter Recommendations

I'll start with CF recommendations. 

The required format for CF recommendations is ['userId', 'itemId', 'rating']. 

In our case, that will be ['tasterId', 'wineId', 'points']. The collaborative filtering will use the Surprise library and import its Singular Value Decompisition model, made famous by Simon Funk during the Netflix competition. Essentially, the model will follow stochaistic gradient descent to minimalize the squared error of the predictions. 

In [5]:
# Select the categories necessary for CF and assign them to categorical representations

wine_cf_df = wine_df.loc[:, ['points', 'taster_name', 'title']]
wine_cf_df.loc[:, 'tasterId'] = wine_cf_df.loc[:, 'taster_name'].astype('category').cat.codes
wine_cf_df.loc[:, 'wineId'] = wine_cf_df.loc[:, 'title'].astype('category').cat.codes

wine_cf_df.head()

Unnamed: 0,points,taster_name,title,tasterId,wineId
0,87,Kerin O’Keefe,Nicosia 2013 Vulkà Bianco (Etna),9,79521
1,87,Roger Voss,Quinta dos Avidagos 2011 Avidagos Red (Douro),15,89368
2,87,Paul Gregutt,Rainstorm 2013 Pinot Gris (Willamette Valley),14,89782
3,87,Alexander Peartree,St. Julian 2013 Reserve Late Harvest Riesling ...,0,100878
4,87,Paul Gregutt,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,14,102810


Next, I initiate the Surprise reader and dataset classes.
* The **reader** is used to parse the dataset 
* The **Dataset** is used to hold data as parsed by a Surprise reader

I'll then initiate the SVD algorithm and run the preset Surprise cross validation method to check how the model handles the data

In [6]:
# We know that the minimum and maximum of the rating scale 'points' is 80 and 100
reader = Reader(rating_scale=(80, 100))

# We load the data into the Surprise dataset with the reader
data = Dataset.load_from_df(wine_cf_df[['tasterId', 'wineId', 'points']], reader)

In [7]:
# Set the algorithm to Surprise's SVD and cross validate on our data
algo = SVD()
cross_validate(algo, data, measures=['RMSE','MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.8073  2.7933  2.7875  2.7976  2.8012  2.7974  0.0067  
MAE (testset)     2.1640  2.1504  2.1423  2.1426  2.1527  2.1504  0.0080  
Fit time          12.31   11.59   12.77   11.56   11.43   11.93   0.52    
Test time         0.44    0.30    0.43    0.29    0.40    0.37    0.06    


{'test_rmse': array([2.80733572, 2.79334911, 2.78750987, 2.79758324, 2.80122726]),
 'test_mae': array([2.16398659, 2.15038604, 2.14229566, 2.14261534, 2.15272081]),
 'fit_time': (12.310510873794556,
  11.594529151916504,
  12.766496658325195,
  11.560883283615112,
  11.429670095443726),
 'test_time': (0.43811869621276855,
  0.2973320484161377,
  0.4274940490722656,
  0.29324889183044434,
  0.40477538108825684)}

SVD works decent with the data and returns a RMSE on average about 2.79 off. Perhaps there could be better algorithms in the Surprise library to use (such as KNN, Basic, etc.) but for our purposes we'll stick to SVD. 

Next I'll set up a train_set of the data using the Surprise Data class build_full_trainset method. Note that we can't simply fit the data, but first must prepare it for fitting with this method. Afterwords, we'll do the same to set up a test set using the build_anti_testset method from the train_set object. The anti_testset pulls all items and users where the rating is not known (they haven't tried it yet). This is important for exploration and recommending new items to the user.

In [8]:
# Building a data trainset and testset for building predictions of what new wines users will like

train_set = data.build_full_trainset()
test_set = train_set.build_anti_testset()
predictions = algo.fit(train_set).test(test_set)

In [9]:
# We can now see the top 10 wines for a user, and make other data analysis possible with the recommendations

predictions_df = pd.DataFrame(predictions)

# Get the top 10 recommended wines for Taster #9 (Kerin O'Keefe)
uid = 9

uid_preds = predictions_df.loc[predictions_df['uid'] == 9, :].sort_values(['est'], ascending=False).iloc[:10]
uid_preds['wineLabel'] = uid_preds.loc[:, 'iid'].apply(lambda i: wine_cf_df.loc[wine_cf_df['wineId'] == i, 'title'].values[0])
uid_preds.loc[:, ['wineLabel', 'est']].reset_index(drop=True)

Unnamed: 0,wineLabel,est
0,Ravines 2006 Chardonnay (Finger Lakes),91.575788
1,Guitián 2006 Sobre Lias Godello (Valdeorras),91.543624
2,Château du Cayrou 2011 Malbec Valley Malbec (C...,91.363991
3,Peltier 2011 Hybrid Cabernet Sauvignon (Lodi),91.353269
4,Cayuse 2013 Cailloux Vineyard Syrah (Walla Wal...,91.310347
5,Feudi di San Gregorio 2008 Sirica Red (Campania),91.295066
6,Plush 2010 Smooth Red (California),91.268974
7,Line 39 2011 Cabernet Sauvignon (North Coast),91.250388
8,Pujanza 2005 Norte (Rioja),91.2019
9,Lamoreaux Landing 2013 Grüner Veltliner (Finge...,91.197164


After our model is fitted to the data, we can use it to predict how a user will like a certain item.

In [10]:
# Check how Taster #9 (Kerin O'Keefe) will like Wine # 89368 (Quinta dos Avidagos 2011)
tasterId_ = 9
wineId_ = 103657

pred = predictions_df.loc[(predictions_df['uid'] == tasterId_) & (predictions_df['iid'] == wineId_), 'est'].values[0]

print(f"{wine_cf_df.loc[wine_cf_df['tasterId'] == int(tasterId_),'taster_name'].values[0]} is predicted to rate {wine_cf_df.loc[wine_cf_df['wineId'] == int(wineId_), 'title'].values[0]} with a score of {pred:.2f}"
     f" which is {pred - predictions_df.loc[0, 'r_ui']:.2f} off the global mean.")

Kerin O’Keefe is predicted to rate Tantara 2010 Gwendolyn Pinot Noir (Sta. Rita Hills) with a score of 88.98 which is 0.53 off the global mean.


## Content-Based Recommendations

The other method of recommendation engines that we'll explore is content-based recommendations. Content-Based algorithms perform on a matrix of items and the items features to find similarities between data. In our example below, I use cosine similarities between term frequency inverse document frequency vectors for the description of the different wines. This allows us to find similar wines based solely on their term frequency (relative to document frequency). 

In [11]:
# Importing dependencies from sci-kit learn's features and metrics libraries

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 
from sklearn.model_selection import train_test_split

In [12]:
# We once again need wineId codes, so we use the same categorization of wine titles

wine_df.loc[:, 'wineId'] = wine_df.loc[:, 'title'].astype('category').cat.codes

wine_df.loc[:, ['description', 'wineId', 'title']].head()

Unnamed: 0,description,wineId,title
0,"Aromas include tropical fruit, broom, brimston...",79521,Nicosia 2013 Vulkà Bianco (Etna)
1,"This is ripe and fruity, a wine that is smooth...",89368,Quinta dos Avidagos 2011 Avidagos Red (Douro)
2,"Tart and snappy, the flavors of lime flesh and...",89782,Rainstorm 2013 Pinot Gris (Willamette Valley)
3,"Pineapple rind, lemon pith and orange blossom ...",100878,St. Julian 2013 Reserve Late Harvest Riesling ...
4,"Much like the regular bottling from 2012, this...",102810,Sweet Cheeks 2012 Vintner's Reserve Wild Child...


In [13]:
# Split the wine dataframe into a train and test split. Although we are not evaluating the data, the dataframe is much too large for Kaggle's 
    # kernels and so I'm only training on 5% of the data. A full implementation would require more significant memory or distributed training.
    
train_wine, test_wine = train_test_split(wine_df, train_size=0.05)

train_wine.reset_index(drop=True, inplace=True)

print(f"Training on {len(train_wine)} samples.")

Training on 6498 samples.


There are a number of different methods for content-based recommendations, but I'll be using NLP TF-IDF features for this recommendation. Otherwise, features could include categorical columns on the province / country of the wine, some variable key words from the description, its price, and other data available. But since the description is readily available, that will be my use case.

In [14]:
# We're running the content-based recommender on TFIDF data from the wine descriptions. 

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df = 0, stop_words='english')

tfidf_matrix = tf.fit_transform(train_wine['description'])

print(f"The term-frequency inverse document frequency matrix is {tfidf_matrix.shape[0]} by {tfidf_matrix.shape[1]}")

The term-frequency inverse document frequency matrix is 6498 by 229797


The linear_kernel from sklearn allows us to compute the linear kernel (the linear seperation of data) for the TFIDF matrices. We then find the similar indices for each item (wine), turn the similar index into a list of similar items (wines) and then append the similar items (other than the first which is the item itself) into the results. 

In [15]:
# First finding the cosine similarities for the tfidf matrix
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

# Next, appending the results to a dictionary of the similar items to each wine
results = {}
for idx, row in train_wine.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:100:-1]
    similar_items = [(cosine_similarities[idx][i], train_wine['wineId'][i]) for i in similar_indices]
    results[row['wineId']] = similar_items[1:]

Some simple functions below allow us to find the item by its id, and then grab the top num recommendations for a given item. 

In [16]:
def item(id):
    return train_wine.loc[train_wine['wineId'] == id]['title'].tolist()[0].split(' - ')[0]

def recommend(item_id, num):
    print('Recommending ' + str(num) + ' products similar to ' + item(item_id) + ' ...')
    print('-----')
    recs = results[item_id][:num]
    for rec in recs:
        print('Recommended: ' + item(rec[1]) + '(score: ' + f"{rec[0]:.2f}" + ')')

And now we can see the top num recommendations for any given item. 

In [17]:
# itemId (wineId) is grabbed from the trainset of wines

itemId_ = train_wine.loc[:, 'wineId'].values[0]
itemName_ = train_wine.loc[train_wine['wineId'] == itemId_, 'title'].values[0]

print(f"Using itemId {itemId_} which is {itemName_} \n")

# The recommend function is then run to find and return the top num matches (5 in this case)

recommend(item_id=itemId_, num=5)

Using itemId 68840 which is Loring Wine Company 2015 Rosella's Vineyard Pinot Noir (Santa Lucia Highlands) 

Recommending 5 products similar to Loring Wine Company 2015 Rosella's Vineyard Pinot Noir (Santa Lucia Highlands) ...
-----
Recommended: Giornata 2013 Luna Matta Vineyard Nebbiolo (Paso Robles)(score: 0.08)
Recommended: Grattamacco 2011  Bolgheri Superiore(score: 0.08)
Recommended: Roblar 2014 Sangiovese (Santa Ynez Valley)(score: 0.06)
Recommended: Davis Family 2013 Soul Patch Estate Grown Syrah (Russian River Valley)(score: 0.06)
Recommended: Masseria del Feudo Grottarossa 2010 Il Giglio Nero d'Avola (Sicilia)(score: 0.06)


In [18]:
results[itemId_][0][1]

51148

In [19]:
# We can then compare the descriptions of the two wines to see how they match

description_original = train_wine.loc[train_wine['title'] == itemName_, 'description'].values[0]

description_matched = train_wine.loc[train_wine['wineId'] == results[itemId_][0][1], 'description'].values[0]

print(f"First wine description: \n{description_original} \n\nMatched wine description: \n{description_matched}")

First wine description: 
Intense baked plum and black cherry aromas meet with vanilla, caramel and sagebrush on the nose of this bottling from one of the appellation's most coveted vineyards. The tangy, energetic and grippy palate offers crushed cranberry flavors, more black plum, gingerbread and exotic Indian spice on the finish. 

Matched wine description: 
Vibrant red plum and cherry aromas meet hibiscus, cinnamon and a touch of inviting cotton candy on the nose of this bottling. The palate brims with cinnamon spice, baked strawberry and red plum, further enhanced by licorice and ginger snap flavors. The structure is elegantly grippy.


Both of these recommendation models have practical uses for building a more hybrid and comprehensive recommendation engine. For instance, content-based models allow for recommendations in absence of past user decisions - useful for a first-time reviewer. As the user's actions and decision history is compiled over time, the recommendation engine could begin to use more collaborative filtering to recommend wines liked by similarly tasted reviewers.

The recommendation engine:
* **First-time reviewers** - mostly using content-based models to recommend wines based on their past preferences. Similar to Netflix's select some movies you've previously enjoyed. 
* **Repeat reviewers** - as data is compiled on the user's tastes, the engine begins to shift from content-based to collaborative filtering. 
* **New wines** - as new items are added, without recommendations provided yet by reviewers, they can be recommended based on their description from the winery from the content-based model. Therefore it is important for exploration purposes that the engine never completely abandon content-based recommendations.