# Sommelier Digital

Using tasting notes scraped from Wine Enthusiast by [Kaggle](https://www.kaggle.com/zynicide/wine-reviews) user [zachthoutt](https://www.kaggle.com/zynicide) to make new recommendations for wines based on similar tasting notes.

This will be a content-based recommender —- we will use wine reviews to find similar wines based on how their tasting notes describe them, then make a recommendation by looking at both the Wine Enthusiast score (higher score is better) and price (more expensive wines will be penalized).

*****
## Imports & Data Reading

We will need pandas to read in the data, as well as scikit-surprise to build a simple recommender.

In [1]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

import numpy as np

import re

*****
## Exploratory data analysis

Take a look at what we've got

In [2]:
# read in the initial data

df = pd.read_csv('../data/winemag-data_first150k.csv')

# rescale the points to [0,1]
df['points'] = df['points']/100.
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,0.96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,0.96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,0.96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,0.96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,0.95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [3]:
words = pd.Series([x for x in df['description'].values])
corpus = words.values
print(corpus)

['This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.'
 'Ripe aromas of fig, blackberry and cassis are softened and sweetened by a slathering of oaky chocolate and vanilla. This is full, layered, intense and cushioned on the palate, with rich flavors of chocolaty black fruits and baking spices. A toasty, everlasting finish is heady but ideally balanced. Drink through 2023.'
 'Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, balanced and complex botrytised white. Dark gold in color, it layers toasted hazelnut, pear compote and orange peel flavors, reveling in the succulence of its 122 g/L of residual sugar.'
 ...
 'This classic example comes f

### Read in Bag of Words

We need to map the vocabulary to a bag of words. This leads to an enormous vocabulary, so eventually we will cut off words that do not appear in more than some threshold number of reviews, to eliminate words that aren't common descriptors of wine. We'll also remove commonly appearing words that don't say anything useful about the wine, from data exploration, and contract some common signifiers that mean the same thing, like "acid" and "acidity".

In [34]:
# minimum fraction of documents in the corpus a vocabulary word appears in, to eliminate
# uncommon descriptors
appearance_rate = .0005
max_count = 50

# keep hyphenated words as a single word
pattern = "(?u)\\b[\\w-]+\\b"

In [44]:
# want to remove numbers, underscores, adverbs, and possessives from the vocabulary

# some common wine review words to get rid of because they aren't helpful
# derived from data exploration
evil_words = ['wine', 'flavor', 'finish', 'offer', 'like', 'texture', 
              'note', 'hint', 'good', 'bad', 'great', 'nice', 'year', 
              'kind', 'away', 'time', 'perfect', 'color', 'feel', 'just', 
              'palate', 'best', 'grape', 'tremendous', 'enjoy', 'age', 
              'elegant', 'hail', 'background']

# words that come up as roots of other words, so acid = acidity, etc.
flatten_words = ['acid', 'sweet', 'fruit', 'tannin']

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text) # remove numbers
    text = re.sub(r'\b(\w+ly)\b', '', text) # remove adverbs
    text = re.sub(r'\b\w{0,3}\b', '', text) # remove words under 3 characters
    text = text.replace('_', '-')
    
    # get rid of common meaningless words
    for word in evil_words:
        text = re.sub(r'\w*{}\w*'.format(word), '', text)
        text = re.sub(r'\b({})\b'.format(word), '', text)
        
    # convert synonyms to the root
    for word in flatten_words:
        text = re.sub(r'\w*{}\w*'.format(word), word, text)
        
    return text

In [45]:
vectorizer = CountVectorizer(stop_words='english', 
                             preprocessor=preprocess_text, 
                             token_pattern=pattern,
                             strip_accents='ascii',
                             min_df=appearance_rate,
                             ngram_range=(1,4))
X = vectorizer.fit_transform(corpus)
vocabulary = vectorizer.get_feature_names()

In [46]:
# Summarize the data
print('size of vocabulary = {}'.format(np.shape(X)[1]))

#count_values = X.toarray().sum(axis=0)
#print('most common words are')
# output n-grams
#for ng_count, ng_text in sorted([(count_values[i],k) for k,i in vectorizer.vocabulary_.items()], reverse=True):
#    print(ng_count, ng_text)

size of vocabulary = 6101


Look at how a few wines get described with our bag-o-words, to see if we get something that makes sense

In [48]:
print(vectorizer.inverse_transform(X[10]))
print(vectorizer.inverse_transform(X[537]))

[array(['juicy', 'fruit', 'ripe', 'delicious', 'white', 'pear', 'aromatic',
       'structure', 'citrus', 'elegance', 'complexity', 'come',
       'gorgeous', 'ranks', 'whites', 'opens', 'yellow', 'spring',
       'flower', 'herb', 'orchard', 'scents', 'creamy', 'combines',
       'peach', 'almond', 'savory', 'mineral', 'grace', 'lingering',
       'orchard fruit', 'fruit scents', 'white peach', 'ripe pear',
       'pear citrus', 'citrus white', 'white almond'], dtype='<U35')]
[array(['juicy', 'aromas', 'cassis', 'chocolate', 'baking', 'toasty',
       'drink', 'deep', 'pure', 'long', 'long drink', 'berry',
       'integrated', 'peppery', 'spice', 'forward', 'baking spice',
       'pepper', 'blend', 'cabernet', 'franc', 'petit', 'verdot',
       'cabernet franc', 'petit verdot', 'tight', 'blend cabernet',
       'merlot', 'sauvignon', 'cabernet sauvignon', 'bold',
       'verdot cabernet', 'petit verdot cabernet',
       'verdot cabernet franc', 'petit verdot cabernet franc', 'racy',
 

*****
## Candidate Generation Model

We will generate candidate wines by looking for wines from the corpus of wine reviews most similar in profile to an input wine. We will use bag-of-words to generate the feature vectors of wine reviews from the EDA above to define the vocabulary.

*****
## Scoring Model

The candidates we choose will be similar to the wine we're asking to compare. To sort them into recommendations, we will sort by the score, and punish for being more or less expensive.