# Wine2Vec Exploration
##### By Zack Thoutt

Here is a little data exploration of my new wine review dataset using word2vec. My theory is that the words a sommelier would use to describe a wine (oaky, tannic, acidic, berry, etc.) can be used to predict the type of wine (Pinot Noir, Cabernet Sav., etc.). Let's see if we can extract some interesting relationships from the data and somewhat validate this theory.

In [120]:
from collections import Counter
import numpy as np
import nltk
import re
import sklearn.manifold
import multiprocessing
import pandas as pd
import gensim.models.word2vec as w2v

---

### Get the Data
The dataset can be found on [Kaggle](https://www.kaggle.com/zynicide/wine-reviews) or you can run my sraper on [Github](https://github.com/zackthoutt/wine-deep-learning).

In [121]:
data = pd.read_json('winemag-data_first150k.json', dtype={
    'points': np.int32,
    'price': np.float32,
})

In [122]:
labels = data['variety']
descriptions = data['description']

### Explore the Data
There are several hundred fairly common varietals of wine and probably thousands of other niche varietals. It will be difficult to be able to identify them all, but I hypothesize that it should be possible to classify the most common, say, 50 or 100 wine varietals with this wine review dataset. 

Let's take a look at a few reviews and see if we as humans can tell a difference in the descriptive words used for different types of wine.

In [198]:
print('{}   :   {}'.format(labels.tolist()[0], descriptions.tolist()[0]))
print('{}   :   {}'.format(labels.tolist()[56], descriptions.tolist()[56]))
print('{}   :   {}'.format(labels.tolist()[93], descriptions.tolist()[93]))

Cabernet Sauvignon   :   This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
Sauvignon Blanc   :   Delicious while also young and textured, this wine comes from biodynamically grown grapes. It has a strong sense of minerality as well as intense citrus and green fruits. It's tight at the moment and needs to round out, so drink from 2018.
Chardonnay   :   A smoky scent and earthy, crisp-apple flavors make this medium-bodied wine a change of pace from the average butterball Chardonnay. It has welcome acidity and a nicely smooth texture.


Even if you're not someone who knows wine, I think that there is a pretty clear distinction in the descriptions of these different types of wines. The Cabernet Sauvignon (a red wine) was described with words like cherry, tannin and carmel. The next two reviews are white wines, but even they show differences in their description. The sauvignon blanc is described as minerally, citrus, and green fruits while the chardonnay is described as smokey, earthy, crisp-apple, and buttery. This provides us with good motivation to move forward and explore the data more.

One of the limitations that I think we will have with this dataset is that there will be a lot more reviews for popular wine varietals than less popular wine varietals. This isn't bad neccissarily, but it means that we will probably only be able to classify the most popular N varietals.

In [9]:
varietal_counts = labels.value_counts()
print(varietal_counts[:50])

Chardonnay                       14482
Pinot Noir                       14291
Cabernet Sauvignon               12800
Red Blend                        10062
Bordeaux-style Red Blend          7347
Sauvignon Blanc                   6320
Syrah                             5825
Riesling                          5524
Merlot                            5070
Zinfandel                         3799
Sangiovese                        3345
Malbec                            3208
White Blend                       2824
Rosé                              2817
Tempranillo                       2556
Nebbiolo                          2241
Portuguese Red                    2216
Sparkling Blend                   2004
Shiraz                            1970
Corvina, Rondinella, Molinara     1682
Rhône-style Red Blend             1505
Barbera                           1365
Pinot Gris                        1365
Cabernet Franc                    1363
Sangiovese Grosso                 1346
Pinot Grigio             

If you drink wine regularly you will probably recognize the most reviewed wines listed above. The value counts for different wine varietals does verify my theory that less popular wines might not have enough reviews to classify them. The most popular wine varietals have thousands of reviews, but even towards the bottom end of the top 50 wine varietals there are only a few hundred reviews. This isn't a problem for building a word2vec model like we are going to do next, but it is something to keep in mind as we move forward trying to create a wine classifier.

### Word2Vec Model
##### Formatting the Data
In order to train a word2vec model, all of the description data will need to be concatenated into one giant string. 

In [12]:
corpus_raw = ""
for description in descriptions:
    corpus_raw += description

Next, we need to tokenize the wine corpus using NLTK. This process will essentially break the word corpus into an array of sentences and then break each sentence into an array of words stripping out less usefull characters like commas and hyphens in the process. In this way, we are able to train the word2vec model with the context of sentences and relative word placement. 

In [18]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [19]:
raw_sentences = tokenizer.tokenize(corpus_raw)

In [20]:
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [23]:
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [25]:
print(raw_sentences[234])
print(sentence_to_wordlist(raw_sentences[234]))

Tart cherry lingers on the finish.A deeper salmon color with elegantly lacy bubbles and a slight cloudy appearance, this sparkler by Norm Yost offers dessicated watermelon, dried orange blossoms, yeast, citrus rinds and fresher strawberry notes on the nose.
['Tart', 'cherry', 'lingers', 'on', 'the', 'finish', 'A', 'deeper', 'salmon', 'color', 'with', 'elegantly', 'lacy', 'bubbles', 'and', 'a', 'slight', 'cloudy', 'appearance', 'this', 'sparkler', 'by', 'Norm', 'Yost', 'offers', 'dessicated', 'watermelon', 'dried', 'orange', 'blossoms', 'yeast', 'citrus', 'rinds', 'and', 'fresher', 'strawberry', 'notes', 'on', 'the', 'nose']


In [143]:
token_count = sum([len(sentence) for sentence in sentences])
print('The wine corpus contains {0:,} tokens'.format(token_count))

The wine corpus contains 7,077,125 tokens


For some context, all of the GOT books combined make up only ~1,800,000 tokens, so this dataset is nearly 4x as large as the GOT book series.

##### Training the Model
It took some experimenting to get the model to train well. The main things hyperparameters that I had to tune were `min_word_count` and `context_size`. 

I usually train word2vec models with a `min_word_count` closer to 3-5, but since this dataset is so large I had to bump it up to 10. When I was training the model on a smaller `min_word_count` I was getting a lot of winery and vinyard noise in my word similarities (ie the words most similar to "cherry" were a bunch of foreign vinyards, wineries, regions, etc.). After looking through some of the descriptions I came to the conclusion that most of the wine descriptions don't mention the wine varietal, vinyard, or winery, but some do. So I played with the `min_word_count` until those rare instances had less of an effect on the model.

I also had to play with the `context_size` quite a bit. 10 is a pretty large context size, but it makes sense here because really all of the words in a sentence are related to each other in the context of wine descriptions and what were are trying to accomplish. I might even experiment with bumping the `context_size` up higher at some point, but even now most of the words in each sentence will be associated with each other in the model.

In [185]:
num_features = 300
min_word_count = 10
num_workers = multiprocessing.cpu_count()
context_size = 10
downsampling = 1e-3
seed=1993

In [186]:
wine2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [187]:
wine2vec.build_vocab(sentences)

In [188]:
print('Word2Vec vocabulary length:', len(wine2vec.wv.vocab))

Word2Vec vocabulary length: 11979


In [189]:
print(wine2vec.corpus_count)

292314


In [190]:
wine2vec.train(sentences, total_examples=wine2vec.corpus_count, epochs=wine2vec.iter)

26586210

### Playing with the Model
Now that we have a trained model we can get to the fun part and start playing around with the results. As you can tell from the outputs below, there is definitely still some noise in the data that could be worked out by tuning the parameters further, but overall we are getting pretty good results.

##### Words closest to a given word
"melon," "berry," and "oak" are words that someone might use to describe the taste/smell of a wine.

In [191]:
wine2vec.most_similar('melon')

[('nectarine', 0.7073046565055847),
 ('peach', 0.6438678503036499),
 ('honeydew', 0.6428186893463135),
 ('papaya', 0.6423326730728149),
 ('cantaloupe', 0.6362674236297607),
 ('guava', 0.6074447631835938),
 ('pear', 0.6029884815216064),
 ('canteloupe', 0.6017950773239136),
 ('clementine', 0.6005336046218872),
 ('passion', 0.5996620059013367)]

In [192]:
wine2vec.most_similar('berry')

[('berries', 0.6290175914764404),
 ('blackberry', 0.6250549554824829),
 ('raspberry', 0.559834361076355),
 ('black', 0.5525908470153809),
 ('incorporate', 0.5403679013252258),
 ('blueberry', 0.5378624796867371),
 ('loganberry', 0.530583381652832),
 ('huckleberry', 0.5194034576416016),
 ('manly', 0.5146172046661377),
 ('Blackberry', 0.5041558742523193)]

In [197]:
wine2vec.most_similar('oak')

[('vanillins', 0.5344773530960083),
 ('woodsap', 0.5198507905006409),
 ('charry', 0.47783029079437256),
 ('cloaked', 0.4773528575897217),
 ('elaborated', 0.4762468934059143),
 ('regime', 0.4708051383495331),
 ('jacket', 0.46834731101989746),
 ('oaky', 0.4653569459915161),
 ('application', 0.4612783193588257),
 ('puncheons', 0.45585477352142334)]

Another thing that someone might use to describe a wine is how acidic it is

In [199]:
wine2vec.most_similar('acidic')

[('raspingly', 0.5183893442153931),
 ('tartly', 0.4958021938800812),
 ('acidically', 0.49269402027130127),
 ('sheering', 0.49023425579071045),
 ('unforgiving', 0.48199647665023804),
 ('angular', 0.4760863184928894),
 ('pinching', 0.47394663095474243),
 ('pointy', 0.47168493270874023),
 ('ultracrisp', 0.46784496307373047),
 ('frisky', 0.4669594466686249)]

Or what the body is like. "full-bodied" would be something that is thick like whole milk while "light-bodied" would be something that is thin like skim milk.

In [206]:
wine2vec.most_similar('full')

[('Full', 0.584251880645752),
 ('creamily', 0.4710482656955719),
 ('voluminous', 0.4688924551010132),
 ('fulfilling', 0.46560823917388916),
 ('bodied', 0.46444156765937805),
 ('expectedly', 0.4592348337173462),
 ('fullish', 0.4530738592147827),
 ('mouthfilling', 0.44827550649642944),
 ('explosively', 0.443299263715744),
 ('robustly', 0.4367332458496094)]

Finally, you can also feel in your mouth how much tannin a wine has. Wines with lots of tannis give you a dry, furry feeling on your tounge.

In [210]:
wine2vec.most_similar('tannins')

[('Tannins', 0.6397914290428162),
 ('tannin', 0.5773420333862305),
 ('resistance', 0.5176475644111633),
 ('pliant', 0.5133403539657593),
 ('Wrapped', 0.5123817920684814),
 ('furry', 0.5082734823226929),
 ('tannic', 0.5056681632995605),
 ('unobtrusive', 0.49775996804237366),
 ('negotiable', 0.4963838756084442),
 ('vise', 0.48365217447280884)]

##### Linear relationships between word pairs

In [101]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = wine2vec.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{start1} is related to {end1}, as {start2} is related to {end2}".format(**locals()))
    return start2

In [212]:
nearest_similarity_cosmul('oak', 'vanilla', 'cherry');

oak is related to vanilla, as cedar is related to cherry


In [213]:
nearest_similarity_cosmul('full', 'berry', 'light');

full is related to berry, as Unoaked is related to light


In [216]:
nearest_similarity_cosmul('tannins', 'plum', 'fresh');

tannins is related to plum, as refreshing is related to fresh


In [222]:
nearest_similarity_cosmul('full', 'bodied', 'acidic');

full is related to bodied, as pinching is related to acidic


### Conclusion
I think that exploring this wine2vec model has helped validate the theory that there is a lot of useful data in these wine descriptions that can probably be used to classify wine varietals. I have not yet trained any classifiers, but we saw early on that descriptions of different wines used different words to describe the wine varietals, and based on our wine2vec model there is definitley enough context to link these descriptive words together and come up with something to classify them when they are used in certain combinations.

That's all I have for now. As always, let me know if anyone has any questions, comments, insights, ideas, etc. I'll be posting more of my analyses and models soon!