# DeepSom: Exploratory Data Analaysis 
### Notebook 1

#### By TJ Cycyota
##### Thanks to Zack Thoutt for inspiration

### Hypothesis
A neural network can be trained on wine data from Wine Spectator (scraped from winemag.com), including reviews, growing region, and price, to predict grape varietals.
<img src="winemag_screenshot.png" style="height: 300px;">

#### Input Data

x rows 

y feature columns:
    - Country 
    - Description (free-text)
    - Designation
    - Points (80-100 based on Wine Spectator score)
    - Price
    - Province 
    - Region 1
    - Region 2
    - Taster name (who contributed the review)
    - Taster Twitter handle
    - Title (of the wine)
    - Variety
    - Winery

In [1]:
from collections import Counter
import numpy as np
import nltk
import re
import sklearn.manifold
import multiprocessing
import pandas as pd
import gensim.models.word2vec as w2v
import seaborn as sns
%matplotlib inline

nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\thomas.j.cycyota\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


---

### Load the Data
The dataset can be found on [Kaggle](https://www.kaggle.com/zynicide/wine-reviews) or you can run Zack Thoutt's sraper on [Github](https://github.com/zackthoutt/wine-deep-learning). The script below is pulling a previously scraped dataset from local storage.

In [2]:
data = pd.read_json('winemag-data.json', dtype={
    'points': np.int32,
    'price': np.float32,
})

In [3]:
data.head(15)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,France,"This Chardonnay-based wine, with 10% Chenin Bl...",,87,10.0,Languedoc-Roussillon,Pays d'Oc,,Lauren Buzzeo,@laurbuzz,Domaine Rives-Blanques 2016 White (Pays d'Oc),White Blend,Domaine Rives-Blanques
1,France,Wood aromas and spice show strongly in this wi...,Domaine de la Ferté,87,35.0,Burgundy,Givry,,Roger Voss,@vossroger,Domaines Devillard 2015 Domaine de la Ferté (...,Pinot Noir,Domaines Devillard
2,France,"Showcasing the rich vintage, this is a generou...",Le Renard,87,23.0,Burgundy,Bourgogne,,Roger Voss,@vossroger,Domaines Devillard 2015 Le Renard (Bourgogne),Pinot Noir,Domaines Devillard
3,France,This large appellation in the Côte Chalonnaise...,Laurent Dufouleur Château Mi-Pont,87,40.0,Burgundy,Mercurey,,Roger Voss,@vossroger,L. Tramier & Fils 2015 Laurent Dufouleur Châte...,Pinot Noir,L. Tramier & Fils
4,France,"This tangy, ripe and fruity wine is crisp with...",,87,19.0,Burgundy,Mâcon-Villages,,Roger Voss,@vossroger,Louis Max 2016 Mâcon-Villages,Chardonnay,Louis Max
5,US,"Crisp apple perfume builds on the palate, sugg...",Semi-Dry,87,15.0,New York,Finger Lakes,Finger Lakes,Anna Lee C. Iijima,,Highland Cellars 2015 Semi-Dry Riesling (Finge...,Riesling,Highland Cellars
6,US,The freshness and concentration of the fruit f...,,87,25.0,California,Lodi,Central Valley,Jim Gordon,@gordone_cellars,Infidel 2014 Zinfandel (Lodi),Zinfandel,Infidel
7,France,"Crisp apple aromas lead to a spicy, yellow and...",,87,18.0,Burgundy,Bourgogne,,Roger Voss,@vossroger,Domaine Bernard Moreau 2016 Bourgogne,Chardonnay,Domaine Bernard Moreau
8,France,"A crisp, tangy and mineral-driven wine, this i...",La Belouse,87,20.0,Burgundy,Mâcon-Cruzille,,Roger Voss,@vossroger,Domaine de l'Echelette 2016 La Belouse (Mâcon...,Chardonnay,Domaine de l'Echelette
9,Italy,"Oak-driven spice, camphor and coconut aromas l...",Riserva,87,,Piedmont,Barbaresco,,Kerin O’Keefe,@kerinokeefe,Sassi San Cristoforo 2012 Riserva (Barbaresco),Nebbiolo,Sassi San Cristoforo


In [4]:
data.shape

(12895, 13)

In [5]:
data.describe()

Unnamed: 0,points,price
count,12895.0,12213.0
mean,89.58356,40.185047
std,2.98751,45.77026
min,80.0,4.0
25%,87.0,20.0
50%,90.0,30.0
75%,92.0,48.0
max,100.0,1116.0


### Break data into Labels and Descriptions

This will allow us to explore the different wine varieties included in the dataset, and will come in handy later in the notebook for Word2Vec implementation.

In [6]:
labels = data['variety']
descriptions = data['description']

In [7]:
#variety_list = data.groupby('variety')['title'].nunique()
variety_list = data.variety.value_counts()
variety_list.tail()

Verdejo-Viura      1
Juhfark            1
Feteascǎ Regalǎ    1
Diamond            1
Bobal              1
Name: variety, dtype: int64

Seems like we have some less common wine varieties in this data! Turns out "Feteascǎ Regalǎ" is from Romania, and "Juhfark" is from Hungary. Already learning something new! After doing a sum on the variety_list, we can see that every row of data is labeled with a variety, and there are quite a few different varieties. 

In [8]:
variety_list.sum()

12895

In [9]:
data.variety.nunique()

319

### Explore the Data
There are several hundred fairly common varietals of wine and probably thousands of other niche varietals. It will be difficult to be able to identify them all, but I hypothesize that it should be possible to classify the most common, say, 50 or 100 wine varietals with this wine review dataset. 

Let's take a look at a few reviews and see if we as humans can tell a difference in the descriptive words used for different types of wine.

In [10]:
print('{}   :   {}'.format(labels.tolist()[11], descriptions.tolist()[11]))
print(' ')
print('{}   :   {}'.format(labels.tolist()[100], descriptions.tolist()[100]))
print(' ')
print('{}   :   {}'.format(labels.tolist()[1000], descriptions.tolist()[1000]))

Zinfandel   :   This is a great value for a wine from the celebrated Amador region. Aromas like strawberry jam and brown sugar lead to very ripe and fruity flavors in this full-bodied but lively wine. It's fun to sip and doesn't try for a serious profile or heavy texture.
 
Red Blend   :   This opens with aromas of oak, blackberry and baking spice. The dense, taut palate shows black currant, toast and vanilla alongside tightly wound tannins. Drink through 2020.
 
Sparkling Blend   :   Dry and nicely mature, this complex wine is always among California's best bubblies. It combines great balance, tiny bubbles and some very interesting flavors that make it as appealing as a well-cellared white Burgundy at its peak. Hints of toast, butter and almond fill the aroma, and lemon, crisp apple and baking spices fill the palate. It has lively acidity that's softened by a good sense of body.


Seems to be c

Even if you're not someone who knows wine, I think that there is a pretty clear distinction in the descriptions of these different types of wines. The Cabernet Sauvignon (a red wine) was described with words like cherry, tannin and carmel. The next two reviews are white wines, but even they show differences in their description. The sauvignon blanc is described as minerally, citrus, and green fruits while the chardonnay is described as smokey, earthy, crisp-apple, and buttery. This provides us with good motivation to move forward and explore the data more.

One of the limitations that I think we will have with this dataset is that there will be a lot more reviews for popular wine varietals than less popular wine varietals. This isn't bad neccissarily, but it means that we will probably only be able to classify the most popular N varietals.

In [11]:
varietal_counts = labels.value_counts()
print(varietal_counts[:50])

Pinot Noir                    1762
Chardonnay                    1212
Bordeaux-style Red Blend       986
Red Blend                      933
Cabernet Sauvignon             721
Riesling                       636
Sparkling Blend                418
Sauvignon Blanc                364
Gamay                          363
Syrah                          326
Nebbiolo                       300
Portuguese Red                 221
Merlot                         213
Glera                          197
Rosé                           187
White Blend                    184
Zinfandel                      181
Malbec                         173
Rhône-style Red Blend          157
Sangiovese                     148
Cabernet Franc                 145
Champagne Blend                143
Tempranillo                    142
Pinot Gris                     138
Grüner Veltliner               116
Bordeaux-style White Blend     115
Grenache                       114
Barbera                        106
Petite Sirah        

If you drink wine regularly you will probably recognize the most reviewed wines listed above. The value counts for different wine varietals does verify my theory that less popular wines might not have enough reviews to classify them. The most popular wine varietals have thousands of reviews, but even towards the bottom end of the top 50 wine varietals there are only a few hundred reviews. This isn't a problem for building a word2vec model like we are going to do next, but it is something to keep in mind as we move forward trying to create a wine classifier.

### Word2Vec Model
##### Formatting the Data
In order to train a word2vec model, all of the description data will need to be concatenated into one giant string. 

In [12]:
corpus_raw = ""
for description in descriptions:
    corpus_raw += description

Next, we need to tokenize the wine corpus using NLTK. This process will essentially break the word corpus into an array of sentences and then break each sentence into an array of words stripping out less usefull characters like commas and hyphens in the process. In this way, we are able to train the word2vec model with the context of sentences and relative word placement. 

In [13]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
print(tokenizer)

<nltk.tokenize.punkt.PunktSentenceTokenizer object at 0x0000000013501518>


In [14]:
raw_sentences = tokenizer.tokenize(corpus_raw)


In [15]:
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [16]:
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [17]:
print(raw_sentences[150])
print(sentence_to_wordlist(raw_sentences[150]))

The rustic sensation carries over to the palate along with a bitter note.
[u'The', u'rustic', u'sensation', u'carries', u'over', u'to', u'the', u'palate', u'along', u'with', u'a', u'bitter', u'note']


In [18]:
token_count = sum([len(sentence) for sentence in sentences])
print('The wine corpus contains {0:,} tokens'.format(token_count))

The wine corpus contains 551,287 tokens


For some context, all of the GOT books combined make up only ~1,800,000 tokens, so this dataset is nearly 4x as large as the GOT book series.

##### Training the Model
It took some experimenting to get the model to train well. The main things hyperparameters that I had to tune were `min_word_count` and `context_size`. 

I usually train word2vec models with a `min_word_count` closer to 3-5, but since this dataset is so large I had to bump it up to 10. When I was training the model on a smaller `min_word_count` I was getting a lot of winery and vinyard noise in my word similarities (ie the words most similar to "cherry" were a bunch of foreign vinyards, wineries, regions, etc.). After looking through some of the descriptions I came to the conclusion that most of the wine descriptions don't mention the wine varietal, vinyard, or winery, but some do. So I played with the `min_word_count` until those rare instances had less of an effect on the model.

I also had to play with the `context_size` quite a bit. 10 is a pretty large context size, but it makes sense here because really all of the words in a sentence are related to each other in the context of wine descriptions and what were are trying to accomplish. I might even experiment with bumping the `context_size` up higher at some point, but even now most of the words in each sentence will be associated with each other in the model.

In [19]:
num_features = 300
min_word_count = 1
num_workers = multiprocessing.cpu_count()
context_size = 5
downsampling = 1e-3
seed=1993

In [20]:
wine2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [21]:
wine2vec.build_vocab(sentences)

In [22]:
print('Word2Vec vocabulary length:', len(wine2vec.wv.vocab))

('Word2Vec vocabulary length:', 12149)


In [23]:
print(wine2vec.corpus_count)

24106


In [24]:
wine2vec.train(sentences, total_examples=wine2vec.corpus_count, epochs=wine2vec.iter)

  if __name__ == '__main__':


(1972196, 2756435)

### Playing with the Model
Now that we have a trained model we can get to the fun part and start playing around with the results. As you can tell from the outputs below, there is definitely still some noise in the data that could be worked out by tuning the parameters further, but overall we are getting pretty good results.

##### Words closest to a given word
"melon," "berry," and "oak" are words that someone might use to describe the taste/smell of a wine.

In [32]:
wine2vec.most_similar('grip')

  if __name__ == '__main__':


[(u'energy', 0.8699564933776855),
 (u'sturdy', 0.8553102016448975),
 (u'crunch', 0.850803017616272),
 (u'grippy', 0.8497793078422546),
 (u'upright', 0.8402553796768188),
 (u'astringency', 0.8346679210662842),
 (u'firmness', 0.8314436078071594),
 (u'acidic', 0.8300552368164062),
 (u'gripping', 0.829682469367981),
 (u'thanks', 0.8195494413375854)]

In [76]:
wine2vec.most_similar('nutty')

  if __name__ == '__main__':


[(u'bready', 0.9144774079322815),
 (u'buttery', 0.9054234027862549),
 (u'peachy', 0.8972986936569214),
 (u'warming', 0.8951722383499146),
 (u'leesy', 0.875410258769989),
 (u'aspect', 0.8745570182800293),
 (u'sugary', 0.8719088435173035),
 (u'yeasty', 0.8661444783210754),
 (u'pulpy', 0.8655073642730713),
 (u'overt', 0.8603687286376953)]

In [77]:
wine2vec.most_similar('popcorn')

  if __name__ == '__main__':


[(u'peanut', 0.9437341690063477),
 (u'suggestive', 0.9409604668617249),
 (u'balsam', 0.9379593729972839),
 (u'wafer', 0.9378612041473389),
 (u'Drenched', 0.9377850294113159),
 (u'marshmallow', 0.9364205598831177),
 (u'plastic', 0.9346927404403687),
 (u'resin', 0.9338468313217163),
 (u'sassafras', 0.9336036443710327),
 (u'stems', 0.9324836730957031)]

Another thing that someone might use to describe a wine is how acidic it is

In [54]:
wine2vec.most_similar('acidic')

  if __name__ == '__main__':


[(u'energy', 0.8874624967575073),
 (u'narrow', 0.8824058771133423),
 (u'choppy', 0.8776118755340576),
 (u'wiry', 0.8742727041244507),
 (u'acid', 0.8734411001205444),
 (u'foamy', 0.8705145120620728),
 (u'snappy', 0.8683796525001526),
 (u'grating', 0.8674183487892151),
 (u'tartaric', 0.8597971796989441),
 (u'edgy', 0.8556100130081177)]

Or what the body is like. "full-bodied" would be something that is thick like whole milk while "light-bodied" would be something that is thin like skim milk.

In [80]:
wine2vec.most_similar('full')

  if __name__ == '__main__':


[(u'Full', 0.8341772556304932),
 (u'generously', 0.7808302640914917),
 (u'broadly', 0.7773233652114868),
 (u'filling', 0.7626659274101257),
 (u'lush', 0.7553116083145142),
 (u'smoothly', 0.7549856901168823),
 (u'richly', 0.7546985149383545),
 (u'broad', 0.7536203861236572),
 (u'lavishly', 0.7522820234298706),
 (u'Brawny', 0.7506261467933655)]

Finally, you can also feel in your mouth how much tannin a wine has. Wines with lots of tannis give you a dry, furry feeling on your tounge.

In [56]:
wine2vec.most_similar('tannins')

  if __name__ == '__main__':


[(u'support', 0.7396936416625977),
 (u'Firm', 0.7319772839546204),
 (u'tannic', 0.7305991053581238),
 (u'firm', 0.7191610932350159),
 (u'grainy', 0.7110586762428284),
 (u'grained', 0.7042685151100159),
 (u'tannin', 0.6936203241348267),
 (u'Tannins', 0.6932768821716309),
 (u'Structured', 0.6904804706573486),
 (u'structure', 0.6904735565185547)]

##### Linear relationships between word pairs

In [57]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = wine2vec.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{start1} is related to {end1}, as {start2} is related to {end2}".format(**locals()))
    return start2

In [67]:
nearest_similarity_cosmul('oak', 'vanilla', 'butter');

oak is related to vanilla, as yeast is related to butter




In [69]:
nearest_similarity_cosmul('full', 'berry', 'cherry');

full is related to berry, as lush is related to cherry




In [70]:
nearest_similarity_cosmul('tannins', 'plum', 'fresh');

tannins is related to plum, as fizz is related to fresh




In [71]:
nearest_similarity_cosmul('full', 'bodied', 'acidic');

full is related to bodied, as narrow is related to acidic


