# DeepSom: Exploratory Data Analaysis 
### Notebook 1

#### By TJ Cycyota
##### Thanks to [Zack Thoutt](https://github.com/zackthoutt/wine-deep-learning) for inspiration

### Hypothesis
A neural network can be trained on wine data from Wine Spectator (scraped from winemag.com), including reviews, growing region, and price, to predict grape varietals. It is going to take several notebooks to get there, but let's see what our data looks like.
<img src="winemag_screenshot.png" style="height: 300px;">

#### Input Data

x rows 

y feature columns:
    - Country 
    - Description (free-text)
    - Designation
    - Points (80-100 based on Wine Spectator score)
    - Price
    - Province 
    - Region 1
    - Region 2
    - Taster name (who contributed the review)
    - Taster Twitter handle
    - Title (of the wine)
    - Variety
    - Winery

### Import packages

In [3]:
from collections import Counter
import numpy as np
import nltk
import re
import sklearn.manifold
import multiprocessing
import pandas as pd
import gensim.models.word2vec as w2v
import seaborn as sns
%matplotlib inline

nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/thomas.j.cycyota/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Load the Data
Data was previously scraped using the file 'scrape-winemag.py'. The dataset can be found on [Kaggle](https://www.kaggle.com/zynicide/wine-reviews) or you can run Zack Thoutt's sraper on [Github](https://github.com/zackthoutt/wine-deep-learning). The line below loads the dataset included in this repo.

In [4]:
data = pd.read_json('winemag-data.json', dtype={
    'points': np.int32,
    'price': np.float32,
})

In [5]:
data.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,France,"This Chardonnay-based wine, with 10% Chenin Bl...",,87,10.0,Languedoc-Roussillon,Pays d'Oc,,Lauren Buzzeo,@laurbuzz,Domaine Rives-Blanques 2016 White (Pays d'Oc),White Blend,Domaine Rives-Blanques
1,France,Wood aromas and spice show strongly in this wi...,Domaine de la Ferté,87,35.0,Burgundy,Givry,,Roger Voss,@vossroger,Domaines Devillard 2015 Domaine de la Ferté (...,Pinot Noir,Domaines Devillard
2,France,"Showcasing the rich vintage, this is a generou...",Le Renard,87,23.0,Burgundy,Bourgogne,,Roger Voss,@vossroger,Domaines Devillard 2015 Le Renard (Bourgogne),Pinot Noir,Domaines Devillard
3,France,This large appellation in the Côte Chalonnaise...,Laurent Dufouleur Château Mi-Pont,87,40.0,Burgundy,Mercurey,,Roger Voss,@vossroger,L. Tramier & Fils 2015 Laurent Dufouleur Châte...,Pinot Noir,L. Tramier & Fils
4,France,"This tangy, ripe and fruity wine is crisp with...",,87,19.0,Burgundy,Mâcon-Villages,,Roger Voss,@vossroger,Louis Max 2016 Mâcon-Villages,Chardonnay,Louis Max


Looks like we have 10K+ reviews! Should be great for some exploratory analysis, but we'll need to go much bigger for deep learning work.

In [6]:
data.shape

(12895, 13)

The average wine scores 89.5 points on winemag.com (the site itself does not rate wines below 88 points, I believe). The average cost of each bottle $40. 

It will be interesting to return to these two columns to see if they offer any predictive power for determining wine variety (is a cabernet sauvignon more expensive then a red blend? I think so).

In [7]:
data.describe()

Unnamed: 0,points,price
count,12895.0,12213.0
mean,89.58356,40.185047
std,2.98751,45.770435
min,80.0,4.0
25%,87.0,20.0
50%,90.0,30.0
75%,92.0,48.0
max,100.0,1116.0


### Break data into Labels and Descriptions

This will allow us to explore the different wine varieties included in the dataset, and will come in handy later in the notebook for Word2Vec implementation.

In [8]:
labels = data['variety']
descriptions = data['description']

Let's look at some of the varieties. You can switch .tail() to .head() to see the most common varieties, but I like these less-common ones.

In [12]:
#variety_list = data.groupby('variety')['title'].nunique()
variety_list = data.variety.value_counts()
variety_list.tail()

Cabernet Merlot    1
Pinot Auxerrois    1
Cabernet-Syrah     1
Tinta del Pais     1
Valvin Muscat      1
Name: variety, dtype: int64

Seems like we have some less common wine varieties in this data! If you run this notebook you may see different varietals listed in the "tail" command. Turns out "Feteascǎ Regalǎ" is from Romania, and "Juhfark" is from Hungary. Already learning something new! After doing a sum on the variety_list, we can see that every row of data is labeled with a variety, and there are quite a few different varieties. We won't need to clean the data or drop nulls as they seem properly labeled.

In [13]:
# This number should match the total number of reviews to indicate each review has a variety tagged to it.
variety_list.sum()

12895

In [14]:
#Number of unique wine varietals.
data.variety.nunique()

319

### Initial Exploration
Let's take a look at a few descriptions of a few different varieties, chosen at random. Hopefully they use different words to describe the wine since we'll be using these descriptions to try to classify the wine variety.

During the part below, I also noticed that some of the wine varieties contain non-ASCII characters (é, ô, etc.). We'll have to figure out how to deal with those later.

In [15]:
print('{} : {}'.format(labels.tolist()[11], descriptions.tolist()[11]))
print(' ')
print('{} : {}'.format(labels.tolist()[100], descriptions.tolist()[100]))
print(' ')
print('{} : {}'.format(labels.tolist()[1000], descriptions.tolist()[1000]))

Zinfandel : This is a great value for a wine from the celebrated Amador region. Aromas like strawberry jam and brown sugar lead to very ripe and fruity flavors in this full-bodied but lively wine. It's fun to sip and doesn't try for a serious profile or heavy texture.
 
Red Blend : This opens with aromas of oak, blackberry and baking spice. The dense, taut palate shows black currant, toast and vanilla alongside tightly wound tannins. Drink through 2020.
 
Sparkling Blend : Dry and nicely mature, this complex wine is always among California's best bubblies. It combines great balance, tiny bubbles and some very interesting flavors that make it as appealing as a well-cellared white Burgundy at its peak. Hints of toast, butter and almond fill the aroma, and lemon, crisp apple and baking spices fill the palate. It has lively acidity that's softened by a good sense of body.


Seems to be clear differences in adjectives and tastes used in the descriptions of these wines. Now let's look at the most commonly-reviewed wines. These are not surprising, and if you've tasted wine you may have heard of a few!

In [16]:
varietal_counts = labels.value_counts()
print(varietal_counts[:20])

Pinot Noir                  1762
Chardonnay                  1212
Bordeaux-style Red Blend     986
Red Blend                    933
Cabernet Sauvignon           721
Riesling                     636
Sparkling Blend              418
Sauvignon Blanc              364
Gamay                        363
Syrah                        326
Nebbiolo                     300
Portuguese Red               221
Merlot                       213
Glera                        197
Rosé                         187
White Blend                  184
Zinfandel                    181
Malbec                       173
Rhône-style Red Blend        157
Sangiovese                   148
Name: variety, dtype: int64


## Word2Vec
### Formatting the Data
Word2Vec is a common approach to word embedding, and after hearing and reading about it, I was excited to try it out. This notebook uses the gensim library. 

The model requires a single string for training. We'll concatenate all of the descriptions together; this will take a few seconds or minutes to run (giant loop).

In [17]:
corp_raw = ""
for descrip in descriptions:
    corp_raw += descrip

Additionally, we must tokenize our corpus to allow for cleaner processing by Word2Vec:

    Giant string -> Array of sentences -> Array of words
       corp-raw  ->    raw_sentences   ->   sentences

In [18]:
# Load the NLTK English tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Apply tokenizer to our giant string
raw_sentences = tokenizer.tokenize(corp_raw)

In [19]:
# Use regular expressions to clean and split the sentences
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [23]:
print(raw_sentences[101])
print(sentence_to_wordlist(raw_sentences[101]))

That said, it does have the potential to bring out more of the black currant fruit of the vintage.
['That', 'said', 'it', 'does', 'have', 'the', 'potential', 'to', 'bring', 'out', 'more', 'of', 'the', 'black', 'currant', 'fruit', 'of', 'the', 'vintage']


In [24]:
token_count = sum([len(sentence) for sentence in sentences])
print('There are {0:,} tokens in the wine corpus'.format(token_count))

There are 551,287 tokens in the wine corpus


### Training the Model
Hats off to Zack Thoutt again on this one. I have used his settings to get Word2Vec to work well. His notebook contains a description of how these factors influence the performance of this model. Quoted below: 

> It took some experimenting to get the model to train well. The main things hyperparameters that I had to tune were `min_word_count` and `context_size`. 

> I usually train word2vec models with a `min_word_count` closer to 3-5, but since this dataset is so large I had to bump it up to 10. When I was training the model on a smaller `min_word_count` I was getting a lot of winery and vinyard noise in my word similarities (ie the words most similar to "cherry" were a bunch of foreign vinyards, wineries, regions, etc.). After looking through some of the descriptions I came to the conclusion that most of the wine descriptions don't mention the wine varietal, vinyard, or winery, but some do. So I played with the `min_word_count` until those rare instances had less of an effect on the model.

> I also had to play with the `context_size` quite a bit. 10 is a pretty large context size, but it makes sense here because really all of the words in a sentence are related to each other in the context of wine descriptions and what were are trying to accomplish. I might even experiment with bumping the `context_size` up higher at some point, but even now most of the words in each sentence will be associated with each other in the model.

In [25]:
num_features = 300
min_word_count = 1
num_workers = multiprocessing.cpu_count()
context_size = 5
downsampling = 1e-3
seed=1993

In [26]:
wine2vec = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [27]:
wine2vec.build_vocab(sentences)
print('Word2Vec vocabulary length:', len(wine2vec.wv.vocab))
print('Wine2vec Corpus Count: ', wine2vec.corpus_count)

Word2Vec vocabulary length: 12149
Wine2vec Corpus Count:  24106


In [28]:
wine2vec.train(sentences, total_examples=wine2vec.corpus_count, epochs=wine2vec.iter)

  """Entry point for launching an IPython kernel.


(1971754, 2756435)

## Results from Word2Vec
### Most Similar Words
Word2Vec has done the hard work of looking at every word in every reviewing, and figuring out how it relates to every other word. We can now use the train model to look at which words are similar to each other. We can use the 'most_similar' method for this:

In [30]:
wine2vec.wv.most_similar('grip')

[('energy', 0.8755673766136169),
 ('sturdy', 0.8264442682266235),
 ('decent', 0.8251634836196899),
 ('gripping', 0.8220580816268921),
 ('lending', 0.8150339722633362),
 ('tension', 0.8149839043617249),
 ('astringency', 0.8133258819580078),
 ('grippy', 0.8099431395530701),
 ('crunch', 0.8075447082519531),
 ('backbone', 0.807217001914978)]

In [31]:
wine2vec.wv.most_similar('nutty')

[('buttery', 0.9079457521438599),
 ('bready', 0.8945122957229614),
 ('leesy', 0.8885785937309265),
 ('peachy', 0.8761761784553528),
 ('warming', 0.87123042345047),
 ('ends', 0.8668086528778076),
 ('sugary', 0.8646137714385986),
 ('dull', 0.8611634373664856),
 ('echo', 0.8596324920654297),
 ('melony', 0.8593854904174805)]

In [32]:
wine2vec.wv.most_similar('popcorn')

[('marshmallow', 0.945315420627594),
 ('accompanies', 0.9438775777816772),
 ('peanut', 0.9384106397628784),
 ('wafer', 0.9363114833831787),
 ('resin', 0.9332039952278137),
 ('ham', 0.9314595460891724),
 ('pumpkin', 0.9313677549362183),
 ('flowing', 0.9292489886283875),
 ('balsam', 0.9274548888206482),
 ('suggestive', 0.9250581860542297)]

In [33]:
wine2vec.wv.most_similar('acidic')

[('energy', 0.8652548789978027),
 ('healthy', 0.8616147637367249),
 ('acid', 0.8548575043678284),
 ('wiry', 0.8542773723602295),
 ('astringency', 0.8526881337165833),
 ('choppy', 0.8437875509262085),
 ('narrow', 0.8407387733459473),
 ('clipped', 0.8403691053390503),
 ('abrasive', 0.8389471769332886),
 ('textural', 0.838520348072052)]

In [34]:
wine2vec.wv.most_similar('full')

[('Full', 0.8538987636566162),
 ('generously', 0.7861971259117126),
 ('broadly', 0.7821114659309387),
 ('bold', 0.7763442993164062),
 ('robustly', 0.7744323015213013),
 ('richly', 0.7699888944625854),
 ('soothing', 0.7663235664367676),
 ('filling', 0.7661536335945129),
 ('solidly', 0.7657864093780518),
 ('luscious', 0.7656110525131226)]

In [35]:
wine2vec.wv.most_similar('tannins')

[('tannin', 0.7542928457260132),
 ('support', 0.7377289533615112),
 ('Firm', 0.7294570803642273),
 ('tannic', 0.7086830139160156),
 ('framework', 0.7004613280296326),
 ('grained', 0.7001125812530518),
 ('Structured', 0.6990479826927185),
 ('firm', 0.6921792030334473),
 ('Tannins', 0.685518205165863),
 ('grainy', 0.6804839968681335)]

##### Linear relationships between word pairs

In [38]:
def nearest_similarity_cosmul(start1, end1, end2):
    similarities = wine2vec.wv.most_similar_cosmul(
        positive=[end2, start1],
        negative=[end1]
    )
    start2 = similarities[0][0]
    print("{start1} is related to {end1}, as {start2} is related to {end2}".format(**locals()))
    return start2

In [39]:
nearest_similarity_cosmul('oak', 'vanilla', 'butter');

oak is related to vanilla, as yeast is related to butter


In [40]:
nearest_similarity_cosmul('clean', 'berry', 'cherry');

clean is related to berry, as pure is related to cherry


In [41]:
nearest_similarity_cosmul('tannins', 'plum', 'fresh');

tannins is related to plum, as bubbles is related to fresh


In [42]:
nearest_similarity_cosmul('full', 'bodied', 'acidic');

full is related to bodied, as healthy is related to acidic


## Potential Applications

We'll keep exploring the applications of this data, but I can already imagine it would be interesting to offer this as a service to people! Almost like a synonym generator when writing wine descriptions, or to describe relationships between different wines. 