### Can we predict which variety of wine a review is describing?
Zack posted a cool dataset of 130k reviews from [WineMag](http://winemag.com), proposing we try to predict a wine's variety using the words in the review text. Given my sister says she'd make up words when selling wine, I thought it'd be fun to try out my first document classifier (and kaggle!) to see if there are any real patterns behind wine descriptions.

In this notebook I try out the following:

 - A quick look & clean of the data
 - Creating term frequency inverse document frequency (Tfidf) vectors of the review descriptions
 - Running Logistic Regression Classifier on our bag-of-words vectors
 - Hyperparameter tuning with grid search
 - Lastly, a check for any interesting topics with Latent Dirichlet Allocation
 
I was a little surprised by some of the results, feedback is most welcome!

In [1]:
import pandas as pd
df = pd.read_csv('./InputData/winemag-data-130k-v2.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [2]:
df.drop(df.columns[[0]], axis=1, inplace=True) # ditch that unnamed row numbers column
df.describe(include='all')

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
count,129908,129971,92506,129971.0,120975.0,129908,108724,50511,103727,98758,129971,129970,129971
unique,43,119955,37979,,,425,1229,17,19,15,118840,707,16757
top,US,"Cigar box, café au lait, and dried tobacco aro...",Reserve,,,California,Napa Valley,Central Coast,Roger Voss,@vossroger,Gloria Ferrer NV Sonoma Brut Sparkling (Sonoma...,Pinot Noir,Wines & Winemakers
freq,54504,3,2009,,,36247,4480,11065,25514,25514,11,13272,222
mean,,,,88.447138,35.363389,,,,,,,,
std,,,,3.03973,41.022218,,,,,,,,
min,,,,80.0,4.0,,,,,,,,
25%,,,,86.0,17.0,,,,,,,,
50%,,,,88.0,25.0,,,,,,,,
75%,,,,91.0,42.0,,,,,,,,


  So, we care about description and variety. Hm, description... uniques and counts don't add up. if there are ~120k unique descriptions and 130k total, we may have some duplicates. Let's take a look at one just to be sure they're actually just duplicates:

In [3]:
dups = df[df.duplicated('description')]
dups.sort_values('description', ascending=False).iloc[3:7]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
23271,Italy,‘Na Vota is a fresh and easygoing wine that is...,'Na Vota,85,18.0,Piedmont,Ruché di Castagnole Monferrato,,,,Cantine Sant'Agata 2007 'Na Vota (Ruché di Ca...,Ruché,Cantine Sant'Agata
103573,New Zealand,this medium-bodied Sauvignon Blanc shows only ...,,88,15.0,New Zealand,,,Joe Czerwinski,@JoeCz,Tussock Jumper 2016 Sauvignon Blanc (New Zealand),Sauvignon Blanc,Tussock Jumper
121126,US,lean and zesty Chardonnay with a burst of mout...,,84,7.0,California,California,California Other,,,Eye Candy 2012 Chardonnay (California),Chardonnay,Eye Candy
109172,Austria,Zweigelt can do easy-drinking styles but in th...,Heideboden,90,26.0,Burgenland,,,Anne Krebiehl MW,@AnneInVino,Nittnaus Hans und Christine 2013 Heideboden Zw...,Zweigelt,Nittnaus Hans und Christine


Well good. I was a little worried reviewers were re-using their descriptions on different wines.... There are no duplicates to drop 

In [4]:
# dedupped_df = df.drop_duplicates(subset='description')
# print('Total unique reviews:', len(dedupped_df))
# print('\nVariety description \n', dedupped_df['variety'].describe())

Variety will be our class label, and 632 labels seems like a lot for <100k documents. Well maybe they're evenly balanced classes?!

In [5]:
varieties =  df['variety'].value_counts()
varieties

Pinot Noir                      13272
Chardonnay                      11753
Cabernet Sauvignon               9472
Red Blend                        8946
Bordeaux-style Red Blend         6915
Riesling                         5189
Sauvignon Blanc                  4967
Syrah                            4142
Rosé                             3564
Merlot                           3102
Nebbiolo                         2804
Zinfandel                        2714
Sangiovese                       2707
Malbec                           2652
Portuguese Red                   2466
White Blend                      2360
Sparkling Blend                  2153
Tempranillo                      1810
Rhône-style Red Blend            1471
Pinot Gris                       1455
Champagne Blend                  1396
Cabernet Franc                   1353
Grüner Veltliner                 1345
Portuguese White                 1159
Bordeaux-style White Blend       1066
Pinot Grigio                     1052
Gamay       

In [6]:
varieties.describe()

count      707.000000
mean       183.833098
std        976.188990
min          1.000000
25%          2.000000
50%          6.000000
75%         28.500000
max      13272.000000
Name: variety, dtype: float64

No cigar. You'd be right 10% of the time if your model just labeled everything "Pinot Noir", and the majority of labels have less than 5 reviews each. On the bright side, unbalanced labels seems to be the norm in many datasets, and the distribution on the top wines isn't as bad as it could be.

So what to do?

Sometime I'd love to try the "shove in similar, computer generated fake data" technique to balance the labels, but for now I'm going to try something quicker. Let's just look at the top 20 varieties. That may not be great for a production classifier of every [WineMag](http://winemag.com) review in history, but I kinda doubt my sister's restaurant carried over 600 varieties of wine (ever order a "Teroldego Rotaliano"?)

In [7]:
top_wines_df =  df.loc[ df['variety'].isin(varieties.axes[0][:20])] # first column[0],  rows :20  
top_wines_df['variety'].describe()

count          93914
unique            20
top       Pinot Noir
freq           13272
Name: variety, dtype: object

In [8]:
top_wines_df['variety']

0                      White Blend
1                   Portuguese Red
2                       Pinot Gris
3                         Riesling
4                       Pinot Noir
9                       Pinot Gris
10              Cabernet Sauvignon
12              Cabernet Sauvignon
14                      Chardonnay
15                        Riesling
16                          Malbec
17                          Malbec
20                       Red Blend
21                      Pinot Noir
22                     White Blend
23                          Merlot
25                      Pinot Noir
26                     White Blend
28                       Red Blend
31                       Red Blend
32                     White Blend
33                       Red Blend
34                 Sauvignon Blanc
35                      Pinot Noir
37              Cabernet Sauvignon
41                      Pinot Noir
43                 Sauvignon Blanc
44                          Merlot
45                  

Now we've got ~93k reviews, 20 labels, and an unbalanced but much more manageable distribution of labels across our documents. Let's make some vectors & do some classifying!

In [9]:
# our labels, as numbers. 

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(top_wines_df['variety'])
y

array([18,  8,  6, ..., 11,  7,  6])

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
vect = TfidfVectorizer()
x = vect.fit_transform(top_wines_df['description'])
x

<93914x27105 sparse matrix of type '<class 'numpy.float64'>'
	with 3261558 stored elements in Compressed Sparse Row format>

Now we've got a sparse matrix of our ~70k docs with ~26k columns. 
Let's have a go at learning, using a basic train test split, accuracy_score, and a <br>                             **Logistic Regression classifier**.

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=10)

In [23]:
import time
start = time.time()
clf = LogisticRegression()
clf.fit(x_train, y_train)
pred = clf.predict(x_test)
print(time.time() - start)

11.073572874069214


In [22]:
accuracy_score(y_test, pred)

0.72136396831638672

~72% accuracy? Not bad! However, probably a good call to be skeptical and think through why this might be a little too optimistic. In terms of model performance, I've read that using a holdout method with train_test_split isn't as trusty as using a cross validation assessment.

But in terms of our subject material, is there anything obvious I missed? One thing that came to mind was the names of the varieties: what if they're in the actual review? Our classifier would probably be pretty pleased with that set of features, and while it would be a reality of the data and therefore "fair game", it feels a bit like cheating the spirit of my original objective (... imagine, reviewer: "This lovely pinot noir tastes like berries!" classifier: "it's pinot noir!" reviewer: "you're so smart!" )

Let's try again, with any variety names removed:

In [25]:
wine_stop_words = []
for variety in top_wines_df['variety'].unique():
    for word in variety.split(' '):
        wine_stop_words.append(word.lower())
 

In [26]:
wine_stop_words = pd.Series(data=wine_stop_words).unique()
wine_stop_words

array(['cabernet', 'sauvignon', 'blanc', 'pinot', 'noir', 'chardonnay',
       'tempranillo', 'malbec', 'rosé', 'syrah', 'sangiovese', 'sparkling',
       'blend', 'red', 'riesling', 'portuguese', 'nebbiolo', 'white',
       'zinfandel', 'bordeaux-style', 'merlot', 'shiraz', 'corvina,',
       'rondinella,', 'molinara'], dtype=object)

In [27]:
vect2 = TfidfVectorizer(stop_words=list(wine_stop_words))
x2 = vect2.fit_transform(top_wines_df['description'])
x2

<71202x26565 sparse matrix of type '<class 'numpy.float64'>'
	with 2392650 stored elements in Compressed Sparse Row format>

Double checking that I understand things here: before removing our variety name-like list, our matrix had 26587 columns. Now it's at 26565, and the difference is 22, the length of our stop words. Lovely. That means our reviews did include every one of those words.

Note that it's likely a better move to check if each variety name mentioned in reviews is always equal with the review's actual label classifier, but I'll save that one for another day. Let's classify our new stuff:

In [28]:
x_train2, x_test2, y_train2, y_test2 = train_test_split(x2, y, test_size=0.25, random_state=10)
clf = LogisticRegression()
clf.fit(x_train2, y_train2)
pred = clf.predict(x_test2)
accuracy_score(y_test, pred)

0.62973990225268239

Huzzah. We've dropped our accuracy nearly 10%, meaning we were right about the importance of those variety presences in the reviews.

Let's see how much we can get back up with some hyperparameter tuning.

In [29]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

x_train3, x_test3, y_train3, y_test3 = train_test_split(
    top_wines_df['description'].values, y, test_size=0.25, random_state=10)

pipe = Pipeline([
    ('vect', TfidfVectorizer(stop_words=list(wine_stop_words))), 
    ('clf', LogisticRegression(random_state=0))
])

param_grid = [
  {
    'vect__ngram_range': [(1, 2)],
    'clf__penalty': ['l1', 'l2'],
    'clf__C': [1.0, 10.0, 100.0]
  },
  {
    'vect__ngram_range': [(1, 2)],
    'vect__use_idf':[False],
    'clf__penalty': ['l1', 'l2'],
    'clf__C': [1.0, 10.0, 100.0]
    },
]

grid = GridSearchCV(pipe, param_grid, scoring='accuracy', cv=3, verbose=1, n_jobs=-1)

We did another train test split, as this time we'll be feeding the raw reviews as our "X" into our pipeline. I'm looking to see if IDF is helping or hurting, and what happens when we up our "grams" (number of words per column) from 1 to 2. With unigram we see "smokey" and "finish" as unrelated, but bigram we'll look for the whole phrase "smokey finish" in reviews.

We're also checking if L1 or L2 penalties could help us ignore some unnecessary word columns. Let's try this out!

In [None]:
grid.fit(x_train3, y_train3)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 18.8min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 32.9min finished

print(grid.best_params_, grid.best_score_)
{'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 2)} 0.63607423082

grid.best_estimator_.score(x_test3, y_test3)
0.6463681815628336

(Copied & pasted in above the results from my home computer)

Alright, we brought the score back up a bit! Usint tfidf was in the best params (assuming they don't show the defaults, or it could be because I didn't list it explicitly), L2 penalty may have helped and putting bigrams into play also probably helped. ~65% accuracy isn't half bad, and it could probably be brought up more were we to dig into those hyperparams.

Lastly, let's see if there are any interesting topics across reviews from **Latent Dirichlet Allocation** decomposition. For this one I'm going to replace our previous stop list of wine varieties and instead use sklearn's basic english set, so our topics aren't mucked up.

In [31]:
from sklearn.decomposition import LatentDirichletAllocation as LDA
from sklearn.feature_extraction import text
stop_words = text.ENGLISH_STOP_WORDS.union(wine_stop_words)
vect = TfidfVectorizer(stop_words=stop_words)
x = vect.fit_transform(top_wines_df['description'])
lda = LDA(learning_method='batch')
topics = lda.fit_transform(x)

There doesn't seem to be a great way to print topics built in, so we'll borrow this handy print function found in Muller & Guido's intro to ML with Python:

In [42]:
import numpy as np
def print_topics(topics, feature_names, sorting, topics_per_chunk=5,
                 n_words=10):
    for i in range(0, len(topics), topics_per_chunk):
        # for each chunk:
        these_topics = topics[i: i + topics_per_chunk]
        # maybe we have less than topics_per_chunk left
        len_this_chunk = len(these_topics)
        # print topic headers
        print(("topic {!s:<8}" * len_this_chunk).format(*these_topics))
        print(("-------- {0:<5}" * len_this_chunk).format(""))
        # print top n_words frequent words
        for i in range(n_words):
            try:
                print(("{!s:<14}" * len_this_chunk).format(
                    *feature_names[sorting[these_topics, i]]))
            except:
                pass
        print("\n")
        

sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
feature_names = np.array(vect.get_feature_names())
        
print_topics(topics=range(10), feature_names=feature_names, sorting=sorting)

topic 0       topic 1       topic 2       topic 3       topic 4       
--------      --------      --------      --------      --------      
aromas        palate        fruit         flavors       spice         
flavors       nose          vineyard      citrus        aromas        
finish        finish        new           apple         wine          
palate        notes         wine          acidity       cherry        
berry         acidity       mix           pineapple     fruit         
plum          lemon         cherry        crisp         bright        
nose          flavors       flavors       wine          pair          
fruit         dry           oak           peach         offers        
herbal        fresh         barrel        finish        tones         
cherry        lime          french        fruit         almond        


topic 5       topic 6       topic 7       topic 8       topic 9       
--------      --------      --------      --------      --------      
frui

Nothing crazy jumps out at me, but thinking about topic 0: "sweet, peach, & honey", maybe these would be white wines or roses? Topic 1 & 4 I could see being topics about reds: "wood, spice, firm, tannins, black..."

I would love to add a topics column next to each document for the topic the document is most likely a member of, no doubt some interesting patterns could up. I couldn't find any great examples of how to do that online, so if anyone knows of one please let me know!

One last interesting thing: when I first ran LDA topics on the original .json file of reviews, "cheeseburger" popped up as topic term...

In [43]:
top_wines_df[top_wines_df['description'].str.contains('cheeseburger')].values

array([['US',
        'This dry, country-style Cab has some sharp, green tannins that bring astringency to the blackberry, cherry and cola flavors. Will go fine with a cheeseburger.',
        nan, 84, 18.0, 'California', 'Paso Robles', 'Central Coast',
        'Cabernet Sauvignon', "Bishop's Peak"],
       ['US',
        'This is a soft, easy wine, with sweet black raspberry, cherry and spice flavors. Drink now with a nice homemade cheeseburger.',
        nan, 84, 25.0, 'California', 'Santa Barbara County',
        'Central Coast', 'Syrah', 'Curtis'],
       ['Italy',
        "You absolutely can't go wrong with a wine like this. Valpolicella Torre d'Orti is an easy, fresh wine that would pair with anything from lasagna to cheeseburgers. It's crisp and thin with a nice quality of cherry, raspberry and cassis fruit.",
        "Torre d'Orti", 86, nan, 'Veneto', 'Valpolicella Superiore', nan,
        'Corvina, Rondinella, Molinara', 'Cavalchina'],
       ['US',
        'A strong Zinfandel,

In [44]:
dedupped_df[dedupped_df['description'].str.contains('cheeseburger')]['points'].describe()

count    50.000000
mean     84.800000
std       1.551826
min      83.000000
25%      84.000000
50%      84.000000
75%      86.000000
max      89.000000
Name: points, dtype: float64

In [45]:
dedupped_df['points'].describe()

count    97821.000000
mean        87.956819
std          3.217883
min         80.000000
25%         86.000000
50%         88.000000
75%         90.000000
max        100.000000
Name: points, dtype: float64

Some 50 reviews use the word cheeseburger ("it will happily wash down simple fare, like cheeseburgers"), and on average their review rating points are ~84, about a full standard deviation lower than the overall average.

I guess bad wine goes with good cheeseburgers?