# Wine Reviews Modeling
**by Remington Greider-Little, April 2022**<br/>
**Data Analytics @ Newman University**

## About this Data Set
**This data is from [the Wine Reviews data set from Kaggle](https://www.kaggle.com/datasets/zynicide/wine-reviews?select=winemag-data-130k-v2.csv).**<br/>
**Number of Records:** 130,000<br/>
**Number of original fields:** 14 (including a supplied index)<br/>
**Fields include:**
- `country` - The country that the wine is from
- `description` - Description of the wine
- `designation` - The vineyard within the winery the grapes came from
- `points` - How much the reviewer rated the wine on a scale 1-100
- `price` - The cost for a bottle of the wine
- `province` - The province or state that the wine is from
- `region` - The wine growing area in a province or state (ie Napa)
- `title` - Title of the wine review. These often contain the vintage which could be useful.
- `variety` - The type of grapes used to make the wine
- `winery` - The winery that made the wine
- `vintage` - The year the wine's grapes were picked

## Import Libraries & Set Default Plot Attributes

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate
%matplotlib inline

import sklearn
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import stopwords
from nltk import word_tokenize

import keras
import tensorflow as tf
from keras.models import Model, load_model
from keras.layers import Dense, Embedding, Input, Activation, CuDNNGRU, Bidirectional, Dropout, GlobalMaxPooling1D
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import EarlyStopping, ModelCheckpoint, Callback

In [5]:
# Customize seaborn plot styles
# Seaborn docs: https://seaborn.pydata.org/tutorial/aesthetics.html

# Adjust to retina quality
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats("retina")

# Adjust dpi and font size
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook', font_scale = 0.8)

# Display tick marks
sns.set_style('ticks')

# Remove borders
plt.rc('axes.spines', top=False, right=False, left=False, bottom=False)

In [6]:
# Color palettes for plots
# Named colors: https://matplotlib.org/stable/gallery/color/named_colors.html
# Seaborn color palette docs: https://seaborn.pydata.org/tutorial/color_palettes.html
# Seaborn palette chart: https://www.codecademy.com/article/seaborn-design-ii

# cp1 Color Palette - a binary blue/orange palette
blue = 'deepskyblue' # Use 'skyblue' for a lighter blue
orange = 'orange'
cp1 = [blue, orange]

# cp2 Color Palette - 5 colors for use with categorical data
turqoise = 'mediumaquamarine'
salmon = 'darksalmon'
tan = 'tan'
gray = 'darkgray'
cp2 = [blue, turqoise, salmon, tan, gray]

# cp3 Color Palette - blue-to-orange diverging palette for correlation heatmaps
cp3 = sns.diverging_palette(242, 39, s=100, l=65, n=11)

# cp4 Palette - Reversed binary color order when needed for certain plots
cp4 = [orange, blue]

# cpd Palette - blue-to-orange diverging palette for correlation heatmaps
cpd = sns.diverging_palette(242, 39, s=100, l=65, n=11)

# Set the default palette
sns.set_palette(cp1)

## Read and Review Data

In [7]:
df = pd.read_csv('wine_cleaned.csv')
df.head(10)

Unnamed: 0,country,description,designation,points,price,province,region_1,title,variety,winery,price_log,vintage,cleaned_description
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,,2013.0,aromas include tropical fruit broom brimstone ...
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2.70805,2011.0,ripe fruity wine smooth still structured firm ...
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2.639057,2013.0,tart snappy flavors lime flesh rind dominate g...
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,2.564949,2013.0,pineapple rind lemon pith orange blossom start...
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,4.174387,2012.0,much like regular bottling comes across rather...
5,Spain,Blackberry and raspberry aromas show a typical...,Ars In Vitro,87,15.0,Northern Spain,Navarra,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem,2.70805,2011.0,blackberry raspberry aromas show typical navar...
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo,2.772589,2013.0,bright informal red opens aromas candied berry...
7,France,This dry and restrained wine offers spice in p...,,87,24.0,Alsace,Alsace,Trimbach 2012 Gewurztraminer (Alsace),Gewürztraminer,Trimbach,3.178054,2012.0,dry restrained wine offers spice profusion bal...
8,Germany,Savory dried thyme notes accent sunnier flavor...,Shine,87,12.0,Rheinhessen,,Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe...,Gewürztraminer,Heinz Eifel,2.484907,2013.0,savory dried thyme notes accent sunnier flavor...
9,France,This has great depth of flavor with its fresh ...,Les Natures,87,27.0,Alsace,Alsace,Jean-Baptiste Adam 2012 Les Natures Pinot Gris...,Pinot Gris,Jean-Baptiste Adam,3.295837,2012.0,great depth flavor fresh apple pear fruits tou...


## Preparing for Modeling

In order to better differeniate between wine ratings I will break down the 100 point scale that WineEnthusiasts use into the following classfications: 

In [8]:
classifications = [['98-100', 'Classic'],['94-97', 'Superb'], ['90-93', 'Excellent'], ['87-89', 'Very Good'], ['83-86', 'Good'],['80-82', 'Acceptable']]
print(tabulate(classifications))

------  ----------
98-100  Classic
94-97   Superb
90-93   Excellent
87-89   Very Good
83-86   Good
80-82   Acceptable
------  ----------


In [9]:
# map reviews by their points the classes
def points_to_class(points):
    if points in range(80,83):
        return 0
    elif points in range(83,87):
        return 1
    elif points in range(87,90):
        return 2
    elif points in range(90,94):
        return 3
    else:
        return 4
    
df["rating"] = df["points"].apply(points_to_class)

In [10]:
# Previewing amount of wines in each class
df['rating'].value_counts().head()

2    46366
3    42871
1    31635
4     6174
0     2925
Name: rating, dtype: int64

**Note:** 
- No reviews in Class 4 (94-97/Superb)
- Data set in unbalanced with a majority of reviews falling in classes 1, 2, and 3

In [11]:
# num_classes - the number of classes we are working with
# embedding_dim - the dimensions of the word vectors
# epochs - number of forward and backward passes through all of the training examples
# batch_size - the number of traning examples in each pass
# max_len - the maximimum length (in words) which will be considered in a text description
# class_weights - classes with higher weights attached to them (class 0 and class 4) will have a higher impact on the learning algorithm. Each instance of class 0 is treated as 7 instances.

num_classes = 5
embedding_dim = 300 
epochs = 50
batch_size = 128
max_len = 100

class_weights = {0: 7,
                1: 1,
                2: 1, 
                3: 1,
                4: 7}

In [12]:
# One hot encoding target
def onehot(arr, num_class):
    return np.eye(num_class)[np.array(arr.astype(int)).reshape(-1)]

y = onehot(df["rating"], num_classes)

## Train-Validation Split

In [13]:
# Train, validation split (test is on another set)
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(df["description"], y, test_size = 0.05)

## Tokenizing Inputs and Preparing the Embedding Matrix

In [14]:
# Prepare embeddings 
embeddings_index = {}

# Read pre-trained word vectors and populate to dictionary
f = open("glove.840B.300d.txt", encoding = "utf8")

for line in f:
    values = line.split()
    word = ''.join(values[:-embedding_dim])
    coefs = np.asarray(values[-embedding_dim:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
    
# train tokenizer 
tokenizer = Tokenizer(num_words = None)
tokenizer.fit_on_texts(X_train)

# fit tokenizer
sequences_train = tokenizer.texts_to_sequences(X_train)

# Padding any short sequences with 0s
X_train = pad_sequences(sequences_train, maxlen=max_len)

sequences_val = tokenizer.texts_to_sequences(X_val)
X_val = pad_sequences(sequences_val, maxlen = max_len)

word_index = tokenizer.word_index
                
# create embedding layer 
# We can designate "Out of Vocabulary" word vectors here 
# In this case, they are initialized to zero vector
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))
        
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

## Training the Classifier

In [15]:
embedding_layer = Embedding(len(word_index) + 1, embedding_dim, weights = [embedding_matrix], input_length = max_len, trainable = False) 
input= Input(shape=(max_len, ), dtype = 'int32')
embedded_sequences = embedding_layer(input) 
x = Bidirectional(CuDNNGRU(50, return_sequences=True))(embedded_sequences)
x = GlobalMaxPooling1D()(x)
x = Dense(50, activation = 'relu')(x)
x = Dropout(0.1)(x)
output = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=input, outputs=output)
model.compile(loss="categorical_crossentropy", optimizer='adam', metrics=['accuracy'])
        
checkpoint = ModelCheckpoint("model.h5", monitor='val_loss', verbose=1, save_best_only=True, mode='min')
early = EarlyStopping(monitor='val_loss', mode='min', patience=3)
callback = [checkpoint, early]
        
model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val), callbacks=callback, class_weight = class_weights)


Epoch 1/50


InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNV2' used by {{node model/bidirectional/forward_cu_dnngru/CudnnRNNV2}} with these attrs: [dropout=0, seed=0, input_mode="linear_input", T=DT_FLOAT, direction="unidirectional", rnn_mode="gru", seed2=0, is_training=true]
Registered devices: [CPU]
Registered kernels:
  device='GPU'; T in [DT_HALF]
  device='GPU'; T in [DT_FLOAT]
  device='GPU'; T in [DT_DOUBLE]

	 [[model/bidirectional/forward_cu_dnngru/CudnnRNNV2]] [Op:__inference_train_function_2230]

## Testing the Model

In [None]:
from sklearn.metrics import accuracy_score

test = pd.read_csv("wine.csv", index_col = False)
test["rating"] = test["points"].apply(points_to_class)

sequences_test = tokenizer.texts_to_sequences(test["description"])
X_test = pad_sequences(sequences_test, maxlen=max_len)

# Predictions
pred_test = model.predict(X_test)
pred_test = [np.argmax(x) for x in pred_test]

# Actual
true_test = onehot(test["rating"], num_class)
true_test = [np.argmax(x) for x in true_test]

# Find accuracies
accuracy = accuracy_score(true_test, pred_test)

In [None]:
# Test model on the test split
# Use the model to generate predictions for the Test split, based on its features only
y_pred = model.predict(X_test)

# Compare model's predictive performance to the provided test labels
score = accuracy_score(y_test, y_pred) * 100

# Report the model and its score
print(model)
print(f'  {score}')

## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(true_test, pred_test)
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)

class_name = ["Acceptable", "Good", "Very Good", "Excellent", "Superb/Classic"]
plt.colorbar()
tick_marks = np.arange(len(class_name))
plt.xticks(tick_marks, class_name, rotation=45)
plt.yticks(tick_marks, class_name)

plt.show()