This Notebook is based on 
 - https://medium.com/tensorflow/predicting-the-price-of-wine-with-the-keras-functional-api-and-tensorflow-a95d1c2c1b03
 - https://www.kaggle.com/learn/embeddings
 

In [1]:
import os
import math
import numpy as np
import pandas as pd
# pip install -q -U tensorflow==1.7.0
import tensorflow as tf
print("Tensorflow version:", tf.__version__)

from sklearn.preprocessing import LabelEncoder
from tensorflow import keras

path_data = "../data/raw/winemag-data_first150k.csv"

Tensorflow version: 1.7.0


In [2]:
# Read data
raw_data_df = pd.read_csv(path_data)
cols = [c for c in raw_data_df.columns]
print("Columns: ", cols)
# raw_data_df['description'].head()

Columns:  ['Unnamed: 0', 'country', 'description', 'designation', 'points', 'price', 'province', 'region_1', 'region_2', 'variety', 'winery']


## Representing descriptions as bag of words

In [3]:
# Feature 1:
f_description = raw_data_df['description']
vocab_size = 12000 # This is a hyperparameter, experiment with different values for your dataset
# Create a tokenizer to preprocess our text descriptions
tokenize = keras.preprocessing.text.Tokenizer(num_words=vocab_size, char_level=False)
tokenize.fit_on_texts(f_description )  # only fit on train
# Sparse bag of words (bow) vocab_size vector
description_bag_train = tokenize.texts_to_matrix(f_description)

In [4]:
description_bag_train[0:5]

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])

In [5]:
f_description.head()

0    This tremendous 100% varietal wine hails from ...
1    Ripe aromas of fig, blackberry and cassis are ...
2    Mac Watson honors the memory of a wine once ma...
3    This spent 20 months in 30% new French oak, an...
4    This is the top wine from La Bégude, named aft...
Name: description, dtype: object

## Representing descriptions as a word embedding

**Preparation**: We’ll first need to convert each description to a vector of integers corresponding to each word in our vocabulary. We can do that with the handy Keras texts_to_sequences method.

In [10]:
# Deep model feature: Word embeddings of wine descriptions
train_embed = tokenize.texts_to_sequences(f_description)
train_embed[0:1]

[[7,
  1695,
  397,
  408,
  8,
  3076,
  25,
  2555,
  1,
  335,
  317,
  118,
  603,
  63,
  10,
  39,
  111,
  40,
  20,
  13,
  1,
  3,
  1200,
  179,
  4,
  283,
  3185,
  2,
  24,
  876,
  29,
  135,
  102,
  22,
  1,
  3,
  382,
  607,
  1440,
  10,
  2,
  715,
  83,
  1,
  2218,
  25,
  666,
  11,
  18,
  12,
  26,
  63,
  1817,
  4,
  12,
  11,
  383,
  1033,
  1312,
  562]]

Now that we’ve got integerized description vectors, we need to make sure they’re all the same length to feed them into our model. Keras has a handy method for that too. We’ll use pad_sequences to add zeros to each description vector so that they’re all the same length.

In [11]:
max_seq_length = 170
train_embed = keras.preprocessing.sequence.pad_sequences(train_embed, maxlen=max_seq_length)
train_embed[0:2]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           7, 1695,  397,  408,    8, 3076,   25, 2555,    1,  335,  317,
         118,  603,   63,   10,   39,  111,   40,   20,   13,    1,    3,
        1200,  179,    4,  283, 3185,    2,   24,  876,   29,  135,  102,
          22,    1,    3,  382,  607, 

With our descriptions converted to vectors that are all the same length, we’re ready to create our embedding layer and feed it into a deep model.

**Word Embedding**: First, we’ll define the shape of our inputs for Keras-Function-API-Model. Then we’ll feed it into the Embedding layer.

Here I’m using an Embedding layer with 8 dimensions.  Dimensions of embedding space correspond to the following axes of variation. 
  1. Dimension 1: How old the wine?
  2. Dimension 2: How acidic flavour?
  3. Dimension 3: How mature is the intended reviewer?
  4. etc.


The output of the Embedding layer will be a three dimensional vector with shape: [batch size, sequence length (170 in this example), embedding dimension (8 in this example)]. In order to connect our Embedding layer to the Dense, fully connected output layer we need to flatten it first:



In [19]:
deep_inputs = keras.layers.Input(shape=(max_seq_length,))
embedding = keras.layers.Embedding(vocab_size, 8,   input_length=max_seq_length)(deep_inputs)
embedding = keras.layers.Flatten()(embedding)


Once the embedding layer is flattened it’s ready to feed into the model and compile it.

In [18]:
embed_out = keras.layers.Dense(1, activation='linear')(embedding)
deep_model = keras.Model(inputs=deep_inputs, outputs=embed_out)
print(deep_model.summary())
deep_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 170)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 170, 8)            96000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 1360)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 1361      
Total params: 97,361
Trainable params: 97,361
Non-trainable params: 0
_________________________________________________________________
None
