# Wine filtering by flavours
In this notebook I will extract the flavours from the wine descriptions I have from the Wine Enthusiast website and I will also use an additional data sources with flavours and spices to mix and match. I will also implement a word2vec model. The list with the word2vec model will be used to create a filtering system.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import nltk
import re
import pickle
import spacy

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem.porter import *
from textblob import TextBlob
from textblob import Word
from collections import Counter
from nltk import word_tokenize
from nltk import sent_tokenize
from sklearn.metrics.pairwise import cosine_similarity

%matplotlib inline

In [2]:
os.chdir('/home/fykos/Documents/workspace/wine_recommendation_system//')

In [3]:
wine = pd.read_csv('data/raw/winemag-data-130k-v2.csv')

In [4]:
wine.head(3)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm


In [5]:
def normalize(review):
    review_letters = re.sub('[^a-zA-Z]', ' ', str(review))
    review_letters = review_letters.lower()
    return (" ".join(review_letters.split()))

In [6]:
def remove_stopwords(review):
    stop_words = set(stopwords.words('english'))
    ls = [word for word in review.split() if word not in stop_words]
    txt = " ".join(ls)
    return (txt)

# Extracting the flavours from the descriptions
Here I will use spacy to extract all the nouns and noun phrases which is the type of the fruit flavours and I will filter this list better with another list I have from external sources.

In [7]:
noun_words = []
for review in wine['description']:
    blob = TextBlob(normalize(review))
    noun_words.append(" ".join([word for word, tag in blob.tags if tag == 'NN'] + blob.noun_phrases))

In [44]:
# save the noun word list because it takes too long to generate everytime
pickle_out = open('data/modified/noun_word_list', 'wb')
pickle.dump(foods, pickle_out)
pickle_out.close()

In [9]:
stops = ['wine', 'pasta', 'whole', 'character', 'cabernet', 'wood', 'spicy', 'tannins', 'crisp', 'juicy', 'fruits', 'blend', 'sauvignon', 'structure', 'fruity', 'aromas', 'flavors', 'ripe', 'syrup', 'cake', 'cheese', 'cream', 'bean', 'hard', 'milk', 'sauce', 'barbecue', 'steak', 'rock', 'powder', 'ruby', 'oil', 'salt', 'pastry', 'flesh', 'bitter', 'sugar', 'leather', 'herbal', 'creamy', 'table', 'brown', 'golden', 'gold', 'extract', 'broad', 'natural', 'salmon', 'tongue', 'dry', 'pure', 'root', 'sea', 'port', 'chewy', 'solid', 'blue', 'pink', 'ground', 'beef', 'purple', 'spring', 'lean', 'raw', 'red', 'black', 'white', 'yellow', 'mature', 'tropical', 'meat', 'wild', 'new', 'juice', 'firm', 'sweet', 'fresh', 'light', 'flower', 'green', 'soft', 'skin', 'spice', 'dark', 'herb', 'palate', 'valley', 'finish', 'drink', 'flavor', 'fruit', 'aroma', 'note', 'texture', 'thi', 'acidity']
tfidf_vectorizer = TfidfVectorizer(stop_words=stops)
tfidf_matrix = tfidf_vectorizer.fit_transform(noun_words)

In [10]:
# features holds a list of all the words in the tfidf's vocabulary in the same order as the column in the matrix
features = tfidf_vectorizer.get_feature_names()
weights = np.asarray(tfidf_matrix.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'term':features, 'weights':weights})
weights_df = weights_df.sort_values(by='weights', ascending=False)

In [11]:
# call the foods database 
food_db = pd.read_csv('data/raw/8b. AUSNUT 2011-13 AHS Food Nutrient Database.csv')

In [12]:
# process foods
test = set(word.strip().lower() for ls in list(map(lambda x:x.split(',') ,food_db['Food Name'].tolist())) for word in ls)

In [13]:
# pickout the foods from the wine list
terms = weights_df[weights_df['weights'] > 0.001]
foods = []
for term in terms['term']:
    if term in test:
        print(term)
        foods.append(term)

cherry
berry
plum
apple
blackberry
citrus
vanilla
pepper
peach
raspberry
chocolate
currant
lemon
pear
orange
licorice
lime
tart
grapefruit
coffee
melon
strawberry
honey
pineapple
apricot
jam
almond
cranberry
herbs
blueberry
cinnamon
mocha
mint
grape
caramel
tea
cherries
pie
prune
sage
raisin
tomato
mango
nectarine
sour
berries
coconut
pomegranate
butter
olive
bacon
banana
mushroom
fig
thyme
butterscotch
mousse
ginger
nutmeg
lychee
bread
rhubarb
jasmine
pine
liqueur
truffle
yeast
nut
balsamic
hazelnut
quince
fennel
watermelon
dill
beer
custard
guava
marmalade
pork
cardamom


In [22]:
pickle_out = open('data/modified/flavours_list.pickle', 'wb')
pickle.dump(foods, pickle_out)
pickle_out.close()

# Implementing the word2vec model

In [14]:
# this list is to be given as input to word2vec because is word tokenized
wine_processed_reviews = []
for review in wine['description']:
    wine_processed_reviews.append(word_tokenize(remove_stopwords(normalize(review))))

In [15]:
from gensim.models import Word2Vec

## Word2Vec needs a large corpus to train in order to perform well
cbow_model = Word2Vec(wine_processed_reviews, min_count = 4, size = 300, window = 8)

In [16]:
cbow_model.wv.most_similar('jam', topn=5)

[('compote', 0.6358773112297058),
 ('preserves', 0.6207005977630615),
 ('syrup', 0.6142028570175171),
 ('liqueur', 0.5811886787414551),
 ('pie', 0.5681503415107727)]

In [17]:
cbow_model.wv.most_similar('marmalade', topn=5)

[('creamsicle', 0.6927849650382996),
 ('honey', 0.6405869722366333),
 ('pekoe', 0.6153730750083923),
 ('slices', 0.6142177581787109),
 ('mandarin', 0.6107165217399597)]

# Find wines based on query

In [23]:
pickle_in = open('data/modified/flavours_list.pickle', 'rb')
foods_list = pickle.load(pickle_in)

In [24]:
print(foods_list)

['cherry', 'berry', 'plum', 'apple', 'blackberry', 'citrus', 'vanilla', 'pepper', 'peach', 'raspberry', 'chocolate', 'currant', 'lemon', 'pear', 'orange', 'licorice', 'lime', 'tart', 'grapefruit', 'coffee', 'melon', 'strawberry', 'honey', 'pineapple', 'apricot', 'jam', 'almond', 'cranberry', 'herbs', 'blueberry', 'cinnamon', 'mocha', 'mint', 'grape', 'caramel', 'tea', 'cherries', 'pie', 'prune', 'sage', 'raisin', 'tomato', 'mango', 'nectarine', 'sour', 'berries', 'coconut', 'pomegranate', 'butter', 'olive', 'bacon', 'banana', 'mushroom', 'fig', 'thyme', 'butterscotch', 'mousse', 'ginger', 'nutmeg', 'lychee', 'bread', 'rhubarb', 'jasmine', 'pine', 'liqueur', 'truffle', 'yeast', 'nut', 'balsamic', 'hazelnut', 'quince', 'fennel', 'watermelon', 'dill', 'beer', 'custard', 'guava', 'marmalade', 'pork', 'cardamom']


In [50]:
query = 'burnt rubber'

In [26]:
def wine_finder(query):
    query_rank = tfidf_vectorizer.transform([query])
    sims = cosine_similarity(query_rank, tfidf_matrix).flatten()
    related_docs_indices = sims.argsort()[:-5:-1]
    return related_docs_indices, sims[related_docs_indices]

In [46]:
def wine_results_printer(query):
    print('The top wines that much the query (', query, '):')
    doc_indices, similarities = wine_finder(query)
    for index, doc_index in enumerate(doc_indices):
        print(index, ')', wine['description'][doc_index])
        print()

In [51]:
print(wine_results_printer(query))

The top wines that much the query ( burnt rubber ):
0 ) Tastes and smells hot, with a burnt rubber note that persists through the tart, peppery finish. The wine is very dry, with little fruit.

1 ) Quite tannic, this hits hard with char, burnt rubber and espresso flavors, some black fruit just showing through. Best to drink up by 2018.

2 ) This has aromas of prunes and burnt rubber, but it tastes better than that. It's a pretty raw effort.

3 ) Possibly too much wood has been used in maturing this wine, which results in a toasty, burnt character to the otherwise ripe fruit. At four years, the wood should have mellowed.

None


In [220]:
similarity_dictionary = dict()
for index, word in enumerate(query.split()):
    similarity_dictionary[word] = [word[0] for word in cbow_model.wv.most_similar(word, topn=2)]

In [221]:
similarity_dictionary

{'jam': ['liqueur', 'preserves'],
 'plum': ['licorice', 'loamy'],
 'vanilla': ['caramelized', 'toffee']}

In [222]:
ls = []
for index, word in enumerate(query.split()):
    for similar_word in similarity_dictionary[word]:
        temporary_list = query.split()
        temporary_list[index] = similar_word
        ls.append(temporary_list)

In [223]:
wine_results_printer(query)
for new_query in ls:
    query = " ".join(new_query)
    wine_results_printer(query)
    print('=======================')

The top wines that much the query ( plum vanilla jam ):
0 ) Soft and dull, with blackberry and cherry jam and sweet vanilla flavors. The tannins and acids both are low. Not going anywhere.

1 ) Firm and full-bodied, this opens with aromas of mature plum, underbrush, baked earth, vanilla and exotic spice. The high-toned, chewy palate doles out ripe black cherry, raspberry jam, vanilla, mocha, white pepper and mint alongside polished tannins.

2 ) Plum, French oak, coffee, vanilla and Oriental spice aromas emerge in the glass. The concentrated palate offers mature black cherry, blackberry jam, vanilla and licorice alongside fine-grained tannins.

3 ) Made entirely from Sangiovese, this opens with aromas of black plum, vanilla and toast. The concentrated palate doles out black cherry jam, cedar and tobacco alongside assertive tannins. Drink through 2019.

The top wines that much the query ( licorice vanilla jam ):
0 ) Soft and dull, with blackberry and cherry jam and sweet vanilla flavors