# Wine filtering by flavours
In this notebook I will extract the flavours from the wine descriptions I have from the Wine Enthusiast website and I will also use an additional data sources with flavours and spices to mix and match. I will also implement a word2vec model. The list with the word2vec model will be used to create a filtering system.

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import nltk
import re
import pickle
# import spacy

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem.porter import *
from collections import Counter
from nltk import word_tokenize
from nltk import sent_tokenize
from sklearn.metrics.pairwise import cosine_similarity

# nlp = spacy.load('en')
%matplotlib inline

In [4]:
# os.chdir('/home/fykos/Documents/workspace/wine_beer_exploration/')
os.chdir('C://Users/vasileios.vyzas/Documents/workspace/Projects/Miscellaneous/wine_recommendation_system//')

In [5]:
wine = pd.read_csv('data/raw/winemag-data-130k-v2.csv')

In [6]:
def normalize(review):
    review_letters = re.sub('[^a-zA-Z]', ' ', str(review))
    review_letters = review_letters.lower()
    return (" ".join(review_letters.split()))

In [7]:
def remove_stopwords(review):
    stop_words = set(stopwords.words('english'))
    ls = [word for word in review.split() if word not in stop_words]
    txt = " ".join(ls)
    return (txt)

# Extracting the flavours from the descriptions
Here I will use spacy to extract all the nouns and noun phrases which is the type of the fruit flavours and I will filter this list better with another list I have from external sources.

In [6]:
def noun_finder(review):
    blob = nlp(normalize(review))
    return (" ".join([token.text for token in blob if token.tag_ == 'NN']))

In [12]:
noun_list = wine['description'].map(noun_finder).values

In [13]:
noun_list

array(['fruit broom brimstone herb palate offering apple citrus sage acidity',
       'fruity wine firm berry acidity',
       'lime flesh rind pineapple acidity wine steel', ...,
       'gravel soil wine character fruity spice favor structure wine age couple',
       'style pinot gris acidity weight core spice apple structure wine age drink',
       'spiciness texture fruit profile feel aftertaste drink'],
      dtype=object)

In [14]:
stops = ['wine', 'pasta', 'whole', 'character', 'cabernet', 'wood', 'spicy', 'tannins', 'crisp', 'juicy', 'fruits', 'blend', 'sauvignon', 'structure', 'fruity', 'aromas', 'flavors', 'ripe', 'syrup', 'cake', 'cheese', 'cream', 'bean', 'hard', 'milk', 'sauce', 'barbecue', 'steak', 'rock', 'powder', 'ruby', 'oil', 'salt', 'pastry', 'flesh', 'bitter', 'sugar', 'leather', 'herbal', 'creamy', 'table', 'brown', 'golden', 'gold', 'extract', 'broad', 'natural', 'salmon', 'tongue', 'dry', 'pure', 'root', 'sea', 'port', 'chewy', 'solid', 'blue', 'pink', 'ground', 'beef', 'purple', 'spring', 'lean', 'raw', 'red', 'black', 'white', 'yellow', 'mature', 'tropical', 'meat', 'wild', 'new', 'juice', 'firm', 'sweet', 'fresh', 'light', 'flower', 'green', 'soft', 'skin', 'spice', 'dark', 'herb', 'palate', 'valley', 'finish', 'drink', 'flavor', 'fruit', 'aroma', 'note', 'texture', 'thi', 'acidity']
tfidf_vectorizer = TfidfVectorizer(stop_words=stops)
tfidf_matrix = tfidf_vectorizer.fit_transform(noun_list)

In [15]:
# features holds a list of all the words in the tfidf's vocabulary in the same order as the column in the matrix
features = tfidf_vectorizer.get_feature_names()
weights = np.asarray(tfidf_matrix.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'term':features, 'weights':weights})
weights_df = weights_df.sort_values(by='weights', ascending=False)

In [18]:
# call the foods database 
food_db = pd.read_csv('data/modified//8b. AUSNUT 2011-13 AHS Food Nutrient Database.csv')

In [19]:
# process foods
test = set(word.strip().lower() for ls in list(map(lambda x:x.split(',') ,food_db['Food Name'].tolist())) for word in ls)

In [20]:
# pickout the foods from the wine list
terms = weights_df[weights_df['weights'] > 0.001]
foods = []
for term in terms['term']:
    if term in test:
        foods.append(term)

In [22]:
print(foods)

['cherry', 'berry', 'plum', 'apple', 'blackberry', 'vanilla', 'pepper', 'citrus', 'lemon', 'raspberry', 'peach', 'pear', 'chocolate', 'currant', 'licorice', 'lime', 'coffee', 'melon', 'grapefruit', 'honey', 'apricot', 'pineapple', 'strawberry', 'cinnamon', 'almond', 'mocha', 'mint', 'orange', 'jam', 'grape', 'blueberry', 'tea', 'sage', 'pie', 'caramel', 'cranberry', 'raisin', 'olive', 'tomato', 'coconut', 'butter', 'bacon', 'fig', 'mango', 'banana', 'thyme', 'prune', 'mushroom', 'pomegranate', 'butterscotch', 'ginger', 'lychee', 'mousse', 'bread', 'nut', 'truffle', 'yeast', 'jasmine', 'nectarine', 'hazelnut', 'fennel', 'liqueur', 'tart', 'herbs', 'quince', 'dill', 'watermelon', 'heart', 'lamb', 'pork', 'chicken', 'guava', 'seafood', 'maple', 'custard', 'energy', 'soy', 'beer', 'cooking', 'coating']


# Implementing the word2vec model

In [11]:
# prepare the input for the word2vec model
ls = list(map(lambda x: word_tokenize(remove_stopwords(normalize(x))), wine['description']))

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  Searched in:
    - 'C:\\Users\\vasileios.vyzas/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\vasileios.vyzas\\AppData\\Local\\Continuum\\anaconda3\\nltk_data'
    - 'C:\\Users\\vasileios.vyzas\\AppData\\Local\\Continuum\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\vasileios.vyzas\\AppData\\Local\\Continuum\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\vasileios.vyzas\\AppData\\Roaming\\nltk_data'
**********************************************************************


In [24]:
from gensim.models import Word2Vec

## Word2Vec needs a large corpus to train in order to perform well
cbow_model = Word2Vec(ls, min_count = 2, size = 150, window = 10)



In [27]:
cbow_model.wv.most_similar('aromas', topn=100)

[('scents', 0.7888871431350708),
 ('notes', 0.5859709978103638),
 ('bouquet', 0.5810145735740662),
 ('whiff', 0.5469571948051453),
 ('fragrances', 0.5263883471488953),
 ('whiffs', 0.5257620215415955),
 ('rubber', 0.4845849275588989),
 ('fragrance', 0.4818279445171356),
 ('clicky', 0.47520989179611206),
 ('tones', 0.46636292338371277),
 ('blast', 0.46254128217697144),
 ('accents', 0.4550802707672119),
 ('smelling', 0.45021432638168335),
 ('aroma', 0.4491744041442871),
 ('nose', 0.4393962323665619),
 ('scent', 0.4302075505256653),
 ('soap', 0.4269176423549652),
 ('tire', 0.42656657099723816),
 ('smells', 0.41807618737220764),
 ('fur', 0.4124946892261505),
 ('resin', 0.40236401557922363),
 ('dust', 0.40189939737319946),
 ('bath', 0.3994298577308655),
 ('hay', 0.3991416096687317),
 ('pressed', 0.39890554547309875),
 ('barlett', 0.39778491854667664),
 ('candle', 0.3935548663139343),
 ('stewed', 0.3927515149116516),
 ('muddled', 0.3906274139881134),
 ('chewing', 0.38984057307243347),
 ('hide

In [26]:
cbow_model.wv.most_similar('pork', topn=50)

[('lamb', 0.8895524740219116),
 ('chops', 0.8595080971717834),
 ('potatoes', 0.8537752628326416),
 ('meats', 0.8528634309768677),
 ('sausage', 0.8490621447563171),
 ('roast', 0.8372448682785034),
 ('stew', 0.8363399505615234),
 ('risotto', 0.8271031975746155),
 ('veal', 0.8163446187973022),
 ('chicken', 0.8158363103866577),
 ('tenderloin', 0.8108752369880676),
 ('beef', 0.8078727722167969),
 ('dish', 0.8061364889144897),
 ('duck', 0.805750846862793),
 ('steaks', 0.8040581941604614),
 ('sausages', 0.7944930195808411),
 ('broiled', 0.7938192486763),
 ('barbecued', 0.7880339026451111),
 ('lasagna', 0.7877833843231201),
 ('braised', 0.779943585395813),
 ('ham', 0.7761930823326111),
 ('sauce', 0.7709249258041382),
 ('ribs', 0.7707759737968445),
 ('oven', 0.7700439691543579),
 ('rib', 0.7694214582443237),
 ('pastas', 0.769004762172699),
 ('venison', 0.7595199346542358),
 ('cheeses', 0.757583737373352),
 ('roasts', 0.757253110408783),
 ('tuna', 0.7561944127082825),
 ('brisket', 0.755761742591

# Find wines based on query

In [42]:
def wine_finder(query):
    query_rank = tfidf_vectorizer.transform([query])
    sims = cosine_similarity(query_rank, tfidf_matrix).flatten()
    related_docs_indices = sims.argsort()[:-8:-1]
    return related_docs_indices, sims[related_docs_indices]

In [43]:
def wine_results_printer(query):
    print('The top wines that much the query (', query, '):')
    doc_indices, similarities = wine_finder(query)
    for index, doc_index in enumerate(doc_indices):
        print(index, ')', wine['description'][doc_index])
        print()

In [32]:
print('Select the flavours and notes you are looking for your wine: ')
print(foods)

Select the flavours and notes you are looking for your wine: 
['cherry', 'berry', 'plum', 'apple', 'blackberry', 'vanilla', 'pepper', 'citrus', 'lemon', 'raspberry', 'peach', 'pear', 'chocolate', 'currant', 'licorice', 'lime', 'coffee', 'melon', 'grapefruit', 'honey', 'apricot', 'pineapple', 'strawberry', 'cinnamon', 'almond', 'mocha', 'mint', 'orange', 'jam', 'grape', 'blueberry', 'tea', 'sage', 'pie', 'caramel', 'cranberry', 'raisin', 'olive', 'tomato', 'coconut', 'butter', 'bacon', 'fig', 'mango', 'banana', 'thyme', 'prune', 'mushroom', 'pomegranate', 'butterscotch', 'ginger', 'lychee', 'mousse', 'bread', 'nut', 'truffle', 'yeast', 'jasmine', 'nectarine', 'hazelnut', 'fennel', 'liqueur', 'tart', 'herbs', 'quince', 'dill', 'watermelon', 'heart', 'lamb', 'pork', 'chicken', 'guava', 'seafood', 'maple', 'custard', 'energy', 'soy', 'beer', 'cooking', 'coating']


In [44]:
query = 'plum vanilla jam'
wine_results_printer(query)

The top wines that much the query ( plum vanilla jam ):
0 ) Blackberry jam aromas lead to a fruity wine, soft and juicy. Fresh and rounded, it has balanced acidity and an open, generous character. Drink now.

1 ) Soft and dull, with blackberry and cherry jam and sweet vanilla flavors. The tannins and acids both are low. Not going anywhere.

2 ) Made entirely from Sangiovese, this opens with aromas of black plum, vanilla and toast. The concentrated palate doles out black cherry jam, cedar and tobacco alongside assertive tannins. Drink through 2019.

3 ) Made entirely from Sangiovese, this opens with aromas of black plum, vanilla and toast. The concentrated palate doles out black cherry jam, cedar and tobacco alongside assertive tannins. Drink through 2019.

4 ) A little heavy and syrupy, but for twelve bucks, you get a creamy wine with pineapple jam and vanilla flavors.

5 ) Plum-jam-flavored wine, this is very sweet, rich and very direct. The freshness is here, but this is more about t

# Print wines based on alternative queries

In [None]:
similarity_dictionary = dict()
for index, word in enumerate(query.split()):
    similarity_dictionary[word] = [word[0] for word in cbow_model.wv.most_similar(word, topn=2)]

In [None]:
similarity_dictionary

In [None]:
ls = []
for index, word in enumerate(query.split()):
    for similar_word in similarity_dictionary[word]:
        temporary_list = query.split()
        temporary_list[index] = similar_word
        ls.append(temporary_list)

In [None]:
wine_results_printer(query)
for new_query in ls:
    query = " ".join(new_query)
    wine_results_printer(query)
    print('=======================')