# Python Group Presentation

### Using NLP and Machine Learning to predict price of wines using wine reviews

#### Group Members 
Onno Ho,
Sabrina Lin,
Shaun Ang,
Natalie Rohr,
Shaun Whitmarsh,
Jemma Shin

### Import and Download the NLP Module 

nltk.download()

### Import Dataset, Clean, and View 

In [1]:
import nltk
import os
import pandas as pd
import numpy as np
import math as math

#view your current working directory 
print("Current Working Directory: " , os.getcwd())

#import data

data = pd.read_csv('winemag-data-130k-v2.csv', encoding = "ISO-8859-1") #imports dataset as 'data'
data.rename(columns={'Unnamed: 0':'Index Number'}, inplace=True) #renames first column
data.head() #views top few rows of the dataframe


Current Working Directory:  C:\Users\user\Downloads


Unnamed: 0,Index Number,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",VulkÃ Bianco,87,,Sicily & Sardinia,Etna,,Kerin OâKeefe,@kerinokeefe,Nicosia 2013 VulkÃ Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


### Lower case description

In [2]:
data['description'] = data['description'].str.lower()

### Tokenize Descriptive Text

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

data['description_tokenized'] = data['description'].apply(nltk.word_tokenize)
print(data['description_tokenized'])

0         [aromas, include, tropical, fruit, ,, broom, ,...
1         [this, is, ripe, and, fruity, ,, a, wine, that...
2         [tart, and, snappy, ,, the, flavors, of, lime,...
3         [pineapple, rind, ,, lemon, pith, and, orange,...
4         [much, like, the, regular, bottling, from, 201...
5         [blackberry, and, raspberry, aromas, show, a, ...
6         [here, 's, a, bright, ,, informal, red, that, ...
7         [this, dry, and, restrained, wine, offers, spi...
8         [savory, dried, thyme, notes, accent, sunnier,...
9         [this, has, great, depth, of, flavor, with, it...
10        [soft, ,, supple, plum, envelopes, an, oaky, s...
11        [this, is, a, dry, wine, ,, very, spicy, ,, wi...
12        [slightly, reduced, ,, this, wine, offers, a, ...
13        [this, is, dominated, by, oak, and, oak-driven...
14        [building, on, 150, years, and, six, generatio...
15        [zesty, orange, peels, and, apple, notes, abou...
16        [baked, plum, ,, molasses, ,, 

### Remove Stopwords

In [4]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

data['filtered_description'] = data['description_tokenized'].apply(lambda x: [item for item in x if item not in stop_words])
print(data['filtered_description'])

0         [aromas, include, tropical, fruit, ,, broom, ,...
1         [ripe, fruity, ,, wine, smooth, still, structu...
2         [tart, snappy, ,, flavors, lime, flesh, rind, ...
3         [pineapple, rind, ,, lemon, pith, orange, blos...
4         [much, like, regular, bottling, 2012, ,, comes...
5         [blackberry, raspberry, aromas, show, typical,...
6         ['s, bright, ,, informal, red, opens, aromas, ...
7         [dry, restrained, wine, offers, spice, profusi...
8         [savory, dried, thyme, notes, accent, sunnier,...
9         [great, depth, flavor, fresh, apple, pear, fru...
10        [soft, ,, supple, plum, envelopes, oaky, struc...
11        [dry, wine, ,, spicy, ,, tight, ,, taut, textu...
12        [slightly, reduced, ,, wine, offers, chalky, ,...
13        [dominated, oak, oak-driven, aromas, include, ...
14        [building, 150, years, six, generations, winem...
15        [zesty, orange, peels, apple, notes, abound, s...
16        [baked, plum, ,, molasses, ,, 

### Stem Words

In [5]:
from nltk.stem import SnowballStemmer

ps = SnowballStemmer("english")

data['stemmed_description'] = data['filtered_description'].apply(lambda x: [ps.stem(y) for y in x])
print(data['stemmed_description'])

0         [aroma, includ, tropic, fruit, ,, broom, ,, br...
1         [ripe, fruiti, ,, wine, smooth, still, structu...
2         [tart, snappi, ,, flavor, lime, flesh, rind, d...
3         [pineappl, rind, ,, lemon, pith, orang, blosso...
4         [much, like, regular, bottl, 2012, ,, come, ac...
5         [blackberri, raspberri, aroma, show, typic, na...
6         ['s, bright, ,, inform, red, open, aroma, cand...
7         [dri, restrain, wine, offer, spice, profus, .,...
8         [savori, dri, thyme, note, accent, sunnier, fl...
9         [great, depth, flavor, fresh, appl, pear, frui...
10        [soft, ,, suppl, plum, envelop, oaki, structur...
11        [dri, wine, ,, spici, ,, tight, ,, taut, textu...
12        [slight, reduc, ,, wine, offer, chalki, ,, tan...
13        [domin, oak, oak-driven, aroma, includ, roast,...
14        [build, 150, year, six, generat, winemak, trad...
15        [zesti, orang, peel, appl, note, abound, sprig...
16        [bake, plum, ,, molass, ,, bal

### Add back cleaned description to the dataset, remove working columns

In [6]:
data['description_clean']= data['stemmed_description']

data = data.drop(columns=['description_tokenized', 'filtered_description','stemmed_description'])

data.head()

Unnamed: 0,Index Number,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,description_clean
0,0,Italy,"aromas include tropical fruit, broom, brimston...",VulkÃ Bianco,87,,Sicily & Sardinia,Etna,,Kerin OâKeefe,@kerinokeefe,Nicosia 2013 VulkÃ Bianco (Etna),White Blend,Nicosia,"[aroma, includ, tropic, fruit, ,, broom, ,, br..."
1,1,Portugal,"this is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,"[ripe, fruiti, ,, wine, smooth, still, structu..."
2,2,US,"tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,"[tart, snappi, ,, flavor, lime, flesh, rind, d..."
3,3,US,"pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,"[pineappl, rind, ,, lemon, pith, orang, blosso..."
4,4,US,"much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,"[much, like, regular, bottl, 2012, ,, come, ac..."


### Data Summary Stats

In [7]:
print("There are {} observations and {} features in this dataset. \n".format(data.shape[0],data.shape[1]))

print("There are {} types of wine in this dataset such as {}... \n".format(len(data.variety.unique()),
                                                                           ", ".join(data.variety.unique()[0:7])))

print("There are {} countries producing wine in this dataset such as {}... \n".format(len(data.country.unique()),
                                                                                      ", ".join(data.country.unique()[0:7])))

There are 129971 observations and 15 features in this dataset. 

There are 708 types of wine in this dataset such as White Blend, Portuguese Red, Pinot Gris, Riesling, Pinot Noir, Tempranillo-Merlot, Frappato... 

There are 44 countries producing wine in this dataset such as Italy, Portugal, US, Spain, France, Germany, Argentina... 



### Extract the Wine Vintage from the Name ('title') of the Wine

In [8]:
from dateutil.parser import parse

n_rows = len(data.index)

vintage = []

for x in range(n_rows):
    try:
        vintage.append(parse(data.at[x,'title'], fuzzy=True).year)
    except ValueError:
        vintage.append(2010) #will append the mean vintage of the db (data['vintage'].mean()), need to put a year or else the constraint will treat the wine as having a vintage of year 0, very old wine
        continue
        
data['vintage'] = pd.Series(vintage) 



In [9]:
data.head()

Unnamed: 0,Index Number,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,description_clean,vintage
0,0,Italy,"aromas include tropical fruit, broom, brimston...",VulkÃ Bianco,87,,Sicily & Sardinia,Etna,,Kerin OâKeefe,@kerinokeefe,Nicosia 2013 VulkÃ Bianco (Etna),White Blend,Nicosia,"[aroma, includ, tropic, fruit, ,, broom, ,, br...",2013
1,1,Portugal,"this is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,"[ripe, fruiti, ,, wine, smooth, still, structu...",2011
2,2,US,"tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,"[tart, snappi, ,, flavor, lime, flesh, rind, d...",2013
3,3,US,"pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,"[pineappl, rind, ,, lemon, pith, orang, blosso...",2013
4,4,US,"much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,"[much, like, regular, bottl, 2012, ,, come, ac...",2012


### Clean up wine variety into 6 common wines and others

In [10]:
n_rows = len(data.index)

variety_cleaned = []

for x in range(n_rows):
    y = data.at[x,'variety']
    if 'Chardonnay' == y:
        z = 'Chardonnay'
    elif 'Sauvignon Blanc' == y:
        z = 'Sauvignon Blanc'
    elif 'Riesling' == y:
        z = 'Riesling'
    elif 'Cabernet Sauvignon' == y:
        z = 'Cabernet Sauvignon'
    elif 'Merlot' == y:
        z = 'Merlot'
    elif 'Pinot Noir' == y:
        z = 'Pinot Noir'
    else:
        z = 'Others'
    variety_cleaned.append(z)

data['variety_cleaned'] = pd.Series(variety_cleaned)
data.head()

Unnamed: 0,Index Number,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,description_clean,vintage,variety_cleaned
0,0,Italy,"aromas include tropical fruit, broom, brimston...",VulkÃ Bianco,87,,Sicily & Sardinia,Etna,,Kerin OâKeefe,@kerinokeefe,Nicosia 2013 VulkÃ Bianco (Etna),White Blend,Nicosia,"[aroma, includ, tropic, fruit, ,, broom, ,, br...",2013,Others
1,1,Portugal,"this is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,"[ripe, fruiti, ,, wine, smooth, still, structu...",2011,Others
2,2,US,"tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,"[tart, snappi, ,, flavor, lime, flesh, rind, d...",2013,Others
3,3,US,"pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,"[pineappl, rind, ,, lemon, pith, orang, blosso...",2013,Riesling
4,4,US,"much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,"[much, like, regular, bottl, 2012, ,, come, ac...",2012,Pinot Noir


### Create Variety Code (For building a constraint in the optimization model) 

In [11]:
variety_code_cleaned = []

for x in range(n_rows):
    y = data.at[x,'variety']
    if 'Chardonnay' == y:
        z = 1
    elif 'Sauvignon Blanc' == y:
        z = 10
    elif 'Riesling' == y:
        z = 100
    elif 'Cabernet Sauvignon' == y:
        z = 1000
    elif 'Merlot' == y:
        z = 10000
    elif 'Pinot Noir' == y:
        z = 100000
    else:
        z = 0
    variety_code_cleaned.append(z)

data['variety_code_cleaned'] = pd.Series(variety_code_cleaned)
data.head()

Unnamed: 0,Index Number,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,description_clean,vintage,variety_cleaned,variety_code_cleaned
0,0,Italy,"aromas include tropical fruit, broom, brimston...",VulkÃ Bianco,87,,Sicily & Sardinia,Etna,,Kerin OâKeefe,@kerinokeefe,Nicosia 2013 VulkÃ Bianco (Etna),White Blend,Nicosia,"[aroma, includ, tropic, fruit, ,, broom, ,, br...",2013,Others,0
1,1,Portugal,"this is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,"[ripe, fruiti, ,, wine, smooth, still, structu...",2011,Others,0
2,2,US,"tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,"[tart, snappi, ,, flavor, lime, flesh, rind, d...",2013,Others,0
3,3,US,"pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,"[pineappl, rind, ,, lemon, pith, orang, blosso...",2013,Riesling,100
4,4,US,"much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,"[much, like, regular, bottl, 2012, ,, come, ac...",2012,Pinot Noir,100000


### Create a country point system (for another constraint) 

In [12]:
#Italy - 10000
#Spain - 10000
#France - 10000
#United States- 15000 (wine is for US customers, thus US wines should have a heavy penality)
#China - 5000
#Argentina - 5000
#Chile - 5000
#Australia - 5000
#South Africa - 5000
#Germany - 5000
#Portugual - 2500
#Romania - 2500

country_code_cleaned = []

for x in range(n_rows):
    y = data.at[x,'country']
    if 'Italy' == y:
        z = 1
    elif 'Spain' == y:
        z = 10
    elif 'France' == y:
        z = 100
    elif 'US' == y:
        z = 1000
    elif 'Argentina' == y:
        z = 10000
    elif 'Chile' == y:
        z = 100000
    else:
        z = 0
    country_code_cleaned.append(z)

data['country_code_cleaned'] = pd.Series(country_code_cleaned)
data.head()

Unnamed: 0,Index Number,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,description_clean,vintage,variety_cleaned,variety_code_cleaned,country_code_cleaned
0,0,Italy,"aromas include tropical fruit, broom, brimston...",VulkÃ Bianco,87,,Sicily & Sardinia,Etna,,Kerin OâKeefe,@kerinokeefe,Nicosia 2013 VulkÃ Bianco (Etna),White Blend,Nicosia,"[aroma, includ, tropic, fruit, ,, broom, ,, br...",2013,Others,0,1
1,1,Portugal,"this is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,"[ripe, fruiti, ,, wine, smooth, still, structu...",2011,Others,0,0
2,2,US,"tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,"[tart, snappi, ,, flavor, lime, flesh, rind, d...",2013,Others,0,1000
3,3,US,"pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,"[pineappl, rind, ,, lemon, pith, orang, blosso...",2013,Riesling,100,1000
4,4,US,"much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwineÂ,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,"[much, like, regular, bottl, 2012, ,, come, ac...",2012,Pinot Noir,100000,1000


In [13]:
np.count_nonzero(data['variety_cleaned'] == 'Chardonnay')
#11753 Chardonnay

11753

In [14]:
np.count_nonzero(data['variety_cleaned'] == 'Sauvignon Blanc')
#4967 Sauvignon Blanc

4967

In [15]:
np.count_nonzero(data['variety_cleaned'] == 'Riesling')
#5189 Riesling

5189

In [16]:
np.count_nonzero(data['variety_cleaned'] == 'Cabernet Sauvignon')
#9472 Cabernet Sauvignon

9472

In [17]:
np.count_nonzero(data['variety_cleaned'] == 'Merlot')
#3102 Merlot

3102

In [18]:
np.count_nonzero(data['variety_cleaned'] == 'Pinot Noir')
#13272 Pinot Noir

13272

In [19]:
np.count_nonzero(data['variety_cleaned'] == 'Others')
#82216 Others

82216

### Remove NA Values in Price 

In [20]:
data = data.dropna(subset=['price'])

### Optimization

In [None]:
from gurobipy import *

data.head()

data1 = data.values

title, country, points, price, variety, vintage, code = multidict({item[11]: (item[18],item[4],item[5],item[16],item[15],item[17]) for item in data.values})

b = 12 #number of bottles

mean_points = data['points'].mean() #mean points

m = Model('wine_box')

x= m.addVars(title,vtype=GRB.BINARY,lb=0, name = 'wines') #binary, instead of integer, sets the constraint that every bottle is unique

m.setObjective(quicksum(price[i]*x[i] for i in title), GRB.MINIMIZE)

#1) There are 12 bottles 

m.addConstr(quicksum(x[i] for i in title) == b)

#2) First 6 bottles will be the 6 most common varieties. Chardonnay, Sauvignon Blanc, Riesling, Cabernet Sauvignon, Merlot, Pinot Noir. The rest are unique wines.

m.addConstr(quicksum(x[i]*code[i] for i in title) == 111111) #assigns variety types to a point (others = 0, char = 1, merlot = 10, PN = 100,...)
                                                            #forces algorithem to pick one each of the 6 most common wines

#3) Wine should come from a variety of countries and not just from the typical top wine exporting ones (e.g. france, usa, italy)

m.addConstr(quicksum(x[i]*country[i] for i in title) == 111111)

#4) Wine's should not be all new wines, average vintage shall be 2010

m.addConstr(quicksum(x[i]*vintage[i] for i in title) <= (2010*b))

#5) The 12 bottles should have an average score of more than 88.44 points (which is the mean)

m.addConstr(quicksum(x[i]*points[i] for i in title) >= (mean_points*b))

m.optimize()

# print optimal solutions
for v in m.getVars():
    print('%s %g' % (v.varName, v.x))
    
#print optimal value
print('Obj: %g' % m.objVal)

Optimize a model with 5 rows, 110638 columns and 469174 nonzeros
Variable types: 0 continuous, 110638 integer (110638 binary)
Coefficient statistics:
  Matrix range     [1e+00, 1e+05]
  Objective range  [4e+00, 3e+03]
  Bounds range     [1e+00, 1e+00]
  RHS range        [1e+01, 1e+05]
Presolve removed 0 rows and 72013 columns
Presolve time: 1.03s
Presolved: 5 rows, 38625 columns, 166915 nonzeros
Variable types: 0 continuous, 38625 integer (25955 binary)

Root relaxation: objective 7.621476e+01, 15 iterations, 0.07 seconds

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time



### Use below window to test out code

In [None]:
data.loc[data['title'] == "Cramele Recas 2011 UnWineD Pinot Grigio (Viile Timisului)"] 