# Python Group Presentation

### Using NLP and Machine Learning to predict price of wines using wine reviews

#### Group Members 
Onno Ho,
Sabrina Lin,
Shaun Ang,
Natalie Rohr,
Shaun Whitmarsh,
Jemma Shin

### Import and Download the NLP Module 

import nltk

nltk.download()

### Import Dataset, Clean, and View 

In [1]:
import nltk
import os
import pandas as pd
import numpy as np
import math as math

#view your current working directory 
print("Current Working Directory: " , os.getcwd())

#import data

data = pd.read_csv('winemag-data-130k-v2.csv') #imports dataset as 'data'
data.rename(columns={'Unnamed: 0':'Index Number'}, inplace=True) #renames first column
data.head() #views top few rows of the dataframe


Current Working Directory:  C:\Users\user\Dropbox


Unnamed: 0,Index Number,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


### Lower case description

In [17]:
data['description'] = data['description'].str.lower()

### Tokenize Descriptive Text

In [18]:
from nltk.tokenize import sent_tokenize, word_tokenize

data['description_tokenized'] = data['description'].apply(nltk.word_tokenize)
print(data['description_tokenized'])

0         [aromas, include, tropical, fruit, ,, broom, ,...
1         [this, is, ripe, and, fruity, ,, a, wine, that...
2         [tart, and, snappy, ,, the, flavors, of, lime,...
3         [pineapple, rind, ,, lemon, pith, and, orange,...
4         [much, like, the, regular, bottling, from, 201...
5         [blackberry, and, raspberry, aromas, show, a, ...
6         [here, 's, a, bright, ,, informal, red, that, ...
7         [this, dry, and, restrained, wine, offers, spi...
8         [savory, dried, thyme, notes, accent, sunnier,...
9         [this, has, great, depth, of, flavor, with, it...
10        [soft, ,, supple, plum, envelopes, an, oaky, s...
11        [this, is, a, dry, wine, ,, very, spicy, ,, wi...
12        [slightly, reduced, ,, this, wine, offers, a, ...
13        [this, is, dominated, by, oak, and, oak-driven...
14        [building, on, 150, years, and, six, generatio...
15        [zesty, orange, peels, and, apple, notes, abou...
16        [baked, plum, ,, molasses, ,, 

### Remove Stopwords

In [24]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

data['filtered_description'] = data['description_tokenized'].apply(lambda x: [item for item in x if item not in stop_words])
print(data['filtered_description'])

0         [aromas, include, tropical, fruit, ,, broom, ,...
1         [ripe, fruity, ,, wine, smooth, still, structu...
2         [tart, snappy, ,, flavors, lime, flesh, rind, ...
3         [pineapple, rind, ,, lemon, pith, orange, blos...
4         [much, like, regular, bottling, 2012, ,, comes...
5         [blackberry, raspberry, aromas, show, typical,...
6         ['s, bright, ,, informal, red, opens, aromas, ...
7         [dry, restrained, wine, offers, spice, profusi...
8         [savory, dried, thyme, notes, accent, sunnier,...
9         [great, depth, flavor, fresh, apple, pear, fru...
10        [soft, ,, supple, plum, envelopes, oaky, struc...
11        [dry, wine, ,, spicy, ,, tight, ,, taut, textu...
12        [slightly, reduced, ,, wine, offers, chalky, ,...
13        [dominated, oak, oak-driven, aromas, include, ...
14        [building, 150, years, six, generations, winem...
15        [zesty, orange, peels, apple, notes, abound, s...
16        [baked, plum, ,, molasses, ,, 

### Stem Words

In [25]:
from nltk.stem import SnowballStemmer

ps = SnowballStemmer("english")

data['stemmed_description'] = data['filtered_description'].apply(lambda x: [ps.stem(y) for y in x])
print(data['stemmed_description'])

0         [aroma, includ, tropic, fruit, ,, broom, ,, br...
1         [ripe, fruiti, ,, wine, smooth, still, structu...
2         [tart, snappi, ,, flavor, lime, flesh, rind, d...
3         [pineappl, rind, ,, lemon, pith, orang, blosso...
4         [much, like, regular, bottl, 2012, ,, come, ac...
5         [blackberri, raspberri, aroma, show, typic, na...
6         ['s, bright, ,, inform, red, open, aroma, cand...
7         [dri, restrain, wine, offer, spice, profus, .,...
8         [savori, dri, thyme, note, accent, sunnier, fl...
9         [great, depth, flavor, fresh, appl, pear, frui...
10        [soft, ,, suppl, plum, envelop, oaki, structur...
11        [dri, wine, ,, spici, ,, tight, ,, taut, textu...
12        [slight, reduc, ,, wine, offer, chalki, ,, tan...
13        [domin, oak, oak-driven, aroma, includ, roast,...
14        [build, 150, year, six, generat, winemak, trad...
15        [zesti, orang, peel, appl, note, abound, sprig...
16        [bake, plum, ,, molass, ,, bal

### Add back cleaned description to the dataset, remove working columns

In [29]:
data['description_clean']= data['stemmed_description']

data = data.drop(columns=['description_tokenized', 'filtered_description','stemmed_description','lower_case'])

data.head()

Unnamed: 0,Index Number,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,description_clean
0,0,Italy,"aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,"[aroma, includ, tropic, fruit, ,, broom, ,, br..."
1,1,Portugal,"this is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,"[ripe, fruiti, ,, wine, smooth, still, structu..."
2,2,US,"tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,"[tart, snappi, ,, flavor, lime, flesh, rind, d..."
3,3,US,"pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,"[pineappl, rind, ,, lemon, pith, orang, blosso..."
4,4,US,"much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,"[much, like, regular, bottl, 2012, ,, come, ac..."


### Data Summary Stats

In [6]:
print("There are {} observations and {} features in this dataset. \n".format(data.shape[0],data.shape[1]))

print("There are {} types of wine in this dataset such as {}... \n".format(len(data.variety.unique()),
                                                                           ", ".join(data.variety.unique()[0:7])))

print("There are {} countries producing wine in this dataset such as {}... \n".format(len(data.country.unique()),
                                                                                      ", ".join(data.country.unique()[0:7])))

There are 129971 observations and 18 features in this dataset. 

There are 708 types of wine in this dataset such as White Blend, Portuguese Red, Pinot Gris, Riesling, Pinot Noir, Tempranillo-Merlot, Frappato... 

There are 44 countries producing wine in this dataset such as Italy, Portugal, US, Spain, France, Germany, Argentina... 

