# Machine learning on wine

**Topics:** Text analysis, linear regression, logistic regression, text analysis, classification

**Datasets**

- **wine-reviews.csv** Wine reviews scraped from https://www.winemag.com/
- **Data dictionary:** just go [here](https://www.winemag.com/buying-guide/tenuta-dellornellaia-2007-masseto-merlot-toscana/) and look at the page

## The background

You work in the **worst newsroom in the world**, and you've had a hard few weeks at work - a couple stories killed, a few scoops stolen out from under you. It's not going well.

And because things just can't get any worse: your boss shows up, carrying a huge binder. She slams it down on your desk.

"You know some machine learning stuff, right?"

You say "no," but she isn't listening. She's giving you an assignment, the _worst assignment_:

> Machine learning is the new maps. Let's get some hits!
>
> **Do some machine learning on this stuff.**

"This stuff" is wine reviews.

## A tiny, meagre bit of help

You have a dataset. It has some stuff in it:

* **Numbers:**
    - Year published
    - Alcohol percentage
    - Price
    - Score
    - Bottle size
* **Categories:**
    - Red vs white
    - Different countries
    - Importer
    - Designation
    - Taster
    - Variety
    - Winery
* **Free text:**
    - Wine description

# Cleaning up your data

Many of these pieces - the alcohol, the year produced, the bottle size, the country the wine is from - aren't in a format you can use. Convert the ones to numbers that are numbers, and extract the others from the appropriate strings.

In [5]:
import pandas as pd
import re
import nltk
import fuzzy_pandas as fpd
import numpy as np
from nltk.stem.porter import PorterStemmer

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import numpy as np
from statsmodels.sandbox.regression.predstd import wls_prediction_std

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline

In [6]:
df = pd.read_csv("wine-reviews.csv")

In [7]:
r = {'750 ml':750, '750ML':750, 
    '375 ml':375, '500 ml':500, '500ML':500,
     '1 L':1000, '1.5 L':1500,
       '375ML':375, '3 L':3000,
     '1.5L':1500, '1L':1000,
     '3L':3000, '187 ml':187}

df['bottle_size_ml'] = df['bottle size'].replace(r)
df.bottle_size_ml.value_counts()

750     21662
375       244
500        99
3000       17
1500       16
1000       15
187         2
Name: bottle_size_ml, dtype: int64

In [8]:
df['price_clean'] = df.price.str.split(',', expand=True)[0].str.replace('$', '').replace('N/A', np.nan).astype(float) 

In [9]:
df['avg_rating_clean'] = df['user avg rating'].str[:2].replace('No', np.nan) 

In [10]:
df['user avg rating'].value_counts()

Not rated yet [Add Your Review]    22047
90 [Add Your Review]                   3
89 [Add Your Review]                   2
98 [Add Your Review]                   1
83 [Add Your Review]                   1
95 [Add Your Review]                   1
Name: user avg rating, dtype: int64

In [11]:
df['alcohol_clean'] = df.alcohol.str.replace('%', '').astype(float)

In [12]:
df['year'] = df.wine_name.str.extract(r'.*?(\d\d\d\d)').astype(float)

In [13]:
df.wine_desc.value_counts()

86-88 This could work as a rich wine, because there is good structure and piles of botrytis. It could be delicious, with its lovely dry finish, but that's for the future.                                                                                                                                                                                                                     3
It is entirely possible that this wine was simply released way too soon and thus tasted too early, but as is tannic and thin, a showcase for warm-weather Grenache without the variety's classic fruitiness and balance.                                                                                                                                                                       2
Long a producer of high quality and good value, Navarro seems to be raising the bar higher on its Pinot Noir. A vivid, potent and concentrated fruit component lights up this full-bodied wine from the first whiff through the finish

In [14]:
df['appellation_clean'] = df.appellation.str.extract('(\w*.)$').astype(str)

In [15]:
df.variety.value_counts()

Pinot Noir                                         2496
Chardonnay                                         2064
Cabernet Sauvignon                                 1909
Red Blends, Red Blends                             1255
Bordeaux-style Red Blend                           1063
Syrah                                              1005
Riesling                                            871
Sauvignon Blanc                                     782
Merlot                                              660
Rosé                                                550
Zinfandel                                           512
Malbec                                              437
Champagne Blend, Sparkling                          389
Portuguese Red                                      372
Tempranillo                                         365
White Blend                                         352
Sparkling Blend, Sparkling                          348
Nebbiolo                                        

## What might be interesting in this dataset?

Maybe start out playing around _without_ machine learning. Here are some thoughts to get you started:

* I've heard that since the 90's wine has gone through [Parkerization](https://www.estatewinebrokers.com/blog/the-parkerization-of-wine-in-the-1990s-and-beyond/), an increase in production of high-alcohol, fruity red wines thanks to the influence of wine critic Robert Parker.
* Red and white wines taste different, obviously, but people always use [goofy words to describe them](https://winefolly.com/tutorial/40-wine-descriptions/)
* Once upon a time in 1976 [California wines proved themselves against France](https://en.wikipedia.org/wiki/Judgment_of_Paris_(wine)) and France got very angry about it

In [17]:
df.wine_points.value_counts()

87.0     2376
90.0     2325
92.0     2023
88.0     1879
91.0     1846
93.0     1825
86.0     1747
84.0     1443
94.0     1300
89.0     1179
85.0     1003
83.0      942
82.0      680
95.0      638
81.0      257
96.0      231
80.0      151
97.0      136
98.0       46
99.0       17
100.0      11
Name: wine_points, dtype: int64

## But machine learning?

Well, you can usually break machine learning down into a few different things. These aren't necessarily perfect ways of categorizing things, but eh, close enough.

* **Predicting a number**
    - Linear regression
    - How does a change in unemployment translate into a change in life expectancy?
* **Predicting a category** (aka classification)
    - Lots of algos options: logistic regression, random forest, etc
    - For example, predicting cuisines based on ingredients
* **Seeing what influences a numeric outcome**
    - Linear regression since the output is a number
    - For example, minority and poverty status on test scores 
* **Seeing what influences a categorical outcome**
    - Logistic regression since the output is a category
    - Race and car speed for if you get a waring vs ticket
    - Wet/dry pavement and car weight if you survive or not in a car crash)

We have numbers, we have categories, we have all sorts of stuff. **What are some ways we can mash them together and use machine learning?**

### Brainstorm some ideas

Use the categories above to try to come up with some ideas. Be sure to scroll up where I break down categories vs numbers vs text!

**I'll give you one idea for free:** if you don't have any ideas, start off by creating a classifier that determines whether a wine is white or red based on the wine's description.

You can also go to https://library.columbia.edu and see if you can find some academic papers about wine. I'm sure they'll inspire you! (and they might even have some ML ideas in them you can steal, too)

# Implement 2 of your machine learning ideas

In [18]:
df['grapes'] = df.variety.str.replace('(,.*)$', '').astype(str)

In [19]:
df = df[~df['grapes'].str.contains('Blend')]

In [20]:
df = df[~df['grapes'].str.contains('Portuguese')]

In [21]:
df.columns

Index(['url', 'wine_points', 'wine_name', 'wine_desc', 'taster', 'price',
       'designation', 'variety', 'appellation', 'winery', 'alcohol',
       'bottle size', 'category', 'importer', 'date published',
       'user avg rating', 'bottle_size_ml', 'price_clean', 'avg_rating_clean',
       'alcohol_clean', 'year', 'appellation_clean', 'grapes'],
      dtype='object')

In [22]:
df.head()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,...,importer,date published,user avg rating,bottle_size_ml,price_clean,avg_rating_clean,alcohol_clean,year,appellation_clean,grapes
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,...,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review],750,25.0,,14.5,2011.0,Spain,Tempranillo
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,...,,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.5,2012.0,US,Chardonnay
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,...,,12/1/2014,Not rated yet [Add Your Review],750,25.0,,13.5,2013.0,US,Auxerrois
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,...,,12/1/2014,Not rated yet [Add Your Review],750,65.0,,13.0,2011.0,US,Pinot Noir
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,...,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review],750,17.0,,13.0,2013.0,Spain,Albariño


In [24]:
df.grapes.value_counts().head()

Pinot Noir            2496
Chardonnay            2064
Cabernet Sauvignon    1909
Syrah                 1005
Riesling               871
Name: grapes, dtype: int64

In [25]:
df.sort_values(by='grapes')

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,...,importer,date published,user avg rating,bottle_size_ml,price_clean,avg_rating_clean,alcohol_clean,year,appellation_clean,grapes
5520,https://www.winemag.com/buying-guide/papaioann...,88.0,Papaioannou 2003 Old Vines Single Vineyard Agi...,This single-vineyard Agiorgitiko has an earthy...,Susan Kostrzewa,"$26, Buy Now",Old Vines Single Vineyard,"Agiorgitiko, Greek Red","Nemea, Greece",Papaioannou,...,"Fantis Imports, Inc",8/1/2008,Not rated yet [Add Your Review],750,26.0,,13.5,2003.0,Greece,Agiorgitiko
16402,https://www.winemag.com/buying-guide/harlaftis...,81.0,Harlaftis 2001 Argilos Agiorgitiko (Corinth),,Joe Czerwinski,"$12, Buy Now",Argilos,"Agiorgitiko, Greek Red","Corinth, Greece",Harlaftis,...,Athenee Importers,9/1/2004,Not rated yet [Add Your Review],750,12.0,,12.0,2001.0,Greece,Agiorgitiko
7228,https://www.winemag.com/buying-guide/estate-co...,83.0,Estate Constantin Gofas 2005 Mythic River Agio...,This rustic red starts with a nose of red berr...,Susan Kostrzewa,"$13, Buy Now",Mythic River,"Agiorgitiko, Greek Red","Nemea, Greece",Estate Constantin Gofas,...,Athena Importing Co,8/1/2009,Not rated yet [Add Your Review],750,13.0,,13.5,2005.0,Greece,Agiorgitiko
6472,https://www.winemag.com/buying-guide/harlaftis...,84.0,Harlaftis 2007 Argilos Agiorgitiko (Nemea),This refreshing red starts with aromas of alls...,Susan Kostrzewa,"$14, Buy Now",Argilos,"Agiorgitiko, Greek Red","Nemea, Greece",Harlaftis,...,Athenee Importers,4/1/2010,Not rated yet [Add Your Review],750,14.0,,13.0,2007.0,Greece,Agiorgitiko
4086,https://www.winemag.com/buying-guide/tsantali-...,87.0,Tsantali 2003 Réserve Agiorgitiko (Nemea),"Agiorgitiko, famed in Nemea and popular for it...",Susan Kostrzewa,"$16, Buy Now",Réserve,"Agiorgitiko, Greek Red","Nemea, Greece",Tsantali,...,"Fantis Imports, Inc",8/1/2008,Not rated yet [Add Your Review],750,16.0,,13.5,2003.0,Greece,Agiorgitiko
16404,https://www.winemag.com/buying-guide/palivou-2...,81.0,Palivou 2002 St. George (rose) Agiorgitiko (Co...,,Joe Czerwinski,"$13, Buy Now",St. George (rose),"Agiorgitiko, Greek Red","Corinth, Greece",Palivou,...,"Stellar Importing Company, LLC",9/1/2004,Not rated yet [Add Your Review],750,13.0,,13.0,2002.0,Greece,Agiorgitiko
13554,https://www.winemag.com/buying-guide/achaia-cl...,83.0,Achaia Clauss 2000 Agiorgitiko (Corinth),,Joe Czerwinski,"$10, Buy Now",,"Agiorgitiko, Greek Red","Corinth, Greece",Achaia Clauss,...,"Stellar Importing Company, LLC",9/1/2004,Not rated yet [Add Your Review],750,10.0,,12.0,2000.0,Greece,Agiorgitiko
21855,https://www.winemag.com/buying-guide/skouras-2...,87.0,Skouras 2004 Grande Cuvée Agiorgitiko (Nemea),The nose first offers suede and leather before...,,"$25, Buy Now",Grande Cuvée,"Agiorgitiko, Greek Red","Nemea, Greece",Skouras,...,Diamond Importers Inc,10/1/2006,Not rated yet [Add Your Review],750,25.0,,12.5,2004.0,Greece,Agiorgitiko
10834,https://www.winemag.com/buying-guide/spiropoul...,86.0,Spiropoulos 2006 Red Stag Agiorgitiko (Nemea),"Ripe cherry, cedar, vanilla and spice aromas a...",Susan Kostrzewa,"$15, Buy Now",Red Stag,"Agiorgitiko, Greek Red","Nemea, Greece",Spiropoulos,...,Athenee Importers,4/1/2010,Not rated yet [Add Your Review],750,15.0,,13.0,2006.0,Greece,Agiorgitiko
21851,https://www.winemag.com/buying-guide/palivou-2...,87.0,Palivou 2003 Agiorgitiko (Nemea),So smooth and creamy. Solid oaking leads to po...,,"$25, Buy Now",,"Agiorgitiko, Greek Red","Nemea, Greece",Palivou,...,"Stellar Importing Company, LLC",10/1/2006,Not rated yet [Add Your Review],750,25.0,,13.0,2003.0,Greece,Agiorgitiko


In [26]:
df_grape = pd.get_dummies(df.grapes, prefix='wine_points').drop(columns=[])

In [27]:
df.wine_points.isnull().sum()

0

In [28]:
X = df_grape  
y = df.wine_points

In [29]:
import statsmodels.api as sm
mod = sm.OLS(y, X)
res = mod.fit()

In [30]:
res.summary2().tables[1].sort_values("Coef.", ascending=False)

Unnamed: 0,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
wine_points_Muscat Ottonel,100.000000,3.516482,28.437515,7.952203e-174,93.107332,106.892668
wine_points_Rosenmuskateller,98.000000,3.516482,27.868765,3.573276e-167,91.107332,104.892668
wine_points_Scheurebe,96.200000,1.572618,61.171864,0.000000e+00,93.117505,99.282495
wine_points_Grenache-Mourvèdre,96.000000,3.516482,27.300014,1.211220e-160,89.107332,102.892668
wine_points_Welschriesling,95.000000,1.112009,85.430952,0.000000e+00,92.820347,97.179653
wine_points_Tokaji,94.500000,1.243264,76.009600,0.000000e+00,92.063074,96.936926
wine_points_Bual,94.500000,2.486528,38.004800,2.049951e-303,89.626148,99.373852
wine_points_Syrah-Petit Verdot,94.000000,3.516482,26.731264,3.092379e-154,87.107332,100.892668
wine_points_Sirica,94.000000,3.516482,26.731264,3.092379e-154,87.107332,100.892668
wine_points_Nosiola,94.000000,3.516482,26.731264,3.092379e-154,87.107332,100.892668


## Second Analysis: Spanish Wine Detector 

In [34]:
df.loc[df.appellation_clean == 'Spain', 'is_spanish'] = 1

In [35]:
df['is_spanish'] = 0

In [36]:
df = df[df.wine_desc.notnull()]

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(stop_words='english', max_features=200)
X = count_vectorizer.fit_transform(df.wine_desc)
print(count_vectorizer.get_feature_names())

['acidity', 'acids', 'age', 'aging', 'alcohol', 'apple', 'apricot', 'aromas', 'baked', 'balance', 'balanced', 'barrel', 'berry', 'best', 'big', 'bit', 'black', 'blackberries', 'blackberry', 'blend', 'bodied', 'bottle', 'bottling', 'bright', 'cab', 'cabernet', 'caramel', 'cassis', 'cedar', 'character', 'chardonnay', 'cherries', 'cherry', 'chocolate', 'cinnamon', 'citrus', 'clean', 'coffee', 'cola', 'color', 'come', 'comes', 'complex', 'complexity', 'concentrated', 'concentration', 'core', 'creamy', 'crisp', 'currant', 'dark', 'deep', 'delicious', 'delivers', 'dense', 'depth', 'dried', 'drink', 'dry', 'earth', 'earthy', 'easy', 'edge', 'elegant', 'feel', 'feels', 'fine', 'finish', 'finishes', 'firm', 'flavor', 'flavors', 'floral', 'fresh', 'freshness', 'fruit', 'fruits', 'fruity', 'glass', 'good', 'grapefruit', 'great', 'green', 'hard', 'heavy', 'herb', 'herbal', 'herbs', 'high', 'hint', 'hints', 'honey', 'intense', 'juicy', 'just', 'lead', 'leather', 'lemon', 'licorice', 'light', 'like'

In [38]:
words_df = pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())

In [39]:
words_df.head()

Unnamed: 0,acidity,acids,age,aging,alcohol,apple,apricot,aromas,baked,balance,...,weight,white,wild,wine,winery,wines,wood,years,yellow,young
0,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0


In [40]:
words_df.sum().sort_values(ascending=False)

wine          10623
flavors        9042
fruit          6837
finish         4581
palate         4468
aromas         4318
acidity        4130
cherry         3814
tannins        3587
ripe           3347
black          3232
drink          3029
oak            2594
rich           2568
dry            2507
sweet          2369
red            2321
nose           2287
notes          2269
spice          2240
fresh          2048
berry          1904
blackberry     1688
soft           1683
plum           1642
good           1582
dark           1577
shows          1574
white          1541
fruits         1538
              ...  
deep            459
toasty          457
round           455
smoke           453
plenty          447
mocha           447
power           441
nice            438
sweetness       433
right           433
yellow          431
strong          429
barrel          427
refreshing      422
way             422
comes           421
make            414
heavy           413
baked           412


In [41]:
porter_stemmer = PorterStemmer()
def stemming_tokenizer(str_input):
   words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
   words = [porter_stemmer.stem(word) for word in words]
   return words