# Machine learning on wine

**Topics:** Text analysis, linear regression, logistic regression, text analysis, classification

**Datasets**

- **wine-reviews.csv** Wine reviews scraped from https://www.winemag.com/
- **Data dictionary:** just go [here](https://www.winemag.com/buying-guide/tenuta-dellornellaia-2007-masseto-merlot-toscana/) and look at the page

## The background

You work in the **worst newsroom in the world**, and you've had a hard few weeks at work - a couple stories killed, a few scoops stolen out from under you. It's not going well.

And because things just can't get any worse: your boss shows up, carrying a huge binder. She slams it down on your desk.

"You know some machine learning stuff, right?"

You say "no," but she isn't listening. She's giving you an assignment, the _worst assignment_...

> Machine learning is the new maps. Let's get some hits!
>
> **Do some machine learning on this stuff.**

"This stuff" is wine reviews.

## A tiny, meagre bit of help

You have a dataset. It has some stuff in it:

* **Numbers:**
    - Year published
    - Alcohol percentage
    - Price
    - Score
    - Bottle size
* **Categories:**
    - Red vs white
    - Different countries
    - Importer
    - Designation
    - Taster
    - Variety
    - Winery
* **Free text:**
    - Wine description

# Cleaning up your data

Many of these pieces - the alcohol, the year produced, the bottle size, the country the wine is from - aren't in a format you can use. Convert the ones to numbers that are numbers, and extract the others from the appropriate strings.

In [42]:
import pandas as pd
pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 300)

In [43]:
df = pd.read_csv("wine-reviews.csv")
df.head()
del df['user avg rating']

In [44]:
import re
df['price'] = df['price'].str.extract(r'\$(.+),')

In [45]:
df['price'] = df['price'].astype(float)

In [46]:
df['alcohol'] = df['alcohol'].str.extract(r'^(.+)%')

In [47]:
df['alcohol'] = df['alcohol'].astype(float) / 100

In [48]:
df = df[df.wine_desc.notna()]

In [49]:
df.head()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published
0,https://www.winemag.com/buying-guide/artadi-2011-vinas-gain-tempranillo-rioja/,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black plum and coconut filter into a round, fluffy palate that's friendly and pure but not very dense or structured. Baked flavors of molasses and gamy berry ...",Michael Schachner,25.0,Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,0.145,750 ml,Red,Folio Fine Wine Partners,12/1/2014
1,https://www.winemag.com/buying-guide/adelsheim-2012-stoller-vineyard-chardonnay-willamette-valley-dundee-hills/,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Dundee Hills),"A tiny production wine, this is rich, tart and vividly fruity. The generous mix of citrus, apple and peach fruit is augmented by barrel fermentation flavors of toasted hazelnuts, caramel and bakin...",Paul Gregutt,65.0,Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,0.135,750 ml,White,,12/1/2014
2,https://www.winemag.com/buying-guide/adelsheim-2013-ribbon-springs-vineyard-other-white-auxerrois-willamette-valley-ribbon-ridge/,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerrois (Ribbon Ridge),"This is another fine vintage for this rare wine. It's loaded with cool climate, mineral-laced scents of grapefruit, kiwi and melon. A whiff of fennel adds further interest. Super refreshing and a ...",Paul Gregutt,25.0,Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,0.135,750 ml,White,,12/1/2014
3,https://www.winemag.com/buying-guide/jcb-2011-no-11-pinot-noir-sonoma-coast/,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),"Light in color and lilting floral aromas of rose, this is an inviting cool-climate Pinot Noir swirling in equal parts strawberry and spice, subtle and sophisticated.",Virginie Boone,65.0,No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,0.13,750 ml,Red,,12/1/2014
4,https://www.winemag.com/buying-guide/pazo-pondal-2013-albarino-rias-baixas/,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, melon and peach are pure as stream water. This feels round and juicy, with flavors of green herbs, lettuce, lime and orange. Tangerine notes carry the f...",Michael Schachner,17.0,,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,0.13,750 ml,White,Vinaio Imports,12/1/2014


In [50]:
df['bottle size'] = df['bottle size'].str.lower()
#Some are ml and some are L but i'm not going to convert because I don't think I'm using it

In [51]:
df['date published'] = pd.to_datetime(df['date published'])

In [52]:
df.dtypes

url                       object
wine_points              float64
wine_name                 object
wine_desc                 object
taster                    object
price                    float64
designation               object
variety                   object
appellation               object
winery                    object
alcohol                  float64
bottle size               object
category                  object
importer                  object
date published    datetime64[ns]
dtype: object

## What might be interesting in this dataset?

Maybe start out playing around _without_ machine learning. Here are some thoughts to get you started:

* I've heard that since the 90's wine has gone through [Parkerization](https://www.estatewinebrokers.com/blog/the-parkerization-of-wine-in-the-1990s-and-beyond/), an increase in production of high-alcohol, fruity red wines thanks to the influence of wine critic Robert Parker.
* Red and white wines taste different, obviously, but people always use [goofy words to describe them](https://winefolly.com/tutorial/40-wine-descriptions/)
* Once upon a time in 1976 [California wines proved themselves against France](https://en.wikipedia.org/wiki/Judgment_of_Paris_(wine)) and France got very angry about it

In [53]:
#What do wines from different places taste like?

In [54]:
# plot date vs type?

In [55]:
df.tail()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published
42290,https://www.winemag.com/buying-guide/concannon-2002-stampmakers-red-wine-central-coast-livermore-valley/,84.0,Concannon 2002 Stampmaker's Red Wine Red (Livermore Valley),"Very fruit forward in cherries and pomegranates, with rich tannins that grip the palate. Feels dry all the way to the finish, when it turns cough-mediciney sweet.",,24.0,Stampmaker's Red Wine,Rhône-style Red Blend,"Livermore Valley, Central Coast, California, US",Concannon,,750ml,Red,,2005-06-01
42291,https://www.winemag.com/buying-guide/san-simeon-2001-merlot-central-coast-paso-robles/,84.0,San Simeon 2001 Merlot (Paso Robles),"Very dry and robust in the mouth, a clean wine with earthy-berry flavors that finishes with some sturdy tannins, although it's nice and soft in acidity.",,22.0,,Merlot,"Paso Robles, Central Coast, California, US",San Simeon,,750ml,Red,,2005-06-01
42292,https://www.winemag.com/buying-guide/torres-anguix-2003-tinto-tempranillo-tinto-pais-ribera-duero/,84.0,Torres de Anguix 2003 Tinto (Ribera del Duero),"Black in color and saturated with plum, fruit cake and vanilla aromas. Big in the mouth but clumsy, with dense, thumping black cherry and blackberry flavors. In fact, everything about the wine is ...",Michael Schachner,10.0,Tinto,"Tinto del Pais, Tempranillo","Ribera del Duero, Northern Spain, Spain",Torres de Anguix,0.134,750ml,Red,Quality Wines of Spain,2005-06-01
42293,https://www.winemag.com/buying-guide/villacezan-2001-doce-meses-red-vino-tierra-leon/,84.0,Villacezan 2001 Doce Meses Red (Vino Tierra de León),"Muddled to start, with chocolate and earth aromas along with heavy, seemingly chewable berry fruit. Big on the palate, with slightly overripe plum backed by overt vanilla and brown sugar. A three-...",Michael Schachner,17.0,Doce Meses,"Red Blends, Red Blends","Vino Tierra de León, Northern Spain, Spain",Villacezan,0.135,750ml,Red,VinLozano Imports,2005-06-01
42294,https://www.winemag.com/buying-guide/rochioli-1997-three-corner-vineyard-pinot-noir-sonoma-russian-river-valley/,84.0,Rochioli 1997 Three Corner Vineyard Pinot Noir (Russian River Valley),"Smells oaky-beetrooty. Cherries, cocoa, nut an aggressive, slightly vulgar, raw quality. Dusty. Not a great success.",,,Three Corner Vineyard,Pinot Noir,"Russian River Valley, Sonoma, California, US",Rochioli,,750ml,Red,,2005-06-01


## But machine learning?

Well, you can usually break machine learning down into a few different things. These aren't necessarily perfect ways of categorizing things, but eh, close enough.

* **Predicting a number**
    - Linear regression
    - For example, how does a change in unemployment translate into a change in life expectancy?
* **Predicting a category** (aka classification)
    - Lots of algos options: logistic regression, random forest, etc
    - For example, predicting cuisines based on ingredients
* **Seeing what influences a numeric outcome**
    - Linear regression since the output is a number
    - For example, minority and poverty status on test scores 
* **Seeing what influences a categorical outcome**
    - Logistic regression since the output is a category
    - Race and car speed for if you get a waring vs ticket
    - Wet/dry pavement and car weight if you survive or not in a car crash)

We have numbers, we have categories, we have all sorts of stuff. **What are some ways we can mash them together and use machine learning?**

### Brainstorm some ideas

Use the categories above to try to come up with some ideas. Be sure to scroll up where I break down categories vs numbers vs text!

**I'll give you one idea for free:** if you don't have any ideas, start off by creating a classifier that determines whether a wine is white or red based on the wine's description.

In [56]:
df.category.value_counts()

Red            25273
White          11801
Sparkling       2518
Rose            1320
Dessert          739
Port/Sherry      357
Fortified         31
Name: category, dtype: int64

In [116]:
df_white_red = df[(df.category == "Red") | (df.category == "White") & (df.wine_desc.notnull()) ]
#df_white_red.head()

In [58]:
df_white_red.category.map(lambda x: 1 if x=="Red" else 0)

0        1
1        0
2        0
3        1
4        0
        ..
42290    1
42291    1
42292    1
42293    1
42294    1
Name: category, Length: 37074, dtype: int64

You can also go to https://library.columbia.edu and see if you can find some academic papers about wine. I'm sure they'll inspire you! (and they might even have some ML ideas in them you can steal, too)

# Implement 2 of your machine learning ideas

----
# Words to predict whether a wine is white or red

In [59]:
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
from sklearn.feature_extraction.text import TfidfVectorizer

In [60]:
def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

In [61]:
texts = df_white_red['wine_desc'].dropna()

In [62]:
idf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, use_idf=True, norm='l1')
X = idf_vectorizer.fit_transform(texts)
idf_df = pd.DataFrame(X.toarray(), columns=idf_vectorizer.get_feature_names())
#idf_df



Unnamed: 0,-,--,-age,-berri,-cherri,-dijon-clon,-domin,-emilion,-est,-esteph,-georg,-j,-jacqu,-julien,-lichin,-like,-pepper,-plum,-run,-salt,-vent,-weet,-year-old,0,000,000-,000-acr,000-case,000-feet,000-foot,000-foot-high,000-liter,002,01,02,03,030-feet,04,05,06,061,064,07,08,09,1,1-3,1-liter,10,10-15,...,zemmer,zerba,zero,zest,zest-l,zesti,zestier,zibibbo,zier,zieregg,zierfandl,zigzag,zimmermann,zin,zin-bas,zin-friendli,zin-lik,zinck,zinfanat,zinfand,zinfandel,zinfandel-lik,zinfandel-mak,zing,zingarelli,zinger,zingi,zinni,zio,zip,zipolo,zippi,zippiest,zlahtina,zocker,zone,zonin,zonk,zoom,zoppega,zork,zotovich,zuccardi,zucchini,zull,zuri,zweigelt,zweigelt-pinot,zwerithal,zwiegelt
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37070,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37071,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37072,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
y = df_white_red['category']

In [64]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [65]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=30)
clf.fit(X_train, y_train)

RandomForestClassifier(max_depth=30)

In [66]:
clf.score(X_test, y_test)

0.9557665336066458

In [67]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['red', 'white'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted red,Predicted white
Is red,6266,16
Is white,394,2593


In [68]:
import eli5 

feature_names=list(idf_df.columns)
eli5.show_weights(clf, feature_names=feature_names)

Weight,Feature
0.0479  ± 0.1744,cherri
0.0469  ± 0.1742,tannin
0.0415  ± 0.1646,appl
0.0309  ± 0.1203,peach
0.0296  ± 0.0890,blackberri
0.0289  ± 0.1119,citru
0.0281  ± 0.1195,pear
0.0242  ± 0.0962,lemon
0.0222  ± 0.0854,tropic
0.0209  ± 0.0843,pineappl


# Predicting wine score based on geography
using NLP because the place listings aren't formatted *too* cleanly. I need to limit the words used, so just writing my own features would probably be better for this.

In [115]:
texts = df['appellation'].dropna()

In [70]:
def simple_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    return words

In [71]:
#Count vectorizer with binary because some mention California/france several times. Found on StackOverflow

count_vectorizer = CountVectorizer(stop_words='english', binary=True, tokenizer=simple_tokenizer)
X = count_vectorizer.fit_transform(texts)
X = pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())

In [72]:
X.head()

Unnamed: 0,-cless,-costa,-georges,-vent,-vivant,aarde,abruzzo,achaia,aconcagua,acqui,adelaida,adelaide,adige,aegean,africa,agata,aglianico,agrelo,ahr,aigialias,aix-en-provence,alba,albana,albemarle,alcamo,alella,alenquer,alentejano,alentejo,alexander,alghero,alicante,aligot,almansa,aloxe-corton,alpujarra,alsace,alta,alto,altos,amador,amalfi,amarone,amboise,america,amindeo,amyndeon,ancenis,ancient,andalucia,...,vouvray,vre,vulkanland,vulture,w,wa,wachau,wagram,wagram-donauland,wahluke,waiheke,waipara,wairarapa,wairau,wales,walker,walla,washington,washington-oregon,weinland,weinviertel,wellington,western,weststeiermark,wiener,willamette,willcox,willow,wine,wrattonbully,y,yadkin,yakima,yamhill,yamhill-carlton,yarra,yea,yecla,ynez,yolo,york,yorkville,yountville,yuba,zamora,ze,zealand,zeaux,zenata,zonda
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [73]:
y = df['wine_points']

In [74]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [75]:
from sklearn.linear_model import LinearRegression 

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [76]:
#adding predictions to the dataframe. They don't seem so accurate

In [77]:
y_pred = model.predict(X_test)

In [78]:
df['prediction'] = pd.DataFrame(y_pred)

In [79]:
df.head()

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,prediction
0,https://www.winemag.com/buying-guide/artadi-2011-vinas-gain-tempranillo-rioja/,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black plum and coconut filter into a round, fluffy palate that's friendly and pure but not very dense or structured. Baked flavors of molasses and gamy berry ...",Michael Schachner,25.0,Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,0.145,750 ml,Red,Folio Fine Wine Partners,2014-12-01,88.311523
1,https://www.winemag.com/buying-guide/adelsheim-2012-stoller-vineyard-chardonnay-willamette-valley-dundee-hills/,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Dundee Hills),"A tiny production wine, this is rich, tart and vividly fruity. The generous mix of citrus, apple and peach fruit is augmented by barrel fermentation flavors of toasted hazelnuts, caramel and bakin...",Paul Gregutt,65.0,Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,0.135,750 ml,White,,2014-12-01,90.436035
2,https://www.winemag.com/buying-guide/adelsheim-2013-ribbon-springs-vineyard-other-white-auxerrois-willamette-valley-ribbon-ridge/,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerrois (Ribbon Ridge),"This is another fine vintage for this rare wine. It's loaded with cool climate, mineral-laced scents of grapefruit, kiwi and melon. A whiff of fennel adds further interest. Super refreshing and a ...",Paul Gregutt,25.0,Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,0.135,750 ml,White,,2014-12-01,88.975098
3,https://www.winemag.com/buying-guide/jcb-2011-no-11-pinot-noir-sonoma-coast/,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),"Light in color and lilting floral aromas of rose, this is an inviting cool-climate Pinot Noir swirling in equal parts strawberry and spice, subtle and sophisticated.",Virginie Boone,65.0,No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,0.13,750 ml,Red,,2014-12-01,88.965576
4,https://www.winemag.com/buying-guide/pazo-pondal-2013-albarino-rias-baixas/,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, melon and peach are pure as stream water. This feels round and juicy, with flavors of green herbs, lettuce, lime and orange. Tangerine notes carry the f...",Michael Schachner,17.0,,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,0.13,750 ml,White,Vinaio Imports,2014-12-01,84.979736


In [86]:
df['residual'] = df['wine_points'] - df['prediction']

In [92]:
df

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,prediction,residual
0,https://www.winemag.com/buying-guide/artadi-2011-vinas-gain-tempranillo-rioja/,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black plum and coconut filter into a round, fluffy palate that's friendly and pure but not very dense or structured. Baked flavors of molasses and gamy berry ...",Michael Schachner,25.0,Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,0.145,750 ml,Red,Folio Fine Wine Partners,2014-12-01,88.311523,1.688477
1,https://www.winemag.com/buying-guide/adelsheim-2012-stoller-vineyard-chardonnay-willamette-valley-dundee-hills/,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Dundee Hills),"A tiny production wine, this is rich, tart and vividly fruity. The generous mix of citrus, apple and peach fruit is augmented by barrel fermentation flavors of toasted hazelnuts, caramel and bakin...",Paul Gregutt,65.0,Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,0.135,750 ml,White,,2014-12-01,90.436035,-0.436035
2,https://www.winemag.com/buying-guide/adelsheim-2013-ribbon-springs-vineyard-other-white-auxerrois-willamette-valley-ribbon-ridge/,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerrois (Ribbon Ridge),"This is another fine vintage for this rare wine. It's loaded with cool climate, mineral-laced scents of grapefruit, kiwi and melon. A whiff of fennel adds further interest. Super refreshing and a ...",Paul Gregutt,25.0,Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,0.135,750 ml,White,,2014-12-01,88.975098,1.024902
3,https://www.winemag.com/buying-guide/jcb-2011-no-11-pinot-noir-sonoma-coast/,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),"Light in color and lilting floral aromas of rose, this is an inviting cool-climate Pinot Noir swirling in equal parts strawberry and spice, subtle and sophisticated.",Virginie Boone,65.0,No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,0.130,750 ml,Red,,2014-12-01,88.965576,1.034424
4,https://www.winemag.com/buying-guide/pazo-pondal-2013-albarino-rias-baixas/,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, melon and peach are pure as stream water. This feels round and juicy, with flavors of green herbs, lettuce, lime and orange. Tangerine notes carry the f...",Michael Schachner,17.0,,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,0.130,750 ml,White,Vinaio Imports,2014-12-01,84.979736,5.020264
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42290,https://www.winemag.com/buying-guide/concannon-2002-stampmakers-red-wine-central-coast-livermore-valley/,84.0,Concannon 2002 Stampmaker's Red Wine Red (Livermore Valley),"Very fruit forward in cherries and pomegranates, with rich tannins that grip the palate. Feels dry all the way to the finish, when it turns cough-mediciney sweet.",,24.0,Stampmaker's Red Wine,Rhône-style Red Blend,"Livermore Valley, Central Coast, California, US",Concannon,,750ml,Red,,2005-06-01,,
42291,https://www.winemag.com/buying-guide/san-simeon-2001-merlot-central-coast-paso-robles/,84.0,San Simeon 2001 Merlot (Paso Robles),"Very dry and robust in the mouth, a clean wine with earthy-berry flavors that finishes with some sturdy tannins, although it's nice and soft in acidity.",,22.0,,Merlot,"Paso Robles, Central Coast, California, US",San Simeon,,750ml,Red,,2005-06-01,,
42292,https://www.winemag.com/buying-guide/torres-anguix-2003-tinto-tempranillo-tinto-pais-ribera-duero/,84.0,Torres de Anguix 2003 Tinto (Ribera del Duero),"Black in color and saturated with plum, fruit cake and vanilla aromas. Big in the mouth but clumsy, with dense, thumping black cherry and blackberry flavors. In fact, everything about the wine is ...",Michael Schachner,10.0,Tinto,"Tinto del Pais, Tempranillo","Ribera del Duero, Northern Spain, Spain",Torres de Anguix,0.134,750ml,Red,Quality Wines of Spain,2005-06-01,,
42293,https://www.winemag.com/buying-guide/villacezan-2001-doce-meses-red-vino-tierra-leon/,84.0,Villacezan 2001 Doce Meses Red (Vino Tierra de León),"Muddled to start, with chocolate and earth aromas along with heavy, seemingly chewable berry fruit. Big on the palate, with slightly overripe plum backed by overt vanilla and brown sugar. A three-...",Michael Schachner,17.0,Doce Meses,"Red Blends, Red Blends","Vino Tierra de León, Northern Spain, Spain",Villacezan,0.135,750ml,Red,VinLozano Imports,2005-06-01,,
