### Machine Learning using NLP of Wine Descriptions
**Goal:** perform NLP on wine descriptions (using TF-IDF). Use the TF-IDF scores, along with other wine info, to build a machine learning algorithm that can predict the rating score for wine.

**Input:**  A csv file of reviews with descripitons, varietals, region, and score from Wine Enthusiast Magazine. CSV was obtained from use zackthoutt on kaggle.com: [wine review data](https://www.kaggle.com/zynicide/wine-reviews). This dataset is 150,000 wine reviews, and only includes wines with scores from 80-100. The dataset has been cleaned to include only those reviews that include price (many had no listed price.) The cleaned dataset contains approximately 120,000 entries.

**Expected Output:** A machine learnign algorithm that can take in:
1. Wine description (textual)
2. Country of origin
3. Designation (optional) - this is more specific data about the wine
4. Price
5. Province (optional)
6. Region
7. Varietal
8. Winery

**The algorithm should then output a score (from 80-100).**

Initially, we will create an algorithm that will attempt to predict score from *only* the text description, and will then compare this to an ML algorithm including text data in addition to the other wine data.

In [2]:
# import dependencies

import pandas as pd
import numpy as np
import os

In [6]:
#load the data into a pandas dataframe
data = os.path.join('data/winemag-data-130k-prices-only.csv')
df = pd.read_csv(data)

In [9]:
#check to see that the data loaded correctly
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
1,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
2,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
3,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
4,5,Spain,Blackberry and raspberry aromas show a typical...,Ars In Vitro,87,15.0,Northern Spain,Navarra,,Michael Schachner,@wineschach,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem


In [13]:
df.dtypes

Unnamed: 0                 int64
country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

In [14]:
#now we need to use the 'description' column, and perform some nlp on this to extract usable data.
#first we will construct a new df with just description text and points
text_df = pd.DataFrame({'points': df.points, 'description': df.description})
text_df.head()

Unnamed: 0,description,points
0,"This is ripe and fruity, a wine that is smooth...",87
1,"Tart and snappy, the flavors of lime flesh and...",87
2,"Pineapple rind, lemon pith and orange blossom ...",87
3,"Much like the regular bottling from 2012, this...",87
4,Blackberry and raspberry aromas show a typical...,87


In [17]:
#let's first look at max and min length descriptions to see if there is any major difference
text_df['description'].map(lambda x: len(x)).max()

829

In [18]:
text_df['description'].map(lambda x: len(x)).min()

20

In [19]:
text_df['description'].map(lambda x: len(x)).mean()

244.24520768753874

In [52]:
X = text_df.description.tolist()
y = text_df.points.values

In [53]:
y

array([87, 87, 87, ..., 90, 90, 90], dtype=int64)

In [54]:
#import train test split to split our data into a training set and a testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [55]:
len(X_train)

90731

In [56]:
len(X_test)

30244

In [41]:
#at this point, I'm not sure if we should run CountVectorizer on the test BEOFRE or AFTER splitting
#I'm going to go with after
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [44]:
X_train_counts = vect.fit_transform(X_train)

In [45]:
X_train_counts.shape

(90731, 27432)

In [46]:
#let's examine our data
vect.get_feature_names()

['000',
 '008',
 '01',
 '02',
 '03',
 '030',
 '04',
 '04s',
 '05',
 '056',
 '06',
 '061',
 '064',
 '07',
 '07s',
 '08',
 '080',
 '08s',
 '09',
 '093',
 '09s',
 '10',
 '100',
 '100th',
 '101',
 '1016',
 '103',
 '104',
 '106',
 '107th',
 '108',
 '10th',
 '11',
 '110',
 '111',
 '112',
 '114',
 '115',
 '116',
 '1194',
 '11th',
 '12',
 '120',
 '1200',
 '122',
 '125',
 '1252',
 '126',
 '128',
 '1290',
 '12g',
 '12th',
 '13',
 '130',
 '130th',
 '132',
 '133',
 '134',
 '135',
 '136',
 '137',
 '1375',
 '1396',
 '13g',
 '13th',
 '14',
 '140',
 '1429',
 '146',
 '147',
 '14g',
 '14th',
 '15',
 '150',
 '1500',
 '150th',
 '154',
 '155g',
 '159g',
 '15g',
 '15s',
 '15th',
 '16',
 '160',
 '1600',
 '1607',
 '160g',
 '161',
 '1610',
 '1628',
 '164',
 '1649',
 '165',
 '1667',
 '1690',
 '1692',
 '16g',
 '16th',
 '17',
 '170',
 '1700s',
 '170g',
 '171',
 '172',
 '1737',
 '1740',
 '1744',
 '175',
 '1756',
 '1759',
 '177',
 '1772',
 '1780',
 '1787',
 '1789',
 '179',
 '17th',
 '18',
 '180',
 '1800',
 '1800s',

In [48]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(90731, 27432)

In [57]:
y_train

array([85, 92, 86, ..., 88, 87, 92], dtype=int64)

In [58]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB().fit(X_train_tfidf, y_train)

In [59]:
print(f"Training Data Score: {mnb.score(X_train_tfidf, y_train)}")

Training Data Score: 0.27863684958834356


Ok, so we can see here that our Naive Bayes model using only the descriptions does not product a good predictive model. We should next test a Support Vector Machine model.

In [66]:
mnb.predict(X_train_tfidf[80000])

array([87], dtype=int64)

In [67]:
y_train[80000]

90

In [None]:
#to run a prediction on the test data, we will first need to count vectorize, transform, and tfidf
#transform the X_test data