### Machine Learning using NLP of Wine Descriptions
**Goal:** perform NLP on wine descriptions (using TF-IDF). Use the TF-IDF scores, along with other wine info, to build a machine learning algorithm that can predict the rating score for wine.

**Input:**  A csv file of reviews with descripitons, varietals, region, and score from Wine Enthusiast Magazine. CSV was obtained from use zackthoutt on kaggle.com: [wine review data](https://www.kaggle.com/zynicide/wine-reviews). This dataset is 150,000 wine reviews, and only includes wines with scores from 80-100. The dataset has been cleaned to include only those reviews that include price (many had no listed price.) The cleaned dataset contains approximately 120,000 entries.

**Expected Output:** A machine learnign algorithm that can take in:
1. Wine description (textual)
2. Country of origin
3. Designation (optional) - this is more specific data about the wine
4. Price
5. Province (optional)
6. Region
7. Varietal
8. Winery

**The algorithm should then output a score (from 80-100).**

Initially, we will create an algorithm that will attempt to predict score from *only* the text description, and will then compare this to an ML algorithm including text data in addition to the other wine data.

In [201]:
# import dependencies

import pandas as pd
import numpy as np
import os
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics

In [2]:
#load the data into a pandas dataframe
data = os.path.join('data/winemag-data-130k-prices-only.csv')
df = pd.read_csv(data)

In [3]:
#check to see that the data loaded correctly
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
1,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
2,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
3,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
4,5,Spain,Blackberry and raspberry aromas show a typical...,Ars In Vitro,87,15.0,Northern Spain,Navarra,,Michael Schachner,@wineschach,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem


In [4]:
df.dtypes

Unnamed: 0                 int64
country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

In [5]:
#now we need to use the 'description' column, and perform some nlp on this to extract usable data.
#first we will construct a new df with just description text and points
text_df = pd.DataFrame({'points': df.points, 'description': df.description})
text_df.head()

Unnamed: 0,description,points
0,"This is ripe and fruity, a wine that is smooth...",87
1,"Tart and snappy, the flavors of lime flesh and...",87
2,"Pineapple rind, lemon pith and orange blossom ...",87
3,"Much like the regular bottling from 2012, this...",87
4,Blackberry and raspberry aromas show a typical...,87


We will now map points to stars to "bin" the points into 5 bins for easier classification

In [108]:
text_df['stars'] = text_df['points'].map({80:1, 81:1, 82:1, 83:1, 84:2, 85:2, 86:2, 87:2, 88:3, 89:3, 90:3, 91:3, 92:4, 93:4, 94:4, 95:4, 96:5, 97:5, 98:5, 99:5, 100:5})

In [111]:
text_df.head(10)

Unnamed: 0,description,points,stars
0,"This is ripe and fruity, a wine that is smooth...",87,2
1,"Tart and snappy, the flavors of lime flesh and...",87,2
2,"Pineapple rind, lemon pith and orange blossom ...",87,2
3,"Much like the regular bottling from 2012, this...",87,2
4,Blackberry and raspberry aromas show a typical...,87,2
5,"Here's a bright, informal red that opens with ...",87,2
6,This dry and restrained wine offers spice in p...,87,2
7,Savory dried thyme notes accent sunnier flavor...,87,2
8,This has great depth of flavor with its fresh ...,87,2
9,"Soft, supple plum envelopes an oaky structure ...",87,2


In [17]:
#let's first look at max and min length descriptions to see if there is any major difference
text_df['description'].map(lambda x: len(x)).max()

829

In [18]:
text_df['description'].map(lambda x: len(x)).min()

20

In [91]:
#let's add a new column that is the length. We will use this to feed into a predictive model to see if just the description
#length alone could be predictive of score.
text_df['length'] = text_df['description'].map(lambda x: len(x))
text_df.head()

Unnamed: 0,description,points,length
0,"This is ripe and fruity, a wine that is smooth...",87,227
1,"Tart and snappy, the flavors of lime flesh and...",87,186
2,"Pineapple rind, lemon pith and orange blossom ...",87,199
3,"Much like the regular bottling from 2012, this...",87,249
4,Blackberry and raspberry aromas show a typical...,87,261


In [19]:
text_df['description'].map(lambda x: len(x)).mean()

244.24520768753874

In [113]:
Q = text_df.length.values
Q

array([227, 186, 199, ..., 225, 216, 169], dtype=int64)

In [94]:
Z = text_df.length.values.reshape(-1,1)
Z

array([[227],
       [186],
       [199],
       ...,
       [225],
       [216],
       [169]], dtype=int64)

In [95]:
from sklearn.model_selection import train_test_split
Z_train, Z_test, y_train, y_test = train_test_split(Z, y, random_state=42)

In [97]:
from sklearn.naive_bayes import MultinomialNB
Zmnb = MultinomialNB().fit(Z_train, y_train)

In [98]:
print(f"Training Data Score: {Zmnb.score(Z_train, y_train)}")

Training Data Score: 0.13176312395983733


In [112]:
print(Zmnb.predict(Z_train[5][0]))
print(y_train[5])

[88]
90


In [210]:
X = text_df.description.tolist()
y = text_df.points.values

In [113]:
y = text_df.stars.values

In [114]:
y

array([2, 2, 2, ..., 3, 3, 3], dtype=int64)

In [211]:
#import train test split to split our data into a training set and a testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [212]:
len(X_train)

90731

In [117]:
X_train

["An unusual blend of Riesling, Albariño and Sauvignon Blanc that is largely neutral on the nose and then waxy, honeyed and pithy in the mouth. Along the way there's an appealing dry, melony, petrol character and some good old-fashioned tang. A breed unto itself.",
 "Coronato is a robust, modern super Tuscan only produced for export markets. Indeed, you can sense a New World appeal within its rich, succulent fabric thanks to penetrating notes of coffee, espresso, tar, aniseed, chocolate and bursting cherry. It's soft and velvety with enormous charm.",
 "This isn't the most powerful, impressive Pinot Noir from the Sonoma Coast, but it has a lot going for it. It's bone dry, crisply acidic and elegant on the palate, with a flavor of sour cherry candy that's savory and clean. Drink now.",
 "Alluring aromas of cocoa powder, berry fruits, graphite and raw oak lead to a jammy palate with density and weight. This old-vines Monastrell tastes of chocolaty oak and herbal blackberry. On the finish

In [11]:
X_train[70010]

'Silky yet slightly grippy, this is a mineral-driven wine, with violet and spicy cardamom aromas atop bright, vibrant swathes of cherry and wild strawberry. This vineyard is where Joseph Swan himself first planted Pinot Noir, it retains all the promise and greatness of his initial discovery of this magical place along the Laguna Ridge.'

In [12]:
len(X_test)

30244

In [58]:
#at this point, I'm not sure if we should run CountVectorizer on the test BEOFRE or AFTER splitting
#I'm going to go with after
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [59]:
X_train_counts = vect.fit_transform(X_train)

In [45]:
X_train_counts.shape

(90731, 27432)

In [60]:
#let's examine our data
vect.get_feature_names()

['000',
 '008',
 '01',
 '02',
 '03',
 '030',
 '04',
 '04s',
 '05',
 '056',
 '06',
 '061',
 '064',
 '07',
 '07s',
 '08',
 '080',
 '08s',
 '09',
 '093',
 '09s',
 '10',
 '100',
 '100th',
 '101',
 '1016',
 '103',
 '104',
 '106',
 '107th',
 '108',
 '10th',
 '11',
 '110',
 '111',
 '112',
 '114',
 '115',
 '116',
 '1194',
 '11th',
 '12',
 '120',
 '1200',
 '122',
 '125',
 '1252',
 '126',
 '128',
 '1290',
 '12g',
 '12th',
 '13',
 '130',
 '130th',
 '132',
 '133',
 '134',
 '135',
 '136',
 '137',
 '1375',
 '1396',
 '13g',
 '13th',
 '14',
 '140',
 '1429',
 '146',
 '147',
 '14g',
 '14th',
 '15',
 '150',
 '1500',
 '150th',
 '154',
 '155g',
 '159g',
 '15g',
 '15s',
 '15th',
 '16',
 '160',
 '1600',
 '1607',
 '160g',
 '161',
 '1610',
 '1628',
 '164',
 '1649',
 '165',
 '1667',
 '1690',
 '1692',
 '16g',
 '16th',
 '17',
 '170',
 '1700s',
 '170g',
 '171',
 '172',
 '1737',
 '1740',
 '1744',
 '175',
 '1756',
 '1759',
 '177',
 '1772',
 '1780',
 '1787',
 '1789',
 '179',
 '17th',
 '18',
 '180',
 '1800',
 '1800s',

In [61]:
#make a list of the numerical feature names
stop_list = vect.get_feature_names()
stop_list = stop_list[:540]

In [48]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(90731, 27432)

In [115]:
type(X_train_tfidf)

scipy.sparse.csr.csr_matrix

In [137]:
X_train_tfidf

<90731x27432 sparse matrix of type '<class 'numpy.float64'>'
	with 3143726 stored elements in Compressed Sparse Row format>

In [117]:
#now we will explore turning the sparse matrix from the TF-IDF transform into a dataframe.
#If we can do this, we can then create a new dataframe with all other wine values that also includes the TF_IDF
#we can then do this on all the data, and then split the data into training and testing data.
TF = pd.DataFrame({'TF_IDF': list(X_train_tfidf)})

In [118]:
TF.head()

Unnamed: 0,TF_IDF
0,"(0, 1319)\t0.17940045025301316\n (0, 25809)..."
1,"(0, 16724)\t0.032516835550230085\n (0, 1348..."
2,"(0, 16724)\t0.04859509521515342\n (0, 1348)..."
3,"(0, 16724)\t0.085053971917573\n (0, 1348)\t..."
4,"(0, 16724)\t0.11281086063920866\n (0, 12702..."


Above, we can see casting the sparse matrix to a list allows it to be added to a dataframe. We will next have to test if we can then pass this into an ML algorithm

In [57]:
y_train

array([85, 92, 86, ..., 88, 87, 92], dtype=int64)

In [58]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB().fit(X_train_tfidf, y_train)

In [134]:
X_t2 = TF['TF_IDF'].tolist()
X_t2 = np.array(X_t2).reshape(-1,1)

In [146]:
import scipy.sparse as sps

In [147]:
sps.coo_matrix(TF.TF_IDF.values)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

In [143]:
X_t3 = scipy.sparse.csr_matrix(TF['TF_IDF'].values.T)
X_t3

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

In [135]:
X_t2

array([[<1x27432 sparse matrix of type '<class 'numpy.float64'>'
	with 38 stored elements in Compressed Sparse Row format>],
       [<1x27432 sparse matrix of type '<class 'numpy.float64'>'
	with 42 stored elements in Compressed Sparse Row format>],
       [<1x27432 sparse matrix of type '<class 'numpy.float64'>'
	with 36 stored elements in Compressed Sparse Row format>],
       ...,
       [<1x27432 sparse matrix of type '<class 'numpy.float64'>'
	with 38 stored elements in Compressed Sparse Row format>],
       [<1x27432 sparse matrix of type '<class 'numpy.float64'>'
	with 41 stored elements in Compressed Sparse Row format>],
       [<1x27432 sparse matrix of type '<class 'numpy.float64'>'
	with 51 stored elements in Compressed Sparse Row format>]], dtype=object)

In [136]:

mnb2 = MultinomialNB().fit(X_t2, y_train)

ValueError: setting an array element with a sequence.

In [59]:
print(f"Training Data Score: {mnb.score(X_train_tfidf, y_train)}")

Training Data Score: 0.27863684958834356


Ok, so we can see here that our Naive Bayes model using only the descriptions does not product a good predictive model. We should next test a Support Vector Machine model.

In [66]:
mnb.predict(X_train_tfidf[80000])

array([87], dtype=int64)

In [67]:
y_train[80000]

90

In [None]:
#to run a prediction on the test data, we will first need to count vectorize, transform, and tfidf
#transform the X_test data

In [72]:
from sklearn.linear_model import SGDClassifier
sgdc = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None).fit(X_train_tfidf, y_train)

In [73]:
print(f"Training Data Score: {sgdc.score(X_train_tfidf, y_train)}")

Training Data Score: 0.4703133438405837


In [86]:
print(sgdc.predict(X_train_tfidf[70010]))
print(y_train[70010])

[95]
95


In [148]:
#now let's try doing the countvectorizer on our original data with n-grams and removing extraneous words
X = text_df.description.tolist()
y = text_df.points.values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [209]:
vect = CountVectorizer(ngram_range=(1,3), stop_words=stop_list)

In [210]:
X_train_counts = vect.fit_transform(X_train)

In [208]:
vect.get_feature_names()

['aacacia',
 'aacacia flower',
 'aand',
 'aand tingling',
 'aaron',
 'aaron jackson',
 'aaron pott',
 'aaron walker',
 'abacela',
 'abacela albariño',
 'abacela dedication',
 'abacela in',
 'abacela makes',
 'abacela tenth',
 'abacela wines',
 'abadal',
 'abadal label',
 'abadal maker',
 'abadia',
 'abadia retuerta',
 'abandon',
 'abandon bristling',
 'abandon it',
 'abandon of',
 'abandon stewed',
 'abandon yourself',
 'abandoned',
 'abandoned grillo',
 'abandoned thus',
 'abate',
 'abate fetel',
 'abate subzone',
 'abbazia',
 'abbazia di',
 'abbey',
 'abbey has',
 'abbey in',
 'abbey of',
 'abbey ridge',
 'abbey ruins',
 'abbey which',
 'abbey would',
 'abbott',
 'abbott cajoles',
 'abbott chalone',
 'abbott ex',
 'abbott has',
 'abbott in',
 'abbott sources',
 'abbreviated',
 'abbreviated finish',
 'abc',
 'abc production',
 'abeille',
 'abeille family',
 'abeilles',
 'abeilles is',
 'abeja',
 'abeja chardonnays',
 'abeja estate',
 'abeja jumps',
 'abeja kicks',
 'abeja main',
 'abe

In [211]:
X_train_counts.shape

(90731, 2019329)

In [212]:
#now we will run multinomial Bayes using just the vectozied data (no tf_idf run)
mnb = MultinomialNB().fit(X_train_counts, y_train)

In [213]:
print(f"Training Data Score: {mnb.score(X_train_counts, y_train)}")

Training Data Score: 0.789608843724857


In [215]:
X_test_counts = vect.transform(X_test)

In [216]:
print(f"MNB on non-tfidf Testing Data Score: {mnb.score(X_test_counts, y_test)}")

MNB on non-tfidf Testing Data Score: 0.26094431953445313


We can now see our predictive score using just CountVectorized, but with n-grams and numerical "words" removed is much better (.64 vs. .278 with MNB done on vectorized and TF_IDF data without bi-grams or stop words removed.

We will now try the SGDClassifier on the data

In [217]:
sgdc = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None).fit(X_train_counts, y_train)

In [218]:
print(f"Training Data Score: {sgdc.score(X_train_counts, y_train)}")

Training Data Score: 0.9425664877495012


In [219]:
print(f"SGDC on non-tfidf Testing Data Score w/ 3-grams: {sgdc.score(X_test_counts, y_test)}")

SGDC on non-tfidf Testing Data Score w/ 3-grams: 0.2871313318344134


We see a score of .697, which is slightly better, but not much.

Now let's do a TF_IDF transform and feed into both the MNB and SGDC models.

In [173]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [180]:
mnb3 = MultinomialNB().fit(X_train_tfidf, y_train)

In [181]:
print(f"Training Data Score: {mnb3.score(X_train_tfidf, y_train)}")

Training Data Score: 0.3470478667710044


.347! That works worse.

Next is SGDC on TF_IDF data

In [200]:
sgdc3 = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=1, max_iter=5, tol=None).fit(X_train_tfidf, y_train)

In [201]:
print(f"Training Data Score: {sgdc3.score(X_train_tfidf, y_train)}")

Training Data Score: 0.9734930729298696


Now to run on our test data. First we will have to transform the test X data using our vectorizer (with just transform), and also with tf_idf

In [185]:
X_test_counts = vect.transform(X_test)

In [186]:
X_test_counts.shape

(30244, 517778)

In [187]:
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

In [189]:
X_test_tfidf.shape

(30244, 517778)

In [202]:
print(f"Testing Data Score: {sgdc3.score(X_test_tfidf, y_test)}")

Testing Data Score: 0.2884208438037297


Well that's not nearly as nice. Let's see what our test data looks like in the mnb model

In [203]:
print(f"Testing Data Score: {sgdc.score(X_test_counts, y_test)}")

Testing Data Score: 0.2517193492924216


In [204]:
print(f"Testing Data Score: {mnb3.score(X_test_tfidf, y_test)}")

Testing Data Score: 0.20635497950006612


In [205]:
print(f"Testing Data Score: {mnb.score(X_test_counts, y_test)}")

Testing Data Score: 0.25221531543446635


In [207]:
X_test_counts

<30244x517778 sparse matrix of type '<class 'numpy.int64'>'
	with 2092202 stored elements in Compressed Sparse Row format>

In [18]:
#let's try just removed the numerical stuff and using only 1-grams in our countvectorizer model
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words=stop_list)
X_train_counts = vect.fit_transform(X_train)

In [19]:
vect.get_feature_names()

['aacacia',
 'aand',
 'aaron',
 'abacela',
 'abadal',
 'abadia',
 'abandon',
 'abandoned',
 'abate',
 'abbazia',
 'abbey',
 'abbott',
 'abbreviated',
 'abc',
 'abeille',
 'abeilles',
 'abeja',
 'abel',
 'abele',
 'abelis',
 'abelé',
 'abernathy',
 'abeyance',
 'abide',
 'abiding',
 'abilio',
 'abilities',
 'ability',
 'abiouness',
 'able',
 'ably',
 'abnormal',
 'abnormally',
 'aboard',
 'abondante',
 'aboriginal',
 'abound',
 'abounds',
 'abouriou',
 'about',
 'abovde',
 'above',
 'abraham',
 'abrasive',
 'abrasiveness',
 'abraxas',
 'abreu',
 'abrigo',
 'abroad',
 'abrupt',
 'abruptly',
 'abruptness',
 'abruzzo',
 'absence',
 'absense',
 'absent',
 'absinthe',
 'absolute',
 'absolutely',
 'absorb',
 'absorbed',
 'absorbingly',
 'absurd',
 'absurdly',
 'abtsberg',
 'abundance',
 'abundant',
 'abundantly',
 'abupt',
 'abused',
 'abuts',
 'abuzz',
 'abv',
 'ac',
 'acacia',
 'academic',
 'acai',
 'acccessible',
 'accelerate',
 'accelerates',
 'accent',
 'accented',
 'accenting',
 'accent

In [20]:
X_test_counts = vect.transform(X_test)

In [23]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [24]:
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

In [25]:
mnb = MultinomialNB().fit(X_train_counts, y_train)

In [26]:
print(f"Training Data Score: {mnb.score(X_train_counts, y_train)}")

Training Data Score: 0.3972953014956299


In [27]:
print(f"Testing Data Score: {mnb.score(X_test_counts, y_test)}")

Testing Data Score: 0.23148393069699774


In [30]:
sgdc = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None).fit(X_train_counts, y_train)

In [31]:
print(f"Training Data Score: {sgdc.score(X_train_counts, y_train)}")

Training Data Score: 0.30471393459787727


In [32]:
print(f"Testing Data Score: {sgdc.score(X_test_counts, y_test)}")

Testing Data Score: 0.1983533924084116


In [232]:
mnb_tf = MultinomialNB().fit(X_train_tfidf, y_train)

In [233]:
print(f"Training Data Score: {mnb_tf.score(X_train_tfidf, y_train)}")

Training Data Score: 0.27803066206698923


In [234]:
sgdc_tf = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None).fit(X_train_tfidf, y_train)

In [236]:
print(f"Training Data Score: {sgdc_tf.score(X_train_tfidf, y_train)}")

Training Data Score: 0.4596775082386395


In [237]:
print(f"Testing Data Score: {sgdc_tf.score(X_test_tfidf, y_test)}")

Testing Data Score: 0.20827271524930566


In [33]:
sgdc = SGDClassifier().fit(X_train_counts, y_train)



In [34]:
print(f"Training Data Score: {sgdc.score(X_train_counts, y_train)}")

Training Data Score: 0.3215659476915277


In [35]:
print(f"Testing Data Score: {sgdc.score(X_test_counts, y_test)}")

Testing Data Score: 0.1961711413834149


In [36]:
sgdc = SGDClassifier().fit(X_train_tfidf, y_train)



In [37]:
print(f"Training Data Score: {sgdc.score(X_train_tfidf, y_train)}")

Training Data Score: 0.4641743174879589


In [38]:
print(f"Testing Data Score: {sgdc.score(X_test_tfidf, y_test)}")

Testing Data Score: 0.20827271524930566


In [213]:
#now let's try a CountVectorizer that doesn't remove the stop words, but has a minimum doc freq of 2
vect = CountVectorizer(min_df=3, stop_words=stop_list, ngram_range=(1,2))
X_train_counts = vect.fit_transform(X_train)

In [148]:
vect = CountVectorizer(stop_words=stop_list, ngram_range=(1,2))
X_train_counts = vect.fit_transform(X_train)

In [214]:
vect.get_feature_names()

['aaron',
 'aaron jackson',
 'aaron pott',
 'abacela',
 'abandon',
 'abate',
 'abate fetel',
 'abbey',
 'abbey ridge',
 'abbott',
 'abbreviated',
 'abbreviated finish',
 'abeja',
 'abelé',
 'abilities',
 'ability',
 'ability of',
 'ability to',
 'able',
 'able to',
 'ably',
 'ably marries',
 'abnormal',
 'abound',
 'abound along',
 'abound and',
 'abound from',
 'abound here',
 'abound in',
 'abound it',
 'abound on',
 'abound the',
 'abound there',
 'abound throughout',
 'abound while',
 'abound with',
 'abounds',
 'abounds in',
 'abounds on',
 'abounds with',
 'abouriou',
 'abouriou grape',
 'about',
 'about acid',
 'about all',
 'about an',
 'about and',
 'about any',
 'about anything',
 'about as',
 'about average',
 'about balance',
 'about berry',
 'about black',
 'about bright',
 'about but',
 'about buttered',
 'about chardonnay',
 'about cherries',
 'about citrus',
 'about complexity',
 'about crisp',
 'about dark',
 'about delicious',
 'about dry',
 'about easy',
 'about eleg

In [215]:
X_test_counts = vect.transform(X_test)

In [216]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [217]:
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

In [218]:
mnb = MultinomialNB().fit(X_train_counts, y_train)

In [219]:
print(f"Training Data Score: {mnb.score(X_train_counts, y_train)}")

Training Data Score: 0.6836362434008222


In [220]:
print(f"Testing Data Score: {mnb.score(X_test_counts, y_test)}")

Testing Data Score: 0.26626768945906626


In [221]:
sgdc = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None).fit(X_train_counts, y_train)

In [222]:
print(f"Training Data Score: {sgdc.score(X_train_counts, y_train)}")

Training Data Score: 0.6254532629421036


In [223]:
print(f"Testing Data Score: {sgdc.score(X_test_counts, y_test)}")

Testing Data Score: 0.24189921967993652


In [194]:
mnb = MultinomialNB().fit(X_train_tfidf, y_train)

In [195]:
print(f"Training Data Score: {mnb.score(X_train_tfidf, y_train)}")

Training Data Score: 0.6646791063693777


In [196]:
print(f"Testing Data Score: {mnb.score(X_test_tfidf, y_test)}")

Testing Data Score: 0.6031278931358286


In [229]:
sgdc = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None).fit(X_train_tfidf, y_train)

In [230]:
print(f"Training Data Score: {sgdc.score(X_train_tfidf, y_train)}")

Training Data Score: 0.912896363977031


In [231]:
print(f"Testing Data Score: {sgdc.score(X_test_tfidf, y_test)}")

Testing Data Score: 0.2785676497817749


In [232]:
print(sgdc.predict(X_test_tfidf[3006]))
print(y_test[3006])

[85]
85


In [236]:
predicted = sgdc.predict(X_test_tfidf)

In [237]:
metrics.accuracy_score(y_test, predicted)

0.2785676497817749

In [240]:
metrics.mean_squared_error(y_test, predicted)

5.302241766962042

In [241]:
metrics.mean_absolute_error(y_test, predicted)

1.6391019706388044

In [242]:
#I was able to get a 27% accurate model (of exact scores!) using tf_idf with a sgdc model. The details are:
#vect = CountVectorizer(min_df=3, stop_words=stop_list, ngram_range=(1,2))
#tfidf_transformer = TfidfTransformer()
#sgdc = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None).fit(X_train_tfidf, y_train)

In [287]:
def desc_predict(words):
    words_list = [words]
    #words_array = np.array(words_list).reshape(-1,1)
    words_count = vect.transform(words_list)
    words_tfidf = tfidf_transformer.transform(words_count)
    predict_score = sgdc.predict(words_tfidf)
    print("This wine deserves a score of...")
    print(predict_score[0])

In [288]:
desc_predict('This wine has hints of blackberry and a surprisingly bright feel.\
             Floral notes can be discerned followed by a strong, crisp finish')

This wine deserves a score of...
88


In [289]:
desc_predict('This wine is not very good')

This wine deserves a score of...
82


In [290]:
desc_predict("This is a fantastic wine, that deserves to be cherished and drunk upon special occasions.")

This wine deserves a score of...
91


In [294]:
desc_predict('Hearty, bold, and breathtaking - this wine is outstanding.')

This wine deserves a score of...
96


In [295]:
desc_predict('You should avoid this wine.')

This wine deserves a score of...
80


In [296]:
desc_predict('Earthy, woody, with a hint of berry juicy')

This wine deserves a score of...
84


In [297]:
desc_predict("This wine contains some material over 100 years old, but shows no signs of fragility. Instead, it's concentrated through age and should hold in the bottle indefinitely. It's dark coffee-brown in color, with delectable aromas of rancio, dried fig, molasses and black tea, yet despite enormous concentration avoids excessive weight. And it's amazingly complex and fresh on the nearly endless finish.")

This wine deserves a score of...
100


In [299]:
desc_predict("aromas of seashell and hints of turnip abound. A slight fishy taste is preceeded by a mushy mouthfeel")

This wine deserves a score of...
82


In [302]:
desc_predict("Triumphant and bold, this wine astounds with new tastes upon every sip. It hints at rosemary then delivers a punch of juiciness")

This wine deserves a score of...
92


In [304]:
desc_predict("berry juicy")

This wine deserves a score of...
88
