# Text Analysis on Wine Data Using NLP

Wine is a treat that many adults enjoy drinking from time to time. Especially with all the different kinds of wine available to drink, there are things that we may not know. We will be using **winemag-data-130k-v2.csv** from Kaggle as our dataset. 

**Content**

This dataset contains these columns:
* country: The country that the wine is from
* description: description of the wine
* designation: The vineyard within the winery where the grapes that made the wine are from
* points: The number of points WineEnthusiast rated the wine on a scale of 1-100
* price: The cost for a bottle of the wine
* province: The province or state that the wine is from
* region_1: The wine growing area in a province or state
* region_2: Sometimes there are more specific regions specified within a wine growing area, but this value can be blank
* taster_name: Name of wine taster
* taster_twitter_handle: Twitter handle of the wine taster
* title: The title of the wine review, which often contains the vintage.
* variety: The type of grapes used to make the wine
* winery: The winery that made the wine

The goals of this notebook include:

* Finding the relationship between countries and wine ratings
* Analyze title to extract wine vintage date
* Price of wine VS Vintage date
* See if variety of grapes affect the selling price of the wine
* Exploring wine descriptions using NLP


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
reviews = pd.read_csv('winemag-data-130k-v2.csv')
reviews.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [3]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 14 columns):
Unnamed: 0               129971 non-null int64
country                  129908 non-null object
description              129971 non-null object
designation              92506 non-null object
points                   129971 non-null int64
price                    120975 non-null float64
province                 129908 non-null object
region_1                 108724 non-null object
region_2                 50511 non-null object
taster_name              103727 non-null object
taster_twitter_handle    98758 non-null object
title                    129971 non-null object
variety                  129970 non-null object
winery                   129971 non-null object
dtypes: float64(1), int64(2), object(11)
memory usage: 13.9+ MB


# Data Cleaning

In [4]:
reviews.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [5]:
#Drop any duplicate entries as there would already be a copy of it in the dataset.
reviews = reviews.drop_duplicates()

We will check for NaN/None values in country, price, and points. As we are analyzing these points specifically, we do not want any NaN/none values present. So if they are present, we will remove those rows with the NaN values. Once we have checked those, we will take a look at the rest of the data to see if any more cleaning needs to be done.

In [6]:
reviews[pd.isnull(reviews['country'])]

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
913,913,,"Amber in color, this wine has aromas of peach ...",Asureti Valley,87,30.0,,,,Mike DeSimone,@worldwineguys,Gotsa Family Wines 2014 Asureti Valley Chinuri,Chinuri,Gotsa Family Wines
3131,3131,,"Soft, fruity and juicy, this is a pleasant, si...",Partager,83,,,,,Roger Voss,@vossroger,Barton & Guestier NV Partager Red,Red Blend,Barton & Guestier
4243,4243,,"Violet-red in color, this semisweet wine has a...",Red Naturally Semi-Sweet,88,18.0,,,,Mike DeSimone,@worldwineguys,Kakhetia Traditional Winemaking 2012 Red Natur...,Ojaleshi,Kakhetia Traditional Winemaking
9509,9509,,This mouthwatering blend starts with a nose of...,Theopetra Malagouzia-Assyrtiko,92,28.0,,,,Susan Kostrzewa,@suskostrzewa,Tsililis 2015 Theopetra Malagouzia-Assyrtiko W...,White Blend,Tsililis
9750,9750,,This orange-style wine has a cloudy yellow-gol...,Orange Nikolaevo Vineyard,89,28.0,,,,Jeff Jenssen,@worldwineguys,Ross-idi 2015 Orange Nikolaevo Vineyard Chardo...,Chardonnay,Ross-idi
11150,11150,,"A blend of 85% Melnik, 10% Grenache Noir and 5...",,89,20.0,,,,Jeff Jenssen,@worldwineguys,Orbelus 2013 Melnik,Melnik,Orbelus
11348,11348,,"Light and fruity, this is a wine that has some...",Partager,82,,,,,Roger Voss,@vossroger,Barton & Guestier NV Partager White,White Blend,Barton & Guestier
14030,14030,,"This Furmint, grown in marl soils, has aromas ...",Márga,88,25.0,,,,Jeff Jenssen,@worldwineguys,St. Donat 2013 Márga White,White Blend,St. Donat
16000,16000,,"Jumpy, jammy aromas of foxy black fruits are s...",Valle de los Manantiales Vineyard,86,40.0,,,,Michael Schachner,@wineschach,Familia Deicas 2015 Valle de los Manantiales V...,Tannat,Familia Deicas
16749,16749,,Winemaker: Bartho Eksteen. This wooded Sauvy s...,Cape Winemakers Guild Vloekskoot Wooded,91,,,,,Lauren Buzzeo,@laurbuzz,Bartho Eksteen 2016 Cape Winemakers Guild Vloe...,Sauvignon Blanc,Bartho Eksteen


In [7]:
reviews = reviews.dropna(axis=0, how='any', subset=['country'])

In [8]:
reviews[pd.isnull(reviews['price'])]

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
13,13,Italy,This is dominated by oak and oak-driven aromas...,Rosso,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Masseria Setteporte 2012 Rosso (Etna),Nerello Mascalese,Masseria Setteporte
30,30,France,Red cherry fruit comes laced with light tannin...,Nouveau,86,,Beaujolais,Beaujolais-Villages,,Roger Voss,@vossroger,Domaine de la Madone 2012 Nouveau (Beaujolais...,Gamay,Domaine de la Madone
31,31,Italy,Merlot and Nero d'Avola form the base for this...,Calanìca Nero d'Avola-Merlot,86,,Sicily & Sardinia,Sicilia,,,,Duca di Salaparuta 2010 Calanìca Nero d'Avola-...,Red Blend,Duca di Salaparuta
32,32,Italy,"Part of the extended Calanìca series, this Gri...",Calanìca Grillo-Viognier,86,,Sicily & Sardinia,Sicilia,,,,Duca di Salaparuta 2011 Calanìca Grillo-Viogni...,White Blend,Duca di Salaparuta
50,50,Italy,This blend of Nero d'Avola and Syrah opens wit...,Scialo,86,,Sicily & Sardinia,Sicilia,,,,Viticultori Associati Canicatti 2008 Scialo Re...,Red Blend,Viticultori Associati Canicatti
54,54,Italy,"A blend of Nero d'Avola and Nerello Mascalese,...",Rosso,85,,Sicily & Sardinia,Sicilia,,,,Corvo 2010 Rosso Red (Sicilia),Red Blend,Corvo
79,79,Portugal,"Grown on the sandy soil of Tejo, the wine is t...",Bridão,86,,Tejo,,,Roger Voss,@vossroger,Adega Cooperativa do Cartaxo 2014 Bridão Touri...,Touriga Nacional,Adega Cooperativa do Cartaxo
137,137,South Africa,"This is great Chenin Blanc, wood fermented but...",Hope Marguerite,90,,Walker Bay,,,Roger Voss,@vossroger,Beaumont 2005 Hope Marguerite Chenin Blanc (Wa...,Chenin Blanc,Beaumont
159,159,Italy,"Intense aromas of ripe red berry, menthol, esp...",Filo di Seta,91,,Tuscany,Brunello di Montalcino,,Kerin O’Keefe,@kerinokeefe,Castello Romitorio 2011 Filo di Seta (Brunell...,Sangiovese,Castello Romitorio


In [9]:
reviews = reviews.dropna(axis=0, how='any', subset=['price'])

In [10]:
reviews[pd.isnull(reviews['points'])]

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery


Here, we will drop columns of data that we do not need.

Columns:
* taster_name
* taster_twitter_handle
* province
* designation
* Unnamed: 0

In [11]:
reviews.drop('taster_name',axis=1, inplace=True)
reviews.drop('taster_twitter_handle', axis=1, inplace=True)
reviews.drop('province', axis=1, inplace=True)
reviews.drop('designation',axis=1,inplace=True)
reviews.drop('Unnamed: 0',axis=1,inplace=True)

In [12]:
reviews.head()

Unnamed: 0,country,description,points,price,region_1,region_2,title,variety,winery
1,Portugal,"This is ripe and fruity, a wine that is smooth...",87,15.0,,,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",87,14.0,Willamette Valley,Willamette Valley,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",87,13.0,Lake Michigan Shore,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",87,65.0,Willamette Valley,Willamette Valley,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
5,Spain,Blackberry and raspberry aromas show a typical...,87,15.0,Navarra,,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem


If you look at the title, we can see a year within that. That year indicates the vintage. Vintage, in winemaking, is the process of picking grapes and created the finished product: wine. A vintage wine is one made from grapes that were all, or primarily, grown and harvested in a single specified year.

We will extract the year out of the title and create a new column for it so that we can do some data analysis with it.

In [19]:
#function to extract vintage
def vintage_extraction(title):
    words = title.split()
    for i in range(len(words)):
        if words[i].isdigit():
            return words[i]

In [20]:
reviews['vintage'] = reviews['title'].apply(vintage_extraction)

In [36]:
reviews['vintage'].value_counts()

2013    15078
2014    14766
2012    14641
2011    11363
2010    11041
2015     9577
2009     9010
2008     6690
2007     6470
2006     5152
2016     3521
2005     3282
2004     1600
2000      732
2001      668
1999      619
1998      540
2003      498
2002      332
1997      297
41         83
14         66
1996       64
3          60
1995       44
1852       40
7          38
10         29
39         27
772        26
        ...  
1789        1
360         1
325         1
1070        1
1904        1
1966        1
1965        1
253         1
1947        1
34          1
585         1
1974        1
68          1
1752        1
19          1
1935        1
01          1
69          1
1872        1
1969        1
002         1
26          1
428         1
310         1
128         1
733         1
1941        1
13          1
1919        1
813         1
Name: vintage, Length: 178, dtype: int64

In [37]:
reviews[reviews['vintage'] <= '100']

Unnamed: 0,country,description,points,price,region_1,region_2,title,variety,winery,vintage
459,France,"A rich, bottled-aged wine, full of toast as we...",92,100.0,Champagne,,Chanoine NV Tzarina No 1 Brut (Champagne),Champagne Blend,Chanoine,1
1412,Portugal,The balance between wood aging and fruit is at...,88,30.0,,,Quinta dos Murças NV 10 Anos Old Tawny (Port),Port,Quinta dos Murças,10
5800,Argentina,Earthy cherry and prune aromas lack pop and se...,83,8.0,Mendoza,,Region 1 2014 Malbec (Mendoza),Malbec,Region 1,1
9408,Italy,This sparkling wine has a delicate fragrance o...,83,15.0,Prosecco,,Piera Martellozzo NV 075 Carati (Prosecco),Glera,Piera Martellozzo,075
10400,France,Aged in glass demijohns for one year and oak t...,91,30.0,Maury,,Mas Amiel NV Cuvée Spéciale 10 Ans d'Âge Grena...,Grenache,Mas Amiel,10
11063,Portugal,Taking the fashion for statement bottles to a ...,89,25.0,,,Martha's Wines NV 10 Years Tawny (Port),Port,Martha's Wines,10
13150,US,"A strongly flavored, sugary, unsubtle wine, le...",84,22.0,Paso Robles,Central Coast,10 Knots 2010 Viognier (Paso Robles),Viognier,10 Knots,10
20563,Portugal,This first aged tawny release from Wine & Soul...,91,50.0,,,Wine & Soul NV 10 Years Old Tawny (Port),Port,Wine & Soul,10
21024,Argentina,Briny cherry and raspberry aromas offer up a n...,83,8.0,Mendoza,,Region 1 2013 Cabernet Sauvignon (Mendoza),Cabernet Sauvignon,Region 1,1
22437,Portugal,"Ervamoira, one of the most remote of the Douro...",89,40.0,,,Ramos-Pinto NV RP10 Quinta da Ervamoira Tawny ...,Port,Ramos-Pinto,10


After taking a look at the extracted data that has been put into the vintage column, we can see that there are other numbers that don't really resemble a specific year. If you look at the dataframe above, you can see that the numbers do not represent the actual vintage year. Sometimes it talks about the age like 10 years old, or it will be something like "No 1", or some other representation. We will need to extract these rows as they do not provide any valuable data as we are specifically looking for years to analyze.

In order to do this without losing useful data, we will slowly increment the range of years we will drop out of the dataframe. Start from 100, then to 500, then to 1000. We will then take a look at the remaining dates in the dataframe and check to see if any more values need to be dropped.

# Text Preprocessing

In [18]:
import string
from nltk.corpus import stopwords

In [None]:
def process_text(reviews):
    nopunc = [char for char in reviews if char not in string.punctuation]
    nopunc = "".join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [None]:
#from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#bow_transformer = CountVectorizer(analyzer=process_text).fit(reviews['description'])
#print(len(bow_transformer.vocabulary_))

In [None]:
#description_bow = bow_transformer.transform(reviews['description'])

In [None]:
#print('Shape of Sparse Matrix: ', description_bow.shape)
#print('Amount of Non-Zero occurences: ', description_bow.nnz)

In [None]:
#sparsity = (100.0 * description_bow.nnz / (description_bow.shape[0] * description_bow.shape[1]))
#print('sparsity: {}'.format(round(sparsity)))

# TF-IDF

In [None]:
#from sklearn.feature_extraction.text import TfidfTransformer