### Meaning of each column:

**1) country:** The country that the wine is from.<br>
**2) description:** A few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.<br>
**3) designation:** The vineyard within the winery where the grapes that made the wine are from.<br>
**4) points:** The number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)<br>
**5) price:** The cost for a bottle of the wine. <br>
**6) province:** The province or state that the wine is from.<br>
**7) region_1:** The wine growing area in a province or state (ie Napa).<br>
**8) region_2:** Sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank.<br>
**9) variety:** The type of grapes used to make the wine (ie Pinot Noir). <br>
**10) winery:** The winery that made the wine. <br>

# Questions:
1) What are the specialties of the wine from different countries?<br>
2) How does the wine taste with different type of grapes (variety)?<br>
3) What makes some wine more expensive or get higher score/point than the others?<br>
4) Is there any correlation between the taste of the wine and its price/score?

In [1]:
import utils
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
%matplotlib inline

In [5]:
### Start from raw data:
# raw_data = utils.loadRawData()
# data = utils.cleanAndTransform(raw_data)
# utils.saveData(data)

### Start from processed data:
data = utils.loadCleanData()

In [7]:
# I'm just gonna work with these columns:
M = data[['country', 'description', 'points', 'price', 'variety', 'clean_des']].as_matrix()

# Train-validation-test split:
X = M[:, -1]
Y = M[:, :-2]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.18, random_state=42)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.17, random_state=42)

In [8]:
# Transform to document-term matrix:
# (Here I assume that the order of the words is not important. I want to use bag-of-word model.)
def vectorize(words):
    vec = CountVectorizer()
    bow = vec.fit_transform(words)
    return pd.DataFrame(bow.toarray(), columns=vec.get_feature_names())

In [9]:
# Note that these matrix will be different. 
# However, if we train the model properly, it should be robust to the variation between data points.
Xv_train = vectorize(X_train)
Xv_val = vectorize(X_val)
Xv_test = vectorize(X_test)

#### Top 15 words in the training data:

In [11]:
Xv_train.sum().sort_values(ascending=False).head(15)

wine       53466
flavors    50748
fruit      36760
finish     24564
aromas     22234
cherry     20996
acidity    19341
tannins    19194
palate     19120
black      16419
ripe       16105
dry        15972
drink      15271
sweet      14190
oak        13959
dtype: int64

#### Top 15 words in the validation data:
(We expect this to be fairly similar to the top-15 from the trianing data.)

In [10]:
Xv_val.sum().sort_values(ascending=False).head(15)

wine       11038
flavors    10279
fruit       7725
finish      5027
aromas      4578
cherry      4160
palate      4008
acidity     3999
tannins     3944
black       3380
dry         3245
ripe        3227
drink       3126
oak         2869
sweet       2820
dtype: int64