# Predict the quality of Wine based on the description with custom Word Embeddings

## Load and explore data

In [1]:
import gensim.models
import gensim
import pandas as pd

df = pd.read_csv('data/winemag-data-130k-v2.csv')

In [2]:
# split points into binary label (80-89 = bad, 90-99 = good)
df['label'] = df['points'].apply(lambda x: 'good' if x > 89 else 'bad')

## Modelling

In [3]:
from util import cleanse_data

clean_txt = cleanse_data(df)
df['clean_desc'] = clean_txt

In [4]:
corpus = []
for col in df.clean_desc:
    word_list = col.split(' ')
    corpus.append(word_list)

In [5]:
model = gensim.models.Word2Vec(sentences=corpus, vector_size=100, window=5, min_count=1, workers=4)
print(f'The word embedding has a vocabulary size of {len(model.wv)} words.')

model.save('embeddings\description_emb.bin')

The word embedding has a vocabulary size of 30463 words.


In [6]:
from sklearn.model_selection import train_test_split

X = df['clean_desc']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
from util import train_bernoulli

model_min1_vs100 = train_bernoulli('embeddings\description_emb.bin', X_train, X_test, y_train, y_test)

Accuracy: 0.6779380650125024


So we have an accuracy of around 67%. This is worse than the tfidf vectorizer. We have multiple approaches to fix this. We could try to finetune the parameters of our word embedding, we could use other prebuilt word embeddings or we could also use other models than the BernoulliNB. Since the word embedding that we just built was rather simple we should start by improving on it first.

In [8]:
# Let's create word embeddings with different parameters

model_min_count_two_100 = gensim.models.Word2Vec(sentences=corpus, vector_size=100, window=5, min_count=2, workers=4)
model_min_count_two_100.save('embeddings\description_emb_min2_vs100.bin')

model_min_count_three_100 = gensim.models.Word2Vec(sentences=corpus, vector_size=100, window=5, min_count=3, workers=4)
model_min_count_three_100.save('embeddings\description_emb_min3_vs100.bin')

model_min_count_one_300 = gensim.models.Word2Vec(sentences=corpus, vector_size=300, window=5, min_count=1, workers=4)
model_min_count_one_300.save('embeddings\description_emb_min1_vs300.bin')

model_min_count_two_300 = gensim.models.Word2Vec(sentences=corpus, vector_size=300, window=5, min_count=2, workers=4)
model_min_count_two_300.save('embeddings\description_emb_min2_vs300.bin')

model_min_count_three_300 = gensim.models.Word2Vec(sentences=corpus, vector_size=300, window=5, min_count=3, workers=4)
model_min_count_three_300.save('embeddings\description_emb_min3_vs300.bin')

In [9]:
model_min2_vs100 = train_bernoulli('embeddings\description_emb_min2_vs100.bin', X_train, X_test, y_train, y_test)

Accuracy: 0.6874783612233122


In [10]:
model_min3_vs100 = train_bernoulli('embeddings\description_emb_min3_vs100.bin', X_train, X_test, y_train, y_test)

Accuracy: 0.6847855356799385


In [11]:
model_min1_vs300 = train_bernoulli('embeddings\description_emb_min1_vs300.bin', X_train, X_test, y_train, y_test)

Accuracy: 0.6748220811694556


In [12]:
model_min2_vs300 = train_bernoulli('embeddings\description_emb_min2_vs300.bin', X_train, X_test, y_train, y_test)

Accuracy: 0.6829390267359108


In [13]:
model_min3_vs300 = train_bernoulli('embeddings\description_emb_min3_vs300.bin', X_train, X_test, y_train, y_test)

Accuracy: 0.676860934795153


Note: The results seem to vary a little, even with a set random state for the data split. However, the differences are not huge.

The best performing word embedding seems to be the one with min_count=2 and vector_size=300. It's accuracy is around 69% which still is way worse than the tfidf vectorizer.
It seems like we cannot improve this any further with our own word embedding. Let's test some prebuilt word embeddings.