# Predicting points based on description (NLP) and other features with Catboost

add in the graphs on top of here

In [2]:
import numpy as np
import pandas as pd

First of all, we are going to load our data and clean the dataset.

In [5]:
data=pd.read_csv('wine-reviews/winemag-data-130k-v2.csv')

In [6]:
data=data.dropna(subset=['price'])

In [7]:
data=data.drop_duplicates(['description','title'])
data=data.reset_index(drop=True)

In [8]:
data=data.fillna(-1)

# NLP
Our basic features are ready, so now we start to create features from description with using NLTK library.
NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”


In [10]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import string
import re

from nltk.tokenize import RegexpTokenizer

We have to turn evry word into lowercase because there is no meaning diffrence between 'This' and 'this' term. We also get rid of irrelevent term.

In [11]:
data['description']= data['description'].str.lower()
data['description']= data['description'].apply(lambda elem: re.sub('[^a-zA-Z]',' ', elem))  
data['description']

0         this is ripe and fruity  a wine that is smooth...
1         tart and snappy  the flavors of lime flesh and...
2         pineapple rind  lemon pith and orange blossom ...
3         much like the regular bottling from       this...
4         blackberry and raspberry aromas show a typical...
5         here s a bright  informal red that opens with ...
6         this dry and restrained wine offers spice in p...
7         savory dried thyme notes accent sunnier flavor...
8         this has great depth of flavor with its fresh ...
9         soft  supple plum envelopes an oaky structure ...
10        this is a dry wine  very spicy  with a tight  ...
11        slightly reduced  this wine offers a chalky  t...
12        building on     years and six generations of w...
13        zesty orange peels and apple notes abound in t...
14        baked plum  molasses  balsamic vinegar and che...
15        raw black cherry aromas are direct and simple ...
16        desiccated blackberry  leather

We can't analyze whole sentences, we will use regex to tokenize sentences to list of words.

In [12]:
tokenizer = RegexpTokenizer(r'\w+')
words_descriptions = data['description'].apply(tokenizer.tokenize)
words_descriptions.head()

0    [this, is, ripe, and, fruity, a, wine, that, i...
1    [tart, and, snappy, the, flavors, of, lime, fl...
2    [pineapple, rind, lemon, pith, and, orange, bl...
3    [much, like, the, regular, bottling, from, thi...
4    [blackberry, and, raspberry, aromas, show, a, ...
Name: description, dtype: object

When we split description into individual words, we have to create vocabulary and additionaly we can add new feature - description lengths.

In [13]:
all_words = [word for tokens in words_descriptions for word in tokens]
data['description_lengths']= [len(tokens) for tokens in words_descriptions]
VOCAB = sorted(list(set(all_words)))
print("%s words total, with a vocabulary size of %s" % (len(all_words), len(VOCAB)))

4624968 words total, with a vocabulary size of 29486


Let's check what are our most common words in our dictionary.

In [14]:
from collections import Counter
count_all_words = Counter(all_words)
count_all_words.most_common(100)

[('and', 302908),
 ('the', 190834),
 ('a', 154824),
 ('of', 149861),
 ('with', 104095),
 ('this', 98014),
 ('is', 81926),
 ('it', 74638),
 ('wine', 66708),
 ('flavors', 55626),
 ('in', 55172),
 ('to', 48455),
 ('s', 46898),
 ('fruit', 42627),
 ('on', 40239),
 ('that', 34359),
 ('aromas', 34293),
 ('palate', 33563),
 ('finish', 30983),
 ('acidity', 28935),
 ('from', 27774),
 ('but', 27565),
 ('tannins', 25883),
 ('drink', 25692),
 ('cherry', 25586),
 ('black', 24936),
 ('are', 22572),
 ('ripe', 22538),
 ('has', 20419),
 ('for', 19024),
 ('red', 18603),
 ('by', 17485),
 ('notes', 16619),
 ('spice', 16210),
 ('oak', 16022),
 ('an', 15673),
 ('as', 15504),
 ('its', 15195),
 ('dry', 15044),
 ('nose', 14962),
 ('now', 14954),
 ('rich', 14690),
 ('berry', 14530),
 ('fresh', 14506),
 ('full', 13629),
 ('plum', 13077),
 ('sweet', 11813),
 ('apple', 11652),
 ('blend', 11580),
 ('soft', 11563),
 ('blackberry', 11319),
 ('well', 11317),
 ('white', 11010),
 ('fruits', 10844),
 ('light', 10839),
 ('

We can see that there are many stop words and words which can't help us with our goal - predict points. 
Now we want to
1. Convert words with same meaning to the one word(example run, running, runned -> run). We will use PorterStemmer from NLTK library.
2. Delete all stopwords.


In [15]:
stopword_list = stopwords.words('english')
ps = PorterStemmer()
words_descriptions = words_descriptions.apply(lambda elem: [word for word in elem if not word in stopword_list])
words_descriptions = words_descriptions.apply(lambda elem: [ps.stem(word) for word in elem])
data['description_cleaned'] = words_descriptions.apply(lambda elem: ' '.join(elem))

In [16]:
all_words = [word for tokens in words_descriptions for word in tokens]
VOCAB = sorted(list(set(all_words)))
print("%s words total, with a vocabulary size of %s" % (len(all_words), len(VOCAB)))
count_all_words = Counter(all_words)
count_all_words.most_common(100)

2822364 words total, with a vocabulary size of 21073


[('wine', 69125),
 ('flavor', 62686),
 ('fruit', 53836),
 ('finish', 35863),
 ('aroma', 35564),
 ('palat', 33674),
 ('acid', 33330),
 ('cherri', 29505),
 ('drink', 28905),
 ('tannin', 27717),
 ('black', 24963),
 ('ripe', 24037),
 ('dri', 22844),
 ('note', 21892),
 ('spice', 20040),
 ('red', 18821),
 ('rich', 18382),
 ('fresh', 18095),
 ('berri', 16569),
 ('oak', 16557),
 ('show', 15940),
 ('nose', 14976),
 ('plum', 14252),
 ('sweet', 13919),
 ('full', 13729),
 ('offer', 13698),
 ('blackberri', 13395),
 ('textur', 13370),
 ('blend', 13280),
 ('appl', 13155),
 ('balanc', 13005),
 ('bodi', 13003),
 ('soft', 12045),
 ('age', 11719),
 ('crisp', 11409),
 ('well', 11328),
 ('white', 11150),
 ('light', 11149),
 ('dark', 10653),
 ('structur', 10643),
 ('citru', 10109),
 ('raspberri', 9909),
 ('cabernet', 9858),
 ('vanilla', 9829),
 ('hint', 9750),
 ('herb', 9717),
 ('miner', 9669),
 ('fruiti', 9653),
 ('bright', 9380),
 ('give', 9222),
 ('pepper', 9131),
 ('touch', 8885),
 ('lemon', 8666),
 ('y

As we can see we deleted almost 9k words and now words from description are much more meaningful.
Now we can 3 diffrent ways to represent our description

1. **Bag of Words Counts** - embeds each sentences as a list of 0 or 1,  1 represent containing word. 
2. **TF-IDF (Term Frequency, Inverse Document Frequency)** - weighing words by how frequent they are in our dataset, discounting words that are too frequent.
3. **Word2Vec **- Capturing semantic meaning. We won't use it in this kernel.

We will check which types perform better in our case, Bag of Words Counts or TF-IDF Bag of Words.

First we will test Bag of Words Counts.

Let's define some useful function and then test our picked techniques.


In [17]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from catboost import Pool, CatBoostRegressor, cv

def prepare_dataframe(vect, data, features=True):
    vectorized=vect.fit_transform(data['description_cleaned']).toarray()
    vectorized=pd.DataFrame(vectorized)
    if features == True:
        X=data.drop(columns=['points','Unnamed: 0','description','description_cleaned'])
        X=X.fillna(-1)
        print(X.columns)
        X=pd.concat([X.reset_index(drop=True),vectorized.reset_index(drop=True)],axis=1)
        categorical_features_indices =[0,1,3,4,5,6,7,8,9,10]
    else:
        X=vectorized
        categorical_features_indices =[]
    y=data['points']
    return X,y,categorical_features_indices

In [18]:
#model definintion and training.
def perform_model(X_train, y_train,X_valid, y_valid,X_test, y_test,categorical_features_indices,name):
    model = CatBoostRegressor(
        random_seed = 100,
        loss_function = 'RMSE',
        iterations=800,
    )
    
    model.fit(
        X_train, y_train,
        cat_features = categorical_features_indices,
        verbose=False,
        eval_set=(X_valid, y_valid)
    )
    
    print(name+" technique RMSE on training data: "+ model.score(X_train, y_train).astype(str))
    print(name+" technique RMSE on test data: "+ model.score(X_test, y_test).astype(str))
    

In [19]:
def prepare_variable(vect, data, features_append=True):
    X, y , categorical_features_indices = prepare_dataframe(vect, data,features_append)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, 
                                                        random_state=42)
    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, 
                                                        random_state=52)
    return X_train, y_train,X_valid, y_valid,X_test, y_test, categorical_features_indices

In [20]:
vect= CountVectorizer(analyzer='word', token_pattern=r'\w+',max_features=500)
training_variable=prepare_variable(vect, data)
perform_model(*training_variable, 'Bag of Words Counts')

Index(['country', 'designation', 'price', 'province', 'region_1', 'region_2',
       'taster_name', 'taster_twitter_handle', 'title', 'variety', 'winery',
       'description_lengths'],
      dtype='object')
Bag of Words Counts technique RMSE on training data: 1.5107158933234435
Bag of Words Counts technique RMSE on test data: 1.5872264225695767


Now we can try TF-IDF.

In [None]:
vect= TfidfVectorizer(analyzer='word', token_pattern=r'\w+',max_features=500)
training_variable=prepare_variable(vect, data)
perform_model(*training_variable, 'TF-IDF')


Yeah, but beyond description we used also meaningful features, let's drop all of our features and do prediction based ONLY on descriptions. 

In [None]:
vect= CountVectorizer(analyzer='word', token_pattern=r'\w+',max_features=500)
training_variable=prepare_variable(vect, data, False)
perform_model(*training_variable, 'Bag of Words Counts')

In [None]:
vect= TfidfVectorizer(analyzer='word', token_pattern=r'\w+',max_features=500)
training_variable=prepare_variable(vect, data, False)
perform_model(*training_variable, 'TF-IDF')

As we can see our scores are similar, but it really outperformet technique without any NLP operations (about 2.09 test score) 
* 1. link to EDA +  Catboost without NLP : https://www.kaggle.com/mistrzuniu1/eda-catboost-feature-importance/