1. NLP feature extraction
2. PCA
3. regression framework that can sit on top of either PCA or AENNs
4. RBM/AENN

## 1. Feature Extraction

Many of the columns in the data are text blocks or categorical - need to convert to numeric representations in order to do matrix operations on them.

Processing steps for description text:
1. convert to lowercase
2. strip out nonalphanumeric characters
3. tokenize
4. strip out stop words
5. apply one-hot encoding for remaining tokens

(Optional): if time permits, use bag-of-words representation or tf-idf instead of one-hot encoding.

`description` is the only text block field that will be processed this way. Simple one-hot encoding will be used for the remaining categorical features: `country`, `designation`, `province`, `region_1`, `region_2`, `taster_name`, `variety`, `winery`.


In [38]:
# Setup
import os
import numpy as np
import pandas as pd
from nltk.corpus import stopwords

# Display settings
pd.options.display.max_columns = 999
pd.options.display.max_colwidth = -1

# Import data, using first 10k rows
CSV_PATH = os.path.join('..', 'data', 'raw', 'winemag-data-130k-v2.csv')
df = pd.read_csv(CSV_PATH, nrows=10000)
df.shape

(10000, 14)

In [39]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016.",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish.",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore),Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew.",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley),Pinot Noir,Sweet Cheeks


In [40]:
# Get list of common stopwords from NLTK package
stop_words = set(stopwords.words('english'))

# convert to lowercase, remove all nonalphanumeric characters, split into tokens, remove stopwords
unique_words = set() 
df['description'] \
    .str.lower() \
    .str.replace('[^\w\s\-]','') \
    .str.split() \
    .apply(unique_words.update)
tokens = unique_words - stop_words

# how many unique words are there in the description field?
len(tokens)

13820

In [41]:
df['description_clean'] = df['description'] \
    .str.lower() \
    .str.replace('[^\w\s\-]','') \
    .str.split()

df['description_clean'].head()

0    [aromas, include, tropical, fruit, broom, brimstone, and, dried, herb, the, palate, isnt, overly, expressive, offering, unripened, apple, citrus, and, dried, sage, alongside, brisk, acidity]                                                                                            
1    [this, is, ripe, and, fruity, a, wine, that, is, smooth, while, still, structured, firm, tannins, are, filled, out, with, juicy, red, berry, fruits, and, freshened, with, acidity, its, already, drinkable, although, it, will, certainly, be, better, from, 2016]                       
2    [tart, and, snappy, the, flavors, of, lime, flesh, and, rind, dominate, some, green, pineapple, pokes, through, with, crisp, acidity, underscoring, the, flavors, the, wine, was, all, stainless-steel, fermented]                                                                        
3    [pineapple, rind, lemon, pith, and, orange, blossom, start, off, the, aromas, the, palate, is, a, bit, more, opulent, with, notes, 

In [57]:
# Extract matrix of one-hot encodings for description
description_enc = df \
    .apply(lambda row: [1 if token in set(row['description_clean']) else 0 for token in tokens], 
           axis=1) \
    .apply(pd.Series)

# Save to csv
description_enc.to_csv(os.path.join('..', 'data', 'intermediate', 'description_encoded.csv'),
                       header=False,
                       index=False)

In [58]:
# Test simple one-hot encoding
cols_to_enc = ['country', 'designation', 'province', 'region_1', 'region_2', 
               'taster_name', 'variety', 'winery']

# Define a function to process each column
def get_one_hot_matrix(col_name):
    '''
    Takes a string column name as input, outputs a pd DataFrame containing one-hot encoding of the column.
    '''
    
    print(f'encoding {col_name}')
    # get tokens
    col_tokens = df[col_name].unique()

    # return a matrix of one-hot encodings for each token
    col_enc = df.apply(lambda row: [1 if row[col_name] == token else 0 for token in col_tokens],
                       axis=1) \
                .apply(pd.Series)
    
    # save intermediate file to csv, just in case
    csv_path = os.path.join('..', 'data', 'intermediate', f'{col_name}_encoded.csv')
    col_enc.to_csv(csv_path, header=False, index=False)
    
    return col_enc

# compute and append encoded cols
encoded_cols = [description_enc]
for i in cols_to_enc:
    encoded = get_one_hot_matrix(i)
    encoded_cols.append(encoded)

encoding country
encoding designation
encoding province
encoding region_1
encoding region_2
encoding taster_name
encoding variety
encoding winery


In [59]:
# fill in missing values for price feature
df['price_clean'] = df['price'].fillna(df['price'].mean())
encoded_cols.append(df['price_clean'])

In [60]:
# compile and export final feature matrix
FINAL_DATA_PATH = os.path.join('..', 'data', 'final')
final_data = pd.concat(encoded_cols, axis=1)
final_data.to_csv(os.path.join(FINAL_DATA_PATH, 'features.csv'), header=False, index=False)

# export label vector
df['points'].to_csv(os.path.join(FINAL_DATA_PATH, 'labels.csv'), header=False, index=False)