### Overview

#### Problem space: 
I am modelling it as a regression problem, where wine price is to be predicted based on wine features and reviews.
Overall objective is to demonstrate the comparative evaluation of different approaches and models. 


#### Dataset
Dataset available at Kaggle is much larger: https://www.kaggle.com/zynicide/wine-reviews

Using the file with 130K rows, and doing a 70:30 train-test split.

#### Approach: 
Pre-processing: Based on the exploratory analysis of dataset, pre-processing steps included:
1. Dropping rows with missing price values
2. Adding an additional column whose values are extracted from the values of an existing column
3. Replacing missing values with "none" and 0, for categorical and numerical columns respectively.

Demonstrated two approaches, with a neural network based regression model at the core. 

1. Baseline: Some feature engineering performed outside the neural network, which is a simple feed forward network with two dense hidden layers. Categorical features are transformed and encoded as n dimensional one hot vectors, resulting in very sparse and large feature vectors being fed to the neural network. Some variations are demonstrated here and results from 3 models with slightly different configurations are displayed. Models 1-3 below, network architecture displayed in the output cell. Further models can be trained using different combinations of provided feature extraction, and feature selection methods.


2. Improvement: Feature learning and regression included inside the neural network. Network inputs data where categorical features are encoded as numerical indexes in a pre-processing step, unlike large one hot encodings above. Dense vectors (embeddings) are learnt for each categorical feature within the network. These embeddings can also be extracted and fed to another regression algorithm. The feature with textual description of reviews is excluded in this approach, which can be included in more complex neural network architectures or network ensembles. Model 5 below, network architecture displayed in the output cell.

#### Evaluation:
Evaluation is performed on the test 
R2 score is used as the primary evaluation metrics. 
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R2 score of 0.0.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

#### Results:
Given limited experiments, approach 2 appears to be strongly outperforming approach 1

In [2]:
from platform import python_version
print(python_version())

2.7.14


In [4]:
import numpy as np
import sklearn
import tensorflow as tf

print(np.__version__)
print(sklearn.__version__)
print(tf.__version__)

1.13.1
0.19.1
1.14.2


In [5]:
import os
import pandas as pd
import pandas_profiling

In [None]:
# change the path to your dataset location
wine_reviews = pd.read_csv("/Users/snegi/Downloads/wine-reviews/winemag-data-130k-v2.csv", 
                           index_col=0, encoding='utf-8')
wine_reviews.head(2)

In [6]:
wine_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129971 entries, 0 to 129970
Data columns (total 13 columns):
country                  129908 non-null object
description              129971 non-null object
designation              92506 non-null object
points                   129971 non-null int64
price                    120975 non-null float64
province                 129908 non-null object
region_1                 108724 non-null object
region_2                 50511 non-null object
taster_name              103727 non-null object
taster_twitter_handle    98758 non-null object
title                    129971 non-null object
variety                  129970 non-null object
winery                   129971 non-null object
dtypes: float64(1), int64(1), object(11)
memory usage: 13.9+ MB


In [7]:
'''pandas profiling for exploratory analysis of dataset, like number of unique and null values in each column, correlation
between continuous valued columns, statistical summary (mean, median, standard deviation, quartiles etc.) 
of continuous valued columns'''
profile = pandas_profiling.ProfileReport(wine_reviews)
display(profile)


0,1
Number of variables,13
Number of observations,129971
Total Missing (%),0.0%
Total size in memory,18.9 MiB
Average record size in memory,152.0 B

0,1
Numeric,2
Categorical,11
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,44
Unique (%),0.0%
Missing (%),100.0%
Missing (n),63

0,1
US,54504
France,22093
Italy,19540
Other values (40),33771

Value,Count,Frequency (%),Unnamed: 3
US,54504,0.0%,
France,22093,0.0%,
Italy,19540,0.0%,
Spain,6645,0.0%,
Portugal,5691,0.0%,
Chile,4472,0.0%,
Argentina,3800,0.0%,
Austria,3345,0.0%,
Australia,2329,0.0%,
Germany,2165,0.0%,

0,1
Distinct count,119955
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
"Cigar box, café au lait, and dried tobacco aromas are followed by coffee and cherry flavors, with barrel spices lingering on the finish. The wood gets a bit out front but it still delivers enjoyment.",3
"This zesty red has pretty aromas that suggest small red berry, blue flower and a whiff of moist soil. The vibrant palate offers sour cherry, pomegranate and a hint of anise alongside zesty acidity and refined tannins.",3
"Seductively tart in lemon pith, cranberry and pomegranate, this refreshing, light-bodied quaff is infinitely enjoyable, both on its own or at the table. It continues to expand on the palate into an increasing array of fresh flavors, finishing in cherry and orange.",3
Other values (119952),129962

Value,Count,Frequency (%),Unnamed: 3
"Cigar box, café au lait, and dried tobacco aromas are followed by coffee and cherry flavors, with barrel spices lingering on the finish. The wood gets a bit out front but it still delivers enjoyment.",3,0.0%,
"This zesty red has pretty aromas that suggest small red berry, blue flower and a whiff of moist soil. The vibrant palate offers sour cherry, pomegranate and a hint of anise alongside zesty acidity and refined tannins.",3,0.0%,
"Seductively tart in lemon pith, cranberry and pomegranate, this refreshing, light-bodied quaff is infinitely enjoyable, both on its own or at the table. It continues to expand on the palate into an increasing array of fresh flavors, finishing in cherry and orange.",3,0.0%,
"Ripe plum, game, truffle, leather and menthol are some of the aromas you'll find on this earthy wine. The tightly wound palate offers dried black cherry, chopped sage, mint and roasted coffee bean alongside raspy tannins that leave a mouth-drying finish.",3,0.0%,
"Stalky aromas suggest hay and green herbs, with raspberry in the backdrop. It's hot and short in terms of mouthfeel, with herbal flavors leading the way and berry fruit running behind. Dry red fruit and herbal notes dominate the finish.",3,0.0%,
"Gravenstein apple, honeysuckle and jasmine aromas show on the relatively boisterous nose of this bottling from a large vineyard on Highway 46 east of Paso Robles. There is compellingly grippy texture to the sip, with ripe flavors of pear and honeydew melon. A salty acidity takes it to the next level.",3,0.0%,
"Dry and sappy, this potent effort is heavily influenced by botanical flavors, to the point of being almost grassy. Dry extract and lemon/lime acidity keep the riper fruit flavors in check, but with ample breathing time the wine fills in with highlights of fresh peaches.",2,0.0%,
"Pinot Noir thrives in the Aube region giving rich wines like this. It has a yeasty aroma which follows through to the ripe fruity palate. The wine is warm, while still crisp and tangy. This bottling is ready to drink.",2,0.0%,
"Kendall-Jackson helped define the buttery oaky Chardonnay movement, and this bottling plays that card quite strongly. Oak meets with roasted nuts, sea salt, pecan brittle and Gravenstein apple on the nose. The rounded palate shows sandalwood, candied walnuts and chamomile, though oak is very prominent as well.",2,0.0%,
"This shows sophisticated aromas of cured meat, Indian spice and dried tobacco that sit as subtle afterthoughts at the back of the bouquet. The core of the wine is dominated by dark cherry and plum fruit. Firmly structured and still young, the wine needs a few more years to evolve.",2,0.0%,

0,1
Distinct count,37980
Unique (%),0.0%
Missing (%),100.0%
Missing (n),37465

0,1
Reserve,2009
Estate,1322
Reserva,1259
Other values (37976),87916
(Missing),37465

Value,Count,Frequency (%),Unnamed: 3
Reserve,2009,0.0%,
Estate,1322,0.0%,
Reserva,1259,0.0%,
Riserva,698,0.0%,
Estate Grown,621,0.0%,
Brut,513,0.0%,
Dry,413,0.0%,
Barrel sample,375,0.0%,
Crianza,343,0.0%,
Estate Bottled,342,0.0%,

0,1
Distinct count,21
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,88.447
Minimum,80
Maximum,100
Zeros (%),0.0%

0,1
Minimum,80
5-th percentile,84
Q1,86
Median,88
Q3,91
95-th percentile,93
Maximum,100
Range,20
Interquartile range,5

0,1
Standard deviation,3.0397
Coef of variation,0.034368
Kurtosis,-0.29596
Mean,88.447
MAD,2.4855
Skewness,0.045921
Sum,11495563
Variance,9.24
Memory size,2.0 MiB

Value,Count,Frequency (%),Unnamed: 3
88,17207,0.0%,
87,16933,0.0%,
90,15410,0.0%,
86,12600,0.0%,
89,12226,0.0%,
91,11359,0.0%,
92,9613,0.0%,
85,9530,0.0%,
93,6489,0.0%,
84,6480,0.0%,

Value,Count,Frequency (%),Unnamed: 3
80,397,0.0%,
81,692,0.0%,
82,1836,0.0%,
83,3025,0.0%,
84,6480,0.0%,

Value,Count,Frequency (%),Unnamed: 3
96,523,0.0%,
97,229,0.0%,
98,77,0.0%,
99,33,0.0%,
100,19,0.0%,

0,1
Distinct count,391
Unique (%),0.0%
Missing (%),100.0%
Missing (n),8996
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,35.363
Minimum,4
Maximum,3300
Zeros (%),0.0%

0,1
Minimum,4
5-th percentile,10
Q1,17
Median,25
Q3,42
95-th percentile,85
Maximum,3300
Range,3296
Interquartile range,25

0,1
Standard deviation,41.022
Coef of variation,1.16
Kurtosis,829.52
Mean,35.363
MAD,20.03
Skewness,18.001
Sum,4278100
Variance,1682.8
Memory size,2.0 MiB

Value,Count,Frequency (%),Unnamed: 3
20.0,6940,0.0%,
15.0,6066,0.0%,
25.0,5805,0.0%,
30.0,4951,0.0%,
18.0,4883,0.0%,
12.0,3934,0.0%,
40.0,3872,0.0%,
35.0,3801,0.0%,
13.0,3549,0.0%,
16.0,3547,0.0%,

Value,Count,Frequency (%),Unnamed: 3
4.0,11,0.0%,
5.0,46,0.0%,
6.0,120,0.0%,
7.0,433,0.0%,
8.0,892,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1900.0,1,0.0%,
2000.0,2,0.0%,
2013.0,1,0.0%,
2500.0,2,0.0%,
3300.0,1,0.0%,

0,1
Distinct count,426
Unique (%),0.0%
Missing (%),100.0%
Missing (n),63

0,1
California,36247
Washington,8639
Bordeaux,5941
Other values (422),79081

Value,Count,Frequency (%),Unnamed: 3
California,36247,0.0%,
Washington,8639,0.0%,
Bordeaux,5941,0.0%,
Tuscany,5897,0.0%,
Oregon,5373,0.0%,
Burgundy,3980,0.0%,
Northern Spain,3851,0.0%,
Piedmont,3729,0.0%,
Mendoza Province,3264,0.0%,
Veneto,2716,0.0%,

0,1
Distinct count,1230
Unique (%),0.0%
Missing (%),100.0%
Missing (n),21247

0,1
Napa Valley,4480
Columbia Valley (WA),4124
Russian River Valley,3091
Other values (1226),97029
(Missing),21247

Value,Count,Frequency (%),Unnamed: 3
Napa Valley,4480,0.0%,
Columbia Valley (WA),4124,0.0%,
Russian River Valley,3091,0.0%,
California,2629,0.0%,
Paso Robles,2350,0.0%,
Willamette Valley,2301,0.0%,
Mendoza,2301,0.0%,
Alsace,2163,0.0%,
Champagne,1613,0.0%,
Barolo,1599,0.0%,

0,1
Distinct count,18
Unique (%),0.0%
Missing (%),100.0%
Missing (n),79460

0,1
Central Coast,11065
Sonoma,9028
Columbia Valley,8103
Other values (14),22315
(Missing),79460

Value,Count,Frequency (%),Unnamed: 3
Central Coast,11065,0.0%,
Sonoma,9028,0.0%,
Columbia Valley,8103,0.0%,
Napa,6814,0.0%,
Willamette Valley,3423,0.0%,
California Other,2663,0.0%,
Finger Lakes,1777,0.0%,
Sierra Foothills,1462,0.0%,
Napa-Sonoma,1169,0.0%,
Central Valley,1062,0.0%,

0,1
Distinct count,20
Unique (%),0.0%
Missing (%),100.0%
Missing (n),26244

0,1
Roger Voss,25514
Michael Schachner,15134
Kerin O’Keefe,10776
Other values (16),52303
(Missing),26244

Value,Count,Frequency (%),Unnamed: 3
Roger Voss,25514,0.0%,
Michael Schachner,15134,0.0%,
Kerin O’Keefe,10776,0.0%,
Virginie Boone,9537,0.0%,
Paul Gregutt,9532,0.0%,
Matt Kettmann,6332,0.0%,
Joe Czerwinski,5147,0.0%,
Sean P. Sullivan,4966,0.0%,
Anna Lee C. Iijima,4415,0.0%,
Jim Gordon,4177,0.0%,

0,1
Distinct count,16
Unique (%),0.0%
Missing (%),100.0%
Missing (n),31213

0,1
@vossroger,25514
@wineschach,15134
@kerinokeefe,10776
Other values (12),47334
(Missing),31213

Value,Count,Frequency (%),Unnamed: 3
@vossroger,25514,0.0%,
@wineschach,15134,0.0%,
@kerinokeefe,10776,0.0%,
@vboone,9537,0.0%,
@paulgwine,9532,0.0%,
@mattkettmann,6332,0.0%,
@JoeCz,5147,0.0%,
@wawinereport,4966,0.0%,
@gordone_cellars,4177,0.0%,
@AnneInVino,3685,0.0%,

0,1
Distinct count,118840
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Gloria Ferrer NV Sonoma Brut Sparkling (Sonoma County),11
Korbel NV Brut Sparkling (California),9
Segura Viudas NV Extra Dry Sparkling (Cava),8
Other values (118837),129943

Value,Count,Frequency (%),Unnamed: 3
Gloria Ferrer NV Sonoma Brut Sparkling (Sonoma County),11,0.0%,
Korbel NV Brut Sparkling (California),9,0.0%,
Segura Viudas NV Extra Dry Sparkling (Cava),8,0.0%,
Ruinart NV Brut Rosé (Champagne),7,0.0%,
Gloria Ferrer NV Blanc de Noirs Sparkling (Carneros),7,0.0%,
Segura Viudas NV Aria Estate Extra Dry Sparkling (Cava),7,0.0%,
Mumm Napa NV Brut Prestige Sparkling (Napa Valley),6,0.0%,
Jacquart NV Brut Mosaïque (Champagne),6,0.0%,
J Vineyards & Winery NV Brut Rosé Sparkling (Russian River Valley),6,0.0%,
Pierre Sparr NV Brut Réserve Sparkling (Crémant d'Alsace),6,0.0%,

0,1
Distinct count,708
Unique (%),0.0%
Missing (%),100.0%
Missing (n),1

0,1
Pinot Noir,13272
Chardonnay,11753
Cabernet Sauvignon,9472
Other values (704),95473

Value,Count,Frequency (%),Unnamed: 3
Pinot Noir,13272,0.0%,
Chardonnay,11753,0.0%,
Cabernet Sauvignon,9472,0.0%,
Red Blend,8946,0.0%,
Bordeaux-style Red Blend,6915,0.0%,
Riesling,5189,0.0%,
Sauvignon Blanc,4967,0.0%,
Syrah,4142,0.0%,
Rosé,3564,0.0%,
Merlot,3102,0.0%,

0,1
Distinct count,16757
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Wines & Winemakers,222
Testarossa,218
DFJ Vinhos,215
Other values (16754),129316

Value,Count,Frequency (%),Unnamed: 3
Wines & Winemakers,222,0.0%,
Testarossa,218,0.0%,
DFJ Vinhos,215,0.0%,
Williams Selyem,211,0.0%,
Louis Latour,199,0.0%,
Georges Duboeuf,196,0.0%,
Chateau Ste. Michelle,194,0.0%,
Concha y Toro,164,0.0%,
Columbia Crest,159,0.0%,
Kendall-Jackson,130,0.0%,

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


#### Important observations from the above dataset profile :
- all columns except "point" are categorical and textual
- 'region_2' and 'designation' have very high number of missing values
- review points only range from 80 to 100, i.e only good reviews are available. This indicates that the available textual reviews ("description") are likely to have low variability in sentiments across different entries

### Variations of feature extraction

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer

In [8]:
'''one boolean-valued feature is constructed for each of the possible string values that the feature can take on. 
For instance, if title has a cardinality n, it will be expanded to n features in the output'''
# description columns can be ignored as it results in a very high dimensionality for the output feature vectors
def feature_extraction_1(df, description=False):
    if decription==False:
        df_X = df.drop(["price", "description"], axis=1)
    else:
        df_X = df.drop(["price"], axis=1)
    
    df_X_dict = df_X.to_dict("records")
    vec = DictVectorizer()
    X = vec.fit_transform(df_X_dict).toarray()
    print("Shape of data transformed into feature vectors:", X.shape)
    y = df["price"].values
    print("Size of prediction vector:", len(y))
    return(X, y)

In [9]:
# similar as above i.e. feature_extraction_1, but drops features with high missing values
def feature_extraction_2(df):
    df_X = df.drop(["price","region_2", "designation", "description"], axis=1)
    print("number of columns ", len(df_X.columns))
    df_X_dict = df_X.to_dict("records")
    vec = DictVectorizer()
    X = vec.fit_transform(df_X_dict).toarray()
    print("Shape of data transformed into feature vectors:", X.shape)
    y = df["price"].values
    print("Size of prediction vector:", len(y))
    return(X, y)

In [10]:
#excludes description feature
'''Uses a count based transformation for the title field, no. of transformed dimensions is equal to the vocabulary 
size of the text in the field'''
def feature_extraction_3(df):
    df_X = df.drop(["price", "region_2", "designation", "description"], axis=1)
    print("number of columns ", len(df_X.columns))
    df_X_1 = df_X[["title"]]
    df_X_2 = df_X.drop(["title"], axis=1)
    
    X_1_vectorizer = CountVectorizer()
    X_1 = X_1_vectorizer.fit_transform(df_X_1["title"].tolist()).toarray()
 
    df_X_2_dict = df_X_2.to_dict("records")
    vec = DictVectorizer()
    X_2 = vec.fit_transform(df_X_2_dict).toarray()
    
    X = np.concatenate((X_1,X_2),axis=1)
    
    print("Shape of data transformed into feature vectors:",X.shape)
    y = df["price"].values
    print("Size of prediction vector:",len(y))
    return(X,y)

In [33]:
'''Same as above, but includes description featur by transforming description field into vocabulary count based vector'''
def feature_extraction_4(df):
    df_X = df.drop(["price", "region_2", "designation", "twitter_handle"], axis=1)
    print("number of columns ", len(df_X.columns))
    df_X_1 = df_X[["title"]]
    df_X_2 = df_X[["description"]]
    df_X_3 = df_X.drop(["title", "description"], axis=1)
        
    X_1_vectorizer = CountVectorizer()
    X_1 = X_1_vectorizer.fit_transform(df_X_1["title"].tolist()).toarray()
 
    #top 100 tokens considered
    X_2_vectorizer = CountVectorizer(max_features=100)
    X_2 = X_2_vectorizer.fit_transform(df_X_2["description"].tolist()).toarray()

    df_X_3_dict = df_X_3.to_dict("records")
    vec = DictVectorizer()
    X_3 = vec.fit_transform(df_X_3_dict).toarray()
        
    X = np.concatenate((X_1, X_2, X_3),axis=1)
    
    print("Shape of data transformed into feature vectors:", X.shape)
    y = df["price"].values
    print("Size of prediction vector:", len(y))
    return(X, y)

### Feature selection and dimensionality reduction

In [12]:
from sklearn.decomposition import PCA

def dim_reduction(X, dimensions):
    pca = PCA(n_components=dimensions)
    X = pca.fit_transform(X)
    print("feature vector reduced to ", dimensions, " dimensions")
    return X

In [13]:
from sklearn.feature_selection import VarianceThreshold

def feature_selection(X, p=0.8):
    feature_sel = VarianceThreshold(threshold=(p * (1 - p)))
    X = feature_sel.fit_transform(X)
    print("no. of features selected: ", X.shape[1])
    return X

### NN model architecture

In [14]:
import tensorflow as tf
from tensorflow.keras.optimizers import *

def nn_model_fn(n_dimensions): 
    # a simple feed forward model architecture     
    batch_size = 32
    optimizer = 'adam'
    learning_rate = 0.01
    
    input_layer = tf.keras.layers.Input(shape=(n_dimensions,))
    Dense_layer_1 = tf.keras.layers.Dense(20, input_shape=(n_dimensions,))    
    Dense_layer_2 = tf.keras.layers.Dense(10)    
    Dense_layer_3 = tf.keras.layers.Dense(1, kernel_initializer = tf.initializers.glorot_uniform(seed=42))    
    logits = Dense_layer_2(Dense_layer_1(input_layer))
    model = tf.keras.models.Model(inputs=input_layer, outputs=logits)
    model.compile(loss='mse',
                  optimizer = optimizer,
                  metrics= ['mae', r2])
    model.summary()
    return model

In [120]:
# R2 score evaluation metrics often used for regression. Value ranges from -1 to +1
def r2(y_true, y_pred):
    SS_res = tf.reduce_sum(tf.square(y_true - y_pred))
    SS_tot = tf.reduce_sum(tf.square(y_true - tf.reduce_mean(y_true)))
    return(1 - SS_res / (SS_tot + tf.keras.backend.epsilon()))

### Data Cleaning, pre-processing

In [17]:
# replacing missing values in string valued columns with the string "none"
def null_handling(df):
    values = {"country":"none", "designation":"none", "province":"none", "region_1":"none",
              "region_2":"none", "taster_name":"none", "taster_twitter_handle":"none", 
              "points":0,"title":"none", "variety":"none", "winery":"none"}
    df = df.fillna(value=values, inplace=False)
    return df

In [45]:
import re

# extracting year from the title 
def extract_year(row):
    year = re.search(r'([1-2][0-9]{3})',row['title'])
    if year is not None:
        year_value = year.group(1)
    else: year_value = "none"
    
    return year_value

def new_title(row):
    year = re.search(r'([1-2][0-9]{3})',row['title'])
    if year is not None:
        year_value = year.group(1)
        new_title = row['title'].replace(year_value,"")
    else: 
        year_value = "none"
        new_title = row['title']
    
    return new_title
 
# adding year as a separate column and replacing title column with a coulmn holding new title values which exclude year
def add_year_column(df):
    df['year'] = df.apply (lambda row: extract_year(row), axis=1)
    df['title'] = df.apply (lambda row: new_title(row), axis=1)
    
    return df

## Approach 1: Baseline
Simple Neural Network with  Feature Engineering and Feature Extraction

In [42]:
# since we are predicting price, dropping all the rows with missing values of price
df = wine_reviews.dropna(subset=["price"])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 120975 entries, 1 to 129970
Data columns (total 13 columns):
country                  120916 non-null object
description              120975 non-null object
designation              86196 non-null object
points                   120975 non-null int64
price                    120975 non-null float64
province                 120916 non-null object
region_1                 101400 non-null object
region_2                 50292 non-null object
taster_name              96479 non-null object
taster_twitter_handle    91559 non-null object
title                    120975 non-null object
variety                  120974 non-null object
winery                   120975 non-null object
dtypes: float64(1), int64(1), object(11)
memory usage: 12.9+ MB


### Model 1
Excludes description field. Doesn't use any dimensionality reduction or feature selection outside the neural net.

In [46]:
# additional column 'year' added
df_no_nulls = add_year_column(null_handling(df))

In [123]:
df_no_nulls.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,none,none,Roger Voss,@vossroger,Quinta dos Avidagos Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2011
2,US,"Tart and snappy, the flavors of lime flesh and...",none,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,2013


In [51]:
df_no_nulls.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 120975 entries, 1 to 129970
Data columns (total 14 columns):
country                  120975 non-null object
description              120975 non-null object
designation              120975 non-null object
points                   120975 non-null int64
price                    120975 non-null float64
province                 120975 non-null object
region_1                 120975 non-null object
region_2                 120975 non-null object
taster_name              120975 non-null object
taster_twitter_handle    120975 non-null object
title                    120975 non-null object
variety                  120975 non-null object
winery                   120975 non-null object
year                     120975 non-null object
dtypes: float64(1), int64(1), object(12)
memory usage: 13.8+ MB


In [124]:
# taking a smaller sample for faster execution
df_sample = df_no_nulls.sample(frac=.50, random_state=42)

In [140]:
(X, y) = feature_extraction_3(df_sample)

n_dimensions = X.shape[1]

('number of columns ', 10)
('Shape of data transformed into feature vectors:', (60488, 40801))
('Size of prediction vector:', 60488)


In [144]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=32)

In [128]:
model = nn_model_fn(n_dimensions)
model.fit(X_train, y_train, batch_size=32, shuffle=True, epochs=10, validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        (None, 40801)             0         
_________________________________________________________________
dense_27 (Dense)             (None, 20)                816040    
_________________________________________________________________
dense_28 (Dense)             (None, 10)                210       
Total params: 816,250
Trainable params: 816,250
Non-trainable params: 0
_________________________________________________________________
Train on 33872 samples, validate on 8469 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a79c39790>

In [129]:
model.evaluate(x=X_test, y=y_test, batch_size=32, verbose=1)



[1201.5914213105182, 13.950552, -5.2824736]

### Model 2
Same as above, includes description feature

In [132]:
# df_no_nulls = null_handling(df)
# df_sample = df_no_nulls.sample(frac =.50, random_state = 42)

# different feature extraction method
(X, y) = feature_extraction_4(df_sample)

('number of columns ', 11)
('Shape of data transformed into feature vectors:', (60488, 40901))
('Size of prediction vector:', 60488)


In [134]:
n_dimensions = X.shape[1]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=32)

model_2 = nn_model_fn(n_dimensions)
model_2.fit(X, y, batch_size=32, shuffle=True, epochs=10, validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_13 (InputLayer)        (None, 40901)             0         
_________________________________________________________________
dense_30 (Dense)             (None, 20)                818040    
_________________________________________________________________
dense_31 (Dense)             (None, 10)                210       
Total params: 818,250
Trainable params: 818,250
Non-trainable params: 0
_________________________________________________________________
Train on 48390 samples, validate on 12098 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a78f10050>

In [54]:
model_2.evaluate(x=X_test, y=y_test, batch_size=32, verbose=1)



[968.08356484722, 10.233475, -2.9665246]

### Model 3
Same as above, includes feature selection based on variance threshold

In [135]:
#X = dim_reduction(X, 2000)
(X, y) = feature_extraction_4(df_sample)
X = feature_selection(X, p=0.95)

('number of columns ', 11)
('Shape of data transformed into feature vectors:', (60488, 40901))
('Size of prediction vector:', 60488)
('no. of features selected: ', 148)


In [137]:
n_dimensions = X.shape[1]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=32)

model_3 = nn_model_fn(n_dimensions)
model_3.fit(X, y, batch_size=32, shuffle=True, epochs=10, validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_14 (InputLayer)        (None, 148)               0         
_________________________________________________________________
dense_33 (Dense)             (None, 20)                2980      
_________________________________________________________________
dense_34 (Dense)             (None, 10)                210       
Total params: 3,190
Trainable params: 3,190
Non-trainable params: 0
_________________________________________________________________
Train on 48390 samples, validate on 12098 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a7be4bfd0>

In [138]:
model_3.evaluate(x=X_test, y=y_test, batch_size=32, verbose=1)



[1475.047717614851, 16.273384, -7.009284]

Observations from the results above:
- Overall, low performance. Models do not fit well
- Including description based features improved the performance
- Feature selection outside the neural network produces poorer results. However, this can be specific to the simple feature selection method used here


There are several areas of improvements in the regression pipeline presented above:

1. Null replacement: nulls for continuous valued columns (points) can be replaced by interpolation
2. Feature engineering: use tokens instead of full string in categorical
3. Feature selection: 
4. Hyperparameter optimisation
5. Textual features: Description features can be incorparated using pre-trained word embeddings, text2vec, and ensemble methods of incorporating a 
6. Scalable implementation: An equivalent spark and sparkML imolementation would be better faster and scalable

## Approach 2: Improvement
Regression model learns dense representations for each categorical feature value within the network, without needing to input categorical features as one hot encodings.

In [71]:
'''The neural network architecture comprises of embedding layers corresponding to each categorical column. 
The dense categorical feature vectors from the embedding layer are concatenated with the numerical feature/s, 
followed by two dense layers.'''

from tensorflow.keras.layers import (Dense, Embedding,
                          Activation, Input, concatenate, Reshape, Flatten, PReLU)
from tensorflow.keras.models import Model


def nn_model_fn_embeddings(numerical_cols, emb_sizes):
        inputs = []
        flatten_layers = []

        for (var,sz) in emb_sizes.items():
            input_c = Input(shape=(1,), dtype='float32', name='input_' + var)
            embed_c = Embedding(*sz, input_length=1, name='embedded_' + var)(input_c)
            embed_c.trainable = True
            flatten_c = Flatten()(embed_c)

            inputs.append(input_c)
            flatten_layers.append(flatten_c)

        # concatenating the continuous-valued features
        input_num = Input(shape=(len(numerical_cols),), dtype='float32')
        flatten_layers.append(input_num)
        inputs.append(input_num)

        flatten = concatenate(flatten_layers, axis=-1)
        fc1 = Dense(200)(flatten)

        fc1 = Activation('relu')(fc1)
     

        fc2 = Dense(100, kernel_initializer='normal')(fc1)
        fc2 = Activation('relu')(fc2)
        
        
        output = Dense(1, kernel_initializer='normal', name='prediction_output')(fc2) 
        model = Model(inputs=inputs, outputs=output)
        
        model.compile(loss='mse',
                  optimizer = 'adam',
                  metrics= ['mae',r2])

        model.summary()
        return model
    
    
def prepare_inputs(X_df, embedding_sizes, numerical_cols):
    cat_vars = embedding_sizes.keys()
    #numerical_vars = [x for x in X.columns if x not in cat_vars]
    x_inputs = []
        
    for col in cat_vars:
            x_inputs.append(X_df[col].values)

    x_inputs.append(X_df[numerical_cols].values)
    
    return x_inputs


def calc_embedding_size(df):
    categorical_cols = [x for x in df.columns if df[x].dtype == object]
    df_cat = df[categorical_cols]
    cat_sizes = df_cat.apply(pd.Series.nunique).to_dict()  
    
            
    col_emb_sizes={col: (cat_sizes[col],np.minimum(50, (cat_sizes[col] + 1) // 2))
                  for col in cat_sizes.keys()}
    
    return col_emb_sizes

In [78]:
df = wine_reviews.dropna(subset=["price"])
df_no_nulls = add_year_column(null_handling(df))
df_sample = df_no_nulls.sample(frac =.50, random_state = 42)

In [80]:
X_df = df_sample.drop(["description", "region_2", "designation","price"],axis=1)


numerical_cols = [x for x in X_df.columns if X_df[x].dtype != object]
embedding_sizes = calc_embedding_size(X_df)

In [81]:
from sklearn.preprocessing import LabelEncoder

#encodes string valued features as numerical indexes and normalises numeric features
X_df = X_df.apply(LabelEncoder().fit_transform)
y_df = df_sample["price"]


X_df_train, X_df_test = train_test_split(X_df, test_size=0.3, random_state=32)
y_df_train, y_df_test = train_test_split(y_df, test_size=0.3, random_state=32)


X_train_inputs_list = prepare_inputs(X_df_train, embedding_sizes, numerical_cols)
X_test_inputs_list = prepare_inputs(X_df_test, embedding_sizes, numerical_cols)

### Model 5

In [83]:
model_5 = nn_model_fn_embeddings(numerical_cols,embedding_sizes)
model_5.fit(X_train_inputs_list, y_df_train, batch_size=32, epochs=10, validation_split=0.2)

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_province (InputLayer)     (None, 1)            0                                            
__________________________________________________________________________________________________
input_variety (InputLayer)      (None, 1)            0                                            
__________________________________________________________________________________________________
input_region_1 (InputLayer)     (None, 1)            0                                            
__________________________________________________________________________________________________
input_country (InputLayer)      (None, 1)            0                                            
__________________________________________________________________________________________________
input_tast

<tensorflow.python.keras.callbacks.History at 0x1a4365dfd0>

In [121]:
from sklearn.metrics import r2_score

y_pred = model_5.predict(X_test_inputs_list)
y_pred = np.reshape(y_pred, (y_pred.shape)[0])
y_test = np.array(y_df_test.tolist())
r2_score(y_test, y_pred)

0.47614041042577204