## Loading Data

In [None]:
import pandas as pd
import numpy as np
import gc

In [None]:
train=pd.read_table('../input/train.tsv')
test=pd.read_table('../input/test.tsv')

## Preprocessing

In [None]:
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder

Brand name, category name, item condition and shipping are all categorical variables. We are going to encode those into proper formats for later processing, but right now we should take care of more pressing matters: dealing with missing values.

In [None]:
train=train.fillna('missing')
test=test.fillna('missing')

For a first approximation, we just fill it in with the word 'missing', but this can certainly be improved in a few cases. We could try imputing values where possible, especially in the category section, by training a secondary classifier. This will probably remain for a further inspection, though.

At this point, a closer look at the category_name feature is worthwhile.

category_name seems to be split in 3 increasingly specific levels. It might be of interest to treat these separately.

In [None]:
allData=train.append(test).reset_index(drop=True)

In [None]:
del train
del test

gc.collect()

In [None]:
le=LabelEncoder()
allData['c_lv3']=le.fit_transform(allData.category_name.ravel())

The other 3 categorical features will be handled directly: Just encode them and be done with it.

In [None]:
allData['c_brand']=le.fit_transform(allData.brand_name.str.lower())
allData['c_condition']=le.fit_transform(allData.item_condition_id)
allData['c_shipping']=le.fit_transform(allData.shipping)

Now for the two most difficult features to handle: Description and Name. We start by first tokenizing (splitting the text into proper words) both of them.

In [None]:
tokenizer=Tokenizer()
tokenizer.fit_on_texts(allData.name.str.lower())
allData['c_names']=pd.Series(tokenizer.texts_to_sequences(allData.name.str.lower()))

In [None]:
tokenizer.fit_on_texts(allData.item_description.str.lower())
allData['c_descriptions']=pd.Series(tokenizer.texts_to_sequences(allData.item_description.str.lower()))

In order to properly feed all this to a Neural Network, we need to standardize the length of all sequences. To accomplish this, we will pad all texts with dummy words until all of them are the same size of the biggest one. We start by gathering the maximum name and description length in words.

In [None]:
max_desc=allData['c_descriptions'].apply(lambda x: len(x)).max()
max_name=allData['c_names'].apply(lambda x: len(x)).max()

In [None]:
dummy_name=allData['c_names'].apply(lambda x: max(x) if len(x)>=1 else 0).max()+1
dummy_desc=allData['c_descriptions'].apply(lambda x: max(x) if len(x)>=1 else 0).max()+1

Now we build a 'pad' function to apply to the sequences.

In [None]:
def pad(sequence,maxlen,dummy_word):
    lsequence=list(sequence)
    if len(lsequence)>maxlen:
        return sequence[:maxlen]
    while len(lsequence)<maxlen:
        lsequence.append(dummy_word)
    return np.array(lsequence)

Finally, we apply the function.

In [None]:
allData['c_names']=allData['c_names'].apply(lambda x: pad(x,max_name,dummy_name))
allData['c_descriptions']=allData['c_descriptions'].apply(lambda x: pad(x,60,dummy_desc))

Now all that is left is to convert all these DataFrames into numpy arrays so that Keras can deal with them. We will be dropping all data points that are sold for free before training those points will only serve to confuse the neural network. The increase in error for doing this should be smaller than what we would see if we tried to fit them.

In [None]:
tr_data=allData[(allData.price.as_matrix()>=1.0) & (np.isnan(allData.test_id.as_matrix()))]

In [None]:
names=np.array(list(tr_data.c_names))
descs=np.array(list(tr_data.c_descriptions))
category3=tr_data.c_lv3.ravel()
brand=tr_data.c_brand.ravel()
condition=tr_data.c_condition.ravel()
shipping=tr_data.c_shipping.ravel()
labels=tr_data.price.ravel()

Setting up vocab size for later embeddings...

In [None]:
max_c_lv3=allData.c_lv3.max()
max_brand=allData.c_brand.max()
max_condition=allData.c_condition.max()
max_shipping=allData.c_shipping.max()

We also need to prepare our test set:

In [None]:
te_data=allData[np.isnan(allData.train_id.as_matrix())]

In [None]:
names_te=np.array(list(te_data.c_names))
descs_te=np.array(list(te_data.c_descriptions))
category3_te=te_data.c_lv3.ravel()
brand_te=te_data.c_brand.ravel()
condition_te=te_data.c_condition.ravel()
shipping_te=te_data.c_shipping.ravel()

Scaling the labels so that they are in a sensible range...

In [None]:
#s_labels=scaler.fit_transform(labels.reshape(-1,1))
s_labels=np.log(labels+1.).reshape(-1,1)

Getting the input variables in a dict...

In [None]:
X={'name':names,
   'descriptions':descs,
   #'category_level_1':category1,
   #'category_level_2':category2,
   'category_level_3':category3,
   'brand':brand,
   'condition':condition,
   'shipping':shipping}

Including the test set...

In [None]:
X_te={'name':names_te,
      'descriptions':descs_te,
      #'category_level_1':category1,
      #'category_level_2':category2,
      'category_level_3':category3_te,
      'brand':brand_te,
      'condition':condition_te,
      'shipping':shipping_te}

In [None]:
del names
del descs
del category3
del brand
del condition
del shipping
del labels

del names_te
del descs_te
del category3_te
del brand_te
del condition_te
del shipping_te

del allData

gc.collect()

## Building a Keras Model

Now we must build our Neural Network model. We have two distinct types of features in this problem: Categorical variables and text variables. The treatment that we should give each of these should be, obviously, very distinct. We will be passing all our variables through an embedding layer first, to better build our feature space before we go into recursive + dense layers. The output is a single neuron with linear activation, as the target variable is a scalar value.

In [None]:
import keras.layers as kl
import keras.models as km
import keras.backend as K
import keras

In [None]:
def schedule(e):
    if e<2:
        return 0.0013
    elif e==2:
        return 0.0012
    else:
        return 0.0011

We now set up our layers one by one, starting with the inputs. Variable names that start with NN are Neural Network layers.

In [None]:
#Inputs
NN_names=kl.Input(shape=[X['name'].shape[1]],name='name')
NN_descs=kl.Input(shape=[X['descriptions'].shape[1]],name='descriptions')
#NN_cat1=kl.Input(shape=[1],name='category_level_1')
#NN_cat2=kl.Input(shape=[1],name='category_level_2')
NN_cat3=kl.Input(shape=[1],name='category_level_3')
NN_brand=kl.Input(shape=[1],name='brand')
NN_condition=kl.Input(shape=[1],name='condition')
NN_shipping=kl.Input(shape=[1],name='shipping')

#Embeddings
NN_emb_name=kl.Embedding(dummy_name+1, 20)(NN_names)
NN_emb_desc=kl.Embedding(dummy_desc+1, 30)(NN_descs)
#NN_emb_cat1=kl.Embedding(max_c_lv1+1, 3)(NN_cat1)
#NN_emb_cat2=kl.Embedding(max_c_lv2+1, 5)(NN_cat2)
NN_emb_cat3=kl.Embedding(max_c_lv3+1, 8)(NN_cat3)
NN_emb_brand=kl.Embedding(max_brand+1, 5)(NN_brand)

#LSTM Layer
NN_lstm_name=kl.LSTM(8)(NN_emb_name)
NN_lstm_desc=kl.LSTM(20)(NN_emb_desc)

#Main layer, joins all data
NN_main=kl.concatenate([#kl.Flatten() (NN_emb_cat1),
#                        kl.Flatten() (NN_emb_cat2),
                        kl.Flatten() (NN_emb_cat3),
                        kl.Flatten() (NN_emb_brand),
                        NN_condition,
                        NN_shipping,
                        NN_lstm_name,
                        NN_lstm_desc])

#Add a dropout layer before two dense layers to process the whole picture
NN_main=kl.Dropout(.1) (kl.Dense(128,activation='relu') (NN_main))
NN_main=kl.Dropout(.1) (kl.Dense(64,activation='relu') (NN_main))

#output
NN_output=kl.Dense(1,activation='linear') (NN_main)

model=km.Model([NN_names,NN_descs,NN_cat3,NN_brand,NN_condition,NN_shipping],NN_output)
model.compile(loss="mean_squared_error", optimizer=keras.optimizers.Adam(lr=0.0013,decay=0.0), metrics=["mae"])
model.summary()

And now we are ready to train!

In [None]:
history=model.fit(X,s_labels,epochs=5,batch_size=15000,validation_split=0.0,callbacks=[keras.callbacks.LearningRateScheduler(schedule)])

## Outputs and Results

Preparing predictions into submission format and saving the csv file:

In [None]:
pd.DataFrame({'test_id':te_data.test_id.as_matrix().astype(int),
              'price':(np.exp(model.predict(X_te))-1.).reshape(-1)}).to_csv('submissions.csv',
                                                                             index=False,
                                                                             header=True,columns=['test_id','price'])

And we're done. Just as a curiosity, let's check the distribution plots for training and test prices to see how we're doing:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
sns.set_palette('nipy_spectral')

In [None]:
sns.distplot(s_labels)
sns.distplot(np.log(pd.read_csv('submissions.csv').price.as_matrix()+1))
plt.show()

The shapes are pretty close, as a general rule. We can see the Neural Network fails to capture a lot of the noise inherent in the training set and kind of overpredicts values in the peak area of the distribution, having a shorter tail. If we had time, we could run a couple more epochs and see how that would improve things.