# The Price Is Right: Predicting Prices with Product Images

### Milestone Report

------------

**Steven Chen, Edward Chou, Richard Yang**

(Edward Chou and Richard Yang are not part of 230, but are part of 229.)



## Introduction

------------

Online shopping is quickly becoming the norm, but the experience differs greatly from retail shopping where people have the opportunity to closely examine a product, weighing  in the feel of a material or the scent of a cream before making a purchase decision.  Online shoppers must rely entirely on the few images and paragraph descriptions to make that decision.

Our goal is to create a machine learning model that can predict item prices based on a product image and description, which could be used by both buyers and sellers to suggest fair prices for products, or warn of inaccurate or unreasonable pricing. In addition, by learning which features tend to result in predicted higher or lower prices, our proposed model can help sellers increase the perceived value of their products on shopping websites, helping guide product design, photo selection, and product description to improve a buyer’s impression.  We hope to create a system that takes in inputs of a product image and descriptions and outputs an estimated price based on the features found in the image. We will evaluate our model's success by comparing the estimated price to the corresponding actual price.

## Loading Packages

---------------

For this project, we choose to use Keras with a Tensorflow backend. Keras is well suited for building complex CNNs, and we have experience with both Tensorflow and Keras from the CS230 programming assignments.

In [2]:
from keras import applications
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential, Model
from keras.layers import Dropout, Flatten, Dense, Input
from keras.initializers import glorot_uniform
#from keras import backend as K
#K.set_image_dim_ordering('th')
from sklearn.model_selection import train_test_split
import numpy as np

Using TensorFlow backend.


## Datasets

------------

We use cars and bikes as the products to perform price prediction, due to the wide visual variances in bike and car models and the models' close visual correlations to their prices. We have collected and cleaned image and price datasets for both bikes and cars that are visually consistent, rich in visual detail and well-suited to train our models.  

Our first dataset, named bikes, is curated from Bicycle Blue Book. We collect images, descriptions of specifications, and MSRP prices from the listings. We preprocess the images by resizing so that the smaller axis is 224, then filtering out low-quality images, removing images with noisy backgrounds. We noticed that high quality images all had a solid white background, so we filter images based on the ratio of white pixels. Our final dataset contains clean, white-background, side-view photos of bikes with similar orientation. The dataset consists of 21,843 images, each with a product name and MSRP price.

Our second dataset, named cars, is a dataset of passenger vehicle images along with their MSRP prices. We retrieve a portion of the dataset from Kaggle (www.kaggle.com/jshih7/car-price-prediction), containing a car's make, model, and year along with price data from Edmunds. We use Google Images to create a dataset, using search terms consisting of the make, model, year, and including additional keywords such as "angular front view". We collect a subset of car images without backgrounds and in the same orientation, clearly displaying the proportions and details. The cars dataset has 12,000 unique rows, each with the car model and trim, image, and price.

The bike dataset prices range between \$70 and \$9,000, and the car dataset prices range between \$2,000 and \$497,650, with some expensive outliers omitted from both. See below for histograms of the prices of both the cars and bikes. We note that the histograms closely follow an exponential distribution, which is expected with the real-world economy where there many more models at the modest and regular price segments vs. luxury segments.

**Note:** Our datasets are quite large, and are not included with this notebook. If you wish to download them, the bike images are at https://stanford.box.com/s/o4nbzogxm0gqjd0o69diua36atweugw5, and the bike prices csv is at https://stanford.box.com/s/mksmn25hyljk0crl2j3qkqkhtt5j8ref. Both files are required to run the model.

Samples from the bikes dataset:
![bike dataset](http://www.stevenzc.com/assets/cs230/bike_montage.jpg)


Samples from the cars dataset:
![bike dataset](http://www.stevenzc.com/assets/cs230/car_montage.jpg)

Histogram of the bikes dataset prices:
![bike dataset](http://www.stevenzc.com/assets/cs230/bike_histogram.png)


Histogram of the the cars dataset prices:
![bike dataset](http://www.stevenzc.com/assets/cs230/car_histogram.png)

## Approach

------------

For our model, we choose to use transfer learning with the VGG-16 network. Transfer learning allows us to take advantage of the interesting and complex features an existing deep object recognition CNN has learned, and use this complexity to increase the accuracy of our model. In addition, many complex CNNs would be difficult or infeasible to train due to the time and compute required, so using pretrained parameters is very helpful.

In the following cells, we insert text to explain what happens.

We initialize the VGG-16 network without the final (top) layer, using the learned ImageNet weights. VGG-16 is a very deep CNN trained for object recognition on the ImageNet challenge.

In [4]:
# build the VGG16 network
input_tensor = Input(shape=(224,224,3))
model = applications.VGG16(weights='imagenet', include_top=False, input_tensor = input_tensor)

We build our own layer on top of VGG. In particular, we flatten the final feature mapping of VGG-16 (consisting of 512 7 by 7 filters) into a single dimension. We then add a fully connected layer of 256 hidden units with ReLU activations, and use uniform Xavier initialization.

We finish our model with an output layer of a single linear activation neuron, which will output the predicted price.

In [13]:
# build a classifier model to put on top of the convolutional model
top_model = Sequential()
print(model.output_shape[1:])
top_model.add(Flatten(input_shape=(model.output_shape[1:])))


# Output layer
# We do random weight intialization
# Maybe this is why our loss is so bad?
top_model.add(Dense(256, activation='relu', kernel_initializer='glorot_uniform'))
top_model.add(Dense(1, activation='linear', name='output', kernel_initializer='glorot_uniform'))

(7, 7, 512)


We set the pretrained VGG layers to be non-trainable so that we do spend time learning them. Instead, our learning will focus on the new layers we have added.

In [19]:
# set the first 19 layers (up to the last conv block)
# to non-trainable (weights will not be updated)
for layer in new_model.layers[:19]:
    layer.trainable = False

In [20]:
# add the model on top of the convolutional base
new_model = Model(inputs= model.input, outputs = top_model(model.output))
new_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

Above, we can see our added layer as sequential_2. Only our new layer is trainable: the rest are not.

We compile the model using mean squared error as the loss (since we are performing regression), and use an RMSprop optimizer.

In [21]:

# SGD
#new_model.compile(loss='mean_squared_error',
#              optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
#              metrics=['accuracy'])

# RMSprop
new_model.compile(loss='mean_squared_error',
                  optimizer=optimizers.RMSprop(lr=0.01, rho=0.9, epsilon=1e-07, decay=0.0))

We use the great Keras ImageDataGenerator to process our images. We rescale the image colors to be between 0 and 1, then perform mean subtraction on each image channel, in order to help our images be more standardized and similar to images the VGG network has seen before.

Our dataset relies on the bike images and the price csv to be in the root directory, because that is where FloydHub puts them. As of now, we read in the images into a large numpy array, then feed this into the network. We hit memory issues when trying to load all 20000 plus images, so for now we load a smaller subset.

In [24]:
# prepare data augmentation configuration
train_datagen = ImageDataGenerator(rescale=1. / 255, samplewise_center=True)

test_datagen = ImageDataGenerator(rescale=1. / 255, samplewise_center=True)

data_path = "/datasets/bikes_im/"
file = open("/datasets/bikes_filtered.csv")
i = -1
X = np.zeros((4000, 224, 224, 3))
Y = np.zeros((4000, 1))
for data_point in file:
    i += 1
    index, name, msrp = data_point.split(",")
    img_path = data_path + index + '.jpg'
    img = image.load_img(img_path, target_size=(224, 224))
    X[i] = image.img_to_array(img)
    Y[i] = int(msrp)
    
    # TODO: Change this to use the full dataset
    if i% 1000 == 0:
        print(i)
    if i == 3999:
        break

print(X.shape)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.1)
print(X_train.shape)
print(y_train.shape)


0
1000
2000
3000
(4000, 224, 224, 3)
(3600, 224, 224, 3)
(3600, 1)


In [25]:
train_generator = train_datagen.flow(
    x = X_train,
    y = y_train,
    batch_size= 64)


validation_generator = test_datagen.flow(
    x = X_test,
    y = y_test,
    batch_size= 64)

epochs = 10
nb_train_samples = X_train.shape[0]
nb_validation_samples = X_test.shape[0]

# fine-tune the model
new_model.fit_generator(
    train_generator,
    samples_per_epoch=nb_train_samples,
    epochs=epochs,
    validation_data=validation_generator,
    nb_val_samples=nb_validation_samples)



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc820a0d4a8>

As we can see, over training epochs the training loss steadily decreases. Training the network on about 4,000 images over 10 epochs takes about one hour on the FloydHub machines. The network definitely has room for a lot of improvement: see next steps for our subsequent plans.

In [23]:
file.close()

## Next Steps

------

We have two main priorities for next steps: training and tuning the network, and working on interesting evaluations and feature extractions.

### Training and Tuning

We have many things to work on next for the model:

* Reading images on demand from disk, so that we can use our entire dataset without running out of memory.
* Doing dataset augmentation by flipping images, using shearing, resizing, and so on.
* Tuning hyperparameters: learning rate, choice of optimizer, momentum, decay, mini-batch size, etc. We have not had much time to tune, and training the network, even on FloydHub's powerful machines, takes a significant amount of time.
* Working with the cars dataset and performing the same tuning steps as above.
* (Possibly) trying different neural network architectures. It would be fun to try to build an end-to-end, simpler network and see how that does, or try transfer learning with a different network.

### Evaluations

One of our stretch goals is to compare our model's performance to humans, if we can get it to make strong predictions. We would like to run a study on Amazon Mechanical Turk where we ask annotators to guess what the price of a bike/car is, and average the guesses of multiple annotators. It would be very cool if our model could outperform humans.

### Feature Extraction

Finally, we would like to look at the activations of our neural network's convolutional filters and extract parts of images that result in higher or lower prices. For instance, the network may predict that any bike that has disc brakes or thin tires generally costs more than a bike without those features. If we could do this, we'd also like to be able to generate the network's idea of a "very expensive" and "very cheap" bike, and see how they look. These features could be used to inform manufacturers of what visual features result in cheaper or more expensive products.
