# Classifying product titles using convolutional neural networks

Text classification help us to better understand and organize data. Tried building a simple CNN classifier using Keras with tensorflow as backend to classify product available on ecommerce sites. Data for this expiriment are product titles of three distinct catgories from a popular ecommerce site. Reference: [Tutorial](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)

## Collecting data

For this experiment I've collected product titles belonging to the following categories. 

* Women's clothing
* Cameras
* Home appliences

Since these catgegories are distinct, meaning they dont have any overlap of contextual information, Our model should have less classification errors.

Pre-trained vectors trained on part of Google News dataset [download 1.5GB](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing)

## Initializing

We need the following libraries
* Gensim
* Keras
* NLTK
* Pandas
* Numpy

[Conda](https://conda.io/docs/) to manage virtual environment

In [1]:
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors
from keras.layers import Flatten
from keras.layers import MaxPooling1D
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from nltk.corpus import stopwords

MAX_NB_WORDS = 200000
MAX_SEQUENCE_LENGTH = 30
EMBEDDING_DIM = 300

EMBEDDING_FILE = "../lib/GoogleNews-vectors-negative300.bin"
category_index = {"clothing":0, "camera":1, "home-appliances":2}
STOPWORDS = set(stopwords.words("english"))


Using TensorFlow backend.


## Loading data
It is important to make sure that the data doesn't have any `null`/`Nan` values.

In [2]:
clothing = pd.read_csv("clothing.tsv", sep='\t')
cameras = pd.read_csv("cameras.tsv", sep='\t')
home_appliances = pd.read_csv("home.tsv", sep='\t')

datasets = [clothing, cameras, home_appliances]

print("Make sure there are no null values in the datasets")
for data in datasets:
    print("Has null values: ", data.isnull().values.any())


Make sure there are no null values in the datasets
Has null values:  False
Has null values:  False
Has null values:  False


## Preprocessing

Stopwords or words that occur frequently and is distracting are removed first, Then we use classes provided by Keras to help prepare text so it can be used by neural network models.

In [3]:
def preprocess(text):
    text= text.strip().lower().split()
    text = filter(lambda word: word not in STOPWORDS, text)
    return " ".join(text)
    
for dataset in datasets:
    dataset['title'] = dataset['title'].apply(preprocess)

To prepare the vector (array of integers) representation of text :
* Combine titles from all three cateories to obtain a list of text.
* Drop duplicates
* Initialize tokenizer with `num_words = MAX_NB_WORDS` (200K). i.e. The tokenizer will perform a word count, sorted by number of occurences in descending order and pick top 200K words. 
* Use tokenizer's `texts_to_sequences` method to convert text to array of integers.
* The arrays obtained from previous step might not be of uniform length, use `pad_sequences` method to obtain arrays with length equal to `MAX_SEQUENCE_LENGTH` (30)

In [4]:
all_texts = clothing['title'] + cameras['title'] + home_appliances['title']
all_texts = all_texts.drop_duplicates(keep=False)

tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(all_texts)

clothing_sequences = tokenizer.texts_to_sequences(clothing['title'])
electronics_sequences = tokenizer.texts_to_sequences(cameras['title'])
home_appliances_sequences = tokenizer.texts_to_sequences(home_appliances['title'])

clothing_data = pad_sequences(clothing_sequences, maxlen=MAX_SEQUENCE_LENGTH)
electronics_data = pad_sequences(electronics_sequences, maxlen=MAX_SEQUENCE_LENGTH)
home_appliances_data = pad_sequences(home_appliances_sequences, maxlen=MAX_SEQUENCE_LENGTH)

A `word_index` has a unique ID assigned to each word in the data. For example

In [5]:
word_index = tokenizer.word_index
test_string = "sports action spy pen camera"
print("word\t\tid")
print("-" * 20)
for word in test_string.split():
    print("%s\t\t%s" % (word, word_index[word]))

word		id
--------------------
sports		16
action		13
spy		7
pen		55
camera		2


The tokenizer will replace words with unique integer id to get a vector representation of the title. 
Example:

In [6]:
test_sequence = tokenizer.texts_to_sequences(["sports action camera", "spy pen camera"])
padded_sequence = pad_sequences(test_sequence, maxlen=MAX_SEQUENCE_LENGTH)
print("Text to Vector", test_sequence)
print("Padded Vector", padded_sequence)

Text to Vector [[16, 13, 2], [7, 55, 2]]
Padded Vector [[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0 16 13  2]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  7 55  2]]


Product titles belonging to all three categories are kept separate so far for the sake of understanding. To prepare the input layer, All three cateogries are combined together and shuffled as shown below. 

The category (y-axis or label) is converted to convnet's understandable format by using the `keras.util` method `to_categorical`. Example:

In [7]:
print("clothing: \t\t", to_categorical(category_index["clothing"], 3))
print("camera: \t\t", to_categorical(category_index["camera"], 3))
print("home appliances: \t", to_categorical(category_index["home-appliances"], 3))

clothing: 		 [ 1.  0.  0.]
camera: 		 [ 0.  1.  0.]
home appliances: 	 [ 0.  0.  1.]


In [8]:
print("clothing shape: ", clothing_data.shape)
print("electronics shape: ", electronics_data.shape)
print("home appliances shape: ", home_appliances_data.shape)

data = np.vstack((clothing_data, electronics_data, home_appliances_data))
category = pd.concat([clothing['category'], cameras['category'], home_appliances['category']]).values
category = to_categorical(category)
print("-"*10)
print("combined data shape: ", data.shape)
print("combined category/label shape: ", category.shape)

clothing shape:  (392721, 30)
electronics shape:  (1347, 30)
home appliances shape:  (11425, 30)
----------
combined data shape:  (405493, 30)
combined category/label shape:  (405493, 3)


Shuffling and spliting the data since categories are stacked one after the other. `nb_validation_samples` is the index which separetes training and testing/validating sets. This step can be simplified by [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from scikit.

In [9]:
VALIDATION_SPLIT = 0.4
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
category = category[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]
y_train = category[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = category[-nb_validation_samples:]

## word2vec embeddings

Word2Vec brings in semantic similarity info which can be leveraged by the convnets. This experiment uses pre-trained vectors from [Google news](https://code.google.com/archive/p/word2vec/).One other option is [GloVe](https://nlp.stanford.edu/projects/glove/).

In [10]:
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)
print('Found %s word vectors of word2vec' % len(word2vec.vocab))

Found 3000000 word vectors of word2vec


The following examples should help understand the intent behind using a pre trained word2vec

In [11]:
#odd man out
print("Odd word out:", word2vec.doesnt_match("banana apple grapes carrot".split()))
print("-"*10)
print("Cosine similarity between TV and HBO:", word2vec.similarity("tv", "hbo"))
print("-"*10)
print("Most similar words to Computers:", ", ".join(map(lambda x: x[0], word2vec.most_similar("computers"))))
print("-"*10)

Odd word out: carrot
----------
Cosine similarity between TV and HBO: 0.613064891522
----------
Most similar words to Computers: computer, laptops, PCs, laptop_computers, desktop_computers, Computers, laptop, notebook_computers, Dell_OptiPlex_desktop, automated_seismographs
----------


Keras embedding layer can be obtained by Gensim Word2Vec's `word2vec.get_keras_embedding(train_embeddings=False)` method or constructed like shown below. 
The null word embeddings indicate the number of words not found in our pre-trained vectors (In this case Google News). This could possibly be unque words for brands in this context. 

In [12]:
from keras.layers import Embedding
word_index = tokenizer.word_index
nb_words = min(MAX_NB_WORDS, len(word_index))+1

embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if word in word2vec.vocab:
        embedding_matrix[i] = word2vec.word_vec(word)
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

embedding_layer = Embedding(embedding_matrix.shape[0], # or len(word_index) + 1
                            embedding_matrix.shape[1], # or EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

Null word embeddings: 1473


2724

I recommend [this](https://www.youtube.com/watch?v=FmpDIaiMIeA) (30 Min) video about how Convnets work to understand the layers. 
Model explanation:
1. Embedding layer as input
2. A Dropout layer to avoid overfitting
3. 1 dimentional convnet

In [37]:

from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Flatten
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation


model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(300, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(150, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(75, 3, padding='valid',activation='relu',strides=2))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(150,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(3,activation='sigmoid'))

model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])

model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 30, 300)           817200    
_________________________________________________________________
dropout_18 (Dropout)         (None, 30, 300)           0         
_________________________________________________________________
conv1d_21 (Conv1D)           (None, 14, 300)           270300    
_________________________________________________________________
conv1d_22 (Conv1D)           (None, 6, 150)            135150    
_________________________________________________________________
conv1d_23 (Conv1D)           (None, 2, 75)             33825     
_________________________________________________________________
flatten_4 (Flatten)          (None, 150)               0         
_________________________________________________________________
dropout_19 (Dropout)         (None, 150)               0         
__________

In [38]:
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=5, batch_size=128)

Train on 243296 samples, validate on 162197 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fa87f94f7f0>

In [None]:
from keras.utils import plot_model
plot_model(model, to_file='model.png')


In [45]:
from keras.layers import Conv2D
cnn = Sequential()
cnn.add(embedding_layer)
cnn.add(Conv2D(300, 3, strides=1, padding="same", activation="relu", input_shape=(1, 300, 1)))



ValueError: Input 0 is incompatible with layer conv2d_6: expected ndim=4, found ndim=3