## Using text descriptions to make predictions for new items

This notebook will cover how to learn new features from the text data, which may be used to improve your predictions: particularly for new items that first appear in the test set, meaning we have no direct history to train them on.

The competition description states that the test set can (and does) include some new items that have never been seen in the training data. How do we make predictions for them? Well, we can look at the sales of items from the same shop, or items from the same category. These are both valid options and will help you. But intuition tells us that if a new album by the band FooBar arrives in the shops, one of the best ways to predict the sales for it would be to look at the performance of other albums by the same band. Can we do this with the competition data?

*(spoiler: the answer is yes)*

**_Note:_** The following approach allows you to take advantage of data in multiple languages without needing to speak any of them :) It also doesn't depend on any kind of external translation services like Google Translate, except to produce an initial training set which has already been done for you, and can be downloaded at the bottom of this notebook.

## Item and category descriptions

Fortunately there are just two fields that contain all of the information we know about an item: `item_name` and `item_category_name`. Unfortunately, they are unstructured text fields. Let's take a look:

In [None]:
!pip install -q --upgrade nltk
!pip install -q umap-learn
import nltk
nltk.download('averaged_perceptron_tagger_ru')

In [None]:
from itertools import product

import pandas as pd
from IPython.display import display
from textblob.utils import strip_punc
from tqdm.auto import tqdm

In [None]:
datadir = "../input/competitive-data-science-predict-future-sales"

In [None]:
# Get the name and category name of each item, along with it's average price.
df = (
    pd.read_csv(f"{datadir}/sales_train.csv")
    .merge(pd.read_csv(f"{datadir}/items.csv"))
    .merge(pd.read_csv(f"{datadir}/item_categories.csv"))
    .groupby(["item_id", "item_category_name", "item_name"])["item_price"]
    .mean()
    .reset_index()
)
df

In [None]:
def remove_punctuation(text):
    return strip_punc(text, all=True)

In [None]:
df["item_name"] = df["item_name"].apply(remove_punctuation)
df["item_category_name"] = df["item_category_name"].apply(remove_punctuation)
with pd.option_context("display.max_colwidth", None):
    display(df.sort_values(["item_name", "item_category_name"]))

Straight away there are a few interesting points to note:

1. For both items and categories, we can see common words that suggest similarity (e.g. the PC, PS3 and XBox versions of *007 Legends*).
2. We can see a mix of both Russian and English text.

The first point is exactly what we were hoping for. This suggests we can use some text pre-processing to help our models learn which items are related. Let's look at the options:

* **One-hot encoding (bag-of-words).** Tells us which words have been seen in each item with a simple true/false flag. Doesn't account for the similarity of word meanings though, or the importance of each word in the description. It also can't give us a useful comparison between different pairs of items: difference is only measured by how many exact word matches there are in each description - the non-common words do not contribute to the similarity measurement.
* **TF-IDF vectorisation.** Slightly better than one-hot encoding, as we now give each word a score based on its importance. Apart from this though, it has many of the same shortcomings as OHE.
* **Word embeddings.** Word embeddings are the output of a neural network that has been trained to spot words that often appear in similar contexts. Word embeddings are a very powerful way of representing words and sentences. Not only can they tell us if the same words are appearing in two different item descriptions, but they can also measure how similar the *meanings* of the words are.

Word embeddings would be a good option, but how do we create them? We could train our own word embedding model (or better still, a document embedding model which also accounts for how the words in our item descriptions are grouped), but to capture the full meaning of each word (and therefore spot any similarities between items) we need a large corpus of data for our model to learn the meanings. We will only spot similaries from our own training data if we keep seeing the exact same words appearing in similar items. These similarities could be identified with a simple one-hot or TF-IDF encoder, however we want to use word embeddings to identify much more subtle similarities, e.g. that *PS3* and *XBox* products are similar. This is where word embeddings come in handy.

Fortunately, it's very easy to get hold of a word embedding model that's been pre-trained on a very large corpus of data. Let's use `gensim` to do this:

In [None]:
import gensim.downloader

In [None]:
en_model = gensim.downloader.load("glove-wiki-gigaword-300")

As we can see, the model does a good job of recognising the similarity between the words "PS3" and "XBox":

In [None]:
en_model.similar_by_word("xbox")

A crude, but effective way of creating a single vector for an item description is to take the vector for each word in the sentence and average them. This will give us a single vector that we can then compare to the vectors of other items to see which ones are most similar by finding the items with the smallest cosine distance. A simple `NearestNeighbors` model from scikit-learn will do this for us efficiently.

Let's try that for our item descriptions. We will combine both the category name and the item name to make an overall item description.

The vectors generated by this model have 300 dimensions, but we can use dimensionality reduction to get 2D vectors that we can visualise. t-SNE works well for this but is slow for the size of this dataset, so we will use UMAP (`pip install umap-learn`) instead which produces similar results much faster.

In [None]:
df["description"] = df["item_category_name"] + "\n" + df["item_name"]

In [None]:
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Remove any words that don't add meaning, such as "the", "in", "of", etc.
stopwords = stopwords.words("english")


def vectorize(row):
    text = row["description"]
    tokens = [
        word
        for word in word_tokenize(strip_punc(text.lower(), all=True))
        if word not in stopwords
    ]
    vecs = [en_model[tok] for tok in tokens if tok in en_model]
    if vecs:
        vector = np.mean(vecs, axis=0)
    else:
        vector = np.zeros_like(en_model.vectors[0])
    return vector

In [None]:
vecs = pd.DataFrame(
    data=np.array(df.apply(vectorize, axis=1).tolist()), index=df["description"]
)

In [None]:
import random

import seaborn as sns
from matplotlib import pyplot as plt
from umap import UMAP  # UMAP is faster than t-SNE

sns.set()


def plot_vectors(vectors, data=None, n_labels=0):
    fig, ax = plt.subplots(figsize=(16, 9))
    plt.close()
    ax.axis("off")
    seed = 42

    tsne = UMAP(n_components=2, random_state=seed)
    reduced = tsne.fit_transform(vectors)
    colours = np.log1p(data["item_price"])
    ax.scatter(reduced[:, 0], reduced[:, 1], c=colours, cmap="RdBu_r", alpha=0.2)

    random.seed(seed)
    for idx in random.sample(range(len(reduced)), n_labels):
        x, y = reduced[idx]
        name = data.iloc[idx]["description"]
        if len(name) > 37:
            name = name[:37] + "..."

        ax.annotate(
            name,
            (x, y),
            xycoords="data",
            xytext=(random.randint(-100, 100), random.randint(-100, 100)),
            horizontalalignment="right" if x < 0 else "left",
            textcoords="offset points",
            color="black",
            bbox=dict(boxstyle="round", fc=(0.03, 0.85, 0.37, 0.45), ec="none"),
            arrowprops=dict(arrowstyle="simple", linewidth=5, ec="none",),
        )

    fig.tight_layout()
    return fig

In [None]:
from sklearn.neighbors import NearestNeighbors

display(plot_vectors(vecs, data=df, n_labels=15))

nn = NearestNeighbors(n_neighbors=3, metric="cosine", n_jobs=-1).fit(vecs)

# Preview the nearest neighbours for the first few items.
with pd.option_context("display.max_colwidth", None):
    display(
        pd.DataFrame(
            df["description"].values[
                nn.kneighbors(vecs[:10], n_neighbors=3, return_distance=False)
            ]
        )
    )

We can see from the plot that some clustering has taken place, and some of that clustering is identifying items with a similar price which is good. However, if we train a nearest neighbours model to show which items are closest to each other, we can actually see that the results are quite disappointing. Why is this? It's because our word embedding model was trained on *English* language text - it has no knowledge of Russian. This means that Russian words, which make up a significant amount of our text, aren't contributing any information to our similar items model. You can see this in the preview above: the only rows that have similar items are the ones with more English words in their descriptions.

We could use a Russian language model instead, but this would ignore the English words. To see which language would capture most of the information, let's do a simple count of how many Russian words there are vs. English words:

In [None]:
cyrillic = set("АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя")

en, ru = 0, 0


def count_langs(text):
    global en, ru
    tokens = [
        set(word)
        for word in word_tokenize(strip_punc(text.lower(), all=True))
        if word not in stopwords
    ]

    for tok in tokens:
        if tok & cyrillic:
            ru += 1
        else:
            en += 1


df["description"].apply(count_langs)

# (Rough) percentage of words that are English
en / (en + ru)

Oh dear: 40% of our words are English and 60% are Russian. Whichever model we choose, we are going to lose a lot of information. To get the most value out of our data, we need to find a way of using both languages.

### Aside: word vectors

You might now wonder if we can just use a Russian-language model to vectorise the Russian words, an Engligh-language model to vectorise the English words, and then take the mean of all the vectors as before to get an overall vector for the mixed-language text. Unfortunately this won't work. This is because word vectors only have meaning in the vector space defined by the model they came from. Comparing one word vector to another vector from a different model will give a meaningless answer. Let's illustrate with an example:

In [None]:
# Note that this model was trained on POS (part-of-speech) tagged words.
# This means that we have to append a POS tag to the end of any Russian words before we look them up in this model.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")

In [None]:
print(en_model.similar_by_word("king", 1))

# The result for this should be 'царица', which means 'queen'
print(ru_model.similar_by_word("царь_NOUN", 1))

In [None]:
tsar_vector = ru_model.get_vector("царь_NOUN")
ru_model.similar_by_vector(tsar_vector, 2)

In [None]:
king_vector = en_model.get_vector("king")
en_model.similar_by_vector(king_vector, 2)

So far so good. But what happens when we look up the vectors in different models?

In [None]:
print(en_model.similar_by_vector(tsar_vector, 2))
print(ru_model.similar_by_vector(king_vector, 2))

We get results that make no sense. This is because vectors produced by one model are meaningless in the vector space of a different model. If we mix vectors from two different models our item clustering will be completely useless. We need to find a better way of using the information from both languages.

## The `transvec` package

Fortunately there is a way around this. [Exploiting Similarities among Languages for Machine Translation Mikolov et. al, 2013.](https://arxiv.org/pdf/1309.4168.pdf) shows us that word vector spaces for different languages share some linear similarities in their structure. This means we can use a tranformation matrix to convert word vectors from a source space (e.g. Russian) into a target space (e.g. English), and the transformed vector should be close to vectors in the target space with a similar meaning.

The research paper shows that we can learn the transformation matrix from some training data using Ordinary Least Squares regression. The data we need is just pairs of Russian words with their English translations.

A Russian/English dictionary would make ideal training data, but we don't have one available so we'll have to do something with the data we have instead. A simple translation training set was produced by running all of the words in the category and item names through the Google translate API. This gave ~10,500 word pairs to learn a translation matrix from.

To make this technique reusable in other competitions, the code for producing these translation models was put into a separate package called `transvec` which is now available on PyPi. `transvec` allows you to convert word vectors from any number of source languages into the vector space of a single 'target' language. In this example, we will use English as the target language since this notebook is in English, although it would be just as valid to use Russian as the target language.

In [None]:
!pip install -q transvec

In [None]:
from transvec.transformers import TranslationWordVectorizer

# transvec also includes a tokenizer that deals with the Russian POS tags that the pre-trained Russian model uses.
from transvec.tokenizers import EnRuTokenizer


word_pairs = pd.read_csv(f"{datadir}/../enru-word-pairs/ru_word_translations.csv")[["en", "ru"]]

# The transvec model takes the target language model first, followed by any source languages (you can provide more if you have a mix of more than two languages).
enru_model = TranslationWordVectorizer(
    en_model, ru_model, alpha=1, missing="ignore"
).fit(word_pairs)

In [None]:
# Our model can now automatically translate Russian words into a vector in English space.
# It doesn't get it right every time, but we can see that 6 out of the top ten words for "царь" ("tsar"), are correctly related by meaning.
# Note that if we provided an English word, the model would just default to the normal English-language vectors.
print(enru_model.similar_by_word("царь_NOUN"))

We can now use this to produce item vectors that use the information from *both* languages! `transvec` models include a scikit-learn `Transformer` API to make this task easier:

In [None]:
tokenizer = EnRuTokenizer()
tokens = df["description"].apply(tokenizer.tokenize)

item_vectors = pd.DataFrame(
    enru_model.transform(tokens), index=df["description"]
).fillna(0)

In [None]:
display(plot_vectors(item_vectors, data=df, n_labels=7))

nn = NearestNeighbors(n_neighbors=3, metric="cosine", n_jobs=-1).fit(item_vectors)

# Preview the nearest neighbours for the first few items.
with pd.option_context("display.max_colwidth", None):
    display(
        pd.DataFrame(
            df["description"].values[
                nn.kneighbors(item_vectors[:10], n_neighbors=3, return_distance=False)
            ]
        )
    )

We can now see that items are being clustered based on both the English and the Russian words, and that there are more distinct clusters which do a better job of grouping items with similar prices.

## Putting it all together

We can now use our item vectors to find the most similar items for every record. We can add this in to our original train and test data, meaning that we can now add features to each record based on not just the history of the item in question, but also the history of similar items.

In [None]:
# First, prepare the training data and test data into a single dataframe.

def denormalize(df):
    return df.merge(pd.read_csv(f"{datadir}/items.csv")).merge(
        pd.read_csv(f"{datadir}/item_categories.csv")
    )


train = (
    # Take the mean item price and item count for each month/shop/item combo
    pd.read_csv(f"{datadir}/sales_train.csv")
    .groupby(["date_block_num", "shop_id", "item_id"])
    .agg({"item_price": "mean", "item_cnt_day": ["mean", "sum"]})
    .reset_index()
)
train.columns = ["_".join([c for c in col if c]) for col in train.columns]
train.rename(columns={"item_cnt_day_sum": "item_cnt_month"}, inplace=True)
train = denormalize(train)

test = pd.read_csv(f"{datadir}/test.csv").drop("ID", axis=1)
test = denormalize(test)
test["date_block_num"] = 34

data = pd.concat([train, test])
data

We have textual descriptions for everything, but for the items in the test set there is no price or sales information. If the items are similar to another item that was seen in the historic training data, we could use the pricing/sales info for the similar item as a best guess.

In [None]:
# Next, prepare a nearest neighbours lookup table to allow us to look up similar items quickly.

items = data.groupby("item_id")[["item_category_name", "item_name"]].first()
items["description"] = items["item_category_name"] + "\n" + items["item_name"]
items = items.drop(columns=["item_category_name", "item_name"])

tokenizer = EnRuTokenizer()
tokens = items["description"].apply(tokenizer.tokenize)

# Index by item ID this time - it's easier to work with later on.
item_vectors = pd.DataFrame(enru_model.transform(tokens), index=items.index).fillna(0)
nn = NearestNeighbors(n_neighbors=3, metric="cosine", n_jobs=-1).fit(item_vectors)

# Number of neighbours we want to calculate for each item.
k = 3

all_neighbours = pd.DataFrame(
    nn.kneighbors(n_neighbors=k, return_distance=False), index=item_vectors.index,
)
all_neighbours

In [None]:
def add_nearest_neighbours(df, nns=all_neighbours):
    "Create a copy of df with extra columns containing the item IDs of the k most similar items (by description)"

    mergecol = "item_id"
    nn_ids = (
        nns.loc[df[mergecol]]
        .astype(np.int16)
        .rename(mapper=lambda x: f"{mergecol}_nn_{x + 1}", axis="columns")
    )

    return pd.concat([df, nn_ids.set_index(df.index)], axis=1)

In [None]:
# Add in the most similar item IDs to each row.
data = add_nearest_neighbours(data)
data

Now we have the most similar items we can use their IDs for self-joins to get features for any new items we have no history for. To demonstrate this, let's generate a couple of new features: the average monthly price and total daily item count for the most similar items.

We'll define a new function to do this. As well as gathering data on nearest neighbours, it can also deal with time lags for us (get the data from X months ago) so we can generate historic features for both the item itself and its nearest neighbours.

In [None]:
def lagjoin(df, groupon, features, lags, nns=[0], agg="mean", dtype=None):
    lagcols = pd.DataFrame(index=df.index)

    if isinstance(groupon, str):
        features = [groupon]
    if isinstance(features, str):
        features = [features]

    for lag, nn in tqdm(list(product(lags, nns))):
        # A lag of 0 means the current month. A nn of 0 means the current item, not a similar one.
        
        if not lag and not nn:
            # Duplicate of original data.
            continue

        shifted = df[groupon + features].groupby(groupon).agg(agg).reset_index()
        shifted["date_block_num"] += lag

        lgrpcols = [
            col if col != "item_id" or not nn else f"{col}_nn_{nn}" for col in groupon
        ]
        rgrpcols = groupon

        newfeatures = df.merge(
            shifted, left_on=lgrpcols, right_on=rgrpcols, how="left"
        )[[f + "_y" if f in df.columns else f for f in features]]
        newfeatures.columns = features

        colnames = [fcol + f"_lag_{lag}" if lag else fcol for fcol in features]
        if nn:
            colnames = [fcol + f"_nn_{nn}" if nn else fcol for fcol in colnames]

        if dtype is None:
            newdata = newfeatures.values
        else:
            newdata = newfeatures.values.astype(dtype)

        lagcols = pd.concat(
            [lagcols, pd.DataFrame(newdata, columns=colnames, index=lagcols.index)],
            axis=1,
        )

    return lagcols

In [None]:
newfeatures = lagjoin(
    data,
    
    # Try different groupings, e.g. month, item and shop ID.
    groupon=["date_block_num", "item_id"],
    
    # Can also be a list of columns.
    features="item_cnt_month",
    
    # A lag of 0 lets us get data for similar items in the same month.
    lags=[0, 1],
    
    # A nearest neighbour of 0 lets us get data for the same item in previous months.
    nns=[0, 1, 2, 3],
)

data = pd.concat([data, newfeatures], axis=1)
data

We now have data for both the item and similar items during the current month (for training data) and previous months (for training and test data). You can go further by imputing missing values, e.g. by taking averages across shop/item groups, which will give you more data to work with for "new" items (items that are not present in the training data).