### Problem - sparse matrix

I can see that people tend to represent text features using Bag-of-words i.e. Tf-idf transform. Because there's a lot of words in item's names, the resulting matrix is very sparse and has thousands of dimensions. Choosing only N most important features is some solution, although obviously the rest of the words also can be crucial.

In this kernel, I'd like to propose a few methods of representing text features in a more compact way. I will focus on items names, but it can be applied cities and categories as well.

Let's start with the imports:

In [None]:
import numpy as np
import pandas as pd
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from sklearn.pipeline import FeatureUnion
import string

import IPython.display as ipd
import librosa.display

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import pandas as pd

Thanks to the **deargle** and **Kâzım Anıl Eren** we can read the translated items names directly to the notebook from Kaggle database. Let's read it together with original names.


In [None]:
items = pd.read_csv("../input/competitive-data-science-predict-future-sales/items.csv")
items_english = pd.read_csv("../input/predict-future-sales-supplementary/item_category.csv")

items = list(items["item_name"])
items_english = list(items_english["item_name_translated"])

In [None]:
_ = [print(items_english[i]) for i in range(10)] 

Let's clean the text a little bit

In [None]:
items_english = [''.join(sign for sign in item if sign not in string.punctuation) for item in items_english]
_ = [print(items_english[i]) for i in range(10)] 

To make kernel feasible, I'm gonna use 1000 examples

In [None]:
np.random.seed(123)
ind = np.arange(len(items_english))
np.random.shuffle(ind)
ind = ind[:1000]
items_english = np.array(items_english)[ind]

### Starting with bag-of-words

One of the easiest ways to represent the text is to convert it to Bag-of-words.

Bag-of-words is a very sparse vector with elements number equal to vocabulary size and a few 1's in the places. As a feature, we can use character or words, and their n-grams. 

In [None]:
tfidf_word_vectorizer = TfidfVectorizer(max_features=None, analyzer='word', ngram_range=(1, 1))
tfidf_word_features = tfidf_word_vectorizer.fit_transform(items_english)
len(tfidf_word_vectorizer.get_feature_names())

There are 15258 words in the items names. Which are the most informative? Is it enough to choose first N to improve our models?

Because the text is still messy and the Russian language has morphology, we can use characters in some range (characters n-grams) instead of words, to catch all the important features.

In [None]:
tfidf_char_vectorizer = TfidfVectorizer(max_features=None, analyzer='char', ngram_range=(3, 3))
tfidf_char_features = tfidf_char_vectorizer.fit_transform(items_english)

In [None]:
len(tfidf_char_vectorizer.get_feature_names())

The amount of features is huge!

It's not a very efficient way to represent the text. We could use CSR matrix to store it better, but handling mixed types of matrices or two matrices is brittle.

We can, however, compress the data and represent it in continuous space, based on their syntax (not semantics!). Let's try to do that

In [None]:
projector = TSNE(n_components=2, perplexity=5, random_state=42)
tfidf_projection = projector.fit_transform(tfidf_char_features.toarray())

In [None]:
# Create a trace
trace = go.Scatter(
    text=items_english,
    x=tfidf_projection[:, 0],
    y=tfidf_projection[:, 1],
    mode='markers',
)
layout = go.Layout(title="Char tf-idf projection of items names", hovermode='closest')
figure = go.Figure(data=[trace], layout=layout)
py.iplot(figure, filename='projection.html')


We can see that the data is now represented in an interesting way. For example, many PINK FLOYD albums are grouped together, and there are clear clusters for audiobooks, board games, XBOX etc.

Probably the representation is much better than using only top n features from tf-idf vectorizer, because it consists all the words in it. In two dimensions we can show some words are more similar syntactically than other. We can suspect, that this similarity has some semantical meaning and that the obtained features will correlate with features derived for the competition well.


### Word-embeddings

Next possible step is to use word-embeddings. This is, however, non-trivial because we don't have a constant length of items names, it's hard to get a good representation of domain words, it requires a lot of memory, and, in my opinion, the semantics of the items names can be covered by the categories. But we have to check it!

I used fastText library to load official Facebook model trained on Wikipedia. Then, I used get_sentence_vector() to infer vector with constant length for each item name, to get rid of a problem with variable sentence length. The result of this operation is uploaded [here](https://www.kaggle.com/davids1992/predictfuturesalesitemnametranslatedfasttext)

Loading:

In [None]:
fasttext_features = pd.read_csv("../input/predictfuturesalesitemnametranslatedfasttext/fasttext_features.csv")
fasttext_features.drop('item_id', axis=1, inplace=True)
fasttext_features.drop('Unnamed: 0', axis=1, inplace=True)
fasttext_features = fasttext_features.values

We can visualize this features in the same space

In [None]:
projector = TSNE(n_components=2, perplexity=60.0, random_state=42)
fasttext_projection = projector.fit_transform(fasttext_features)

In [None]:
fasttext_projection = fasttext_projection[ind]

In [None]:
# Create a trace
trace = go.Scatter(
    text=items_english,
    x=fasttext_projection[:, 0],
    y=fasttext_projection[:, 1],
    mode='markers',
)
layout = go.Layout(title="FastText projection of items names", hovermode='closest')
figure = go.Figure(data=[trace], layout=layout)
py.iplot(figure, filename='projection.html')


Sentence embeddings don't look good at all. I checked the embeddings generation and shuffling a few times and I can't see any bug, so It's hard to say why no patterns can be seen. One guess is that I lowercased everything, but "Pink Floyd" can have much different representation than "pink floyd" in FastText. Also, the variable length and Russian characters can disturb models.

Will it work with the regressors? It's always worth checking.

---
If you find my kernel interesting, please don't forget to upvote :)