## Text vector representations (embeddings)

In [1]:
import pandas as pd
import numpy as np

import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

In [2]:
df = pd.read_csv("airbnb-eng.tsv", delimiter='\t')
df.sample(5)

Unnamed: 0,Price,Description,lang
13850,78.0,"Magical Spanish villa tucked away on a palm tree-lined street in Silver Lake. The bedroom has french doors which open to a tranquil, lush garden and it has it's own private bathroom. Just walking distance to Sunset Blvd. and the best of Silver Lake This 2 bedroom, 2 bath house has a wonderful indoor/outdoor flow. Hardwood floors, coved ceilings, built-ins, vintage-style baths and a beautiful fireplace make this a very unique and comfortable space. The guest room for rent has wonderful natural light and looks out on a private, dense garden. It faces west, so it is amazing during the sunset! The room has a queen-size bed, a spacious walk-in closet and it's own, private bathroom with a bathtub. The galley kitchen has all of the amenities, a breakfast nook and it opens up to the magical b...",en
7043,249.0,"This nice one bedroom apartment is newly renovated and located only two blocks from the French quarter with free street parking. It is part of a historic mansion built in 1855 and located on the historic Esplanade Ave with 10ft ceilings throughout. Walking distance to all French Quarter festivities. The court yard and all of apt #3. As much as you'd like. I'm there if you need me and not if you don't. The mansions on Esplanade Ave is an attraction to tourism. During carnival and festival season you'll enjoy a short scenic wall to the cities heart of history and festivities. The other direction is a 2 miles straight shot to New Orleans City Park where you can enjoy kayaking, bike trails, golf, fishing, tennis and much more. This apartment is conveniently located two blocks from the Fren...",en
40442,60.0,"Hello Everyone! my name is Ester, my boyfriend Antoine and I have lived in our studio for almost one year. We both moved from outside of California for work. You will enjoy staying at our place! It is situated in an excellent area where all the buzz is happening on Venice! You''ll see many people riding their bikes, cars with surf boards, and palm trees everywhere! The weather is perfect year round. The scenery is beautiful filled with flowers and the streets are filled with artistic murals. Our studio is a perfect little space with a full kitchen at your disposal. We have a full size bed as well as space for an air mattress (which we can provide) as well. You will have access to the pool and hot tub to relax at any time and our bikes (2) are also available to use. At night you can me...",en
41512,150.0,"Second floor room with double bed TV, Cable, Internet in private house with front and back yard. Private in ground pool, grill, etc. In the heart of Beach zone of Staten Island, Brooklyn, Rockaways, Long Island, Jersey Shore. Surfs up!!!",en
56198,34.0,"Room available in a lovely terraced house in Mile End. Lovely people and lovely neighborhood. Close to Victoria Park, shoreditch and Olympic Park. 15 mins by tube or bus to central London.",en


In [3]:
df.dropna(inplace=True)

In [4]:
len(df)

81406

## Byte Pair Encoding

First, let's apply compression method - byte pair encoding. We will use sentencepiece framework by Google: it is provided with python mappings:

In [5]:
import sentencepiece as spm

In [6]:
! cat sample.txt

this is the first sentence
and this is the second
bla bla bla
the end

In [7]:
spm.SentencePieceTrainer.train('--input=sample.txt --model_prefix=sample_model --vocab_size=64 --character_coverage=1.0 --model_type=bpe')
sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load('sample_model.model')

print(sp_bpe.encode_as_pieces('thisisatestblablabla'))

['▁this', 'is', 'a', 'te', 'st', 'bla', 'bla', 'bla']


That is "encoded" data.

Sentencepiece can be run from CLI as well:

In [8]:
! spm_train --input=sample.txt --model_prefix=sample_model --vocab_size=64 --character_coverage=1.0 --model_type=bpe

sentencepiece_trainer.cc(79) LOG(INFO) Starts training with : 
trainer_spec {
  input: sample.txt
  input_format: 
  model_prefix: sample_model
  model_type: BPE
  vocab_size: 64
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}


In [9]:
!echo "this is a test blablabla." | spm_encode --model=sample_model.model

▁this ▁is ▁a ▁ te st ▁bla bla bla .


Let's apply BPE to our Airbnb dataset: to do so we have to first reformat raw (important! preprocessing and tokenisation are performed within the framework) data in such format, that each row represents separate sentence.

In [10]:
text = df.Description
text[0]

"Hi everyone, Cosy bedroom in a modern apartment located in a central area, Paris 11th. * THE APARTMENT : it's a 2 bedrooms (one is mine) apartment of 47m2 (509 sq ft) fully renovated, warm atmosphere + a living room with a equiped kitchen, wifi. I provide towels and sheets * CENTRAL AREA : Cosmopolite, non-touristic, very close to the Marais, Bastille and Republic. * TRANSPORTS - 2 metro (3' walk) Saint Ambroise (line 9) or Richard Lenoir (line 5) - 2 city bike station Best J."

In [11]:
text[0].split('.')

['Hi everyone, Cosy bedroom in a modern apartment located in a central area, Paris 11th',
 " * THE APARTMENT : it's a 2 bedrooms (one is mine) apartment of 47m2 (509 sq ft) fully renovated, warm atmosphere + a living room with a equiped kitchen, wifi",
 ' I provide towels and sheets * CENTRAL AREA : Cosmopolite, non-touristic, very close to the Marais, Bastille and Republic',
 " * TRANSPORTS - 2 metro (3' walk) Saint Ambroise (line 9) or Richard Lenoir (line 5) - 2 city bike station Best J",
 '']

In [12]:
sents = [t.split('.') for t in text]

with open('airbnb_sents.txt', 'w') as f:
    for t in sents:
        for s in t:
            f.write(s)
            f.write('\n')

In [13]:
! head -n 1 airbnb_sents.txt

Hi everyone, Cosy bedroom in a modern apartment located in a central area, Paris 11th


Run BPE, dictionary size = 1,000.

In [14]:
%%time

! spm_train --input=airbnb_sents.txt --model_prefix=airbnb_model_1 --vocab_size=1000 --character_coverage=0.99--model_type=bpe

sentencepiece_trainer.cc(79) LOG(INFO) Starts training with : 
trainer_spec {
  input: airbnb_sents.txt
  input_format: 
  model_prefix: airbnb_model_1
  model_type: UNIGRAM
  vocab_size: 1000
  self_test_sample_size: 0
  character_coverage: 0.99
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalizat

~1 min was required to calculate the results, the higher compression rate (the lower size of the dictionary), the more time is required.

We can encode the text with n-grams:

In [15]:
!echo "Hi everyone, Cosy bedroom in a modern apartment located in a central area, Paris 11th." | spm_encode --model=airbnb_model_1.model


▁H i ▁every one , ▁Co s y ▁bedroom ▁in ▁a ▁modern ▁apartment ▁located ▁in ▁a ▁central ▁area , ▁Paris ▁1 1 th .


Or with corresponding ids:

In [16]:
!echo "Hi everyone, Cosy bedroom in a modern apartment located in a central area, Paris 11th." | spm_encode --model=airbnb_model_1.model --output_format=id

120 29 592 305 5 270 4 19 68 11 8 278 37 109 11 8 347 89 5 417 65 177 71 0


Now we can write our encoded data to train vowpall wabbit:

In [17]:
with open('airbnb_texts.txt', 'w') as f:
    for t in text:
        f.write(t)
        f.write('\n')

In [18]:
%%time

!spm_encode --model=airbnb_model_1.model --output_format=id airbnb_texts.txt > airbnb_bpe_encoded.txt

CPU times: user 37.9 ms, sys: 32.3 ms, total: 70.2 ms
Wall time: 7.02 s


In [19]:
! head -n 1 airbnb_bpe_encoded.txt

120 29 592 305 5 270 4 19 68 11 8 278 37 109 11 8 347 89 5 417 65 177 71 0 3 0 853 41 374 749 114 242 91 326 114 3 0 134 38 4 8 54 299 39 305 10 190 15 43 37 13 3 0 27 127 39 617 0 740 3 410 43 249 396 5 801 939 3 0 8 93 30 14 8 3 15 437 29 31 35 57 5 595 0 70 526 471 6 877 3 0 112 91 326 114 191 117 144 41 568 117 3 0 270 4 224 31 256 200 5 297 20 28 12 252 301 73 5 111 132 9 7 579 18 218 5 49 18 84 106 32 6 423 31 348 56 73 0 3 0 110 191 699 145 374 118 191 114 145 77 54 436 39 0 38 76 43 965 41 27 44 87 218 15 39 679 3 0 43 81 272 29 83 275 725 511 159 39 679 155 43 77 54 139 690 187 49 228 3 0


In [20]:
with open('airbnb_bpe_encoded.txt', 'r') as bpe:
    encoded_texts = bpe.readlines()
    
len(encoded_texts)

81406

In [21]:
def convert_to_vw(raw_text, target):
    return "{} |d {}".format(float(target), raw_text)
        
def write_vw(X_data, Y_data, filename):
    with open(filename, "w") as f:
        for x, y in zip(X_data, Y_data):
            vw_object = convert_to_vw(x, y)
            if not vw_object:
                continue
            f.write(vw_object)

In [22]:
from sklearn.model_selection import train_test_split

In [23]:
X_train, X_test, y_train, y_test = train_test_split(encoded_texts, df.Price, test_size=0.33, random_state=42)

In [24]:
len(X_train), len(y_train), len(X_test), len(y_test)

(54542, 54542, 26864, 26864)

In [25]:
write_vw(X_train, y_train, "airbnb-train-bpe.vw")
write_vw(X_test, y_test, "airbnb-test-bpe.vw")

r2 metric scoring:

In [26]:
from sklearn.metrics import r2_score

def read_target_from_vw(vw_object):
    return float(vw_object.split(' ')[0])

def calc_r2(predictions_path, answers_path):
    with open(predictions_path, 'r') as f:
        y_pred = np.array([float(value) for value in f.readlines()])
        
    with open(answers_path, 'r') as f:
        y_expected = np.array([read_target_from_vw(value) for value in f.readlines()])

        
    return r2_score(y_expected, y_pred)

Model itself:

In [27]:
%%time
! vw --final_regressor airbnb-lin-model-bpe.vw.bin airbnb-train-bpe.vw --passes 30 -l 10 -c -k --bit_precision 23 

final_regressor = airbnb-lin-model-bpe.vw.bin
Num weight bits = 23
learning rate = 10
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = airbnb-train-bpe.vw.cache
Reading datafile = airbnb-train-bpe.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
324.000000 324.000000            1            1.0  18.0000   0.0000      270
438.408447 552.816895            2            2.0  48.0000  24.4879      302
1115.257774 1792.107101            4            4.0 145.0000  87.5479      334
4840.171246 8565.084717            8            8.0 110.0000 240.8061      303
4695.488470 4550.805695           16           16.0 192.0000 255.6292      355
6673.270553 8651.052636           32           32.0 403.0000 238.2178      319
18521.320179 30369.369805           64           64.0  78.0000  58.0186      313
26787.004926 35052.689673          128          128.0 189.0000

In [28]:
! vw --testonly --initial_regressor airbnb-lin-model-bpe.vw.bin --predictions airbnb-1-predictions-bpe.txt airbnb-test-bpe.vw

only testing
predictions = airbnb-1-predictions-bpe.txt
Num weight bits = 23
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = airbnb-test-bpe.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
23058.980469 23058.980469            1            1.0 215.0000  63.1482      319
14398.844482 5738.708496            2            2.0 189.0000 113.2457      277
9505.072540 4611.300598            4            4.0  25.0000 116.8266      309
16812.621964 24120.171387            8            8.0 698.0000 533.2416      354
11089.064659 5365.507355           16           16.0  58.0000 102.7545      350
12351.417374 13613.770089           32           32.0  50.0000  73.4964      370
8441.588420 4531.759466           64           64.0  54.0000  86.7532      317
14959.975895 21478.363370          128          128.0 245.0000 145.8976      207
13014.284634 11068

In [29]:
print(calc_r2("airbnb-1-predictions-bpe.txt", "airbnb-test-bpe.vw"))

0.3283231885514082


The quality is almost the same as with "raw" tokens, but the model is trained faster and amount of features is less.

Let's compare with linear regression from sklearn:

In [30]:
from sklearn.linear_model import Ridge
from sklearn.feature_extraction.text import CountVectorizer

In [31]:
vectorizer = CountVectorizer()  
X = vectorizer.fit_transform(encoded_texts)

In [32]:
len(vectorizer.get_feature_names())

990

As expected (:

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, df.Price, test_size=0.33, random_state=42)

In [34]:
%%time

regressor = Ridge(solver='sparse_cg').fit(X_train, y_train)

CPU times: user 1.37 s, sys: 82.5 ms, total: 1.45 s
Wall time: 1.45 s


In [35]:
y_pred = regressor.predict(X_test)
score = r2_score(y_test, y_pred)
print(score)

0.3376712432605553


We have much ligher model by using BPE and small increase of quality in comparison to linear regression on raw tokens.

## word2vec

Now we try to train word2vec model to get word vectors. Model is always trained on preprocessed texts, so we perform the same preprocessing as before: tokenization, punctuation and stop-words removal, stemming.

In [36]:
import re
import nltk.data 
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

words_only = re.compile('[a-z]+')

def letters(s, regex = words_only):
    if isinstance(s, str):
        return words_only.findall(s.lower())
    else:
        return []

def remove_stopwords(tokens, sw = stop_words):
    return [t for t in tokens if not t in sw]

def preprocess(s):
    return remove_stopwords(letters(s))

In [37]:
from nltk.stem.snowball import SnowballStemmer
from functools import lru_cache


snowball = SnowballStemmer("english")

def stemm_description(d):
    @lru_cache(maxsize=128)
    def stemm_token(token):
        return snowball.stem(token)

    return [stemm_token(t) for t in d if len(stemm_token(t)) >= 3]

In [38]:
from multiprocessing import Pool
from tqdm import tqdm_notebook as tqdm

with Pool(8) as p:
    clean_text = list(tqdm(p.imap(preprocess, df.Description), total=len(df)))
    
df['clean_text'] = clean_text
    
with Pool(8) as p:
    stems = list(tqdm(p.imap(stemm_description, df.clean_text), total=len(df)))
    
df['stems'] = stems

df.sample()

HBox(children=(FloatProgress(value=0.0, max=81406.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=81406.0), HTML(value='')))




Unnamed: 0,Price,Description,lang,clean_text,stems
71947,80.0,"Bright apartment with a long terrace, 10 minutes from the Vatican museums and St. Peter (Vatican City). It's well connected to the center with a metro station (Cipro) 10 min by feet. The neighborhood is quiet, but full of shops, markets and several restaurants. The apartment has a spacious entrance with 2 armchairs, a dining room with kitchenette and other 2 armchairs, one can become an additional bed, fridge with freezer, microwave, TV, induction hob and oven, washing machine, iron. There are 2 bedrooms about 20 square meters, one with a king bed with a large desk, and the other one with 2 single beds and 2 desks (if necessary you can add an extra bed). Moreover the apartment has a bathroom and long balcony with a beautiful high (from the last floor) view of the city and the Vatican g...",en,"[bright, apartment, long, terrace, minutes, vatican, museums, st, peter, vatican, city, well, connected, center, metro, station, cipro, min, feet, neighborhood, quiet, full, shops, markets, several, restaurants, apartment, spacious, entrance, armchairs, dining, room, kitchenette, armchairs, one, become, additional, bed, fridge, freezer, microwave, tv, induction, hob, oven, washing, machine, iron, bedrooms, square, meters, one, king, bed, large, desk, one, single, beds, desks, necessary, add, extra, bed, moreover, apartment, bathroom, long, balcony, beautiful, high, last, floor, view, city, vatican, gardens, great, spot, enjoy, summer, breeze, evening, kitchenette, everything, need, cooking, various, pots, meal, peacefully, bedrooms, dining, room, wood]","[bright, apart, long, terrac, minut, vatican, museum, peter, vatican, citi, well, connect, center, metro, station, cipro, min, feet, neighborhood, quiet, full, shop, market, sever, restaur, apart, spacious, entranc, armchair, dine, room, kitchenett, armchair, one, becom, addit, bed, fridg, freezer, microwav, induct, hob, oven, wash, machin, iron, bedroom, squar, meter, one, king, bed, larg, desk, one, singl, bed, desk, necessari, add, extra, bed, moreov, apart, bathroom, long, balconi, beauti, high, last, floor, view, citi, vatican, garden, great, spot, enjoy, summer, breez, even, kitchenett, everyth, need, cook, various, pot, meal, peac, bedroom, dine, room, wood]"


In [39]:
with open('text_for_vector_models.txt', 'w') as f:
    for s in df.stems:
        f.write(' '.join(s))
        f.write('\n')

In [40]:
from gensim.models import word2vec

Main model params:

* size — vector size, 
* window — size of the observation window,
* min_count — minimum word frequency in the corpora (not equal to 0 by default!),
* sg — learning algorithm to use (0 — CBOW, 1 — Skip-gram),
* sample — threshold for downsampling of high frequency words,
* workers — amount of threads,
* alpha — learning rate,
* iter — iterations count,
* max_vocab_size — allows to limit used RAM for dictionary creation (in case of limit violation the lowest words by frequence would be dropped out from the dictionary). For example: 10 mln of words = 1Gb of RAM.

In [41]:
%time w2v_model = word2vec.Word2Vec(df.stems, workers=4, size=300, min_count=10, window=4, sample=1e-3)

CPU times: user 1min 18s, sys: 0 ns, total: 1min 18s
Wall time: 40.9 s


If you work with large texts collection, which does not fit into memory, we can use an iterator as dataset loader object (file format is the same as sentencepiece: one row = one sentence, words are split by spaces; possible to have several files).

In [42]:
with open('sentences_w2v.txt', 'w') as f:
    for s in df.stems:
        f.write(' '.join(s)+'\n')

In [43]:
!head -n 3 sentences_w2v.txt

everyon cosi bedroom modern apart locat central area pari apart bedroom one mine apart fulli renov warm atmospher live room equip kitchen wifi provid towel sheet central area cosmopolit non tourist close marai bastill republ transport metro walk saint ambrois line richard lenoir line citi bike station best
comfort calm apart room center pari bastill area welcom explain live area bakeri market restaur bar live build close public transport metro ledru rollin charonn bastill gare lyon bus germain louvr gare austerlitz gare nord live room bedroom kitchen bathroom sleep person doubl bed sofa bed also divid separ part quiet apart direct opposit love park central locat live restaur cafe fruit market live build love area nice vie quartier market bakeri apart equip fulli equip kitchen refriger freezer toaster kettl microwav induct hob dishwash wash machin coffe maker sheet towel linen provid broadband internet adsl
minut walk publiqu oberkampf nilmont lachais rent flat includ shower room toil

In [44]:
%%time

import os

class Sentences(object):
    def __init__(self, filename):
        self.filename = filename
 
    def __iter__(self):
            for line in open(self.filename, 'r'):
                yield line.split()
 

sents = Sentences('sentences_w2v.txt') 
model_f = word2vec.Word2Vec(sents, workers=4, size=300, min_count=10, window=4, sample=1e-3)

CPU times: user 1min 23s, sys: 0 ns, total: 1min 23s
Wall time: 43.4 s


We can look at the closest word's neighbours (model is trained on stemms):

In [45]:
print(w2v_model.wv.most_similar(positive=["room"], topn=5))

[('bedroom', 0.6954348087310791), ('larg', 0.5923123359680176), ('bathroom', 0.5694455504417419), ('adjoin', 0.5647478103637695), ('livingroom', 0.5623283982276917)]


In [46]:
print(w2v_model.wv.most_similar(positive=["shower"], topn=5))

[('clawfoot', 0.6027348041534424), ('sink', 0.5912506580352783), ('vaniti', 0.5649619102478027), ('bathtub', 0.546120285987854), ('rainshow', 0.5302311182022095)]


In [47]:
print(model_f.wv.most_similar(positive=["shower"], topn=5))

[('sink', 0.5929206609725952), ('clawfoot', 0.5884297490119934), ('vaniti', 0.5783952474594116), ('bathtub', 0.5565468668937683), ('washbasin', 0.5494244694709778)]


In [48]:
print(w2v_model.wv.most_similar(positive=["large"], topn=5))

KeyError: "word 'large' not in vocabulary"

Problem of the word2vec model - out of vocabulary words, which are not included into training dataset (in our case because we use stemms).

Let's train linear regression: first we have to get vector for descriptions by averaging the words' vectors.

In [49]:
class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(list(word2vec.values())[0])

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

In [50]:
w2v = dict(zip(w2v_model.wv.index2word, w2v_model.wv.vectors))

In [51]:
from sklearn.pipeline import Pipeline

regressor = Ridge(solver='sparse_cg')

w2v_pipeline = Pipeline([
    ("word2vec vectorizer", MeanEmbeddingVectorizer(w2v)),
    ("regressor", regressor)])

In [52]:
X_train, X_test, y_train, y_test = train_test_split(df.stems, df.Price, test_size=0.33, random_state=42)

In [53]:
%%time 

w2v_pipeline.fit(X_train,y_train)

CPU times: user 31.2 s, sys: 12.1 s, total: 43.3 s
Wall time: 23.1 s


Pipeline(steps=[('word2vec vectorizer',
                 <__main__.MeanEmbeddingVectorizer object at 0x7f34868c5520>),
                ('regressor', Ridge(solver='sparse_cg'))])

In [54]:
y_pred = w2v_pipeline.predict(X_test)
score = r2_score(y_test, y_pred)
print(score)

0.24305641736730677


Not impressive, probably, the problem is that we take all words with same weight, let's weight them with tf-idf coefficients:

In [55]:
class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(list(word2vec.values())[0])

    def fit(self, X, y):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])

        return self

    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter,defaultdict

regressor = Ridge(solver='sparse_cg')

w2v_tfidf_pipeline = Pipeline([
    ("word2vec vectorizer", TfidfEmbeddingVectorizer(w2v)),
    ("regressor", regressor)])

In [57]:
%time w2v_tfidf_pipeline.fit(X_train,y_train)


y_pred = w2v_tfidf_pipeline.predict(X_test)
score = r2_score(y_test, y_pred)
print(score)

CPU times: user 1min 6s, sys: 19.9 s, total: 1min 26s
Wall time: 55.6 s
0.2734952616211819


Slightly better than mean vector, but still a room for improvement!

Let's try fastText model:

In [58]:
import fasttext


%time ft_model = fasttext.train_unsupervised('sentences_w2v.txt', minn=3, maxn=4, dim=300)

CPU times: user 11min 45s, sys: 5.96 s, total: 11min 51s
Wall time: 5min 58s


In [59]:
ft_model.get_nearest_neighbors('room')

[(0.7296572923660278, 'bathroom'),
 (0.7287658452987671, 'bedroom'),
 (0.6851889491081238, 'bdroom'),
 (0.6800824403762817, 'bedrooom'),
 (0.6773774027824402, 'boxroom'),
 (0.6652435064315796, 'bathrooom'),
 (0.6594610214233398, 'rooom'),
 (0.6421788334846497, 'live'),
 (0.6419817209243774, 'mudroom'),
 (0.6401411890983582, 'spacious')]

No more out of vocabulary words problem:

In [60]:
ft_model.get_nearest_neighbors('hotrl')

[(0.6229067444801331, 'hot'),
 (0.6198985576629639, 'hotdog'),
 (0.605341374874115, 'hottub'),
 (0.5855438709259033, 'hote'),
 (0.5623665452003479, 'hotpot'),
 (0.5119271874427795, 'notr'),
 (0.5116435885429382, 'hottest'),
 (0.510097324848175, 'hotter'),
 (0.48678746819496155, 'hothous'),
 (0.48394525051116943, 'dame')]

In [61]:
class MeanFTEmbeddingVectorizer(object):
    def __init__(self, ft_model):
        self.ft_model = ft_model
        self.dim = 300

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.ft_model[w] for w in words ]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

In [62]:
regressor = Ridge(solver='sparse_cg')

ft_pipeline = Pipeline([
    ("fasttext vectorizer", MeanFTEmbeddingVectorizer(ft_model)),
    ("regressor", regressor)])

In [63]:
%%time 
ft_pipeline.fit(X_train,y_train)

CPU times: user 37.8 s, sys: 2.85 s, total: 40.6 s
Wall time: 36.8 s


Pipeline(steps=[('fasttext vectorizer',
                 <__main__.MeanFTEmbeddingVectorizer object at 0x7f34bbe163a0>),
                ('regressor', Ridge(solver='sparse_cg'))])

In [64]:
y_pred = ft_pipeline.predict(X_test)
score = r2_score(y_test, y_pred)
print(score)

0.3504588499177357


Better than word2vec !