## Text preprocessing

We will process text data from Airbnb dataset in order to improve quality of Vowpal Wabbit model.

In [2]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

In [3]:
df = pd.read_csv("airbnb-100k.tsv", delimiter='\t')
df.sample(5)

Unnamed: 0,Price,Description
93449,115.0,"A room of 30m² bathed in the light, thanks to his three large windows is equipped with: • A boxspringbed of 180x210 cm • Shower and private toilet • Hair dryer • 32’ smart televison FHD • Wifi Ultra speedy This spacious room is situated in our guesthouse Living in Brûsel, Urban B&B. This B&B is located in a quiet neighborhood, very close to the European institutions and is an ideal Guesthouse to discover Brussels as a tourist or with the family. Metro station Mérode is situated at a mere 600m. Tram 81 stops in front of the house. Let yourself be seduced by the marriage of the old and the contemporary in a typical Brussels house. This spacious and quiet room of 30m ² is bathed in the sunlight thanks to its three large original windows and a small balcony. You can dream away in very co..."
26263,26.0,"A great place to refresh yourself, this service is only for a shower!, close to the #7 train. A bathroom in a nice home. Guest will only use the rest room to shower. Rest room, Front yard Great diverse area Take the #7 train to the 90 street stop"
51148,180.0,"Hello! I am listing my apartment ideally for you to rent for the majority or entire month of February. It is a spacious one bedroom, two bathroom apartment in the heart of Manhattan, with a doorman, laundry room, gym, roof deck, and gorgeous view. My apartment also offers Wifi and Cable TV. It has everything you will need. I am seeking either a single person or a couple. These photos are a few months old, the apartment is a little more lived in now - dining chairs, rugs, etc. The apartment offers everything you will need for long stay, the location is a block away from all major train lines in the city, and a block away from Central Park. You are truly in the heart of the world here. This is a doorman building and a pretty quiet building at that. It is perfect for a professional who ne..."
85653,250.0,"60m2 Luxury Flat,5min walk to the Arc de Triomphe. The flat is very luminous,at the 7th floor with elevator and has two terraces where you can take your breakfast or just chill in the sun. Completely renovated recently. Charming, romantic and classy! Quiet apartment, full of sun... luxurious and comfy! Enjoy my sync music system in every room (Sonos), relax on my hotel quality bed, enjoy the terrace couch for sunbath surrounded by charming Parisians roofs. Fully equipped, it's a perfect place for a romantic week end in Paris! 5 minutes walk to Arc de Triomphe (Etoile) and Champs Elysées. Subway : Pereire or Ternes (line 2) I love meeting people from all around the world, I will gladly share local tips for directions, restaurants, shopping etc. around the house! The district is really c..."
62750,46.0,"A few steps from Metro Piramide and 10min walking to Trastevere, a beautiful room with private bathroom. Breakfast included! Free WI-FI! Very well connected, a few minutes from all main interesting points: - Colosseum (4min) - Vatican (10 min) - Trastevere (10 min)"


Plan:

- split sentences into tokens (words)
- lowercase
- remove stop-words (frequently appearing meaningless words, such as articles, pronoun, modal verbs, etc.)
- perform stemming (select word's "base")

Library NLTK (Natural Language Toolkit) implements methods for natural language processing, including stop-words sets for different languages (including russian)

In [4]:
import re
import nltk.data 
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Tokenization can be performet using NLTK library as well, but for now we will use simple one based on regular expressions

In [5]:
# tokenize
words_only = re.compile('[a-z]+')

def letters(s, regex = words_only):
    if isinstance(s, str):
        return words_only.findall(s.lower())
    else:
        return []

# stop-words removal
def remove_stopwords(tokens, sw = stop_words):
    return [t for t in tokens if not t in sw]

def preprocess(s):
    return remove_stopwords(letters(s))

In [6]:
s = 'I get help from a local service called "Key Butler" to welcome you and let you in etc, and they....' 

preprocess(s)

['get',
 'help',
 'local',
 'service',
 'called',
 'key',
 'butler',
 'welcome',
 'let',
 'etc']

In [7]:
df['clean_description'] = df.Description.map(preprocess)
df.sample(4)

Unnamed: 0,Price,Description,clean_description
5411,87.0,"This 1903 Victorian on Cherry Hill has a super comfortable welcoming vibe. Our ""Cedar Tree"" room is decorated with vibrant color and silky soft wall to wall carpet. The queen-sized bed is plush and layered with cozy linens for sweet dreams. A 1903, Victorian home with four unique bedrooms to rent! Click on our profile picture on the right, then scroll down on the left side to see our other beautiful bedrooms :o The feeling of the first-floor harkens to a graceful, bye-gone era of calm retreat to the intoxication of thoughtful conversation with a stranger. Relax in the parlor, or the lounge-room. Enjoy meals in the formal dining room. Check your email everywhere--we have wifi. Guests have keyed-access to their own rooms, their private, or shared baths, all first-floor rooms, including ...","[victorian, cherry, hill, super, comfortable, welcoming, vibe, cedar, tree, room, decorated, vibrant, color, silky, soft, wall, wall, carpet, queen, sized, bed, plush, layered, cozy, linens, sweet, dreams, victorian, home, four, unique, bedrooms, rent, click, profile, picture, right, scroll, left, side, see, beautiful, bedrooms, feeling, first, floor, harkens, graceful, bye, gone, era, calm, retreat, intoxication, thoughtful, conversation, stranger, relax, parlor, lounge, room, enjoy, meals, formal, dining, room, check, email, everywhere, wifi, guests, keyed, access, rooms, private, shared, baths, first, floor, rooms, including, two, social, rooms, formal, dining, area, kitchen, includes, access, refrigerator, guests, may, cook, permission, washer, dryer, used, certain, times, ...]"
8306,133.0,"Enjoy a beautiful contemporary residence with picturesque park views near Fenway Park. Upscale amenities throughout include, fully equipped kitchens, open floor plans, oversized windows, spacious walk-in closets, as well as a pool, fitness center, & clubhouse for your entertainment. 1 BED 1 BATH SLEEPS 3 B1251 For your convenience, we have a guest service representative available in the area. Experience Luxury living in the heart of Boston in these premium residences that will provide you with an uplifting lifestyle and comfortable stay. With a contemporary design throughout this high-rise has cutting-edge finishes and amenities throughout including: fully equipped kitchens with stainless steel and energy efficient appliances, bathrooms with elegant flooring and custom vanities, remar...","[enjoy, beautiful, contemporary, residence, picturesque, park, views, near, fenway, park, upscale, amenities, throughout, include, fully, equipped, kitchens, open, floor, plans, oversized, windows, spacious, walk, closets, well, pool, fitness, center, clubhouse, entertainment, bed, bath, sleeps, b, convenience, guest, service, representative, available, area, experience, luxury, living, heart, boston, premium, residences, provide, uplifting, lifestyle, comfortable, stay, contemporary, design, throughout, high, rise, cutting, edge, finishes, amenities, throughout, including, fully, equipped, kitchens, stainless, steel, energy, efficient, appliances, bathrooms, elegant, flooring, custom, vanities, remarkable, views, olmstead, park, private, courtyard, stunning, landscaping, elevated, ame..."
76691,49.0,"1. Private jacuzzi bath in quiet surroundings near Division corridor. 2. Kitchen is open, sharing with me! a. Use refrig.. b. Coffee Po(URL HIDDEN)1. cream(URL HIDDEN)2.sugar 3. Washer and Dryer back porch 4. WiFi pass word 4vcujkeg98wxhe 5. WiFi is Centruy Link 3838 6. Main. person lives in private room on first floor. 7. shower on the first floor is shared with you, me, and David (main.) Great view of down town Portland Closet to two bus lines #9 on Powell 4 blocks South The #4 on Division 4 blocks North A ton of things to do on Division St. Pool Arriving and leaving It is Southeast Portland a very popular living area in Portland and close to down town. Bus, light- rail and trolly The loft and setting are in a beautiful 100 yr old home open spiral staircase Lic. #700238","[private, jacuzzi, bath, quiet, surroundings, near, division, corridor, kitchen, open, sharing, use, refrig, b, coffee, po, url, hidden, cream, url, hidden, sugar, washer, dryer, back, porch, wifi, pass, word, vcujkeg, wxhe, wifi, centruy, link, main, person, lives, private, room, first, floor, shower, first, floor, shared, david, main, great, view, town, portland, closet, two, bus, lines, powell, blocks, south, division, blocks, north, ton, things, division, st, pool, arriving, leaving, southeast, portland, popular, living, area, portland, close, town, bus, light, rail, trolly, loft, setting, beautiful, yr, old, home, open, spiral, staircase, lic]"
10858,45.0,"Cosy apartment in a calm and secure residential area close to parks, shopping mall and supermarkets. It will take you just 20 minutes to get to downtown by metro. Fast access from airport - just 15 minutes driving. A nice studio in basement renovated in April 2013, at a 15-minute metro ride from Montreal downtown. Everything is provided for your stay and included in the price (laundry, dishes, unlimited wi-fi, TV, washer, dryer, stove, microwave). You can also enjoy the BBQ grill on the backyard. There's a free shared laundry room with washer&dryer in the basement. My furnished apartment is to rent for short term from 4 nights to 6 months The apartment is 700 meters from metro Monk and Angrignon, 700 meters from Plaza Angrignon (supermarkets, shops, restaurants) and 600 meters from ...","[cosy, apartment, calm, secure, residential, area, close, parks, shopping, mall, supermarkets, take, minutes, get, downtown, metro, fast, access, airport, minutes, driving, nice, studio, basement, renovated, april, minute, metro, ride, montreal, downtown, everything, provided, stay, included, price, laundry, dishes, unlimited, wi, fi, tv, washer, dryer, stove, microwave, also, enjoy, bbq, grill, backyard, free, shared, laundry, room, washer, dryer, basement, furnished, apartment, rent, short, term, nights, months, apartment, meters, metro, monk, angrignon, meters, plaza, angrignon, supermarkets, shops, restaurants, meters, beautiful, parc, angrignon, smart, tv, browse, internet, check, mailbox, see, movies, directly, tv, free, unlimited, calls, u, canada, phone, apa]"


If we take a closer look, dataset includes texts in different languages, not only English:

In [8]:
df.iloc[41403]

Price                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            350
Description          Le rez-de chaussée de la Maison Duquette est à 2 minutes de marche du métro Villa-Maria, Il s'agit d'une maison familiale avec 4 grandes chambres, salon, sall

Let's. understand what other languages are present.

We will use library spaCy, it's written in Cython, so it's pretty robust.

Documentation can be found [here](https://spacy.io/models).

Library includes pretrained models for 16 languages plus multi-lang one.

In [11]:
import spacy
from spacy_langdetect import LanguageDetector
from tqdm import tqdm_notebook as tqdm
from langdetect import DetectorFactory

DetectorFactory.seed = 0
spacy.util.fix_random_seed(0)

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)

In [12]:
# to speed up use only first sentence for language detection

def detect_lang(t):
    if isinstance(t, str):
        text = t.split('.')[0]
        doc = nlp(text)
        return (doc._.language.get('language'))
    else:
        return None

Method LanguageDetector determines language of the text, it's rather slow, so we will use `multiprocessing` to speed-up and make process parallel

In [13]:
from multiprocessing import Pool

with Pool(2) as p:
    langs = list(tqdm(p.imap(detect_lang, df.Description), total=len(df)))
    
df['lang'] = langs

HBox(children=(FloatProgress(value=0.0, max=100000.0), HTML(value='')))




In [14]:
df.lang.value_counts()

en         82612
fr          6655
es          3300
it          2756
de          1760
da           525
nl           503
zh-cn        284
af           235
ca           202
ko           164
pt           139
ro           129
UNKNOWN       97
no            77
vi            55
el            51
et            49
tl            44
ru            38
so            33
pl            26
fi            26
hu            23
sw            22
sv            21
id            20
cy            19
zh-tw         16
lt            16
sq            11
sk            10
ja            10
tr            10
cs             5
sl             5
lv             4
hr             3
bg             1
Name: lang, dtype: int64

Last two steps of pre-processing (stop-words removal as we did and stemming) highly depend on language, so to make it simple we only leave English texts from dataset, although, NLTK and spaCy can work with many European languages (but specific models should be downloaded)

In [15]:
eng_df = df[df.lang == 'en']
len(eng_df)

82612

In [17]:
eng_df.to_csv('eng_df.csv', index=False)

In [18]:
del df

We use Stemmer from NLTK:

In [19]:
from nltk.stem.snowball import SnowballStemmer

snowball = SnowballStemmer("english")
snowball.stem('apartment')

'apart'

Stemming isn't fast operation, though, it's faster than lemmatization (converting word to normal form, that should be used for morphologicaly rich languages such as Russian). So we create cache for stemmer function not to process same words several times

In [20]:
from functools import lru_cache

In [21]:
def stemm_description(d):
    @lru_cache(maxsize=128)
    def stemm_token(token):
        return snowball.stem(token)

    return [stemm_token(t) for t in d if len(stemm_token(t)) >= 3]

In [22]:
tokens = eng_df.clean_description[1]

stemm_description(tokens)[:10]

['comfort',
 'calm',
 'apart',
 'room',
 'center',
 'pari',
 'bastill',
 'area',
 'welcom',
 'explain']

In [24]:
from multiprocessing import Pool

with Pool(2) as p:
    stems = list(tqdm(p.imap(stemm_description, eng_df.clean_description), total=len(eng_df)))
    
eng_df['stems'] = stems

HBox(children=(FloatProgress(value=0.0, max=82612.0), HTML(value='')))




In [25]:
eng_df.sample(3)

Unnamed: 0,Price,Description,clean_description,lang,stems
77476,65.0,"This big bright room is on the second floor of a newly renovated three bedrooms duplex. The central air keeps you cool during summer and warm during winter. There is plenty of living space, a fully equipped kitchen and furnished backyard with a grill. See "" Guest Access "" for more details. The space is equipped with central air and heat. There is a fully equipped chef's kitchen with brand new appliances including a dishwasher. ( See GUEST ACCESS for more information ) Our guest have full access to the chef's kitchen equipped with brand new appliances as well as s fully furnished backyard with charcoal grill. The room has a smart TV hooked up to Netflix, Hulu and Showtime. It is wifi enabled so internet browsing is also an option. There is an iron and iron board. Our guests also hav...","[big, bright, room, second, floor, newly, renovated, three, bedrooms, duplex, central, air, keeps, cool, summer, warm, winter, plenty, living, space, fully, equipped, kitchen, furnished, backyard, grill, see, guest, access, details, space, equipped, central, air, heat, fully, equipped, chef, kitchen, brand, new, appliances, including, dishwasher, see, guest, access, information, guest, full, access, chef, kitchen, equipped, brand, new, appliances, well, fully, furnished, backyard, charcoal, grill, room, smart, tv, hooked, netflix, hulu, showtime, wifi, enabled, internet, browsing, also, option, iron, iron, board, guests, also, access, living, areas, tea, room, comfortable, blue, leather, couch, brown, leather, butterfly, chair, sink, right, perfect, area, enjoy, nice, ...]",en,"[big, bright, room, second, floor, newli, renov, three, bedroom, duplex, central, air, keep, cool, summer, warm, winter, plenti, live, space, fulli, equip, kitchen, furnish, backyard, grill, see, guest, access, detail, space, equip, central, air, heat, fulli, equip, chef, kitchen, brand, new, applianc, includ, dishwash, see, guest, access, inform, guest, full, access, chef, kitchen, equip, brand, new, applianc, well, fulli, furnish, backyard, charcoal, grill, room, smart, hook, netflix, hulu, showtim, wifi, enabl, internet, brows, also, option, iron, iron, board, guest, also, access, live, area, tea, room, comfort, blue, leather, couch, brown, leather, butterfli, chair, sink, right, perfect, area, enjoy, nice, convers, ...]"
90585,57.0,"Walkout apartment with keyless locks. Clean renovated basement with windows. All new furniture and appliances, new queen bed; high quality 100% cotton linens. Unlimited internet and easy access to TTC. Quiet residential neighbourhood with tree-lined streets and Victorian houses between Leslieville and the Danforth. Free parking either behind or close to the house. Parks, waterfront trail, and restaurants close by.","[walkout, apartment, keyless, locks, clean, renovated, basement, windows, new, furniture, appliances, new, queen, bed, high, quality, cotton, linens, unlimited, internet, easy, access, ttc, quiet, residential, neighbourhood, tree, lined, streets, victorian, houses, leslieville, danforth, free, parking, either, behind, close, house, parks, waterfront, trail, restaurants, close]",en,"[walkout, apart, keyless, lock, clean, renov, basement, window, new, furnitur, applianc, new, queen, bed, high, qualiti, cotton, linen, unlimit, internet, easi, access, ttc, quiet, residenti, neighbourhood, tree, line, street, victorian, hous, leslievill, danforth, free, park, either, behind, close, hous, park, waterfront, trail, restaur, close]"
72144,200.0,This is a self contained apartment above my home. A spacious and comfortable backyard space to relax in and easy access to the city. Blocks from Frenchman Street and the quarter with walking access to the Bywater and Marigny,"[self, contained, apartment, home, spacious, comfortable, backyard, space, relax, easy, access, city, blocks, frenchman, street, quarter, walking, access, bywater, marigny]",en,"[self, contain, apart, home, spacious, comfort, backyard, space, relax, easi, access, citi, block, frenchman, street, quarter, walk, access, bywat, marigni]"


Let's compare model's quality for raw texts ('Description' column), tokens without stop-words ('clean_description' column) and stems (column 'stems')

In [26]:
eng_df.dropna(subset=['Price', "stems"], inplace=True)

Y = eng_df['Price']
X_raw = eng_df['Description'].map(letters)
X_tokens = eng_df['clean_description']
X_stems = eng_df['stems']

In [27]:
def convert_to_vw(raw_text, target):
    return "{} |d {}".format(float(target), " ".join(raw_text))

In [28]:
def write_vw(X_data, Y_data, filename):
    with open(filename, "w") as f:
        for x, y in zip(X_data, Y_data):
            vw_object = convert_to_vw(x, y)
            if not vw_object:
                continue
            f.write(vw_object + '\n')

In [29]:
from sklearn.metrics import r2_score

def read_target_from_vw(vw_object):
    return float(vw_object.split(' ')[0])

def calc_r2(predictions_path, answers_path):
    with open(predictions_path, 'r') as f:
        y_pred = np.array([float(value) for value in f.readlines()])
        
    with open(answers_path, 'r') as f:
        y_expected = np.array([read_target_from_vw(value) for value in f.readlines()])
        
    return r2_score(y_expected, y_pred)

In [30]:
from sklearn.model_selection import train_test_split

In [31]:
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X_raw, Y, test_size=0.33, random_state=4200)
X_train_tokens, X_test_tokens, y_train, y_test = train_test_split(X_tokens, Y, test_size=0.33, random_state=4200)
X_train_stems, X_test_stems, y_train, y_test = train_test_split(X_stems, Y, test_size=0.33, random_state=4200)

In [32]:
write_vw(X_train_raw, y_train, "airbnb-train-raw.vw")
write_vw(X_test_raw, y_test, "airbnb-test-raw.vw")

write_vw(X_train_tokens, y_train, "airbnb-train-tokens.vw")
write_vw(X_test_tokens, y_test, "airbnb-test-tokens.vw")

write_vw(X_train_stems, y_train, "airbnb-train-stems.vw")
write_vw(X_test_stems, y_test, "airbnb-test-stems.vw")

In [33]:
! head -n 1 airbnb-train-raw.vw

220.0 |d square feet high ceilings king bed completely renovated you won t find many bedroom apartments like this in the west village you re steps away from the meatpacking district high line and whitney museum when you walk outside and you can enjoy this luxurious clutter free airy home when you decide to stay in the apartment is newly renovated including the furniture and finishings some highlights include brand new samsung suhd tv with amazon fire tv high speed wifi kitchen sodastream bialetti espresso maker full set of pots pans bang and olufsen a bluetooth speaker luxurious king size leesa mattress as a guest you will have a key to the main entry of the building as well as the apartment which is a storey walk up aside from a few cupboards and areas of the closet where i will keep my belongings the rest of the apartment is yours to use there is plenty of storage space i will be around manhattan most of the time to help if needed i m always


In [34]:
! head -n 1 airbnb-train-tokens.vw

220.0 |d square feet high ceilings king bed completely renovated find many bedroom apartments like west village steps away meatpacking district high line whitney museum walk outside enjoy luxurious clutter free airy home decide stay apartment newly renovated including furniture finishings highlights include brand new samsung suhd tv amazon fire tv high speed wifi kitchen sodastream bialetti espresso maker full set pots pans bang olufsen bluetooth speaker luxurious king size leesa mattress guest key main entry building well apartment storey walk aside cupboards areas closet keep belongings rest apartment use plenty storage space around manhattan time help needed always


In [35]:
! head -n 1 airbnb-train-stems.vw

220.0 |d squar feet high ceil king bed complet renov find mani bedroom apart like west villag step away meatpack district high line whitney museum walk outsid enjoy luxuri clutter free airi home decid stay apart newli renov includ furnitur finish highlight includ brand new samsung suhd amazon fire high speed wifi kitchen sodastream bialetti espresso maker full set pot pan bang olufsen bluetooth speaker luxuri king size leesa mattress guest key main entri build well apart storey walk asid cupboard area closet keep belong rest apart use plenti storag space around manhattan time help need alway


We have to train models for all 3 datasets and compare:

In [36]:
%%time

! vw --final_regressor airbnb-lin-model-raw.vw.bin airbnb-train-raw.vw --passes 20 -l 3 -c -k  --bit_precision 18

final_regressor = airbnb-lin-model-raw.vw.bin
Num weight bits = 18
learning rate = 3
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = airbnb-train-raw.vw.cache
Reading datafile = airbnb-train-raw.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
48400.000000 48400.000000            1            1.0 220.0000   0.0000      174
125191.367188 201982.734375            2            2.0 450.0000   0.5751        9
72780.285416 20369.203644            4            4.0 240.0000  40.3795      170
36699.550247 618.815079            8            8.0  40.0000  10.0007       33
80979.453055 125259.355862           16           16.0 135.0000  52.5366      153
51445.165040 21910.877026           32           32.0  79.0000  60.4543       71
44872.192968 38299.220895           64           64.0 172.0000 166.1870      171
36163.280758 27454.368548          128        

In [37]:
%%time

! vw --final_regressor airbnb-lin-model-tokens.vw.bin airbnb-train-tokens.vw --passes 20 -l 3 -c -k --bit_precision 18

final_regressor = airbnb-lin-model-tokens.vw.bin
Num weight bits = 18
learning rate = 3
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = airbnb-train-tokens.vw.cache
Reading datafile = airbnb-train-tokens.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
48400.000000 48400.000000            1            1.0 220.0000   0.0000       98
125101.125000 201802.250000            2            2.0 450.0000   0.7759        6
75713.196960 26325.268921            4            4.0 240.0000  15.5428       96
39055.715233 2398.233505            8            8.0  40.0000   4.1926       24
84181.565842 129307.416451           16           16.0 135.0000  38.5242      127
57145.070851 30108.575860           32           32.0  79.0000  30.2245       45
54811.209314 52477.347777           64           64.0 172.0000  81.9622      104
41783.796841 28756.384368          1

In [38]:
%%time

! vw --final_regressor airbnb-lin-model-stems.vw.bin airbnb-train-stems.vw --passes 20 -l 3 -c -k --bit_precision 18

final_regressor = airbnb-lin-model-stems.vw.bin
Num weight bits = 18
learning rate = 3
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = airbnb-train-stems.vw.cache
Reading datafile = airbnb-train-stems.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
48400.000000 48400.000000            1            1.0 220.0000   0.0000       96
125062.570312 201725.140625            2            2.0 450.0000   0.8618        5
75542.389709 26022.209106            4            4.0 240.0000  16.8910       96
38900.226391 2258.063072            8            8.0  40.0000   5.2025       23
83627.623821 128355.021252           16           16.0 135.0000  47.9530      120
56023.923013 28420.222205           32           32.0  79.0000  30.9137       44
53072.210422 50120.497831           64           64.0 172.0000  88.2881      103
40484.896336 27897.582250          128 

First thing we see - even on such a small dataset preprocessing increases model learning speed, in our case because of smaller size of features set.

What about quality?

In [60]:
%%time
! vw --testonly --initial_regressor airbnb-lin-model-raw.vw.bin --predictions airbnb-1-predictions-raw.txt airbnb-test-raw.vw 


only testing
predictions = airbnb-1-predictions-raw.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = airbnb-test-raw.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
599.695374 599.695374            1            1.0  55.0000  79.4887       28
305.984709 12.274044            2            2.0  36.0000  39.5034      186
207.019877 108.055046            4            4.0  80.0000  67.7075       24
4919.375420 9631.730963            8            8.0  70.0000  74.8798      175
2999.275561 1079.175702           16           16.0 100.0000 142.4937      173
12161.945164 21324.614767           32           32.0 150.0000 152.4244      185
24285.223548 36408.501932           64           64.0 150.0000  82.4140       47
19419.673518 14554.123487          128          128.0 300.0000 122.5763       28
13868.897035 8318.120552     

In [40]:

! vw --testonly --initial_regressor airbnb-lin-model-tokens.vw.bin --predictions airbnb-1-predictions-tokens.txt airbnb-test-tokens.vw


only testing
predictions = airbnb-1-predictions-tokens.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = airbnb-test-tokens.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
852.703064 852.703064            1            1.0  55.0000  84.2011       19
433.786043 14.869022            2            2.0  36.0000  39.8560       92
280.920192 128.054341            4            4.0  80.0000  76.6952       17
5309.742549 10338.564906            8            8.0  70.0000  73.2144       95
3037.758406 765.774262           16           16.0 100.0000 129.0590      104
11758.061741 20478.365077           32           32.0 150.0000 135.2268       99
23336.385150 34914.708559           64           64.0 150.0000  91.4685       37
18445.038022 13553.690894          128          128.0 300.0000 136.3479       19
13362.039025 8279.04002

In [41]:
! vw --testonly --initial_regressor airbnb-lin-model-stems.vw.bin --predictions airbnb-1-predictions-stems.txt airbnb-test-stems.vw


only testing
predictions = airbnb-1-predictions-stems.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = airbnb-test-stems.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
980.918945 980.918945            1            1.0  55.0000  86.3196       19
565.562332 150.205719            2            2.0  36.0000  48.2558       88
630.141131 694.719930            4            4.0  80.0000  76.4350       17
5932.519242 11234.897353            8            8.0  70.0000  74.3693       90
3213.406160 494.293078           16           16.0 100.0000 109.6876      103
11970.408771 20727.411382           32           32.0 150.0000 140.3474       98
22687.461167 33404.513562           64           64.0 150.0000  82.0521       35
17803.776708 12920.092250          128          128.0 300.0000 138.9059       19
13497.156398 9190.536089

In [42]:
print("RAW")
print(calc_r2("airbnb-1-predictions-raw.txt", "airbnb-test-raw.vw"))

RAW
0.34760673021608524


In [43]:
print("TOKENS")
print(calc_r2("airbnb-1-predictions-tokens.txt", "airbnb-test-tokens.vw"))

TOKENS
0.3744525473300868


In [44]:
print("STEMS")
print(calc_r2("airbnb-1-predictions-stems.txt", "airbnb-test-stems.vw"))

STEMS
0.3578918225807295


Models on raw texts and stems are slightly worse than model on tokens without stop-words.

Is there any other way to reduce features amount leaving only most relevant? We can calculate tf-idf for all word-document pairs in collection and leave only words that exceed the threshold

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [48]:
X_train = X_raw
X_train.apply(lambda x: ' '.join(x))[0]

'hi everyone cosy bedroom in a modern apartment located in a central area paris th the apartment it s a bedrooms one is mine apartment of m sq ft fully renovated warm atmosphere a living room with a equiped kitchen wifi i provide towels and sheets central area cosmopolite non touristic very close to the marais bastille and republic transports metro walk saint ambroise line or richard lenoir line city bike station best j'

In [49]:
x_train, x_test = X_train_stems.apply(lambda x: ' '.join(x)) , X_test_stems.apply(lambda x: ' '.join(x)) 

Calculate tf-idf transformation: the result is a sparse matrix

In [50]:
vec = TfidfVectorizer(ngram_range=(1, 1))
x_train_tfidf = vec.fit_transform(x_train)

x_train_tfidf

<54542x44692 sparse matrix of type '<class 'numpy.float64'>'
	with 3326139 stored elements in Compressed Sparse Row format>

Convert matrix to list, with each element being a pair `(w, tfidf(w, d , D))`, where `w` is the word in the document `d` from collection `D`

In [51]:
index_value={i[1]:i[0] for i in vec.vocabulary_.items()}

fully_indexed = []
for row in x_train_tfidf:
    fully_indexed.append({index_value[column]:value for (column,value) in zip(row.indices,row.data)})

Let's take a look at words such, for example: `tf-idf >= 0.1`

In [52]:
tr = 0.1

[ (k, v ) for k, v in sorted(fully_indexed[4].items(), key=lambda item: -item[1]) if v >= tr]

[('doubl', 0.2580593795785133),
 ('split', 0.24115506933644754),
 ('neff', 0.2022642917696268),
 ('larg', 0.1962100846501572),
 ('origin', 0.18475126372913314),
 ('storag', 0.18029147151524685),
 ('level', 0.17138729784739948),
 ('slate', 0.16540308912231033),
 ('costa', 0.16540308912231033),
 ('side', 0.15770388173708214),
 ('finsburi', 0.1478633566100029),
 ('ceil', 0.14407523335957956),
 ('rais', 0.14128854967409324),
 ('front', 0.13858454237819792),
 ('place', 0.13625522822264793),
 ('form', 0.12490200014466063),
 ('convers', 0.12452005863135523),
 ('power', 0.12396028773002239),
 ('railway', 0.12042232913462157),
 ('chest', 0.11927026423601682),
 ('compris', 0.11767866051622303),
 ('benefit', 0.11727743203760296),
 ('shower', 0.11650003976394883),
 ('field', 0.11459355750386695),
 ('cupboard', 0.11364892399903367),
 ('follow', 0.11321064888943784),
 ('cooker', 0.11292363409760428),
 ('flat', 0.11097829368394094),
 ('hallway', 0.11085913741963234),
 ('leafi', 0.10875189729225716),


We leave in each document only top-20 most important words (highest tf-idf):

In [53]:
def important_words(x, n=20):
    return [(k.replace(' ', '_'), v) for k, v in sorted(x.items(), key=lambda item: -item[1]) if v >= tr][:n]

In [54]:
def convert_to_vw_tf(raw_text, target):
    return "{} |d {}".format(float(target), " ".join(["{}:{}".format(k, v) for k, v in raw_text]))

def write_vw_tf(X_data, Y_data, filename):
    with open(filename, "w") as f:
        for x, y in zip(X_data, Y_data):
            vw_object = convert_to_vw_tf(x, y)
            if not vw_object:
                continue
            f.write(vw_object + '\n')

In [55]:
x_train_updated = [important_words(x) for x in fully_indexed]
len(x_train_updated)

54542

In [56]:
x_train_updated[0]

[('suhd', 0.22702091541382582),
 ('leesa', 0.2092237299910606),
 ('olufsen', 0.20394496122668826),
 ('bialetti', 0.20394496122668826),
 ('sodastream', 0.19972994577223743),
 ('high', 0.19182189013874973),
 ('whitney', 0.18010511781362673),
 ('bang', 0.173690686601339),
 ('clutter', 0.16209468549638292),
 ('luxuri', 0.1596956274846224),
 ('asid', 0.1577930528524131),
 ('king', 0.15471383698293836),
 ('meatpack', 0.1502494251422974),
 ('samsung', 0.14901867589237205),
 ('apart', 0.14825207040563462),
 ('bluetooth', 0.1472195142269145),
 ('decid', 0.13671041244968138),
 ('renov', 0.13544658673657073),
 ('amazon', 0.1284808481645425),
 ('speaker', 0.12747745077125505)]

In [57]:
write_vw_tf(x_train_updated, y_train, "airbnb-train-important-stems.vw")

In [58]:
! head -n 3 airbnb-train-important-stems.vw

220.0 |d suhd:0.22702091541382582 leesa:0.2092237299910606 olufsen:0.20394496122668826 bialetti:0.20394496122668826 sodastream:0.19972994577223743 high:0.19182189013874973 whitney:0.18010511781362673 bang:0.173690686601339 clutter:0.16209468549638292 luxuri:0.1596956274846224 asid:0.1577930528524131 king:0.15471383698293836 meatpack:0.1502494251422974 samsung:0.14901867589237205 apart:0.14825207040563462 bluetooth:0.1472195142269145 decid:0.13671041244968138 renov:0.13544658673657073 amazon:0.1284808481645425 speaker:0.12747745077125505
450.0 |d gym:0.6752280868300573 amaz:0.5857861817339045 floor:0.34873046248046086 apart:0.2816179052965157
50.0 |d inst:0.3303694220060813 middl:0.2169097540580892 neiu:0.17606802050466333 argosi:0.17606802050466333 kendal:0.16518471100304066 depaul:0.16518471100304066 rican:0.16168106790911377 student:0.1615717403334601 intern:0.15009980649780472 peruvian:0.14667829926165915 dominican:0.14551470800700664 uic:0.14551470800700664 cultur:0.1437913480248

In [61]:
%%time

! vw --final_regressor airbnb-lin-model-tfidf.vw.bin airbnb-train-important-stems.vw --passes 15 -l 3 -c --bit_precision 18 -k

final_regressor = airbnb-lin-model-tfidf.vw.bin
Num weight bits = 18
learning rate = 3
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = airbnb-train-important-stems.vw.cache
Reading datafile = airbnb-train-important-stems.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
48400.000000 48400.000000            1            1.0 220.0000   0.0000       21
124623.609375 200847.218750            2            2.0 450.0000   1.8402        5
76784.553589 28945.497803            4            4.0 240.0000   4.3585       21
40157.342880 3530.132172            8            8.0  40.0000   2.2763       16
87028.076649 133898.810417           16           16.0 135.0000   3.5806       21
63429.498878 39830.921108           32           32.0  79.0000   8.8392       21
65524.439860 67619.380842           64           64.0 172.0000  12.7161       21
53577.385478 41630.

This model is the fastest by learning speed (previously 2.5-3 secs required). We can try to learn longer to increase quality. Let's increase amount of passes (epochs, runs over data) and see, what quality can we get

In [62]:
%%time

! vw --final_regressor airbnb-lin-model-tfidf.vw.bin airbnb-train-important-stems.vw --passes 35 -l 3 -c --bit_precision 18 -k

final_regressor = airbnb-lin-model-tfidf.vw.bin
Num weight bits = 18
learning rate = 3
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = airbnb-train-important-stems.vw.cache
Reading datafile = airbnb-train-important-stems.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
48400.000000 48400.000000            1            1.0 220.0000   0.0000       21
124623.609375 200847.218750            2            2.0 450.0000   1.8402        5
76784.553589 28945.497803            4            4.0 240.0000   4.3585       21
40157.342880 3530.132172            8            8.0  40.0000   2.2763       16
87028.076649 133898.810417           16           16.0 135.0000   3.5806       21
63429.498878 39830.921108           32           32.0  79.0000   8.8392       21
65524.439860 67619.380842           64           64.0 172.0000  12.7161       21
53577.385478 41630.

In [63]:
x_test_tfidf = vec.transform(x_test)

fully_indexed_test = []
for row in x_test_tfidf:
    fully_indexed_test.append({index_value[column]:value for (column,value) in zip(row.indices,row.data)})

In [64]:
x_test_updated = [important_words(x) for x in fully_indexed_test]
write_vw_tf(x_test_updated, y_test, "airbnb-test-important-stems.vw")

In [65]:
! vw --testonly --initial_regressor airbnb-lin-model-tfidf.vw.bin --predictions airbnb-1-predictions-stems.txt airbnb-test-important-stems.vw

only testing
predictions = airbnb-1-predictions-stems.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = airbnb-test-important-stems.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
5731.760254 5731.760254            1            1.0  55.0000 130.7084       19
4014.961792 2298.163330            2            2.0  36.0000  83.9392       21
4030.666748 4046.371704            4            4.0  80.0000 121.7081       12
7675.492446 11320.318143            8            8.0  70.0000 108.3420       21
4766.873722 1858.254999           16           16.0 100.0000 153.5942       21
11999.351437 19231.829151           32           32.0 150.0000 145.4472       21
26360.865262 40722.379086           64           64.0 150.0000 175.8636       21
20796.356198 15231.847134          128          128.0 300.0000 284.8787       18
14396.3

In [66]:
print("TF-IDF")
print(calc_r2("airbnb-1-predictions-stems.txt", "airbnb-test-important-stems.vw"))

TF-IDF
0.3535020945957654


We can see, that the quality is comparable to previoous runs and better than in original model, and we compressed the data heavily - we only left 20 words for each text

Of course, learning time differencies are very small and in size of seconds, but, in real life, data scale is much higher and the difference can be hours or even days. We expect the effect of speed-up to be proportional on real life data

# Run on cluster

In [67]:
import pyspark
from pyspark.sql import SparkSession

conf = (
    pyspark.SparkConf()
    .setAppName("CourseraLocalSpark")
    .setMaster("local[*]")
)

sc = pyspark.SparkContext.getOrCreate(conf)
spark = SparkSession.builder.appName('NLP-examples').getOrCreate()

airbnb = spark.read.option("header",True).option("parserLib", "univocity")\
    .option("delimiter", "\t") \
    .csv('airbnb-100k.tsv')

In [68]:
airbnb.printSchema()

root
 |-- Price: string (nullable = true)
 |-- Description: string (nullable = true)



In [71]:
def all_preprocess(row):
    Y = row.Price
    X_raw = row.Description
    X_processed = stemm_description(preprocess(X_raw))
    return (Y, X_processed)

In [72]:
def convert_to_vw_spark(data):
    target, text = data
    return convert_to_vw(text, target)

In [73]:
X_rdd = (
    airbnb.rdd
    .filter(lambda x: bool(x.Price) and bool(x.Description))
    .filter(lambda x: detect_lang(x.Description)=='en')
    .map(all_preprocess)
    .cache()
)

In [74]:
%%time 

X_rdd.take(5)

CPU times: user 49.5 ms, sys: 21.6 ms, total: 71.1 ms
Wall time: 10min 48s


[('50.0',
  ['everyon',
   'cosi',
   'bedroom',
   'modern',
   'apart',
   'locat',
   'central',
   'area',
   'pari',
   'apart',
   'bedroom',
   'one',
   'mine',
   'apart',
   'fulli',
   'renov',
   'warm',
   'atmospher',
   'live',
   'room',
   'equip',
   'kitchen',
   'wifi',
   'provid',
   'towel',
   'sheet',
   'central',
   'area',
   'cosmopolit',
   'non',
   'tourist',
   'close',
   'marai',
   'bastill',
   'republ',
   'transport',
   'metro',
   'walk',
   'saint',
   'ambrois',
   'line',
   'richard',
   'lenoir',
   'line',
   'citi',
   'bike',
   'station',
   'best']),
 ('125.0',
  ['comfort',
   'calm',
   'apart',
   'room',
   'center',
   'pari',
   'bastill',
   'area',
   'welcom',
   'explain',
   'live',
   'area',
   'bakeri',
   'market',
   'restaur',
   'bar',
   'live',
   'build',
   'close',
   'public',
   'transport',
   'metro',
   'ledru',
   'rollin',
   'charonn',
   'bastill',
   'gare',
   'lyon',
   'bus',
   'germain',
   'louvr'

In [75]:
print("\n".join(X_rdd.map(convert_to_vw_spark).take(5)))

50.0 |d everyon cosi bedroom modern apart locat central area pari apart bedroom one mine apart fulli renov warm atmospher live room equip kitchen wifi provid towel sheet central area cosmopolit non tourist close marai bastill republ transport metro walk saint ambrois line richard lenoir line citi bike station best
125.0 |d comfort calm apart room center pari bastill area welcom explain live area bakeri market restaur bar live build close public transport metro ledru rollin charonn bastill gare lyon bus germain louvr gare austerlitz gare nord live room bedroom kitchen bathroom sleep person doubl bed sofa bed also divid separ part quiet apart direct opposit love park central locat live restaur cafe fruit market live build love area nice vie quartier market bakeri apart equip fulli equip kitchen refriger freezer toaster kettl microwav induct hob dishwash wash machin coffe maker sheet towel linen provid broadband internet adsl
59.0 |d minut walk publiqu oberkampf nilmont lachais rent flat 

For TF-IDF calculation we can use default Spark library. For effective work TF calculation module uses features hashing

In [76]:
from pyspark.mllib.feature import HashingTF, IDF

In [77]:
documents = X_rdd.map(lambda x: x[1])

hashingTF = HashingTF(numFeatures=2**14)
tf = hashingTF.transform(documents)
tf.cache()

PythonRDD[18] at RDD at PythonRDD.scala:53

In [78]:
%%time

tf.take(1)

CPU times: user 3.29 ms, sys: 932 µs, total: 4.22 ms
Wall time: 3.43 s


[SparseVector(16384, {436: 1.0, 812: 2.0, 1063: 1.0, 1202: 1.0, 1281: 1.0, 2081: 1.0, 2253: 1.0, 2374: 3.0, 2634: 1.0, 3420: 1.0, 3443: 1.0, 3613: 1.0, 3944: 1.0, 4750: 1.0, 4767: 1.0, 4949: 1.0, 5066: 1.0, 5789: 1.0, 6278: 1.0, 7254: 1.0, 7668: 1.0, 7695: 1.0, 8356: 1.0, 8459: 1.0, 8981: 2.0, 9012: 1.0, 9294: 1.0, 9536: 1.0, 11713: 1.0, 12131: 1.0, 12381: 1.0, 12556: 1.0, 12620: 1.0, 13067: 1.0, 13314: 1.0, 13467: 1.0, 13588: 1.0, 14296: 1.0, 14706: 2.0, 14722: 1.0, 15153: 1.0, 15706: 2.0})]

In [79]:
%%time

idf = IDF().fit(tf)
tfidf = idf.transform(tf)

CPU times: user 33.9 ms, sys: 8.16 ms, total: 42 ms
Wall time: 9min 21s


In [80]:
tfidf.take(1)

[SparseVector(16384, {436: 2.2227, 812: 3.4274, 1063: 5.5703, 1202: 7.5692, 1281: 2.17, 2081: 4.5315, 2253: 2.9518, 2374: 2.2942, 2634: 3.9335, 3420: 2.119, 3443: 0.6858, 3613: 0.6081, 3944: 2.7449, 4750: 1.1501, 4767: 1.14, 4949: 2.803, 5066: 0.8058, 5789: 4.4452, 6278: 3.5628, 7254: 4.4782, 7668: 5.3988, 7695: 3.1893, 8356: 1.9697, 8459: 1.5624, 8981: 1.3976, 9012: 3.2982, 9294: 1.4327, 9536: 8.3112, 11713: 3.1602, 12131: 1.7543, 12381: 7.1022, 12556: 1.8506, 12620: 0.6933, 13067: 2.1395, 13314: 8.088, 13467: 2.0324, 13588: 1.2255, 14296: 5.0512, 14706: 2.122, 14722: 0.5746, 15153: 1.5551, 15706: 4.0375})]