 <h1 align="center">Let's explore Russki!!</h1> 
<img src=http://www.svfs-russia.com/images/russian-language.gif>

## This Notebook will show:
1. How to read in word vectors
2. How to create vector representation for each row of the "Description" Column

PS. You can access the file here https://s3.us-east-2.amazonaws.com/datafaculty/train_desc_features.npy

In [None]:
import os
import pandas as pd
import numpy as np
import glob
import nltk
import gensim

### Data import and usual stuff!!!

In [None]:
train=pd.read_csv("../input/avito-demand-prediction/train.csv")

In [None]:
train.head()

### Let's load the word vectors

In [None]:
from gensim.models import KeyedVectors

In [None]:
ru_model = KeyedVectors.load_word2vec_format('../input/fasttext-russian-2m/wiki.ru.vec')

In [None]:
print("The size of vocabulary for this corpus is {}".format(len(ru_model.vocab)))

### Let's explore these vectors and have some fun

In [None]:
# Pick a word 
find_similar_to = 'Автомобили'.lower()

In [None]:
ru_model.similar_by_word(find_similar_to)

### Using yandex translate let's analyze the results:
* We searched for Автомобили, which is Cars in English
* The cosine matches were===========================>
* aвтомобили---->Cars
* микроавтомобили ----->Midget Car
* автомобили\xa0----->Cars
* автомобили»----->Cars»
* легковые----->Automobile
* автомобили------>Cars
* мотоциклы----->Motorcycles
* спецавтомобили----->Special Vehicles
* грузовики----->Trucks
* автомобилевозы-----> Car Carrier



### Let's get back to business and create features from 'Description' column by adding word vectors

In [None]:
import nltk
def tokenize(x):
    '''Input: One description'''
    tok=nltk.tokenize.toktok.ToktokTokenizer()
    return [t.lower() for t in tok.tokenize(x)]
def get_vector(x):
    '''Input: Single token''' #If the word is out of vocab, then return a 300 dim vector filled with zeros
    try:
        return ru_model.get_vector(x)
    except:
        return np.zeros(shape=300)
def vector_sum(x):
    '''Input:List of word vectors'''
    return np.sum(x,axis=0)

In [None]:
features=[]
for desc in train['description'].values:
    tokens=tokenize(desc)
    if len(tokens)!=0: ## If the description is missing then return a 300 dim vector filled with zeros
        word_vecs=[get_vector(w) for w in tokens]
        features.append(vector_sum(word_vecs))
    else:
        features.append(np.zeros(shape=300))                 

In [None]:
print("Features were extracted from {} rows".format(len(features)))

In [None]:
## Convert into numpy array
train_desc_features=np.array(features)
print("Shape of features extracted from 'Description' column is:")
print(train_desc_features.shape)

## As can now be seen, we now have a dense representation of text. This representation can be used to build an Xgboost or Catboost model, taking into account both text data and regular columns. Such a model may perform better than one with tfidf features only.

In [None]:
## Write out as .npy file to be used later for modelling
## np.save("train_desc_features.npy",train_desc_features)
## Due to kernel limitations, this step fails, I had trained a file locally and can be accessed from:
## https://s3.us-east-2.amazonaws.com/datafaculty/train_desc_features.npy