# Abstract
### Problem
predict the relevance for each pair listed in the test set

### Method
For NLP problem, we often need to generate some self-made text features to help ML model.  
How to generate slef-made text features?  

- String Distance
- TF-IDF
- Word2Vec

# 0. Import Necessary Packages

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from nltk.stem.snowball import SnowballStemmer

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# 1. Read Dataset

In [None]:
train_df = pd.read_csv('../input/train.csv', encoding="ISO-8859-1")
test_df = pd.read_csv('../input/test.csv', encoding="ISO-8859-1")
desc_df = pd.read_csv('../input/product_descriptions.csv')

# 2. Dataset Summary

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
desc_df.head()

# 3. Combine Train and Test Dataset

In [None]:
all_df = pd.concat((train_df, test_df), axis=0, ignore_index=True)
all_df.head()

In [None]:
print("Number of instance: ", all_df.shape[0])
print("Number of features: ", all_df.shape[1])

In [None]:
# Merge product description.
all_df = pd.merge(all_df, desc_df, how='left', on='product_uid')
all_df.head()

# 4. Text Preprocessing
For text preprocessing, we have various operations, such like stopwords, drop_number, lemma, stem...   
In this case, we just use stem

In [None]:
stemmer = SnowballStemmer('english')

def str_stemmer(s):
    return " ".join([stemmer.stem(word) for word in s.lower().split()])

In [None]:
all_df['search_term'] = all_df['search_term'].map(lambda x:str_stemmer(x))
all_df['product_title'] = all_df['product_title'].map(lambda x:str_stemmer(x))
all_df['product_description'] = all_df['product_description'].map(lambda x:str_stemmer(x))

# 5. Self-made Text Features (optional)

### Levenshtein
- ratio(string1, string2)
  - Compute similarity of two strings.
  - https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html#Levenshtein-ratio
  - Use this function, we can add two new features -- 'dist_in_title' and 'dist_in_desc'

In [None]:
import Levenshtein

In [None]:
Levenshtein.ratio('hello', 'hello world')

In [None]:
all_df['dist_in_title'] = all_df.apply(lambda x:Levenshtein.ratio(
    x['search_term'],x['product_title']), axis=1)
all_df['dist_in_desc'] = all_df.apply(lambda x:Levenshtein.ratio(
    x['search_term'],x['product_description']), axis=1)

### TF-iDF
- Step 1: create a new column, including all free-text
- Step 2: build a term dictionary by using gensim/sklearn
- Step 3: Bag-of-Words
  - The function **doc2bow()** simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. 
- Step 4: create our own corpus
- Step 5: build TF-IDF model
- Step 6: compare similarity between two terms

In [None]:
# Merge all free-text features as one new feature.
all_df['all_texts'] = all_df['product_title'] + ' . ' + all_df['product_description'] 
+ ' . '

In [None]:
all_df['all_texts'][:5]

In [None]:
from gensim.utils import tokenize
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(list(tokenize(x, errors='ignore')) 
                        for x in all_df['all_texts'].values) # fit dictionary
print(dictionary)

In [None]:
# Bag-of-word: To convert documents to vectors, convert corpus to BoW format.
class MyCorpus(object):
    def __iter__(self):
        for x in all_df['all_texts'].values:
            # convert corpus to BoW format
            yield dictionary.doc2bow(list(tokenize(x, errors='ignore')))

corpus = MyCorpus()

In [None]:
# Build TF-IDF model.
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus) # fit model

In [None]:
# Test TF-IDF model.
tfidf[dictionary.doc2bow(list(tokenize('hello world, good morning', errors='ignore')))]

In [None]:
from gensim.similarities import MatrixSimilarity

def to_tfidf(text):
    res = tfidf[dictionary.doc2bow(list(tokenize(text, errors='ignore')))]
    return res

def cos_sim(text1, text2):
    tfidf1 = to_tfidf(text1)
    tfidf2 = to_tfidf(text2)
    index = MatrixSimilarity([tfidf1],num_features=len(dictionary))
    sim = index[tfidf2]
    return float(sim[0])

In [None]:
# Test cosine similarity.
text1 = 'hello world'
text2 = 'hello from the other side'
cos_sim(text1, text2)

In [None]:
# Generate two new features -- 'tfidf_cos_sim_in_title' and 'tfidf_cos_sim_in_desc'.
all_df['tfidf_cos_sim_in_title'] = all_df.apply(lambda x: cos_sim(
    x['search_term'], x['product_title']), axis=1)
all_df['tfidf_cos_sim_in_desc'] = all_df.apply(lambda x: cos_sim(
    x['search_term'], x['product_description']), axis=1)

In [None]:
# Check new feature value.
all_df['tfidf_cos_sim_in_title'][:5]

### Word2Vec
- Step 1: split text into list of sentences
- Step 2: split sentences into list of words in order to build corpur
  - nltk.tokenize.word_tokenize
    - https://www.nltk.org/api/nltk.tokenize.html
  - gensim.utils.tokenize
    - https://radimrehurek.com/gensim/utils.html#gensim.utils.tokenize
- Step 3: train Word2Vec model
  - https://radimrehurek.com/gensim/models/word2vec.html
- Step 4: compare similarity between two terms

In [None]:
import nltk
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [None]:
# Test sentences splitting.
tokenizer.tokenize(all_df['all_texts'].values[0])

In [None]:
# free-text ==> sentences
sentences = [tokenizer.tokenize(x) for x in all_df['all_texts'].values]

In [None]:
# Flatten list of lists due to there is no hierarchical relationship in sentences.
sentences = [y for x in sentences for y in x]

In [None]:
len(sentences) # how many sentences in total 

In [None]:
# Build corpur, Sentences ==> words
from nltk.tokenize import word_tokenize
w2v_corpus = [word_tokenize(x) for x in sentences]

In [None]:
# Train the model.
from gensim.models.word2vec import Word2Vec

model = Word2Vec(w2v_corpus, size=128, window=5, min_count=5, workers=4)

In [None]:
# Test term -- 'right', returns a vector.
model['right']

In [None]:
# Get all vocabulary.
vocab = model.wv

# Get corresponding vector of any text.
def get_vector(text):
    res =np.zeros([128])
    count = 0
    for word in word_tokenize(text):
        if word in vocab:
            res += model[word]
            count += 1
    return res/count  # compute the average w2v vector of each word in text

In [None]:
# Test get_vector.
print(get_vector('life is like a box of chocolate'))

In [None]:
from scipy import spatial

def w2v_cos_sim(text1, text2):
    try:
        w2v1 = get_vector(text1)
        w2v2 = get_vector(text2)
        sim = 1 - spatial.distance.cosine(w2v1, w2v2)
        return float(sim)
    except:
        return float(0)

In [None]:
# Test w2v_cos_sim.
w2v_cos_sim('hello world', 'hello from the other side')

In [None]:
# Generate two new features -- 'w2v_cos_sim_in_title' and 'w2v_cos_sim_in_desc'.
all_df['w2v_cos_sim_in_title'] = all_df.apply(
    lambda x: w2v_cos_sim(x['search_term'], x['product_title']), axis=1)
all_df['w2v_cos_sim_in_desc'] = all_df.apply(
    lambda x: w2v_cos_sim(x['search_term'], x['product_description']), axis=1)

In [None]:
# Show current dataframe.
all_df.head(5)

In [None]:
# Drop all features that can not be inputted into ML model.
all_df = all_df.drop(
    ['search_term','product_title','product_description','all_texts'],axis=1)

In [None]:
# Show current dataframe,all features are numerical data type.
all_df.head(5) 

In [None]:
# Fill all NaN with 0.
all_df = all_df.fillna(0)

# 6. Reshape Train and Test Data

In [None]:
# Seperate train and test data.
train_df = all_df.loc[train_df.index]
test_df = all_df.loc[test_df.index]

In [None]:
# keep test index.
test_ids = test_df['id']

In [None]:
# Extract target values of train data.
y_train = train_df['relevance'].values

In [None]:
# For train and test data, drop all label features.
X_train = train_df.drop(['id','relevance'],axis=1).values
X_test = test_df.drop(['id','relevance'],axis=1).values

# 7. Modeling

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [None]:
params = [1,3,5,6,7,8,9,10]
test_scores = []
for param in params:
    clf = RandomForestRegressor(n_estimators=30, max_depth=param)
    test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=5, 
                                          scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(params, test_scores)
plt.title("Param vs CV Error")

In [None]:
rf = RandomForestRegressor(n_estimators=30, max_depth=6)
rf.fit(X_train, y_train)

# 8. Prediction

In [None]:
y_pred = rf.predict(X_test)

# 9. Submission

In [None]:
pd.DataFrame({"id": test_ids, "relevance": y_pred}).to_csv('submission.csv',index=False)