## Data Cleaning and Shape Examining 


## Home Depot Product Search Relevance



Shoppers rely on Home Depot’s product authority to find and buy the latest products and to get timely solutions to their home improvement needs. From installing a new ceiling fan to remodeling an entire kitchen, with the click of a mouse or tap of the screen, customers expect the correct results to their queries – quickly. Speed, accuracy and delivering a frictionless customer experience are essential.

In this competition, Home Depot is asking Kagglers to help them improve their customers' shopping experience by developing a model that can accurately predict the relevance of search results.

Search relevancy is an implicit measure Home Depot uses to gauge how quickly they can get customers to the right products. Currently, human raters evaluate the impact of potential changes to their search algorithms, which is a slow and subjective process. By removing or minimizing human input in search relevance evaluation, Home Depot hopes to increase the number of iterations their team can perform on the current search algorithms.

### Data description

This data set contains a number of products and real customer search terms from Home Depot's website. The challenge is to predict a relevance score for the provided combinations of search terms and products. To create the ground truth labels, Home Depot has crowdsourced the search/product pairs to multiple human raters.

The relevance is a number between 1 (not relevant) to 3 (highly relevant). For example, a search for "AA battery" would be considered highly relevant to a pack of size AA batteries (relevance = 3), mildly relevant to a cordless drill battery (relevance = 2), and not relevant to a snow shovel (relevance = 1).

Each pair was evaluated by at least three human raters. The provided relevance scores are the average value of the ratings. There are three additional things to know about the ratings:

The specific instructions given to the raters is provided in relevance_instructions.docx.
Raters did not have access to the attributes.
Raters had access to product images, while the competition does not include images.
Your task is to predict the relevance for each pair listed in the test set. Note that the test set contains both seen and unseen search terms.



### File descriptions

- train.csv - the training set, contains products, searches, and relevance scores
- test.csv - the test set, contains products and searches. You must predict the relevance for these pairs.
- product_descriptions.csv - contains a text description of each product. You may join this table to the training or test set via the product_uid.
- attributes.csv -  provides extended information about a subset of the products (typically representing detailed technical specifications). Not every product will have attributes.
- sample_submission.csv - a file showing the correct submission format
- relevance_instructions.docx - the instructions provided to human raters

### Data fields

- id - a unique Id field which represents a (search_term, product_uid) pair
- product_uid - an id for the products
- product_title - the product title
- product_description - the text description of the product (may contain HTML content)
- search_term - the search query
- relevance - the average of the relevance ratings for a given id
- name - an attribute name
- value - the attribute's value

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')

# training_data = pd.read_csv("input/train.csv.zip", encoding="ISO-8859-1")
# testing_data = pd.read_csv("input/test.csv.zip", encoding="ISO-8859-1")
# attribute_data = pd.read_csv('input/attributes.csv.zip')
# descriptions = pd.read_csv('input/product_descriptions.csv.zip')


training_data = pd.read_csv("../input/train.csv", encoding="ISO-8859-1")
testing_data = pd.read_csv("../input/test.csv", encoding="ISO-8859-1")
attribute_data = pd.read_csv('../input/attributes.csv')
descriptions = pd.read_csv('../input/product_descriptions.csv')


Let's try to examing the data and try to spot if there are anything suspicious about it

In [None]:
print("training data shape is:",training_data.shape)
print("testing data shape is:",testing_data.shape)
print("attribute data shape is:",attribute_data.shape)
print("description data shape is:",descriptions.shape)

In [None]:
print("training data has empty values:",training_data.isnull().values.any())
print("testing data has empty values:",testing_data.isnull().values.any())
print("attribute data has empty values:",attribute_data.isnull().values.any())
print("description data has empty values:",descriptions.isnull().values.any())

In [None]:
training_data.head(10)

In [None]:
print("there are in total {} products ".format(len(training_data.product_title.unique())))
print("there are in total {} search query ".format(len(training_data.search_term.unique())))
print("there are in total {} product_uid".format(len(training_data.product_uid.unique())))




In [None]:
testing_data.head(10)

In [None]:
print("there are in total {} products ".format(len(testing_data.product_title.unique())))
print("there are in total {} search query ".format(len(testing_data.search_term.unique())))
print("there are in total {} product_uid".format(len(testing_data.product_uid.unique())))





In [None]:
attribute_data.head(10)

In [None]:
print("there are in total {} product_uid ".format(len(attribute_data.product_uid.unique())))
print("there are in total {} names ".format(len(attribute_data.name.unique())))
print("there are in total {} values".format(len(attribute_data.value.unique())))






In [None]:
descriptions.head(10)

In [None]:
print("there are in total {} product_uid ".format(len(descriptions.product_uid.unique())))
print("there are in total {} product_descriptions ".format(len(descriptions.product_description.unique())))







In [None]:
(descriptions.product_description.str.count('\d+') + 1).hist(bins=30)
(descriptions.product_description.str.count('\W')+1).hist(bins=30)




In [None]:
(training_data.product_title.str.count("\\d+") + 1).hist(bins=30)#plot number of digits in title
(training_data.product_title.str.count("\\w+") + 1).hist(bins=30)#plot number of digits in title





In [None]:
(training_data.search_term.str.count("\\w+") + 1).hist(bins=30) #plot number of words in search therms
(training_data.search_term.str.count("\\d+") + 1).hist(bins=30) #plot number of digits in search terms







In [None]:
(training_data.relevance ).hist(bins=30)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import norm  

training_data.relevance.plot(kind='hist', normed=True)

mu, std = norm.fit(training_data.relevance)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

In [None]:
print('total data has html tags in',descriptions.product_description.str.count('<br$').values.sum())

In [None]:
descriptions[descriptions.product_description.str.contains("<br")].values.tolist()[:3]

In [None]:
descriptions.product_description.str.contains("Click here to review our return policy for additional information regarding returns").values.sum()

In [None]:
training_data[training_data.search_term.str.contains("^\\d+ . \\d+$")].head(10)

In [None]:
training_data[training_data.product_uid==100030]

from above the following conclusion follows. 
   - At first there exists fields which has html tags in for __description__ dataset. (maybe and error made by the scrapper) along with _Click here to review our return policy_
   - There is no missing/empty values in any of these datasets 
   - in dataset __description__ field product_description contains more digits than word characters
   - some query in dataset __training__ are too straight, it's hard to guess exactly what user meant in terms of broad  sense
   - some of the search query in dataset __training__ has too specific meaning like 8 4616809045 9	
   - number of diggits appearence in the product title tends to be greater number of characters for dataset __training__ (and the same is true for search query field)
   - the relevancy score is between 1 and 3. Because the density of product whose relevancy score is between 2 and 3 is higher we can conclude that most of search query has been classifield between 2 and 3
   - The histogram of relevancy score doesn't follow standard distribution pattern
   
   
   
In order to continue the analysis we will need the whole datasets
   - description datasets might be joined together to training by the product_uid (the same holds for attribute datasets) then clean the html parts
   
   

## Data cleaning

In [None]:
## let's create first the cleaning functions
from bs4 import BeautifulSoup
import lxml
import re
import nltk
from nltk.corpus import stopwords # Import the stop word list
from nltk.metrics import edit_distance
from string import punctuation
from collections import Counter


def remove_html_tag(text):
    soup = BeautifulSoup(text, 'lxml')
    text = soup.get_text().replace('Click here to review our return policy for additional information regarding returns', '')
    return text

def str_stemmer(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return ' '.join(tokens)

def str_stemmer_title(s):
#     return " ".join([stemmer.stem(word) for word in s.lower().split()])
    return " ".join(map(stemmer.stem, s.lower().split()))

def str_common_word(str1, str2):
    whole_set = set(str1.split())
#     return sum(int(str2.find(word)>=0) for word in whole_set)
    return sum(int(str2.find(word)>=0) for word in whole_set)


def get_shared_words(row_data):
    return np.sum([str_common_word(*row_data[:-1]), str_common_word(*row_data[1:])])



In [None]:
############### cleaning html tags ##################
has_tag_in = descriptions.product_description.str.contains('<br')
descriptions.loc[has_tag_in, 'product_description'] = descriptions.loc[has_tag_in, 'product_description'].map(lambda x:remove_html_tag(x))
###############

Examing the search query in the datasets __training__, there some misspelings for field _search_term_ contains a lot of misspelling (more than 3000). This might be fixed by using Google API 

In [None]:
import requests
import re
import time
from random import randint

START_SPELL_CHECK="<span class=\"spell\">Showing results for</span>"
END_SPELL_CHECK="<br><span class=\"spell_orig\">Search instead for"
HTML_Codes = (("'", '&#39;'),('"', '&quot;'),('>', '&gt;'),('<', '&lt;'),('&', '&amp;'))

def spell_check(s):
    q = '+'.join(s.split())
    time.sleep(  randint(0,1) ) #relax and don't let google be angry
    r = requests.get("https://www.google.co.uk/search?q="+q)
    content = r.text
    start=content.find(START_SPELL_CHECK) 
    if ( start > -1 ):
        start = start + len(START_SPELL_CHECK)
        end=content.find(END_SPELL_CHECK)
        search= content[start:end]
        search = re.sub(r'<[^>]+>', '', search)
        for code in HTML_Codes:
            search = search.replace(code[1], code[0])
        search = search[1:]
    else:
        search = s
    return search 

Indeed correcting the misspelings words might help, due to ability of reproducing the result at Kaggle, we won't do spell correction

In [None]:
training_data = pd.merge(training_data, descriptions, 
                         on="product_uid", how="left")

In [None]:
training_data.head(3)

In [None]:
print("It has blank/empty fields ",training_data.isnull().values.sum())


## Feature Engineering

### Plan
We are going to do the following:
0. Join dataset __training__ with __description__  by  _product uid_ (already done)

2. Create num columns based on text columns
    - count number of words from search query which appears both in product_title and product_description
    - compute edit distnace from search query which appears both in product_title and product_title
    - compute the cosine similarity between search query, product_title and product_description
    - count number of words in the product description
    - create new columns for each pair
    
3. Remove all text columns

As a result we will have vectors of numbers that suites well for the machine learning.

In [None]:
print("has blank/empty values",training_data.isnull().values.any())

In [None]:
from nltk.corpus import brown, stopwords
from nltk.cluster.util import cosine_distance
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter


def sentence_similarity(columns,stopwords=None):
    sent1, sent2 = columns[0], columns[1]
    sent1 = sent1.split(' ')
    sent2 = sent2.split(' ')
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)

def get_jaccard_sim(columns): 
    str1, str2 = columns[0], columns[1]
    a = set(str1.split()) 
    b = set(str2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))


def calc_edit_dist(row):
    return edit_distance(*row)

    

In [None]:
################begin testing
## let's create first the cleaning functions
from bs4 import BeautifulSoup
import lxml
import re
import nltk
from nltk.corpus import stopwords # Import the stop word list
from nltk.metrics import edit_distance
from string import punctuation
from collections import Counter


def remove_html_tag(text):
    soup = BeautifulSoup(text, 'lxml')
    text = soup.get_text().replace('Click here to review our return policy for additional information regarding returns', '')
    return text

def str_stemmer(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return ' '.join(tokens)


def str_stemmer_tokens(tokens):
    # split into tokens by white space
#     tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return ' '.join(tokens)

def str_stemmer_title(s):
    return " ".join(map(stemmer.stem, s))

def str_common_word(str1, str2):
    whole_set = set(str1.split())
#     return sum(int(str2.find(word)>=0) for word in whole_set)
    return sum(int(str2.find(word)>=0) for word in whole_set)


# def str_common_word(str1, str2):
#     return sum(int(str2.find(word)>=0) for word in str1.split())


def str_common_word2(str1, str2):
    part_of_first = set(str1)
    return sum(1 for word in str2 if word in part_of_first)
#     return sum(int(str2.find(word)>=0) for word in str1.split())

def get_shared_words_mut(row_data):
    return np.sum([str_common_word2(*row_data[:-1]), str_common_word2(*row_data[1:])])


def get_shared_words_imut(row_data):
    return np.sum([str_common_word(*row_data[:-1]), str_common_word2(*row_data[1:])])
    
from nltk.corpus import brown, stopwords
from nltk.cluster.util import cosine_distance
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter


def sentence_similarity(columns,stopwords=None):
    sent1, sent2 = columns[0], columns[1]
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)

def get_jaccard_sim(columns): 
    str1, str2 = columns[0], columns[1]
    a = set(str1) 
    b = set(str2)
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))


In [None]:
############## apply stemming #####################
#  also .apply(, raw=True) might be a good options
# https://github.com/s-heisler/pycon2017-optimizing-pandas to see why it is done on this way
############## apply stemming #####################



training_data['search_term_tokens'] = training_data.search_term.str.lower().str.split()
training_data['product_title_tokens'] = training_data.product_title.str.lower().str.split()
training_data['product_description_tokens'] = training_data.product_description.str.lower().str.split()

training_data['search_term'] = [str_stemmer_title(_) for _ in training_data.search_term_tokens.values.tolist()]
training_data['product_title'] = [str_stemmer_tokens(_) for _ in training_data.product_title_tokens.values.tolist()]
training_data['product_description'] = [str_stemmer_tokens(_) for _ in training_data.product_description_tokens.values.tolist()]


training_data['shared_words_mut'] = [get_shared_words_mut(columns)
                         for columns in 
                         training_data[['search_term_tokens', 'product_title_tokens', 'product_description_tokens']].values.tolist()
                        ]

training_data['shared_words'] = list(map(get_shared_words_imut, training_data[['search_term','product_description', 'product_title']].values))


training_data['j_dis_sqt'] = [get_jaccard_sim(rows) for rows in training_data[["search_term_tokens","product_title_tokens"]].values]
training_data['j_dis_sqd'] = [get_jaccard_sim(rows) for rows in training_data[["search_term_tokens","product_description_tokens"]].values]

training_data['search_query_length'] = training_data.search_term.str.len()
training_data['number_of_words_in_descr'] = training_data.product_description.str.count("\\w+")


training_data['cos_dis_sqt'] = [ sentence_similarity(rows) for rows in training_data[["search_term","product_title"]].values]
training_data['cos_dis_sqd'] = [sentence_similarity(rows) for rows in training_data[["search_term","product_description"]].values]




In [None]:
# this two lines takeing too long time to execute
# training_data["edistance_sprot"] = [edit_distance(word1, word2) for word1, word2 in
#                                     training_data[["search_term","product_title"]].values.tolist()]


# training_data["edistance_sd"] = [edit_distance(word1, word2) for word1, word2 in
#                                     training_data[["search_term","product_description"]].values.tolist()]

In [None]:
# training_data.corr()
training_data.head(3)

__test dataset__
we have to have to apply symmetric transformation for both data set, except relevance score field since it is target field. Except we are not allow to take any actions which might lead to overfitting the data

In [None]:
testing_data = pd.merge(testing_data, descriptions, 
                         on="product_uid", how="left")
print("has blank/empty values",testing_data.isnull().values.any())

In [None]:
############## apply stemming for test data #####################
# testing_data['search_term'] = list(map(str_stemmer_title, testing_data['search_term'].values))
# testing_data['product_title'] = list(map(str_stemmer, testing_data['product_title'].values))
# testing_data['product_description'] = list(map(str_stemmer, testing_data['product_description'].values))
testing_data['search_term_tokens'] = testing_data.search_term.str.lower().str.split()
testing_data['product_title_tokens'] = testing_data.product_title.str.lower().str.split()
testing_data['product_description_tokens'] = testing_data.product_description.str.lower().str.split()

testing_data['search_term'] = [str_stemmer_title(_) for _ in testing_data.search_term_tokens.values.tolist()]
testing_data['product_title'] = [str_stemmer_tokens(_) for _ in testing_data.product_title_tokens.values.tolist()]
testing_data['product_description'] = [str_stemmer_tokens(_) for _ in testing_data.product_description_tokens.values.tolist()]

############## end stemming #####################

In [None]:
############## building custome feature for test data, let's build a few of them before compare which one is the best ###########
# testing_data['shared_words'] = list(map(get_shared_words, testing_data[['search_term','product_description', 'product_title']].values))
# testing_data["edistance_sprot"] = list(map(calc_edit_dist, testing_data[["search_term","product_title"]].values))
# testing_data["edistance_sd"] = list(map(calc_edit_dist, testing_data[["search_term","product_description"]].values))


# testing_data['cos_dis_sqt'] = list(map(sentence_similarity ,testing_data[["search_term","product_title"]].values))
# testing_data['cos_dis_sqd'] = list(map(sentence_similarity, testing_data[["search_term","product_description"]].values))



# testing_data['j_dis_sqt'] = list(map(get_jaccard_sim, testing_data[["search_term","product_title"]].values))
# testing_data['j_dis_sqd'] = list(map(get_jaccard_sim, testing_data[["search_term","product_description"]].values))

# testing_data['j_dis_sqt'] = list(map(get_jaccard_sim, testing_data[["search_term","product_title"]].values))
# testing_data['j_dis_sqd'] = list(map(get_jaccard_sim, testing_data[["search_term","product_description"]].values))

# testing_data['search_query_length'] = testing_data.search_term.str.len()
# testing_data['number_of_words_in_descr'] = testing_data.product_description.str.count("\\w+")

testing_data['shared_words_mut'] = [get_shared_words_mut(columns)
                         for columns in 
                         testing_data[['search_term_tokens', 'product_title_tokens', 'product_description_tokens']].values.tolist()
                        ]

testing_data['shared_words'] = list(map(get_shared_words_imut, testing_data[['search_term','product_description', 'product_title']].values))


testing_data['j_dis_sqt'] = [get_jaccard_sim(rows) for rows in testing_data[["search_term_tokens","product_title_tokens"]].values]
testing_data['j_dis_sqd'] = [get_jaccard_sim(rows) for rows in testing_data[["search_term_tokens","product_description_tokens"]].values]

testing_data['search_query_length'] = testing_data.search_term.str.len()
testing_data['number_of_words_in_descr'] = testing_data.product_description.str.count("\\w+")


testing_data['cos_dis_sqt'] = [ sentence_similarity(rows) for rows in testing_data[["search_term","product_title"]].values]
testing_data['cos_dis_sqd'] = [sentence_similarity(rows) for rows in testing_data[["search_term","product_description"]].values]




In [None]:
#this two lines taking too long to execute

# testing_data["edistance_sprot"] = [edit_distance(word1, word2) for word1, word2 in
#                                     testing_data[["search_term","product_title"]].values.tolist()]


# testing_data["edistance_sd"] = [edit_distance(word1, word2) for word1, word2 in
#                                     testing_data[["search_term","product_description"]].values.tolist()]


In [None]:
testing_data.corr()

In [None]:
training_data.describe()

In [None]:
testing_data.describe()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 12))
temp = training_data.drop(['product_uid','id'],axis=1)
sns.heatmap(temp.corr(), annot=True)
plt.show()

In [None]:
import seaborn as sns
plt.figure(figsize=(12, 12))
temp = testing_data.drop(['product_uid','id'],axis=1)
sns.heatmap(temp.corr(), annot=True)
plt.show()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import norm  

training_data.cos_dis_sqd.plot(kind='hist', normed=True)

mu, std = norm.fit(training_data.cos_dis_sqd)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

let's check wheather this is follows Gaussian distribution or not. Indeed it doesn't follow Gaussian distribution, that follows from Shapiro test

In [None]:
from statsmodels.graphics.gofplots import qqplot
from scipy.stats import shapiro


from matplotlib import pyplot
qqplot(training_data.cos_dis_sqd, line='s')
pyplot.show()

stat, p = shapiro(training_data.cos_dis_sqd)
print('Statistics=%.3f, p=%.3f' % (stat, p))

let's try to find out if wheather it follows normal distribution or not, by doing a few others test

In [None]:
from scipy.stats import normaltest

stat, p = normaltest(training_data.cos_dis_sqd)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')

In [None]:
from scipy.stats import anderson

result = anderson(training_data.cos_dis_sqd)
print('Statistic: %.3f' % result.statistic)
p = 0
for i in range(len(result.critical_values)):
    sl, cv = result.significance_level[i], result.critical_values[i]
    if result.statistic < result.critical_values[i]:
        print('%.3f: %.3f, data looks normal (fail to reject H0)' % (sl, cv))
    else:
        print('%.3f: %.3f, data does not look normal (reject H0)' % (sl, cv))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import norm  

training_data.cos_dis_sqt.plot(kind='hist', normed=True)

mu, std = norm.fit(training_data.cos_dis_sqt)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

In [None]:
from matplotlib import pyplot
qqplot(training_data.cos_dis_sqt, line='s')
pyplot.show()

stat, p = shapiro(training_data.cos_dis_sqt)
print('Statistics=%.3f, p=%.3f' % (stat, p))

From the below histogram we can conclude that the sum of shared words between search_query product_title, and product description follows the standard distribution.


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import norm  

training_data.shared_words.plot(kind='hist', normed=True)

mu, std = norm.fit(training_data.shared_words)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

In [None]:
from statsmodels.graphics.gofplots import qqplot
from scipy.stats import shapiro


from matplotlib import pyplot
qqplot(training_data.shared_words, line='s')
pyplot.show()

stat, p = shapiro(training_data.shared_words)
print('Statistics=%.3f, p=%.3f' % (stat, p))

In [None]:
# %matplotlib inline
# import matplotlib.pyplot as plt
# from scipy.stats import norm  

# training_data.edistance_sprot.plot(kind='hist', normed=True)

# mu, std = norm.fit(training_data.edistance_sprot)

# xmin, xmax = plt.xlim()
# x = np.linspace(xmin, xmax, 100)
# p = norm.pdf(x, mu, std)
# plt.plot(x, p, 'k', linewidth=2)
# title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
# plt.title(title)

# plt.show()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import norm  

training_data.search_query_length.plot(kind='hist', normed=True)

mu, std = norm.fit(training_data.search_query_length)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

let's examing if the same behaviour can be spotted on __testing__ dataset

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import norm  

testing_data.cos_dis_sqd.plot(kind='hist', normed=True)

mu, std = norm.fit(testing_data.cos_dis_sqd)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import norm  

testing_data.cos_dis_sqt.plot(kind='hist', normed=True)

mu, std = norm.fit(testing_data.cos_dis_sqt)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import norm  

testing_data.shared_words.plot(kind='hist', normed=True)

mu, std = norm.fit(testing_data.shared_words)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

In [None]:
# %matplotlib inline
# import matplotlib.pyplot as plt
# from scipy.stats import norm  

# testing_data.edistance_sprot.plot(kind='hist', normed=True)

# mu, std = norm.fit(testing_data.edistance_sprot)

# xmin, xmax = plt.xlim()
# x = np.linspace(xmin, xmax, 100)
# p = norm.pdf(x, mu, std)
# plt.plot(x, p, 'k', linewidth=2)
# title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
# plt.title(title)

# plt.show()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import norm  

testing_data.search_query_length.plot(kind='hist', normed=True)

mu, std = norm.fit(testing_data.search_query_length)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

In [None]:
training_data.shape

np.max([np.log(74067) / np.log(x) for x in training_data.search_query_length])


In [None]:
sns.pairplot(training_data)

In [None]:
# sns.pairplot(testing_dataing_data)
training_data.corr()

# 3. Let's start machine learning
first of all let's create training and test data sets


We are going to apply the following models:
1. RandomForestRegressor
2. LinearRegression
4. GradientBoostingRegressor 
5. BaggingRegressor
6. Chain model withing pipeline
7. XGBoost
8. CatBoost
9. Naive Baies
10. PolynomialFeatures for all previous algorithms


### Plan
We are going to do the following:
0. Define pipeline
1. drop non numeric columns because these information has been already transformed to numberic
2. Apply the model which has been mentioned above within pipeline mode and outside pipeline
3. Train models and compare their result on __test__ dataset
4. write a summary about it





In [None]:
df_training = training_data.drop(['product_title','search_term','product_description', 'product_title_tokens', 'product_description_tokens','product_title_tokens','search_term_tokens'],axis=1)

y_train = df_training['relevance'].values
X_train = df_training.drop(['id','relevance'],axis=1).values

In [None]:
df_training.head(3)

In [None]:
# X_test = testing_data.drop(['id','product_title','search_term','product_description'],axis=1).values
X_test = testing_data.drop(['id','product_title','search_term','product_description', 'product_title_tokens', 'product_description_tokens','product_title_tokens','search_term_tokens'],axis=1).values

id_test = testing_data['id']


## 3.1 RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 1, n_jobs = -1, random_state = 17, verbose = 1)
rfr.fit(X_train, y_train)

y_pred = rfr.predict(X_test)

pd.DataFrame({"id": id_test, "relevance": y_pred}).to_csv('submission.csv',index=False)



## 3.2 LinearRegression


In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression(n_jobs = -1)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# pd.DataFrame({"id": id_test, "relevance": y_pred}).to_csv('submission.csv',index=False)

## 3.3 GradientBoostingRegressor

In [None]:
import sklearn
from sklearn.ensemble import GradientBoostingRegressor

param_grid = {
                'loss' : ['ls'],
                'n_estimators' : [3], 
                'max_depth' : [9],
                'max_features' : ['auto'] 
             }

gbr = GradientBoostingRegressor()

model_gbr = sklearn.model_selection.GridSearchCV(estimator = gbr, n_jobs = -1, param_grid = param_grid)
model_gbr.fit(X_train, y_train)

y_pred = model_gbr.predict(X_test)

# pd.DataFrame({"id": id_test, "relevance": y_pred}).to_csv('submission.csv',index=False)

## 3.4 BaggingRegressor based on  RandomForestRegressor

In [None]:
from sklearn.ensemble import BaggingRegressor
rf = RandomForestRegressor(max_depth = 20, max_features =  'sqrt', n_estimators = 3)
clf = BaggingRegressor(rf, n_estimators=3, max_samples=0.1, random_state=25)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

# pd.DataFrame({"id": id_test, "relevance": y_pred}).to_csv('submission.csv',index=False)

In [None]:
cat

## 3.5 Chain model withing pipeline


In [None]:
# define models which will be chained togher in a bigger model, which aims to predict the relevancy score
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold

#define standard scaler
scaler = StandardScaler()
scaler.fit(X_train, y_train)
scaled_train_data = scaler.transform(X_train)
scaled_test_data = scaler.transform(X_test)


rf = RandomForestRegressor(n_estimators=4, max_depth=6, random_state=0)
clf = BaggingRegressor(rf, n_estimators=4, max_samples=0.1, random_state=25)


pipeline = Pipeline(steps = [('scaling', scaler), ('baggingregressor', clf)])
#end pipeline 
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# pd.DataFrame({"id": id_test, "relevance": y_pred}).to_csv('submission.csv',index=False)


## 3.6 Naive Bayes

In [None]:
from sklearn.linear_model import BayesianRidge

gnb = BayesianRidge()
param_grid = {}
model_nb = sklearn.model_selection.GridSearchCV(estimator = gnb, param_grid = param_grid, n_jobs = -1)
model_nb.fit(X_train, y_train)

y_pred = model_nb.predict(X_test)
# pd.DataFrame({"id": id_test, "relevance": y_pred}).to_csv('submission.csv',index=False)

## 3.7 XGBoost

In [None]:
from xgboost import XGBRegressor

xgb = XGBRegressor()
param_grid = {'max_depth':[5, 6], 
              'n_estimators': [130, 150, 170], 
              'learning_rate' : [0.1]}
model_xgb = sklearn.model_selection.GridSearchCV(estimator = xgb, param_grid = param_grid, n_jobs = -1)
model_xgb.fit(X_train, y_train)

y_pred = model_xgb.predict(X_test)
# pd.DataFrame({"id": id_test, "relevance": y_pred}).to_csv('submission.csv',index=False)


# Results



|Regressor|Train|Kaggle
|:----------------------|----------|----------|
|CatBoostRegressor|-|-|
|XGBRegressor|-|-|
|GradientBoostingRegressor|-|-|
|PolynomialFeatures on GradientBoostingRegressor|-|-|
|PolynomialFeatures on XGBRegressor|-|-|
|PolynomialFeatures on LinearRegression|-|-|
|PolynomialFeatures on BaggingRegressor on RandomForestRegressor|-|-|
|BaggingRegressor on RandomForestRegressor|0.53480|0.53427|
|Chaining toghether using Pipeline|0.53063|0.53100|
|BayesianRidge|-|-|
|LinearRegression|-|-|
|PolynomialFeatures on BayesianRidge|-|-|
|RandomForestRegressor|0.59063|0.58869|
