# Crowdflower Search Result Relevance

* 설명
  * 검색 결과의 관련성을 측정하는 데 사용할 수있는 오픈 소스 모델을 만드는 것
  * 전자 상거래 사이트의 검색어 및 결과 제품, 제품 설명을 통해 검색 알고리즘의 정확성을 평가

* 평가
  * 두 등급 간의 일치도를 측정하는 quadratic weighted kappa 기준으로 채점
  * 인간의 평가 점수와 예측 평가 점수의 일치도 측정
  * 0 (평가자 간의 임의 합의)에서 1 (평가자 간의 완전한 합의)

* ### Data 설명

* train.csv - training data set
  - id : 제품 ID
  - query : 검색어 사용
  - product title : 제품 이름
  - product_description : HTML 형식 지정 태그와 함께 전체 제품 설명
  - median_relevance : 3 명의 평가자에 의한 중앙 관련성 점수. 1 - 4 사이의 정수
  - relevance_variance : 평가자가 준 관련성 점수의 편차. 

* test.csv - test data set
  - id : 제품 ID
  - query : 검색어 사용
  - product title : 제품 이름
  - product_description : HTML 형식 지정 태그와 함께 전체 제품 설명

* ampleSubmission.csv - sample submission correct format
  - id : 제품 ID
  - prediction : 예측 값

# Training Goal

* Modeling 을 진행하기에 앞서 Data Preprocessing 및 Back of Word 진행

# Process

* ### Back of Word
  * query 와 title, description 을 구분하여 진행
  * TF-IDF 를 통해 Back of Word 진행

* ### Preprocessing
  1. HTML tag 제거 : **BeautifulSoup**
  2. string type 이 아닌 vlaue 변환
  3. query / title / description 분리 및 corpus 생성
  4. text 소문자로 변환
  5. 알파벳, 숫자만 존재하도록 정규표현식 사용
  6. 'english' stopwords 적용
  7. stemming : **PorterStemmer**
  8. Back of Word : **TF-IDF**
  9. stack : query / title / description 모두 사용한 버전과 query / title 만을 사용한 두가지 버전 생성

In [5]:
import numpy as np
import pandas as pd
import sklearn as sk

import matplotlib as mpl
import matplotlib.pylab as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, roc_auc_score, accuracy_score

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from scipy.sparse import hstack

import re
import joblib
from joblib import dump, load
from bs4 import BeautifulSoup

pd.options.display.max_columns = 400
pd.options.display.max_rows = 200
pd.options.display.max_colwidth = 600
pd.options.display.precision = 10

## Data load / Variables define

* train data / test data / submission data load

In [6]:
df_train = pd.read_excel("./__data/excel/train.xlsx").fillna("")
df_test = pd.read_excel("./__data/excel/test.xlsx").fillna("")

In [3]:
y = df_train['median_relevance']

In [14]:
df_train.head()

Unnamed: 0,id,query,product_title,product_description,median_relevance,relevance_variance
0,1,bridal shower decorations,Accent Pillow with Heart Design - Red/Black,"Red satin accent pillow embroidered with a heart in black thread. 8"" x 8"".",1,0.0
1,2,led christmas lights,Set of 10 Battery Operated Multi LED Train Christmas Lights - Clear Wire,"Set of 10 Battery Operated Train Christmas Lights Item #X124210 Features: Color: multi-color bulbs with matching train light covers / clear wire Multi-color consists of red, green, blue and yellow bulbs Number of bulbs on string: 10 Bulb size: micro LED Spacing between bulbs: 6 inches Lighted length: 4.5 feet Total length: 5.5 feet 12 inch lead cord Additional product features: LED lights use 90% less energy Cool to the touch If one bulb burns out, the rest will stay lit Lights are equipped with Lamp Lock feature, which makes them replaceable, interchangeable and keeps them from falling ou...",4,0.0
2,4,projector,ViewSonic Pro8200 DLP Multimedia Projector,,4,0.471
3,5,wine rack,"Concept Housewares WR-44526 Solid-Wood Ceiling/Wall-Mount Wine Rack, Charcoal Grey, 6 Bottle","Like a silent and sturdy tree, the Southern Enterprises Bird and Branch Coat Rack is an eye-catching addition to your home d챕cor. This tree themed coat rack features strong branches with pinecone accents and a small bird perched at the top to give it a whimsical and welcoming appearance while still making it sturdy enough to hold your coats, hats, umbrellas and more. Whether it serves as a coat rack, a hat rack or a combination of the two, it?셪l be a great space saver that gets appreciated for its graceful appearance.\nNumber of Hooks: 10\nFrame Material: Metal\nHardware Material: Metal\nD...",4,0.0
4,7,light bulb,Wintergreen Lighting Christmas LED Light Bulb (Pack of 25),"WTGR1011\nFeatures\nNickel base, 60,000 average hours, acrylic resin bulb material\nChristmas light bulb\nSteady dimmable replacement lamps\nNickel bases prevent corrosion in sockets\nWattage: 0.96 Watts\nVoltage: 130 Volts\nDimmable: Yes\nLight Source: LED\nBulb Shape Type: Candle\n\nColor Amber\nBulb Color: Amber",2,0.471


In [13]:
df_test['product_description'][8]

'<ul>\n\t\t<li>\n\t\t\tEnglish \n\t\t\t\t</li>\n    \t<li>\n    \t\t    \n    \t</li>\n    \t<li>\n    \t\t\t \n    \t\t\t \n    \t\t</li>\n    \t</ul>\n\n    \n\t\t\tThis translation tool is for your convenience only. The accuracy and accessibility of the resulting translation is not guaranteed.\n\t\n\t\n\n\n\n\t\t\n\t\t\n\t\t\t\t\t\t\t<ul>\n\t\t\t\t\t\t\t\t<li>\n\t\t\t\t\t\t\t\t\t\tEnglishEnglish\n\t\t\t\t\t\t\t\t\t\t\t</li>\n\t\t\t\t\t\t\t\t<li>\n\t\t\t\t\t\t\t碼?晩邈磨?馬Arabic\n\t\t\t\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t\t訝?뻼竊덄?鵝볩펹Chinese (Simplified)\n\t\t\t\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t\t訝?뻼竊덄퉩鵝볩펹Chinese (Traditional)\n\t\t\t\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t\t훻eskyCzech\n\t\t\t\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t\tNederlandsDutch\n\t\t\t\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t\tSuomiFinnish\n\t\t\t\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t\t?貫貫管館菅觀郭Greek\n\t\t\t\t\t\t\t\t</li>\n\t\t\t\t\t<li>\n\t\t\t\t\t\t\t鬧?淚?瘻Hebrew\n\t\t\t\t\t\t

* HTML tags 제거

In [7]:
%%time
f_names = ['query', 'product_title', 'product_description']
for feat in f_names:
    for num in range(len(df_train)):
        df_train[feat][num] = BeautifulSoup(df_train[feat][num], "lxml").text

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


Wall time: 18min 10s


In [80]:
%%time
f_names = ['query', 'product_title', 'product_description']
for feat in f_names:
    for num in range(len(df_test)):
        df_test[feat][num] = BeautifulSoup(df_test[feat][num], "lxml").text

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup


Wall time: 42min 13s


* change value, not string

In [4]:
def str_check(d_train, d_test):
    f_names = ['query', 'product_title', 'product_description']
    frame_list = [d_train, d_test]
    frame_name = ['df_train', 'df_test']
    for frame in range(len(frame_list)):
        for feat in f_names:
            for num in range(len(frame_list[frame])):
                if type(frame_list[frame][feat][num]) != str:
                    print(frame_name[frame], '/', feat, '/', 'index :', num, '/', 'value :', frame_list[frame][feat][num])
                    
str_check(df_train, df_test)

df_test / product_title / index : 13493 / value : 21.5
df_test / product_title / index : 14627 / value : 28
df_test / product_title / index : 16065 / value : 60
df_test / product_title / index : 16987 / value : 7.75
df_test / product_title / index : 17460 / value : 12.5
df_test / product_title / index : 19751 / value : 41.5


In [5]:
def convert_str(d_train, d_test):
    f_names = ['query', 'product_title', 'product_description']
    frame_list = [d_train, d_test]
    frame_name = ['df_train', 'df_test']
    for frame in range(len(frame_list)):
        for feat in f_names:
            for num in range(len(frame_list[frame])):
                if type(frame_list[frame][feat][num]) != str:
                    frame_list[frame][feat][num] = str(frame_list[frame][feat][num])
                    
convert_str(df_train, df_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [83]:
df_train.to_csv('./df_train.csv', index=False)

In [84]:
df_test.to_csv('./df_test.csv', index=False)

In [2]:
df_train = pd.read_excel("./__data/excel/df_train.xlsx").fillna("")
df_test = pd.read_excel("./__data/excel/df_test.xlsx").fillna("")

## query / title / description 분리

In [15]:
train_q = list(df_train['query'])
train_t = list(df_train['product_title'])
train_d = list(df_train['product_description'])

test_q = list(df_test['query'])
test_t = list(df_test['product_title'])
test_d = list(df_test['product_description'])

In [16]:
train_q

['bridal shower decorations',
 'led christmas lights',
 'projector',
 'wine rack',
 'light bulb',
 'oakley polarized radar',
 'boyfriend jeans',
 'screen protector samsung',
 'pots and pans set',
 'waffle maker',
 'oakley radar',
 'workout clothes for women',
 'decorative pillows',
 'wall clocks',
 'cuisinart coffee maker',
 'thomas the train',
 'silver necklace',
 'bluray hobbit extended',
 'cat grass',
 'soda stream',
 'microwave',
 'aqua shoes',
 'leather mens briefcase',
 'girls halloween costumes',
 'knife victorinox',
 'micro usb to hdmi',
 'zippo',
 'sleeping bags',
 'routers',
 'hello kitty',
 'golf clubs',
 'converse low top',
 'memory foam pillow',
 'kitchen faucet',
 'hollister polo',
 'dc shoes black',
 'donut shoppe k cups',
 'hollister polo',
 'eye cream',
 'plantronics corded headset',
 'sleeping bags',
 'ps3 wireless controller',
 'bluesky gel nail polish',
 'yankee candle',
 'projector',
 'macbook case 13 case',
 'girls halloween costumes',
 'iphone 5',
 'vanity fair b

## Train data / Test data 를 사용하여 corpus 제작

In [8]:
df_corpus = pd.concat([df_train, df_test], axis=0)

In [9]:
df_corpus = df_corpus.drop('median_relevance', axis = 1)
df_corpus = df_corpus.drop('relevance_variance', axis = 1)

In [10]:
%%time
corpusdata = list(df_corpus.apply(lambda x:'%s %s %s' % (x['query'], x['product_title'], x['product_description']), axis=1))

Wall time: 1.23 s


## 모든 Text data 를 소문자로 변환

In [11]:
def lower_convert(data):
    for num in range(len(data)):
        data[num] = data[num].lower()

In [12]:
lower_convert(train_q)
lower_convert(train_t)
lower_convert(train_d)
lower_convert(test_q)
lower_convert(test_t)
lower_convert(test_d)
lower_convert(corpusdata)

## 2글자 이상의 알파벳, 숫자만 존재하도록 정규표현식 적용

In [13]:
def alphabet_stopwords(data):
    for num in range(len(data)):
        data[num] = re.findall(r'[a-zA-Z0-9]+', data[num])
        
    for num in range(len(data)):
        data[num] = (" ").join(data[num])
    
    for num in range(len(data)):
        data[num] = re.findall(r'\w\w+', data[num])

In [14]:
%%time
alphabet_stopwords(train_q)
alphabet_stopwords(train_t)
alphabet_stopwords(train_d)
alphabet_stopwords(test_q)
alphabet_stopwords(test_t)
alphabet_stopwords(test_d)
alphabet_stopwords(corpusdata)

Wall time: 7.33 s


## Stopwords (english)

In [15]:
def eng_stopwords(data):
    data_copy = data.copy()
    for num in range(len(data_copy)):
        data_copy[num] = []
    
    stop = stopwords.words('english')
    
    for num in range(len(data)):
        for i in range(len(data[num])):
            if data[num][i] not in stop:
                data_copy[num].append(data[num][i])
                
    return data_copy

In [16]:
%%time
train_q = eng_stopwords(train_q)
train_t = eng_stopwords(train_t)
train_d = eng_stopwords(train_d)
test_q = eng_stopwords(test_q)
test_t = eng_stopwords(test_t)
test_d = eng_stopwords(test_d)
corpus_data = eng_stopwords(corpusdata)

Wall time: 14.3 s


## Stemming

In [52]:
def stemPorter(text):
            porter = PorterStemmer()
            stem_data = []
            for num in text:
                final_stem = []
                for word in num:
                    final_stem.append(porter.stem(word))
                stem_data.append(final_stem)
            return stem_data

In [53]:
%%time
train_q = stemPorter(train_q)
train_t = stemPorter(train_t)
train_d = stemPorter(train_d)
test_q = stemPorter(test_q)
test_t = stemPorter(test_t)
test_d = stemPorter(test_d)

Wall time: 56.7 s


In [54]:
%%time
corpus_data = stemPorter(corpus_data)

Wall time: 54.8 s


## Back of Word 적용을 위해 형태 변환 (word list to sentence)

In [19]:
def data_join(list_data):
    for num in range(len(list_data)):
        list_data[num] = (" ").join(list_data[num])

In [20]:
data_join(train_q)
data_join(train_t)
data_join(train_d)
data_join(test_q)
data_join(test_t)
data_join(test_d)
data_join(corpus_data)

## TF-IDF

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [23]:
tfv = TfidfVectorizer(analyzer='word', token_pattern=r'\w+', ngram_range=(1, 3), stop_words = 'english')

In [24]:
%%time
tfv.fit(corpus_data)

Wall time: 15 s


TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='\\w+', tokenizer=None, use_idf=True,
        vocabulary=None)

In [25]:
%%time
train_Q = tfv.transform(train_q)
train_T = tfv.transform(train_t)
train_D = tfv.transform(train_d)
test_Q = tfv.transform(test_q)
test_T = tfv.transform(test_t)
test_D = tfv.transform(test_d)

Wall time: 9.93 s


In [None]:
all_train_X = hstack((train_Q, train_T, train_D))
tit_train_X = hstack((train_Q, train_T))
all_test_X = hstack((test_Q, test_T, test_D))
tit_test_X = hstack((test_Q, test_T))

In [None]:
joblib.dump(all_train_X, 'all_train_X.pkl')

In [None]:
joblib.dump(tit_train_X, 'tit_train_X.pkl')

In [None]:
joblib.dump(all_test_X, 'all_test_X.pkl')

In [None]:
joblib.dump(tit_test_X, 'tit_test_X.pkl')