### TagRank for product data
Inspired from TextRank, the TagRank method use graph-based method for keyword extraction.

#### Preprocess
- Download "Product Keyword Extraction - Data" as "data.csv"

In [1]:
import pandas as pd
from preprocessing.preprocess import preprocess, tokenize
import networkx as nx
import operator

# load data and preprocess
product_log = pd.read_csv("data.csv",sep="\t",names=["product_name", "category", "query", "event", "date"])

print product_log.shape, product_log.drop_duplicates().shape
product_log = product_log.drop_duplicates()
product_log = product_log[product_log.event!= "Event"]
product_log = product_log[(product_log.product_name!="#NAME?") & (product_log.product_name!="#ERROR!")]
product_log.head(5)

(10317, 5) (10310, 5)


Unnamed: 0,product_name,category,query,event,date
0,--- X 10 --- 七色 多層次搭配 圓下擺 LAYERED 素面 無袖背心 打底,Male Fashion,無袖,Impression,31/7/17
1,︱IBIT︱Gymshark 熱銷款 運動T恤 健身T恤 圓領短T 運動短T 健身鯊魚,Male Fashion,gymshark,Impression,31/7/17
2,︱IBIT︱Gymshark 超高彈性 短褲 運動短褲 跑步短褲 深蹲褲 訓練短褲,Male Fashion,gymshark,Impression,31/7/17
3,::另類情侶兄弟姊妹殼::電力滿格/不足黑白趣味浮雕手機軟殼i5/i5s/i5se/i6/i...,Mobile & Gadgets,軟殼,Click,31/7/17
4,：新舊手機商場：Iphone6 16金 （需要看細圖密我）,Mobile & Gadgets,iphone6 系列,Impression,30/7/17


In [2]:
# preprocess the data
product_log["query"] = product_log["query"].map(preprocess)
product_log.product_name = product_log.product_name.map(preprocess)

# tokenize
%time titles = product_log.product_name.map(lambda x: list(tokenize(x)))
%time keywords = product_log['query'].map(lambda x: list(tokenize(x)))
product_log["query_tokens"] = keywords
product_log["title_tokens"] = titles

Building prefix dict from /Users/rren/workspace/x/keyword_extraction/virtenv/lib/python2.7/site-packages/jieba/dict.txt ...
Loading model from cache /var/folders/dg/j9x86cjs0_v8qks12vvzcr400000gp/T/jieba.cache
Loading model cost 0.395578145981 seconds.
Prefix dict has been built succesfully.


CPU times: user 12min 38s, sys: 2.31 s, total: 12min 40s
Wall time: 12min 42s
CPU times: user 1min 27s, sys: 367 ms, total: 1min 28s
Wall time: 1min 28s


In [3]:
product_log[["event","query_tokens","title_tokens"]]

Unnamed: 0,event,query_tokens,title_tokens
0,Impression,[無袖],"[七色, 搭配, 圓下, 擺, layered, 素面, 無袖, 背心, 打底]"
1,Impression,[gymshark],"[ibit, gymshark, 熱銷款, 運動, t, 恤, 健身, t, 恤, 圓領, ..."
2,Impression,[gymshark],"[ibit, gymshark, 超高, 彈性, 褲, 運動, 短褲, 跑步, 褲, 蹲, ..."
3,Click,[軟殼],"[情侶, 兄弟, 姊妹, 殼, 電力, 滿格, 趣味, 浮雕, 手機, 軟殼, i5i5si..."
4,Impression,[iphone6],"[手機, 商場, iphone6, 需要, 看, 圖密]"
5,Impression,[iphone],"[手機, 商場, iphone7, g, 需要, 圖密]"
6,Click,[iphone7],"[手機, 商場, iphone7, g, 需要, 圖密]"
7,Impression,"[男生, 衣著]","[免運, 現貨, gildan, 典素, t, 買, 會]"
8,Impression,"[t, 恤]","[免運, 現貨, gildan, 典素, t, 買, 會]"
9,Click,"[休閒, 短褲]","[免運, 現貨, 大量, 現貨, 帶, 休, 閒, 色, m5xl]"


In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix

In [190]:
class TagRank(object):
    def __init__(self,window=3):
        self.window = window
        self.graph = None
        
    def _vectorize(self,tokens):
        """
        Vectorize a list of tokens
        """
        data = []
        row = []
        col = []
        for w in tokens:
            data.append(1.)
            row.append(self.vocab[w])
            col.append(0)
        return csr_matrix((data,(row,col)),shape=(len(self.vocab),1))
    
    def _query_title_vec(self,query,title,event):
        """
        Find the cooccurrence matrix of query and title
        """
        data = []
        row = []
        col = []
        if event == "Impression": val = 2.
        else: val = 3.
            
        for wt in title:
            for wq in query:
                data.append(val)
                row.append(self.vocab[wt])
                col.append(self.vocab[wq])
                
                data.append(val)
                row.append(self.vocab[wq])
                col.append(self.vocab[wt])
        return csr_matrix((data,(row,col)),shape=(len(self.vocab),len(self.vocab)))
        
        
        
    def _construct_sparse_matrix(self,events,queries,titles):
        # construct vocabulary
        vocab = set([w for sentence in (queries + titles) for w in sentence])
        vocab = {w:i for (w,i) in zip(vocab,range(len(vocab)))}
        self.vocab = vocab
        self.i2w = {i:w for w,i in vocab.iteritems()}
        self.coo_matrix = csr_matrix(([],([],[])),shape=(len(self.vocab), len(self.vocab)))
        
        # construct sparse matrix
        row = []
        col = []
        data = []
        for e,q,t in zip(events,queries,titles):
            vec = self._vectorize(t)
            coo_matrix = (vec * vec.T) / len(self.vocab)
            self.coo_matrix += coo_matrix
            
            self.coo_matrix += self._query_title_vec(q,t,e)
        self.coo_matrix /= len(titles)
        
        
    def build_graph(self,events,queries,titles):
        self._construct_sparse_matrix(events,queries,titles)
        self.graph = nx.from_scipy_sparse_matrix(self.coo_matrix)
        self.scores = nx.pagerank(self.graph)
        
    def get_ranking(self,tokens,topK=5):
        rankings = {i:0. for i in range(len(self.vocab))}
        for w in tokens:
            i = self.vocab[w]
            rankings[i] += self.scores[i]
#             for n in self.graph.neighbors(i):
#                 rankings[n] += self.scores[n]
        rankings = sorted(rankings.items(), key=operator.itemgetter(1),reverse=True)
        results = [(self.i2w[i],score) for i,score in rankings[:topK]]
        return results

In [191]:
tagrank = TagRank()

In [192]:
%time scores = tagrank.build_graph(product_log.event, product_log.query_tokens, product_log.title_tokens)

CPU times: user 1min 11s, sys: 11.6 s, total: 1min 23s
Wall time: 1min 22s


In [193]:
def print_ranking(i):
    print 
    print product_log.product_name.iloc[i]
    print "-------------------------------"
    rankings = tagrank.get_ranking(product_log.title_tokens.iloc[i])
    for w,score in rankings:
        print w,score

In [196]:
for i in np.random.choice(range(product_log.shape[0]),replace=False,size=10):
    print_ranking(i)


[現貨]oppor11r9sr9sp夏日新款浮雕水果檸檬指環手繩手機殼r9r9pa59a57全包防摔保護殼
-------------------------------
殼 0.0332699566407
保護 0.00591856076572
現貨 0.00589649486818
機殼 0.00267229303267
新款 0.00238265920783

💖xl4xl💖大尺碼大碼兩件式顯瘦休閒套裝運動套裝短褲
-------------------------------
運動 0.0084891698556
尺碼 0.00610662265001
套裝 0.00558173682432
短褲 0.00284072547758
休閒 0.00241370575427

最火爆的聯名款 bape 海賊王 one piece 喬巴 a bathing ape
-------------------------------
bape 0.000303780662746
ape 0.00020032891443
one 0.000159068689068
海賊 6.54835519533e-05
名款 5.85434367496e-05

mryuu夏季男女生寬鬆直筒情侣款運動休閒褲9分褲子歐美長褲男裝棉褲七分褲棉褲adidasnike
-------------------------------
褲 0.0444642511859
運動 0.0084891698556
休閒 0.00241370575427
褲子 0.000998138452847
男裝 0.000728534667536

samsung 原廠2a旅充 充電器傳輸線 有測試圖 note2 a7 j7 s6 s4 三星原廠充電器
-------------------------------
samsung 0.00311945051661
j7 0.00286481614662
原廠 0.00187837044103
線 0.000945837641454
a7 0.000927999783926

轉賣吊牌未拆全新只有試穿領口微被化妝品沾到現貨l尺寸尺寸偏小建議m號體型穿著韓標韓國喜宴穿搭白色蕾絲優雅風格瘦版體型
-------------------------------
現貨 0

### TO DO
- Use TFIDF to lower the importance of repeated words, such as "現貨"
- Apply direction into the graph (from product_name to query)