## [Kaggle Clone Coding] Mercari Price Suggestion Challenge
- [Mercari Price Suggestion Challenge](https://www.kaggle.com/c/mercari-price-suggestion-challenge/overview)
- [Mercari Interactive EDA + Topic Modelling](https://www.kaggle.com/thykhuely/mercari-interactive-eda-topic-modelling/notebook)
  
- Task : 제품 가격 추천 / 제안
---

### Kaggle API를 통해 코랩에 데이터 다운로드

In [1]:
!pip install kaggle
from google.colab import files
files.upload()



Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"yoonj98","key":"dbb3b5607358d2775c1cb6107c3bd2d3"}'}

In [7]:
import os 
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/Kaggle/kaggle/'
os.chdir('/content/drive/MyDrive/Kaggle/kaggle/')
os.getcwd()

'/content/drive/MyDrive/Kaggle/kaggle'

In [8]:
# 다운받고자하는 대회의 data 탭에서 api 주소 가져오기
! kaggle competitions download -c mercari-price-suggestion-challenge

Downloading test.tsv.7z to /content/drive/MyDrive/Kaggle/kaggle
 94% 32.0M/34.0M [00:00<00:00, 86.4MB/s]
100% 34.0M/34.0M [00:00<00:00, 85.9MB/s]
Downloading test_stg2.tsv.zip to /content/drive/MyDrive/Kaggle/kaggle
 99% 291M/294M [00:02<00:00, 105MB/s]
100% 294M/294M [00:02<00:00, 117MB/s]
Downloading train.tsv.7z to /content/drive/My Drive/Kaggle/kaggle
100% 74.3M/74.3M [00:00<00:00, 105MB/s]

Downloading sample_submission_stg2.csv.zip to /content/drive/My Drive/Kaggle/kaggle
  0% 0.00/7.77M [00:00<?, ?B/s]
100% 7.77M/7.77M [00:00<00:00, 71.2MB/s]
Downloading sample_submission.csv.7z to /content/drive/My Drive/Kaggle/kaggle
  0% 0.00/170k [00:00<?, ?B/s]
100% 170k/170k [00:00<00:00, 23.9MB/s]


In [9]:
# 데이터 확인
!ls

kaggle.json		  sample_submission_stg2.csv.zip  test.tsv.7z
sample_submission.csv.7z  test_stg2.tsv.zip		  train.tsv.7z


In [10]:
# 압축해제
!p7zip -d test.tsv.7z
!p7zip -d train.tsv.7z


7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 35617013 bytes (34 MiB)

Extracting archive: test.tsv.7z
--
Path = test.tsv.7z
Type = 7z
Physical Size = 35617013
Headers Size = 122
Method = LZMA2:24
Solid = -
Blocks = 1

  0%      8% - test.tsv                13% - test.tsv                20% - test.tsv                26% - test.tsv                32% - test.tsv                40% - test.tsv                46% - test.tsv                54% - test.tsv                61% - test.tsv               

### Code
1. Explanatory Data Analysis
2. Text Processing  
  2.1. Tokenizing and tf-idf algorithm  
  2.2. K-means Clustering  
  2.3. Latent Dirichlet Allocation (LDA) / Topic Modelling

In [24]:
pip install plotly



In [27]:
import numpy as np
import pandas as pd

In [13]:
train = pd.read_csv('/content/drive/MyDrive/Kaggle/kaggle/train.tsv', sep='\t')
test = pd.read_csv('/content/drive/MyDrive/Kaggle/kaggle/test.tsv', sep='\t')

print(train.shape)
print(test.shape)

(1482535, 8)
(693359, 7)


In [17]:
train.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity


In [14]:
train.dtypes

train_id               int64
name                  object
item_condition_id      int64
category_name         object
brand_name            object
price                float64
shipping               int64
item_description      object
dtype: object

#### Target Variable
- 쇼핑몰 판매자에게 제안하는 가격
- 모든 항목의 중앙값은 약 267달러이지만, 변수의 분포는 왼쪽으로 심하게 치우쳐 있음 --> 로그 변환 수행 (이때, 변환 전 값에 +1을 하여 0이었던 값이 음의 무한대로 발산하는 상황을 피해줌)

In [18]:
train.price.describe()

count    1.482535e+06
mean     2.673752e+01
std      3.858607e+01
min      0.000000e+00
25%      1.000000e+01
50%      1.700000e+01
75%      2.900000e+01
max      2.009000e+03
Name: price, dtype: float64

#### Shipping
- 배송비 = 판매자과 구매자가 함께 분담하되, 배송비의 절반 이상 (55%)은 판매자가 부담
- 배송비를 지불해야 하는 사용자들이 지불하는 평균 가격은 추가 배송비가 필요하지 않은 사용자들보다 낮다.

In [19]:
train.shipping.describe()

count    1.482535e+06
mean     4.472744e-01
std      4.972124e-01
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      1.000000e+00
max      1.000000e+00
Name: shipping, dtype: float64

#### Item Category
- 카테고리는 보통 3개로 묶여있음 - 3개의 column으로 분화

In [20]:
train['category_name'].value_counts()[:5]

Women/Athletic Apparel/Pants, Tights, Leggings    60177
Women/Tops & Blouses/T-Shirts                     46380
Beauty/Makeup/Face                                34335
Beauty/Makeup/Lips                                29910
Electronics/Video Games & Consoles/Games          26557
Name: category_name, dtype: int64

In [21]:
def split_cat(text):
    try: return text.split("/")
    except: return ("No Label", "No Label", "No Label")

In [22]:
train['general_cat'], train['subcat_1'], train['subcat_2'] = \
zip(*train['category_name'].apply(lambda x: split_cat(x)))
train.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,general_cat,subcat_1,subcat_2
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet,Men,Tops,T-shirts
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...,Electronics,Computers & Tablets,Components & Parts
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...,Women,Tops & Blouses,Blouse
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...,Home,Home Décor,Home Décor Accents
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity,Women,Jewelry,Necklaces


In [23]:
print("There are %d unique first sub-categories." % train['subcat_1'].nunique())
print("There are %d unique second sub-categories." % train['subcat_2'].nunique())

There are 114 unique first sub-categories.
There are 871 unique second sub-categories.


7개의 주요 범주(첫 번째 하위 범주 114개, 두 번째 하위 범주 871개)가 있으며, 가장 인기 있는 두 가지 범주(관찰의 50% 이상)로 여성용과 미용용품이 있고, 그 다음으로 어린이와 전자제품이 있다.

In [29]:
x = train['general_cat'].value_counts().index.values.astype('str')
y = train['general_cat'].value_counts().values
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))]

In [None]:
# ploty 라이브러리 오류

# trace1 = go.Bar(x=x, y=y, text=pct)
# layout = dict(title= 'Number of Items by Main Category',
#               yaxis = dict(title='Count'),
#               xaxis = dict(title='Category'))
# fig=dict(data=[trace1], layout=layout)
# py.iplot(fig)

In [30]:
x = train['subcat_1'].value_counts().index.values.astype('str')[:15]
y = train['subcat_1'].value_counts().values[:15]
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))][:15]

#### Brand Name

In [31]:
print("There are %d unique brand names in the training dataset." % train['brand_name'].nunique())

There are 4809 unique brand names in the training dataset.


#### Item Description
- 구두점 제거
- stop word : english
- 길이가 3 이하인 단어 삭제

In [32]:
def wordCount(text):
    # convert to lower case and strip regex
    try:
         # convert to lower case and strip regex
        text = text.lower()
        regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
        txt = regex.sub(" ", text)
        # tokenize
        # words = nltk.word_tokenize(clean_txt)
        # remove words in stop words
        words = [w for w in txt.split(" ") \
                 if not w in stop_words.ENGLISH_STOP_WORDS and len(w)>3]
        return len(words)
    except: 
        return 0

In [33]:
train['desc_len'] = train['item_description'].apply(lambda x: wordCount(x))
test['desc_len'] = test['item_description'].apply(lambda x: wordCount(x))

In [34]:
train.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,general_cat,subcat_1,subcat_2,desc_len
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet,Men,Tops,T-shirts,0
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...,Electronics,Computers & Tablets,Components & Parts,0
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...,Women,Tops & Blouses,Blouse,0
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...,Home,Home Décor,Home Décor Accents,0
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity,Women,Jewelry,Necklaces,0


In [35]:
df = train.groupby('desc_len')['price'].mean().reset_index()

In [37]:
# 결측치 제거
train = train[pd.notnull(train['item_description'])]

- 일반적으로 자주 등장하는 단어들
  * 고객을 유치하기 위한 목적으로 판매자가 사용하는 단어 : 크기, 무료, 배송 (그러나, 두 변수 가격과 배송료 사이에 상관관계가 없음)
  * 가격 차별화를 위한 단어 : 브랜드명

#### Text Processing - Item Description
Pre-processing: tokenization

1. 문장으로 나누고, 해당 문장들을 다시 토큰화
2. 구두점을 없애고, stop word 활용
3. 모두 소문자로 변환
4. 길이가 3자리 이하인 단어 삭제

In [48]:
import re
import string
from nltk import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')

stop = set(stopwords.words('english'))

def tokenize(text):
    """
    sent_tokenize(): segment text into sentences
    word_tokenize(): break sentences into words
    """
    try: 
        regex = re.compile('[' +re.escape(string.punctuation) + '0-9\\r\\t\\n]')
        text = regex.sub(" ", text) # remove punctuation
        
        tokens_ = [word_tokenize(s) for s in sent_tokenize(text)]
        tokens = []
        for token_by_sent in tokens_:
            tokens += token_by_sent
        tokens = list(filter(lambda t: t.lower() not in stop, tokens))
        filtered_tokens = [w for w in tokens if re.search('[a-zA-Z]', w)]
        filtered_tokens = [w.lower() for w in filtered_tokens if len(w)>=3]
        
        return filtered_tokens
            
    except TypeError as e: print(text,e)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [49]:
train['tokens'] = train['item_description'].map(tokenize)
test['tokens'] = test['item_description'].map(tokenize)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [50]:
for description, tokens in zip(train['item_description'].head(),
                              train['tokens'].head()):
    print('description:', description)
    print('tokens:', tokens)
    print()

description: No description yet
tokens: ['description', 'yet']

description: This keyboard is in great condition and works like it came out of the box. All of the ports are tested and work perfectly. The lights are customizable via the Razer Synapse app on your PC.
tokens: ['keyboard', 'great', 'condition', 'works', 'like', 'came', 'box', 'ports', 'tested', 'work', 'perfectly', 'lights', 'customizable', 'via', 'razer', 'synapse', 'app']

description: Adorable top with a hint of lace and a key hole in the back! The pale pink is a 1X, and I also have a 3X available in white!
tokens: ['adorable', 'top', 'hint', 'lace', 'key', 'hole', 'back', 'pale', 'pink', 'also', 'available', 'white']

description: New with tags. Leather horses. Retail for [rm] each. Stand about a foot high. They are being sold as a pair. Any questions please ask. Free shipping. Just got out of storage
tokens: ['new', 'tags', 'leather', 'horses', 'retail', 'stand', 'foot', 'high', 'sold', 'pair', 'questions', 'please', 

In [54]:
from collections import Counter

cat_desc = dict()
general_cats = list(train['general_cat'].unique())

for cat in general_cats: 
    text = " ".join(train.loc[train['general_cat']==cat, 'item_description'].values)
    cat_desc[cat] = tokenize(text)

women100 = Counter(cat_desc['Women']).most_common(100)
beauty100 = Counter(cat_desc['Beauty']).most_common(100)
kids100 = Counter(cat_desc['Kids']).most_common(100)
electronics100 = Counter(cat_desc['Electronics']).most_common(100)

In [55]:
from wordcloud import WordCloud

def generate_wordcloud(tup):
    wordcloud = WordCloud(background_color='white',
                          max_words=50, max_font_size=40,
                          random_state=42
                         ).generate(str(tup))
    return wordcloud

Pre-processing: tf-idf  
용어 빈도-역 문서 빈도로, 문서 또는 말뭉치의 어휘와 관련하여 특정 단어의 중요성을 수량화  

- 용어 빈도: 주어진 문서에서 단어의 발생
- 역 문서 빈도: 한 단어가 문서 말뭉치에서 발생하는 역수

만약 그 단어가 모든 문서에서 광범위하게 사용된다면, 특정 문서 내의 그것의 존재는 우리에게 문서 자체에 대한 많은 구체적인 정보를 제공하지 못할 것입니다. 따라서 역 문서 빈도는 "a", "the", "and" 등과 같은 일반적인 단어를 처벌하는 벌칙 용어로 볼 수 있다. 따라서 tf-idf는 특정 문서의 단어 관련성에 대한 가중치 체계이다.

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=10,
                             max_features=180000,
                             tokenizer=tokenize,
                             ngram_range=(1, 2))

In [57]:
all_desc = np.append(train['item_description'].values, test['item_description'].values)
vz = vectorizer.fit_transform(list(all_desc))

vz = tf-idf 행렬로, 행은 description의 수, 열은 description 전체 중 고유 토큰의 수

In [58]:
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
tfidf = pd.DataFrame(columns=['tfidf']).from_dict(
                    dict(tfidf), orient='index')
tfidf.columns = ['tfidf']

tfidf 점수가 가장 낮은 10개의 토큰으로, 한 description을 다른 description과 구별하는데 사용할 수 없는 매우 일반적인 단어

In [59]:
tfidf.sort_values(by=['tfidf'], ascending=True).head(10)

Unnamed: 0,tfidf
new,2.175653
size,2.330674
brand,2.75566
condition,2.799306
brand new,2.874418
free,2.903426
shipping,3.070592
worn,3.107882
used,3.16531
never,3.276901


tfidf 점수가 가장 높은 10개의 토큰으로, 토큰을 통해 특정 단어를 포함하여 해당 토큰이 속한 범주를 추측 가능

In [61]:
tfidf.sort_values(by=['tfidf'], ascending=False).head(10)

Unnamed: 0,tfidf
postnatal,13.195054
subdrip rda,13.195054
lmt,13.195054
lbs length,13.195054
place step,13.195054
light volts,13.195054
thumb point,13.195054
wedgwood,13.195054
novelty bill,13.195054
colour brow,13.195054


tfidf 행렬은 고차원이므로, SVD 기법을 통해 차원을 줄여야 한다. 

#### t-Distributed Stochastic Neighbor Embedding (t-SNE)  
확률 분포를 기반으로 고차원 공간의 점 집합을 취하여 저차원 공간, 일반적으로 2D 평면에서 이러한 점들의 표현을 찾는 것

그러나 t-SNE의 복잡성이 상당히 높기 때문에 일반적으로 t-SNE를 적용하기 전에 다른 고차원 축소 기술을 사용

In [62]:
trn = train.copy()
tst = test.copy()
trn['is_train'] = 1
tst['is_train'] = 0

sample_sz = 15000

combined_df = pd.concat([trn, tst])
combined_sample = combined_df.sample(n=sample_sz)
vz_sample = vectorizer.fit_transform(list(combined_sample['item_description']))

In [63]:
from sklearn.decomposition import TruncatedSVD

n_comp=30
svd = TruncatedSVD(n_components=n_comp, random_state=42)
svd_tfidf = svd.fit_transform(vz_sample)

In [64]:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=42, n_iter=500)

In [65]:
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 15000 samples in 0.057s...
[t-SNE] Computed neighbors for 15000 samples in 12.053s...
[t-SNE] Computed conditional probabilities for sample 1000 / 15000
[t-SNE] Computed conditional probabilities for sample 2000 / 15000
[t-SNE] Computed conditional probabilities for sample 3000 / 15000
[t-SNE] Computed conditional probabilities for sample 4000 / 15000
[t-SNE] Computed conditional probabilities for sample 5000 / 15000
[t-SNE] Computed conditional probabilities for sample 6000 / 15000
[t-SNE] Computed conditional probabilities for sample 7000 / 15000
[t-SNE] Computed conditional probabilities for sample 8000 / 15000
[t-SNE] Computed conditional probabilities for sample 9000 / 15000
[t-SNE] Computed conditional probabilities for sample 10000 / 15000
[t-SNE] Computed conditional probabilities for sample 11000 / 15000
[t-SNE] Computed conditional probabilities for sample 12000 / 15000
[t-SNE] Computed conditional probabilities for sa

In [68]:
import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook

output_notebook()
plot_tfidf = bp.figure(plot_width=700, plot_height=600,
                       title="tf-idf clustering of the item description",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

ValueError: ignored

In [69]:
tfidf_df = pd.DataFrame(tsne_tfidf, columns=['x', 'y'])
tfidf_df['description'] = combined_sample['item_description']
tfidf_df['tokens'] = combined_sample['tokens']
tfidf_df['category'] = combined_sample['general_cat']

ValueError: ignored

In [70]:
plot_tfidf.scatter(x='x', y='y', source=tfidf_df, alpha=0.7)
hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"description": "@description", "tokens": "@tokens", "category":"@category"}
show(plot_tfidf)

NameError: ignored

#### K-Means Clustering

In [None]:
from sklearn.cluster import MiniBatchKMeans

num_clusters = 30 # need to be selected wisely
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters,
                               init='k-means++',
                               n_init=1,
                               init_size=1000, batch_size=1000, verbose=0, max_iter=1000)

In [None]:
kmeans = kmeans_model.fit(vz)
kmeans_clusters = kmeans.predict(vz)
kmeans_distances = kmeans.transform(vz)

In [None]:
sorted_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(num_clusters):
    print("Cluster %d:" % i)
    aux = ''
    for j in sorted_centroids[i, :10]:
        aux += terms[j] + ' | '
    print(aux)
    print() 

In [None]:
kmeans = kmeans_model.fit(vz_sample)
kmeans_clusters = kmeans.predict(vz_sample)
kmeans_distances = kmeans.transform(vz_sample)
tsne_kmeans = tsne_model.fit_transform(kmeans_distances)

In [None]:
kmeans_df = pd.DataFrame(tsne_kmeans, columns=['x', 'y'])
kmeans_df['cluster'] = kmeans_clusters
kmeans_df['description'] = combined_sample['item_description']
kmeans_df['category'] = combined_sample['general_cat']

In [None]:
plot_kmeans = bp.figure(plot_width=700, plot_height=600,
                        title="KMeans clustering of the description",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

In [None]:
source = ColumnDataSource(data=dict(x=kmeans_df['x'], y=kmeans_df['y'],
                                    color=colormap[kmeans_clusters],
                                    description=kmeans_df['description'],
                                    category=kmeans_df['category'],
                                    cluster=kmeans_df['cluster']))

plot_kmeans.scatter(x='x', y='y', color='color', source=source)
hover = plot_kmeans.select(dict(type=HoverTool))
hover.tooltips={"description": "@description", "category": "@category", "cluster":"@cluster" }
show(plot_kmeans)

#### Latent Dirichlet Allocation
잠재 디리클레 할당(LDA)은 말뭉치에 존재하는 주제를 발견하는 데 사용되는 알고리즘

In [None]:
cvectorizer = CountVectorizer(min_df=4,
                              max_features=180000,
                              tokenizer=tokenize,
                              ngram_range=(1,2))

In [None]:
cvz = cvectorizer.fit_transform(combined_sample['item_description'])

In [None]:
lda_model = LatentDirichletAllocation(n_components=20,
                                      learning_method='online',
                                      max_iter=20,
                                      random_state=42)

In [None]:
X_topics = lda_model.fit_transform(cvz)

In [None]:
n_top_words = 10
topic_summaries = []

topic_word = lda_model.components_  # get the topic words
vocab = cvectorizer.get_feature_names()

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))
    print('Topic {}: {}'.format(i, ' | '.join(topic_words)))

In [None]:
tsne_lda = tsne_model.fit_transform(X_topics)

In [None]:
unnormalized = np.matrix(X_topics)
doc_topic = unnormalized/unnormalized.sum(axis=1)

lda_keys = []
for i, tweet in enumerate(combined_sample['item_description']):
    lda_keys += [doc_topic[i].argmax()]

lda_df = pd.DataFrame(tsne_lda, columns=['x','y'])
lda_df['description'] = combined_sample['item_description']
lda_df['category'] = combined_sample['general_cat']
lda_df['topic'] = lda_keys
lda_df['topic'] = lda_df['topic'].map(int)

In [None]:
source = ColumnDataSource(data=dict(x=lda_df['x'], y=lda_df['y'],
                                    color=colormap[lda_keys],
                                    description=lda_df['description'],
                                    topic=lda_df['topic'],
                                    category=lda_df['category']))

plot_lda.scatter(source=source, x='x', y='y', color='color')
hover = plot_kmeans.select(dict(type=HoverTool))
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips={"description":"@description",
                "topic":"@topic", "category":"@category"}
show(plot_lda)

In [None]:
def prepareLDAData():
    data = {
        'vocab': vocab,
        'doc_topic_dists': doc_topic,
        'doc_lengths': list(lda_df['len_docs']),
        'term_frequency':cvectorizer.vocabulary_,
        'topic_term_dists': lda_model.components_
    } 
    return data

In [None]:
import pyLDAvis

lda_df['len_docs'] = combined_sample['tokens'].map(len)
ldadata = prepareLDAData()
pyLDAvis.enable_notebook()
prepared_data = pyLDAvis.prepare(**ldadata)

In [None]:
import IPython.display
from IPython.core.display import display, HTML, Javascript