In [1]:
import sys
import os
import pandas as pd
import numpy as np
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
from LDA_BERT.model import model
from LDA_BERT.text_processing.sentence_splitting import split_into_sentences
from LDA_BERT.text_processing import is_hyperlink

### Load in Sample Data (Self Posts from Mattress Subreddit)

In [2]:
df = pd.read_csv('../examples/examples_data/matt.csv')
df = df.dropna()
df.head()

Unnamed: 0,text
0,It was Me not the Mattress. I went through a l...
1,TIL about Purple mattresses.... Holy balls.......
2,Just received my new Tempur-Pedic LUXEbreeze s...
3,I work for Mattress Firm in operations in a ve...
4,I don't like the Purple 2.... Been sleeping on...


### Pre-process the Data and get it prepared for the Model

In [3]:
sent_data = []
for k, text in enumerate(df.text.values):
    text = ' '.join([w for w in text.split() if not is_hyperlink(w)])
    for sent in split_into_sentences(text):
        sent_data.append({'sent': sent, 'doc_no': k})
sent_df = pd.DataFrame(sent_data)
sent_df.head()

Unnamed: 0,sent,doc_no
0,It was Me not the Mattress.,0
1,I went through a lot of mattresses over a numb...,0
2,My lower back pain always seemed to come back.,0
3,Turns out I have something called chronic pelv...,0
4,It is a bizarre problem that doctors don't tot...,0


In [4]:
rws = sent_df.sent.values
print(len(rws))
sentences, token_lists, idx_in = model.preprocess(rws)
print(len(sentences))

12403
Preprocessing raw texts ...
Preprocessing raw texts. Done!
10920


### Fit the Model

In [5]:
tm = model.Topic_Model(k=10, method='LDA_BERT')
tm.fit(sentences, token_lists)

Clustering embeddings ...
Getting vector representations for LDA ...
Getting vector representations for LDA. Done!
Getting vector representations for BERT ...


HBox(children=(FloatProgress(value=0.0, description='Batches', max=1365.0, style=ProgressStyle(description_wid…


Getting vector representations for BERT. Done!
Fitting Autoencoder ...
Fitting Autoencoder Done!
Clustering embeddings. Done!


### Generate LSA Topic Model as Baseline

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import randomized_svd
tfidf = TfidfVectorizer(min_df=3, max_df=0.1,
                        stop_words='english', ngram_range=(1,3),
                        sublinear_tf=True)
tfidf_m = tfidf.fit_transform(sentences)
U, S, V = randomized_svd(tfidf_m.T, 15, random_state=2020)

In [7]:
for k in V.T[:,0].argsort()[-15:][::-1]:
    print(sent_df.sent.values[idx_in[k]])
    print('_____________')

I’m looking to get a memory foam mattress.
_____________
The first was a Silk+Snow, which was my very first memory foam (lived with standard ol' innersprings all my life).
_____________
Memory foam is not soft enough.
_____________
I have a bed in the box memory foam mattress.
_____________
We have never had a memory foam mattress.
_____________
I'm thinking of just diying a memory foam mattress.
_____________
Memory Foam after a year?
_____________
The bed is much better than my old memory foam mattress but still heats up.
_____________
What memory foam should we look at?
_____________
Would it work with memory foam?
_____________
Memory foam way too hot.
_____________
Is it worth getting a memory foam topper for my sunken 14 year old memory foam mattress or should I just focus on a new mattress?
_____________
So please, if you buy a memory foam mattress, be weary of fiberglass!
_____________
So I had this memory foam mattress from Lucid for a few years.
_____________
I sleep hot and 

In [8]:
V.T[:,0].argsort()[-10:][::-1]

array([ 1989, 10119, 10069,    66,  5414,  8667,  6671,   441,  2438,
        4087], dtype=int64)

### Kmeans to generate Topic Clusters (Note they are not very good)

In [9]:
from sklearn.cluster import KMeans
cm = KMeans(10)
clusters = cm.fit_predict(tm.vec['LDA_BERT'])
cluster_idx = {k: np.array(idx_in)[np.where(clusters==k)] for k in range(10)}

In [10]:
for sent in sent_df.sent.values[cluster_idx[1]][:20]:
    print(sent)
    print('-----------------------------------------')

If you have a Sleep Sherpa store near you, you can try out nearly all the online beds in that one store & order from there too- same price & everything.
-----------------------------------------
I now know most of those are sponsored reviews and outright fabrications.
-----------------------------------------
So last week I started really researching mattresses.
-----------------------------------------
I spent a lot of time here in this sub, specifically, to learn more about mattresses and to avoid disinformation.
-----------------------------------------
I just received it two days ago.
-----------------------------------------
In those two nights I have slept better than I have in years.
-----------------------------------------
I wake up without pain and I haven’t woken up with the sheets sticking to me due to sweat.
-----------------------------------------
Edit: does anyone use a cotton mattress pad or moisture barrier on their LUXEbreeze, or does that make the cooling less effec

### Similar Document Extraction (This is where LDA_BERT seems to shine)

In [11]:
from sklearn.metrics.pairwise import euclidean_distances

In [12]:
dist = euclidean_distances(tm.vec['LDA_BERT_FULL'])

In [13]:
dist[1988].argsort()[:10]

array([1988, 3971, 4540, 4324, 2291, 8781, 4325, 5587, 4530, 8945],
      dtype=int64)

In [14]:
for k in dist[528].argsort()[:10]:
    print(sent_df.sent.values[idx_in[k]])
    print('---------------------------')

I’m about 6’4 290 and I’m a side sleeper.
---------------------------
I’m 6’2 230 and I mainly sleep on my back, and sometimes on the side.
---------------------------
I’m a back and side sleeper and 5’7”/130 pounds.
---------------------------
Notes about us - I’m 5’3” 130lb and a side/stomach sleeper.
---------------------------
I am mostly a back sleeper but spend some time on my side.
---------------------------
* I’m a side sleeper.
---------------------------
I am a side sleeper.
---------------------------
I am a side sleeper.
---------------------------
I am a side sleeper.
---------------------------
I am a side sleeper.
---------------------------


In [15]:
U2, S2, V2 = randomized_svd(tm.vec['LDA_BERT_FULL'].T, 15, random_state=2020)

In [16]:
for k in V2.T[:,2].argsort()[-15:]:
    print(sentences[k])
    print('_____________')

my parents just never saw it as a priority to replace it so i never saw it as a priority to find comfort
_____________
i absolutely hated the topper and couldnt sleep on it after 2 nights of trying
_____________
i never received any emails from bear regarding this and it was especially surprising given that it had already gone out for delivery
_____________
well i contacted the company a couple weeks ago by email because no one is picking up the phone and they havent responded
_____________
also because of the pandemic i obviously can’t go to a store to test it out
_____________
not even going to deal with their warranty department on it because they just make it a pain the arse
_____________
after doing some more research i saw that they have terrible delivery and customer service so im canceling that order
_____________
i havent had the chance to try any out on stores because of the virus
_____________
the place doesn’t do returns or exchanges due to the virus
_____________
they dont