This is an attempt to model Medium's digest by topic. It's a very simple experiment that uses
a combination of (high-dimensional) word embeddings and dynamic bi-directional encoding
on article features. This is a proof-of-concept for modeling "live" data that is "streamed" and is
available publicly. There are models for headless web Scraper
s and dataset models for natural
language articles.
Before you start slinging, do this first (> Node 7, > Python 2 or 3):
npm i && python setup.py install
-
Scrape dataset
medium
. This will go for a while dependent on your internet connection. There is a timeout of 30s where the page will be skipped. This tries to use 8 threads.bin/scrape medium
This crawls the topics page and collects the available topics. Then it visits each topic main page i.e. culture and extracts all the landing page articles (extracts the
href
according to the attributedata-post-id
). Finally, it visits each article page, finds the Medium API from the landing page, it looks like this:<script> // <![CDATA[ window["obvInit"]({"value":{...}}); </script>
It uses
page.evaluate(...)
to perform aregexp
on the script content, parses it as JSON and then passes it back to node.js. It finally strips the meta data, it reduces the object as a model for the pythontextsum/dataset/article.py
modeltextsum.Article
with the features:title
,subtitle
,text
,tags
,description
,short_description
.We now have raw data that we can use to do fun things.
-
Convert raw data to numpy records of examples.
bin/records --src=data/medium --dst=records/medium --pad
This takes the raw data from
src
and serializes it astextsum.Article
objects for consumption. As it is serializing, it tokenizes all the features (title
,subtitle
, ...) as mentioned in 2. It saves all these asnp.ndarray
s and stores them indst
bytopic
. Next, the examples are piped to*.npy
files. This comes in handy to be used with the nativetf.data
API, it's like hadoop or spark but native compatibility with tensorflow. Finally, all the recordtokens
we collected for each topic, is collected in aset
, so we don't store all tokens in memory to avoid repetition, this is done in amap->reduce
fashion. The tokens are gathered bytopic
on a individual thread as aset
ofstr
s and theunion
operation reduces the total space for eachtopic
map
operation. Themap
stage returns all the individual vocabs for each feature (as in 2) and is reduced by theunion
operation again.We now have a set of vocab files for each feature in the dataset.
-
Final step, Sling (run the experiment)
bin/experiment \
--model_dir=article_model \
--dataset_dir=records/medium \
--input_feature='text' \
--target_feature='title' \
--schedule='train'