Medium Text Summarization

This is an attempt to model Medium's digest by topic. It's a very simple experiment that uses a combination of (high-dimensional) word embeddings and dynamic bi-directional encoding on article features. This is a proof-of-concept for modeling "live" data that is "streamed" and is available publicly. There are models for headless web Scrapers and dataset models for natural language articles.

Running

Before you start slinging, do this first (> Node 7, > Python 2 or 3):

npm i && python setup.py install

Scrape dataset medium. This will go for a while dependent on your internet connection. There is a timeout of 30s where the page will be skipped. This tries to use 8 threads.
```
bin/scrape medium
```
This crawls the topics page and collects the available topics. Then it visits each topic main page i.e. culture and extracts all the landing page articles (extracts the href according to the attribute data-post-id). Finally, it visits each article page, finds the Medium API from the landing page, it looks like this:
```
<script>
// <![CDATA[
window["obvInit"]({"value":{...}});
</script>
```
It uses page.evaluate(...) to perform a regexp on the script content, parses it as JSON and then passes it back to node.js. It finally strips the meta data, it reduces the object as a model for the python textsum/dataset/article.py model textsum.Article with the features: title, subtitle, text, tags, description, short_description.

We now have raw data that we can use to do fun things.
Convert raw data to numpy records of examples.
```
bin/records --src=data/medium --dst=records/medium --pad
```
This takes the raw data from src and serializes it as textsum.Article objects for consumption. As it is serializing, it tokenizes all the features (title, subtitle, ...) as mentioned in 2. It saves all these as np.ndarrays and stores them in dst by topic. Next, the examples are piped to *.npy files. This comes in handy to be used with the native tf.data API, it's like hadoop or spark but native compatibility with tensorflow. Finally, all the record tokens we collected for each topic, is collected in a set, so we don't store all tokens in memory to avoid repetition, this is done in a map->reduce fashion. The tokens are gathered by topic on a individual thread as a set of strs and the union operation reduces the total space for each topic map operation. The map stage returns all the individual vocabs for each feature (as in 2) and is reduced by the union operation again.

We now have a set of vocab files for each feature in the dataset.
Final step, Sling (run the experiment)

bin/experiment \
  --model_dir=article_model \
  --dataset_dir=records/medium \
  --input_feature='text' \
  --target_feature='title' \
  --schedule='train'

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bin		bin
textsum		textsum
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medium Text Summarization

Running

About

Releases

Packages

Languages

wenkesj/summarize_text

Folders and files

Latest commit

History

Repository files navigation

Medium Text Summarization

Running

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages