In [2]:
from datetime import datetime
import os
import ipywidgets as widgets
from IPython.display import display, HTML
from autoextract.sync import request_raw
from IPython.core.display import HTML
from parsel import Selector
import html_text

# AutoExtract articleBodyHtml example

The [AutoExtract API](https://scrapinghub.com/autoextract) is a service for 
automatically extracting information from web content. In this notebook
we are going to show how is it possible to extract article body content
from article pages automatically and specifically we will focus on the features
offered by the returned attribute `articleBodyHtml`. 

Scrapinghub client library ``scrapinghub-autoextract`` brings access to the Articles 
Extraction API in Python. A key is required to access the service. You can obtain one
at [in this page](https://scrapinghub.com/autoextract). The client library will look
for this key in the environmental variable ``SCRAPINGHUB_AUTOEXTRACT_KEY`` but **you can
also set it in the text box below and press enter**. 

In [None]:
def set_key(event):
    os.environ['SCRAPINGHUB_AUTOEXTRACT_KEY'] = event.value
    print(f"New key set at {datetime.now()}")
    
key = widgets.Text(placeholder='Fill with your AutoExtract key', layout={'width': '400px'})
key.on_submit(set_key)
display(widgets.HBox([widgets.Label("AutoExtract key:"), key]))

The method [``request_raw``](https://github.com/scrapinghub/scrapinghub-autoextract#synchronous-api) 
is the entrypoint to AutoExtract API. Let's define the method ``autoextract_article`` for convenience 
as:  

In [4]:
def autoextract_article(url):
    query = [{'url': url, 'pageType': 'article'}]
    return request_raw(query)[0]['article']

Between the [attributes that are extracted](https://doc.scrapinghub.com/autoextract.html#article-extraction)
this notebook will focus in the attribute ``articleBodyHtml``, which contains the simplified, 
normalized and cleaned up article content in HTML code.

Let's see an extraction example for [this page](https://www.independent.ie/sport/soccer/premier-league/man-united-charged-for-failing-to-ensure-players-conducted-themselves-in-an-orderly-fashion-against-liverpool-38881375.html)

In [5]:
nfl_article = autoextract_article("https://www.independent.ie/sport/soccer/premier-league/man-united-charged-for-failing-to-ensure-players-conducted-themselves-in-an-orderly-fashion-against-liverpool-38881375.html")
HTML(nfl_article['articleBodyHtml'])

NoApiKey: API key not found. Please set SCRAPINGHUB_AUTOEXTRACT_KEY environment variable or pass

Note how only the relevant content of the article was extracted, avoiding elements
like ads, unrelated content, etc. AutoExtract relies in advanced machine learning
models that are able to discriminate between what is relevant and what is not.  

Also note how figures with captions was extracted. Many 
[other elements can be also present](https://doc.scrapinghub.com/autoextract.html#format-of-articlebodyhtml-field). 

## Styling

Having normalized HTML code has some cool advantages. One is that the content
can be formatted independently of the original style with simple CSS rules.
That means that the same consistent formatting can be applied even if content is coming
from very different pages with different formats.  

AutoExtract encapsulates the `articleBodyHtml` content within ``article`` tags. For example:
```html
<article>
    <p>This is a simple article</p>
</article>
```

In order to apply some style rules we are going to add the class `beauty` to `article` tag. 
The method `show` will take care of that:  

In [8]:
def show(article):
    return HTML(article['articleBodyHtml'].replace("<article>", "<article class='beauty'>"))

Now let's create some CSS style rules to be applied for the `beauty` class:  

In [9]:
style = """
<style>
    .beauty {
        font-family: 'Benton Sans', Sans-Serif;
        line-height: 23px;
        font-size: 17.008px;
        font-style: normal;   
        background-color: #F9F9F9;
        padding: 20px;
    }
    .beauty h2, h3, h4, h5, h6 { 
        font-family: Majerit, serif;
        font-weight: 700;
    }
    .beauty p { 
        margin-bottom: 10px;
        color: #444;
    }
    .beauty figcaption {
        display: table-caption; 
        caption-side: bottom;     
        border-bottom: 0.063rem dotted #D0D0D0;
        margin-bottom: 10px;
        line-height: 22px;
        font-size: 13px; 
        color: #646464; 
        text-align: center;       
    }
    .beauty figcaption * {
        text-align: center;
        font-size: 13px; 
        color: #646464;         
    }
    .beauty figcaption p { margin-bottom: 0px;}
    .beauty figure { 
        display: table;
        margin: 0 auto;
    }
</style>
"""
display(HTML(style))

Let's show the article again. It looks better, isn't it?

In [54]:
show(nfl_article)

## Tweets and other embeddings

Have a look to the following page:

In [66]:
musk_article = autoextract_article("https://www.cnet.com/news/elon-musks-top-10-weirdest-tweets-of-2019/")
show(musk_article)

The page is full of tweets, but the format is not the usual one seen in pages. 
But don't worry. Everything is ready to get them formatted, all we have to do is to include
the [Twitter widgets javascript library](https://developer.twitter.com/en/docs/twitter-for-websites/javascript-api/guides/set-up-twitter-for-websites)
into the page. Let's to do it: 

In [None]:
twitter_js = '<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>'
display(HTML(twitter_js))

Now the tweets in the article are nicely formatted. Facebook and Instagram content
can also get formatted by [including its javascript libraries](https://doc.scrapinghub.com/autoextract.html#format-of-articlebodyhtml-field).    

But not only that. Other `iframe` based multimedia content like videos, podcasts, maps, etc 
will also be present and functional in the `articleBodyHtml` attribute.  

## Cherry picking

Another advantage of having a normalized structure is that we can pick only the parts
we are in interested in. In the following example, we are going to just pick the images
from [this article](https://eu.thespectrum.com/story/news/2019/05/02/st-george-ironman-2019-athletes-raise-kidney-disease-awareness/3510491002/)
with its corresponding caption to compose an images array. 

In [4]:
iron_article = autoextract_article("https://eu.thespectrum.com/story/news/2019/05/02/st-george-ironman-2019-athletes-raise-kidney-disease-awareness/3510491002/")

In [10]:
sel = Selector(iron_article['articleBodyHtml'])
images = [{'img_url': fig.xpath(".//img/@src").get(),
           'caption': html_text.selector_to_text(fig.xpath("(.//figcaption)"))} 
          for fig in sel.xpath("//figure")]
images

[{'img_url': 'https://www.gannett-cdn.com/presto/2019/04/30/PSTG/acecc248-6344-41fa-9030-18073f2dccc2-attach.jpg?width=540&height=&fit=bounds&auto=webp',
  'caption': 'Melodie Carli races forward on her bike during IRONMAN Copenhagen on Aug. 19, 2018. (Photo: Melodie Carli)'},
 {'img_url': 'https://www.gannett-cdn.com/presto/2019/04/23/PSTG/907681f6-ff69-45dd-9d3d-ab09b4a8f513-Image-1.png?width=180&height=240&fit=bounds&auto=webp',
  'caption': 'Melodie Carli crosses the finish line at the IRONMAN 70.3 Cartagena in Cartagena, Colombia on Dec. 3, 2017. (Photo: Melodie Carli)'},
 {'img_url': 'https://www.gannett-cdn.com/presto/2019/04/30/PSTG/9c25ef67-8f7f-4828-a98f-6014607b7d7b-IMG_5651.JPG?width=180&height=240&fit=bounds&auto=webp',
  'caption': 'Melodie Carli celebrates with her medal after completing IRONMAN 70.3 Colombia on Dec. 3, 2017. (Photo: Melodie Carli)'}]

[parsel](https://github.com/scrapy/parsel) and [html-text](https://github.com/TeamHG-Memex/html-text)
libraries was used as helpers for the task. `parsel` makes possible to query the content using
XPath and CSS expressions and `html-text` converts properly HTML content to raw text.    

Note that in the source code of the page in question there is not any `figcaption`
tag: AutoExtract machine learning capabilities are able to detect that a particular
section of the page is really a figure caption even if they were not annotated with the right
HTML tag. Such intelligence is also applied to other elements like `blockquote`. 

Heading levels are also normalized. It can be handy to automatically extract 
"table of contents" for `articleBodyHtml`. The function `print_toc` presented below
print the table of content of an article extracted by AutoExtract.

In [21]:
def print_toc(html):  
    for section in Selector(html).css("h2,h3,h4,h5,h6"):
        level = int(section.root.tag[-1]) - 2
        print(f"{'  ' * level}{section.css('::text').get()}")

Let's try it with [this article](http://cs231n.github.io/neural-networks-1/):

In [20]:
article_toc = autoextract_article("http://cs231n.github.io/neural-networks-1/")        
print_toc(article_toc['articleBodyHtml'])

Quick intro
Modeling one neuron
  Biological motivation and connections
  Single neuron as a linear classifier
  Commonly used activation functions
Neural Network architectures
  Layer-wise organization
  Example feed-forward computation
  Representational power
  Setting number of layers and their sizes
Summary


## Try it yourself

Now is the moment to try it yourself. Set the `url` variable below and execute the cell
to see the results of autoextract on it:

In [10]:
url = "https://www.vox.com/policy-and-politics/2020/1/17/21046874/netherlands-universal-health-insurance-private"

article = autoextract_article(url)
show(article)