In [21]:
from datetime import datetime
import os
import ipywidgets as widgets
from IPython.display import display, HTML
from autoextract.sync import request_raw
from IPython.core.display import HTML

# AutoExtract articleBodyHtml example

blblablab

Scrapinghub client library ``scrapinghub-autoextract`` brings access to the Articles 
Extraction API in Python. A key is required to access the service. You can obtain one
at [in this page](https://scrapinghub.com/autoextract). The client library will look
for this key in the environmental variable ``SCRAPINGHUB_AUTOEXTRACT_KEY`` but can
also set it in the text box below and press enter. 
 

In [29]:
def set_key(event):
    os.environ['SCRAPINGHUB_AUTOEXTRACT_KEY'] = event.value
    print(f"New key set at {datetime.now()}")
    
key = widgets.Text(placeholder='Fill with your AutoExtract key', layout={'width': '400px'})
key.on_submit(set_key)
display(widgets.HBox([widgets.Label("AutoExtract key:"), key]))

HBox(children=(Label(value='AutoExtract key:'), Text(value='', layout=Layout(width='400px'), placeholder='Fill…

The method [``request_raw``](https://github.com/scrapinghub/scrapinghub-autoextract#synchronous-api) 
is the entrypoint to AutoExtract API. Let's define the method ``autoextract_article`` for convenience 
as:  

In [None]:
def autoextract_article(url):
    query = [{'url': url, 'pageType': 'article'}]
    return request_raw(query)[0]['article']

Between the [attributes that are extracted](https://doc.scrapinghub.com/autoextract.html#article-extraction)
this notebook will focus in the attribute ``articleBodyHtml``, which contains the simplified, 
normalized and cleaned up article content in HTML code.

Let's see an extraction example for [this page](https://www.independent.ie/sport/soccer/premier-league/man-united-charged-for-failing-to-ensure-players-conducted-themselves-in-an-orderly-fashion-against-liverpool-38881375.html)

In [38]:
nfl_art = autoextract_article("https://www.independent.ie/sport/soccer/premier-league/man-united-charged-for-failing-to-ensure-players-conducted-themselves-in-an-orderly-fashion-against-liverpool-38881375.html")
HTML(nfl_art['articleBodyHtml'].replace("article", "unformatted_article"))

Note how only the relevant content of the article was extracted, avoiding elements
like ads, unrelated content, etc. AutoExtract relies in advanced machine learning
models that are able to discriminate between what is relevant and what is not.  

Also note how figures with captions was extracted. Many 
[other elements can be also present](https://doc.scrapinghub.com/autoextract.html#format-of-articlebodyhtml-field). 

Having normalized HTML code has some cool advantages. One is that the content
can be formatted independently of the source with simple CSS rules.   

In [40]:
HTML("<style>unformatted_article figcaption { color: red;}</style>")

In [None]:
def show(article, style): 
    return HTML(style + article['articleBodyHtml'])

In [28]:
url = "https://www.cnet.com/news/elon-musks-top-10-weirdest-tweets-of-2019/"
#url = "https://www.vox.com/policy-and-politics/2020/1/17/21046874/netherlands-universal-health-insurance-private"


article = autoextract_article(url)

In [125]:
style = """
<style>
    article {
        font-family: 'Benton Sans', Sans-Serif;
        line-height: 23px;
        font-size: 17.008px;
        font-style: normal;        
    }
    article h2, h3, h4, h5, h6 { 
        font-family: Majerit, serif;
        font-weight: 700;
    }
    article p { 
        margin-bottom: 10px;
        color: #444;
    }
    article figcaption {
        display: table-caption; 
        caption-side: bottom;     
        border-bottom: 0.063rem dotted #D0D0D0;
        margin-bottom: 10px;
        line-height: 22px;
        font-size: 13px; 
        color: #646464; 
        text-align: center;       
    }
    article figcaption * {
        text-align: center;
        font-size: 13px; 
        color: #646464;         
    }
    article figcaption p { margin-bottom: 0px;}
    article figure { 
        display: table; 
        margin-bottom: 30px;
    }
</style>
"""
#style =""
#show(article, style)
show2(article, style)
#display.HTML(style +  " <article><h2>hola</h2></article>")

In [None]:
twitter_js = """
<script>window.twttr = (function(d, s, id) {
  var js, fjs = d.getElementsByTagName(s)[0],
    t = window.twttr || {};
  if (d.getElementById(id)) return t;
  js = d.createElement(s);
  js.id = id;
  js.src = "https://platform.twitter.com/widgets.js";
  fjs.parentNode.insertBefore(js, fjs);

  t._e = [];
  t.ready = function(f) {
    t._e.push(f);
  };

  return t;
}(document, "script", "twitter-wjs"));</script>
"""

