<small><i>August 2014 - This notebook was created by [Oriol Pujol Vila](http://www.maia.ub.es/~oriol). Source and license info are in the folder.</i></small>

# Data hunting and gathering (part 2)

<img style = "border-radius:20px;" src = "http://unadocenade.com/wp-content/uploads/2012/09/cavalls-de-valltorta.jpg">

# Contents and Requirements   

**SESSION 2: When every other thing fails: Create our own web API - Scraping**

    + Understanding HTML and CSS
    + XPath selectors
    + Scraping dynamic content with Selenium   

SOFTWARE REQUIREMENTS FOR SESSION 2
    
    + Mozilla Firefox Quantum
    + Firefox Add-on: Xpath Try
    + Download Geckodriver from `https://github.com/mozilla/geckodriver/releases`
    
ADDITIONAL PYTHON LIBRARIES

    + lxml #pip install lxml
    + selenium #pip install selenium
    
NOT REQUIRED BUT USED FOR SOME EXAMPLES: 

    + I will use VLC to automatically reproduce scrapped audio.


<div class = "alert alert-danger" style = "border-radius:10px;border-width:3px;border-color:darkred;font-family:Verdana,sans-serif;font-size:16px;">
**DISCLAIMER AND USER AGREEMENT:** Ensure you are allowed to use these tools for retrieving data and be respectful with web pages and apps. Ethical use of these tools is mandatory. The content provided by this notebook is for educational purposes only. 
<p>

THE NOTEBOOK/SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE NOTEBOOK CONTENTS OR SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
</div>

<div class = "alert alert-danger" style = "border-radius:10px;border-width:3px;border-color:darkred;font-family:Verdana,sans-serif;font-size:16px;">
**REQUIREMENTS:** From this point on you will need:
<ul>
<li> Firefox</li>
<li> Firefox Add-on: XPath Try</li>
<li> Selenium</li>
<li> Geckodriver</li>
<li>lxml</li>
</ul>
</div>

<img style="border-radius:20px;" src="./files/big_picture.jpg">

## 3. "Making your own API": Web scraping

Sometimes data is on the web but there is no API to grant access to it, the API is lacking functionalities or the terms of service are not adequate. In those cases because as humans we have visual access to the data we might wonder how to extract that data automatically. The discipline for doing so is **Web Scraping**. 

Before we start, it is useful to understand a little how web pages are created and data stored. In this section a brief introduction to web front-end development is presented. We will focus on two basic aspects:

+ Basic HTML + CSS static pages.
+ Dynamic HTML (a basic JavaScript example using JQuery).


<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**Firebug example on a page.**

Go to a page and check its contents using Inspect Element
</div>

### 3.1 Basic HTML + CSS 101

The most basic web pages are built upon HTML + CSS technology. This division stnds for content and design, respectively. **HTML (Hypertext markup language)** is used to give websites structure and stores the contents. This is our target for scraping. On the other hand **CSS (Cascading Style Sheets)** gives format to the content, sigles out content for visualization purposes, i.e. defines the style (e.g. font, color, family, borders, image style, relative positioning of the content, etc). HTML files include tags and references to style, thus it is worthwhile to understand a little bit of both technologies which can help us to scrap data more efficiently.


HTML is a tagged language usually rendered by a browser. Tags are specified in the following format:

<p style="text-align: center">&lt;tag_name *attributes*&gt; content &lt;/tag_name&gt;<p>

<p>
<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">STRUCTURE of an HTML file:

<ul>
    <li> HTML files start with the <!DOCTYPE html>. This tells the browser that we will use HTML5. In former versions of HTML standard there were different versions. </li>
    <li> The first tag in a web page is &lt;html&gt; and its corresponding &lt;/html&gt; closing tag. All the web page is found inside these tags. </li>
    <li> HTML files have a &lt;head&gt; and a &lt;body&gt; </li>
    <li> In the head, we have the &lt;title&gt; tags, and we use this to specify the webpage's name. We can also find references to CSS stylesheets (&lt;link&gt;) used for formating the page and links to javascript files (&lt;script&gt;)that give the web page dynamic behavior.</li>
    <li> In the body we find the content of the page. </li> 
        <ul>
            <li> Headings and text paragraphs can be created using &lt;h#&gt; (# is a natural number) and &lt;p&gt; ,respectively. </li>
            <li> Hyperlinks (links) are given in the <strong>href</strong> attribute of the &lt;a&gt; (anchor) tag. </li>
            <li> Images can be embedded using the &lt;img&gt; tag and setting the <strong>src</strong> attribute to the resource. Caution: img is an special tag and it does not have a closing tag, e.g. &lt;img src = "my_pic.jpg" /&gt; </li>
        </ul>
</ul>
</div>
</p>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE**

Let us build a basic HTML web page, adding the following tags. Remember that nearly all tags require to be closed using &lt;/tag&gt;

+ DOCTYPE
+ html
+ head
+ title
+ body

<ol>
<li>Create a file 'example.html' in your favorite editor.</li>
<li>Create a basic html web page containing a *title*, *h1*, *p*, *img* and *a* tags.</li>
</ol>
</div>

If you are lazy go to the files folder and double-click on "example.html". You can check the html code executing the following line.

In [47]:
%%html

	<head>
		<title>
			Basic knowledge for web scraping.
		</title>	
	</head>
	<body>
		<h1>About HTML
		</h1>
		<p>Html (Hypertext markdown language) is the basic language to provide contents in the web. It is a tagged language. You can check more about it in <a href="http://www.w3.org/community/webed/wiki/HTML">World Wide Web Consortium.</a></p>
        
        <p> One of the following rubberduckies is clickable
	</p>
	<p>
            <img src = "files/rubberduck.jpg"/>
        
            <a href="http://www.pinterest.com/misscannabliss/rubber-duck-mania/"><img src = "files/rubberduck.jpg"/></a>
        </p>
	</body>



<div class  = "alert alert-success">** EXERCISE ** <p>
Change the type of cell of the former cell to *Markdown* and Execute (SHIFT+ENTER). In order for the files to show you must add the relative path to the image, e.g. ./files/rubberduck.jpg
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
**Old style HTML** static pages rely heavily on tables and lists: 

<ul>
<li> Making ordered and unordered lists is simple: *ol* (ordered list), *ul* (unordered list) are the main tags. Each item is inserted as *li* (list item) </li>
<li> *table* is the containing tag for building tables, each table row is given as *tr* and columns depend on the table data elements *td*. Tables may have a head (*thead*) and a body (*tbody*). *th* is the same as *td* but for the header. If you want a multi column cell then use colspan=number of cells to cover.
</li>
</ul>
</div>

The next example shows a simple table build. Check the markdown code.

<table>
<thead>
<tr><th colspan = 2>A table</th><tr>
</thead>
<tbody>
<tr>
<td>Hello I am element 1.1</td><td>Hello I am element 1.2</td>
</tr>
<tr>
<td colspan=2>Hello I am element 2.1 and 2.2</td>
</tr>
</tbody>
</table>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
**Current HTML** static pages rely heavily on containers and style: 

<ul>
<li> *div* stands for division and mark a block of content.
</li>
<li> *span* is used to single out an element of a block content.
</li>
</ul>

</div>

By themselves they are not much but when combined with the *style* attribute they become interesting.

For example, consider the following example of code:

<div style = "width:100px;height:100px;background-color:red;padding:10px;font-family:Verdana;font-size:24;color:pink;display:inline-block">  Box 1
</div>
<div style = "width:100px;height:100px;background-color:blue;padding:10px;font-family:Futura;font-size:24;color:lightblue;display:inline-block">  Box 2
</div>
<div style = "width:100px;height:100px;background-color:yellow;padding:10px;font-family:Garamond;font-size:24;color:orange;display:inline-block">  Box 3
</div>
<div style = "width:100px;height:100px;background-color:green;padding:10px;font-family:ArialNarrow;font-size:24;color:lightgreen;display:inline-block">  Box 4
</div>

The attribute *style* is also referred as *inline CSS* and let us give the skeleton some skin and makeup.

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE**

Let us build a basic HTML web page and check the magic of CSS in action before going in detail into CSS.
<ol>
<li>Create a file 'example2.html' using your favorite editor.</li>
<li>Fill the header and body basic HTML structure</li>
<li>Let us add three containers *div* in the body.</li> 
<li>Select one of them. This will be used as a navigation bar and will contain an unordered list with three elememnts: Home, Brief Bio, Hobbies</li>
<li>Select another division and create a table inside. Each row will contain information about your profile, e.g. the first row may contain Name: Your Name, the second row Position: Your current position, etc</li>
<li>The last one will contain an image of youself and a paragraph with your contact info (email)</li>
</ol>
<p>
Check the result. Nearly professional, doesn't it?
</p>
</div>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE FOLLOW UP**

Let us add some style.
<ol>
<li>Add the class "navbar" as an attribute to the *div* containing the list. (eg. class = "navbar")</li>
<li>Add the class "head" to the *div* containing the image and the email.</li>
<li>Add the class "right" to the *div* containing the table.</li>
<li>Add the identifier "email" to the paragraph containing the email. (eg. id = "email")</li>
<li>Finally, let us link the class and ids definitions we have just writen by adding to the head tag the following line:
<p>< link type="text/css" rel="stylesheet" href="stylesheet.css"/ ></p>
</li>
</ol>
<p>
Check the result now. Do not forget to hover over your navigation bar.
</p>
</div>

The former exercise is an extremely simple exercise showing the separation between the content and the styling. Observe that the html file you have created does not have any explicit styling. However, we have added two new elements to the mix, classes and identifiers as attributes of the tags. As you can imagine styling rules are given for each class and ID and are compactly found on the stylesheet.css we have just linked.

<div class = "alert alert-warning" style = "border-radius:10px;border-width:3px;border-color:orange;font-family:Verdana,sans-serif;font-size:16px;">**COMMENT:**
Very simple formating can be also given using html markers. For example *strong* and *em* tags refers to bold and italics fonts.
</div>

**CSS (which stands for Cascading Style Sheets)** is a language used to describe the appearance and formatting of your HTML. A style sheet is a file that describes how an HTML file should look. The word cascading refers to the fact that a specific style rules override more generic ones. We will see that in a minute. 


<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">FORMAT of a CSS file:

<ol>
    <li> CSS files contains a set of style rules applied to a certain selection of the content of the html file. The format is as follows:  <span style = "font-family:Courier;color:gray">css_selector { 
                property: value;
        }</span>
    and may contain many properties.
    </li>
    <li> <span style = "font-family:Courier;color:gray"> css_selector </span> identifies a certain context of the Document Object Model (DOM), i.e. it allows to traverse the DOM and select specific blocks. For example, 
      <span style = "font-family:Courier;color:gray">div { color:red; }</span>
    selects all *div* tags and apply a red font color to their content.
    </li>
    <li> We can use any html tag as element for selection.</li>
</ol>
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
    **CLASSES:**
        <ul>
        <li> We may link/format a certain set of tags of the html file with a unique CSS style by means of a **class**. </li>
        <li> In html the class is defined as an attribute and can be shared among tags, e.g. < div class = "my_class" > and < p class = "my_class">
        </li>
        <li> In the css file, the class is identified with a point preceeding the name, e.g.
         <p style = "font-family:Courier;margin-left:100px;">
            .my_class { font-family:Verdana; }
        </p>
        </li>
        </ul>
    </div>


<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
    **IDENTIFIERS:**
        <ul>
        <li> If we want to single out an element to apply a certain style we can use an **identifier**. </li>
        <li> In html the identifier is defined as the attribute **ID**, e.g. < div id = "my_ID" >
        </li>
        <li> In the css file, the identifier name is preceeded by a hash sign (#), e.g.
         <p style = "font-family:Courier;margin-left:100px;">
            #my_ID { font-size:24px; }
        </p>
        </li>
        </ul>
</div>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE**

Check stylesheet.css and inspect the formating of the identifiers and the classes.
<ol>
<li>
Create an identifier name_props that changes the font-family to Courier.
</li>
<li>
Add this identifier to the *td* tag with your name in the profile.
</li>
</ol>
</div>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE**

Let us make our own css style sheet and explore a little of the more advanced CSS selectors.
<ol>
<li>Create a file 'mystylesheet.css' using your favorite editor and open example3.html in your browser.</li>
<li>What happens if we add the following style command? 
<p style="font-family:Courier">div {font-family:Verdana;font-size:16px;}</p></li>
<li>Add the following line in the style sheet:
<p style="font-family:Courier">div div{color:red;}</p></li>
<li>Add the following line in the style sheet:
<p style="font-family:Courier">div>div{color:green;}</p></li>
<li>Add the style to make the font of the container regarding Bruce Lee comments on Choy Li Fut have *font-size:14px*, *font-family:Courier*, *background-color:#FFCC66;*, and *color:yellow*.
<li>Add a style to the *img* tag with *height* of *230px* and *width* of *200px*. Set the width of the table to 700 pixels width.</li>
</ol>
</div>

In this last exercise we have seen another type of css selection. The html document can be seen as a tree structure. The root of the tree is the *html* tag. This has two children *head* and *body*. Head may have different children such as *title*, *link*, or *script*. Body may have any combination of tags, *divs*, *p*, *a*, etc. These tags can be nested, e.g. we can find a *div* inside a *div* inside a *div*. In the example we have seen how to refer to nested elements. The elements can be html tags or classes or identifiers.
    + "elem1 elem2" refers to any elem2 inside any other elem1 disregarding the degree of nesting (it may have any arbitrary set of elementes in between both).
    + "elem1>elem2" specifically refers to any elem2 children of a direct parent with tag elem1.

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE:**
What does "div table>img" select?

</div>

<div class = "alert alert-danger" style = "border-radius:10px;border-width:3px;border-color:darkred;font-family:Verdana,sans-serif;font-size:16px;">We can use CSS Selection for accessing elements in the document. However I prefer to introduce XPATH for that issue.

</div>

## 3.3 Selecting elements with XPATH

XPath is an alternative way of navigating through XML-like documents. It follows a similar structure to file directory navigations. In this sense, we can define an absolute path using `/`. This means that we have to give the complete path to the element we want to select. For example `xpath('/html/body/p')` will select all the paragraphs in the body of the html root.

If the path starts with `//` we are not starting at the root but will select an element starting anywhere in the hierarchy. For example  `xpath('//a/div')` will look for an 'a' followed with a 'div' anywhere in the document. 

We may also use wildcards suchs as `*`. For example `xpath(//a/div/*)` will return all the elements preceeding a/div anywhere in the document. And `xpath(/*/*/div)` will look for divs at the second level of the hierarchy with respect to the root.

If the selection returns more than one element we can choose one using brakets. For example `xpath('//a/div[1]')` will return the first div element of that set and `xpath('//a/div[last()]')` the last one.

We can toy with attributes using `@`. In this sense `xpath('//@name')` returns all attributes called 'name' anywhere in the document, and `xpath('//div[@name]')` selects from all the divs in the document only those that have an attribute 'name'. Note that it selects the divs, not the attributes. `xpath('//div[not(@*)]')` will return all the divs without attributes. We can even look for specific values of attributes `xpath('//div[@name='chachiname']')`

There are built-in functions that may help in localizing elements, such as `count()`,
`name()`, `starts-with()`, `contains()`. For example, `xpath('//*[contains(name(),'iv')]')` will selet all elements anywhere in the document with an name descriptor containing the substring 'iv'; or `xpath('//*[count('div')==2])` will return all elements with two div elements as children.

We can select elements coming from several paths using `|` (OR), e.g. `xpath('/div/p|/div/a')` elements either div/p or div/a.

We can refer to the parent, ancestors, child, or descendants in a path, e.g. `xpath('//div/div/parent::*')` returns the parent nodes that have as children the path div/div.




|Expression|Description|
|----------|-----------|
|nodename|Selects all nodes with the name "nodename"|
|/	|Selects from the root node|
|//	|Selects nodes in the document from the current node that match the selection no matter where they are|
|.	|Selects the current node|
|..	|Selects the parent of the current node|
|@	|Selects attributes|

Want more? Check https://www.guru99.com/xpath-selenium.html


<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**XPATH Exercises**:
<ol>
<li>Text that are on the lists of the page?</li>
<li>All the attributes on the page</li>
<li>Is there any link on the page?</li>
<li>Can you get the style sheet information?</li>
<li>divs that have the class "container"</li>
<li>what if you remove brackets in the last one?</li>
</ol>
</div>

In [1]:
from lxml import etree
import urllib.request

url = "https://www.google.com"
html = urllib.request.urlopen(url).read()
tree = etree.HTML(html)


In [2]:
tree.xpath('//div/text()')
#tree.xpath('//li/a/text()')

['Ofrecido por Google en:  ', '    ', '    ', '  ']

In [3]:
tree.xpath('//@*')

['',
 'http://schema.org/WebPage',
 'es',
 'Google.es permite acceder a la información mundial en castellano, catalán, gallego, euskara e inglés.',
 'description',
 'noodp',
 'robots',
 'text/html; charset=UTF-8',
 'Content-Type',
 '/images/branding/googleg/1x/googleg_standard_color_128dp.png',
 'image',
 'VKhMWXnZ8LFXDyyVI3yiXg==',
 'VKhMWXnZ8LFXDyyVI3yiXg==',
 '#fff',
 'VKhMWXnZ8LFXDyyVI3yiXg==',
 'mngb',
 'gbar',
 'gb1',
 'gb1',
 'https://www.google.es/imghp?hl=es&tab=wi',
 'gb1',
 'https://maps.google.es/maps?hl=es&tab=wl',
 'gb1',
 'https://play.google.com/?hl=es&tab=w8',
 'gb1',
 'https://www.youtube.com/?gl=ES&tab=w1',
 'gb1',
 'https://news.google.com/?tab=wn',
 'gb1',
 'https://mail.google.com/mail/?tab=wm',
 'gb1',
 'https://drive.google.com/?tab=wo',
 'gb1',
 'text-decoration:none',
 'https://www.google.es/intl/es/about/products?tab=wh',
 'guser',
 '100%',
 'gbn',
 'gbi',
 'gbf',
 'gbf',
 'gbe',
 'http://www.google.es/history/optout?hl=es',
 'gb4',
 '/preferences?hl=es',
 'g

In [5]:
tree.xpath('//@href')

['https://www.google.es/imghp?hl=es&tab=wi',
 'https://maps.google.es/maps?hl=es&tab=wl',
 'https://play.google.com/?hl=es&tab=w8',
 'https://www.youtube.com/?gl=ES&tab=w1',
 'https://news.google.com/?tab=wn',
 'https://mail.google.com/mail/?tab=wm',
 'https://drive.google.com/?tab=wo',
 'https://www.google.es/intl/es/about/products?tab=wh',
 'http://www.google.es/history/optout?hl=es',
 '/preferences?hl=es',
 'https://accounts.google.com/ServiceLogin?hl=es&passive=true&continue=https://www.google.com/&ec=GAZAAQ',
 '/advanced_search?hl=es&authuser=0',
 'https://www.google.com/setprefs?sig=0_RJSBwLz86swv8bSyl4drz993Gpw%3D&hl=ca&source=homepage&sa=X&ved=0ahUKEwjukKu1_8fsAhVp8OAKHQSrAtAQ2ZgBCAU',
 'https://www.google.com/setprefs?sig=0_RJSBwLz86swv8bSyl4drz993Gpw%3D&hl=gl&source=homepage&sa=X&ved=0ahUKEwjukKu1_8fsAhVp8OAKHQSrAtAQ2ZgBCAY',
 'https://www.google.com/setprefs?sig=0_RJSBwLz86swv8bSyl4drz993Gpw%3D&hl=eu&source=homepage&sa=X&ved=0ahUKEwjukKu1_8fsAhVp8OAKHQSrAtAQ2ZgBCAc',
 '/intl

In [6]:
tree.xpath('//style')

[<Element style at 0x7faa5edaff00>,
 <Element style at 0x7faa5edaffa0>,
 <Element style at 0x7faa5edb6050>]

In [7]:
for item in tree.xpath('//a'):
    print(item.values())

['gb1', 'https://www.google.es/imghp?hl=es&tab=wi']
['gb1', 'https://maps.google.es/maps?hl=es&tab=wl']
['gb1', 'https://play.google.com/?hl=es&tab=w8']
['gb1', 'https://www.youtube.com/?gl=ES&tab=w1']
['gb1', 'https://news.google.com/?tab=wn']
['gb1', 'https://mail.google.com/mail/?tab=wm']
['gb1', 'https://drive.google.com/?tab=wo']
['gb1', 'text-decoration:none', 'https://www.google.es/intl/es/about/products?tab=wh']
['http://www.google.es/history/optout?hl=es', 'gb4']
['/preferences?hl=es', 'gb4']
['_top', 'gb_70', 'https://accounts.google.com/ServiceLogin?hl=es&passive=true&continue=https://www.google.com/&ec=GAZAAQ', 'gb4']
['/advanced_search?hl=es&authuser=0']
['https://www.google.com/setprefs?sig=0_RJSBwLz86swv8bSyl4drz993Gpw%3D&hl=ca&source=homepage&sa=X&ved=0ahUKEwjukKu1_8fsAhVp8OAKHQSrAtAQ2ZgBCAU']
['https://www.google.com/setprefs?sig=0_RJSBwLz86swv8bSyl4drz993Gpw%3D&hl=gl&source=homepage&sa=X&ved=0ahUKEwjukKu1_8fsAhVp8OAKHQSrAtAQ2ZgBCAY']
['https://www.google.com/setprefs?

In [8]:
for item in tree.xpath('//link[contains(@href,"css")]'):
    print (item.values())

In [9]:
tree.xpath('//div[@id="SIvCob"]/text()') #exact match
#tree.xpath('//div[contains(@class,"container")]')

['Ofrecido por Google en:  ', '    ', '    ', '  ']

In [10]:
tree.xpath('//div/@class="container"') #exact match


False

## Practice XPATH with your browser 

We need Firefox and XPathTry Add-on


Let us supose we want to scrap licitaciones from `https://contrataciondelestado.es`. Our use case is a simple one this time. We want to automatically 

1. Go to Licitaciones
2. We want to get `organo de contratacion`, `categoria`, `importe` from all elements in the first page.


<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**EXERCISE IN PAIRS:** Take the inspector and write down what you need to look for completing points 1 and 2.
</div>

<div class = "alert alert-warning" style = "border-radius:10px;border-width:3px;border-color:orange;font-family:Verdana,sans-serif;font-size:16px;">
**SOLUTION:**
<ul>
<li> **Point 1:** Observe that it directly points to `/wps/portal/licRecientes`. So we do not really need to use any automation here. We can directly feed `https://contrataciondelestado.es/wps/portal/licRecientes`.
</li>
    
<li> **Point 2:** This is a messed-up tabular result. But we find that `categoria` in the first element is governed by  `id = ab_hcategoria1`, `organo` by `id = ab_hocontratacion1`, and `importe` by `id=ab_hprecio1`. So we can get all results from the page iterating on the last number
</li>    
</ul>

</div>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**EXERCISE IN PAIRS:** Install in Firefox the Add-On `XPath Try`. 
Check how to find the value of the quantity `price`. Write down the XPath.
</div>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**EXERCISE IN PAIRS:** Write down the code that scraps that value.
</div>

In [11]:
#My code

from lxml import etree
import urllib.request

url = "https://contrataciondelestado.es/wps/portal/licRecientes"
html = urllib.request.urlopen(url).read()
tree = etree.HTML(html)

tree.xpath('//*[@id="ab_hprecio1"]//parent::tr/td[2]/span/text()')




['56.402,14']

Let us iterate on all values of the web page.

In [12]:
from lxml import etree
import urllib.request

url = "https://contrataciondelestado.es/wps/portal/licRecientes"
html = urllib.request.urlopen(url).read()
tree = etree.HTML(html)

for i in range(1,10):
    print('ab_hprecio'+str(i))
    query = '//*[@id="ab_hprecio'+str(i)+'"]//parent::tr/td[2]/span/text()'
    value = tree.xpath(query)
    print(value)
    

ab_hprecio1
['5.004,04']
ab_hprecio2
['34.500,00']
ab_hprecio3
['56.402,14']
ab_hprecio4
['14.784,00']
ab_hprecio5
['1.033.136,48']
ab_hprecio6
['34.710,74']
ab_hprecio7
[]
ab_hprecio8
[]
ab_hprecio9
[]


In [13]:
from lxml import etree
import urllib.request

url = "https://contrataciondelestado.es/wps/portal/licRecientes"
html = urllib.request.urlopen(url).read()
tree = etree.HTML(html)

fin=False
i=1
while not(fin):
    try:
        query='//*[@id="ab_hprecio'+str(i)+'"]//parent::tr/td[2]/span/text()'
        value = tree.xpath(query)
        print(value[0])
        i=i+1
    except IndexError:
        fin=True
    

5.004,04
34.500,00
56.402,14
14.784,00
1.033.136,48
34.710,74


What about the rest of the elements from the other pages?

We are **stuck** at this point. But before proceeding let us wrap up the former scrapping:

In [14]:
from lxml import etree
import urllib.request

url = "https://contrataciondelestado.es/wps/portal/licRecientes"
html = urllib.request.urlopen(url).read()
tree = etree.HTML(html)

fin=False
i=1
col = []
while not(fin):
    doc={}
    try:
        doc['id']=i
        query='//*[@id="ab_hprecio'+str(i)+'"]//parent::tr/td[2]/span/text()'
        value = tree.xpath(query)
        doc['importe']=value[0]
        query='//*[@id="ab_hcategoria'+str(i)+'"]//parent::tr/td[2]/span/text()'
        value = tree.xpath(query)
        doc['categoria']=value[0]
        query='//*[@id="ab_hocontratacion'+str(i)+'"]//parent::tr/td[2]/span/text()'
        value = tree.xpath(query)
        doc['organo']=value[0]
        i=i+1
        col.append(doc)
    except IndexError:
        fin=True

col

[{'id': 1,
  'importe': '50.000,00',
  'categoria': 'Servicios de exposiciÃ³n en museos.',
  'organo': 'Presidencia de la DiputaciÃ³n Provincial de Valencia'},
 {'id': 2,
  'importe': '5.004,04',
  'categoria': 'Servicios de arquitectura, construcciÃ³n, ingenierÃ\xada e inspecciÃ³n.',
  'organo': 'Servicio de Salud de las Illes Balears'},
 {'id': 3,
  'importe': '30.000,00',
  'categoria': 'Barreras de seguridad.',
  'organo': 'Junta de Gobierno Local de Ayuntamiento de Priego de CÃ³rdoba'},
 {'id': 4,
  'importe': '34.500,00',
  'categoria': 'Luces para iluminaciÃ³n exterior.',
  'organo': 'Junta de Gobierno del Ayuntamiento de MaÃ³-MahÃ³n'},
 {'id': 5,
  'importe': '56.402,14',
  'categoria': 'Servicios de salud.',
  'organo': 'Director Gerente de Mutua Intercomarcal MATEPSS NÂº 39'},
 {'id': 6,
  'importe': '14.784,00',
  'categoria': 'Material mÃ©dico fungible.',
  'organo': 'Servicio CÃ¡ntabro de Salud'}]

## 3.4 The inexistent data scraping case

As a simple exercise try to scrap the numerical value in the text box of the hidden.html file.

In [24]:
from IPython.display import IFrame
IFrame("file:./files/hidden.html", width=700, height=350)

In [18]:
# %load files/hidden.html
<!DOCTYPE html>
<html>
<head>
<title>The hidden scraper</title>
<link rel='stylesheet' type='text/css' href='hiddenstylesheet.css'/>
        <script type='text/javascript' src="http://ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min.js">
</script>
        <script type='text/javascript' src='hiddenscript.js'></script>
</head>
<body>
<div></div>
</body>
</html>


SyntaxError: invalid syntax (<ipython-input-18-290d09de4567>, line 2)

In [22]:
#Solution
from urllib.request import urlopen
socket = urlopen("file:./files/hidden.html")
print (socket.read().decode('latin-1'))

<!DOCTYPE html>
<html>
<head>
<title>The hidden scraper</title>
<link rel='stylesheet' type='text/css' href='hiddenstylesheet.css'/>
        <script type='text/javascript' src="http://ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min.js">
</script>
        <script type='text/javascript' src='hiddenscript.js'></script>
</head>
<body>
<div></div>
</body>
</html>



... and the value?

Problems and limitations of LXML and basic scraping techniques,

     + DOM loaded content. The page finishes loading and it is being acquired when the response is closed. Any further data will be not loaded.
     + Really broken HTML/XML
     + Proprietary and login required can be difficult depending on the log and flow of the page.
     + JS form interaction

# 4 Advanced scraping using automation tools

We see the data in our web browser but the data is not directly found in the html. However "Data is out there". This is due to the fact that it has been dinamically generated with a function call. Thus, we see that we have two versions of the web page. The first contains static data and function calls, the second contains static data after the interpretation of the function calls. The question now is how we can access this post interpretation data. There are many different ways. One way could be opting for running our own interpreter such as node.js. Another way is to take advantage of the browser interpretation capabilities and run it as an interpreter.

Automation tools such as mechanize or selenium are suites with the goal of testing web interfaces automatically from scripts. They allow to start a browser and interact with the web page in the same way a human user would do. We can use these tools for our scraping purposes.


## The Cepstral demo and our new goal.
<small>An updated version of the case study of Asheesh Laroia (PaulProtheus at Github)</small>

Our new goal is to deal with dynamically generated data. Our goal is to be able to perform a web scraping as the following case. Cepstral is a text-to-speech provider. Let us check the web page.

We will need to download geckodriver for Firefox to work 
https://github.com/mozilla/geckodriver/releases

In [25]:
!ls


3. scraping_EEUB_student_py36-s02.ipynb geckodriver-v0.26.0-macos.tar.gz
[31m3. scraping_EEUB_student_py36_s01.ipynb[m[m geckodriver.log
3. scraping_solutions_py36-s02.ipynb    [31mgeckodriver_old[m[m
[31m3. scraping_solutions_py36_s01.ipynb[m[m    [31mgeckodriver_old_new[m[m
3. scraping_student_py36_s01.ipynb      [34mine[m[m
RPA_example_UIVISION.rtf                micoleccion
dict.txt                                micoleccion.pkl
[34mfiles[m[m                                   [31mscraped_image.bmp[m[m
[31mgeckodriver[m[m                             [34mstyles[m[m


In [26]:
from IPython.display import HTML
HTML('<iframe src="http://cepstral.com" width=700 height=350></iframe>')

Our goal is to retrieve the audio file that has been played using web scraping techniques. Let us check how can we do it.

In [30]:
from selenium import webdriver

browser = webdriver.Firefox(executable_path=r'./geckodriver')
browser.get('http://seleniumhq.org/')

In [32]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
driver = webdriver.Firefox(options=options, executable_path=r'./geckodriver')
driver.get("http://google.com/")
print ("Headless Firefox Initialized")
driver.quit()

Headless Firefox Initialized


Let us run the demo first with a normal browser to check what it does and then we move to a headless browser.

In [34]:
#CEPSTRAL DEMO
%reset -f
#!/usr/bin/python
# -*- coding: utf-8 -*-

from selenium import webdriver
import time
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = False

url = 'http://www.cepstral.com/en/demos' #Poseu el nom de la pàgina web
browser = webdriver.Firefox(options=options, executable_path=r'./geckodriver') #Obrir un navegador Chrome
browser.get(url)
element = browser.find_element_by_css_selector("#demo_text")
element.clear()
s=' My word is Milan. But I also enjoy Roma. But this summer I will travel to Moscow'
element.send_keys(s)
browser.find_element_by_css_selector('#demo_submit').click()

browser.implicitly_wait(30)
browser.find_element_by_css_selector('audio')
html=browser.page_source
browser.quit()
print('DONE')

DONE


In [4]:
print (html)


<html class=" js audio"><head>
				<title>Cepstral - Demo High Quality Text to Speech Voices Full of Personality for Free</title>
		<!-- meta -->
		<meta charset="utf-8"> 
		<meta name="description" content="Demo Cepstral text to speech voices for free. Discover the only text to speech provider that offers natural voices that have personality and style.">
		<link href="http://www.cepstral.com/favicon.ico" rel="SHORTCUT ICON">
		<!-- stylesheets -->
		<!-- CSS -->
					            <link rel="stylesheet" type="text/css" href="/media/css/main.css?v=2.1.6">
	    			            <link rel="stylesheet" type="text/css" href="/media/css/std_page.css?v=2.1.6">
	    			            <link rel="stylesheet" type="text/css" href="/media/css/demos.css?v=2.1.6">
	    			            <link rel="stylesheet" type="text/css" href="/media/css/plugins/jquery-ui-1.10.0.custom.min.css?v=2.1.6">
	    		<!--[if lt IE 9]>
			<script src="/media/js/html5shiv.js"></script>
		<![endif]-->
	</head>
	<body>
		<input typ

In [35]:
#Check the data is in
'.mp3' in html


True

In [36]:
#locate it
html.find('.mp3')

9630

In [37]:
chunks=html.split('"')
for chunk in chunks:
    if '.mp3' in chunk:
        break


In [38]:
print (chunk)

/demos/audio/4ccrj0lopjvjakvcupqvfuceh3.1603364463177.mp3


In [39]:
import urllib
furl=urllib.parse.urljoin(url,chunk)
print (furl)

http://www.cepstral.com/demos/audio/4ccrj0lopjvjakvcupqvfuceh3.1603364463177.mp3


In [40]:
import os

player = "/Applications/VLC.app/Contents/MacOS/VLC " 

##Replace with media player with your own player 
os.system(player+furl)


0

## 4.1 Starting with Selenium 

+ Requirements
        ''pip install selenium''
        
If you use Firefox you do not need anything else. Check the following code and it should work fine.

We will need to download geckodriver for Firefox to work 
https://github.com/mozilla/geckodriver/releases


In [41]:
from selenium import webdriver
browser = webdriver.Firefox()

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Basic manipulation in Selenium:**
<p>
A webdriver instance allows to manipulate the web session, control cookies, retrieve the html code or find elements in the source code.
</p>
Given a webdriver instance (e.g.<span style = "font-family:Courier;">
            browser = webdriver.Firefox()</span>) the most relevant methods

<ul>
<li>**Open URL:**  .get(url) (e.g.
<span style = "font-family:Courier;"> browser.get(url)</span>)</li>
<li>**Selection: ** .find_element(s)... [element will return the first, elements the complete list]
<ul>
<li>..._by_link_text('foo') - find the link with text foo</li>
<li>..._by_partial_link_text() - similar to contains ...</li>
<li>..._by_css_selector()</li>
<li>..._by_tag_name()</li>
<li>..._by_xpath()</li>
<li>..._by_class_name()</li>
</ul>
</li>
<li>**Retrieve source: ** .page_source</li>
  
</ul>
</div>

<div class = "alert alert-info" style = "background-color:lightyellow;border-radius:10px;border-width:3px;border-color:darkorange;font-family:Verdana,sans-serif;font-size:16px;color:brown">**Other web driver utilities:**
<ul>
<li>browser.execute_script('window.close()') - execute any javascript on a load page</li>
<li>brosers.save_screenshot('foo.png')</li>
<li>browser.switch_to_alert(): handle pop-ups automatically</li>
<li>browser.forward() / browser.back(): navigation</li>
</ul>
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Element manipulation in Selenium:**
<p>
Consider the result of a selection, e.g. 

<span style = "font-family:Courier;">element = browser.find_element_by_css_selector('div')</span>

We can do several things on it.
<ul>
<li>element**.click()** - click on a selected element</li>
<li>Element properties:
<ul>
<li>element**.location**: x, y location</li>
<li>element**.parent**: parent element</li>
<li>element**.tag_name**: The tag of the element</li>
<li>element**.text**: text of the element and childs</li>
</ul>
</li>
   
</ul>




<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**Exercise: Back to business.**
Now we can interact with the web page, programatically.
</div>

In [42]:
from selenium import webdriver
url = 'https://contrataciondelestado.es/wps/portal/licRecientes' #Poseu el nom de la pàgina web
browser = webdriver.Firefox() #Obrir un navegador Firefox
browser.get(url)

browser.implicitly_wait(5)

try:
    element = browser.find_element_by_xpath('//*[@value="Next >>"]').click()
except:
    print('NOT FOUND')
    pass

With this we are now able to scrap all contracts. Let us minimize the amount of burden and use what we know so far (even if it is not the optimal way of doing it).

In [43]:
from lxml import etree
import urllib.request
from selenium import webdriver


url = 'https://contrataciondelestado.es/wps/portal/licRecientes' #Poseu el nom de la pàgina web
browser = webdriver.Firefox() #Obrir un navegador Firefox
browser.get(url)

browser.implicitly_wait(5)

i=1
col = []
for page in range(5):
    fin=False
    html_source = browser.page_source
    tree = etree.HTML(html_source)
    itemN=1
    while not(fin):
        doc={}
        try:
            doc['id']=i
            print(itemN)
            query='//*[@id="ab_hprecio'+str(itemN)+'"]//parent::tr/td[2]/span/text()'
            value = tree.xpath(query)
            doc['importe']=value[0]
            query='//*[@id="ab_hcategoria'+str(itemN)+'"]//parent::tr/td[2]/span/text()'
            value = tree.xpath(query)
            doc['categoria']=value[0]
            query='//*[@id="ab_hocontratacion'+str(itemN)+'"]//parent::tr/td[2]/span/text()'
            value = tree.xpath(query)
            doc['organo']=value[0]
            i=i+1
            itemN=itemN+1
            col.append(doc)
        except IndexError:
            fin=True
    try:
        element = browser.find_element_by_xpath('//*[@value="Next >>"]').click()
    except:
        print('NOT FOUND')
        pass

col

1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7


[{'id': 1,
  'importe': '262,000.00',
  'categoria': 'Servicios de impresión y servicios conexos.',
  'organo': 'Rectorado de la Universidad de Granada'},
 {'id': 2,
  'importe': '15,798.00',
  'categoria': 'Servicios relacionados con la contaminación del agua.',
  'organo': 'Aena. Dirección del Aeropuerto de Fuerteventura'},
 {'id': 3,
  'importe': '20,560.43',
  'categoria': 'Servicios de consultoría en ingeniería civil.',
  'organo': 'Consejería para la Transición Ecológica y Sostenibilidad'},
 {'id': 4,
  'importe': '92,948.66',
  'categoria': 'Equipo de salvamento y emergencia.',
  'organo': 'Junta de Gobierno del Ayuntamiento de Ourense'},
 {'id': 5,
  'importe': '14,507.63',
  'categoria': 'Otros servicios.',
  'organo': 'Aena. Dirección del Aeropuerto de Fuerteventura'},
 {'id': 6,
  'importe': '60,000.00',
  'categoria': 'Equipo de laboratorio, óptico y de precisión (excepto gafas).',
  'organo': 'Rector de la Universidad Carlos III de Madrid'},
 {'id': 7,
  'importe': '60,000

In [44]:
col

[{'id': 1,
  'importe': '262,000.00',
  'categoria': 'Servicios de impresión y servicios conexos.',
  'organo': 'Rectorado de la Universidad de Granada'},
 {'id': 2,
  'importe': '15,798.00',
  'categoria': 'Servicios relacionados con la contaminación del agua.',
  'organo': 'Aena. Dirección del Aeropuerto de Fuerteventura'},
 {'id': 3,
  'importe': '20,560.43',
  'categoria': 'Servicios de consultoría en ingeniería civil.',
  'organo': 'Consejería para la Transición Ecológica y Sostenibilidad'},
 {'id': 4,
  'importe': '92,948.66',
  'categoria': 'Equipo de salvamento y emergencia.',
  'organo': 'Junta de Gobierno del Ayuntamiento de Ourense'},
 {'id': 5,
  'importe': '14,507.63',
  'categoria': 'Otros servicios.',
  'organo': 'Aena. Dirección del Aeropuerto de Fuerteventura'},
 {'id': 6,
  'importe': '60,000.00',
  'categoria': 'Equipo de laboratorio, óptico y de precisión (excepto gafas).',
  'organo': 'Rector de la Universidad Carlos III de Madrid'},
 {'id': 7,
  'importe': '60,000

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**A more elaborate exercise:** We will end this section using a web application. As expected the answer to a web application is a dynamic source that depends on the inputs. Thus, we change a little our little problem and go for the search engine in the same portal.
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Form input with Selenium:**
<ul>
<li> element**.send_keys()** - Keys, commands, arrows, etc </li>
<li> element**.clear()** - clear the element</li>
</ul>
<p>

**Example.**

<p style="font-family:Courier;">
from selenium.webdriver.common.keys import Keys
<br>input.send_keys('Ip Man',Keys.RETURN)
</p>
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Scrolling and moving:**
Moving around the page is tricky, be prepared for displaying a little patience.

ActionChains provide a way of stringing together one or more actions and then implementing them.
<ul>
<li>move_by_offset(x,y)</li>
<li>move_to_element() - for highlighting, hovering, rollover, etc.</li>
<li>move_to_elemnte_by_offset(elem, x, y)</li>
</ul>
</div>

In [45]:
%reset -f
from lxml import etree
import urllib.request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


#url = 'https://contrataciondelestado.es/wps/portal/!ut/p/b1/04_Sj9Q1tLQ0NrQ0tDAx0o_Qj8pLLMtMTyzJzM9LzAHxo8ziTVz9nZ3dPIwMLIKNXQyMfFxCQ808gFx3U6CCSHwKjIjTb2rsbBLmFRZgFuzpbmDg6eHm4hNqaArUbkacfgMcwNGAkP5w_Sj8SoygCvA5EawAjxv8PPJzU_Vzo3Lc3Cw9s0wcFRUB1iXZhA!!/dl4/d5/L2dBISEvZ0FBIS9nQSEh/pw/Z7_AVEQAI930OBRD02JPMTPG21004/act/id=0/p=javax.servlet.include.path_info=QCPjspQCPbusquedaQCPMainBusqueda.jsp/417201276536/-/'
url = 'https://contrataciondelestado.es/wps/portal/licitaciones' #Poseu el nom de la pàgina web
browser = webdriver.Firefox() #Obrir un navegador Firefox
browser.get(url)

browser.implicitly_wait(10)

#1) We have to get the web app going

element = browser.find_element_by_xpath("//*[contains(@id, 'logoFormularioBusqueda')]").click()

#Too slow access, made us time out but we can recover with check results


#2) Fill in some search

element = browser.find_element_by_xpath('//*[@title="Tipo de Contrato"]/option[text()="Servicios"]').click()

#Alternativas:
## select by visible text
#element.select_by_visible_text('Banana')

## select by value 
#element.select_by_value('1')

element = browser.find_element_by_xpath('//*[@alt="Organización contratante"]')

element.clear()
element.send_keys('Universidad Juan Carlos I',Keys.RETURN) #We press RETURN here


#3) Retrieve results
try:
    browser.find_element_by_xpath("//span[contains(@id,'etNoHayRtdos')]")
    print("No hay resultados!")
except:
    print("Hay resultados!")


#4) Now retrieve some stuff

#Clear all by pressing the button
element = browser.find_element_by_xpath('//*[@alt="Organización contratante"]').clear()

element = browser.find_element_by_xpath('//*[@title="Tipo de Contrato"]/option[text()="Servicios"]').click()
element = browser.find_element_by_xpath("//input[contains(@id,'texoorgano')]")
element.clear()
element.send_keys('Rector de la Universidad Carlos III de Madrid',Keys.RETURN) #We press RETURN here

try:
    browser.find_element_by_xpath("//table[contains(@id,'myTablaBusquedaCustom')]")
    print("Lo tenemos!!!!")
except:
    print("Fallamos de nuevo .....")
    
#Get an element
element = browser.find_elements_by_xpath("//table[contains(@id,'myTablaBusquedaCustom')]/tbody/tr[1]/td[1]/div[2]")
print('Let us see what the Rector buys...\n'+element[0].text)

No hay resultados!
Lo tenemos!!!!
Let us see what the Rector buys...
Servicio de encuestas para un estudio de la relación entre clase social y preferencias políticas dentro del proyecto Classpol


# Robot Process Automation (RPA)

Robot process automation is way beyond the goal of this course but it might be interesting to get a little and brief intro of RPA for web scrapping. In this case we will use UI.Vision as an approximation to a suite for RPA. 

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**INSTALL:** Install UI.vision extension on your Chrome or Firefox browser from `https://ui.vision/`
</div>

In [None]:
# %load RPA_example_UIVISION.rtf
{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\paperw11900\paperh16840\margl1440\margr1440\vieww18580\viewh10640\viewkind0
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural\partightenfactor0

\f0\fs24 \cf0 \{\
  "Name": "Prova",\
  "CreationDate": "2019-10-27",\
  "Commands": [\
    \{\
      "Command": "open",\
      "Target": "https://contrataciondelestado.es/wps/portal/licRecientes",\
      "Value": ""\
    \},\
    \{\
      "Command": "store",\
      "Target": "0",\
      "Value": "i"\
    \},\
    \{\
      "Command": "while_v2",\
      "Target": "($\{i\}<3)",\
      "Value": ""\
    \},\
    \{\
      "Command": "executeScript",\
      "Target": "return Number $\{i\}*10",\
      "Value": "j"\
    \},\
    \{\
      "Command": "executeScript",\
      "Target": "return Number $\{j\}+4",\
      "Value": "j"\
    \},\
    \{\
      "Command": "echo",\
      "Target": "$\{j\}",\
      "Value": ""\
    \},\
    \{\
      "Command": "storeText",\
      "Target": "xpath=//*[@id=\\"tabla_liciRecientes\\"]/tbody/tr[$\{j\}]/td[2]/span",\
      "Value": "!csvLine"\
    \},\
    \{\
      "Command": "echo",\
      "Target": "$\{!csvLine\}",\
      "Value": ""\
    \},\
    \{\
      "Command": "executeScript",\
      "Target": "return Number $\{i\}+1",\
      "Value": "i"\
    \},\
    \{\
      "Command": "echo",\
      "Target": "$\{i\}",\
      "Value": ""\
    \},\
    \{\
      "Command": "end",\
      "Target": "",\
      "Value": ""\
    \},\
    \{\
      "Command": "csvSave",\
      "Target": "Licitaciones",\
      "Value": ""\
    \},\
    \{\
      "Command": "localStorageExport",\
      "Target": "licitaciones.csv",\
      "Value": ""\
    \}\
  ]\
\}}