# XPaths and Selectors

Manfaatkan sintaks XPath untuk menjelajahi scrapy selectors. Kedua konsep ini akan menggerakkan Anda untuk dapat melakukan scrape dokumen HTML.


In [9]:
# !pip install scrapy

## XPathology

### Slashes and Brackets

* Single forward slash / looks forward `one` generation
* Double forward slash // looks forward `all` future generations
* Square brackets [] help narrow in on specic elements

### Counting Elements

* `xpath = "/html/body/*"` : sama dengan jumlah anak-anak dari elemen body
* `xpath = "/html/body//*"` : Sama dengan jumlah total keturunan elemen body.
* `xpath = "/*"` : sama dengan jumlah elemen root dalam dokumen HTML, yang biasanya 1 elemen root html.
* `xpath = "//*"` : sama dengan jumlah total elemen di seluruh dokumen HTML.

### Body Appendages

In [18]:
def how_many_elements( xpath ):
    print( len(sel.xpath( xpath )) )

# Buat string XPath untuk menavigasi ke anak-anak elemen body
xpath = '//body/*'

# Cetak jumlah elemen yang dipilih
how_many_elements( xpath )

23


### Choose DataCamp!

In [26]:
from scrapy import Selector

html = '''
<html>
  <body>
    <div>
      <p>Hello World!</p>
      <div>
        <p>Choose DataCamp!</p>
      </div>
    </div>
    <div>
      <p>Thanks for Watching!</p>
    </div>
  </body>
</html>
'''

sel = Selector( text = html )

In [27]:
def print_element_text( xpath ):
  text = ' '.join( sel.xpath( xpath ).xpath( './text()' ).extract() )
  print( text )

In [28]:
# Buat string XPath ke elemen paragraf yang diinginkan
xpath = "/html/body/div[1]/div/p"

# Cetak teks elemen
print_element_text( xpath )

Choose DataCamp!


## Off the Beaten XPath

### Where it's @

Memilih elemen paragraf yang mengandung frasa: **"Thanks for Watching!"**

In [31]:
from scrapy import Selector

html = '''
<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose DataCamp!</p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>
'''

sel = Selector( text = html )

In [32]:
# Buat string Xpath untuk memilih elemen p yang diinginkan
xpath = '//*[@id="div3"]/p'

# Cetak teks pilihan
print_element_text( xpath )

Thanks for Watching!


### Check your Class

Pilih elemen paragraf yang mengandung frasa: **"Hello World!"**

In [33]:
# Buat string XPath untuk memilih elemen p menurut kelas
xpath = '//p[@class="class-1 class-2"]'

# Cetak teks pilihan
print_element_text( xpath )

Hello World!


### Hyper(link) Active

Memilih nilai atribut `href` dari hyperlink DataCamp.

In [34]:
html = '''
<html>
  <body>
    <div id="div1" class="class-1">
      <p class="class-1 class-2">Hello World!</p>
      <div id="div2">
        <p id="p2" class="class-2">Choose 
            <a href="http://datacamp.com">DataCamp!</a>!
        </p>
      </div>
    </div>
    <div id="div3" class="class-2">
      <p class="class-2">Thanks for Watching!</p>
    </div>
  </body>
</html>
'''

sel = Selector( text = html)

In [36]:
# Fungsi untuk mencetak data yang diekstrak dari XPath Anda
def print_attribute( xpath ):
    print( "You have selected:" )
    for i,el in enumerate(sel.xpath( xpath ).extract()):
        print( "%d) %s" % (i+1, el) )

In [37]:
# Buat xpath ke atribut href
xpath = '//p[@id="p2"]/a/@href'

# Cetak pilihannya; seharusnya hanya ada satu
print_attribute( xpath )

You have selected:
1) http://datacamp.com


### Secret Links

Menetapkan string XPath ke variabel `xpath` yang mengarahkan ke semua nilai atribut `href` dari hyperlink elemen `a` yang atribut kelasnya berisi string `"course-block"`. Ingat bahwa kami menggunakan panggilan `contains` dalam string XPath untuk memeriksa apakah nilai atribut berisi string tertentu.

In [40]:
# Import requests
import requests

url = 'https://www.datacamp.com/courses/all'

# Create the string html containing the HTML source
html = requests.get( url ).content

# Create the Selector object sel from html
sel = Selector( text = html )

In [41]:
# Mencetak jumlah elemen yang telah dipilih XPath
def how_many_elements( xpath ):
    print( "You've selected %d elements" % len(sel.xpath( xpath )) )

# Mencetak beberapa elemen pertama yang dipilih
def preview( xpath ):
    els = sel.xpath( xpath ).extract()
    n = len(els)
    for i,el in enumerate( els[:min(4,n)]):
        print( "Element %d: %s" % (i+1,el) )

In [42]:
# Create an xpath to the href attributes
xpath = '//a[contains(@class,"course-block")]/@href'

# Print out how many elements are selected
how_many_elements( xpath )
# Preview the selected elements
preview( xpath )

You've selected 518 elements
Element 1: /courses/free-introduction-to-r
Element 2: /courses/free-introduction-to-r
Element 3: /courses/data-table-data-manipulation-r-tutorial
Element 4: /courses/data-table-data-manipulation-r-tutorial


## Selector Objects

In [43]:
from scrapy import Selector

html = '''
<html>
    <body>
        <div>HELLO</div>
        <div><p>GOODBYE</p></div>
        <div><span><p>NOPE</p><p>ALMOST</p><p>YOU GOT IT!</p></span></div>
    </body>
</html>
'''

sel = Selector( text = html )

### XPath Chaining

* `sel.xpath('/html/body/div[2]')` = `sel.xpath('/html').xpath('./body/div[2]')`
  * = `sel.xpath('/html').xpath('./body').xpath('./div[2]')`

Menyatukan dua panggilan xpath yang menghasilkan seleksi yang sama dengan:

```python
sel.xpath('//div/span/p[3]')
```

In [44]:
# Chain together xpath methods to select desired p element
sel.xpath( '//div' ).xpath( './span/p[3]' )

[<Selector xpath='./span/p[3]' data='<p>YOU GOT IT!</p>'>]

### Divvy Up This Exercise


In [45]:
html = '''
<html>
    <body>
        <div>Div 1: 
            <p>paragraph 1</p>
        </div>
        <div>Div 2: 
            <p>paragraph 2</p> 
            <p>paragraph 3</p> 
        </div>
        <div>Div 3: <p>paragraph 4</p> 
            <p>paragraph 5</p> 
            <p>paragraph 6</p>
        </div>
        <div>Div 4: 
            <p>paragraph 7</p>
        </div>
        <div>Div 5: 
            <p>paragraph 8</p>
        </div>
    </body>
</html>
'''

In [46]:
from scrapy import Selector

# Create a Selector selecting html as the HTML document
sel = Selector( text = html )

# Create a SelectorList of all div elements in the HTML document
divs = sel.xpath( '//div' )
print(divs)

[<Selector xpath='//div' data='<div>Div 1: \n            <p>paragraph 1<'>, <Selector xpath='//div' data='<div>Div 2: \n            <p>paragraph 2<'>, <Selector xpath='//div' data='<div>Div 3: <p>paragraph 4</p> \n        '>, <Selector xpath='//div' data='<div>Div 4: \n            <p>paragraph 7<'>, <Selector xpath='//div' data='<div>Div 5: \n            <p>paragraph 8<'>]


## The Source of Source

### Requesting a Selector

In [47]:
# Import a scrapy Selector
from scrapy import Selector

# Import requests
import requests

url = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

# Buat string html yang berisi sumber HTML
html = requests.get( url ).content

# Buat objek Selector sel dari html
sel = Selector( text = html )

# Cetak jumlah elemen dalam dokumen HTML
print( "There are 1020 elements in the HTML document.")
print( "You have found: ", len( sel.xpath('//*') ) )

There are 1020 elements in the HTML document.
You have found:  1020
