# CSS Locators, Chaining, and Responses

Pelajari sintaksis CSS Locator dan mulailah bermain dengan ide untuk chaining (merantai) bersama CSS Locators dengan XPath. Kami juga memperkenalkan objek `Response`, yang berperilaku seperti Selectors tetapi memberi kami alat tambahan untuk memobilisasi upaya scraping di beberapa situs web.

In [1]:
# Import a scrapy Selector and request
from scrapy import Selector
import requests

## CSS Locators

### Get an "a" in this Course

In [20]:
from scrapy.http import TextResponse

url = 'https://www.datacamp.com/courses/all'
html = requests.get(url)
html = html.text

In [3]:
def how_many_elements( css ):
  sel = Selector( text = html )
  print( len(sel.css( css )) )

In [4]:
from scrapy import Selector

# Create a selector from the html (of a secret website)
sel = Selector( text = html )

# Fill in the blank
css_locator = 'div.course-block > a'

# Print the number of selected elements.
how_many_elements( css_locator )

326


### The CSS Wildcard

Anda dapat menggunakan wildcard `*` di Locators CSS juga! Bahkan, kita dapat menggunakannya dengan cara yang sama, ketika kita ingin mengabaikan jenis tag. Sebagai contoh:

* String CSS Locator `'*'` memilih semua elemen dalam dokumen HTML.
* String CSS Locator `'*.class-1'` memilih semua elemen yang termasuk `class-1`, tetapi ini tidak perlu karena string `'.class-1'` juga akan melakukan pekerjaan yang sama.
* String CSS Locator `'*#uid'` memilih elemen dengan atribut `id` yang sama dengan `uid`, tetapi ini tidak perlu karena string `'#uid'` juga akan melakukan pekerjaan yang sama.

Dalam latihan ini, kami ingin Anda bekerja dengan analogi dengan karakter wildcard yang Anda tahu dari notasi XPath untuk menemukan cara memilih semua anak dari elemen tertentu dalam notasi CSS Locator.

In [8]:
# Create the CSS Locator to all children of the element whose id is uid
css_locator = "#uid > *"

### Selectors with CSS

In [9]:
html = '''
<html>
<body>
    <div class="hello datacamp">
        <p>Hello World!</p>
    </div>
    <p>Enjoy DataCamp!</p>
</body>
</html>
'''
sel = Selector( text = html )

In [10]:
type(html)

str

In [11]:
sel.css("div > p")

[<Selector xpath='descendant-or-self::div/p' data='<p>Hello World!</p>'>]

In [15]:
sel.css("div > p").extract()

['<p>Hello World!</p>']

## Attribute and Text Selection

### You've been `href` ed

In [21]:
from scrapy import Selector

# Buat objek selector dari situs web
sel = Selector( text = html )

# Pilih semua hyperlink elemen div milik class "course-block"
course_as = sel.css( 'div.course-block > a' )

# Memilih semua atribut href chaining dengan css
hrefs_from_css = course_as.css( '::attr(href)' )

# Memilih semua atribut href chaining dengan xpath
hrefs_from_xpath = course_as.xpath( './@href' )

In [22]:
print(hrefs_from_xpath)

[<Selector xpath='./@href' data='/courses/free-introduction-to-r'>, <Selector xpath='./@href' data='/courses/intermediate-r'>, <Selector xpath='./@href' data='/courses/introduction-to-machine-lear...'>, <Selector xpath='./@href' data='/courses/cleaning-data-in-r'>, <Selector xpath='./@href' data='/courses/intro-to-python-for-data-sci...'>, <Selector xpath='./@href' data='/courses/intermediate-r-practice'>, <Selector xpath='./@href' data='/courses/data-visualization-with-ggpl...'>, <Selector xpath='./@href' data='/courses/data-visualization-with-ggpl...'>, <Selector xpath='./@href' data='/courses/intermediate-python'>, <Selector xpath='./@href' data='/courses/data-visualization-with-ggpl...'>, <Selector xpath='./@href' data='/courses/text-mining-with-bag-of-word...'>, <Selector xpath='./@href' data='/courses/case-study-exploring-basebal...'>, <Selector xpath='./@href' data='/courses/introduction-to-portfolio-an...'>, <Selector xpath='./@href' data='/courses/introduction-to-credit-risk-.

### Top Level Text

In [27]:
from scrapy.http import TextResponse

url = 'https://www.DataCamp.com'
res = requests.get(url)
res = TextResponse(body=res.content, url=url)

In [31]:
def our_xpath( xpath ):
    xextr = res.xpath( xpath ).extract()
    return xextr

def our_css( css ):
    cextr = res.css( css ).extract()
    return cextr

def print_results( xpath, css_locator ):
    print( "Your XPath extracts to following:")
    print( our_xpath(xpath) )
    print("_________________\n")
    print( "Your CSS Locator extracts the following:")
    print( our_css(css_locator) )
    return None

In [32]:
# Create an XPath string to the desired text.
xpath = '//p[@id="p3"]/text()'

# Create a CSS Locator string to the desired text.
css_locator = 'p#p3::text'

# Print the text from our selections
print_results( xpath, css_locator )

Your XPath extracts to following:
[]
_________________

Your CSS Locator extracts the following:
[]


### All Level Text

In [33]:
# Create an XPath string to the desired text.
xpath = '//p[@id="p3"]//text()'

# Create a CSS Locator string to the desired text.
css_locator = 'p#p3 ::text'

# Print the text from our selections
print_results( xpath, css_locator )

Your XPath extracts to following:
[]
_________________

Your CSS Locator extracts the following:
[]


### Text Extraction

In [34]:
html = '''
<p id="p-example">
    Hello world! Try <a href="http://www.datacamp.com">DataCamp</a> today!
</p>
'''

sel = Selector( text = html )

In [35]:
sel.xpath('//p[@id="p-example"]/text()').extract()

['\n    Hello world! Try ', ' today!\n']

In [15]:
sel.xpath('//p[@id="p-example"]//text()').extract()

['\n    Hello world! Try ', 'DataCamp', ' today!\n']

In [16]:
sel.css('p#p-example::text').extract()

['\n    Hello world! Try ', ' today!\n']

In [17]:
sel.css('p#p-example ::text').extract()

['\n    Hello world! Try ', 'DataCamp', ' today!\n']

In [18]:
type(sel)

scrapy.selector.unified.Selector

## Respond

### Reveal By Response

In [36]:
from scrapy.http import TextResponse

url = 'https://www.datacamp.com/courses/all'
response = requests.get(url)
response = TextResponse(body=response.content, url=url)

In [37]:
def print_url_title( url, title ):
    print( "Here is what you found:" )
    print( "\t-URL: %s" % url )
    print( "\t-Title: %s" % title )

In [38]:
# Get the URL to the website loaded in response
this_url = response.url

# Get the title of the website loaded in response
this_title = response.xpath( '/html/head/title/text()' ).extract_first()

# Print out our findings
print_url_title( this_url, this_title )

Here is what you found:
	-URL: https://www.datacamp.com/courses/all
	-Title: Data Science Courses: R & Python Analysis Tutorials | DataCamp


### Responding with Selectors

In [39]:
# Create a CSS Locator string to the desired hyperlink elements
css_locator = 'a.course-block__link'

# Select the hyperlink elements from response and sel
response_as = response.css(css_locator)
sel_as = sel.css(css_locator)

# Examine similarity
nr = len(response_as)
ns = len(sel_as)
for i in range(min(nr, ns, 2)):
    print("Element %d from response: %s" % (i+1, response_as[i]))
    print("Element %d from sel: %s" % (i+1, sel_as[i]))
    print("")

### Selecting from a Selection

In [42]:
# Select all desired div elements
divs = response.css( 'div.course-block' )

# Take the first div element
first_div = divs[0]

# Extract the text from the h4 element in first_div
h4_text = first_div.css('h4::text').extract_first()

# Print out the text
print( "The text from the h4 element is:", h4_text )

The text from the h4 element is: Introduction to R


## Survey

### Titular

In [43]:
# Create a SelectorList of the course titles
crs_title_els = response.css( 'h4::text' )

# Extract the course titles 
crs_titles = crs_title_els.extract()

# Print out the course titles 
for el in crs_titles:
  print( ">>", el )

>> Introduction to R
>> Intermediate R
>> Introduction to Machine Learning
>> Cleaning Data in R
>> Introduction to Python
>> Intermediate R: Practice
>> Data Visualization with ggplot2 (Part 1)
>> Data Visualization with ggplot2 (Part 2)
>> Intermediate Python
>> Data Visualization with ggplot2 (Part 3)
>> Text Mining with Bag-of-Words in R
>> Case Study: Exploring Baseball Pitching Data in R
>> Introduction to Portfolio Analysis in R
>> Credit Risk Modeling in R
>> Machine Learning with caret in R
>> Introduction to Databases in Python
>> Manipulating Time Series Data with xts and zoo in R
>> Time Series Analysis in R
>> Importing & Cleaning Data in R: Case Studies
>> Financial Trading in R
>> Importing and Managing Financial Data in R
>> Interactive Data Visualization with Bokeh
>> Case Study: Exploratory Data Analysis in R
>> Introduction to Importing Data in R
>> Intermediate Importing Data in R
>> Data Visualization in R
>> Python Data Science Toolbox (Part 2)
>> Python Data Scie

In [45]:
for i, el in enumerate(crs_titles, start=1):
    print(i, el)

1 Introduction to R
2 Intermediate R
3 Introduction to Machine Learning
4 Cleaning Data in R
5 Introduction to Python
6 Intermediate R: Practice
7 Data Visualization with ggplot2 (Part 1)
8 Data Visualization with ggplot2 (Part 2)
9 Intermediate Python
10 Data Visualization with ggplot2 (Part 3)
11 Text Mining with Bag-of-Words in R
12 Case Study: Exploring Baseball Pitching Data in R
13 Introduction to Portfolio Analysis in R
14 Credit Risk Modeling in R
15 Machine Learning with caret in R
16 Introduction to Databases in Python
17 Manipulating Time Series Data with xts and zoo in R
18 Time Series Analysis in R
19 Importing & Cleaning Data in R: Case Studies
20 Financial Trading in R
21 Importing and Managing Financial Data in R
22 Interactive Data Visualization with Bokeh
23 Case Study: Exploratory Data Analysis in R
24 Introduction to Importing Data in R
25 Intermediate Importing Data in R
26 Data Visualization in R
27 Python Data Science Toolbox (Part 2)
28 Python Data Science Toolb

In [48]:
import pandas as pd

pd.DataFrame({ 'title': [el for el in crs_titles] }).head(10)

Unnamed: 0,title
0,Introduction to R
1,Intermediate R
2,Introduction to Machine Learning
3,Cleaning Data in R
4,Introduction to Python
5,Intermediate R: Practice
6,Data Visualization with ggplot2 (Part 1)
7,Data Visualization with ggplot2 (Part 2)
8,Intermediate Python
9,Data Visualization with ggplot2 (Part 3)


In [None]:
# Calculate the number of children of the mystery element
how_many_kids = len( mystery.xpath( './*' ) )

# Print out the number
print( "The number of elements you selected was:", how_many_kids )

## Scraping For Reals

### What's the Div, Yo?

In [49]:
from scrapy import Request

response = Request(url='https://www.datacamp.com/courses/all')

In [50]:
print(type(response))
print(response.url)

<class 'scrapy.http.request.Request'>
https://www.datacamp.com/courses/all


In [51]:
from scrapy.http import TextResponse

url = 'https://www.datacamp.com/courses/all'
resp = requests.get(url)
resp = TextResponse(body=resp.content, url=url)

course_divs = resp.css('div.course-block')

In [52]:
print( len(course_divs) )

326


### Inspecting course-block

In [53]:
first_div = course_divs[0]
children = first_div.xpath('./*')
print( len(children) )

4


### The first child

In [54]:
first_div = course_divs[0]
children = first_div.xpath('./*')

In [55]:
first_child = children[0]
print( first_child.extract() )

<div class="js-bookmark-icon-large-58">
      <a class="dc-bookmark-icon dc-bookmark-icon--large dc-bookmark-icon--bookmark dc-bookmark-icon--hidden js-bookmark-icon tooltip-trigger--primary-dark top ds-snowplow-bookmarking-dashboard-bookmark  js-bookmark-icon-click-create dc-bookmarking-tooltip" data-toggle="tooltip" title="Bookmark" data-remote="true" rel="nofollow" data-method="post" href="/courses/free-introduction-to-r/user_bookmarks"></a>


  </div>


### The second child

In [56]:
first_div = course_divs[0]
children = first_div.xpath('./*')

In [57]:
second_child = children[1]
print( second_child.extract() )

<a class="course-block__link ds-snowplow-link-course-block" href="/courses/free-introduction-to-r">
      <div class="course-block__technology course-block__technology--r"></div>
      <div class="course-block__body">
        <h4 class="course-block__title">Introduction to R</h4>
        <p class="course-block__description">
          Master the basics of data analysis by manipulating common data structures such as vectors, matrices, and data frames.
        </p>
            <div class="course-block__extra-info dc-u-fx dc-u-fx-aifs dc-u-fx-jcc dc-u-fx-fww">
              <span class="course-block__length dc-u-mh-12 dc-u-fx-center">
                
<span class="
    dc-icon 
      dc-icon--size-12
      dc-icon--primary
      dc-icon--flex dc-u-mr-8
  ">
  <svg class="dc-icon__svg">
    <use xlink:href="#clock"></use>
  </svg>
</span>

                4 hours
              </span>
            </div>
      </div>
</a>


### The forgotten child

In [44]:
first_div = course_divs[0]
children = first_div.xpath('./*')

In [45]:
third_child = children[2]
print( third_child.extract() )

<span class="js-mobile-progress-container js-mobile-course-progress" data-id="58" data-user-id="">
    </span>


### Listful

* In one CSS Locator

In [47]:
links = resp.css('div.course-block > a::attr(href)').extract()

* Stepwise

In [49]:
# step 1: course blocks
course_divs = resp.css('div.course-block')
# step 2: hyperlink elements
hrefs = course_divs.xpath('./a/@href')
# step 3: extract the links
links = hrefs.extract()

### Get Schooled

In [50]:
for l in links:
    print( l )

/courses/free-introduction-to-r
/courses/data-table-data-manipulation-r-tutorial
/courses/dplyr-data-manipulation-r-tutorial
/courses/ggvis-data-visualization-r-tutorial
/courses/reporting-with-r-markdown
/courses/intermediate-r
/courses/introduction-to-machine-learning-with-r
/courses/cleaning-data-in-r
/courses/intro-to-python-for-data-science
/courses/intermediate-r-practice
/courses/data-visualization-with-ggplot2-1
/courses/data-visualization-with-ggplot2-2
/courses/intermediate-python-for-data-science
/courses/data-visualization-with-ggplot2-part-3
/courses/intro-to-text-mining-bag-of-words
/courses/exploring-pitch-data-with-r
/courses/working-with-the-rstudio-ide-part-1
/courses/introduction-to-portfolio-analysis-in-r
/courses/writing-functions-in-r
/courses/introduction-to-credit-risk-modeling-in-r
/courses/machine-learning-toolbox
/courses/working-with-the-rstudio-ide-part-2
/courses/joining-data-in-r-with-dplyr
/courses/introduction-to-relational-databases-in-python
/courses/