# Spiders

Pelajari cara membuat web crawlers dengan scrapy. Scrapy spiders ini akan melakukan proses crawl pada web melalui beberapa halaman, mengikuti tautan untuk scrape setiap halaman secara otomatis sesuai dengan prosedur yang telah kita pelajari di bab-bab sebelumnya.

## Your First Spider

* Required imports

In [3]:
# !pip install scrapy

In [4]:
import scrapy
from scrapy.crawler import CrawlerProcess

### Weaving the Web

In [5]:
class DCspider( scrapy.Spider ):
    
    name = 'dc_spider'
    
    def start_requests( self ):
        urls = [ 'https://www.datacamp.com/courses/all' ]
        for url in urls:
            yield scrapy.Request( url = url, callback = self.parse )

    def parse( self, response ):
        # simple example: write out the html
        html_file = 'datasets/DC_courses.html'
        with open( html_file, 'wb' ) as fout:
            fout.write( response.body )

* Perlu memiliki fungsi yang disebut `start_requests`
* Harus memiliki setidaknya satu fungsi pengurai/parser untuk menangani kode HTML

### The Skinny on start_requests

In [6]:
def start_requests( self ):
    urls = ['https://www.datacamp.com/courses/all']
    for url in urls:
        yield scrapy.Request( url = url, callback = self.parse )

* `scrapy.Request` di sini akan ada dalam variabel response untuk kita
* Argumen `url` memberi tahu kami situs mana yang ingin di scrape
* Argumen `callback` memberi tahu kita ke mana harus mengirim variabel respons untuk diproses

### Inheriting the Spider

Saat mempelajari tentang `scrapy` spiders, kami melihat bahwa bagian utama dari kode yang disesuaikan adalah `class` untuk spider. Untuk membantu membangun keakraban class, Anda akan menyelesaikan sepotong kode pendek untuk menyelesaikan model mainan kode class spider. Kami telah menghilangkan kode yang benar-benar menjalankan spider, hanya menyertakan bagian yang diperlukan untuk membuat class.

Seperti disebutkan dalam pelajaran, `class` merupakan sekumpulan variabel terkait dan fungsi yang ditempatkan bersama. Terkadang satu class suka menggunakan metode dari class lain, jadi kami akan mewarisi metode dari class yang berbeda. Itulah yang kami lakukan di class spider.

Kami menulis fungsi `inspect_class` untuk melihat class Anda setelah selesai.

In [11]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider(scrapy.Spider):
    name = "your_spider"
    # start_requests method
    def start_requests(self):
        pass
    # parse method
    def parse(self, response):
        pass

def inspect_class(c):
    newc = c()
    meths = dir(newc)
    if 'name' in meths:
        print("Your spider class name is:", newc.name)
    if 'from_crawler' in meths:
        print("It seems you have inherited methods from scrapy.Spider -- NICE!")
    else:
        print("Oh no! It doesn't seem that you are inheriting the methods from scrapy.Spider!!")
  
# Inspect Your Class
inspect_class(YourSpider)

Your spider class name is: your_spider
It seems you have inherited methods from scrapy.Spider -- NICE!


### Hurl the URLs

In [12]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
    name = "your_spider"
    # start_requests method
    def start_requests( self ):
    urls = ["https://www.datacamp.com", "https://scrapy.org"]
    for url in urls:
        yield url
    
    # parse method
    def parse( self, response ):
        pass

def inspect_class( c ):
    newc = c()
    meths = dir( newc )
    if 'start_requests' in meths:
        print( "The start_requests method yields the following urls:" )
    for u in newc.start_requests():
        print(  "\t-", u )
  
# Inspect Your Class
inspect_class( YourSpider )

The start_requests method yields the following urls:
	- https://www.datacamp.com
	- https://scrapy.org


## Start Requests

### Self Referencing is Classy

Anda mungkin telah memperhatikan bahwa di dalam class spider, kami selalu memasukkan argumen `self` dalam metode `start_requests` dan `parse`.

Ini memungkinkan kita untuk referensi antar metode di dalam class. Yaitu, jika kita ingin merujuk ke metode `parse` dalam metode `start_requests`, kita perlu menulis `self.parse` daripada hanya `parse`; apa yang dilakukan `self` adalah memberi tahu kode: "Lihat di class yang sama dengan `start_requests` untuk metode yang disebut `parse` agar digunakan."

In [13]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
    name = "your_spider"
    # start_requests method
    def start_requests( self ):
        self.print_msg( "Hello World!" )
    # parse method
    def parse( self, response ):
        pass
    # print_msg method
    def print_msg( self, msg ):
        print( "Calling start_requests in YourSpider prints out:", msg )

def inspect_class( c ):
    newc = c()
    try:
        newc.start_requests()
    except:
        print( "Oh No! Something is wrong with the code! Keep trying." )

# Inspect Your Class
inspect_class( YourSpider )

Calling start_requests in YourSpider prints out: Hello World!


### Starting with Start Requests

In [17]:
# Import scrapy library
import scrapy

# Create the spider class
class YourSpider( scrapy.Spider ):
    name = "your_spider"
    # start_requests method
    def start_requests( self ):
        yield scrapy.Request( url = "https://www.datacamp.com", callback = self.parse )
    # parse method
    def parse( self, response ):
        pass

def inspect_class( c ):
    newc = c()
    try:
        y = list( newc.start_requests() )
        first_yield = y[0]
        print( "The url you would scrape is:", first_yield.url )
        cb = first_yield.callback
        print( "The name of the callback method you called is:", cb.__name__ )
    except:
        print( "Oh No! Something is wrong with the code! Keep trying." )
    
# Inspect Your Class
inspect_class( YourSpider )

The url you would scrape is: https://www.datacamp.com
The name of the callback method you called is: parse


## Parse and Crawl

### Pen Names

In [18]:
# Import the scrapy library
import scrapy
import requests
from scrapy.http import TextResponse

url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

# Create the Spider class
class DCspider( scrapy.Spider ):
    name = 'dcspider'
    # start_requests method
    def start_requests( self ):
        yield scrapy.Request( url = url_short, callback = self.parse )
    # parse method
    def parse( self, response ):
        # Create an extracted list of course author names
        author_names = response.css( 'p.course-block__author-name::text' ).extract()
        # Here we will just return the list of Authors
        return author_names

def inspect_spider( s ):
    news = s()
    try:
        req = list( news.start_requests() )[0]
        url = req.url
        html = requests.get( url ).content
        response = TextResponse( url = url, body = html, encoding = 'utf-8' )
        author_names = req.callback( response )
        print( 'You have collected the author names:')
        for a in author_names:
            print('\t-', a )
    except:
        print( 'Oh no! Something went wrong with the code. Keep trying!')
        
# Inspect the spider
inspect_spider( DCspider )

You have collected the author names:
	- Jonathan Cornelissen
	- Matt Dowle
	- Garrett Grolemund
	- Garrett Grolemund
	- Garrett Grolemund
	- Filip Schouwenaars
	- Gilles Inghelbrecht
	- Nick Carchedi
	- Filip Schouwenaars
	- Filip Schouwenaars
	- Mark Peterson


### Crawler Time

In [19]:
# Import the scrapy library
import scrapy
import requests
from scrapy.http import TextResponse

# Create the Spider class
class DCdescr( scrapy.Spider ):
    name = 'dcdescr'
    # start_requests method
    def start_requests( self ):
        yield scrapy.Request( url = url_short, callback = self.parse )
  
    # First parse method
    def parse( self, response ):
        links = response.css( 'div.course-block > a::attr(href)' ).extract()
        # Follow each of the extracted links
        for link in links:
            yield response.follow( url = link, callback = self.parse_descr )
      
    # Second parsing method
    def parse_descr( self, response ):
        # Extract course description
        course_descr = response.css( 'p.course__description::text' ).extract_first()
        # For now, just yield the course description
        yield course_descr
    
def inspect_spider( s ):
    news = s()
    try:
        req1 = list( news.start_requests() )[0]
        html1 = requests.get( req1.url ).content
        response1 = TextResponse( url = req1.url, body = html1, encoding = 'utf-8' )
        req2 = list( news.parse( response1 ) )[0]
        html2 = requests.get( req2.url ).content
        response2 = TextResponse( url = req2.url, body = html2, encoding = 'utf-8' )
        for d in news.parse_descr( response2 ):
          print("One course description you found is:", d )
          break
    except:
        print("Oh no! Something is wrong with the code. Keep trying!")


# Inspect the spider
inspect_spider( DCdescr )

One course description you found is: In this introduction to R, you will master the basics of this beautiful open source language, including factors, lists and data frames. With the knowledge gained in this course, you will be ready to undertake your first very own data analysis. With over 2 million users worldwide R is rapidly becoming the leading programming language in statistics and data science. Every year, the number of R users grows by 40% and an increasing number of organizations are using it in their day-to-day activities. Leverage the power of R by completing this free R online course today!


## Capstone

### Time to Run

In [20]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

# Create the Spider class
class DC_Chapter_Spider(scrapy.Spider):
    name = "dc_chapter_spider"
    # start_requests method
    def start_requests(self):
        yield scrapy.Request(url = url_short, callback = self.parse_front)
    
    # First parsing method
    def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
        yield response.follow(url = url, callback = self.parse_pages)

    # Second parsing method
    def parse_pages(self, response):
        crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
        crs_title_ext = crs_title.extract_first().strip()
        ch_titles = response.css('h4.chapter__title::text')
        ch_titles_ext = [t.strip() for t in ch_titles.extract()]
        dc_dict[ crs_title_ext ] = ch_titles_ext
    
def previewCourses( dc_dict, n = 3 ):
    crs_titles = list( dc_dict.keys() )
    print( "A preview of DataCamp Courses:")
    print("---------------------------------------\n")
    for t in crs_titles[:n]:
        print( "TITLE: %s" % t)
        for i,ct in enumerate(dc_dict[t]):
          print("\tChapter %d: %s" % (i+1,ct) )
        print("")

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

2020-02-15 05:24:43 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2020-02-15 05:24:43 [scrapy.utils.log] INFO: Versions: lxml 4.4.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 21:52:21) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-5.0.0-1029-gcp-x86_64-with-debian-buster-sid
2020-02-15 05:24:43 [scrapy.crawler] INFO: Overridden settings: {}
2020-02-15 05:24:43 [scrapy.extensions.telnet] INFO: Telnet Password: 997c49bc3533869d
2020-02-15 05:24:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-02-15 05:24:43 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddleware

A preview of DataCamp Courses:
---------------------------------------

TITLE: Introduction to R
	Chapter 1: Intro to basics
	Chapter 2: Vectors
	Chapter 3: Matrices
	Chapter 4: Factors
	Chapter 5: Data frames
	Chapter 6: Lists

TITLE: Reporting with R Markdown
	Chapter 1: Authoring R Markdown Reports
	Chapter 2: Embedding Code
	Chapter 3: Compiling Reports
	Chapter 4: Configuring R Markdown (optional)

TITLE: Data Analysis in R, the data.table Way
	Chapter 1: Data.table novice
	Chapter 2: Data.table yeoman
	Chapter 3: Data.table expert



### DataCamp Descriptions


In [None]:
# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'

# Create the Spider class
class DC_Description_Spider(scrapy.Spider):
    name = "dc_chapter_spider"
    # start_requests method
    def start_requests(self):
        yield scrapy.Request(url = url_short, callback = self.parse_front)
        
  # First parsing method
    def parse_front(self, response):
        course_blocks = response.css('div.course-block')
        course_links = course_blocks.xpath('./a/@href')
        links_to_follow = course_links.extract()
        for url in links_to_follow:
            yield response.follow(url = url, callback = self.parse_pages)
        
    # Second parsing method
    def parse_pages(self, response):
        # Create a SelectorList of the course titles text
        crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
        # Extract the text and strip it clean
        crs_title_ext = crs_title.extract_first().strip()
        # Create a SelectorList of course descriptions text
        crs_descr = response.css( 'p.course__description::text' )
        # Extract the text and strip it clean
        crs_descr_ext = crs_descr.extract_first().strip()
        # Fill in the dictionary
        dc_dict[crs_title_ext] = crs_descr_ext
    
    def previewCourses( dc_dict, n = 1 ):
        crs_titles = list( dc_dict.keys() )
        print( "A preview of DataCamp Courses:")
        print("---------------------------------------\n")
        for t in crs_titles[:n]:
            print( "TITLE: %s" % t)
            print("\tDescription: %s" % dc_dict[t] )
            print("")

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Description_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

### Capstone Crawler

In [3]:
from scrapy.http import TextResponse
import requests

url_short = 'https://assets.datacamp.com/production/repositories/2560/datasets/19a0a26daa8d9db1d920b5d5607c19d6d8094b3b/all_short'
response = requests.get(url_short)
response = TextResponse(body=response.content, url=url_short)

# parse method
def parse(self, response):
  # Extracted course titles
  crs_titles = response.xpath('//h4[contains(@class,"block__title")]/text()').extract()
  # Extracted course descriptions
  crs_descrs = response.xpath('//p[contains(@class,"block__description")]/text()').extract()
  # Fill in the dictionary
  for crs_title, crs_descr in zip(crs_titles, crs_descrs):
    dc_dict[crs_title] = crs_descr

In [4]:
# Import scrapy
import scrapy

# Import the CrawlerProcess
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class YourSpider(scrapy.Spider):
    name = 'yourspider'
    # start_requests method
    def start_requests(self):
        yield scrapy.Request(url = url_short, callback = self.parse)
      
    def parse(self, response):
    # My version of the parser you wrote in the previous part
    crs_titles = response.xpath('//h4[contains(@class,"block__title")]/text()').extract()
    crs_descrs = response.xpath('//p[contains(@class,"block__description")]/text()').extract()
    for crs_title, crs_descr in zip(crs_titles, crs_descrs):
        dc_dict[crs_title] = crs_descr
    
def previewCourses( dc_dict, n = 3 ):
    parse( self = None, response = response )
    crs_titles = list( dc_dict.keys() )
    print( "A preview of DataCamp Courses:")
    print("---------------------------------------\n")
    for t in crs_titles[:n]:
        print( "TITLE: %s" % t)
        print( "\tDESCRIPTION: %s" % dc_dict[t] )
        print("")
    
# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(YourSpider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

2019-05-18 06:48:04 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-05-18 06:48:04 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) - [GCC 7.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.5, Platform Linux-3.10.0-957.12.2.el7.x86_64-x86_64-with-debian-buster-sid
2019-05-18 06:48:04 [scrapy.crawler] INFO: Overridden settings: {}
2019-05-18 06:48:04 [scrapy.extensions.telnet] INFO: Telnet Password: 8417b1c7ad659f28
2019-05-18 06:48:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-05-18 06:48:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddleware

A preview of DataCamp Courses:
---------------------------------------

TITLE: Introduction to R
	DESCRIPTION: 
          Master the basics of data analysis by manipulating common data structures such as vectors, matrices and data frames.
        

TITLE: Data Analysis in R, the data.table Way
	DESCRIPTION: 
          Master core concepts in data manipulation such as subsetting, updating, indexing and joining your data using data.table.
        

TITLE: Data Manipulation in R with dplyr
	DESCRIPTION: 
          Master techniques for data manipulation using the select, mutate, filter, arrange, and summarise functions in dplyr.
        



## Stop Scratching and Start Scraping!

### Feeding the Machine

`Process Data Acquistion` **Access Raw Data --> Parse & Extract**

### Scraping Skills

* **Objective**: Scrape situs web secara komputasi
* **Bagaimana**? Kita memutuskan untuk menggunakan scrapy
* **Bagaimana**? Kita perlu bekerja dengan:
  * `Selector` dan `Response` objects
  * Mungkin bahkan membuat Spider
* **Bagaimana**? Kita perlu mempelajari Notasi XPath dan CSS Locator
* **Bagaimana**? Memahami struktur HTML

### Apa yang perlu diketahui?

* Struktur HTML
* Notasi XPath dan CSS Locator
* Cara menggunakan `Selector` dan `Response` objects di `scrapy`
* Cara mengatur spider
* Cara scrape situs web