### SCRAPY BASICS

Scrapy is a fast, open-source web crawling framework written in Python, used to extract the data from the web page with the help of selectors based on XPath.

<li>create a new project in PyCharm</li>
<li>Go to File -> Settings -> python Interpreter</li>
<li>Add Scrapy to the list of interpreters</li>
<li>Go to Terminal and type: scrapy startproject "myproject" </li>
<li>A subdirectory named myproject will be created in the current directory</li>


<img src="scrapy_project_creation.png">


<img src="new_project_structure.png">

### creating first spider

<li>Go to spiders folder</li>
<li>create a python file quotes_tutorial.py</li>
<li>Go to terminal and navigate to folder cd .\quotetutorial\</li>
<li>Go to terminal and type: scrapy crawl quotes</li>

### open scrapy in shell mode

<li>Go to terminal and type: scrapy shell "https://quotes.toscrape.com/"</li>

In [1]: response.css("title")

Out[1]: [\<Selector xpath='descendant-or-self::title' data='\<title>Quotes to Scrape</title>\n    \<...'\>]


In [3]: response.css('title::text').extract()

Out[3]: ['Quotes to Scrape']


In [4]: response.css('title::text').extract_first()

Out[4]: 'Quotes to Scrape'


<li>extract() returns error when Title is unavailable, but extract_first() returns None instead of error</li>

In [5]: response.css("span.text::text").extract()
Out[5]: 
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',

 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',

 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',

 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',

 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",

 '“Try not to become a man of success. Rather become a man of value.”',

 '“It is better to be hated for what you are than to be loved for what you are not.”',

 "“I have not failed. I've just found 10,000 ways that won't work.”",

 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 
 '“A day without sunshine is like, you know, night.”']

In [6]: response.css("span.text::text")[1].extract()

Out[6]: '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'

In [7]: response.css(".author::text").extract()

Out[7]: 
['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']




### installing SelectorGadget chrome extension

<li>Google search "Selector Gadget Chrome"</li>
<li>Go to chrome://extensions</li>
<li>Click on the extension icon and add extension</li>

#### using SelectorGadget

<li>Select on the things you need to scrape on page</li>
<li>Copy the text from extension .a-color-base.a-text-normal</li>
<li>Go to terminal</li>

scrapy shell "https://www.amazon.in/s?k=s22+samsung&crid=163A3PO1H6GYM&sprefix=s2
2+samsung%2Caps%2C383&ref=nb_sb_noss"


In [1]: response.css(".a-color-base.a-text-normal::text")[1].extract()

Out[1]: 'Samsung Galaxy S21 FE 5G (Graphite, 8GB, 256GB Storage)'


### Extracting data with xpath

scrapy shell "https://quotes.toscrape.com/"

In [2]: response.xpath("//title/text()").extract()

Out[2]: ['Quotes to Scrape']

In [3]: response.xpath("//span[@class='text']/text()").extract()

Out[3]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

In [4]: response.xpath("//span[@class='text']/text()")[1].extract()

Out[4]: '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'



### Extracting data with css along with xpath

In [5]: response.css("li.next a").xpath("@href").extract()

Out[5]: ['/page/2/']

In [6]: response.css("a").xpath("@href").extract()

Out[6]: 
['/',
 '/login',
 '/author/Albert-Einstein',
 '/tag/change/page/1/',
 '/tag/deep-thoughts/page/1/',
 '/tag/thinking/page/1/',
 '/tag/world/page/1/',
 '/author/J-K-Rowling',
 '/tag/abilities/page/1/',
 '/tag/choices/page/1/',
 '/author/Albert-Einstein',
 '/tag/inspirational/page/1/',
 '/tag/life/page/1/',
 '/tag/live/page/1/',
 '/tag/miracle/page/1/',
 '/tag/miracles/page/1/',
 '/author/Jane-Austen',
 '/tag/aliteracy/page/1/',
 '/tag/books/page/1/',
 '/tag/classic/page/1/',
 '/tag/humor/page/1/',
 '/author/Marilyn-Monroe',
 '/tag/be-yourself/page/1/',
 '/tag/inspirational/page/1/',
 '/author/Albert-Einstein',
 '/tag/adulthood/page/1/',
 '/tag/success/page/1/',
 '/tag/value/page/1/',
 '/author/Andre-Gide',
 '/tag/life/page/1/',
 '/tag/love/page/1/',
 '/author/Thomas-A-Edison',
 '/tag/edison/page/1/',
 '/tag/failure/page/1/',
 '/tag/inspirational/page/1/',
 '/tag/paraphrased/page/1/',
 '/author/Eleanor-Roosevelt',
 '/tag/misattributed-eleanor-roosevelt/page/1/',
 '/author/Steve-Martin',
 '/tag/humor/page/1/',
 '/tag/obvious/page/1/',
 '/tag/simile/page/1/',
 '/page/2/',
 '/tag/love/',
 '/tag/inspirational/',
 '/tag/life/',
 '/tag/humor/',
 '/tag/books/',
 '/tag/reading/',
 '/tag/friendship/',
 '/tag/friends/',
 '/tag/truth/',
 '/tag/simile/',
 'https://www.goodreads.com/quotes',
 'https://scrapinghub.com']



### creating items to temporarily store scrapped data

<li> Go to items.py file</li>
<li> define the fields</li>
<li> Import the python file in spider/quotes_spider.py</li>

### Storing scrapped data in JSON, XML or CSV file

<li>Go to terminal and navigate to folder cd .\quotetutorial\</li>

##### JSON
<li>Go to terminal and type: scrapy crawl quotes -o items.json</li>


##### XML
<li>Go to terminal and type: scrapy crawl quotes -o items.xml</li>

##### CSV
<li>Go to terminal and type: scrapy crawl quotes -o items.csv</li>

### Setting up the pipeline

<li> Go to settings.py file</li>
<li> Uncomment ITEM_PIPELINES</li>
<li> set the priority of the pipeline if there are multiple pipelines</li>

#### Storing in SQLite database

<li> Go to pipeline.py file</li>



In [2]:
# import sqlite3


# class QuotetutorialPipeline(object):

#     def __init__(self):
#         self.create_connection()
#         self.create_table()

#     def create_connection(self):
#         self.conn = sqlite3.connect("myquotes.db")
#         self.curr = self.conn.cursor()

#     def create_table(self):
#         self.curr.execute("""CREATE TABLE IF NOT EXISTS quotes_tb(
#             title TEXT,
#             author TEXT,
#             tag TEXT
#         )""")

#     def process_item(self, item, spider):
#         self.store_db(item)
#         print("Pipeline :" + item['title'][0])
#         return item

#     def store_db(self, item):
#         self.curr.execute("""INSERT INTO quotes_tb VALUES (?,?,?)""",(
#             item['title'][0],
#             item['author'][0],
#             item['tag'][0]
#         ))
#         self.conn.commit()

<li> Go to https://sqliteonline.com/ </li>
<li> Go to File and browse to the folder wher the database file is available</li>
<li> Go to terminal and type: scrapy crawl quotes</li>

<img src="sqlite_database.png">


### Scraping multiple pages

In [None]:
        # next_page = response.css('li.next a::attr(href)').get()
        # print(next_page)

        # if next_page is not None:
        #     yield response.follow(next_page, callback=self.parse)

### Scraping using pagination

In [None]:
        # next_page = 'http://quotes.toscrape.com/page/' + str(QuoteSpider.page_number) + '/'

        # if QuoteSpider.page_number < 11:
        #     QuoteSpider.page_number += 1
        #     yield response.follow(next_page, callback=self.parse)

###  Logging in with Scrapy FormRequest

<li> Click onm login button</li>
<li> Enter username and password</li>
<li> Go to inspect element</li>
<li> Go to network tab</li>
<li> Click on the login</li>
<li> Go to Payload tab</li>
<li> csrf_token can be seen in the payload</li>

In [None]:
# class QuoteSpider(scrapy.Spider):
#     name = "quotes"
#     start_urls = [
#         'http://quotes.toscrape.com/login'
#     ]

#     def parse(self, response):
#         token = response.css('form input::attr(value)').extract_first()
#         print(token)

AYowUResZEPBCxytWdqKFguQkvMzNnbXJiGacpTSDmhILOjfVHrl


### Creating spiders automatically

<li> Go to Terminal and typoe Command : scrapy startproject amazontutorial</li>

##### OUTPUT 

New Scrapy project 'amazontutorial', using template directory 

'C:\Users\vijay\anaconda3\envs\data_science\lib\site-packages\scrapy\templates\project',
 created in:
 
    C:\Users\vijay\Jupyter Lab Scripts\Data_Science_Libraries\SCRAPY\amazontutorial

You can start your first spider with:

    cd amazontutorial
    
    scrapy genspider example example.com

PS C:\Users\vijay\Jupyter Lab Scripts\Data_Science_Libraries\SCRAPY> cd .\amazontutorial\

PS C:\Users\vijay\Jupyter Lab Scripts\Data_Science_Libraries\SCRAPY\amazontutorial> scrapy genspider 

amazon_spider amazon.com

Created spider 'amazon_spider' using template 'basic' in module:
    
    amazontutorial.spiders.amazon_spider

<img src="amazon_spider_creation.png">



### Setting User Agent

<li> Go to settings.py file</li>
<li> Uncomment USER_AGENT</li>
<li> Google search "googlebot user agent"</li>
<li> COPY : Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) </li>
<li> Paste in variable USER_AGENT</li>

### Bypass Restrictions using User-Agent

pip install scrapy-user-agents

COPY and Paste in settings.py

DOWNLOADER_MIDDLEWARES = {
    
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

COMMENT USER_AGENT in settings.py

### Bypass Restrictions using Proxy

pip install scrapy_proxy_pool

Repo : https://github.com/rejoiceinhope/scrapy-proxy-pool

Enable this middleware by adding the following settings to your settings.py:

PROXY_POOL_ENABLED = True

COPY and Paste in settings.py
DOWNLOADER_MIDDLEWARES = {
    
    # ...
    'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
    'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
    # ...
}