# Scrapy Utils
The purpose of this jupyter notebook is to provide some useful tips for managing scrapy.

In [1]:
from scrapy import Request
from scrapy.utils.request import request_fingerprint

import shutil
import os

## Methods
### Retrieve fingerprint from url
Scrapy is able to cache our requests. This cached url are stored in .scrapy/httpcache by default. And they are saved like a fingerprint. <br>

[request_fingerprint](https://github.com/scrapy/scrapy/blob/master/scrapy/utils/request.py) generate a sha1 from this parameters :
- request method ('GET', 'POST', etc)
- request url
- request body

In [2]:
def get_fingerprint(url, method='GET', headers=None):
    request = Request(url, method=method)
    return request_fingerprint(request)

In [3]:
fingerprint = get_fingerprint('https://www.scrapy.org/')
fingerprint

'0929449f1da37f7c7a27ec01e041e7ac62b91165'

For example, if I do a request on this url https://www.scrapy.org/, scrapy will store the request and the response in __.scrapy/httpcache/spider_name/09/0929449f1da37f7c7a27ec01e041e7ac62b91165__ <br>

In [4]:
def cache_path_from_fingerprint(finger_print, spider_name, httpcache_dir='httpcache'):
    return os.path.join('.scrapy', httpcache_dir, spider_name, finger_print[:2], finger_print)

In [5]:
cache_path = cache_path_from_fingerprint(fingerprint, spider_name='test')
cache_path

'.scrapy/httpcache/test/09/0929449f1da37f7c7a27ec01e041e7ac62b91165'

### Delete directory
Simple method that delete a directory recursively. Warning : does not display errors if the directory doesn't exist

In [6]:
def delete_directory(path):
    shutil.rmtree(path, ignore_errors=True)

# Applications

I usually use this method to retrieve a bad cached url and delete it for rebuilding the cache.

In [7]:
delete_directory(cache_path)