Install scrapy using pip.
pip install scrapy
This project extract data from Naukri and save it in a folder using scrapy framework.
To scrape any website, understanding of DOM and xpath is helpful. Xpath is easy to understand, for this project i used Xpather to prepare the needed xpath query.
Run below command to create project
scrapy startproject naukriscraper
It will directory with project name with following contents.
naukriscraper/
scrapy.cfg # deploy configuration file
naukriscraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
Add spider to project, by creating python_spider.py in spiders directory.
import scrapy
import os
from scrapy import Selector
class PythonJobSpider(scrapy.Spider):
name = 'pythonjob'
def start_requests(self):
urls = [
'https://www.naukri.com/python-jobs-in-hyderabad',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self,response):
outdirpath = os.path.join((os.getcwd()), 'scrapefiles')
page = response.url.split('/')[3]
filename = os.path.join(outdirpath, 'python-%s.html' % page)
for link in response.xpath('//*[@data-url]').getall():
sel = scrapy.Selector(text=link)
yield {
'jobtitle':sel.xpath('string(//div/span/ul/li[@title])').extract_first(),
'joburl':sel.xpath('//div/@data-url').extract_first(),
'companyname':sel.xpath('//div/span/span/span[@class="org"]/text()').extract_first(),
'experience':sel.xpath('//div/span/span[@class="exp"]/text()').extract_first(),
'location':sel.xpath('//div/span/span[@class="loc"]/span/text()').extract_first(),
'skills':sel.xpath('//div/span/div/span[@class="skill"]/text()').extract_first(),
'moredesc':sel.xpath('//div/span/div[2][@class="more desc"]/span/text()').extract_first(),
'salary':sel.xpath('//div/div/span[@class="salary"]/text()').extract_first(),
'postedby':sel.xpath('//div/div/div/a[@class="rec_name"]/text()').extract_first(),
'dayposted':sel.xpath('//div/div/div/span[@class="date"]/text()').extract_first(),
}
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
next_page = response.xpath('/html/head/link[contains(@rel,"next")]/@href').extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback = self.parse
)
Define USER_AGENT in settings.py file
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
Run the spider to scrape data from project top directory.
scrapy crawl pythonjob
This will crawl the link and save the web page in scrapefiles folder.
To save the yielded files, run below.
scrapy crawl pythonjob -o pythonjob.json
Deploy scrapy project using scrapyd
Install scrapyd and scrapyd-client
pip install scrapyd
pip install git+https://github.com/scrapy/scrapyd-client
Start scrapyd server and verify the url http://localhost:6800/
Deploy the scrapy project from project folder directly by running below command.
scrapyd-deploy samplescrapy -p naukriscraper
Run scrapy crawl run manually
curl http://localhost:6800/schedule.json -d project=naukriscraper -d spider=pythonjob
Happy crawling.