Scrape naukri.com using Scrapy

Requirements

Install scrapy using pip.

pip install scrapy

This project extract data from Naukri and save it in a folder using scrapy framework.

To scrape any website, understanding of DOM and xpath is helpful. Xpath is easy to understand, for this project i used Xpather to prepare the needed xpath query.

Creating project in scrapy

Run below command to create project

scrapy startproject naukriscraper

It will directory with project name with following contents.

naukriscraper/
    scrapy.cfg            # deploy configuration file
    naukriscraper/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py

Add spider to project, by creating python_spider.py in spiders directory.

import scrapy
import os
from scrapy import Selector

class PythonJobSpider(scrapy.Spider):
    name = 'pythonjob'
    
    def start_requests(self):
        urls = [
            'https://www.naukri.com/python-jobs-in-hyderabad',
            ]
        
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self,response):
        
        outdirpath = os.path.join((os.getcwd()), 'scrapefiles')
        page = response.url.split('/')[3]        
        filename = os.path.join(outdirpath, 'python-%s.html' % page) 
        
        for link in response.xpath('//*[@data-url]').getall():
            sel = scrapy.Selector(text=link)            
            yield {
                'jobtitle':sel.xpath('string(//div/span/ul/li[@title])').extract_first(),
                'joburl':sel.xpath('//div/@data-url').extract_first(),
                'companyname':sel.xpath('//div/span/span/span[@class="org"]/text()').extract_first(),
                'experience':sel.xpath('//div/span/span[@class="exp"]/text()').extract_first(),
                'location':sel.xpath('//div/span/span[@class="loc"]/span/text()').extract_first(), 
                'skills':sel.xpath('//div/span/div/span[@class="skill"]/text()').extract_first(),
                'moredesc':sel.xpath('//div/span/div[2][@class="more desc"]/span/text()').extract_first(),
                'salary':sel.xpath('//div/div/span[@class="salary"]/text()').extract_first(), 
                'postedby':sel.xpath('//div/div/div/a[@class="rec_name"]/text()').extract_first(),
                'dayposted':sel.xpath('//div/div/div/span[@class="date"]/text()').extract_first(),          
                }

            
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
                
        next_page = response.xpath('/html/head/link[contains(@rel,"next")]/@href').extract_first()
        
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback = self.parse
                )

Define USER_AGENT in settings.py file

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

Run the spider to scrape data from project top directory.

scrapy crawl pythonjob

This will crawl the link and save the web page in scrapefiles folder.

To save the yielded files, run below.

scrapy crawl pythonjob -o pythonjob.json

Deploy scrapy project using scrapyd

Install scrapyd and scrapyd-client

pip install scrapyd
pip install git+https://github.com/scrapy/scrapyd-client

Start scrapyd server and verify the url http://localhost:6800/

Deploy the scrapy project from project folder directly by running below command.

scrapyd-deploy samplescrapy -p naukriscraper

Run scrapy crawl run manually

curl http://localhost:6800/schedule.json -d project=naukriscraper -d spider=pythonjob

Happy crawling.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
naukriscraper		naukriscraper
README.md		README.md
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

naukriscraper

naukriscraper

README.md

README.md

scrapy.cfg

scrapy.cfg

setup.py

setup.py

Repository files navigation

Scrape naukri.com using Scrapy

Requirements

Creating project in scrapy

Deploy scrapy project using scrapyd

About

Releases

Packages

Languages

rajnathsah/naukriscraper

Folders and files

Latest commit

History

Repository files navigation

Scrape naukri.com using Scrapy

Requirements

Creating project in scrapy

Deploy scrapy project using scrapyd

About

Resources

Stars

Watchers

Forks

Languages