# Job Data Web Scraping





#### Problem statement:

We want to get a dataset of all job postings from an Indian job search resource "https://www.naukri.com/". 

#### Input:

We have a file 'link_by_areas.csv' with the links to the job postings by area.

#### Output:

CSV file which contains job title, name of company, expected experience, salary, tags associsted, function area, posting date, scraping date.

Data should be scraped for every link, looping through all the job postings page.


![title](img/jobs_site.png)

#### Table of contents:
    
   1. Importing libraries, reading input data
   2. Understanding the URL structure
   3. Scraping data from Naukri.com
   4. Saving the scraped data into a csv file

#### 1. Importing libraries, reading input data

In [None]:
import time
import pandas as pd

from bs4 import BeautifulSoup
from selenium import webdriver

import warnings
warnings.filterwarnings("ignore")

In [None]:
joblinks = pd.read_csv('data/link_by_areas.csv')
print('Number of job areas:', len(joblinks))
joblinks.head()

#### 2. Understanding the URL structure


We want to creat a general rule to scrape from any link from our list.

- All the links have a domain: "https://www.naukri.com/".

- Then what job we search: "interior-deesign".

- Then "-jobs".

- Then number of a page "-2" (just not for the 1st page). <---- **We need to add it to our list of links.**

- Then its query tail "?xt=catsrch&qf[]=12".

In [None]:
urls = joblinks['link'].tolist()
urls[:2]

In [None]:
import yarl
from yarl import URL # library to take parts from the link

jobs_part = []

for i in urls:
    jobs_part.append(URL(i).path)
    
jobs_part[:5]

In [None]:
tails = []

for i in urls:
    tails.append(URL(i).query_string)
tails[:5]

In [None]:
# creating geniral link to use it then for different pages

gen_urls = []

for i in range(len(tails)):
    
    url = 'https://www.naukri.com' + jobs_part[i] + '{}?' + tails[i]
    gen_urls.append(url)
    
gen_urls[:5]

In [None]:
len(joblinks['link']) == len(gen_urls)

#### 3. Scraping data from Naukri.com

- Defining a DataFrame
- Extract the data

In [None]:
# for simplicity of computation we observe in the code only first page and 
# first link from the input file

df_posts = pd.DataFrame(columns = ['Function_Area', 'Job_Title', 'Experience', 'Company', 'Scraping_Date', 'Salary', 'Location', 'Tags_Associated', 'Posting_Date'])
    
for urll in gen_urls[2:4]:
    
    for page in range(1,3):
        
        
        if page == 1:
            str_page = ''
        else:
            str_page = '-'+str(page)
            
        url = urll.format(str_page)  # adding page to the link
        
        #print(url)
        
        driver = webdriver.Chrome('/Users/anastasia/Desktop/global-warming/chromedriver') # start the browser
        
        driver.get(url)
        
        time.sleep(15) # not to be blocked as a bot
        
        soup = BeautifulSoup(driver.page_source, 'html')
        
        
        driver.close()
        
        #print(soup.prettify())
        
        
        results = soup.find('div', class_ = 'list')
        job_elems = results.find_all('article', class_ ='jobTuple bgWhite br4 mb-8')
        
        
        #print(len(job_elems))
        
        for job_elem in job_elems:
            

            # Job_Title
            Job_Title = job_elem.find('a', class_ = 'title fw500 ellipsis')
        
        
            # Experience
            Exp = job_elem.find('li', class_ = 'fleft grey-text br2 placeHolderLi experience')
            
            if Exp is None:
                Experience = None
                continue
            else:
                exp_span = Exp.find('span', class_ = 'ellipsis fleft fs12 lh16')
                if exp_span is None:
                    Experience = None
                    continue
                else:
                    Experience = exp_span.text
                
                
            # Company
            Company = job_elem.find('a', class_ = 'subTitle ellipsis fleft')
            
            
            # Date Scraped
            from datetime import date
            today = date.today()
            date_today = today.strftime("%d/%m/%Y")

            
            # Salary
            Sal = job_elem.find('li', class_ = 'fleft grey-text br2 placeHolderLi salary')
            if Sal is None:
                Salary = None
                continue
            else: 
                Sal_span = Sal.find('span', class_ = 'ellipsis fleft fs12 lh16')
                if Sal_span is None:
                    Salary = None
                    continue
                else:
                    Salary = Sal_span.text

                    
            # Location for the job post
            Loc = job_elem.find('li', class_ = 'fleft grey-text br2 placeHolderLi location')
            if Loc is None:
                Location = None
                continue
            else:
                Loc_span = Loc.find('span', class_ = 'ellipsis fleft fs12 lh16')
                if Loc_span is None:
                    Location = None
                    continue
                else:
                    Location = Loc_span.text

                    
            # Tags
            all_tags = job_elem.find_all('li', class_ = 'fleft fs12 grey-text lh16 dot')
            
            if all_tags is None:
                assoc_tags = None
                continue
            else:
                assoc_tags = []
                for tag in all_tags:
                    assoc_tags.append(tag.text)

                
            # Date job was posted
            
            date = job_elem.find('div', class_ = 'type br2 fleft green')
            if date is None:
                continue
            else:
                date_span = date.find('span', class_ = 'fleft fw500')
                if date_span is None:
                    Posting_Date = None
                else:
                    Posting_Date = date_span.text
                    
            df_posts.loc[len(df_posts.index)] = [URL(url).path[1:], Job_Title.text, Experience, Company.text,
                                                date_today, Salary, Location, assoc_tags, Posting_Date  ]
            
        

In [None]:
df_posts

In [None]:
df_posts.shape

In [None]:
df_posts = pd.DataFrame(columns = ['Function_Area', 'Job_Title', 'Experience', 'Company', 'Scraping_Date', 'Salary', 'Location', 'Tags_Associated', 'Posting_Date'])

In [None]:
df_posts

#### 4. Saving the scraped data into a csv file

In [None]:
df_posts.to_csv('scrapped_job_data')