# **Web scraping for job listings from timesjobs.com using BeautifulSoup**

### Project aim:
*        To get first few pages of job listings at the timesjob.com
*        Provide basic info of the job like:
            a. Job Title
            b. Company
            c. Experience required
            d. Job description
            e. Keyskills
            f. Time when the listing was posted etc...


#### Import necessary libraries
*  bs4 from beautifulSoup
*  requests for request handling
*  regex to clean the text
*  pandas to store data in xlsx format

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

In [2]:
import bs4
bs4.__version__

'4.6.3'

#### Get Inputs:
*        Job Title
*        Location
*        Experience
*        max pages to parse data from
*        file name where the results have to be stored

In [3]:
jobTitle   = input('Enter a job title to search: ')
location   = input('Enter a location: ')
experience = input('Enter years of experience: ')
maxPages   = int(input('Enter max number of pages to be parsed: '))
fileName   = input('Enter the file name where results to be stored: ') + '.xlsx'

url = 'https://www.timesjobs.com/candidate/job-search.html?\
searchType=personalizedSearch&from=submit&txtKeywords={}&txt\
Location={}&cboWorkExp1={}&sequence='.format(jobTitle, location, experience)

Enter a job title to search: data science
Enter a location: bangalore
Enter years of experience: 1
Enter max number of pages to be parsed: 50
Enter the file name where results to be stored: sample_output


### Define the data to fetch

In [4]:
columns = ['Job type', 'Company Name', 'Experience required',
           'Work Location', 'Compensation', 'Job description',
           'Skill set', 'Posted Time', 'WFH Available', 'More details']

### Helper functions

#### Function to get the job cards from the times job website

In [5]:
def getJobCards(pageNumber):
    times_job_page = requests.get(url + str(pageNumber))
    return BeautifulSoup(times_job_page.content, 'lxml').find_all('li', class_='wht-shd-bx')

#### lambda function to remove \n\r\t from the text

In [6]:
cleanText = lambda x: re.sub(r'\s+', ' ', x)

#### Get the tags where the job listing are present
Function to get the following data
* Job type
* Company name
* Experience required
* Work location
* Compensation
* Job description
* Skill set
* Posted time
* WFH available
* url for more details

In [7]:
def parseJobCard(jobCard):
    jobType         = jobCard.find('a').text
    moreDetails     = jobCard.find('a').get('href')
    companyName     = jobCard.find('h3', class_='joblist-comp-name').contents[0]
    jobDetails      = jobCard.find('ul', class_="top-jd-dtl")
    reqExp          = jobDetails.select_one('li').text[11:]
    location        = jobDetails.select_one('li > span').text
    secondListChild = jobDetails.select_one('li + li').text
    compensation    = secondListChild if 'p.a' in secondListChild else 'NA'
    jobDescription  = jobCard.find(class_='list-job-dtl').select_one('li:nth-of-type(1)').contents[2]
    skillSet        = jobCard.find(class_='srp-skills').text
    tags            = jobCard.find(class_='sim-posted')
    postedTime      = tags.select('span')[-1].text
    isWFHAvailable  = 'Available' if 'Work from Home' in tags.select_one('span').text else 'NA'
    
    return {
        'Job type'           : cleanText(jobType),
        'Company Name'       : cleanText(companyName),
        'Experience required': cleanText(reqExp),
        'Work Location'      : cleanText(location), 
        'Compensation'       : cleanText(compensation),
        'Job description'    : cleanText(jobDescription),
        'Skill set'          : cleanText(skillSet),
        'Posted Time'        : cleanText(postedTime),
        'WFH Available'      : cleanText(isWFHAvailable),
        'More details'       : moreDetails
    }

### Main logic

#### Create empty dataframe to hold the result

In [10]:
parsedResultData = pd.DataFrame([], columns = columns)

#### Get job listing for maxPages number of pages

In [11]:
for page in range(1, maxPages + 1):
    # fetch the next page
    jobCards = getJobCards(page)
    if (len(jobCards) == 0):
        break
    parsedJobdf = pd.DataFrame([parseJobCard(jobCard) for jobCard in jobCards],
                               columns = columns)
    parsedResultData = parsedResultData.append(parsedJobdf)

#### Print the number of listing parsed

In [12]:
print(f'Number of job listing parsed are {parsedResultData.size}')

Number of job listing parsed are 12500


#### Print the first 5 rows

In [13]:
parsedResultData.head()

Unnamed: 0,Job type,Company Name,Experience required,Work Location,Compensation,Job description,Skill set,Posted Time,WFH Available,More details
0,Data Science,ADmyBRAND,0 - 3 yrs,Bengaluru / Bangalore,,"Data ScienceSelecting features , building and...","data mining , api , machine learning",Posted few days ago,,https://www.timesjobs.com/job-detail/data-scie...
1,Data Science,DataWeave Software Pvt. Ltd.,0 - 3 yrs,Bengaluru / Bangalore,,Data ScienceWe the Data Science team at DataW...,"natural language processing , machine learnin...",Posted few days ago,,https://www.timesjobs.com/job-detail/data-scie...
2,Data Science,capgemini,0 - 3 yrs,"Hyderabad/Secunderabad, Mumbai, Pune, Bengalur...",,Job DescriptionHands on experience in Python ...,"hive , cloudera , python , sas , scala , impa...",Posted few days ago,,https://www.timesjobs.com/job-detail/data-scie...
3,Data Analyst / Data Science,CANVAS27.com,1 - 6 yrs,"Ahmedabad, Bengaluru / Bangalore, Chennai, Del...",₹Rs 4.00 - 9.00 Lacs p.a.,"Common data science toolkits , such as Python...","r , data analysis , logistic regression , sql...",Posted a month ago,,https://www.timesjobs.com/job-detail/data-anal...
4,Explore Job Opening on Data Science,IIBM Institute of Business Management,0 - 3 yrs,"Bengaluru / Bangalore, Chennai, Delhi/NCR, Hyd...",,IIBM Institute offers job linked internship a...,"IT Proffestionals , Python , Java",Posted few days ago,Available,https://www.timesjobs.com/job-detail/explore-j...


#### Store the parsed data in .csv file

In [14]:
parsedResultData.to_excel(fileName, index = False)