# Target:	DE Candidate
<br><br>Test Concept:	You are asked to pull data from the job recruitment platform, which is linkedin using a scraping method and store the requested data feature to our data warehouse.
<br><br>In this case, we want to know about all information about hiring opportunities related to data field position, Use this keyword search tag:
-	Senior Data Engineer / Data Engineer
-	Senior Data Scientist / Data Scientist
-	Senior Data Analyst / Data Analyst
-	Senior Business Intelligence / Business Intelligence Analyst
<br><br>So, here is a detail of data feature that we want:
1.	Company Name
2.	Job Posting Time (23 hours ago, 1 minute ago, etc)
3.	Number of applicants
4.	Seniority level  (Entry level/Associate/Mid-senior level)
5.	Size of employee
6.	Company industry
7.	Detail description (job desc, job req, benefit, etc)
8.	Employment type
9.	Job Function

<br><br>Store all information into your own Data Warehouse for “Detail Description” you can store into Google Cloud Storage. You also need to visualize/present the result (aggregation, dashboard, etc) that gives meaningful insights
<br><br>Key objectives:	
-	9 features in each job opportunities
<br><br>Tools needed:	
-	Github (Documentation + File Management)
<br><br>Effort (time/duration):	
-	Until 9 April, faster, better.
<br><br>Output:	
-	Github Repository

<br><br>Notes:
-	Make sure that your script runs well before submission, we will try it!
-	Make a proper documentation, especially for flow of scraping.
-	Send Output to academi@blank-space.io cc: jedi@blank-space.io, rahadian@blank-space.io, rahul@blank-space.io 



# Author note
Adopted from https://amandeepsaluja.com/extracting-job-information-from-linkedin-jobs-using-beautifulsoup-and-selenium/
<br><br>Limitation:
- Only available while LinkedIn is not logged in
- Didn't include "Company Size" as it's only available when logging in
- Indonesia only location
- "Show more" in description is ignored

In [1]:
# Install when required
# !pip install selenium
# !pip install webdriver_manager

In [2]:
# importing packages
import pandas as pd
import re

from bs4 import BeautifulSoup
from requests import get
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from time import sleep
# from time import time
from webdriver_manager.chrome import ChromeDriverManager
import urllib.parse

In [30]:
# replace variables here.
# using quote to make sure it's exactly as it is
# test for "data engineer" first
keyword = "Data Engineer"
url = "https://www.linkedin.com/jobs/search?keywords=%22"+ urllib.parse.quote(keyword, safe='') +"%22&location=indonesia"

In [31]:
# this will open up new window with the url provided above 
# driver = webdriver.Chrome()
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
sleep(3)
action = ActionChains(driver)

[WDM] - Current google-chrome version is 89.0.4389
[WDM] - Get LATEST driver version for 89.0.4389
[WDM] - Driver [C:\Users\rnauv\.wdm\drivers\chromedriver\win32\89.0.4389.23\chromedriver.exe] found in cache






In [32]:
# scroll down for all available jobs
no_of_jobs = int(driver.find_element_by_xpath('/html/body/main/div/section[2]/div/h1/span').text)
linkedin_job_per_page = 20
for i in range(0, round(no_of_jobs/linkedin_job_per_page)):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    sleep(3)

In [33]:
# parsing the visible webpage
pageSource = driver.page_source
lxml_soup = BeautifulSoup(pageSource, 'lxml')

# searching for all job containers
job_container = lxml_soup.find('ul', class_ = 'jobs-search__results-list')

print('You are scraping information about {} jobs.'.format(len(job_container)))
# somehow not getting all..

You are scraping information about 141 jobs.


In [34]:
# setting up list for job information
# Company Name
company_name = []

# Job Posting Time (23 hours ago, 1 minute ago, etc)
post_date = []

In [35]:
# for loop for job title, company, id, location and date posted
for job in job_container:

    # company name
    company_names = job.select_one('img')['alt']
    # remove "Graphic"
    company_names = company_names[:-8]
    # Append to list
    company_name.append(company_names)
    
    # Keep this code part if maybe location is needed
    # # job location
    # job_locations = job.find("span", class_="job-result-card__location").text
    # job_location.append(job_locations)
    
    # posting date
    post_dates = job.select_one('time').text
    post_date.append(post_dates)
    
# Check how many rows do we got
print("Company Name:", len(company_name), "rows")
print("Job Posting Time:", len(post_date), "rows")


Company Name: 141 rows
Job Posting Time: 141 rows


In [36]:
# Number of applicants
applicants = []

# Seniority level (Entry level/Associate/Mid-senior level)
level = []

# Size of employee
# NA
# Have to be logged in

# Company industry
industries = []

# Detail description (job desc, job req, benefit, etc)
job_desc = []

# Employment type
emp_type = []

# Job Function
functions = []

In [37]:
# for loop for job description and criterias
print("Start processing detail..")
for x in range(1,len(company_name)+1):
# for x in range(23,24):
    
    # clicking on different job containers to view information about the job
    job_xpath = '/html/body/main/div/section/ul/li[{}]/img'.format(x)
    driver.find_element_by_xpath(job_xpath).click()
    sleep(1)
    
    # job description
    jobdesc_xpath = '/html/body/main/section/div[2]/section[2]/div'
    job_descs = driver.find_element_by_xpath(jobdesc_xpath).text
    # re-get when there exist "compensation"
    if "Base pay range" in job_descs:
        jobdesc_xpath = '/html/body/main/section/div[2]/section[3]/div'
        job_descs = driver.find_element_by_xpath(jobdesc_xpath).text
    job_desc.append(job_descs)
    
    # Seniority level
    # try-except when there exist "compensation"
    seniority_xpath = '/html/body/main/section/div[2]/section[2]/ul/li[1]'
    try:
        seniority = driver.find_element_by_xpath(seniority_xpath).text.splitlines(0)[1]
    except:
        seniority_xpath = '/html/body/main/section/div[2]/section[3]/ul/li[1]'
        seniority = driver.find_element_by_xpath(seniority_xpath).text.splitlines(0)[1]
    level.append(seniority)
    
    # Employment type
    # try-except when there exist "compensation"
    type_xpath = '/html/body/main/section/div[2]/section[2]/ul/li[2]'
    try:
        employment_type = driver.find_element_by_xpath(type_xpath).text.splitlines(0)[1]
    except:
        type_xpath = '/html/body/main/section/div[2]/section[3]/ul/li[2]'
        employment_type = driver.find_element_by_xpath(type_xpath).text.splitlines(0)[1]
    
    emp_type.append(employment_type)
    
    # Job function
    # try-except when there exist "compensation"
    job_function = ''
    function_xpath = '/html/body/main/section/div[2]/section[2]/ul/li[3]/span'
    if len(driver.find_elements_by_xpath(function_xpath)) != 0:
        for function_elem in driver.find_elements_by_xpath(function_xpath):
            job_function = job_function + ',' + function_elem.text
    else:
        function_xpath = '/html/body/main/section/div[2]/section[3]/ul/li[3]/span'
        for function_elem in driver.find_elements_by_xpath(function_xpath):
            job_function = job_function + ',' + function_elem.text
    
    # remove comma in the front
    job_function = job_function[1:]
    functions.append(job_function)
    
    # Industries
    # try-except when there exist "compensation"
    industry_type = ''
    industry_xpath = '/html/body/main/section/div[2]/section[2]/ul/li[4]/span'
        
    if len(driver.find_elements_by_xpath(industry_xpath)) != 0:
        for industry_elem in driver.find_elements_by_xpath(industry_xpath):
            industry_type = industry_type + ',' + industry_elem.text
        # remove comma in the front
        industry_type = industry_type[1:]
    else:
        industry_xpath = '/html/body/main/section/div[2]/section[3]/ul/li[4]/span'
        if len(driver.find_elements_by_xpath(industry_xpath)) != 0:
            for industry_elem in driver.find_elements_by_xpath(industry_xpath):
                industry_type = industry_type + ',' + industry_elem.text
            # remove comma in the front
            industry_type = industry_type[1:]
        else:
            # if somehow they don't give the industry type
            industry_type = 'NA'
    industries.append(industry_type)
    
    # applicants
    # try-except when "be early 25 applicants"
    applicant_xpath = '/html/body/main/section/div[2]/section/div/div/h3[2]/span[2]'  
    try:
        applicant = driver.find_element_by_xpath(applicant_xpath).text
    except:
        applicant_xpath = '/html/body/main/section/div[2]/section/div/div/h3[2]/figure'
        applicant = driver.find_element_by_xpath(applicant_xpath).text
    
    applicants.append(applicant)
    if x % 10 == 0:
        print(str(x) + "/" + str(len(company_name)) + " data processed")
    x = x+1

# to check if we have all information
print("Applicants:", len(applicants), "rows")
print("Job Description:", len(job_desc), "rows")
print("Level:", len(level), "rows")
print("Employee Type:", len(emp_type), "rows")
print("Functions:", len(functions), "rows")
print("Industries:", len(industries), "rows")

print("Finished")

Start processing detail..
10/141 data processed
20/141 data processed
30/141 data processed
40/141 data processed
50/141 data processed
60/141 data processed
70/141 data processed
80/141 data processed
90/141 data processed
100/141 data processed
110/141 data processed
120/141 data processed
130/141 data processed
140/141 data processed
Applicants: 141 rows
Job Description: 141 rows
Level: 141 rows
Employee Type: 141 rows
Functions: 141 rows
Industries: 141 rows
Finished


In [38]:
# creating a dataframe
job_data = pd.DataFrame({
    'Company Name': company_name,
    'Job Posting Time': post_date,
    'Number of Applicants': applicants,
    'Seniority Level': level,
    # 'Size of Employee': 'NA',
    'Company Industry': industries,
    'Detail description': job_desc,
    'Employment Type': emp_type,
    'Job Function': functions
})

# cleaning description column
job_data['Detail description'] = job_data['Detail description'].str.replace('\n',' ')

print(job_data.info())
job_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141 entries, 0 to 140
Data columns (total 8 columns):
Company Name            141 non-null object
Job Posting Time        141 non-null object
Number of Applicants    141 non-null object
Seniority Level         141 non-null object
Company Industry        141 non-null object
Detail description      141 non-null object
Employment Type         141 non-null object
Job Function            141 non-null object
dtypes: object(8)
memory usage: 8.9+ KB
None


Unnamed: 0,Company Name,Job Posting Time,Number of Applicants,Seniority Level,Company Industry,Detail description,Employment Type,Job Function
0,kumparan.com,2 days ago,Be among the first 25 applicants,Entry level,"Marketing and Advertising,Online Media,Internet",We are looking for a Data Engineer to join our...,Full-time,Information Technology
1,PT. SURYA MADISTRINDO,13 hours ago,117 applicants,Entry level,Tobacco,At PT Surya Madistrindo (Subsidiary of PT Guda...,Full-time,Information Technology
2,Tata Consultancy Services,1 week ago,54 applicants,Mid-Senior level,Information Technology and Services,Direct message the job poster from Tata Consul...,Full-time,Information Technology
3,Danone,3 days ago,110 applicants,Not Applicable,"Consumer Goods,Food & Beverages,Food Production",About The Job Data engineers are responsible ...,Full-time,Information Technology
4,CIMB Niaga,1 week ago,124 applicants,Mid-Senior level,Banking,Job Description: Perform application system d...,Full-time,Information Technology


# Observing

In [12]:
job_data['Company Name'].value_counts()

DKATALIS (Digital Katalis)           2
PT Bank Net Indonesia Syariah Tbk    2
Snaphunt                             2
Mekari (PT. Mid Solusi Nusantara)    2
Tjetak                               2
Flip                                 2
Shipper                              2
Sayurbox                             2
Ajaib                                2
Pt SehatQ Harsana Emedika            1
Jamtangan.com                        1
PT SehatQ Harsana Emedika            1
99.co                                1
Brainworx Solusi Integrasi           1
Flip.id                              1
Blue Bird Group                      1
Machtwatch                           1
Lemonilo                             1
SehatQ                               1
PT Bank Jago Tbk                     1
Blibli.com                           1
Name: Company Name, dtype: int64

In [13]:
job_data['Job Posting Time'].value_counts()

2 weeks ago     7
1 week ago      6
4 weeks ago     5
1 month ago     4
3 weeks ago     3
2 months ago    2
20 hours ago    1
15 hours ago    1
3 days ago      1
Name: Job Posting Time, dtype: int64

In [14]:
job_data['Number of Applicants'].value_counts()

Be among the first 25 applicants    27
29 applicants                        1
96 applicants                        1
39 applicants                        1
Name: Number of Applicants, dtype: int64

In [15]:
job_data['Seniority Level'].value_counts()

Associate           20
Mid-Senior level     6
Entry level          2
Director             2
Name: Seniority Level, dtype: int64

In [16]:
job_data['Company Industry'].value_counts()

Information Technology and Services,Computer Software,Internet                       7
Information Technology and Services                                                  4
Computer Software,Internet,Financial Services                                        3
Retail                                                                               2
Health, Wellness and Fitness                                                         1
Transportation/Trucking/Railroad                                                     1
Information Technology and Services,Computer Software,Retail                         1
Marketing and Advertising,Internet,Consumer Goods                                    1
Internet                                                                             1
NA                                                                                   1
Information Technology and Services,Banking,Financial Services                       1
Construction                               

In [17]:
job_data.isna().sum()

Company Name            0
Job Posting Time        0
Number of Applicants    0
Seniority Level         0
Company Industry        0
Detail description      0
Employment Type         0
Job Function            0
dtype: int64

In [18]:
job_data['Employment Type'].value_counts()

Full-time    30
Name: Employment Type, dtype: int64

In [19]:
job_data['Job Function'].value_counts()

Information Technology                              26
Engineering                                          2
Information Technology,Analyst,Strategy/Planning     1
Other                                                1
Name: Job Function, dtype: int64

In [20]:
job_data[job_data['Job Function'] == '']

Unnamed: 0,Company Name,Job Posting Time,Number of Applicants,Seniority Level,Company Industry,Detail description,Employment Type,Job Function


# Clean Up

In [21]:
import numpy as np

In [22]:
# Clean up
# Append Size of Employee just to comply with the required columns
job_data['Size of Employee'] = np.NaN

In [23]:
job_data.replace('NA', np.NaN, inplace=True)

In [24]:
job_data.isna().sum()

Company Name             0
Job Posting Time         0
Number of Applicants     0
Seniority Level          0
Company Industry         1
Detail description       0
Employment Type          0
Job Function             0
Size of Employee        30
dtype: int64

In [25]:
def clean_number_of_applicants(string):
    string_split = string.split()
    clean_number = ''
    if len(string_split) == 6:
        clean_number = '<' + str(string_split[4])
    elif len(string_split) == 3:
        clean_number = '>' + str(string_split[1])
    elif len(string_split) == 2:
        clean_number = str(string_split[0])
    else: clean_number = np.NaN
    return clean_number

In [26]:
job_data['Number of Applicants'] = job_data['Number of Applicants'].apply(clean_number_of_applicants)
job_data['Number of Applicants'].value_counts()

<25    27
29      1
39      1
96      1
Name: Number of Applicants, dtype: int64

In [27]:
job_data['Keyword'] = keyword

In [52]:
job_data.head()

Unnamed: 0,Company Name,Job Posting Time,Number of Applicants,Seniority Level,Company Industry,Detail description,Employment Type,Job Function
0,kumparan.com,2 days ago,Be among the first 25 applicants,Entry level,"Marketing and Advertising,Online Media,Internet",We are looking for a Data Engineer to join our...,Full-time,Information Technology
1,PT. SURYA MADISTRINDO,13 hours ago,117 applicants,Entry level,Tobacco,At PT Surya Madistrindo (Subsidiary of PT Guda...,Full-time,Information Technology
2,Tata Consultancy Services,1 week ago,54 applicants,Mid-Senior level,Information Technology and Services,Direct message the job poster from Tata Consul...,Full-time,Information Technology
3,Danone,3 days ago,110 applicants,Not Applicable,"Consumer Goods,Food & Beverages,Food Production",About The Job Data engineers are responsible ...,Full-time,Information Technology
4,CIMB Niaga,1 week ago,124 applicants,Mid-Senior level,Banking,Job Description: Perform application system d...,Full-time,Information Technology


In [29]:
job_data.to_csv('LinkedIn Job Data_' + keyword + '.csv', index=0)

In [41]:
job_data.columns

Index(['Company Name', 'Job Posting Time', 'Number of Applicants',
       'Seniority Level', 'Company Industry', 'Detail description',
       'Employment Type', 'Job Function'],
      dtype='object')

In [50]:
pd.concat([all_data,job_data])

Unnamed: 0,Company Name,Job Posting Time,Number of Applicants,Seniority Level,Company Industry,Detail description,Employment Type,Job Function
0,kumparan.com,2 days ago,Be among the first 25 applicants,Entry level,"Marketing and Advertising,Online Media,Internet",We are looking for a Data Engineer to join our...,Full-time,Information Technology
1,PT. SURYA MADISTRINDO,13 hours ago,117 applicants,Entry level,Tobacco,At PT Surya Madistrindo (Subsidiary of PT Guda...,Full-time,Information Technology
2,Tata Consultancy Services,1 week ago,54 applicants,Mid-Senior level,Information Technology and Services,Direct message the job poster from Tata Consul...,Full-time,Information Technology
3,Danone,3 days ago,110 applicants,Not Applicable,"Consumer Goods,Food & Beverages,Food Production",About The Job Data engineers are responsible ...,Full-time,Information Technology
4,CIMB Niaga,1 week ago,124 applicants,Mid-Senior level,Banking,Job Description: Perform application system d...,Full-time,Information Technology
5,ACT Foundation | Aksi Cepat Tanggap,1 week ago,Be among the first 25 applicants,Entry level,Philanthropy,Direct message the job poster from ACT Foundat...,Full-time,Information Technology
6,Flash Coffee,2 days ago,Be among the first 25 applicants,Entry level,Food & Beverages,Do you love coffee & tech put together? Then y...,Full-time,Information Technology
7,Xiaomi Indonesia,3 weeks ago,Over 200 applicants,Mid-Senior level,Internet,Direct message the job poster from Xiaomi Indo...,Contract,"Analyst,Engineering"
8,PT Bank Jago Tbk,1 week ago,Be among the first 25 applicants,Entry level,Financial Services,We are seeking a hands-on senior data engineer...,Full-time,Information Technology
9,Accenture Southeast Asia,1 day ago,26 applicants,Mid-Senior level,"Information Technology and Services,Computer S...",Accenture is a leading global professional ser...,Full-time,"Design,Information Technology"
