# Project Name - Web Scraping of different Job Sites


### About the project
This is a web scraping project which deals with scraping three job sites viz. Naukri Jobs, Indeed Jobs and LinkedIn Jobs to extract a total of 180 Data Science jobs in India. Each job site consists of a list of 60 jobs. 


Libraries used: bs4 (BeautifulSoup), selenium, pandas

## Lets's Begin

##### Import Statements

In [1]:
# pip install selenium

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd

### Naukri Data Science Jobs

In [2]:
def extract(page):
    '''
    This function is used to extract the webpage of www.naukri.com and convert it to the soup object using BeautifulSoup.
    
    Args: 
    int: Page Number for number of pages to extract.
    
    Returns:
    A BeautifulSoup object.
    '''
    
    # Fetch the url and the page number
    url = f"https://www.naukri.com/data-science-jobs-{page}?k=data%20science&nignbevent_src=jobsearchDeskGNB"
    
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(3)
    
    # Extract the web page
    soup = BeautifulSoup(driver.page_source,'html5lib')
    
    # Close the driver
    driver.close()
    
    return soup 


In [3]:
def transform(soup):
    '''
    This function is used to extract the job title, job link, company name, experience, salary, location, skills and job post date for each job and create a dictionary.
    
    Args:
    The BeautifulSoup object returned by the 'extract(page)' function.
    
    Returns:
    A list of job dictionaries named 'joblist'.
    '''
    
    divs = soup.find_all('div', {'class' : 'srp-jobtuple-wrapper'})
          
    for job in divs:
        # Job Title
        title = job.find('a', {'class' : 'title'}).text.strip()
        
        # Job Link
        link = job.find('a').get('href')
        
        # Company name
        company_tag = job.find('span', {'class' : 'comp-dtls-wrap'})
        company = company_tag.find('a').get('title').strip()
        
        # Experience
        experience = job.find('span', {'class' : 'expwdth'}).get('title').strip()
        
        # Salary
        salary_tag = job.find('span', {'class' : 'ni-job-tuple-icon ni-job-tuple-icon-srp-rupee sal'})
        salary = salary_tag.find('span').get('title').strip()
        
        # Job Location
        location = job.find('span', {'class' : 'locWdth'}).get('title').strip()
        
        # Skills
        # Extract all the skills for each job.
        skills_tag = job.find('ul', {'class' : 'tags-gt'}).find_all('li')
        # Create a list of skills for each job by using a for loop.
        skills = []
        for skill in skills_tag:
            skills.append(skill.text) 
            
        # Job Post Date
        post_date = job.find('span', {'class' : 'job-post-day'}).text.strip()
        
        # Create a dictionary to store all the values.
        job = {
            'Title': title,
            'Job_Link': link,
            'Company': company,
            'Experience': experience,
            'Salary': salary,
            'Location': location,
            'Skills': skills,
            'Job_Post_Date': post_date
        }
        
        # Create a list of dictionaries for each job.
        joblist.append(job)
        
    return joblist


In [4]:
# Check the number of jobs extracted.
joblist = []
for i in range(3):
    content = extract(i)
    transform(content)

print(len(joblist))

60


In [5]:
# Load into a pandas dataframe.
df_naukri = pd.DataFrame(joblist)

# Check the first 5 rows.
df_naukri.head()

Unnamed: 0,Title,Job_Link,Company,Experience,Salary,Location,Skills,Job_Post_Date
0,Data Science Analytics Sr Analyst - Data Science,https://www.naukri.com/job-listings-data-scien...,Accenture,5-8 Yrs,Not disclosed,Mumbai,"[Data Science, Publishing, Artificial Intellig...",2 Days Ago
1,Data Science Domain Manager,https://www.naukri.com/job-listings-data-scien...,Coursera,7-11 Yrs,Not disclosed,"Kolkata, Mumbai, New Delhi, Hyderabad/Secunder...","[Computer science, Content strategy, Data anal...",5 Days Ago
2,Data Science Analyst II Developer,https://www.naukri.com/job-listings-data-scien...,Aditya Industries,0-5 Yrs,4-9 Lacs PA,"Noida, Uttar Pradesh, Gurgaon/ Gurugram, Harya...","[Data Science, Data Science Analyst, DATA, Adv...",5 Days Ago
3,"Senior Analyst, Data Science & Analytics",https://www.naukri.com/job-listings-senior-ana...,Transunion,1-2 Yrs,Not disclosed,Pune,"[Data analysis, Operations research, Linux, Ri...",11 Days Ago
4,Senior Data Analyst (Data Science),https://www.naukri.com/job-listings-senior-dat...,Trigyn Technologies,5-10 Yrs,Not disclosed,Remote,"[Data Science, Data Mart, Artificial Intellige...",11 Days Ago


In [6]:
# Convert the list under 'Skills' to string.
df_naukri['Skills'] = df_naukri['Skills'].apply(lambda x: ', '.join(x))

In [7]:
# Add a column to identify the Job site which in this case is 'Naukri Jobs'.
df_naukri['Job_Site_Name'] = 'Naukri Jobs'

df_naukri.head()

Unnamed: 0,Title,Job_Link,Company,Experience,Salary,Location,Skills,Job_Post_Date,Job_Site_Name
0,Data Science Analytics Sr Analyst - Data Science,https://www.naukri.com/job-listings-data-scien...,Accenture,5-8 Yrs,Not disclosed,Mumbai,"Data Science, Publishing, Artificial Intellige...",2 Days Ago,Naukri Jobs
1,Data Science Domain Manager,https://www.naukri.com/job-listings-data-scien...,Coursera,7-11 Yrs,Not disclosed,"Kolkata, Mumbai, New Delhi, Hyderabad/Secunder...","Computer science, Content strategy, Data analy...",5 Days Ago,Naukri Jobs
2,Data Science Analyst II Developer,https://www.naukri.com/job-listings-data-scien...,Aditya Industries,0-5 Yrs,4-9 Lacs PA,"Noida, Uttar Pradesh, Gurgaon/ Gurugram, Harya...","Data Science, Data Science Analyst, DATA, Adva...",5 Days Ago,Naukri Jobs
3,"Senior Analyst, Data Science & Analytics",https://www.naukri.com/job-listings-senior-ana...,Transunion,1-2 Yrs,Not disclosed,Pune,"Data analysis, Operations research, Linux, Ris...",11 Days Ago,Naukri Jobs
4,Senior Data Analyst (Data Science),https://www.naukri.com/job-listings-senior-dat...,Trigyn Technologies,5-10 Yrs,Not disclosed,Remote,"Data Science, Data Mart, Artificial Intelligen...",11 Days Ago,Naukri Jobs


In [8]:
# Save the data as a .csv file.
df_naukri.to_csv('naukri_jobs.csv')

### Indeed Data Science Jobs

In [9]:
def extract(page):
    '''
    This function is used to extract the webpage of in.indeed.com/jobs and convert it to the soup object using BeautifulSoup.
    
    Args: 
    int: Page Number for number of pages to extract.
    
    Returns:
    A BeautifulSoup object.
    '''
    
    # Fetch the url and the page number
    url = f"https://in.indeed.com/jobs?q=data+scientist&l=&sort=date&start={page}&vjk=7777907161d57e40"
    
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(3)
    
    # Extract the web page
    soup = BeautifulSoup(driver.page_source,'html5lib')

    # Close the driver
    driver.close()
    
    return soup 


In [10]:
def transform(soup):
    '''
    This function is used to extract the job title, job link, company name, experience, salary, location, skills and job post date for each job and create a dictionary.
    
    Args:
    The BeautifulSoup object returned by the 'extract(page)' function.
    
    Returns:
    A list of job dictionaries named 'joblist'.
    '''
    
    divs = soup.find_all('div', {'class' : 'job_seen_beacon'})
              
    for job in divs:
        # Job Title
        title = job.find('span').get('title').strip()
                
        # Job Link
        link = job.find('a').get('href')
        
        # Company name
        # Use try-except to catch instances where company name is not mentioned.
        try:
            company = job.find('span', {'class' : 'companyName'}).text.strip()
        except:
            company = 'Not disclosed'
      
        
        # Experience
        # Experience is not explicity given under a separate heading on the job listing page.
        experience = 'Not available'
        
        # Salary
        # Use try-except to catch instances where salary is not mentioned.
        try:
            salary_tag = job.find('div', {'class' : 'css-1ihavw2 eu4oa1w0'}).text
            # Use if-else condition to only consider the fields starting with '₹', else the salary tag extracts the next element if salary is not present.
            if salary_tag[0] == '₹':
                salary = salary_tag
            else:
                salary = 'Not disclosed'
        except:
            salary = 'Not disclosed'
        
        # Job Location
        location = job.find('div', {'class' : 'companyLocation'}).text.strip()
                
        # Skills / Summary
        # Skills are not explicitly mentioned for each job but the summary includes some skills.
        # Extract the summary for each job.
        skills_tag = job.find('ul').find_all('li')
        # Create a list of skills for each job by using a for loop.
        skills_summary = []
        for skill in skills_tag:
            skills_summary.append(skill.text.strip())
                    
        # Job Post Date
        # If we just extract the text, the output is 'PostedJust posted'. To get the date we need to slice the output.
        post_date = job.find('span', {'class' : 'date'}).contents[-1].strip()
        
        # Create a dictionary to store all the values.
        job = {
            'Title': title,
            'Job_Link': link,
            'Company': company,
            'Experience': experience,
            'Salary': salary,
            'Location': location,
            'Skills': skills_summary,
            'Job_Post_Date': post_date
        }
        
        # Create a list of dictionaries for each job.
        joblist.append(job)
        
    return joblist


In [11]:
# Check the number of jobs extracted.
joblist = []
for i in range(0,40,10):
    content = extract(i)
    transform(content)

print(len(joblist))

60


In [12]:
# Load into a pandas dataframe.
df_indeed = pd.DataFrame(joblist)

# Check the first 5 rows.
df_indeed.head()

Unnamed: 0,Title,Job_Link,Company,Experience,Salary,Location,Skills,Job_Post_Date
0,Visualization Specialist - Data Science,/rc/clk?jk=c0f8ea8f22c6d958&fccid=216c89cded62...,Bristlecone,Not available,Not disclosed,"Bengaluru, Karnataka","[Collaborate with other data scientists, produ...",Just posted
1,ML Engineer II,/rc/clk?jk=1b34210478aef1c1&fccid=154ba50a8fca...,Nissan,Not available,Not disclosed,"Thiruvananthapuram, Kerala",[Experience in various statistical and machine...,Just posted
2,Data Scientist,/rc/clk?jk=7a83444dc37d615c&fccid=154ba50a8fca...,Nissan,Not available,Not disclosed,"Thiruvananthapuram, Kerala",[Experience in various statistical and machine...,Just posted
3,Lead Data Scientist,/rc/clk?jk=4147dedb353a7d70&fccid=154ba50a8fca...,Nissan,Not available,Not disclosed,"Thiruvananthapuram, Kerala",[Experience in various statistical and machine...,Just posted
4,Data Science Manager,/rc/clk?jk=3adcc4d4996ab6ac&fccid=154ba50a8fca...,Nissan,Not available,Not disclosed,"Thiruvananthapuram, Kerala",[Experience in various statistical and machine...,Just posted


In [13]:
# Convert the list under 'Skills' to string.
df_indeed['Skills'] = df_indeed['Skills'].apply(lambda x: ', '.join(x))

In [14]:
# Add a column to identify the Job site which in this case is 'Indeed Jobs'.
df_indeed['Job_Site_Name'] = 'Indeed Jobs'

df_indeed.head()

Unnamed: 0,Title,Job_Link,Company,Experience,Salary,Location,Skills,Job_Post_Date,Job_Site_Name
0,Visualization Specialist - Data Science,/rc/clk?jk=c0f8ea8f22c6d958&fccid=216c89cded62...,Bristlecone,Not available,Not disclosed,"Bengaluru, Karnataka","Collaborate with other data scientists, produc...",Just posted,Indeed Jobs
1,ML Engineer II,/rc/clk?jk=1b34210478aef1c1&fccid=154ba50a8fca...,Nissan,Not available,Not disclosed,"Thiruvananthapuram, Kerala",Experience in various statistical and machine ...,Just posted,Indeed Jobs
2,Data Scientist,/rc/clk?jk=7a83444dc37d615c&fccid=154ba50a8fca...,Nissan,Not available,Not disclosed,"Thiruvananthapuram, Kerala",Experience in various statistical and machine ...,Just posted,Indeed Jobs
3,Lead Data Scientist,/rc/clk?jk=4147dedb353a7d70&fccid=154ba50a8fca...,Nissan,Not available,Not disclosed,"Thiruvananthapuram, Kerala",Experience in various statistical and machine ...,Just posted,Indeed Jobs
4,Data Science Manager,/rc/clk?jk=3adcc4d4996ab6ac&fccid=154ba50a8fca...,Nissan,Not available,Not disclosed,"Thiruvananthapuram, Kerala",Experience in various statistical and machine ...,Just posted,Indeed Jobs


In [15]:
# Save the data as a .csv file.
df_indeed.to_csv('indeed_jobs.csv')

### LinkedIn Data Science Jobs

In [16]:
def extract(page):
    '''
    This function is used to extract the webpage of www.linkedin.com/jobs and convert it to the soup object using BeautifulSoup.
    
    Args: 
    int: Page Number for number of pages to extract.
    
    Returns:
    A BeautifulSoup object.
    '''
    
    # Fetch the url and the page number
    url = f"https://www.linkedin.com/jobs/search?keywords=Data%20Scientist&location=India&geoId=102713980&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum={page}"
    
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(3)
    
    # Extract the web page
    soup = BeautifulSoup(driver.page_source,'html5lib')

    # Close the driver
    driver.close()
    
    return soup 


In [17]:
def transform(soup):
    '''
    This function is used to extract the job title, job link, company name, experience, salary, location, skills and job post date for each job and create a dictionary.
    
    Args:
    The BeautifulSoup object returned by the 'extract(page)' function.
    
    Returns:
    A list of job dictionaries named 'joblist'.
    '''
    
    divs = soup.find_all('div', {'class' : 'base-card'})
              
    for job in divs:
        # Job Title
        title = job.find('span', {'class' : 'sr-only'}).text.strip()
                        
        # Job Link
        link = job.find('a').get('href')
                
        # Company name
        # Use try-except to catch instances where company name is not mentioned.
        company = job.find('a', {'class' : 'hidden-nested-link'}).text.strip()
              
        # Experience
        # Experience is not explicity given under a separate heading on the job listing page.
        experience = 'Not available'
        
        # Salary
        # Salary is not explicity given under a separate heading on the job listing page.
        salary = 'Not available'
        
        # Job Location
        location = job.find('span', {'class' : 'job-search-card__location'}).text.strip()
                
        # Skills
        # Skills is not explicity given under a separate heading on the job listing page.
        skills = 'Not available'
                    
        # Job Post Date        
        post_date = job.find('time').text.strip()
        
        # Create a dictionary to store all the values.
        job = {
            'Title': title,
            'Job_Link': link,
            'Company': company,
            'Experience': experience,
            'Salary': salary,
            'Location': location,
            'Skills': skills,
            'Job_Post_Date': post_date
        }
        
        # Create a list of dictionaries for each job.
        joblist.append(job)
        
    return joblist
    

In [18]:
# Check the number of jobs extracted.
joblist = []
for i in range(3):
    content = extract(i)
    transform(content)

print(len(joblist))

75


In [19]:
# Load into a pandas dataframe.
df_linkedin = pd.DataFrame(joblist)

# Check the first 5 rows.
df_linkedin.head()

Unnamed: 0,Title,Job_Link,Company,Experience,Salary,Location,Skills,Job_Post_Date
0,AI/ML Engineer - Freshers (2023),https://in.linkedin.com/jobs/view/ai-ml-engine...,Mindfire Solutions,Not available,Not available,"Bhubaneswar, Odisha, India",Not available,1 week ago
1,AI/ML Data Scientist,https://in.linkedin.com/jobs/view/ai-ml-data-s...,Experfy,Not available,Not available,"Bengaluru, Karnataka, India",Not available,4 months ago
2,Junior Data Scientist,https://in.linkedin.com/jobs/view/junior-data-...,Quadrant.io,Not available,Not available,India,Not available,1 month ago
3,AI / ML Engineer,https://in.linkedin.com/jobs/view/ai-ml-engine...,HighPoints Technologies India (P) Ltd,Not available,Not available,"Bengaluru, Karnataka, India",Not available,1 week ago
4,ML Engineer,https://in.linkedin.com/jobs/view/ml-engineer-...,"Dubai Jobs, Gulf Jobs, Jobs in Dubai, Qatar, K...",Not available,Not available,"Bangalore Urban, Karnataka, India",Not available,1 week ago


In [20]:
# Add a column to identify the Job site which in this case is 'LinkedIn Jobs'.
df_linkedin['Job_Site_Name'] = 'LinkedIn Jobs'

df_linkedin.head()

Unnamed: 0,Title,Job_Link,Company,Experience,Salary,Location,Skills,Job_Post_Date,Job_Site_Name
0,AI/ML Engineer - Freshers (2023),https://in.linkedin.com/jobs/view/ai-ml-engine...,Mindfire Solutions,Not available,Not available,"Bhubaneswar, Odisha, India",Not available,1 week ago,LinkedIn Jobs
1,AI/ML Data Scientist,https://in.linkedin.com/jobs/view/ai-ml-data-s...,Experfy,Not available,Not available,"Bengaluru, Karnataka, India",Not available,4 months ago,LinkedIn Jobs
2,Junior Data Scientist,https://in.linkedin.com/jobs/view/junior-data-...,Quadrant.io,Not available,Not available,India,Not available,1 month ago,LinkedIn Jobs
3,AI / ML Engineer,https://in.linkedin.com/jobs/view/ai-ml-engine...,HighPoints Technologies India (P) Ltd,Not available,Not available,"Bengaluru, Karnataka, India",Not available,1 week ago,LinkedIn Jobs
4,ML Engineer,https://in.linkedin.com/jobs/view/ml-engineer-...,"Dubai Jobs, Gulf Jobs, Jobs in Dubai, Qatar, K...",Not available,Not available,"Bangalore Urban, Karnataka, India",Not available,1 week ago,LinkedIn Jobs


In [21]:
# Limit the number of dataframe rows to 60 as the number of jobs extracted from naukri and indeed are also 60.
df_linkedin = df_linkedin.head(60)

In [22]:
len(df_linkedin)

60

In [23]:
# Save the data as a .csv file.
df_linkedin.to_csv('linkedin_jobs.csv')

### Merge all the three Dataframes into a single Dataframe

In [24]:
# Check the shape of all 3 dataframes before merging.
print(f"Naukri DF Shape: {df_naukri.shape}")
print(f"Indeed DF Shape: {df_indeed.shape}")
print(f"LinkedIn DF Shape: {df_linkedin.shape}")

Naukri DF Shape: (60, 9)
Indeed DF Shape: (60, 9)
LinkedIn DF Shape: (60, 9)


In [25]:
# Since the shape of all 3 dataframes is equal, we can merge all of them into a single dataframe for further analysis.
df_jobs = pd.concat([df_naukri, df_indeed, df_linkedin], axis=0)

# Check the shape of the merged df.
df_jobs.shape

(180, 9)

In [26]:
# Check the number of rows for each job site to confirm.
df_jobs.value_counts("Job_Site_Name")

Job_Site_Name
Indeed Jobs      60
LinkedIn Jobs    60
Naukri Jobs      60
dtype: int64

In [27]:
# See the last 5 rows to check the index numbers.
df_jobs.tail()

Unnamed: 0,Title,Job_Link,Company,Experience,Salary,Location,Skills,Job_Post_Date,Job_Site_Name
55,Data Scientist - Python/Machine Learning,https://in.linkedin.com/jobs/view/data-scienti...,CavinKare,Not available,Not available,"Chennai, Tamil Nadu, India",Not available,1 week ago,LinkedIn Jobs
56,Data Scientist,https://in.linkedin.com/jobs/view/data-scienti...,Obviously AI,Not available,Not available,"Bengaluru, Karnataka, India",Not available,4 months ago,LinkedIn Jobs
57,Data Scientist - Business Intelligence,https://in.linkedin.com/jobs/view/data-scienti...,"Dubai Jobs, Gulf Jobs, Jobs in Dubai, Qatar, K...",Not available,Not available,"Bangalore Urban, Karnataka, India",Not available,5 days ago,LinkedIn Jobs
58,Data Scientist - Machine Learning,https://in.linkedin.com/jobs/view/data-scienti...,"Dubai Jobs, Gulf Jobs, Jobs in Dubai, Qatar, K...",Not available,Not available,"Bangalore Urban, Karnataka, India",Not available,5 days ago,LinkedIn Jobs
59,AI and ML Engineer,https://in.linkedin.com/jobs/view/ai-and-ml-en...,INSTATALENT RECRUIT LLP,Not available,Not available,"Bengaluru, Karnataka, India",Not available,2 weeks ago,LinkedIn Jobs


In [28]:
# Reset index
df_jobs.reset_index(drop=True)

Unnamed: 0,Title,Job_Link,Company,Experience,Salary,Location,Skills,Job_Post_Date,Job_Site_Name
0,Data Science Analytics Sr Analyst - Data Science,https://www.naukri.com/job-listings-data-scien...,Accenture,5-8 Yrs,Not disclosed,Mumbai,"Data Science, Publishing, Artificial Intellige...",2 Days Ago,Naukri Jobs
1,Data Science Domain Manager,https://www.naukri.com/job-listings-data-scien...,Coursera,7-11 Yrs,Not disclosed,"Kolkata, Mumbai, New Delhi, Hyderabad/Secunder...","Computer science, Content strategy, Data analy...",5 Days Ago,Naukri Jobs
2,Data Science Analyst II Developer,https://www.naukri.com/job-listings-data-scien...,Aditya Industries,0-5 Yrs,4-9 Lacs PA,"Noida, Uttar Pradesh, Gurgaon/ Gurugram, Harya...","Data Science, Data Science Analyst, DATA, Adva...",5 Days Ago,Naukri Jobs
3,"Senior Analyst, Data Science & Analytics",https://www.naukri.com/job-listings-senior-ana...,Transunion,1-2 Yrs,Not disclosed,Pune,"Data analysis, Operations research, Linux, Ris...",11 Days Ago,Naukri Jobs
4,Senior Data Analyst (Data Science),https://www.naukri.com/job-listings-senior-dat...,Trigyn Technologies,5-10 Yrs,Not disclosed,Remote,"Data Science, Data Mart, Artificial Intelligen...",11 Days Ago,Naukri Jobs
...,...,...,...,...,...,...,...,...,...
175,Data Scientist - Python/Machine Learning,https://in.linkedin.com/jobs/view/data-scienti...,CavinKare,Not available,Not available,"Chennai, Tamil Nadu, India",Not available,1 week ago,LinkedIn Jobs
176,Data Scientist,https://in.linkedin.com/jobs/view/data-scienti...,Obviously AI,Not available,Not available,"Bengaluru, Karnataka, India",Not available,4 months ago,LinkedIn Jobs
177,Data Scientist - Business Intelligence,https://in.linkedin.com/jobs/view/data-scienti...,"Dubai Jobs, Gulf Jobs, Jobs in Dubai, Qatar, K...",Not available,Not available,"Bangalore Urban, Karnataka, India",Not available,5 days ago,LinkedIn Jobs
178,Data Scientist - Machine Learning,https://in.linkedin.com/jobs/view/data-scienti...,"Dubai Jobs, Gulf Jobs, Jobs in Dubai, Qatar, K...",Not available,Not available,"Bangalore Urban, Karnataka, India",Not available,5 days ago,LinkedIn Jobs


In [29]:
# Transfer the merged data to a .csv file.
df_jobs.to_csv("naukri_indeed_linkedin_jobs.csv", index=False)

### Conclusion

In this project, data was scraped from three job sites viz. Naukri Jobs, Indeed Jobs and LinkedIn Jobs using Pythons' Selenium and bs4 libraries. This project scrapes three webpages for each job site - 60 Naukri, 60 Indeed and 75 LinkedIn which is limited to 60 to maintain consistency in the number of records scraped from all the job sites.<br>
This data is further used in the EDA project to uncover insights in the data science jobs realm.